Identification of data mining research frontier based on conference papers

Yue Huang (School of Information Science, Beijing Language and Culture University, Beijing, China)
Hu Liu (International Business School, Beijing Foreign Studies University, Beijing, China)
Jing Pan (School of Economics and Management, University of Science and Technology Beijing, Beijing, China)

International Journal of Crowd Science

ISSN: 2398-7294

Article publication date: 21 May 2021

Issue publication date: 3 August 2021

1096

Abstract

Purpose

Identifying the frontiers of a specific research field is one of the most basic tasks in bibliometrics and research published in leading conferences is crucial to the data mining research community, whereas few research studies have focused on it. The purpose of this study is to detect the intellectual structure of data mining based on conference papers.

Design/methodology/approach

This study takes the authoritative conference papers of the ranking 9 in the data mining field provided by Google Scholar Metrics as a sample. According to paper amount, this paper first detects the annual situation of the published documents and the distribution of the published conferences. Furthermore, from the research perspective of keywords, CiteSpace was used to dig into the conference papers to identify the frontiers of data mining, which focus on keywords term frequency, keywords betweenness centrality, keywords clustering and burst keywords.

Findings

Research showed that the research heat of data mining had experienced a linear upward trend during 2007 and 2016. The frontier identification based on the conference papers showed that there were five research hotspots in data mining, including clustering, classification, recommendation, social network analysis and community detection. The research contents embodied in the conference papers were also very rich.

Originality/value

This study detected the research frontier from leading data mining conference papers. Based on the keyword co-occurrence network, from four dimensions of keyword term frequency, betweeness centrality, clustering analysis and burst analysis, this paper identified and analyzed the research frontiers of data mining discipline from 2007 to 2016.

Keywords

Citation

Huang, Y., Liu, H. and Pan, J. (2021), "Identification of data mining research frontier based on conference papers", International Journal of Crowd Science, Vol. 5 No. 2, pp. 143-153. https://doi.org/10.1108/IJCS-01-2021-0001

Publisher

:

Emerald Publishing Limited

Copyright © 2021, Yue Huang, Hu Liu and Jing Pan.

License

Published in International Journal of Crowd Science. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode


1. Introduction

In the era of “Internet+,” big data have become the focus of attention. How to mine and use these massive information is the significance of scientists studying big data. Generally speaking, data mining refers to an engineered and systematic process of mining implicit and previously unknown but potentially useful information and patterns from large amounts of data. Through data mining, engineers can propose new algorithms and models and medical institutions can develop new antibodies and drugs. The authors can say that data mining provides new ideas and methods for the research of scientists.

In recent couple of years, the research results of data mining discipline have been increasing in number where the published volume in international famous journals and conferences continues to grow. To investigate the research trends of data mining in a more comprehensive and meticulous manner, this paper discusses data mining discipline based on the papers of international authoritative conferences.

The concept of the frontier of the discipline has been revised and enriched by other scholars since it was introduced by Price in 1965. Price (1965) believed that the research frontier was time-varying, that is, the research frontier changed with time. For the discipline field, the process of changes in the frontier of research basically represents the development process of this discipline. There are many concepts related to the research frontier, such as hot topics, emerging research areas, emerging topics, emerging trends and potential knowledge. The identification methods of the research frontier are roughly divided into two categories which are qualitative research method and quantitative research method. The qualitative research method is relatively mature and the quantitative research method is still developing and improving. In this paper, the discipline frontier is considered as the same concept with the research frontier.

Currently, there is no clear and uniform definition of the research frontier. The definitions are broadly divided into three categories:

  1. defining the cited literatures as the research frontier;

  2. defining the citing literatures as the research frontier; and

  3. defining the burst words or hot topics as the discipline frontier.

There are many concepts in the field of information science similar to the research frontier, such as emerging trends, emerging topics and research hotspots. The concept of emerging trends was proposed by Kontostathis et al. (2004), which refers to the subject areas that have gradually attracted interests of people over time and are being discussed by more and more researchers. The concept of emerging topics was proposed by Matsumura et al. (2002), which refers to a set of emerging subject areas represented by multiple keywords or phrases in a particular scientific research field. It represents the most promising research directions or trends in the field of discipline research. Although there are no clear definitions of research hotspots, they have been widely used and are collectively referred to the hot papers.

There are numerous methods for research frontier detection, such as co-citation analysis by Small (1973) and White and Griffith (1981), coupling analysis by Kessler (1963) and Weinberg (1974) and co-words analysis by Morris and Yen (2004) and Bhattacharya and Basu (1998). And they have been used in various disciplines to help discover research topics from research papers. Many bibliographic elements, such as co-citation, coupling, keyword and authors, are used to detect intellectual structures. Among these elements, keywords carefully chosen by authors can best show the main ideas of the manuscripts. Almost all previous research work detects the research frontier derived from the journal papers. However, conference papers are crucial to data mining research community.

In this paper, the international authoritative conference papers in the field of data mining were taken as the research object. Then, keyword term frequency analysis, centrality analysis, keyword clustering and burst word analysis methods are adopted to determine the frontiers and trends of data mining discipline.

2. Data and methods

2.1 Data acquisition

The authors select Google Scholar Metrics as the data acquisition standard, because data mining is a relatively new research field and it is not a subcategory of an existing discipline. Google Scholar Metrics (2017) provide an academic evaluation standard to help researchers to assess the visibility and influence of recent articles in scholarly publications. One of the important metrics is H5-index. For a publication, H5-index is the H-index for articles published in the past five complete years, i.e. it is the largest number h such that h articles published in the past five years have at least h citations each. The authors take the top international famous conferences of Google Scholar Metrics included in the subcategory “Date Mining and Analysis” of category “Engineering and Computer Science” as data source (Table 1).

There are nine conferences in total. To guarantee data integrity and reliability, using the Web of Science (WOS) and Scopus databases to complement each other, the search strategy is as follows: selecting the core database of WOS as the database; filling the conference name with “corresponding conference name;” selecting time from 2007 to 2016. Finally, the bibliographic information of 11,870 conference papers was downloaded from WOS and Scopus.

2.2 Data preanalysis

By counting the bibliographic information of 11,870 conference papers, the authors conclude that the IEEE International Conference on Data Mining has the highest number of publications (2,499) and the following are the IEEE International Conference on Big Data (1,689), the ACM SIGKDD International Conference on Knowledge discovery and data mining (1,628), the Pacific-Asia Conference on Knowledge Discovery and Data Mining (1,201), the European Conference on Machine Learning and Knowledge Discovery in Databases (1,094), the SIAM International Conference on Data Mining (1,085), the ACM Conference on Recommender Systems (1,058), the International Conference on Artificial Intelligence and Statistics (1,018) and the ACM International Conference on Web Search and Data Mining (598).

From 2007 to 2016, the number of papers published in data mining conferences showed an increasing trend (Figure 1). It shows the increasing emphasis on data mining discipline in the field of engineering and computing over years where the research on this situation is also deepening. It can be seen from the broken line chart that although there is some fluctuation in the amount of the published papers in individual years, the magnitude of the increase is exactly the opposite of the journal papers. It has a larger growth before 2010 and after 2013 and the increase between 2010 and 2013 is relatively flat. To a certain extent, it illustrates that the research on data mining based on conference papers goes toward the peaceful trend.

2.3 Methodology

In this paper, the authors intend to use keyword term frequency analysis, keyword clustering analysis, keyword burst analysis and keyword betweenness centrality analysis for the research frontier of data mining based on conference papers.

2.3.1 Tools.

The research uses CiteSpace 5.0.R4 as the main analysis tool. CiteSpace is a visualization tool for the academic literature analysis. The main function is to detect hot topics and trends in a certain subject or filed based on specific algorithms. Based on the principle and algorithm of co-citation/co-occurrence analysis, CiteSpace provides cooperative network analysis of author/institution/country, co-occurrence analysis of term/keyword/category, co-citation analysis of reference/author/journal and burst analysis technology for frontier detection. It also provides output and visualization of three views, including clustering, timeline and time zone. Researchers can make various parameter settings according to the needs.

2.3.2 Methods.

In this paper, the authors intend to use keyword term frequency analysis, keyword clustering analysis, keyword burst analysis and keyword betweenness centrality analysis for the research frontier of data mining based on conference papers. Keyword term frequency analysis, betweenness centrality analysis and burst analysis are from the micro perspective to analyze, which aims to identify the hot topics by measuring keyword term frequency, betweenness centrality and burst metrics for the research frontier identification. The clustering analysis is more focused on macro analysis.

2.3.2.1 Keyword term frequency analysis.

The basic principle of term frequency analysis is to determine the research hotspots and their trends by the frequency of the occurrence of the words. Term frequency analysis can reflect the hotspots in certain research field by keywords with given thresholds. Higher term frequency reflects that researchers pay a higher attention in this research field. Studying the subject matter of the literature cannot only reveal its hotspots but also reveal the time distribution of the research topics in combination with the term frequency and then identify the research hotspots and trends.

2.3.2.2 Keyword betweenness centrality analysis.

The definition of centrality in the CiteSpace menu is betweenness centrality. It is an indicator of measuring the importance of the nodes in the network. This importance measurement method for the nodes was proposed by Freeman (1978). In the visual knowledge network map, the nodes with higher betweenness centrality are highlighted with purple circles. The higher the degree of centrality, the greater the excessive effect of the key nodes. By analyzing these high-centrality nodes in chronological order, the authors can vertically compare the development history of data mining disciplines based on the journal papers and conference papers.

2.3.2.3 Keywords clustering analysis.

Clustering analysis simplifies the intricate co-word network into several groups. The larger the cluster, the more keywords it contains. Through clustering analysis, the authors can directly understand the research hotspots in this field. Based on the similarity or relevance of knowledge, keywords with co-occurrence relations are reorganized into knowledge communities by clustering. The intricate co-word network relationships between many analysis objects are simplified into the relationships with a relatively small number of groups which are directly represented.

2.3.2.4 Keywords burst analysis.

The burst detection algorithm proposed by Kleinberg (2003) can be used to detect a sudden increase in research interest of a certain discipline field. The keywords burst refer to a large change in the reference quantity of a certain keyword during a certain period of time, such as a sudden rise or a sudden drop. The beginning time of citation history can detect the development and evolution trend of a certain research topic in the related fields. In CiteSpace, if a certain cluster contains more burst nodes, then this field is more active or it is the emerging trend of research.

3. Keywords knowledge map analysis of conference papers

The authors use CiteSpace software to apply keywords co-occurrence analysis for the data mining papers in the above mentioned nine conferences during the past 10 years. The analysis time slice is one year, the node type is the keyword and the item selection criterion is “Top N = 50.” The authors choose a minimum spanning tree and simplify each slice network and each integrated network.

3.1 Keywords term frequency analysis

By running CiteSpace, the keywords co-occurrence knowledge map can be obtained. The representative high frequency keywords in the data mining conference papers from 2007 to 2016 are “data mining (2,644),” “algorithm (1,040),” “artificial intelligence (1,033),” “recommender system (1,001),” “information retrieval (573),” and “data set (429)” (shown in Table 2).

The content indicated by these high frequency keywords shows that the research hotspots in the field of data mining based on conferences are mainly focused on artificial intelligence–based recommendation systems, information retrieval and data management. Besides, the results show that the keywords with high frequency are almost within the first several years (2007–2010), which indicates that data mining research first focus on fewer themes than later. This is consistent with the fact that research on data mining began its fast development around this time. For example, several leading journals on data mining were founded during 2007 and 2010, such as ACM Transactions on Knowledge Discovery from Data (2007), Advances in Data Analysis and Classification (2007), BioData Mining (2008), Statistical Analysis and Data Mining (2008) and ACM Transactions on Intelligent Systems and Technology (2010).

3.2 Keywords betweenness centrality analysis

In the visual knowledge network map, the keywords nodes with purple circles are the nodes with higher centrality which is shown in Table 3. The keyword nodes with the high betweenness centrality are, such as “recommender system,” “world wide web,” “problem-solving,” “user interface,” “social networking (online),” and “social network.”

By analyzing these nodes with high centrality in order of the year, the authors can obtain the research development history of data mining discipline based on conference papers during 2007 and 2016:

The first stage (2007–2008): implementation of data mining classification and clustering algorithm. The second stage (2009): matrix decomposition of recommendation algorithm and semisupervised learning of machines. The third stage (2010–2011): online social network analysis and sentiment analysis based on social media. The fourth stage (2012–2014): active learning and big data processing model. The fifth stage (2015–2016): community discovery and cloud computing.

3.3 Keywords clustering analysis

Clustering the high-frequency words, the research hotspots in the data mining field based on the conference papers during the recent years are obtained. According to the LLR algorithm, a cluster view is generated (as shown in Figure 2) and 23 clusters are generated (as shown in Table 4). The previous four clusters are used as examples for analysis. The results are as follows.

  • The topic of the first cluster (#0) can be summarized as Hadoop big data processing model. The average year is 2010, including 17 keywords, which are “big data,” “mapreduce,” “hadoop,” “real word data,” “real data set,” “algorithm,” “graph mining,” “spark,” “vector,” “forecasting,” “state of the art,” “cloud,” “time series,” “data stream,” “data mining,” “anomaly detection” and “cloud computing.” Among these, “big data,” “mapreduce,” “hadoop,” “real word data” and “real data set” are burst terms. The beginning years of citation are 2013, 2013, 2014, 2010 and 2010, respectively.

  • The topic of the second cluster (#1) can be summarized as recommender system. The average year is 2010, including 16 keywords, which are “factorization,” “mathematical model,” “matrix factorization,” “iterative method,” “personalization,” “matrix algebra,” “regression analysis,” “recommendation,” “high dimensional,” “recommender system,” “recommendation algorithm,” “online system,” “recommendation system,” “food processing,” “electronic commerce” and “new approach.” Among these, “factorization,” “mathematical model,” “matrix factorization,” “iterative method,” “personalization” and “matrix algebra” are burst terms. The beginning years of citation are 2009, 2007, 2010, 2008, 2008 and 2010, respectively.

  • The topic of the third cluster (#2) can be summarized as classification methods of machine learning and the LLR label is Bayesian network. The average year is 2008, including 16 keywords, which are “system,” “training data,” “prediction,” “"bioinformatics,” “neural network,” “parameter estimation,” “gene expression data,” “selection,” “gene expression,” “Bayesian network,” “support vector machine,” “inference engine,” “large data set,” “deep learning,” “kernel method” and “regression.” Among these, “system,” “training data,” “prediction” and “bioinformatics” are burst terms. The beginning years of citation are 2014, 2007, 2007 and 2007, respectively.

  • The topic of the fourth cluster (#3) can be summarized as data management and the LLR label is administrative data processing. The average year is 2010, including 16 keywords, which are “data set,” “mining,” “network,” “space division multiple access,” “complex network,” “graph theory,” “trees (mathematics),” “decision support system,” “Gaussian process,” “dimensionality reduction,” “mathematical technique,” “community detection,” “high dimensional data,” “information management,” “administrative data processing” and “principal component analysis.” Among these, “data set,” “mining,” “network,” “space division multiple access,” “complex network” and “graph theory” are burst terms. The beginning years of citation are 2007, 2007, 2007, 2010, 2013 and 2007, respectively.

3.4 Keywords burst analysis

In keywords analysis and clustering results, CiteSpace lists 52 burst words. The following results can be obtained by classification according to the start year of their citation year.

There were 12 burst terms in 2007, including “computational efficiency,” “training data,” “probability distribution,” “mining,” “bioinformatics,” “prediction,” “internet,” “set theory,” “text mining,” “problem-solving,” “user interface” and “mathematical model.”

There were six burst terms in 2008, including “decision tree,” “graph theory,” “signal filtering and prediction,” “personalization,” “text processing” and “information service.”

There were five burst terms in 2009, including “classifier,” “software engineering,” “experimental evaluation,” “data set” and “graphical model.”

There were five burst terms in 2010, including “visualization,” “real world data,” “supervised learning,” “cluster analysis” and “data processing.”

There was one burst term “real data set” in 2011.

There were five burst terms in 2012, including “active learning,” “iterative method,” “Markov process,” “website” and “statistics.”

There were seven burst terms in 2013, including “topic modeling,” “real world,” “sentiment analysis,” “complex network,” “mapreduce,” “graphic method” and “state of the art method.”

There were 22 burst words from 2014 to 2016, including “big data,” “network,” “mapreduce,” “system,” “hadoop,” “factorization,” “state of the art method,” “website,” “real world,” “active learning,” “space division multiple access,” “topic modeling,” “matrix factorization,” “iterative method,” “complex network,” “decision-making,” “stochastic system,” “feature selection,” “matrix algebra,” “sentiment analysis,” “graphic method” and “markov process.” These terms reflect that the research on Hadoop-based big data processing framework and the in-depth study of recommendation system are one of the discipline frontier research topics. In 2015, the keywords of community discovery, random walk and big data economic effect appeared in the co-occurrence analysis results, reflecting that the implementation of complex data analysis algorithms and the improvement of social media evaluation system are one of the discipline frontier research topics. In 2016, the keywords of recommendation system, cloud computing, feature selection and e-commerce appeared in the results of co-occurrence analysis, reflecting that the commercialization analysis based on big data and artificial intelligence is one of the discipline frontier research topics.

4. Conclusion

Compared with previous research work starting from the journal papers for the research frontier identification, in this paper, the conference papers were used as the analysis object. Based on the keyword co-occurrence network, from four dimensions of keyword term frequency, betweeness centrality, clustering analysis and burst analysis, this paper identified and analyzed the research frontiers of data mining discipline from 2007 to 2016. The purpose was to more accurately and comprehensively grasp the research frontiers of data mining based on conference papers. The clustering analysis focused on macro analysis. From the results of clustering, the conference papers focused on the infrastructure and application of big data. The keywords term frequency, betweenness centrality and burst analysis were from the microscopic point of view to analyze. The results showed that the contents of the conference papers were very diverse, focusing on applications and time-sensitive. The next step is to conduct a comparative analysis for the identification of the frontiers of data mining based on journal papers and conference papers.

Figures

Annual published papers of data mining authoritative conferences

Figure 1.

Annual published papers of data mining authoritative conferences

Data mining authoritative conference annual papers based on LLR clustering view

Figure 2.

Data mining authoritative conference annual papers based on LLR clustering view

Top nine conferences of Google Scholar metrics in the field of data mining

Rank Publication Abbreviation h5-index
1 ACM SIGKDD International Conference on Knowledge discovery and data mining KDD 73
2 ACM International Conference on Web Search and Data Mining WSDM 54
3 IEEE International Conference on Data Mining ICDM 38
4 SIAM International Conference on Data Mining SDM 35
5 ACM Conference on Recommender Systems RecSys 34
6 European Conference on Machine Learning and Knowledge Discovery in Databases ECMLPKDD 31
7 International Conference on Artificial Intelligence and Statistics AISTATS 31
8 IEEE International Conference on Big Data ICBD 25
9 Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD 22

Keywords term frequency statistics of conference papers (frequency 160 or above)

Frequency Year Keyword
2,644 2007 Data mining
1,040 2007 Algorithm
1,033 2007 Artificial intelligence
1,001 2007 Recommender system
573 2007 Information retrieval
429 2007 Data set
394 2010 Social networking (online)
368 2007 Collaborative filtering
368 2007 Learning system
365 2008 Website
364 2008 World wide web
319 2013 Big data
303 2007 Classification
298 2007 Optimization
290 2007 Social network
288 2007 Clustering algorithm
256 2008 Search engine
245 2008 Classification (of information)
245 2007 Learning algorithm
237 2008 Forecasting
208 2007 Clustering
189 2010 Space division multiple access
178 2010 Matrix factorization
173 2007 Bayesian network
169 2009 Factorization
169 2010 Matrix algebra
168 2008 Semantics
160 2007 Machine learning

Keywords betweenness centrality statistics of conference papers (centrality 0.13 or above)

Centrality Year Keyword
1.02 2007 Recommender system
0.97 2008 World wide web
0.91 2007 Problem-solving
0.9 2007 User interface
0.88 2010 Social networking (online)
0.88 2007 Social network
0.87 2007 Visualization
0.87 2007 Information system
0.74 2007 Information retrieval
0.66 2007 Learning system
0.63 2007 Machine learning
0.62 2008 Search engine
0.62 2008 International conference
0.58 2007 Mining
0.44 2007 Learning algorithm
0.34 2007 Optimization
0.33 2007 Data mining
0.31 2007 Parameter estimation
0.28 2007 Clustering problem
0.27 2007 Clustering algorithm
0.25 2007 Artificial intelligence
0.25 2007 Bayesian network
0.23 2007 Inference engine
0.21 2008 Classification (of information)
0.21 2008 Flow of solid
0.17 2007 Cluster analysis
0.15 2007 Gene expression
0.13 2007 Algorithm
0.13 2007 Gene expression data

Keywords clustering statistics of conference papers

No. Scale Mean year Cluster name
0 17 2010 Data mining
1 16 2010 Recommender system
2 16 2008 Deep learning
3 16 2008 Administrative data processing
4 15 2008 Clustering
5 14 2009 Search engine
6 13 2007 Collaborative filtering
7 13 2009 Artificial intelligence
8 13 2009 Social network
9 13 2008 Semisupervised learning
10 11 2008 Classification
11 1 2008 Statistical method
12 1 2008 Database system
13 1 2008 Perceptron algorithm
14 1 2008 Association rule
15 1 2008 Pattern
16 1 2016 Word embedding
17 1 2008 Electric network analysis
18 1 2008 Knowledge-based system
19 1 2007 Security
20 1 2008 Database
21 1 2008 Speed up
22 1 2016 Big data analytics

References

Bhattacharya, S. and Basu, P.K. (1998), “Mapping a research area at the micro level using co-word analysis”, Scientometrics, Vol. 43 No. 3, pp. 359-372.

Freeman, L.C. (1978), “Centrality in social networks conceptual clarification”, Social Networks, Vol. 1 No. 3, pp. 215-239.

Google Scholar Metrics (2017), “Google scholar metrics”, available at: https://scholar.google.com/scholar/metrics.html (accessed 20 March 2017).

Kontostathis, A., Galitsky, L.M., Pottenger, W.M., Roy, S. and Phelps, D.J. (2004), “A survey of emerging trend detection in textual data mining”, in Berry, M.W. (Eds), Survey of Text Mining, Springer, New York, NY, pp. 185-224.

Kessler, M.M. (1963), “Bibliographic coupling between scientific papers”, American Documentation, Vol. 14 No. 1, pp. 10-25.

Kleinberg, J. (2003), “Bursty and hierarchical structure in streams”, Data Mining and Knowledge Discovery, Vol. 7 No. 4, pp. 373-397.

Matsumura, N., Matsuo, Y., Ohsawa, Y. and Ishizuka, M. (2002), “Discovering emerging topics from WWW”, Journal of Contingencies and Crisis Management, Vol. 10 No. 2, pp. 73-81.

Morris, S.A. and Yen, G.G. (2004), “Crossmaps: visualization of overlapping relationships in collections of journal papers”, Proceedings of the National Academy of Sciences of Sciences, Vol. 101 No. Supplement 1, pp. 5291-5296.

Price, D.J.D.S. (1965), “Networks of scientific papers”, Science, Vol. 149 No. 3638, pp. 510 -515.

Small, H. (1973), “Co-citation in the scientific literature: a new measure of the relationship between two documents”, Journal of the American Society for Information Science, Vol. 24 No. 4, pp. 265-269.

White, H.D. and Griffith, B.C. (1981), “Author cocitation: a literature measure of intellectual structure”, Journal of the American Society for Information Science, Vol. 32 No. 3, pp. 163-171.

Weinberg, B.H. (1974), “Bibliographic coupling: a review”, Information Storage and Retrieval, Vol. 10 Nos 5/6, pp. 189-196.

Further reading

Chen, C. (2006), “CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature”, Journal of the American Society for Information Science and Technology, Vol. 57 No. 3, pp. 359-377.

Acknowledgements

This work is supported by BLCU Youth Talent Development Program and Science Foundation of Beijing Language and Culture University (supported by “the Fundamental Research Funds for the Central Universities”) (20YJ040003).

Corresponding author

Yue Huang can be contacted at: huang.yuet@blcu.edu.cn

Related articles