Text mining stackoverflow

Arash Joorabchi (Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland)
Michael English (Department of Computer Science & Information Systems, University of Limerick, Limerick, Ireland)
Abdulhussain E. Mahdi (Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland)

Journal of Enterprise Information Management

ISSN: 1741-0398

Publication date: 7 March 2016



The use of social media and in particular community Question Answering (Q & A) websites by learners has increased significantly in recent years. The vast amounts of data posted on these sites provide an opportunity to investigate the topics under discussion and those receiving most attention. The purpose of this paper is to automatically analyse the content of a popular computer programming Q & A website, StackOverflow (SO), determine the exact topics of posted Q & As, and narrow down their categories to help determine subject difficulties of learners. By doing so, the authors have been able to rank identified topics and categories according to their frequencies, and therefore, mark the most asked about subjects and, hence, identify the most difficult and challenging topics commonly faced by learners of computer programming and software development.


In this work the authors have adopted a heuristic research approach combined with a text mining approach to investigate the topics and categories of Q & A posts on the SO website. Almost 186,000 Q & A posts were analysed and their categories refined using Wikipedia as a crowd-sourced classification system. After identifying and counting the occurrence frequency of all the topics and categories, their semantic relationships were established. This data were then presented as a rich graph which could be visualized using graph visualization software such as Gephi.


Reported results and corresponding discussion has given an indication that the insight gained from the process can be further refined and potentially used by instructors, teachers, and educators to pay more attention to and focus on the commonly occurring topics/subjects when designing their course material, delivery, and teaching methods.

Research limitations/implications

The proposed approach limits the scope of the analysis to a subset of Q & As which contain one or more links to Wikipedia. Therefore, developing more sophisticated text mining methods capable of analysing a larger portion of available data would improve the accuracy and generalizability of the results.


The application of text mining and data analytics technologies in education has created a new interdisciplinary field of research between the education and information sciences, called Educational Data Mining (EDM). The work presented in this paper falls under this field of research; and it is an early attempt at investigating the practical applications of text mining technologies in the area of computer science (CS) education.



This research was funded under the “Research & Practice in ICT Learning” initiative – University of Limerick.


Joorabchi, A., English, M. and Mahdi, A. (2016), "Text mining stackoverflow", Journal of Enterprise Information Management, Vol. 29 No. 2, pp. 255-275. https://doi.org/10.1108/JEIM-11-2014-0109

Download as .RIS



Emerald Group Publishing Limited

Copyright © 2016, Emerald Group Publishing Limited

Please note you might not have access to this content

You may be able to access this content by login via Shibboleth, Open Athens or with your Emerald account.
If you would like to contact us about accessing this content, click the button and fill out the form.
To rent this content from Deepdyve, please click the button.