Clustering of online learning resources via minimum spanning tree

Purpose 
 
 
 
 
The quick growth of web-based and mobile e-learning applications such as massive open online courses have created a large volume of online learning resources. Confronting such a large amount of learning data, it is important to develop effective clustering approaches for user group modeling and intelligent tutoring. The paper aims to discuss these issues. 
 
 
 
 
Design/methodology/approach 
 
 
 
 
In this paper, a minimum spanning tree based approach is proposed for clustering of online learning resources. The novel clustering approach has two main stages, namely, elimination stage and construction stage. During the elimination stage, the Euclidean distance is adopted as a metrics formula to measure density of learning resources. Resources with quite low densities are identified as outliers and therefore removed. During the construction stage, a minimum spanning tree is built by initializing the centroids according to the degree of freedom of the resources. Online learning resources are subsequently partitioned into clusters by exploiting the structure of minimum spanning tree. 
 
 
 
 
Findings 
 
 
 
 
Conventional clustering algorithms have a number of shortcomings such that they cannot handle online learning resources effectively. On the one hand, extant partitional clustering methods use a randomly assigned centroid for each cluster, which usually cause the problem of ineffective clustering results. On the other hand, classical density-based clustering methods are very computationally expensive and time-consuming. Experimental results indicate that the algorithm proposed outperforms the traditional clustering algorithms for online learning resources. 
 
 
 
 
Originality/value 
 
 
 
 
The effectiveness of the proposed algorithms has been validated by using several data sets. Moreover, the proposed clustering algorithm has great potential in e-learning applications. It has been demonstrated how the novel technique can be integrated in various e-learning systems. For example, the clustering technique can classify learners into groups so that homogeneous grouping can improve the effectiveness of learning. Moreover, clustering of online learning resources is valuable to decision making in terms of tutorial strategies and instructional design for intelligent tutoring. Lastly, a number of directions for future research have been identified in the study.


Introduction
E-learning is a means of education that incorporates self-motivation, communication, efficiency, and technology (Phobun and Vicheanpanya, 2010;Woldab, 2014). As a general tendency in intelligent tutoring and learning, e-learning has attracted an increasing amount of attention from researchers in the fields of computer science, pedagogy, and praxeology. With the rapid growth of e-learning resources, including content delivered through the internet, intranet/extranet, CD-ROM, audio or video tape, and satellite TV, the selection and organization of these materials is very time-consuming and challenging to users. Thus, it is necessary to cluster learning resources and subsequently recommend personalized resources to both teachers and learners. Clustering is the process of assigning class labels to objects based on the principle of minimizing the interclass similarity and maximizing the intraclass similarity (Li et al., 2013), which is widely used in various scientific areas (Ben et al., 2011). For instance, taxonomists, social scientists, psychologists, biologists, statisticians, mathematicians, engineers, computer scientists, medical researchers, and others who collect and process real-world data have all contributed to clustering methodology ( Jain, 2010;Mimaroglu and Erdil, 2011). At the same time, a recent trend in e-learning is the development of massive open online courses (MOOCs) and micro-courses. With the help of numerous teachers, MOOCs provide unlimited participation and open access via the internet to learners worldwide. It is not rare that tens of thousands of students from around the world enroll in a single course. As more and more learning resources are generated due to the explosion of MOOCs, it is hard to apply traditional clustering algorithms to analyze online learning resources.
Outliers or noise objects are very common in real-world data sets, especially for usergenerated content. This brings new challenges to existing clustering methods. On the one hand, most of traditional partitional clustering algorithms (e.g. K-means, bisecting K-means, and K-medoids) randomly assign objects as initial centroids of the clusters. Outliers may be chosen as the initial centroids of clusters. It will then converge to an unstable result (i.e. instability issue). On the other hand, the performance of classical density-based clustering methods (e.g. density-based spatial clustering of applications with noise (DBSCAN)) will be computationally expensive and time-consuming when they are facing those noise objects.
In this paper, a novel scheme is proposed to resolve the problem of instability and inefficiency for clustering of online learning resources. Outliers are first eliminated based on the density of each resource. Then, a minimum spanning tree is constructed based on the distances among resources. The degree of freedom for each resource is subsequently calculated based on the structure of the minimum spanning tree. The resource with the largest value of degree of freedom will be considered as the initial centroid. In comparison with the previous work (Wang et al., 2015), a number of enhancements have been made: a more comprehensive literature review has been conducted in Section 2; the effectiveness of the proposed algorithm together with other clustering algorithms are evaluated with two additional data sets including a two-dimensional data set (Section 4.3) and a real-world e-learning data set (Section 4.4) to improve the generalization of results; one classical density-based clustering method (i.e. DBSCAN) is implemented for comparison, and the experimental results are analyzed in more detail; and more detailed information and in-depth discussion is provided in the introduction, experiments, conclusion, and future research directions.
The rest of the paper is organized as follows. Section 2 describes related work on e-learning systems and clustering of online learning resources. Section 3 presents a novel clustering algorithm based on minimum spanning tree. Section 4 evaluates clustering algorithms with four data sets. Section 5 discusses the directions of incorporating the proposed clustering algorithm into e-learning systems. Finally, Section 6 provides concluding remarks.
2. Related works 2.1 E-learning systems E-learning is valuable to educational institutions, corporations and all types of learners as it eliminates distances and subsequent commutes (Phobun and Vicheanpanya, 2010). It is affordable and time-saving because a wide range of online learning resources can be accessed from properly equipped computer terminals. Thus, the development of e-learning systems is one of the fastest growing trends in educational uses of technology (Li et al., 2009). Applications and components of e-learning systems include construction of learning models, prediction of learners' learning behavior, development of mobile application, and so forth. For instance, Zou et al. (2014) proposed an incidental word learning model for e-learning. In particular, they measured the load of various incidental word learning tasks so as to construct load-based learner profiles. A task generation method was further developed based on the learner profile to increase the effectiveness of various word learning activities. Boyer and Veeramachaneni (2015) designed a set of processes which take the advantage of knowledge from both previous courses and previous weeks of the same course to make real-time prediction on learners' behavior. Ferschke et al. (2015) implemented a Lobby program that students can be connected via a live link at any time. Zbick (2013) presented a web-based approach to provide an authoring tool for creation of mobile applications with data collection purposes.

Clustering of learning resources
It is believed that e-learning systems should provide a variety of learning resources to satisfy need of different learners (Sabitha et al., 2016). With the rapid growth of online learning resources, learners are facing a serious problem of information overload. A tool is urgently required to assist the learners to get the similar learning materials efficiently. The clustering algorithms are extensively employed for discovery of community (Xie et al., 2012 and event detection (Rao and Li, 2012), which are important research topics in e-learning. Sabitha et al. (2016) employed fuzzy clustering technique to combine learning and knowledge resources based on attributes of metadata. Mansur and Yusof (2013) tried to reveal the behavior of students from all activities in Moodle e-learning system by using ontology clustering techniques. In their ontology model, the forum, quiz, assignment, and many other activities were placed as clustering parameters. Govindarajan et al. (2013) employed particle swarm optimization algorithm to analyze and cluster continuously captured data from students' learning interactions. However, some useless resources may exist in e-learning systems.
It is important to remove the noise objects before clustering. Mimaroglu and Erdil (2011) defined two variables named weight and attachment to address the issue of noise object. The first one (i.e. weight) measures the similarity between two objects, and the second one (i.e. attachment) ranks the quality of each candidate centroid. Noise objects are removed based on their measurement of weight and attachment. Luo et al. (2010) proposed another 199 Clustering of online learning resources method to exclude the "fake" centroid based on the notion of density, as follows: let X ¼ {x 1 , x 2 , …, x n } be the set of objects. DEN(x i ) is the density of object x i . A small value of DEN(x i ) indicates that x i locates at a relative high-density location, or vice versa. The density of x i is compared with the average density ADEN. If an object has density higher than the average density, it will be considered as a "fake" centroid and therefore eliminated.
3. Clustering of online learning resources 3.1 The overall framework The increasing availability of digital educational materials on the internet, called online learning resources, has been followed by the definition of indexing standards. However, the selection process of these elements is challenging to learners because of the diversity of metadata approaches, in addition to the lack of consensus about the definition of learning resources (Silva and Mustaro, 2009). In light of these considerations, learners need effective and efficient clustering methods to organize and manage such large volume of online learning resources. The objective of clustering of online learning resources in this study is to assign class labels to various learning resources by eliminating outliers, and to improve the accuracy of clustering algorithm based on the minimum spanning tree as well as procedures of merging learning resources and small clusters.
As illustrated in Figure 1, a clustering framework for online learning resources with four key steps is proposed as follows: (1) The density of each instance of online learning resource is measured in order to identify and eliminate outliers. Learning resources that are few and scattered in their areas will be removed in this step.
(2) A minimum spanning tree is constructed to create a link of all learning resources. The minimum spanning tree is helpful to detect clusters of different shapes and sizes (Päivinen, 2005).
(3) A partitioning method based on the structure of minimum spanning tree is employed to merge learning resources into clusters.
(4) The small clusters that contain only a few learning resources are also merged into large ones.
The density-based clustering algorithm proposed in this paper can be applied to a number of areas in e-learning, for example, classification of learner, discovery of learning path, recommendation of learning resource, and intelligent tutoring. On the other hand, the key parameters of the proposed approach are explained below in the context of online learning resources for better understanding of the paper: • distance between two online learning resources measures the dissimilarity between the contents of two resources; • density of an online learning resource measures number of learning resources which are similar as the resource, i.e., their distances to the learning resource are less than a threshold value; • outlier or noise learning resource is a learning resource which is very different from the others; and • usefulness of learning resource refers to the relevancy of the learning resource to the learner's study or learning interest.
Mathematical definitions of the parameters can be found in the following subsection.

Elimination of outliers
The existence of outliers will produce useless learning resources in e-learning, and disturb the effect of clustering. In order to solve this problem, the method proposed by Luo et al. (2010) is incorporated in the algorithm proposed. The related definitions are shown below: Definition 1. The density of an object (i.e. an online learning resource) is: is the Euclidean distance (Deza and Deza, 2016) between x i and y j .
Definition 2. The average density of online learning resources is: where p is the number of online learning resources.
Lemma 1. The density DEN of some normal online learning resources is larger than the average density ADEN.
Proof 1: if all online learning resources are normal, i.e., no outliers, the value of ADEN must between DEN max and DEN min . Thus, the density DEN of some normal online learning resources must be larger than then average distance ADEN. ◼ According to Lemma 1, a constant DEV is added to the average distance ADEN. If DEN(x i ) is larger than the sum of ADEN and DEV, it will be considered as an outlier and removed from the data set.

Generation of minimum spanning tree
After the elimination of noise resources, there are still a huge number of online learning resources. As a result, an efficient clustering technique is required to group similar learning resources together as clusters. The distance between each pair of remaining objects is first calculated, and then the minimum spanning tree of remaining learning resources is built accordingly by using the Prim's (1957)   Clustering of online learning resources Algorithm 1. Algorithm of generating the minimum spanning tree. Input: A weighted connected graph, with a vertex set V and an edge set E; Output: A set V new and a set E new by which the minimum spanning tree is described 1: initialization: add v into V new and add o u; vW into E new ; 5: end while

Merging learning resources into clusters
In the previous subsection, a minimum spanning tree is generated. The degree of freedom of each instance of learning resources can be obtained by using Definition 3: Definition 3. The degree of freedom of an object (i.e. online learning resource) x i is: where E denotes the edges that x i belongs to.
It is believed that an object with a large value of degree of freedom means that it has a large number of neighbors. The object therefore may be a centroid (Mimaroglu and Erdil, 2011). Thus the learning resources are sorted according to their degree of freedom. Subsequently, the learning resources are partitioned into clusters based on the structure of minimum spanning tree and their degree of freedom (Algorithm 2). Figure 2 provides an example to demonstrate the operation of Algorithm 2. Table I shows the Euclidean distance between all pairs of objects.
This algorithm is illustrated as follows: (1) Six objects v 1 , v 2 , …, v 6 are used as an example ( Figure 2). The parameters DEV and m are set as 0.5 and 2, respectively. The values of density DEN of each object are shown in Table II. Because there is no noise object, all objects are hence reserved.
(2) A minimum spanning tree for the objects is generated by using the Prim's algorithm. The resulting edges of the tree are (3) Table II shows the degree of freedom of each object. The objects are sorted in reverse order of their degree of freedom. The order of the objects after sorting is v 1 , v 2 , v 4 , v 5 , v 3 , v 6 . Thus, object v 1 is put into the first cluster, i.e., Cluster 1.
(4) The immediate neighboring objects of v 1 are v 2 and v 4 . Object v 2 is put into (5) The neighboring objects of the newly added object are subsequently considered. Because object v 2 is the newly added object, object v 3 which is the immediate neighbor of object v 2 is considered. Object v 3 is put into Cluster 1 because v 3 have no other neighboring objects and the minimum distance between object v 3 and its neighboring objects is d(v 2 , v 3 ). Now, Cluster 1 has three objects, i.e., {v 1 , v 2 , v 3 }.
(6) The object with the largest value of degree of freedom is first chosen among remaining objects. Among the three objects v 4 , v 5 , and v 6 , object v 4 has the largest value of degree of freedom. The neighboring objects of v 4 are v 1 and v 5 . As object v 1 has been put into Cluster 1, it is not considered here. Object v 5 is put into Cluster 2 because d(v 4 , v 5 ) ¼ d(v 5 , v 6 ). After that, the neighboring objects of the object which is newly added are considered, i.e., object v 6 . Object v 6 is also added into Cluster 2 because it has no other neighboring objects, and the minimum distance between object v 6 and its neighboring objects is d(v 5 , v 6 ). At last, all objects have been put into clusters and two clusters are generated by the algorithm, i.e., Cluster 1 ¼ {v 1 , v 2 , v 3 } and Cluster 2 ¼ {v 4 , v 5 , v 6 } ( Figure 3).  Figure 2. Data before clustering  Figure 3. Data after clustering

Merging small clusters by the distance
Based on the minimum spanning tree generated, an initial clustering result is obtained by using Algorithm 2. However, there may be a large number of small clusters which only contain a few learning resources. To save computational resources, the small clusters will be further merged into large clusters. Algorithm 3 details the merging of small clusters into large clusters, where minNum indicates the minimum number of objects required for a cluster, minDis represents the minimum distance between clusters. If the number of objects in one cluster is less than minNum and the distance between the cluster and its closest neighboring cluster is less than minDis, the cluster is merged into its closest neighboring.
Algorithm 3. Merging small clusters based on the distance. 1: for every cluster obtained above do 2: if the number of objects in cluster o minNum then 3: if distance between the cluster and its closest neighboring cluster o minDis then 4: merge the cluster into its closest neighboring cluster; 5: end if 6: end if 7: end for 3.6 Comparison with technique proposed with density-based clustering methods Clustering approaches are very popular for understanding the natural grouping or structure in a data set. There are various clustering algorithms such as K-means, bisecting K-means, K-medoids, and fuzzy-means clustering. The main drawback of those approaches is the random selection of initial centroids (i.e. instability issue). In addition, the traditional clustering approaches can find only spherical-shaped clusters (Govindarajan et al., 2013). Other clustering methods have been developed for non-spherical cluster shape based on the notion of density. Density-based clustering can be used to filter out noise objects (outliers) and discover clusters of arbitrary shape effectively (Duan et al., 2006). DBSCAN is one of the most widely used density-based clustering algorithms, which can discover clusters of arbitrary shape in spatial databases with noise objects (Ester et al., 1996). The general idea of DBSCAN is that for each instance of a cluster, the neighborhood of a given radius (ε) has to contain at least a minimum number of points (MinPts), where ε and MinPts are parameters set by users manually. If a spatial index is used, the computational complexity of DBSCAN is O(nlogn), where n is the number of objects. Otherwise, its computational complexity is O(n 2 ).
In this paper, a novel clustering technique is proposed based minimum spanning tree. Because the minimum spanning tree is built by using the Prim's algorithm and the running time of Prim's algorithm is O(n 2 ), the overall running time for the technique proposed is also O(n 2 ). In this regard, the computational complexity of the technique proposed is comparable with DBSCAN. The efficiency of DBSCAN is highly dependent on appropriate settings of the user-defined parameters ε and MinPts. The performance of DBSCAN will be computationally expensive and time-consuming when they are facing noise objects. The proposed technique is free of the problem of noise objects, because they are removed at the early stage.

Experiments
In this section, the clustering technique proposed is evaluated by using four different data sets. First, we employ three data sets (i.e. "Smileface," "Aggregation," and "Jain"

205
Clustering of online learning resources data sets) to test the effectiveness of our method and standard clustering algorithms, because these data sets have quite different densities, scales, and shapes. Second, a large-scale discussion threads from the forums of Coursera MOOCs is used for realworld validation. Specifically, the "Smileface" data set contains clusters with both uniform and uneven densities, which is suitable to evaluate the effectiveness of densitybased clustering algorithms. The "Aggregation" data set has seven clusters with different scales, and the "Jain" data set contains two clusters with ambiguous boundaries. The above features may be presented in online learning resources and will bring challenges to clustering approaches. The classical K-means clustering, averagelink hierarchical clustering, complete-link hierarchical clustering, and DBSCAN method are implemented in this study for comparison.

Results of different clustering algorithms on the "Smileface" data set
The algorithm proposed is first evaluated with the data set named "Smileface." The "Smileface" data set contains a total of 644 points which belong to four different clusters.
The K-means clustering performs very well with data points in globular shaped clusters. However, clusters in the "Smileface" data set are not in the globular shape. Figure 4 shows the result of K-means clustering with K equals to 4. Figures 5 and 6 show the results of average-link hierarchical clustering algorithm and complete-link hierarchical clustering algorithm, respectively. It is observed that the four clusters are not separated very well by three baseline algorithms.  Table III. As shown in Figure 7, our clustering scheme is robust and it can handle the outliers very well. The result of clusters produced by our algorithm is more satisfactory than three baseline algorithms.

Results of different clustering algorithms on the "aggregation" data set
Similarly, all clustering algorithms are also evaluated with the data set named "Aggregation." The "Aggregation" data set contains a total of 788 points, which belong to seven different clusters. In comparison with the "Smileface" data set, the "Aggregation" data set is more complex. Figure 8 shows the result of K-means clustering. It is observed that the red cluster contains points which belong to three different clusters. Furthermore, two clusters in the right hand side with internal touch are separated into three clusters.   Clustering of online learning resources hierarchical clustering is good. However, the complete-link hierarchical algorithm performs very poorly on the "Aggregation" data set. Figure 11 shows the experimental result of our algorithm. It exactly separates two clusters in the right hand side which are wrongly separated by K-means clustering. However, it groups the points which belong to three clusters in the lower left corner into two clusters. This problem will be further investigated in the future. It provides a research direction for enhancement of our algorithm proposed.  Figure 10.
Result of complete-link on "Aggregation" data set 4.3 Results of different clustering algorithms on the "Jain" data set A two-dimensional data set "Jain" is used for further evaluation of the robustness of the clustering algorithms over clusters with different densities. The "Jain" data set contains a total of 373 points, which belong to two clusters. Different from the aforementioned two data sets, the cluster densities of this data set are different with each other. Figure 12 shows the result of K-means clustering. It is observed that the performance of K-means on "Jain" data set is poor since these two clusters are not in the globular shape. Figures 13 and 14 show the results of average-link hierarchical Clustering of online learning resources clustering and complete-link clustering algorithms, respectively. In this case, it is observed that these two algorithms have the same experimental results. Figure 15 shows the result of our algorithm which treats the points with low density as noise points and eliminates them. This indicates that our algorithm is more suitable to identify dense resources than other baseline methods.

Results of different clustering algorithms on the e-learning data set
In this section, the proposed minimum spanning tree based clustering algorithm is compared with the classical density-based algorithm DBSCAN, and the bestperforming baseline of average-link hierarchical clustering by using a real-world e-learning data set (Rossi and Gnawali, 2014). This data set is the anonymized version of the discussion threads from the forums of 60 Coursera MOOCs, for a total of about 100,000 threads. After removing the redundant items, 73,942 learning instances are used for evaluation. A total of 197 distinct courses are assigned to four clusters (i.e. automata-002, bigdata-edu-001, humankind-001, and gametheory-003).
The characteristics of the "MOOCs" data set are used as the density of DBSCAN, and the density of the object is used as the parameter of our minimum spanning tree algorithm.
The actual clusters of the MOOCs data set are shown in Figure 16. By tuning various combination of parameters, the best clustering result of DBSCAN is shown in Figure 17. The result of average-link hierarchical clustering, which performed well on the previous "Aggregation" data set, is shown in Figure 18. However, it is observed Clustering of online learning resources that these two baselines both generated some errors on the e-learning data set. The clustering result of the algorithm proposed is shown in Figure 19, which is nearly the same with the grand truth ( Figure 16).
On the one hand, the proposed minimum spanning tree based clustering algorithm shows higher accuracy, which can group the online courses into clusters effectively. On the other hand, our algorithm can find the appropriate parameters efficiently on the "MOOCs" data set, i.e., it is robust to make a correct distinction between the labeled and unlabeled e-learning data sets.
The experimental results also indicate that determination of parameters for different clustering algorithms on e-learning data sets is a critical factor which affects the effectiveness of the algorithms. This provides another direction for future research.

E-learning applications
This section will discuss briefly how to apply the novel clustering technique proposed in various e-learning systems. Generally, the algorithm proposed can be employed in the following four aspects.

Classification of learner
As there may be more than tens of thousands of learners enrolling in one single course in MOOCs, it is very important to cluster the learners into groups. The effectiveness of learning can be greatly improved by homogenous grouping. Because the learners in a group have common characteristics, so that learning material and teaching strategies can be adjusted accordingly. For instance, the MITx and HarvardX (2014) data set suggests that the learners with good academic results have similar pattern of playing of course video. These learners are quite close with each other if their attributes (e.g. frequency of playing video) are plotted in an n-dimensional graph. As a result, the clustering algorithm proposed can differentiate the learners and corresponding assistances can be offered subsequently.

Discovery of learning path
Discovery of learning path is a classical application in e-learning system. On the other hand, it is very time-consuming and extremely challenging for users to identify Result of our algorithm on "MOOCs" data set their optimized learning paths when they wish to acquire new knowledge in a specific topic. A key step in discovery of learning path is to identify whether there is a strong linkage between two knowledge units (Leung and Li, 2003). It can be easily determined whether two knowledge units are in the same cluster by using the clustering result produced by the method proposed. It is less time-consuming, because it is not required to compare all pairs of knowledge units.

Recommendation of learning resource
In web-based learning, learners are facing a problem of overloading of online learning resources. It is essential to identify suitable learning resources from a potentially overwhelming variety of choices (Manouselis et al., 2010). The algorithm proposed can discover the natural grouping of online learning resources effectively. The system can easily recommend both interesting and relevant learning resources to learners by using the clustering result.

Intelligent tutoring
Intelligent tutoring is a generation of learning oriented methodology that includes the individuality of the learner in the learning process. It is very similar to what happens in a traditional individualized lesson with one tutor and one learner. In the learning oriented approach, technology needs to be adapted to the needs of learners and tutors to create suitable methods for working with it (Aberšek et al., 2014). To this end, clustering of online learning resources is valuable to decision making in terms of tutorial strategies and instructional design.

Conclusions
A clustering algorithm for online learning resources is proposed based on the minimum spanning tree in this paper. Outliers are removed according to the density of resource which is measured by using Euclidean distance. A minimum spanning tree is generated to connect the neighboring online learning resources together by edges. The K-means clustering, average-link hierarchical clustering, complete-link hierarchical clustering algorithms and DBSCAN algorithm are tested with four data sets in order to evaluate the performance of different clustering techniques. Furthermore, it is elaborated how to apply the algorithm proposed in four different e-learning applications (i.e. classification of learner, discovery of learning path, recommendation of learning resource, and intelligent tutoring). The experimental results demonstrate the effectiveness of our algorithms proposed. Our technique will shed light on the real-world online learning, i.e., the minimum spanning tree based clustering algorithm can classify large amount of learning resources according to their characteristics. Such a kind of feature can reduce the time for searching of learning resources, alleviate the problem of ineffective studies, and improve the efficiency of online learners. In the future, the density-based clustering method will be applied to choose the representative documents for sentiment analysis . Moreover, the algorithm proposed will be further evaluated by using a large and high-dimensional learning corpus, as well as more real-world data sets. On the other hand, it will be valuable to conduct a longitudinal study.

213
Clustering of online learning resources