Application of keyword extraction on MOOC resources

Purpose – Recent years have witnessed the rapid development of massive open online courses (MOOCs). With more and more courses being produced by instructors and being participated by learners all over the world, unprecedented massive educational resources are aggregated. The educational resources include videos, subtitles, lecture notes, quizzes, etc., on the teaching side, and forum contents, Wiki, log of learning behavior, log of homework, etc., on the learning side. However, the data are both unstructured and diverse. To facilitate knowledge management and mining on MOOCs, extracting keywords from the resources is important. This paper aims to adapt the state-of-the-art techniques to MOOC settings and evaluate the effectiveness on real data. In terms of practice, this paper also tries to answer the questions for the ﬁ rst time that to what extend can the MOOC resources support keyword extraction models, and how many human efforts arerequired to make the models work well. Design/methodology/approach – Based on which side generates the data, i.e instructors or learners, the data are classi ﬁ ed to teaching resources and learning resources, respectively. The approach used on teaching resources is based on machine learning models with labels, while the approach used on learning resources is based ongraphmodel withoutlabels. Findings – From the teaching resources, the methods used by the authors can accurately extract keywords with only 10 per cent labeled data. The authors ﬁ nd a characteristic of the data that the resources of various forms, e.g. subtitles and PPTs, should be separately considered because they have the different model ability. From the learning resources, the keywords extracted from MOOC forums are not as domain-speci ﬁ c as those extracted from teaching resources, but they can re ﬂ ect the topics which are lively discussed in forums. Then instructors can get feedback from the indication. The authors implement two applications with the extracted keywords: generating concept map and generating learning path. The visual demos show they have the potentialto improve learning ef ﬁ ciencywhen they areintegrated into a real MOOC platform. Research limitations/implications – Conducting keyword extraction on MOOC resources is quite dif ﬁ cult because teaching resources are hard to be obtained due to copyrights. Also, getting labeled data is tough becauseusuallyexpertiseof thecorrespondingdomain is required. Practical implications – The experiment results support that MOOC resources are good enough for building models of keyword extraction, and an acceptable balance between human efforts and model accuracycan be achieved. Originality/value – This paper presents a pioneer study on keyword extraction on MOOC resources and obtains some new ﬁ ndings. and Post-Doctoral Fellow with the of Computing, Her research focuses on studying the cognitive and social characteristics of intelligent agents in multi-agent and distributed AI/CI systems, such as trust, emotions, incentives, motivated learning, ecological and organizational behavior. She has worked on new disruptive Arti ﬁ cial intelligence (AI) approaches and theories that synergize human intelligence, arti ﬁ cial intelligence and behavior data analytics (AI powered by humans). Her current research interests include human-agent interaction, cognitive agents, human computation and serious games.


Introduction
In recent years, massive open online courses (MOOCs) have benefited tens of millions of students all over the world. A very important characteristic of MOOC is it provides a onestop online learning environment which consists of lecture videos, assignments, email notifications, discussion forum, and quizzes and examinations. Along with the popularity of MOOCs, a large amount of online educational resources of various subject areas, ranging from humanity to science, are unprecedentedly produced. Not only instructors can provide videos, subtitles, lecture notes, questions, etc., but also learners can generate forum contents, Wiki posts, log of homework submissions, etc. In fact, each MOOCs platform is a large-scale "knowledge base" where the educational resources can be regarded as the outcome of crowd intelligence (from both instructors and learners). However, those resources are unstructured and diverse. For example, subtitles are well-organized and formal, as they are usually produced by instructors, whereas the contents of posts are written by different learners; thus, they are colloquial and informal. As Figure 1 shows, MOOC resources can be considered as teaching resources and learning resources. By proposing proper models to discover knowledge from the crowd intelligence, it is promising to implement knowledge management, knowledge mining and even smart education for MOOCs. This paper explores the task of keyword extraction and its applications on MOOC resources. The reason for conducting this task is that in most work of knowledge engineering, e.g. construction of knowledge graph and knowledge management, entity extraction is the first step. As for our task, we call the "entity" as keyword. The meaning of keyword in the educational setting is regarded general and intuitively can include concept, terminology, named entity and so on. Keyword extraction from MOOC resources may face several difficulties: MOOCs are of different subject areas, any domain-specific method should not help much, and as such, the method we use should be instructor-and courseagnostic; obtaining labeled training data set is extremely expensive, as usually domain expertise is required; and the volume of data is usually large and the textual styles are various.
Despite those difficulties, once keywords are well extracted, many subsequent applications are feasible, e.g. construction of course-specific and domain-specific concept map, management of cross-domain concepts, knowledge discovery from crowd and even personalized learning by mining learners' behaviors.
Based on the partition of who generates the MOOC resources, i.e. instructors and learners, the research design of this paper is composed by three parts: (1) keyword extraction on resources generated by instructors; (2) keyword extraction on resources generated by learners; and (3) applications with keywords in MOOC settings.
As to the first part, it is difficult to collect entire instructor-generated resources of many courses. Also, labeling the data requires expertise in the corresponding subject area. Even so, we invite the instructors and teaching assistants (TAs) to help label the teaching resources of one course, as we expect to use human knowledge to learn a classifier by supervised machine learning methods. Moreover, we design a semi-supervised learning framework to test whether using less labeled data is practical. We regard this task as a problem of natural language processing, i.e. word sequence labeling. Sutton and McCallum (2011) believe that the probabilistic graphical models, especially conditional random fields (CRFs), can obtain the state-of-the-art performance in many sequence labeling tasks like part of speech (POS), named entity recognition (NER) and word segment, so we leverage this kind of model to extract keywords on MOOC teaching resources.
As to the second part of keyword extraction on resources generated by learners, i.e. discussion forum contents, it is relatively convenient to collect the contents of many courses. However, the number of posts may be quite large, e.g. over ten thousands, so it is difficult to use human knowledge through labeled data. On the other hand, as a kind of social media, forums have relational information between learners and contents. By referring to many methods of keyword extraction for social media, we model the MOOC forum to a heterogeneous network for each course. Then through graph-based random walk algorithm, keywords are extracted by ranking the importance of each word. We regard the top words in the ranking list as keywords.
After keywords are extracted, lots of novel educational applications can be developed within the MOOC settings. In the third part of this paper, we introduce two preliminary applications: generation of concept map and generation of learning paths. Romero and Ventura (2010) proposes that in the educational field, concept map is useful to organize, design and manage the course resources for instructors. We propose a new concept map which is called semantic concept map (SCM). The main difference with traditional concept maps is that the edge, i.e. relationship between keywords, is defined as semantic similarity. This kind of concept map can be easily extended to various courses. Then based on the SCM, we propose a method to automatically generate learning paths which have the potential for personalized learning.

IJCS 1,1
In what follows, we review the related work in Section 2. Section 3 introduces data sets used in this paper. Section 4 introduces the method of keyword extraction on the side of teaching resources. The corresponding method on the learning side, i.e. forum contents, is introduced in Section 5. Then in Section 6, we report the experiment results obtained from both sides of resources respectively. In Section 7, we state the two demo applications with extracted keywords. Finally, we conclude this paper in Section 8.

Related work
Keyword is the word which people regard as important in a text. In different situations, keyword can be named entity, proper noun, terminology or concept. In this paper, the meaning of keyword is general. So if not otherwise specified, their differences are neglected.
In the past decades, Finkel et al. (2005), Nadeau and Sekine (2007) and Ratinov and Roth (2009) have studied the tasks of keyword extraction by machine learning methods, e.g. NER, terminology extraction and key phrases extraction. NER methods focus on named nouns, such as person name, location, time and address, and they are for constructing knowledge base, as seen from the papers by Dong et al. (2014) and Nickel et al. (2015). Terminology extraction methods are developed to extract domain-specific words. Recently, Nojiri and Manning (2015) and Qin et al. (2013) propose the methods based on machine learning for keyword extraction. However, methods for one kind of keywords extraction may not be used to another kind. For example, Nojiri and Manning (2015) exhibit that directly applying existing methods of NER to terminology extraction will not perform well. It is different to our task that we have labeled data.
Apart from supervised machine learning methods with human knowledge, another perspective for solving keyword extraction is the unsupervised approach. For example, Justesona and Katza (1995) propose the rule-based method, Frantzi et al. (2000) and Bin and Shichao (2011) propose the statistical methods. In this paper, we leverage a graph-based method proposed by Sonawane and Kulkarni (2014), which can model the social relationship between words to a network and then rank all the words in accordance with their importance.
To our knowledge, a large number of studies of data analytics on MOOC data have been proposed in recent years. For example, Anderson et al. (2014) try to classify MOOC learners after analyzing their behavior patterns. It also studies how to use a badge system to produce incentives based on learners' activity and contribution in the forum. Huang et al. (2014) analyze the behaviors of superposters in 44 MOOCs forums and finds that MOOCs forums are mostly healthy. Wen et al. (2014) study the sentiment analysis in MOOCs discussion forums and find that no positive correlation exists between the sentiment of posts and the course dropout. Wang et al. (2015) study the learning gain reflected through forum discussions. Jiang et al. (2015) conduct an analysis from the perspective of influence by modeling the MOOC forum to a heterogeneous network. Kizilcec et al. (2013) conduct a research on the behavior of learner disengagement. Moreover, some statistical reports and case study papers analyze behavior of MOOC learners, such as Ho et al. (2013) and Breslow et al. (2013). However, few studies of keyword extraction have been conducted on MOOC data. Romero and Ventura (2010) and Novak and Cañas (2006) define that a concept map is a connected graph that shows relationships between concepts and expresses the hierarchical structure of knowledge. To our knowledge, plenty of work of automatically constructed concept map has been studied with data mining techniques. For example, Tsenga et al. (2007), Lee et al. (2009) and Qasim et al. (2013) leverage association-rule mining; Chen et al. (2008), Lau et al. (2009) and Huang et al. (2006Huang et al. ( , 2015 base on text mining; and Marian and Keyword extraction Maria (2009) and Chu et al. (2007) design specific algorithms. However, the majority of those methods are domain-specific, e.g. for specific courses or specific learning settings. We expect to explore new methods by reducing their dependency on domains, so a new kind of semantic relationship is leveraged in this paper.

Overview of a data sets
Also based on the partition of sources of generated resources, i.e. instructors and learners, we introduce the available MOOC data we have respectively.

Resources on teaching side
We collect the resources of an interdisciplinary course conducted in the fall of 2013 on Coursera. The course involves computer science, social science and economics. Textual content includes video subtitles, PPTs, questions and forum contents (i.e. threads, posts and comments). Table I shows the statistics of resources. We invited the instructor and two TAs to help label the data. As seen in Table I, the number of keywords in questions and PPTs are much smaller than that in subtitles. Based on our observation during labeling the data, the instructor and TAs would still spend much time on understanding each sentence, even though they should be more familiar with the contents than any others. We guess it is because the resources are composed by different people. During the activity of labeling data, everyone would spend about 8 h on labeling 3,000 sentences (in average 10 s per sentence).
A preprocessing step of word segment for Chinese may be necessary. We adopt the Stanford Word Segmenter[1] proposed by Chang et al. (2008). All data are randomly shuffled before they are processed late.

Resources on learning side
We collect data from 12 courses offered by Peking University from Coursera. They were offered in Fall Semester of 2013 and Spring Semester of 2014. There are totally over 4,000 threads and over 24,000 posts. For convenience later in the paper, Table II lists the pairs of course codes and course titles. Table III shows the statistics of the data sets per course. The "posts" denotes both posts and comments.

Keyword extraction on teaching side
The resources generated by instructors in MOOCs mainly include lecture notes, subtitles, PPTs and questions. In order to extract keywords from the teaching resources, we regard this task as a sequence labeling problem. It is similar to other sequence labeling tasks, e.g. NER and part-of-speech annotation. So probabilistic graphical models are the natural solution to this kind of tasks. Sutton and McCallum (2011) exhibits CRFs can achieve the state-of-the art performance. And we define instructor-and course-agnostic features in order to reduce the domain dependency. Moreover, we propose a semi-supervised learning framework to reduce human efforts of labeling data.

Conditional random field's model
The problem of keyword extraction can be formally described as solving the conditional probability PðYjXÞ. The random variable X refers to features of each sentence which follows a word sequence x ¼ fx 1 ; x 2 ; . . .; x T g, and the random variable Y is a label sequence of the sentence y ¼ fy 1 ; y 2 ; . . . ; y T g. The label of a word is defined as three classes: NO, ST and IN. They respectively mean not a keyword, the beginning word of a keyword and the middle word of a keyword. So the label variable is Y 2 fNO; ST; INg. We consider the conditional probability of labeling sequence Y, i.e. pðYjXÞ, rather than their joint probability p(Y, X), so linear chain CRFs framework proposed by Lafferty et al. (2001) is the natural choice. The conditional distribution over label sequence y, given an observation word sequence x, can be defined as: Notes: PðCjFÞ represents the ratio of certificated forum learners to overall forum learners; PðFjCÞ represents the ratio of certificated forum learners to overall certificated learners of the course Practice on programming chemistry-001 General chemistry (Session 1) chemistry-002 General chemistry (Session 2) pkubioinfo-001 Bioinformatics: Introduction and methods (Session 1) pkubioinfo-002 Bioinformatics: Introduction and methods (Session 2) Keyword extraction are the set of feature functions defined on given x; H ¼ fl k g 2 < K are parameter vector. N is the length of sentence and K is the number of features. Given a training data set, the model H ¼ fl k g K k¼1 could be learned by maximum likelihood estimation. To avoid overfitting, we add a regularized term to the function. Then the log-likelihood function of pðyjx; l Þ based on the Euclidean norm of l $ð0; s 2 Þ is represented as: So the gradient function is: The detail of learning the CRFs model can be referred to Sutton and McCallum (2011). Then, given a new word sequence x* and a learned model H ¼ fl k g K k¼1 , the optimal label sequence y * could be calculated by: where Y is the set of all possible label sequences for the given sentence x * . We use L-BFGS algorithm to learn the model and Viterbi algorithm to infer the optimal label sequence y * .

Feature engineering
A crucial part of CRFs framework is the definition of feature functions. Based on our observation, we define five kinds of features which are adapted to our educational data. All the features are course-agnostic and make our framework flexible for scalability.

Text style features
whether the target word is English; whether the two neighbor words are English; whether the word is the first word in a sentence; whether the word is the last word in a sentence; and whether the target word is in a quotation.
Text style features capture the stylistic characteristics. Some keywords usually appear at the beginning or the last of a sentence in instructor's language, e.g. "Netwrok means[. We treat the POS as a feature because fixed combination of POS, e.g. adjective þ noun or noun þ noun, may indicate keyword phrases. We use the Stanford Log-linear POS Tagger [2] proposed by Toutanova et al. (2003) to assigns POS to each word. Note that as to the corresponding feature functions, we adopt binary value, 0 or 1, to every POS. For example, there is a function to capture whether the target word is a noun and so on.

Context features
term frequency and inverted document frequency (TF-IDF) value of the target word and two neighbor words; normalized uni-gram BM25 score of the target word; normalized bi-gram BM25 score of the target word; and normalized bi-gram BM25 score of the two neighbor words.
Context features capture the importance of words and word-level information within the whole document. The training set is partitioned to documents based on video clips. Statistical metric of normalized bi-grams BM25 scores proposed by Robertson et al. (2004) is used to quantify word relevance by default parameters.

Semantic features
semantic similarity of the target word with the previous two words; and semantic similarity of the target word with the next two words.
Some frequent-co-occurrence words may be keywords. Also, close words in the semantic space may be keywords. So by learning the word semantics, features of adjacent words can be captured. The similarity of two adjacent words in semantic space is calculated with the corresponding word vectors trained by Word2Vec[3] proposed by Mikolov et al. (2013). All textual contents are used to learn the word embeddings. The corpus size is 145,232 words and the vector dimension is set as 100 by default.

Dictionary features
whether the target word and two neighbor words are in the dictionary; and whether the two neighbor words are in the dictionary.
As in most tasks about natural language processing, a dictionary is useful. We therefore design a run-time dictionary which is just a set of keywords in training data set.

Semi-supervised learning framework
Because the effort for labeling training data is extremely expensive, we propose the semisupervised framework. We leverage the ideas of self training proposed by Liu et al. (2009) and k nearest neighbors (KNN). The intuition is that if an unlabeled sample is similar to a labeled sample in semantic space, the unlabeled sample is very probable to be successfully Keyword extraction inferred by the model which is learned from all the current labeled data. Then, the unlabeled sample is turned to a labeled one and can be added into the labeled dataset with modelinferred labels. A new model can be learned. The new thing proposed here is that we use the word embeddings learned by Word2Vec to calculate the similarity between two sentences. Sentence vector is denoted as: where VecWord is the word vector. Algorithm 1 is the details of the semi-supervised version of training process. The time complexity of Algorithm 1 is OðNM 2 Þ þ M c OðTrainCRFÞ where N and M are the sizes of labeled set and unlabeled set, respectively, and c is the number of unlabeled data which are selected to be inferred in each loop. The additional computing cost is rewarding, as human effort can be largely reduced, especially when N and M is not large.

Keyword extraction on learning side
Due to the difficulty and complexity of labeling massive data of MOOC forums, we leverage unsupervised approaches to extract keywords from contents generated by learners. As the discussion forums are a kind of social media, we build a graph to model the relationship of post-reply. Then, a random walk algorithm is proposed to rank the importance of words. Finally, we regard the top words as keywords.
The intuition to build a graph model is that the more words are replied to, the more important the word is, and the more important word A is when word A replies to word B, the more important word B would be. This is similar to the algorithms for ranking Web pages, e.g. PageRank proposed by Brin and Page (1998).

Data model of massive open online course forum
To better model the importance of keywords, we design a heterogeneous network. Two kinds of entities are involved, learners and words. In the following, we introduce definition of the data model. Then, we explain the intuition for designing such a network.
Definition 1. Heterogeneous network with learners and words. Given all the learners' records of a MOOC forum, heterogeneous network G ¼ ðV; . . . ; v D n D g are sets of learners and words, respectively. E L is the set of directed edges which mean the co-occurrence of two learners in the same thread. The leaner who posts later points to the other. E D is the set of directed and bidirectional edges which mean the cooccurrence of two words in the same thread. Directed edges mean the two words belong to different posts. And the one which appears later points to the other. The bidirectional edges mean the two words belong to the same post. E LD is the set of bidirectional edges which mean a learner's contents contain the word and in reverse a word appears in the learner's contents. W L , W D and W LD are the sets of weight values which mean the times of co-occurrence of two entities on corresponding edges. Self co-occurrence is meaningless and is consistently ignored. Figure 2 is a demo of the heterogeneous network with learners and words of a MOOC forum. By the way, we denote G L ¼ ðV L ; E L ; W L Þ as a weighted directed graph of learners, G D ¼ ðV D ; E D ; W D Þ as a weighted directed and bidirectional graph of words and G LD ¼ ðV LD ; E LD ; W LD Þ as the weighted bipartite graph of authorship between students and keywords. Denote n L ¼ jV L j and n D ¼ jV D j are the numbers of entities in V L and V D , respectively.
Such a heterogeneous network can embody the latent post-reply relationship between learners and words. In G L and G D , the more edges point to an entity, the more important it is. Moreover, if more important entities point to a specific entity, the entity would be more important. Similarly seeing from G LD , the more edges point to a word, the more popular it is while also more important, if an important leaner points to it. All the weight values can capture the importance degree of relationship. The transmission of importance between learners (in G L ) can be transited to G D . It is a process of mutual reinforcement between the two subnetworks.

Jump random walk algorithm
We design an algorithm for co-ranking learners and words, named Jump-Random-Walk (JRW) which simulates two random surfers jumping and walking between different types of entities. Figure 3 shows the framework of JRW algorithm. G L is the subnetwork of learners and G D is the subnetwork of words. G LD is the subnetwork of authorship. b is the probability of walking along an edge within the homogeneous subnetwork. l is the probability for jumping to the other subnetwork. l = 0 means the two random surfers are independent to jump and walk within respective homogeneous subnetworks. We assume the probabilities of jump and walk are consistent.
Denote l 2 R n L and d 2 R n D are the ranking result vectors, also probability distributions, whose entries are corresponding to entities of V L and V D , subject to jjljj 1 1 Keyword extraction and jjdjj 1 1 due to existence of no-out-degree entities. Denote four transition matrixes of G L , G D , G LD and G DL as L 2 R n L Ân L ; D 2 R n D Ân D ; LD 2 R n LD Ân LD and DL 2 R n D Ân L , respectively. Adding the probability of random jumping for avoiding trapped in small set of entities or no-out-degree entities, the iteration functions are: where the former terms right the equal signs are iteration functions within a homogeneous subnetwork and the latter are across the two homogeneous subnetworks. l is the probability of jumping to the subnetwork. b is the probability of walking along an edge within a homogeneous subnetwork. e n L 2 R n L and e n D 2 R n D are the vectors whose all entries are 1. The four transition matrixes are: where X i w DL i;j 6 ¼ 0: w L i;j is the weight of the edge from V L i to V L j ; w D i; j is the weight of the edge between V D i and V D j ; w LD i; j is the weight of the edge between V L i and V D j and w DL i;j is the weight of the edge between V D i and V L j . Actually, w LD i;j ¼ w DL j;i . When X i w L i; j ¼ 0, it means the student V L j always posts the last in a thread. If X i w D i;j ¼ 0, it means the keyword V D j always has no peer in a thread. Actually, this situation almost never happens in our filtered words. X i w LD i; j ¼ 0 is also impossible, which means every word would have at least one author l Þe n D =n D Þ þ l DLl 8: until jd Àdj e 9: return l; d Finally, we can actually get two ranking lists of learners and words, but we only consider the ranking list of words within this paper.

Experiment
Again, based on the partition of two kinds of resources, as well as an extra experiment, this section consists of three parts.

On teaching side
In this subsection, we use teaching resources, i.e. subtitles, PPTs and questions, to evaluate the supervised learning model. We introduce several baselines to extract keywords for comparison: Term frequency (TF): Words are ranked by their term frequency. If a word is a keyword, the instructor may say it repeatedly in lecture. Bootstraping (BT): Instructors may have personal language styles to give talks. So we design the rule-based algorithm by giving several patterns containing keywords. This method is actually course-and instructor-dependent. Stanford Chinese NER (S-NER): This is an exiting tool developed for NER, whose model is already trained, and we just use it to infer keywords in our educational data sets [4] proposed by Nadeau and Sekine (2007). Terminology extraction (TermExtractor): This is an exiting tool for terminology extraction [5]. The well-trained model is also only used to infer keywords in our data sets. Supervised keyword-CRF (SK-CRF): This is a method of supervised learning based CRFs with all features as defined before. Semi-supervised keyword-CRF (SSK-CRF): This is the semi-supervised version for keyword extraction. The parameter of c, number of candidates, is empirically set as 20.
We adopt three metrics, precision, recall and F1-value, to measure the results.
6.1.1 Results and analysis. Table IV shows the comparison of performance between baselines. We use 30 per cent data of subtitles as training data for SK-CRF and SSK-CRF, Keyword extraction and the rest are for evaluation. Especially for SSK-CRF, half of the training data are unlabeled. The statistic-based methods (TF@500 and TF@1000) are unreliable because many stopwords may degrade the performance. The rule-base method (BT) is highly dependent on human experience, and the low precision means plenty of subsequent work for filtering the outputs is required. On the other hand, Stanford Chinese NER and TermExtractor do not perform well maybe because of two reasons, namely, named entity and terminology are actually different from the keywords in our data, and the models are not learned from our data set. The semi-supervised CRF is comparable to the supervised version. Figure 4 manifests that the semi-supervised learning would be comparable to the supervised version, especially when less than 20 per cent data are used for training. Half of training data is identically regarded as unlabeled by SSK-CRF. Note that the amount of labeled data when using 10 per cent training data by SK-CRF is equivalent to that of using 20 per cent training data by SSK-CRF, but SSK-CRF performs better than SK-CRF. This result means the semi-supervised framework can obtain satisfactory performance by only labeling a handful of data. Now, we evaluate the different model abilities among various MOOC textual content. As shown in Table V, the items in row are training data set, while those in column are testing data set. This table can explain some common situations of educational settings. Subtitles can cover almost all the keywords. They are ideal to be regarded as the training data. PPTs is also decent to be as training data seeing from the precisions, but the recalls are low. Maybe due to usually in PDF format, PPTs may cause incomplete sentences when being converted to text. Questions could lead to lower recalls than PPTs because not all keywords are present in questions as shown in Table I. In summary, different kinds of MOOC textual content have different model ability, so they should be separately considered.
6.1.2 Feature contribution. We analyze how the different kinds of features contribute to the model. The result is shown in Table VI. Dictionary feature has a predominant influence on the final results, and structure feature is the second important. Other features are also contributive, but the difference is small. Even so, every kind of features contribute to the model positively.

On learning side
After building a heterogenous network for each course, Table VII shows the parameters of the network per course.
The important of keywords ranked at top is hard to evaluate. Table VIII lists the top ten high-frequency words and top ten keywords ranked by JRW, respectively. We can see the Notes: SK-CRF and SSK-CRF use 30% data of subtitles for training; half of the training data as unlabeled for SSK-CRF; the italic data mean they are the best results among all the baselines IJCS 1,1 two kinds are highly overlapped, but the order is slightly different. The bold keywords are related to course content, and the italic ones are mainly about the course quiz, assignment, video and other course stuff. Table IX shows the statistics of the top three "important posts", meaning that the posts contain the top 20 keywords. The more frequency of keywords they contain, the higher they rank. From Table IX, we can first find that the content lengths are mostly long, which is obvious by our definition of "important posts". From the dimension of vote, we cannot find some insight of the numbers. Author rank means the ranking of the post author in the ranking list of important learners. We find they are truly the "important learners" of each  course. Also, the important posts are mostly at the top of a thread, seeing from position in thread. It means the initial authors in a thread are inclined to express important information. By the way, the lengths of a thread, i.e. # Post in Thread, are significantly correlated to the important posts. Some empirical conclusions can be summarized as below: 6.3 Extra experiment Considering the available data in our hand, although we do not have labels of forum data, we can learn a classifier from labeled teaching resources and conduct a task of identifying the need of concept comprehension on forum contents. This task can be regarded as a binary classification of forum threads, that is to identify whether a thread is about concept comprehension. So if the question contains keywords of the course, it is much likely to ask for the explanation of some concepts. The result is post-evaluated which means: to each thread, if the score is marked as "1", two situations are included as the following: if no concept is identified and this thread is not about need of concept comprehension; and if at least one concept is identified and the definition of identified concepts can answer the question.
Other situations are marked as "0". We use 30 per cent of subtitles to learn a classifier by the semi-supervised method. Only threads title and the initial post are involved in this experiment, instead of all the posts. Table X exhibits the result. The accuracy is not bad. The relatively high recall is meaningful because this can accurately remind instructors which threads to intervene. Moreover, this method not only can identify whether a thread is about concept comprehension but also can identify which concept needs to be explained.

Application with keywords for massive open online course
After keywords are extracted from teaching resources of one course, we exhibit two intelligent applications with keywords in the MOOC settings: generation of concept map and generation of learning path. We conduct the applications on the course of people and network. It can be observed that the more frequent a concept appears, the more fundamental it is. For example, the top ten concepts, Node, Network, Reward, Probability, Graph, Game, Edge, Tactic, Hypothesis and Price, are all the fundamental knowledge points of the course. So the metric of TF can capture the feature of fundamentality. The formal definition is: where f ki is the times of the ith concept existing in the kth document; a document corresponds to a video clip in MOOCs. However, on the other hand, low-frequency concepts often are the important knowledge points. So TF-IDF is ideal to measure the importance of a concept. For example, the top ten important concepts are PageRank, SignalSequence,  2  359  0  2  3  9  3  1,539  0  5  3  4  chemistry-002  1  424  16  2  1  5  2  472  2  1  3  5  3  306  1  4  5  6  pkubioinfo-001  1  769  -2  3  1  7  2  1,322  3  3  1  7  3  1,956  2  2  1  6  pkubioinfo-002  1  309  0  1  2  3  2  422  0  8  1  4  3  515  0  1 3 5 where N is the number of video clips and n i is the times of video clips in which the ith concept appears.
Considering the word embeddings learned by Word2Vec have the characteristic that semantically similar words are close in the embedding space, so we use the similarity as the weights of edges between two concepts. For example, the most semantically similar concepts around Network are: NetworkAnalysis, SocialNetwork, ResidualNetwork, ComplexNetwork, NetworkSwitch, ComplexNetworkAnalysis, SocialNetwork, TraficNetwork, SocialNetworkAnalysis and NetworkSwitchExperiment. Figure 5 shows the demos of SCM of the course of People and Network for fundamentality and importance, respectively. We find the map can visually reveal the degree of semantic relationships between concepts. This is beneficial for learners to build a "concept map" in their brain and remember concepts easily. We use the tool of Gephi to draw the maps.

Learning path
Based on the SCM, learners can also learn the course in line with their own pace. Here, we propose an algorithm (Algorithm 3) to generate a primary learning path according to the definition of SCM. Then, the learning path can be revised by both the instructors and learners as required.
The basic idea of the algorithm is simple. Every time a current concept is taken, then a candidate set of k the most semantically similar neighbors of the concept are selected. Among the candidate set, TF or TF-IDF of concepts is calculated. Then, the top concept is selected as a node in the path and as the next current concept. The algorithm can start from any concept. Note that the concepts which are selected in the candidate set should appear later than the current concept along with the course because learners may be confused to learn concepts through the path which does not conform to the instructor's design.
By taking the concept Node as the starting point and setting k = 10, the first ten concepts in the learning path with metric of TF are: Node ! Edge ! Element ! Set ! Alternative ! Vote ! MajorityVoting ! MajorityVotingRule ! IndividualRanking ! GroupRanking.
By taking the concept with the highest TF-IDF as the starting point, the first ten concepts are: PageRank ! PageRankAlgorithm ! SmallWorld ! Balance ! NashBalance ! StructuralBalance ! EquilibriumTheorem ! MixedStrategyEquilibrium ! NashBargaining ! NashBargainingSolution. We can see these concepts are all important along the course: Algorithm 3 Generation of learning path INPUT: SCM ¼ fC; Rg, starting concept c i , number of candidates k OUTPUT: learning path p i ¼ fn 1 ; n 2 ; . . . ; n jCj g 1: j = 1 2: n j = c i 3: p i ¼ fn j g 4: C 0 ¼ C À fn j g Keyword extraction Figure 5. Two kinds of SCM based on different concepts metrics IJCS 1,1 66 5: repeat 6: T = the k most semantically similar and later appeared concepts to n j in C 0 7: j = j þ 1 8: n j = the concept selected by some metric (TF or TF-IDF) in T 9: p i ¼ p i [ fn j g 10: C 0 ¼ C 0 À fn j g 11: until C 0 ¼ ; 12: return p i Admittedly, the two demo learning paths are very primitive. They cannot support personalized learning and adaptive learning yet. However, by analyzing the learners' behavior and log of homework, the learning paths can be more intelligent. We leave this for the future work.

Conclusion
Along with the development of MOOCs, massive online educational resources are unprecedentedly produced from crowd. Instructors can provide videos, subtitles, lecture notes, questions, etc., while learners can generate forum content, Wiki, log of homework, etc. How to process these data from unstructured to structured is a challenging problem. In this paper, we explore the task of keyword extraction on MOOC resources. Keyword extraction can benefit a lot of subsequential applications. First, it is a kind of annotation for MOOC resources. The annotation can be used for studying machine learning methods for MOOC-related natural language processing tasks, such as information extraction, information retrieval and question answering. Second, keyword extraction can pick up domain-specific or cross-domain knowledge points from complex text. This result can be further processed to build knowledge graph or concept map. With the graph (or the map), instructors can better organize the course, and learners can plan their own learning paths more easily. Then by collecting the feedback from learners, the whole teaching and learning process can be a virtuous cycle. Thus finally, crowd intelligence can lead to intelligent education.
Back to the task of this paper, we are faced with two challenges: MOOCs are crossdomain, labeling training data is extremely expensive. So we propose a flexible framework based on semi-supervised machine learning with domain-agnostic features. Experiments demonstrate the efficacy of our framework. Using a very little labeled data can achieve decent performance. We find that various kinds of MOOC content, e.g. subtitles and PPTs, have different modeling ability for keyword extraction. So they should be separately treated in future work. Our framework also can be applied to the task of concept identification on MOOC forum content. Moreover, unsupervised method based on graph model is proposed by modeling MOOC forum to a heterogeneous network. Although the top keywords in MOOC forums are not as the same as those keywords extracted from teaching resources, they can indicate the concerned topics which are discussed in forums. At least instructors can get feedback from the information.
In the future, methods of transfer learning and deep learning may be better for extracting cross-domain keywords. External resources of knowledge, e.g. Wikipedia, may be helpful. The relationship between keywords is deserved to be paid more attention for building a domain-specific or even cross-domain concept map.