Abstract
Course sequencing plays a major role in Intelligent Tutoring Systems because it determines the learning path of the student. However, it is difficult to define this order during early stages when there is no interaction with the student. The objective of this study is to determine the sequence of learning concepts considering an ontology and Wikipedia information. We used a text mining algorithm using Wikipedia to determine course sequencing. The knowledge base is formed by concepts and relationships in an ontology, in addition to Wikipedia articles of the same concepts. To evaluate the accuracy of the algorithm, we made a comparison against domain experts. According to the Pearson test, a correlation of 0.664 between the algorithm and experts was obtained, with a confidence level higher than 99%. The learning sequence can be defined with this method when we do not have evidence of student knowledge, to be later modified according to the interaction of the student.
Introduction
An Intelligent Tutoring System (ITS) is an effective tool for teaching knowledge domains with significant gains in learning [4]. But to be considered intelligent, an ITS should have a behavior similar to a human tutor and offer an adapted, reactive, flexible, and personalized teaching [3]. There are architectures in the literature to model an ITS; however, a generalized way of explaining an ITS architecture refers to four modules [6, 27]: Tutor module, domain module, student module, and interface module. This work is mainly based on the student module and secondarily the domain module.
The domain module represents knowledge about the domain, the computer must understand this representation [5]. Different techniques are used to achieve this representation such as semantic networks, Bayesian networks, fuzzy cognitive maps, among others [26]. Ontologies, a type of semantic network, have been widely used in different areas in recent years, motivated by the growing interest of the semantic web [12].
Ontologies enable defining basic concepts and their relations representing a knowledge domain in a standard format, available, and manageable [30]. The study of ontologies is an area of research with important advances [15, 29].
The student module is the basis for the personalization of an ITSs. This model handles processes that are responsible for representing student cognitive issues such as analyzing student performance, detecting difficult concepts, representing student goals and plans, identifying acquired knowledge, describing personality characteristics, among other tasks [7, 17].
An important goal in the student model is to determine the order in which concepts will be taught, this is called course sequencing [11, 28]. The course sequencing in an ITS help to personalize the course, this personalization can be achieved by means of the orientation of students considering activities appropriate to the needs of them [17]. To achieve this adaptation, students must show evidence of their knowledge to establish the topic they need to learn. A problem will materialize during the initial stages of student interaction with the ITS due to lack of information for the adaptive. Although a basic learning path can be established for all students, one goal of this study is to automate the process of building courses using ITSs, and without intervention of experts that define a general path.
To achieve course sequencing, we take advantage of ontologies with the knowledge source of Wikipedia1. Wikipedia is a free online encyclopedia written collaboratively [9]; this has a sufficient number of articles developed from different topics. Although the validity of the Wikipedia information has been questioned, studies [18] have shown a level of accuracy close to the encyclopedia Britannica2. Also, Wikipedia has an extensive scientific community analyzing its corpus of information and its structure since its emergence in 2001 [2].
On the other hand, authors have shown the importance of Wikipedia in the educational field [9]; for instance, it is used for learning in pairs, collaborative learning, sharing information, among others [13, 19]. For the virtues shown by Wikipedia, it is common that teachers consult it to organize, integrate, and enrich their courses [18].
This work diagnoses the course sequencing to be taught through an ITS while also considering that there is no evidence of student interaction. To completely automate the process, experts are not taken into account. The study starts from an ontology that models a domain of knowledge for teaching in an ITS; we use Wikipedia text mining to ascertain the order in which concepts should be taught to the student.
The purpose of this research is to answer the following question: Can we combine an ontology and Wikipedia information to obtain a generic course sequencing? If so, to what extent can we do it? Otherwise, Why cannot we do it?
The article is structured as follows. Section 2 describes related work. Section 3 shows the mathematical description of the algorithm used to find the sequencing automatically. Section 4 describes the ontology that served as the basis for the automation process. Section 5 shows the experiment, results, and the discussion. Finally, the article unfolds conclusions and references.
Related work
This section describes related work to the course sequencing and the approach to the proposal.
Vuogh et al. [31] proposed to analyze the data generated by an ITS to establish the relation between units of knowledge and determine the pre-requisites in a curriculum. The goal of this article was to identify an alternative order of teaching through the detection of topics that do not significantly influence student knowledge. This approach has the disadvantage that it would only work on courses implemented on an ITS; a new course would involve generating all the supporting information to find different ways of teaching.
Limongelli et al. [18] showed a system that retrieves Wikipedia learning material, giving to the student an order based on links embedded in pages. This system follows a construction process based on the Grasha teaching style and a social didactic approach. Authors of this article only have evidence about usability but not about their accuracy. They provide a sequencing using a method other than ours, our proposal focuses primarily on the text mining of articles.
Gasparetti et al. [11] reported an approach to help teachers to define the prerequisite relations of text-based learning objects; and the approach provides a valid list of learning object sequences for teachers. Semantic analysis techniques identify relevant concepts of Wikipedia in each learning object; then, taking advantage of Wikipedia’s knowledge, the approach identifies constraints between learning objects. This work shows an approach very similar to ours, with the difference that we start the knowledge base in an ontology and exploit different characteristics of Wikipedia. Their results are not as good as they expected, however, they have a working progress to obtain better results.
The Muhammad article [22] considers a literature review to know the current state of adaptation of learning paths in online learning systems; however, the article does not deepen into a particular approach. They briefly explain some techniques used to solve this problem as well as challenges in this area.
Proposed approach
This study considers that a course sequencing can be taken from the structured knowledge of Wikipedia to automate teaching processes; taking an ontology as the starting point of the domain. The ontology gives a basis of relations forming a hierarchy, similar to a program of study.
Wikipedia articles are related to concepts that represent other articles; so, knowledge is formed through relations. If the concept x is based on the concept y to be explained, then a relation of dependence exists between both concepts. Measuring the dependence of the concept y with the concept x indicates the importance of the first concept to understand the second. If we have a quantitative way of measuring this relation, we can establish an order of concepts. This proposal takes into consideration a complex mixture of factors to measure the relation between concepts.
The following section represents the algorithm to obtain a course sequencing. We based the proposal on the structure of an ontology, the knowledge source of Wikipedia, and the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm, the latter is widely used in information retrieval and text mining [2, 34].
Main equation
Figure 1 shows variables representing the problem and its relations. The root class of the ontology is represented by the variable r (root). The variable h (child) represents a class that is composed of other classes. Finally, the variable p (parent) describes classes that form another class.

Relations between variables.
The set P represents the set of parents that influence a child (P = {p1, p2, ..., p n }). Thus, the variable p i represents an element of the set P (p i ∈ P). The set X includes the set P, the root of the ontology (r), and a child (h) (X = {r, h, {P}}). Where x y represents an element of X (x y ∈ X).
Elements of the set X are related in the following way (see Fig. 1): (1) The element root is influenced by the element child (r, h). (2) The element child is influenced by the element p i (h, p i ).
The objective is to find the quantitative relation between the child (h) and parents that influence to the child (p i ). Each element of X corresponds to a Wikipedia page that describes a concept. The element x y can be treated by the concept that represents the page (title) or by the content (text of the article including links, references, and others). We denote when we refer to the concept as ct (x y ) and for the content as cd (x y ).
The algorithm calculates a value that represents the parent p
i
through its frequency with Equation 1.
Where, the function f (ct (x y ), cp (p i )) is the frequency search of the term p i in the domain x y , the variable m is the number of elements of the set X.
The previous process is repeated for each element of X, obtaining a summation of each result. We consider that ct (x y ) must be different from ct (p i ), because P is a subset of X.
Then, we unify the factors j and a, with Equation 2.
Where j (cd (h), cd (p i )) is the adjustment factor of concept coincidence and a (cd (p i ), ct (h)) is the frequency factor of the parent.
Equation 3 uses values returned by Equations 1 and 2 to obtain a non-normalized value of the relation between two terms, in this case, the relation parent-child (p i , h).
Where s (ct (h), ct (p i )) is the hops adjust factor.
Equation 3 weights the value of v (p i ) considering to t. t is penalized by the factor s; thus, the factor t is reduced by s or at least remains unchanged. To increase v (p i ) after the penalty, it is added 1 to the value of t.
To obtain a normalized value, we need to obtain each value of the relation parent-child (p
i
, h), so, Equation 4 is applied.
Next, we describe factors considered in equations.
This adjustment establishes a quantitative relation between two sets. The set c1 represents words in the article content h (cd (h)) and the set c2 represents the words in the article content p
i
(cd (p
i
)). The quantitative relation is calculated by the coefficient of Jaccard [14] represented in Equation 5.
Equation represents the intersection of both sets divided by the union of both sets. This result is a normalized value and is assigned to j (j = sim (c1, c2)).
This factor considers the frequency of terms in reverse. That is, we previously handled the frequency as f (ct (x
y
), cp (p
i
)), where the frequency of the term p
i
in domain x
y
was determined. Now, the function is used as f (ct (p
i
), cp (h)), where the frequency of the term h in the domain p
i
is determined. Thus, we apply Equation 6:
Where h is the element child of the set X. The value v (p i ) represents to a in the main equation (a = v (p i )), this value is normalized.
This factor refers to the number of hops between two concepts. That is, the number of links that the user must follow to go from one page of Wikipedia to another. C1 is represented by cp (h) and c2 is represented by cp (c2). We compute the relation through Equation 7:
This equation calculates support values through equations:
Finally, the number of hops from concept c1 to concept c2 is represented by d (c1, c2). The result of Equation 7 is assigned to s (s = v (c1, c2)). This factor works in reverse to other factors, because it represents a penalty. Where 0 represents no penalty and 1 represents the maximum penalty.
The ontology was taken from [25], this structure was developed through Methontology [8], following the phases of specification, conceptualization, formalization, implementation, and evaluation. Methontology is one of the most understandable methodologies to build ontologies [10]. This has been applied to different articles focused on the creation of ontologies for different domains [15, 29].
The chosen ontology models an Object Orientation (OO), which is a field in computer science. OO is a programming paradigm organized around objects and data; it is an important topic in programming-related careers. An ontology is a representation of knowledge that can represent a domain. A computer can analyze it for decision making. The ontology represents the knowledge that a student must have. In this way, algorithms can make an inference to establish what to teach to the student.
Classes and the hierarchy of the ontology were obtained through the analysis of a course on OOP, with the support of a group of teachers with experience in the area. The central aspect to develop the ontology was the organization of knowledge regarding the structure of the course, considering which concepts are necessary to learn other concepts. Ontology construction begins by disaggregating the more general concepts of OO (abstraction, encapsulation, hierarchy, and modularity) [16, 32]. Then, the secondary concepts of OO (concurrency, persistence, and typing) were added; to complete the ontology.
The ontology is presented in Fig. 2; the structure shows classes and their semantic relations, the central concept is object_orientation, all classes converge in this concept. There are different types of arrows that represent the type of relation such as part of, is a, type of, and derived from.

Semantic relations of the Object Orientation.
For the purposes of this study, the ontology was reduced to the version in Fig. 3, considering only those classes that are represented by an article in Wikipedia because a developed content of the concept for the mining is necessary.

OOP trimmed ontology.
The experiment determines the order of concepts by experts based on the ontology; later, the proposed algorithm generated another order of concepts with the same ontology. And finally, we compared results to obtain the degree of correlation.
Course sequencing by experts
The goal of this section is to establish the importance of a concept (specific) to learn another concept (general). The instrument is intended for teachers who teach the course of Object Oriented Programming (OOP). The ontology (Fig. 2) represents a section of the knowledge of that course. The type of sampling used is non-probabilistic because we do not require the representativeness of the population. The sample considered professors who teach the course in universities of Mexico.
The survey determines the importance of a concept to understand another. For example: To what extent do you believe that the concept abstraction is important for understanding the concept object _ orientation? (Relation abstraction - object _ orientation).
The survey has set a question for each ontology relation. The scale to evaluate questions are values from one to seven, where one was considered as minor and seven as very important.
The process consisted of the following steps: We made a list of universities where the OOP course is taught to identify teachers who are candidates to answer the survey. The survey was made based on the ontology, its concepts, and relations. Google Form3 was used to develop the instrument. Teachers answered the questionnaire online without regard of a time limit. Finally, we analyzed and interpreted the data.
The experiment considered changes to ontology relations to reduce the number or clearly write questions. For instance, the relation method - protocol was changed by method - method _ modifiers, method - parameters and method - return.
Course sequencing by the algorithm
The relations obtained from the ontology in Fig. 3 were the input of the algorithm described in Section 3. The algorithm evaluated the same relations as the experts to determine the course sequencing.
The proposed method was implemented in the Java programming language with the support of the JWPL (Java Wikipedia Library) programming interface. JWPL Provides access to all Wikipedia information in different languages in a structured way; this includes a MediaWiki tag analyzer for in-depth page content analysis. JWPL offers methods of accessing properties such as links, templates, categories, text and other properties [33].
Results
The obtained data are represented in Table 1. Each relation has an identifier (Id), relations are shown in the column with the same name. Each relation has information from experts and the algorithm. The first group has the non-normalized values from one to seven (value), the normalized values from zero to one (norm.), and the order given by experts (order). Values in the first group were normalized to be compared to the second group. The second group has the normalized values and the order estimated by the algorithm. The algorithm obtains the normalized values directly.
Results of experts and the algorithm
Results of experts and the algorithm
The non-normalized values were considered to construct the normalized values through the formula 8. Where c
i
is an element of the set C without normalizing (C = {c1, ..., c
m
}). C refers to the concept that composes another concept, where all elements of C compose the same concept. The variable P refers to the main concept. For example, P = object _ orientation and C = {abstraction, concurrency, encapsulation, hierarchy, modularity, persistence}.
The order of the result was generated considering the normalized values. We made groupings of concepts according to the direct relations in the ontology. For example, the first concept, object _ orientation, has six concepts derived directly (abstraction, concurrency, encapsulation, hierarchy, modularity, and persistence). We consider an order for these six concepts, other groupings of concepts consider another order (See Fig. 3).
Figure 4 graph results of experts and the algorithm. The axis x represents the relations in Table 1, each Id represents a relation. The axis y represents the order proposed by experts and the algorithm. The order of experts is deployed with a solid line, while the order of the algorithm is represented by a dotted line. The ideal representation is that the two lines overlap at all points.

Results of experts and the algorithm.
The hypothesis test considers two variables algorithm and experts. The variable algorithm represents the quantitative relation degree between classes generated through the proposed algorithm. The variable experts represents the quantitative relation degree obtained through the group of teachers.
This study considered the Pearson and Spearman tests to measure the correlation of the data. These tests show whether the variables are related and to what extent they are. We need other tests, Levene and Kolmogorov-Smirnov, to decide whether the Pearson or Spearman test is used. The Levene test describes a situation in which the error term is the same across all the values of the independent variables; this is called homoscedasticity. The Kolmogorov-Smirnov test verifies that the data has a normal distribution. If the Levene and Kolmogorov-Smirnov tests are approved we then use the Pearson test, otherwise we use the Spearman test.
The algorithm and experts variables approved the Levene and Kolmogorov-Smirnov tests with a confidence level of 95%; therefore, we use the Pearson test. The hypothesis are: H0: the variables algorithm and experts do not have correlation. H1: the variables algorithm and experts have correlation.
The Pearson test showed a correlation between both variables of 0.664, and a P - value below the significance level (0.001 < 0.05). Therefore, we accept H1, the variables algorithm and expert have correlation.
Discussion
Figure 4 shows results between experts and the algorithm. The algorithm coincided in the position of the concept 13 times from 23 possible (56%), five attempts were made at one position (22%) and the rest of the attempts were at two or more positions (22%). It is difficult to accurately match all values, even the decision of experts is based on subjectivity and often do not show agreement in their opinions.
The Pearson test obtained a positive correlation of 0.664, which represents a good correlation in the range [-1, 1], obtaining a confidence level of the test higher than 99%. Although we have a good correlation and a high confidence level, we can still adjust some aspect of the algorithm to increase the correlation values. In this way, achieving a correlation above 0.8 would be great considering the subjectivity that appears on this topic.
We can justify our method by the implicit relations existing in Wikipedia. Concepts of articles form relations to other concepts that represent a hierarchical organization similar to the studying programs that describe a sequence of contents. This knowledge is generated and evaluated by people with experience in the domain who reached a consensus on the content of articles; therefore, we established a degree of relation between what an expert thinks and what others write.
A Wikipedia article contains concepts linked to a main concept; however, several factors make one concept more relevant than others. The proposal was that analyzing frequencies of concepts, measuring hops from page to page, and employing a similarity equation, the most important concepts to learn others can be identified.
Conclusions
This work proposed a method to build a course sequencing to ITS considering the knowledge represented in the ontologies and Wikipedia. The proposal considered the structured knowledge of an ontology to obtain concepts and relations. We proposed a text mining algorithm that works on the structure and content of Wikipedia to determine the order of contents.
The learning sequence can be established from the beginning of the course when the system does not have information on the student to establish the learning path. In later phases, when the student interacts with the system, topics are reorganized according to the needs of the user; As shown in other models to adapt the teaching to the student [24].
According to the research question, we can say that it is possible to establish a course sequencing for students considering relations of an ontology and the mining of Wikipedia information. We deduced that the sequencing obtained by the proposed algorithm and the order suggested by experts correlates according to the Pearson test of 0.664. Although our algorithm has an acceptable level of correlation and a confidence level greater than 99%, it is necessary to increase these results to approximate the values of experts.
The vision of our work is to automate the process of constructing ITSs, making the process independent of the expert; this would help to reduce costs and time in the construction of ITSs.
The proposal triggers lines of future work such as: improving the order accuracy generated by the algorithm and obtaining concepts and relations of the Wikipedia structure without regard an ontology.
