Abstract
In this paper an analysis, based on similarity metrics, was carried out in order to detect main concepts related to the superclasses in a pedagogical domain ontology. A semi-automatic corpus containing articles in Spanish was built. Afterward, the corpus was lemmatized and three representations were extracted. Four textual similarity metrics based on terms and Pointwise Mutual Information were implemented. A list of words, which was evaluated using a gold standard built by an expert in the domain, was retrieved from each experiment according to establish thresholds for the metrics. Precision and recall were used for evaluation step, where a detailed discussion by representation and class was presented. Results showed a higher precision in types of intelligences class and 5-grams representation.
Introduction
The volume of information available is increasing exponentially, however, heterogeneity in Web information makes its processing very difficult. Traditional information retrieval techniques obtain results according to keywords and information to be processed, but it is not enough for semantic queries. Ontologies are presented as an option to process information, which can be used for vocabulary management, natural language processing applications, searches, recommendation systems, e-learning, among others [10]. The ontology learning process integrates concept detection, creation, population and evaluation [11]. Traditionally, the ontology building process is manual, because of its elaboration it is necessary to work with interest domain experts which identify and define the relationships and keywords in the text under analysis. This procedure is computationally expensive and in many cases it requires a great deal of time in the ontology creation step, due to the very high interaction between domain experts and linguistic scientists.
This work is focused on the first step of ontological learning process, that is, concepts detection. In previous research, a method for detection and validation of superclasses was carried out and in this paper, experiments to detect main concepts related to these superclasses are presented. A corpus in Spanish is building with pedagogical articles related to three main classes: learning styles, types of intelligences and learning strategies. In the experiments, textual similarity and corpus-based metrics are used to extract the most relevant terms (subclasses) to each of the superclasses. In addition to the metrics, experiments are performed with three representations of the corpus (sentences, pairs of words and 5-grams) and using different values of metrics to consider a word as retrieved. As evaluation, precision, recall and a gold standard developed with the help of an expert in the domain are used.
The motivation behind the use of similarity metrics is grounded on the analysis of the retrieved words number. The contributions in this sense are based on a different interpretation respect to the classical metrics, in the use of an external corpus for PMI, which is related to the domain and also, in a threshold to consider a word as recovered. Concepts detection automatically lets to extract relevant words using less interaction with domain expert than manually. Moreover, the experiments can be replicated according to the input corpus.
The article is organized in seven sections described as follows. Section 2 introduces theoretical concepts about ontologies, similarity metrics and pedagogic domain. Section 3 presents related work to the pedagogical domain and some other approaches focused on the concept detection process. In Section 4, experiments are carried out followed by a detailed discussion of results in Section 5. Finally, Section 6 outlines conclusions and future work of the research.
Theoretical concepts
In real life, an ontology is a computational entity, an artificial resource that had been created [23]. In Computer Sciences, an ontology is defined such as “an explicit specification of a conceptualization” [15]. Another definition is that of [34], which defines it as “a database that describes the concepts in the world or some domain”, some of its properties and how the concepts are related to each other. This database is defined from a base corpus, from which the main elements or keywords are extracted.
Subsequently, the relationships between keywords are inferred from the same text, in this way, a graph structure is created where the nodes are the keywords and the edges represent the relationship between them. Among the most representative applications of ontologies are the formal representation of knowledge, which facilitates the management and integration of data with different structures.
Formally, an ontology can be defined as the tuple O = (C, H, I, R, P, A), where C is the set of entities of the ontology, H is the set of taxonomic relationships between concepts, I is the set of instance relationships related to the C, R is the set of non-taxonomic ontology relationships, P is the set properties of ontology entities and A is the set of axioms, rules that allow checking the consistency of an ontology and infer new knowledge through some inference mechanism [10].
There are some techniques for concept detection that include information retrieval, pattern recognition and similarity metrics, among other areas. Similarity metrics are used for collocations detection, since if two words are together, probably are related. In subsection 2.1 similarity metrics used in this approach are addressed.
Similarity metrics
Similarity metrics compare the proximity between words or characters in two texts. In this paper, four metrics based on terms and a corpus are used. Terms based metrics analyze the frequency, depending on the distance between each couple of words, metrics of this type are coefficients of Dice, Jaccard, Overlap and Cosine. The corpus-based metrics calculate the occurrence of one word based on the occurrence of another. Mutual information is an example for this type of metrics where a large external corpus is necessary.
Pointwise mutual information can be normalized between [- 1, + 1] resulting in -1 for never occurring together, 0 for independence, and +1 for complete co-occurrence (Equation 6).
The Dice and Overlap coefficients are similar to the Jaccard coefficient, but the Dice coefficient gives double weight to the positive coincidences between the terms, and the Overlap coefficient considers only the characters cardinality of the smaller text instead of the characters union, as does the Jaccard coefficient. The mutual information (MI) of the random variables X and Y is the expected value of the PMI over all possible outcomes.
The superclasses were determined by a domain expert, and from them, main concepts will detect by the automatic process proposed in this paper. The superclasses are outlined in the next paragraphs.
Related work
In this section, some related works will be presented. The approaches were divided in two subsections, subsection 3.1 talks about the concept detection process, and Section 3.2 contains the approaches related with the pedagogic domain.
Concept detection process
Table 1 shows some recent approaches for concepts detection and ontology creation steps. Works like [7] uses word representations and collocations as additional features in a linear chain Conditional Random Field (CRF) classifier, obtaining results in Fscore of 82.44% on the CoNLL- 2002 corpus, where the features are based on cross-lingual words representation. A simple and practical approach for automatic term and relationship extraction was explained in [21], term extraction scheme used domain-specific patterns to identify agriculture terms and the relationship extraction scheme employed patterns, position vectors and WordNet similarity to identify four type of relations from the agricultural text pertaining to crops. Relationships extraction scheme is evaluated using 10-fold cross validation.
Approaches for ontology creation and concept detection steps
Approaches for ontology creation and concept detection steps
A methodology for manual ontology on non communicable diseases was presented in [31]. The proposed system consists of six ontologies which formalize concepts related to people, physical activity, the NCD, nutrition, geographic regions and symptoms. A method for concept extraction using linguistic patterns and NLP metrics such as morphological labeling is presented in a recent research [20]. An Ontology-based Research Profile Creation (ObRPC) model was proposed in [19] to create keyword profiles of users to assist in the task of expert finding.
Methods for semi-automatic concept extraction using a database of Spanish verbs are presented in [26], diathesis alternations and syntactic-semantic schemes (ADESSE tool) [12], where the semantic extracted patterns are the classes. This methodology was applied in educational domain and replicated in financial domain [25]; in both works, the class extraction was completed with the domain expert opinion.
A theoretical focus [33] describes two concepts on the Semantic Web and Ontology and points out the core role of Ontology in Semantic Web. Moreover, a Double-Channel Helix Methodology was described. In the analyzed approaches, the use of NLP techniques and semantic resources are dominant, however these.
The approaches about pedagogical domain addressed in this subsection are focused into four general categories: ontologies applied to e-learning, ontologies applied to classroom classes, approaches that present concepts in a specific topic or ontologies as a tool to facilitate the learning process among lecturer and students. Figure 1 shows some approaches in this domain which are described in Table 2, analyzing the methodology, language and the knowledge area in which the research was conducted.

Researches in pedagogy domain.
Approaches in the pedagogic domain
Some researchers are focused on online education such as [8], [9] and recently [16], where ontologies are manually defined from XML resources available on the Internet, and the evaluation is a manual process too. On the other hand, [17] proposes an ontology for the internet learning process. In both works an ontology for each entity in the learning process is defined, and the evaluation is conducted by a manual supervised process for domain experts.
There are works such as [32] focused on automatic learning; in this paper, an ontology based on the Internet of Things used in a classroom is created, considering the student intelligences. In [27], a research focuses on identifying social learning factors framework is presented. This work enables higher education to be sustainable and more adaptive; the primary method of the study is systematic literature review process to define social learning factors that can be mapped. The result is an ontology model for social learning that can be implemented for higher education.
In [24] an ontological modeling for learning personalization, which involves students profile according to the multiple intelligence theory by Howard Gardner, is proposed. These works use a domain ontology that helps to represent knowledge in virtual learning platforms as well. An approach for searching and selecting user relevant Massive Open Online Courses (MOOCs); fitting the level of students education is presented in [29]. The ontology is based on the semantic representation of the user knowledge in a subject area.
An ontology created from CASE diagrams for on-line education is presented in Bagiampou and Kameas [4]; its evaluation is addressed by experts with a manual process. In this work, the focus is on the construction step, where the classes are extracted manually. The ontology creation process from the courses information offered in advanced levels is explained in [3], where students can choose courses according to their academic background. Both works present the structure, information, and hierarchy of the classes manually.
According to Figure 1, the proposed work is in the dark area (intersection between face to face domain and ontology building as a tool). Most of the research proposes a manual methodology, thus, this work takes relevance in the proposal of a semiautomatic approach.
The proposal in this paper is a part of a general methodology for ontology learning applied to pedagogic domain. Experiments about ontology creation step, as can be seen in Figure 2, were carried out in this section. Experiments about class validation are available in [2].

Ontology learning general methodology.
A corpus was built using academic papers which are focused on Social Sciences (Pedagogy) and written in Spanish language. Besides, papers are related to the superclasses being extracted and joined in an initial corpus. Table 3 shows the words frequency, where Learning Strategy class contains more words than the other two classes in its vocabulary because of the presence of other two subcategories levels. After the analysis it was concluded that the classes share many words. The final vocabulary corpus contains 18,563 elements.
Initial corpus
For concept detection experiments, three representations of the corpus and three set of similarity metrics were used as shown in Figure 3, where first level shows the superclasses, second level corresponds to the corpus representations and third lever shows metric sets. The results per each experiment was named using the notation ClassRepresentation,Metrics,γ were precision PMI using Wikipedia corpus for the learning style class with 5-grams representation, threshold of 0.1 is represented by EA5g,Pw,0.1, precision of term metrics for the intelligences types with sentences representation and threshold of 0.15 is represented by TISe,Te,0.15.

Classes, representations and metric sets used in the experiments.
A gold standard related to the classes was built with the help of a domain expert. The final list contains words related to subclasses, descriptive terms and authors, among others important concepts associated with a class. The experiments for class detection were implemented as follows: Initial corpus was divided by class and each subcorpus was named after the initials of the class in Spanish: EA, EE, and TI, that is, for learning styles, learning strategies and intelligence types respectively. Lemmas of each corpus were extracted using TreeTagger tool [30]. Afterward, stops words were deleted and three representations were extracted: sentences separated by periods (Se), lemmas pairs (Pa) and 5-grams (5g). The corpus representations were analyzed with the set of metrics explained in Section 2.1: term based including Dice coefficient, Jaccard coefficient, Overlap coefficient, and Cosine coefficient (Te), normalized PMI using the Wikipedia corpus (Pw) and normalized PMI using the books corpus (Pb). A gold standard with a threshold was used in order to determine if a word is consider as recovered. Thresholds were represented by γ. Precision metric was calculated for each experiment (Equation 7). This metric was used because it is not necessary to retrieve all the gold standard words, then the precision shows the percentage correctly retrieved in each experiment. Recall metric was also evaluated for each representation (Equation 8).
Those metrics in 1, 2, 3 and 4 are used in the literature to obtain the similarity between two sentences, but, in this approach is necessary the similarity between two words. Usually, in the metrics (Section 2.1) t1 and t2 represent sentence 1 and sentence 2; for example, in Jaccard coefficient (Equation 2) ∣t1∩ t2 ∣ is the number of words that appear in sentence 1 and sentence 2, and ∣t1∪ t2 ∣ is the number of words between both sentences. For this approach, the same metrics were used but the representation of the variables was different for the three corpus, where t1 is the class, and t2 is each of the vocabulary words, thus the representation of ∣t1∩ t2 ∣ and ∣t1∪ t2 ∣ is different according to the corpus representation: In Se representation: [∣t1∩ t2 ∣:] Number of sentences where the class (t1) and the word (t2) come out. It is not important if the words are separated, just have to appear in the same sentence. [∣t1∪ t2 ∣:] Total of sentences where appear the class (t1) or the word (t2). In Pa representation: [∣t1∩ t2 ∣:] Total appearances of the two words together, regardless the order (t1:t2 or t2:t1) [∣t1∪ t2 ∣:] Frequency of t1 in the corpus plus frequency of t2 in the corpus. In 5g representation: [∣t1∩ t2 ∣:] Number of 5-grams where come out t1 and t2 [∣t1∪ t2 ∣:] Number of 5-grams where come out t1 or t2
The other metrics in Equation 6 were calculated using this variables representation. For the PMI implementation, a different corpus is necessary; thus two corpora were built: the first one was composed by some Wikipedia random articles and the second one was composed by free books of pedagogy, philosophy and psychology. The Wikipedia articles were obtained using a Web crawler and the books were obtained manually. Table 4 shows the components of each corpus. Wikipedia corpus is richer but the domain for it is not delimited as in book corpus, the initial hypothesis is that a corpus related to the analyzed domain could get better results.
Corpora obtained for PMI
In this section, a detailed discussion about results by class and its representation is presented. The precision of each experiment using a combination of representations and metrics is evaluated. Some experiments were taken as examples, in order to analyze the list of recovered words and to describe how the precision values were obtained.
Table 5 shows the list of words retrieved in the experiment EESe,Te,0.16. It is made up of 4 metrics of overlapping terms, then the voting system was used to determine the words that would be recovered. For this particular experiment, the value of γ determined if a word is considered as retrieved. Table 5 shows 8 words, whose associated metrics have values greater than 0.16 in 3 or 4 of the metrics. For example, the word conocimiento when it is compared with the class learning strategies obtained a Jaccard coefficient of 0.132, which is less than γ, however, the other three metrics had higher values, whereby conocimiento is considered as recovered. Just 3 words of 8 (estudio, proceso, poder) are not within the gold standard, so the precision for this experiment is 0.625. The gold standard contains 38 words related to EE class, then, the recall for this experiment is 0.1316. The remainder words represent a type of strategies (metacognitive) and the characteristics of these strategies in the learning theory.
Results of term based metrics in EESe,Te,0.16 experiment
Results of term based metrics in EESe,Te,0.16 experiment
For the experiments TIPa,Te,0.02 and EAPa,Te,0.04 the precision was 1.0 and recall of 0.17, retrieving 11 and 6 words respectively. For the intelligence type class, all the words corresponding to the types of intelligences are retrieved, including some combinations that can be given. In the case of learning styles, three of the four types mentioned in the literature are retrieved. All these words are included in the gold and they are listed below:
[TIPa,Te,0.02:] Intrapersonal, lingüísticaverbal, inteligencia, musical, naturalista, interpersonal, múltiple, lógicomatematica, espacial, emocional, lingüístico.
[EAPa,Te,0.04:] teórico, aprendizaje, estudiante, reflexivo, activo.
Table 6 shows the precision using Te representation, the γ values were in between 0.01 and 0.2, with intervals of 0.05. Table 6 just shows the results up to γ = 0.16, since for higher values, the result is 0 for all experiments.
Precision in the experiments using terms representation
The condition explained above is applied to the number of words; for example, experiment EE5g,Te,0.16 recovered two words, of which only one is relevant, this represents a precision of 0.5, but a word is not enough to consider it as a concept in an ontology, so the precision was considered 0. The experiment TISe,Te,0.16 had a precision of 1.0 retrieving three words, thus, all of which are relevant, however, the number of words recovered is also considered to be precision 0.
For learning styles class, the highest precision was EAPa,Te,0.04, with precision 1 and 6 correct recovered words, among which are the 4 types of learning styles, and the authors who created the questionnaire to detect them. In the class of teaching strategies, the best experiment was EESe,Te,0.16, with an precision of 0.625. The precision was increasing as the value of γ was increased (with γ = 0.17) since no words were recovered.
Finally, for the class intelligence types, the best result corresponded to TiPa,Te,0.03 and TiPa,Te,0.02, with precision 1.0. In this experiment, just one six and eleven words were recovered respectively, all associated to the concept of intelligence and to the types of intelligences. It is observed for the three classes, that the representation in sentences (Se) has fewer experiments with precision 0, but the remainder precisions are lower than those in 5g and Pa, except for the class of teaching strategies. In this class it is also observed that to obtain the highest results a γ of 0.16 was needed, while in the other two classes were obtained with much lower γ value.
Table 7 shows the results obtained by comparing the PMI metric with pedagogy books (Pb). The values of γ range from -0.5 to 0.5 in intervals of 0.05. The table only shows the results up to γ = 0.3, since, for larger values, in all experiments the result is 0. In these experiments the results did not vary much as γ increases its value, so some values are omitted where the results are the same.
Precision in the experiments using Pb representation
The class intelligences types had more words as γ increases, while the class of teaching strategies had zeros from γ = 0.05. For the class of learning styles the highest precision was 0.29 with the experiment EA5g,Pb,0.20, and for the class of intelligence types the highest precision was 0.25, with the experiment TI5g,Pb,0.25. The class of teaching strategies obtains a higher value in EE5g,Pb,0.10, but the precision barely reaches 0.041. Although using the Pb metrics, the results decreased in all the representations with respect to the Te metrics. An improvement from Table 6 can be noticed in the 5g representation.
Table 8 shows the results obtained using the PMI metric with the Wikipedia corpus (Pw), the range of γ is the same as those used in the Pb representation. Values are omitted where the results do not vary with respect to the previous experiment. In general, the results were lower than in the experiments with the Pb representation. For the class of learning styles the highest precision was obtained with the experiment EA5g,Pb,0.20, obtaining a 0.1951, for the learning strategy class the highest precision was 0.0286 with EE5g,Pw,0.05, while for the class of intelligence types the highest precision was 0.1778 with the TIPa,Pw,0.25. The class of intelligences types was the only one that obtained better results using the representation in pairs, while the other two classes obtained them with the representation of 5-grams.
Precision in the experiments using Pw representation
The precision obtained for both, the representation with Pw and Pb, is much lower in the three classes. However, comparing the vocabulary of the corpora used in PMI with the initial corpus, words were found that were not shared despite being include in the gold standard. Figure 4 shows the number of gold words that do not appear within the pairs of words recovered with the Pw and Pb metrics. It can be seen that the number of words that do not appear within the Wikipedia corpus is much greater than the number of words that do not appear in the corpus of books, especially in the representation of pairs of words. The class intelligence types was the one that has more lost words within the corpus, however, this class is one of those with the highest precision in the representations, especially in the 5g. Analyzing the representations, Pa has the largest number of words that do not appear in the gold, this is justified by the method used to obtain the representations. The representation Pa only relates two words when they appear together in the corpus, while the representations Se and 5g related two words if they appear in the same sentence and if they are at a distance of four or fewer words respectively.

Number of words in the gold standard that do not appear in the Wikipedia corpus and books corpus.
Figure 5 presents the results of all the experiments for the Pw and Pb representations. For these metrics, all the results are in the second and third quadrants, being closer to the vertical axis those experiments corresponding to the representation Pa, then, precision is high, although recall is low. The experiments using the 5g representation show some results a bit away from this axis.

Precision and recall in Pb and Pw representations.
Figure 6 shows the results of all the Te set experiments, separated by representation and classes. In all the graphs the vertical axis represents the recall and the horizontal axis the precision. Graph 6a shows the 5g representation, where it is observed that the majority of the experiments of the classes TI and EE come out in the third quadrant of the graph (low recall and precision) while most of the results for the class EA are displayed in the fourth quadrant (high order, low recall). Only the EA class has a decreasing trend in terms of the results of these two metrics, arriving to obtain very high precision but with few words recovered from the total.

Precision and recall in Te representation by class.
Graph 6b corresponds to the representation Se, where the EA class is most often kept in the fourth quadrant, but it is notable a wider distribution along quadrants II, II, and IV. Classes EE and TI presented several experiments with a very high recall, even of 1.0, but the precision is low. Therefore, the recovery of words is high but also words that do not belong to the gold standard are recovered.
Graph 6c corresponds to the representation Pa, where most of the results are the same for different values of γ or have a value of 0 when recovering less of 5 words. Comparing the two axes of the graph, the EA class has better results and, the EE class is those which presents the largest number of experiments in the origin (0, 0) of the graph or very close to it.
In the experiments using Pw and Pb, precision and recall do not exhibit a linear relationship, since one metric is high, but the other is low. Nevertheless, using the set Te some experiments manage to mediate the results of these two metrics, especially the class EA in the representations of 5g and SE.
In this paper, experiments to determine the effect of similarity metrics in the detection of an ontology principal concepts were presented. The experiments were carried out with three different representations of a corpus and with different ceilings for these above mentioned metrics. As a consequence, a list of words retrieved by each experiment was compared with a gold made with the help of an expert in the domain.
The principal contribution of this paper is the use of Information Retrieval evaluated using precision and recall. These metrics were calculated using modified similarity metrics and a threshold in order to obtain the similarity between two concepts. Making an analysis by superclass, the better results were reached in the intelligence types and the worst ones for the teaching strategies.
The types of intelligences are the most theoretically supported in the literature, thus, the words in the gold are more related to each other. Teaching strategies are not universal among authors, since different names and depth levels of handled in their classifications. Regarding to learning styles, although they are universally defined, they only have 4 styles and the remainder words belong to other substyles and concepts that describe them, so it is more difficult to detect these terms automatically.
Respect to representations, 5g is the one which presents the best results and the set of overlapping metrics shown better results for all classes. Although results of Pw and Pb were smaller, they could explain the relationship that exists between the corpus and the domain to be used. Despite of this, Pw has more instances and vocabulary than Pb, it obtained the lowest results since Pb has few instances, but all of them directly related to the pedagogical domain.
As future work, these experiments will be formalized in a methodology for the principal concepts extraction in order to determine the relationships between them. In addition, an analysis on fuzzy classes will be carried out according to the word lists retrieved in these experiments.
