Similarity metrics analysis for principal concepts detection in ontology creation

Abstract

In this paper an analysis, based on similarity metrics, was carried out in order to detect main concepts related to the superclasses in a pedagogical domain ontology. A semi-automatic corpus containing articles in Spanish was built. Afterward, the corpus was lemmatized and three representations were extracted. Four textual similarity metrics based on terms and Pointwise Mutual Information were implemented. A list of words, which was evaluated using a gold standard built by an expert in the domain, was retrieved from each experiment according to establish thresholds for the metrics. Precision and recall were used for evaluation step, where a detailed discussion by representation and class was presented. Results showed a higher precision in types of intelligences class and 5-grams representation.

Keywords

Ontology learning pedagogical domain NLP.

1 Introduction

The volume of information available is increasing exponentially, however, heterogeneity in Web information makes its processing very difficult. Traditional information retrieval techniques obtain results according to keywords and information to be processed, but it is not enough for semantic queries. Ontologies are presented as an option to process information, which can be used for vocabulary management, natural language processing applications, searches, recommendation systems, e-learning, among others [10]. The ontology learning process integrates concept detection, creation, population and evaluation [11]. Traditionally, the ontology building process is manual, because of its elaboration it is necessary to work with interest domain experts which identify and define the relationships and keywords in the text under analysis. This procedure is computationally expensive and in many cases it requires a great deal of time in the ontology creation step, due to the very high interaction between domain experts and linguistic scientists.

This work is focused on the first step of ontological learning process, that is, concepts detection. In previous research, a method for detection and validation of superclasses was carried out and in this paper, experiments to detect main concepts related to these superclasses are presented. A corpus in Spanish is building with pedagogical articles related to three main classes: learning styles, types of intelligences and learning strategies. In the experiments, textual similarity and corpus-based metrics are used to extract the most relevant terms (subclasses) to each of the superclasses. In addition to the metrics, experiments are performed with three representations of the corpus (sentences, pairs of words and 5-grams) and using different values of metrics to consider a word as retrieved. As evaluation, precision, recall and a gold standard developed with the help of an expert in the domain are used.

The motivation behind the use of similarity metrics is grounded on the analysis of the retrieved words number. The contributions in this sense are based on a different interpretation respect to the classical metrics, in the use of an external corpus for PMI, which is related to the domain and also, in a threshold to consider a word as recovered. Concepts detection automatically lets to extract relevant words using less interaction with domain expert than manually. Moreover, the experiments can be replicated according to the input corpus.

The article is organized in seven sections described as follows. Section 2 introduces theoretical concepts about ontologies, similarity metrics and pedagogic domain. Section 3 presents related work to the pedagogical domain and some other approaches focused on the concept detection process. In Section 4, experiments are carried out followed by a detailed discussion of results in Section 5. Finally, Section 6 outlines conclusions and future work of the research.

2 Theoretical concepts

In real life, an ontology is a computational entity, an artificial resource that had been created [23]. In Computer Sciences, an ontology is defined such as “an explicit specification of a conceptualization” [15]. Another definition is that of [34], which defines it as “a database that describes the concepts in the world or some domain”, some of its properties and how the concepts are related to each other. This database is defined from a base corpus, from which the main elements or keywords are extracted.

Subsequently, the relationships between keywords are inferred from the same text, in this way, a graph structure is created where the nodes are the keywords and the edges represent the relationship between them. Among the most representative applications of ontologies are the formal representation of knowledge, which facilitates the management and integration of data with different structures.

Formally, an ontology can be defined as the tuple O = (C, H, I, R, P, A), where C is the set of entities of the ontology, H is the set of taxonomic relationships between concepts, I is the set of instance relationships related to the C, R is the set of non-taxonomic ontology relationships, P is the set properties of ontology entities and A is the set of axioms, rules that allow checking the consistency of an ontology and infer new knowledge through some inference mechanism [10].

There are some techniques for concept detection that include information retrieval, pattern recognition and similarity metrics, among other areas. Similarity metrics are used for collocations detection, since if two words are together, probably are related. In subsection 2.1 similarity metrics used in this approach are addressed.

2.1 Similarity metrics

Similarity metrics compare the proximity between words or characters in two texts. In this paper, four metrics based on terms and a corpus are used. Terms based metrics analyze the frequency, depending on the distance between each couple of words, metrics of this type are coefficients of Dice, Jaccard, Overlap and Cosine. The corpus-based metrics calculate the occurrence of one word based on the occurrence of another. Mutual information is an example for this type of metrics where a large external corpus is necessary.

Dice Coefficient. It is based on set theory considering the number of words shared by both chains. This number of shared word is divided by the total number of words in the first and second texts. Its calculation is determined by Equation 1. Then, the result is normalized between 0 and 1, where 0 is zero similarity while 1 refers to maximum similarity [1].

${sim}_{D} (t_{1}, t_{2}) = 2 \frac{∣ t_{1} \cap t_{2} ∣}{∣ t_{1} ∣ + ∣ t_{2} ∣}$ (1)

Jaccard coefficient. It is similar to the Dice coefficient, which is obtained by dividing the intersection of terms between the union of the them. Its formula is presented in Equation 2 [18].

${sim}_{J} (t_{1}, t_{2}) = \frac{∣ t_{1} \cap t_{2} ∣}{∣ t_{1} \cup t_{2} ∣}$ (2)

Overlap coefficient. It is similar to the Jaccard coefficient but it only considers the cardinality of characters in the smaller text instead of the union of them [14]. This change is specified in Equation 3.

${sim}_{T} (t_{1}, t_{2}) = \frac{∣ t_{1} \cap t_{2} ∣}{\min (∣ t_{1} ∣, ∣ t_{2} ∣)}$ (3)

Cosine coefficient. It is obtained by dividing the cardinality of the union of the two sets between the square root of the product of their cardinalities (Equation 4).

${sim}_{C} (t_{1}, t_{2}) = \frac{∣ t_{1} \cap t_{2} ∣}{\sqrt{∣ t_{1} ∣ ∣ t_{2} ∣}}$ (4)

Mutual information. This technique uses advanced queries search methods to calculate probabilities. The more often co-occurs one word close to the other in a collection of documents, the value of MI is higher [6]. In information retrieval, the Pointwise Mutual Information (PMI) in a pair of words x and y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence (Equation 5).

$PMI (x; y) = \log \frac{p (y ∣ x)}{p (y)}$ (5)

Pointwise mutual information can be normalized between [- 1, + 1] resulting in -1 for never occurring together, 0 for independence, and +1 for complete co-occurrence (Equation 6).

$NPMI (x; y) = \frac{pmi (x; y)}{h (x, y)}$ (6)

The Dice and Overlap coefficients are similar to the Jaccard coefficient, but the Dice coefficient gives double weight to the positive coincidences between the terms, and the Overlap coefficient considers only the characters cardinality of the smaller text instead of the characters union, as does the Jaccard coefficient. The mutual information (MI) of the random variables X and Y is the expected value of the PMI over all possible outcomes.

2.2 Superclasses

The superclasses were determined by a domain expert, and from them, main concepts will detect by the automatic process proposed in this paper. The superclasses are outlined in the next paragraphs.

Learning styles: Learning styles project the way in which a person learns. However, there are alternatives about how to learn concepts and process information by humans. Several theories to describe the types of learning have been proposed in different works. This work adopts as a reference to David Kolb model [22], where a learning style is determined using the Learning Style Inventory (LSI) scale. The theory proposes a method for describing how students solve problems and apply new knowledge from personal experience within their learning environment. It considers the psychological processes of perception and processing [28].

Intelligences types: Intelligence is the ability to solve problems, or to create products, which are valued within one or more cultural environments [13]. Humans have capacities and potentials that can be employed in productive ways (together or separately); this idea originated the multiple intelligences theory.

Learning strategies: A learning strategy is a set of procedures that a student uses in a conscious, controlled and intentional manner as are flexible tools to learn and solve problems [5].

3 Related work

In this section, some related works will be presented. The approaches were divided in two subsections, subsection 3.1 talks about the concept detection process, and Section 3.2 contains the approaches related with the pedagogic domain.

3.1 Concept detection process

Table 1 shows some recent approaches for concepts detection and ontology creation steps. Works like [7] uses word representations and collocations as additional features in a linear chain Conditional Random Field (CRF) classifier, obtaining results in Fscore of 82.44% on the CoNLL- 2002 corpus, where the features are based on cross-lingual words representation. A simple and practical approach for automatic term and relationship extraction was explained in [21], term extraction scheme used domain-specific patterns to identify agriculture terms and the relationship extraction scheme employed patterns, position vectors and WordNet similarity to identify four type of relations from the agricultural text pertaining to crops. Relationships extraction scheme is evaluated using 10-fold cross validation.

Table 1
Approaches for ontology creation and concept detection steps

Step Automation Learning approach Language Domain Approach

Concept detection Semi-automatic and unsupervised Named Entity Recognition and collocations Spanish CoNLL-2002 corpus [7]

Concept detection Automatic Pattern recognition - NLP English Agriculture [21]

Creation Semi-automatic Semantic web and SWRL rules English NCD: physical activity, nutrition, geographic regions and symptoms [31]

Concept detection Automatic Linguistic patterns straction and statistical weights English Independent [20]

Creation Semi-automatic Domain Dictionary and Semantic matching English Researcher profiles [19]

Concept detection Semi-automatic Semantic relations extraction Spanish Financial texts [25]

Step	Automation	Learning approach	Language	Domain	Approach
Concept detection	Semi-automatic and unsupervised	Named Entity Recognition and collocations	Spanish	CoNLL-2002 corpus	[7]
Concept detection	Automatic	Pattern recognition - NLP	English	Agriculture	[21]
Creation	Semi-automatic	Semantic web and SWRL rules	English	NCD: physical activity, nutrition, geographic regions and symptoms	[31]
Concept detection	Automatic	Linguistic patterns straction and statistical weights	English	Independent	[20]
Creation	Semi-automatic	Domain Dictionary and Semantic matching	English	Researcher profiles	[19]
Concept detection	Semi-automatic	Semantic relations extraction	Spanish	Financial texts	[25]

A methodology for manual ontology on non communicable diseases was presented in [31]. The proposed system consists of six ontologies which formalize concepts related to people, physical activity, the NCD, nutrition, geographic regions and symptoms. A method for concept extraction using linguistic patterns and NLP metrics such as morphological labeling is presented in a recent research [20]. An Ontology-based Research Profile Creation (ObRPC) model was proposed in [19] to create keyword profiles of users to assist in the task of expert finding.

Methods for semi-automatic concept extraction using a database of Spanish verbs are presented in [26], diathesis alternations and syntactic-semantic schemes (ADESSE tool) [12], where the semantic extracted patterns are the classes. This methodology was applied in educational domain and replicated in financial domain [25]; in both works, the class extraction was completed with the domain expert opinion.

A theoretical focus [33] describes two concepts on the Semantic Web and Ontology and points out the core role of Ontology in Semantic Web. Moreover, a Double-Channel Helix Methodology was described. In the analyzed approaches, the use of NLP techniques and semantic resources are dominant, however these.

3.2 Pedagogical domain

The approaches about pedagogical domain addressed in this subsection are focused into four general categories: ontologies applied to e-learning, ontologies applied to classroom classes, approaches that present concepts in a specific topic or ontologies as a tool to facilitate the learning process among lecturer and students. Figure 1 shows some approaches in this domain which are described in Table 2, analyzing the methodology, language and the knowledge area in which the research was conducted.

Fig.1

Researches in pedagogy domain.

Table 2

Approaches in the pedagogic domain

Approach	Automation	Methodology	Language	Sub domain	Area of knowledge
[16]	Semi-automatic	Semantic Web technologies in a e-learning platform	English	E-learning platform (Moodle)	Information and communication technologies
[17]	Semi-automatic	Ontology is used in a system recommendation	Chinese	Chinese K12 education	Computer Sciences
[32]	Manual	Ontology is used in a system recommendation	English	Internet-of-Things	Education
[27]	Manual	Documentary research	English	Social learning in Web 2.0	Information and communication technologies
[24]	Manual	Manual process: Protégé and SWRL	Spanish	Student profiles	Education
[29]	Semi-automatic	Ontology is used in a system recommendation	English	Massive Online Open Courses (MOOC)	Computational and Information Sciences
[3]	Manual	Class hierarchy	English	Courses offered by an university	Technology and education
[9]	Manual	ABC model	English	Online education	Computational and Information Sciences

Some researchers are focused on online education such as [8], [9] and recently [16], where ontologies are manually defined from XML resources available on the Internet, and the evaluation is a manual process too. On the other hand, [17] proposes an ontology for the internet learning process. In both works an ontology for each entity in the learning process is defined, and the evaluation is conducted by a manual supervised process for domain experts.

There are works such as [32] focused on automatic learning; in this paper, an ontology based on the Internet of Things used in a classroom is created, considering the student intelligences. In [27], a research focuses on identifying social learning factors framework is presented. This work enables higher education to be sustainable and more adaptive; the primary method of the study is systematic literature review process to define social learning factors that can be mapped. The result is an ontology model for social learning that can be implemented for higher education.

In [24] an ontological modeling for learning personalization, which involves students profile according to the multiple intelligence theory by Howard Gardner, is proposed. These works use a domain ontology that helps to represent knowledge in virtual learning platforms as well. An approach for searching and selecting user relevant Massive Open Online Courses (MOOCs); fitting the level of students education is presented in [29]. The ontology is based on the semantic representation of the user knowledge in a subject area.

An ontology created from CASE diagrams for on-line education is presented in Bagiampou and Kameas [4]; its evaluation is addressed by experts with a manual process. In this work, the focus is on the construction step, where the classes are extracted manually. The ontology creation process from the courses information offered in advanced levels is explained in [3], where students can choose courses according to their academic background. Both works present the structure, information, and hierarchy of the classes manually.

According to Figure 1, the proposed work is in the dark area (intersection between face to face domain and ontology building as a tool). Most of the research proposes a manual methodology, thus, this work takes relevance in the proposal of a semiautomatic approach.

4 Experiments

The proposal in this paper is a part of a general methodology for ontology learning applied to pedagogic domain. Experiments about ontology creation step, as can be seen in Figure 2, were carried out in this section. Experiments about class validation are available in [2].

Fig.2

Ontology learning general methodology.

A corpus was built using academic papers which are focused on Social Sciences (Pedagogy) and written in Spanish language. Besides, papers are related to the superclasses being extracted and joined in an initial corpus. Table 3 shows the words frequency, where Learning Strategy class contains more words than the other two classes in its vocabulary because of the presence of other two subcategories levels. After the analysis it was concluded that the classes share many words. The final vocabulary corpus contains 18,563 elements.

Table 3

Initial corpus

Class	Words	Vocabulary
LearningStyle	60,551	8,587
LearningStrategy	68,397	10,145
IntelligenceType	56,090	9,863
Total	185,038	18,563

For concept detection experiments, three representations of the corpus and three set of similarity metrics were used as shown in Figure 3, where first level shows the superclasses, second level corresponds to the corpus representations and third lever shows metric sets. The results per each experiment was named using the notation Class_{Representation,Metrics,γ} were precision PMI using Wikipedia corpus for the learning style class with 5-grams representation, threshold of 0.1 is represented by EA_5g,Pw,0.1, precision of term metrics for the intelligences types with sentences representation and threshold of 0.15 is represented by TI_Se,Te,0.15.

Fig.3

Classes, representations and metric sets used in the experiments.

A gold standard related to the classes was built with the help of a domain expert. The final list contains words related to subclasses, descriptive terms and authors, among others important concepts associated with a class. The experiments for class detection were implemented as follows:

Initial corpus was divided by class and each subcorpus was named after the initials of the class in Spanish: EA, EE, and TI, that is, for learning styles, learning strategies and intelligence types respectively.

Lemmas of each corpus were extracted using TreeTagger tool [30]. Afterward, stops words were deleted and three representations were extracted: sentences separated by periods (Se), lemmas pairs (Pa) and 5-grams (5g).

The corpus representations were analyzed with the set of metrics explained in Section 2.1: term based including Dice coefficient, Jaccard coefficient, Overlap coefficient, and Cosine coefficient (Te), normalized PMI using the Wikipedia corpus (Pw) and normalized PMI using the books corpus (Pb).

A gold standard with a threshold was used in order to determine if a word is consider as recovered. Thresholds were represented by γ.

Precision metric was calculated for each experiment (Equation 7). This metric was used because it is not necessary to retrieve all the gold standard words, then the precision shows the percentage correctly retrieved in each experiment.

Recall metric was also evaluated for each representation (Equation 8).

$P = \frac{retrieved : relevant : items}{retrieved : items}$ (7) $R = \frac{retrieved : relevant : items}{relevant : items}$ (8)

Those metrics in 1, 2, 3 and 4 are used in the literature to obtain the similarity between two sentences, but, in this approach is necessary the similarity between two words. Usually, in the metrics (Section 2.1) t₁ and t₂ represent sentence 1 and sentence 2; for example, in Jaccard coefficient (Equation 2) ∣t₁∩ t₂ ∣ is the number of words that appear in sentence 1 and sentence 2, and ∣t₁∪ t₂ ∣ is the number of words between both sentences. For this approach, the same metrics were used but the representation of the variables was different for the three corpus, where t₁ is the class, and t₂ is each of the vocabulary words, thus the representation of ∣t₁∩ t₂ ∣ and ∣t₁∪ t₂ ∣ is different according to the corpus representation:

In Se representation:

[∣t₁∩ t₂ ∣:] Number of sentences where the class (t₁) and the word (t₂) come out. It is not important if the words are separated, just have to appear in the same sentence.

[∣t₁∪ t₂ ∣:] Total of sentences where appear the class (t₁) or the word (t₂).

In Pa representation:

[∣t₁∩ t₂ ∣:] Total appearances of the two words together, regardless the order (t₁:t₂ or t₂:t₁)

[∣t₁∪ t₂ ∣:] Frequency of t₁ in the corpus plus frequency of t₂ in the corpus.

In 5g representation:

[∣t₁∩ t₂ ∣:] Number of 5-grams where come out t₁ and t₂

[∣t₁∪ t₂ ∣:] Number of 5-grams where come out t₁ or t₂

The other metrics in Equation 6 were calculated using this variables representation. For the PMI implementation, a different corpus is necessary; thus two corpora were built: the first one was composed by some Wikipedia random articles and the second one was composed by free books of pedagogy, philosophy and psychology. The Wikipedia articles were obtained using a Web crawler and the books were obtained manually. Table 4 shows the components of each corpus. Wikipedia corpus is richer but the domain for it is not delimited as in book corpus, the initial hypothesis is that a corpus related to the analyzed domain could get better results.

Table 4

Corpora obtained for PMI

	Wikipedia	Books
Instances	174,605	113
Words	21,529,363	4,289,894
Vocabulary	498,866	116,986

5 Results

In this section, a detailed discussion about results by class and its representation is presented. The precision of each experiment using a combination of representations and metrics is evaluated. Some experiments were taken as examples, in order to analyze the list of recovered words and to describe how the precision values were obtained.

Table 5 shows the list of words retrieved in the experiment EE_Se,Te,0.16. It is made up of 4 metrics of overlapping terms, then the voting system was used to determine the words that would be recovered. For this particular experiment, the value of γ determined if a word is considered as retrieved. Table 5 shows 8 words, whose associated metrics have values greater than 0.16 in 3 or 4 of the metrics. For example, the word conocimiento when it is compared with the class learning strategies obtained a Jaccard coefficient of 0.132, which is less than γ, however, the other three metrics had higher values, whereby conocimiento is considered as recovered. Just 3 words of 8 (estudio, proceso, poder) are not within the gold standard, so the precision for this experiment is 0.625. The gold standard contains 38 words related to EE class, then, the recall for this experiment is 0.1316. The remainder words represent a type of strategies (metacognitive) and the characteristics of these strategies in the learning theory.

Table 5
Results of term based metrics in EE_Se,Te,0.16 experiment

Word Dicce Jaccard Overlap Cosine

estudio 0.1655 0.0902 0.3802 0.2005

conocimiento 0.2333 0.132 0.4973 0.2753

proceso 0.2418 0.1375 0.4531 0.2733

cognitivo 0.3253 0.1943 0.6701 0.3794

metacognitivas 0.2556 0.1465 0.9271 0.3707

poder 0.1752 0.096 0.3971 0.2113

aprendizaje 0.5156 0.3474 0.5656 0.5177

estudiante 0.3134 0.1858 0.438 0.3269

Word	Dicce	Jaccard	Overlap	Cosine
estudio	0.1655	0.0902	0.3802	0.2005
conocimiento	0.2333	0.132	0.4973	0.2753
proceso	0.2418	0.1375	0.4531	0.2733
cognitivo	0.3253	0.1943	0.6701	0.3794
metacognitivas	0.2556	0.1465	0.9271	0.3707
poder	0.1752	0.096	0.3971	0.2113
aprendizaje	0.5156	0.3474	0.5656	0.5177
estudiante	0.3134	0.1858	0.438	0.3269

For the experiments TI_Pa,Te,0.02 and EA_Pa,Te,0.04 the precision was 1.0 and recall of 0.17, retrieving 11 and 6 words respectively. For the intelligence type class, all the words corresponding to the types of intelligences are retrieved, including some combinations that can be given. In the case of learning styles, three of the four types mentioned in the literature are retrieved. All these words are included in the gold and they are listed below:

[TI_Pa,Te,0.02:] Intrapersonal, lingüísticaverbal, inteligencia, musical, naturalista, interpersonal, múltiple, lógicomatematica, espacial, emocional, lingüístico.

[EA_Pa,Te,0.04:] teórico, aprendizaje, estudiante, reflexivo, activo.

Table 6 shows the precision using Te representation, the γ values were in between 0.01 and 0.2, with intervals of 0.05. Table 6 just shows the results up to γ = 0.16, since for higher values, the result is 0 for all experiments.

Table 6

Precision in the experiments using terms representation

γ	EA	EA	EA	EE	EE
	5g,Te	Se,Te	Pa,Te	5g,Te	Se,Te
0.01	0.1899	0.0718	0.7059	0.1325	0.0527
0.02	0.3684	0.1391	0.7692	0.2449	0.0781
0.03	0.6316	0.1720	0.8000	0.2400	0.1093
0.04	0.6875	0.2188	1.0000	0.3333	0.1317
0.05	0.7273	0.2745	1.0000	0	0.1545
0.06	0.8000	0.3171	0	0	0.1744
0.07	0.8333	0.3871	0	0	0.2344
0.08	0	0.4583	0	0	0.2979
0.09	0	0.5556	0	0	0.2778
0.10	0	0.6000	0	0	0.2903
0.11	0	0.8182	0	0	0.3077
0.12	0	0.7778	0	0	0.3333
0.13	0	0.7500	0	0	0.3750
0.14	0	0.7143	0	0	0.5455
0.15	0	0	0	0	0.6000
0.16	0	0	0	0	0.6250

γ	EE	TI	TI	TI
	Pa, Te	5g, Te	Se, Te	Pa, Te
0.01	0.2174	0.2105	0.1017	0.6842
0.02	0	0.4375	0.1593	1.0000
0.03	0	0.6364	0.1921	1.0000
0.04	0	0.6429	0.2743	0
0.05	0	0.6667	0.3553	0
0.06	0	0.6250	0.3966	0
0.07	0	0	0.4390	0
0.08	0	0	0.5000	0
0.09	0	0	0.5652	0
0.10	0	0	0.5500	0
0.11	0	0	0.5000	0
0.12	0	0	0.6364	0
0.13	0	0	0.7143	0
0.14	0	0	0	0
0.15	0	0	0	0
0.16	0	0	0	0

The condition explained above is applied to the number of words; for example, experiment EE_5g,Te,0.16 recovered two words, of which only one is relevant, this represents a precision of 0.5, but a word is not enough to consider it as a concept in an ontology, so the precision was considered 0. The experiment TI_Se,Te,0.16 had a precision of 1.0 retrieving three words, thus, all of which are relevant, however, the number of words recovered is also considered to be precision 0.

For learning styles class, the highest precision was EA_Pa,Te,0.04, with precision 1 and 6 correct recovered words, among which are the 4 types of learning styles, and the authors who created the questionnaire to detect them. In the class of teaching strategies, the best experiment was EE_Se,Te,0.16, with an precision of 0.625. The precision was increasing as the value of γ was increased (with γ = 0.17) since no words were recovered.

Finally, for the class intelligence types, the best result corresponded to Ti_Pa,Te,0.03 and Ti_Pa,Te,0.02, with precision 1.0. In this experiment, just one six and eleven words were recovered respectively, all associated to the concept of intelligence and to the types of intelligences. It is observed for the three classes, that the representation in sentences (Se) has fewer experiments with precision 0, but the remainder precisions are lower than those in 5g and Pa, except for the class of teaching strategies. In this class it is also observed that to obtain the highest results a γ of 0.16 was needed, while in the other two classes were obtained with much lower γ value.

Table 7 shows the results obtained by comparing the PMI metric with pedagogy books (Pb). The values of γ range from -0.5 to 0.5 in intervals of 0.05. The table only shows the results up to γ = 0.3, since, for larger values, in all experiments the result is 0. In these experiments the results did not vary much as γ increases its value, so some values are omitted where the results are the same.

Table 7

Precision in the experiments using Pb representation

γ	EA	EA	EA	EE	EE
	5g, Pb	Se, Pb	Pa, Pb	5g, Pb	Se, Pb
-0.50	0.0205	0.0191	0.0262	0.0259	0.0239
-0.35	0.0197	0.0182	0.0262	0.0259	0.024
-0.25	0.0201	0.0186	0.0263	0.0237	0.0219
-0.20	0.0206	0.0188	0.0258	0.0205	0.0196
-0.15	0.0233	0.0208	0.0248	0.0186	0.0182
-0.10	0.0265	0.0228	0.0257	0.0198	0.0183
-0.05	0.0302	0.0274	0.0286	0.0243	0.02
0.00	0.0330	0.0347	0.0275	0.0256	0.0215
0.05	0.0481	0.0425	0.0371	0.027	0.0223
0.10	0.0909	0.0796	0.0571	0.0417	0.0302
0.15	0.1429	0.1169	0.0701	0	0
0.20	0.2917	0.1944	0.1013	0	0
0.25	0	0	0	0	0
0.30	0	0	0	0	0

γ	EE	TI	TI	TI
	Pa, Pb	5g, Pb	Se, Pb	Pa, Pb
-0.50	0.0357	0.0371	0.0349	0.0554
-0.35	0.0357	0.0371	0.0349	0.0554
-0.25	0.0357	0.0321	0.0307	0.0554
-0.20	0.0357	0.0308	0.0281	0.0554
-0.15	0.0358	0.0301	0.028	0.0555
-0.10	0.0340	0.0336	0.0297	0.0528
-0.05	0.0308	0.0300	0.0266	0.0469
0.00	0.0242	0.0385	0.0290	0.0443
0.05	0.0196	0.0431	0.0374	0.0496
0.10	0.0191	0.0667	0.0504	0.0517
0.15	0	0.1149	0.0902	0.0751
0.20	0	0.1458	0.1333	0.1034
0.25	0	0.2500	0.1860	0.1739
0.30	0	0	0	0.2414

The class intelligences types had more words as γ increases, while the class of teaching strategies had zeros from γ = 0.05. For the class of learning styles the highest precision was 0.29 with the experiment EA_5g,Pb,0.20, and for the class of intelligence types the highest precision was 0.25, with the experiment TI_5g,Pb,0.25. The class of teaching strategies obtains a higher value in EE_5g,Pb,0.10, but the precision barely reaches 0.041. Although using the Pb metrics, the results decreased in all the representations with respect to the Te metrics. An improvement from Table 6 can be noticed in the 5g representation.

Table 8 shows the results obtained using the PMI metric with the Wikipedia corpus (Pw), the range of γ is the same as those used in the Pb representation. Values are omitted where the results do not vary with respect to the previous experiment. In general, the results were lower than in the experiments with the Pb representation. For the class of learning styles the highest precision was obtained with the experiment EA_5g,Pb,0.20, obtaining a 0.1951, for the learning strategy class the highest precision was 0.0286 with EE_5g,Pw,0.05, while for the class of intelligence types the highest precision was 0.1778 with the TI_Pa,Pw,0.25. The class of intelligences types was the only one that obtained better results using the representation in pairs, while the other two classes obtained them with the representation of 5-grams.

Table 8

Precision in the experiments using Pw representation

γ	EA	EA	EA	EE	EE
	5g, Pw	Se, Pw	Pa, Pw	5g, Pw	Se, Pw
-0.50	0.0339	0.0343	0.0403	0.0224	0.0253
-0.15	0.0339	0.0344	0.0403	0.0224	0.0254
-0.05	0.0339	0.0359	0.0406	0.0224	0.0254
0.00	0.0388	0.0407	0.0425	0.0236	0.0244
0.05	0.0567	0.0439	0.0491	0.0286	0.0233
0.10	0.0804	0.0503	0.0466	0	0.0177
0.15	0.1429	0.0599	0.0556	0	0
0.20	0.1951	0.0885	0.0621	0	0
0.25	0	0.1522	0.1139	0	0

γ	EE	TI	TI	TI
	Pa, Pw	5g, Pw	Se, Pw	Pa, Pw
-0.50	0.0240	0.0352	0.0357	0.0475
-0.15	0.0240	0.0352	0.0358	0.0475
-0.05	0.0243	0.0352	0.038	0.0483
0.00	0.0245	0.0398	0.0346	0.0493
0.05	0.0212	0.0405	0.0411	0.0468
0.10	0.0174	0.0634	0.0375	0.0579
0.15	0.0211	0	0.0465	0.0701
0.20	0	0	0	0.0982
0.25	0	0	0	0.1778

The precision obtained for both, the representation with Pw and Pb, is much lower in the three classes. However, comparing the vocabulary of the corpora used in PMI with the initial corpus, words were found that were not shared despite being include in the gold standard. Figure 4 shows the number of gold words that do not appear within the pairs of words recovered with the Pw and Pb metrics. It can be seen that the number of words that do not appear within the Wikipedia corpus is much greater than the number of words that do not appear in the corpus of books, especially in the representation of pairs of words. The class intelligence types was the one that has more lost words within the corpus, however, this class is one of those with the highest precision in the representations, especially in the 5g. Analyzing the representations, Pa has the largest number of words that do not appear in the gold, this is justified by the method used to obtain the representations. The representation Pa only relates two words when they appear together in the corpus, while the representations Se and 5g related two words if they appear in the same sentence and if they are at a distance of four or fewer words respectively.

Fig.4

Number of words in the gold standard that do not appear in the Wikipedia corpus and books corpus.

Figure 5 presents the results of all the experiments for the Pw and Pb representations. For these metrics, all the results are in the second and third quadrants, being closer to the vertical axis those experiments corresponding to the representation Pa, then, precision is high, although recall is low. The experiments using the 5g representation show some results a bit away from this axis.

Fig.5

Precision and recall in Pb and Pw representations.

Figure 6 shows the results of all the Te set experiments, separated by representation and classes. In all the graphs the vertical axis represents the recall and the horizontal axis the precision. Graph 6a shows the 5g representation, where it is observed that the majority of the experiments of the classes TI and EE come out in the third quadrant of the graph (low recall and precision) while most of the results for the class EA are displayed in the fourth quadrant (high order, low recall). Only the EA class has a decreasing trend in terms of the results of these two metrics, arriving to obtain very high precision but with few words recovered from the total.

Fig.6

Precision and recall in Te representation by class.

Graph 6b corresponds to the representation Se, where the EA class is most often kept in the fourth quadrant, but it is notable a wider distribution along quadrants II, II, and IV. Classes EE and TI presented several experiments with a very high recall, even of 1.0, but the precision is low. Therefore, the recovery of words is high but also words that do not belong to the gold standard are recovered.

Graph 6c corresponds to the representation Pa, where most of the results are the same for different values of γ or have a value of 0 when recovering less of 5 words. Comparing the two axes of the graph, the EA class has better results and, the EE class is those which presents the largest number of experiments in the origin (0, 0) of the graph or very close to it.

In the experiments using Pw and Pb, precision and recall do not exhibit a linear relationship, since one metric is high, but the other is low. Nevertheless, using the set Te some experiments manage to mediate the results of these two metrics, especially the class EA in the representations of 5g and SE.

6 Conclusions and future work

In this paper, experiments to determine the effect of similarity metrics in the detection of an ontology principal concepts were presented. The experiments were carried out with three different representations of a corpus and with different ceilings for these above mentioned metrics. As a consequence, a list of words retrieved by each experiment was compared with a gold made with the help of an expert in the domain.

The principal contribution of this paper is the use of Information Retrieval evaluated using precision and recall. These metrics were calculated using modified similarity metrics and a threshold in order to obtain the similarity between two concepts. Making an analysis by superclass, the better results were reached in the intelligence types and the worst ones for the teaching strategies.

The types of intelligences are the most theoretically supported in the literature, thus, the words in the gold are more related to each other. Teaching strategies are not universal among authors, since different names and depth levels of handled in their classifications. Regarding to learning styles, although they are universally defined, they only have 4 styles and the remainder words belong to other substyles and concepts that describe them, so it is more difficult to detect these terms automatically.

Respect to representations, 5g is the one which presents the best results and the set of overlapping metrics shown better results for all classes. Although results of Pw and Pb were smaller, they could explain the relationship that exists between the corpus and the domain to be used. Despite of this, Pw has more instances and vocabulary than Pb, it obtained the lowest results since Pb has few instances, but all of them directly related to the pedagogical domain.

As future work, these experiments will be formalized in a methodology for the principal concepts extraction in order to determine the relationships between them. In addition, an analysis on fuzzy classes will be carried out according to the word lists retrieved in these experiments.

References

Al-Shamri

M.Y.H.

, Expert Systems with Applications, Power coefficient as a similarity measure for memory-based collaborative recommender systems41(13), 2014.

Alemán

, Somodevilla

M.J.

and Vilariño

, Computer and Information Science, A class validation proposal of a pedagogic domain ontology based on clustering analysis11(1), 2018.

Ameen

, Khan

K.U.R.

, Rani

B.P.

, Creation of ontology in education domain, 2012 IEEE Fourth International Conference on Technology for Education, 2012.

Bagiampou

, Kameas

, A use case diagrams ontology that can be used as common reference for software engineering education, 2012 6th IEEE International Conference Intelligent Systems, 2012.

Barriga

, Hernández

, Estrategias docentes para un aprendizaje significativo. Una interpretación constructivista, Mc- Graw Hill, 2004.

Chew

P.A.

and Robinson

D.G.

, International Journal of Accounting & Information Management, Automated account reconciliation using probabilistic and statistical techniques20(4), 2012.

Copara

, Ochoa

, Thorne

, Glavas

, Exploring unsupervised features in conditional random fields for Spanish named entity recognition, 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), 2016.

Dai

, Li

, Study of learning source ontology modeling in remote education, 2010 International Conference on Multimedia Technology, 2010.

, Zheng

, You

, Bai

, Zhang

, Research of online education ontology model, 2012 Fourth International Conference on Computational and Information Sciences, 2012.

10.

Faria

, Girardi

, A domain-independent process for automatic ontology population from text, Science of Computer Programming 95 Part 1, 2014.

11.

, Jia

, Xu

, Domain ontology learning for question answering system in network education, 2008 The 9th International Conference for Young Computer Scientists, 2008.

12.

García-Miguel

J.M.

, Vaamonde

, Domínguez

F.G.

, ADESSE, a Database with Syntactic and Semantic Annotation of a Corpus of Spanish, Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC10), 2010.

13.

Gardner

, Estructuras de la Mente, Fondo de Cultura Económica, 2001.

14.

Gomaa

, Fahmy

, A survey of text similarity approaches, International journal of computer applications, 2013.

15.

Gruber

T.R.

, Toward principles for the design of ontologies used for knowledge sharing, International Journal of Human-Computer Studies, 1995.

16.

Hssina

, Bouikhalene

, Merbouha

, An ontology to assess the performances of learners in an e-learning platform based on semantic web technology: Moodle case study, Europe and MENA Cooperation Advances in Information and Communication Technologies, 2017.

17.

, Li

, Xu

, An approach of ontology based knowledge base construction for chinese K12 education, 2016 First International Conference on Multimedia and Image Processing (ICMIP), 2016.

18.

Huang

C.H.

, Yin

, Hou

, A text similarity measurement combining word semantic information with tf-idf method, Chinese journal of computer, 2011.

19.

Jamaludin

N.A.

, Annamalai

, Jamil

, Bakar

Z.A.

, A model for keyword profile creation using extracted keywords and terminological ontology, 2013 IEEE Conference on e-Learning, e-Management and e-Services, 2013.

20.

Kang

Y.B.

, Haghighi

P.D.

and Burstein

, Expert Systems with Applications, Cfinder: An intelligent key concept finder from text for ontology development41(9), 2014.

21.

Kaushik

, Chatterjee

, A practical approach for term and relationship extraction for automatic ontology creation from agricultural text, 2016 International Conference on Information Technology (ICIT), 2016.

22.

Kolb

, Learning style inventory, 1976.

23.

Mahesh

, Ontology development for machine translation: Ideology and methodology, Computing Research Laboratory New Mexico State University, 1996.

24.

Méndez

N.D.D

, Carranza

D.A.O.

and Ocampo

M.G.

, Revista Educación en Ingeniería, Representación ontológica de perfiles de estudiantes para la personalización del aprendizaje10(19), 2015.

25.

Ochoa

J.L.

, Hernández-Alcaraz

M.L.

, Almela

, Valencia-García

, Learning semantic relations from Spanish natural language documents in the financial domain, Proceedings of the 3rd International Conference on Computer Modeling and Simulation, 2011.

26.

Ochoa-Hernández

J.L.

, Desarrollo de una metodología para la construcción automática de ontologías en español a partir de texto libre, Ph.D. thesis, Departamento de Ingeniería de la información y las comunicaciones, Universidad de Murcia, 2011.

27.

Oktavia

, Meyliana

, Prabowo

, Kosala

, Supangkat

S.H.

, A conceptual social learning ontology for higher education in e-learning 2.0, 2016 International Conference on Information Management and Technology (ICIMTech), 2016.

28.

Olivos

, Santos

, Martín

, Cañas

, Gómez

and Maya

, Suma Psicológica, The relationship between learning styles and motivation to transfer of learning in a vocational training programme23(1), 2016.

29.

Sammour

, Al-Zoubi

, Gladun

, Khala

, Schreurs

, Semantic web and ontologies for personalization of learning in moocs, 2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS), 2015.

30.

Schmid

, Probabilistic part-of-speech tagging using decision trees, International Conference on New Methods in Language Process, 1994.

31.

Somodevilla

M.J.

, Mena

, Pineda

I.H.

, Celis

M.C.P.

, Deducting lifestyle patterns by ontologies SWRL rules, 26th International Workshop on Database and Expert Systems Applications (DEXA), 2015.

32.

Uskov

, Pandey

, Bakken

J.P.

and Margapuri

V.S

, Smart engineering education: The ontology of internet-of-things applications, 2016 IEEE Global Engineering Education Conference (EDUCON), 2016.

33.

Wang

, Computer and Information Science, Methodology research of ontology building in semantic web3(4), 2010.

34.

Weigand

, A multilingual ontology-based lexicon for news filtering-the TREVI project, Proceedings of the IJCAI Workshop on Multilingual Ontologies-Nagoya, 1997.