Ontological knowledge inferring approach: Introducing Directed Collocations (DC) and Joined Directed Collocations (JDC)

Abstract

The growing need of utilizing unstructured knowledge embedded in open-domain natural language text into machine-processable forms requires the induction of hardly extracted structured knowledge into knowledge bases which makes the Semantic Web vision a reality. In this context, ontologies, and ontological knowledge (triples) plays a vital role. This research introduces two novel concepts named Directed Collocation (DC) and Joined Directed Collocation (JDC) along with a methodical application of them to infer new ontological knowledge. Introduced Quality-Threshold-Value (QTV) parameter improves the quality of the inferred ontological knowledge. Having set a moderate value (3) for QTV, this approach inferred 95,491 new ontological knowledge from 43,100 triples of open domain Sri Lankan English news corpus. Indeed, the outcome was approximately doubled in size as the source corpus. Some inferred ontological knowledge was identical with the original corpus content, which evidences the accuracy of this approach. The remaining were validated using inter-rater agreement method (high reliability) and out of which around 56% were estimated as effective. The inferred outcome which is in the triple format may use in any knowledge base. The proposed approach is domain independent. Thus, helps to construct/extend ontologies for any domain with the help of less or no human specialists.

Keywords

Semantic web natural language processing ontological knowledge knowledge bases collocation triple

1. Introduction/background

Unlike in the past, accessibility of Information is advanced and expanded today. Users have access to information generated anywhere in the world remotely, 24/7 with the support of the internet. Though this provides a good opportunity for knowledge acquisition, the human’s cognitive capability and processing power is limited. Therefore, the mass volume of knowledge, which is encapsulated in web data and e-news like sources are not effectively utilized owing to the effect of information overloading.

The above situation becomes a real motive for Automatic Knowledge Extraction (AKE). Today, it has become crucial and beneficial for human. AKE is a non-trivial, highly challenging area of research in Natural Language Processing (NLP) domain in computer science. Ontologies and knowledge bases are part and parcel with AKE as hardly extracted structured knowledge (e.g., RDF/triples) should be organized into knowledge bases, while enabling them to be machine processible. That is the vision of the semantic web.

1.1 Ontological knowledge

An ontology is a shared structure/schema or a vocabulary, which models the type of objects/concepts, their properties, and relations. It is an explicit specification of a conceptualization, according to Thomas R. Gruber [1]. It can act as a repository/knowledge base to store real world instances of such objects and relationships by allowing the end users to reason out and infer the knowledge.

Resource Description Framework (RDF) is a structured semantic pattern which consists of Subject, Predicate and Object (a triple) components respectively. RDF is a universal language that lets users describe resources (objects and relationships) in their own vocabularies and is the main construct of ontologies. RDF/triples are of two types. Some represent the schema layer of the ontology while the others represent the instances of the schema layer [2].

Therefore, RDF helps in constructing both schema (ontology) and the instances (knowledge base) of the ontology. Henceforth, the two terms “Ontological knowledge” and “ontological knowledge facts” both have been used in this paper interchangeably to represent a similar plural meaning for RDF (both RDF schema and RDF instance types).

1.2 Problem formation

The use of ontologies is broad. Ontology development is a useful approach in the design and implementation of interoperable and multi-agent systems [3]. Ontology-based expert systems are widely used in many industry domains. Ontology driven knowledge management systems help organizing and managing firm’s knowledge in an efficient way [4]. Machine learning and the construction of very large knowledge graphs have accompanied a proliferation of ontologies for many purposes such as Semantic Web applications, business reporting and artificial intelligence. Ontologies can be extracted, learned, modularized, interrelated, transformed, analyzed, and harmonized, as well as developed in a formal process which can be manual or automated [5]. Domains are continuously updating. But the ontologies do not evolve to reflect such changes automatically. To address this limitation, researchers attempt for automatic ontology generation from unstructured text corpus. Unfortunately, methods that aim to generate ontologies from unstructured text corpus are domain-specific and require manual intervention. Also, they suffer from uncertainty in creating concept relationships and difficulty in finding axioms for the same concept [6]. Often, a human intervention is required to improve the quality of the ontologies. Hence, the involvement of consensus and high-level abstraction requires human cognitive processing. This makes the process of fully automating ontology learning impossible. Another important issue refers to scalability of ontology learning techniques. Extracting knowledge from the growing amounts of data on the Web in different formats requires scalable and efficient approaches [7].

Considering such background facts, it must be acknowledged that the biggest challenge in ontology construction is to gather appropriate ontological knowledge. These attempts are mostly carried out through manual processes which indeed consume more labor and time. In addition, scalability issues and being domain specific are vital concerns to be addressed. These turned out to be the real motivation for this research.

The primary objective of this research is to introduce a novel mechanism to automatically infer new ontological knowledge by utilizing the limited number of existing triples/ontological knowledge. The new ontological knowledge facts are inferred based on the existing ontological knowledge, using a novel approach named Joined Directed Collocation (JDC).

The proposed approach is generic and can be applied on triples irrespective of their domain and infer new ontological knowledge. Hence, this approach can be used in developing either domain specific or open domain ontologies. This method infers a large amount of new ontological knowledge, thus can be used to construct new ontologies or extend existing ontologies.

2. Related work

The existing literature provides evidence for the importance of having knowledge representation schemas to make the extracted knowledge accessible by the machines. Ontologies, Knowledge graphs, Frames and the semantic web are some of the knowledge representation schemas/frameworks which has become popular today. Though ontology now becomes a kind of matured knowledge representation, yet the usability and application of ontologies are not being degraded.

Ontology learning is a research field that brings together the technologies of ML, data and text mining, Natural Language Processing and knowledge representation. Automatic ontology construction from text: a review from shallow to deep learning trend, includes various methods, application systems, and the difficulties of automated ontology construction from the unstructured natural language text. This paper also highlights the ways the ontology construction process could be enhanced by presenting techniques from shallow learning to deep learning [8]. Ontology learning frameworks exist connecting with Ontology engineering tools such as TextToOnto framework on KAON ontology engineering environment, OntoLT on Protégé and Text2Onto on NeOn Toolkit etc. Fully automation of Ontology learning, and construction is not yet feasible and human intervention is vital [9]. Ontology 101 is an ontology development methodology for declarative frame-based systems [10]. It lists the main steps in the ontology development process while addressing issues come across when defining class hierarchies and properties of classes and instances. Highlighted steps include Defining classes in ontology, arranging the classes in a taxonomic hierarchy, defining slots and describing allowed values for such slots and finally filling values for slots for instances. It is also a well-structured methodology to build ontologies. It also identifies a set of activities in the ontology development process such as planify, specify, acquire knowledge, conceptualize, formalize, integrate, implement, evaluate, document, and maintain. It further suggests techniques to carry out at each stage and the deliverables. This approach highly recommends the reuse of existing ontologies [11].

The paper titled as “Methods and Tools for automatic construction of Ontologies from textual resources” proposes a comparative analysis of some Ontology construction approaches and their associated tools [12]. Initially, they have compared 4 approaches for automatic construction of ontologies from textual resources. All 4 approaches were based on the general domain-independent ontology construction framework “METHONTOLOGY”. The four approaches were, OntoLearn [13], Alvis [14], Text2Onto [15] and SPRAT [16]. In these 4 approaches, generally, the following algorithmic techniques have been used mainly for the first 3 steps. Step 1 – Builds a glossary of terms, used linguistic techniques and statistical techniques. Step 2 – Build concept taxonomies, used structural techniques and Contextual technics. Structural techniques utilize the structure of a term representing a concept. They can be based on the syntax of the term (e.g., domain ontology subsumes ontology), or on the morphology of the term (e.g., blood mononuclear cell as a variant of blood cell). A context is usually defined as a vector representing syntactic dependencies between the term that represents the concept and other surrounding terms. Contextual techniques are based on distributional & clustering techniques and pattern-based techniques. Step 3 – Identifies ad-hoc relations, uses pattern-based techniques, external resources and distributional & clustering techniques. In the four approaches mentioned above, hybrid methods have been used by appropriately selecting a combination of the above techniques. Yet Another Methodology for Ontology (YAMO), proposed a new method to develop ontologies. This research used facet analysis and an analytico-synthetic classification approach conceptualized by Ranganathan in 1997, to achieve their goals. It explained the methodology step-by-step based on an example in large-scale food domain [17].

Some research studies have attempted to extend existing ontologies by automatically extracting information from unstructured sources such as Web. SOFIE: A Self-Organizing Framework for Information Extraction is one such attempt [18]. In this research, the existing ontology was used to extract new knowledge and then the extracted knowledge was used back to extend the same ontology. This approach parses the unstructured text, identify strong positive and negative patterns as ontological facts and hypothesis that can be mapped into clause form for the Weighted MAX SAT solver. They introduce a set of rules. SOFIE aims to find the hypotheses that should be accepted as true so that a maximum number of rules is satisfied (Using Weighted MAX SAT Algorithm). Then, the FMS Algorithm is run by assigning truth values to hypotheses. Subsequently, the true hypotheses are accepted as new facts of the ontology. This way ontology can be expanded. SOFIE combined Pattern matching, Word Sense Disambiguation and ontological reasoning in their model. This unsupervised and self-organizing approach showed fairly good precision and recall. Though this is a general-purpose knowledge extraction, when customizing rules into a specific type of input corpora, performance can be improved.

A robust ontology design from natural language texts has been done in [19] using Discourse Representation Theory (DRT), Linguistic frame semantics and Ontology design patterns. The paper describes a task implementation of ontology design from natural language texts which developed for Open Knowledge Extraction Challenge (OKE2015). This system named as FRED defines a mapping between DRT and Resource Description Framework (RDF)/ Web Ontology Language (OWL) while producing quality linked data and ontologies. The system architecture has the best existing tools combined in a novel way to get good performance. They have used the tool Boxer, a deep parser that produces linguistic frames in DRS. Boxer has been enhanced to identify most appropriate linguistic frames when complex relations are in the input text. Boxer uses VerbNet which is linked to FrameNet. These mappings help frame detection without training. Remarkably low computational time was observed at evaluations.

DBPedia is a large scale, multilingual ontology and a knowledge base which gives a remarkable contribution to the Semantic Web [20]. The research article describes a community project DBPedia which extracts structured knowledge from Wikipedia. DBPedia extracts knowledge from 111 language editions of Wikipedia and makes it available for free through Semantic Web and Linked Data [21]. It serves on the Web in three forms; 1) provides downloadable data sets where each data set contains the results of its extractors (Ex. abstract, disambiguation, geo coordinates, infobox, image, mappings etc.) 2) Serve via a public SPARQL endpoint 3) Provides dereferenceable URIs according to the Linked Data principles. DBpedia has an ontology and a knowledge base. It maps Wikipedia Infoboxes to this single shared ontology. This mapping has been done using a worldwide crowdsourcing effort. Project publishers release knowledge base to download and access via SPARQL queries. It sets RDF links to external data sources. It has now become one of the central interlinking hubs in the Linked Open Data (LOD) cloud. DBPedia provides many data sets for NLP tasks. Out of them, Topic signatures can be useful in tasks such as query expansion or document summarization.

When Open Information Extraction (OIE) is considered, the Semantic Web is an important topic to discuss. It is an attempt of people to make the data in the web pages to structured and tagged to allow machines to understand the internet. Though, the semantic web is part and parcel with the OIE, classifying it as a knowledge representation framework seems more appropriate. World Wide Web (WWW) is for people. The content of WWW is meaningful to human. The key intention of the Semantic Web is to translate the current content of WWW into a format that is meaningful to computers. Several machine-readable languages have been developed for this purpose such as XML, RDF, DAML, OIL, SHOE etc. The main target of the Open Knowledge Extraction (OKE) challenge was to bridge the gap between Knowledge Extraction and Semantic Web /Linked Data. So, the tasks were designed to automatically extract the structured content from textual data and represent them as Linked Data [22]. There were two tasks in the challenge. Task 1 was Entity Recognition, Linking and Typing for Knowledge Base population. Task 2 was Class Induction and entity typing for Vocabulary and Knowledge Base enrichment. An evaluation was done through a tool designed to evaluate precision, recall and F-Measure of candidate systems and benchmark them. The winner of Task 1 was Adel. They achieved micro F1 and macro F1 as 0.6075 and 0.6039 respectively. (micro precision $=$ 0.6938, macro precision $=$ 0.685) The winner of Task 2 was CETUS-FOX and they achieved micro F1 and macro F1 as 0.4735 and 0.4478 respectively. (micro precision $=$ 0.4455, macro precision $=$ 0.4182). According to the evaluation results, OKE/OIE still requires moving forward with further research.

Authors of the paper [23] presented a framework for transforming unstructured documents into machine-readable forms using existing tools. They highlighted that the existing Web architecture consists of three layers such as Syntactic (HTML/XML), Semantic (RDF) and Ontology layers separately and hence answering for queries is difficult. To overcome this issue, they proposed a model that combine these layers by giving meaning to words in the document (Word semantics). Various semantics were added to the word such as stem, POS, Named Entity Tag, Sinset id of the WordNet (Ontology and knowledge base) and so on, using a pre-defined format. They defined a new query format also which extracts information from the documents in this enhanced format.

Knowledge Graphs (KGs) can be used as abstractions to represent and share knowledge in a structured way [24]. It has a schema-less nature which allows the graph to grow by adding new entities and relationships seamlessly. Knowledge graph has become a powerful tool to represent knowledge in the form of a labelled directed graph and to give semantics to textual information [25]. Another research study introduces the concepts related to the knowledge representation and analyzes knowledge representation of knowledge graphs, mainly including several typical domain knowledge graphs. It highlights the knowledge representation compliant with the difference of entities, relationships, and properties [26].

In the paper titled as “A Semi-automated Ontology Construction for Legal Question Answering” proposed a methodology for constructing legal Web Ontology Language (OWL) ontologies with Semantic Web Rule Language (SWRL) rules. They also have developed a Legal NER system to identify legal named entities as well. Further, it evidences the plausibility of ontologies yet in many different industry domains [27].

Review for Ontology Construction from Unstructured Texts by using Deep Learning is a study which introduces some recent representative technical research studies on ontology construction [28]. Among their findings, “the corpus acquisition is the most fundamental and important in ontology learning” motivates us to create our own Sri Lankan news corpus. Not only that but also, “realizing the importance of word representations in the construction of ontology” is another finding which influences the approaches that are similar to our approach on syntactic colocation. A new agricultural ontology has been developed using Jaccard relative extractor (JRE) and Naïve Bayes algorithms [29]. JRE identifies the similarity between two sentences/words in the agricultural documents and the relationship between two terms is identified via the Naïve Bayes algorithm. The results show high precision (94.4%) while outperforming the decision tree and K-nearest neighbor algorithms. However, this study also falls under the domain specific category. Another research study has proposed a platform for rapid domain ontology construction from unstructured data [30]. They have proved that their approach is good for specific domains and taxonomy relationships, but poorly performing in open domain text and non-taxonomy relationships.

Giving a definition for the collocation in linguistics, the paper [31] highlighted the fact, “The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure.” The study focused on extracting collocations from the “Collins-Robert English-French. French-English dictionary”. The author has made a comparison between the collocation knowledge extracted from the dictionary and the similar data retrieved from statistical processing. Results revealed that both methods are complementary and mutually enriching.

A syntactic collocation has the characteristic such that, the presence of a syntactic link between collocation’s items [32]. Keeping that characteristic as a minimum syntactic constraint, they induced the collocation patterns in a data-driven fashion with the support of a parser. Rather than relying on pre-defined patterns like in other similar research, they consider any POS combination which has a syntactic link in the parser output.

The study [33] has attempted in examining whether the syntactic co-occurrence can help to trace phraseological development in foreign language learner writing. They have experimented on Verb $+$ Direct Object structures using dataset from French-speaking learners of English in the framework – Longitudinal Database of Learner English project (LONFDALE). Automatic verb-noun collocation extraction has become an important NLP task. The results attained in this research domain can be used in various applications including thesaurus building, semantic role labelling, language modelling and machine translation [34].

Fig. 1.

High-level Design Diagram for Inferring new ontological knowledge.

3. Methodology

The proposed ontological knowledge inferring mechanism can be diagrammatically depicted as in Fig. 1.

The main data source for this inferring mechanism is a corpus of triples (RDF). It could be taken from an existing ontological knowledge corpus (Depicted in block 1 of the Fig. 1 – High-level Design Diagram). For the simplicity of RDF statements, URLs have been omitted and the 3 triple components have been separated using “ $|$ ” symbol in example triples presented in this paper (E.g., Subject $|$ Predicate $|$ Object).

Except the RDF triples that represent the schema layer of an ontology, other triples have some instance specific information embedded with the Subject and Object components of the triple. Therefore, such components look like fairly lengthy noun chunks.

E.g., Budget Proposal of 2016 $|$ introduced $|$ comprehensive policy framework

The proposed approach prefers RDF statements which represent the schema layer than instance layer of the ontology as the inputs to infer new ontological knowledge. Therefore, our own semantic abstraction method is used to convert such triples into more abstract format which closer to schema type.

3.1 Semantic abstraction method

This method (Depicted in block 2 of the Fig. 1 – High-level Design Diagram) keeps the predicate part of the triple unchanged. But remove all instance specific other terms except the base term (noun), from both the Subject and Object components of the triple.

E.g., Sentence: The Budget Proposal of 2016 has introduced a comprehensive policy framework for tourism development.

Triple extracted from the example sentence:

Budget Proposal of 2016 $|$ introduced $|$ comprehensive policy framework (With instance specific additional information in the Subject and Object components of the triple.)

Triple after semantic abstraction (Abstract triple):

Proposal $|$ introduced $|$ framework (Without instance specific information.)

Using the semantic abstraction method, triples are further processed and constructed an abstract ontological knowledge corpus. This corpus is depicted in block 3 of the Fig. 1 – High-level Design Diagram.

The abstract ontological knowledge in the corpus is methodically processed using a novel approach named as Joined Directed Collocation (JDC), to infer new ontological knowledge.

3.2 Joined Directed Collocation (JDC)

This is a novel concept that is introduced to infer new ontological knowledge from triples. The conceptual framework of this method is rooted from the existing concept called, collocation. The authors first elaborate the fundamental collocation concept and consequently introduce the two novel concepts Directed Collocation (DC) and Joined Directed Collocation (JDC). Finally, the method of inferring ontological knowledge based on the JDC concept is clearly described with examples.

3.2.1 Collocation

The dictionary meaning of the term “collocate” is to locate two or more things close together or be located together. “In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance” [35]. A syntactic collocation is a kind of collocation that has the characteristic such that, the presence of a syntactic link between collocation’s items [36]. There are 6 main syntactic collocation types in the English language. Those are,

Adjective $+$ Noun Noun $+$ Noun Verb $+$ Noun Adverb $+$ Adjective Verb $+$ Prepositional phrase Verb $+$ Adverb Collocation of two terms, $t_{1}$ and $t_{2}$ can be represented as ( $t_{1}$ $+$ $t_{2}$ ).

Applying Collocation concept into elements of ontological knowledge

The structure of the ontological knowledge includes 3 elements such as “Subject”, “Verb/predicate” and “Object”. Such elements can be mapped into “Verb $+$ Noun” and “Noun $+$ Noun” syntactic collocations as elaborated below.

Noun (Subject) $+$ Verb E.g.,: (Noun $+$ Verb) (teacher $+$ teach), (student $+$ study), (student $+$ play), (mother $+$ cook), (chef $+$ cook) Verb $+$ Noun (Object) E.g., (Verb $+$ Noun) (teach $+$ subject), (study $+$ subject), (play $+$ game), (cook $+$ meal) Noun (Subject) $+$ Noun (Object) E.g., (Noun $+$ Noun) (teacher $+$ subject), (student $+$ subject), (student $+$ game), (mother $+$ meal), (chef $+$ meal)

3.2.2 Introducing Directed Collocation (DC)

This is the novel concept/variation introduced to the collocation concept. Directed Collocation is also a co-occurrence between two terms in the corpus. But, in this method, rather than simply finding the co-occurrence of two terms, for a given term (From-term), the most frequently co-occurring term (To-term) is identified within the specified corpus. Therefore, it looks like co-occurrence to a specified direction. Two terms can be named as From-term and To-term as it shows the flow of the direction. Thus, the direction that is emphasized here is very important and it will differentiate the Directed Collocation outcome from the Collocation outcome. All such From-term and To-term co-occurrences are not eligible to be valid DCs unless they are statistically significant.

When DC is denoted, the addition symbol ( $+$ ) used in collection is replaced with an arrow ( $\rightarrow$ ) to indicate the direction of co-occurrence (From-term $\rightarrow$ To-term). The DC of two terms $t_{1}$ and $t_{2}$ can be represented using the notation ( $t_{1}\rightarrow t_{2}$ ) which means $t_{2}$ is the most frequent term that goes with $t_{1}$ in the corpus.

3.2.3 Finding Directed Collocations (DCs) from the input ontological knowledge corpus

This sub process is depicted in block 4 of the Fig. 1 – High-level Design Diagram. As mentioned above, Collocation’s two terms ( $t_{1}$ and $t_{2}$ ) could be Nouns, Verbs, Adjectives, Adverbs or Prepositional phrases etc. Since the ontological knowledge (simply triples) is considered as input data source for this approach, the subject terms (S), verb terms (V) or object terms (O) of triples are potential to be selected as the terms of a Collocation. When a close attention is paid for the triple semantic structure, six possible DC types can be constructed as follows.

DC Type 1. S $\rightarrow$ V (for a given subject term S, the most frequently co-occurring verb term is V) DC Type 2. S $\leftarrow$ V (for a given verb term V, the most frequently co-occurring subject term is S) DC Type 3. V $\rightarrow$ O (for a given verb term V, the most frequently co-occurring object term is O) DC Type 4. V $\leftarrow$ O (for a given object term O, the most frequently co-occurring verb term is V) DC Type 5. O $\rightarrow$ S (for a given object term O, the most frequently co-occurring subject term is S) DC Type 6. O $\leftarrow$ S (for a given subject term S, the most frequently co-occurring object term is O)

Bigrams or DCs ( $t_{1}\rightarrow t_{2}$ ) are identified methodically of 6 DC types from the corpus. Methodological steps are listed below.

Step 1:
Create distinct lists of subject terms (S), verb terms (V) and object terms (O) from the triple corpus.
Step 2:
For each term in the subject terms list, identify the most frequent verb term by scanning through all the triples in the corpus. It would result the candidate DCs of DC Type 1.
Step 3:
Repeat the Step 2 appropriately for other DC types and generate candidate DCs. (From term will be always taken from one of the lists above appropriately.)

It is important to highlight here that, unlike in collocations, from-term and to-term of the DC do not need to be placed as adjacent terms in the triple. Instead, they could be placed somewhere in a triple structure (E.g., terms of O $\rightarrow$ S and O $\leftarrow$ S DC types are not adjacent in triples).
Step 4:
Identify statistically significant DCs (explained below) and categorized them as valid DCs.

3.2.4 Elaborating Directed Collocation concept by introducing DC rules

Suppose

T1 $=$ {set of terms} T2 $=$ {set of terms} n(T1 $\cap$ T2) $>=$ 0 and $t_{1}\in$ T1 and $t_{2}\in$ T2

DC Rule 01:For a given term ${\bm{t}}_{\bm{1}}$ (in T1), there could be many collocations with various ${\bm{t}}_{\bm{2}}$ terms (in T2), having different co-occurrence counts withing the corpus. Among all such collocations, the term ( ${\bm{t}}_{\bm{2}}$ ) which has the highest co-occurrence count with ${\bm{t}}_{\bm{1}}$ becomes a candidate Directed Collocation (DC), symbolized as ${\bm{t}}_{\bm{1}}\rightarrow{\bm{t}}_{\bm{2}}$ .

DC-score:

Every candidate DC has a property named as DC-score. The co-occurrence count which makes the two terms (bigram) a candidate DC, is called as the DC-score. Therefore,

$n(t_{1}+t_{2})=$ co-occurrence count of $t_{1}+t_{2}$ , within the corpus $n(t_{1}\rightarrow t_{2})=$ co-occurrence count of $t_{1}+t_{2}$ , within the corpus where a particular $t_{2}$ has the highest co-occurrence count with $t_{1}$ , among all the other terms of T2. Then, $n(t_{1}\rightarrow t_{2})=$ DC-score

Elaborating the DC Rule 01:

suppose n(T1) $=$ k n(T2) $=$ m $t_{x}\in$ T1 and $t_{y}\in$ T2 $n(t_{x}\rightarrow t_{y})$ $=$ Max(n(t ${}_{x}+t_{1}$ ), $n(t_{x}+t_{2}$ ) …, $n(t_{x}+t_{m}))$ Where $t_{y}$ is the element of T2 that has the maximum co-occurrence count within the corpus with $t_{x}$ Then, n(t ${}_{x}\rightarrow$ t ${}_{y})$ $=$ DC-score According to DC Rule 01, now ( $t_{x}\rightarrow t_{y})$ becomes a candidate for a DC. ( $t_{x}\rightarrow t_{y})=$ candidate DC

Applying the DC Rule 01 on ontological knowledge structure:

S $=$ {All subject terms in the ontological knowledge corpus} V $=$ {All verb terms in the ontological knowledge corpus} O $=$ {All Object terms in the ontological knowledge corpus} n(S $\cap$ O) $>=$ 0 n(S $\cap$ V) $=$ 0 n(O $\cap$ V) $=$ 0 s $\in$ S – s belongs to S, and for some s $\in$ O v $\in$ V – v belongs to V v $\notin$ S, for all v $\in$ V v $\notin$ O, for all v $\in$ V o $\in$ O – o belongs to O and for some o $\in$ S

E.g: 01

Suppose n(V) $=$ m $n(s\rightarrow v_{y})=$ Max( $n(s+v_{1})$ , $n(s+v_{2})$ …, $n(s+v_{m}))$ Where $v_{y}$ is the verb, out of all verbs in the corpus, that has the maximum co-occurrence with subject s. n(s $\rightarrow$ v ${}_{y})$ $=$ DSC-score According to DC Rule 01, now ( $s\rightarrow v_{y}$ ) becomes a candidate for a DC. $s\rightarrow v_{y}=$ Candidate DC

Further elaboration of E.g. 01 with some real-world example values:

n(s $\rightarrow$ v ${}_{1})$ $=$ n(student $\rightarrow$ study) $=$ 6 n(s $\rightarrow$ v ${}_{2})$ $=$ n(student $\rightarrow$ play) $=$ 4 n(s $\rightarrow$ v ${}_{3})$ $=$ n(student $\rightarrow$ plagiarize) $=$ 1 student $\rightarrow$ study $=$ Candidate DC (Having the highest DC-score $=$ 6)

E.g: 02

Suppose n(O) $=$ m n(v $\rightarrow$ o ${}_{y})$ = Max(n(v+o ${}_{1})$ , n(v+o ${}_{2})$ …, n(v+o ${}_{m}))$ Where oy is the Object, out of all Objects in the corpus, that has the maximum co-occurrence count with the verb v. n(v $\rightarrow$ o ${}_{y})$ = DC-score According to DC Rule 01, now (v $\rightarrow$ o ${}_{y})$ becomes a candidate for a DC. v $\rightarrow o_{y}=$ candidate DC

Further elaboration of E.g. 02 with some real-world example values:

n(v $\rightarrow$ o ${}_{1})=$ n(play $\rightarrow$ game) $=$ 10 n(v $\rightarrow$ o ${}_{2})=$ n(play $\rightarrow$ instrument) $=$ 8 n(v $\rightarrow$ o ${}_{3})$ $=$ n(play $\rightarrow$ video) $=$ 5 play $\rightarrow$ game $=$ Candidate DC, (DC-score $=$ 10)

In some situations, for a specific term t1, there can be multiple t2 terms with the same DC-score which demonstrates the highest co-occurrence count. In such situations, all such bigrams are considered as candidate DCs.

3.2.5 Identifying statistically significant (valid) DCs out of candidate DCs

When the base concept Collocation is considered, a co-occurrence of two terms is considered as a valid collocation, only when that combination has co-occurred more often than would be expected by chance [35]. Similarly, every candidate DC is not a valid DC. Candidate DC should be statistically significant to become a valid DC. In this approach, identifying statistically significant DCs from candidate DCs is performed using a pre-defined threshold value. This threshold value ensures the quality of the outcome; thus, it can be named as a Quality-Threshold-Value (QTV).

DC Rule 02:If the DC-score of a candidate DC is greater than or equal to the pre-defined Quality Threshold Value (QTV) then, such Directed Collocation becomes valid as it demonstrates sufficient co-occurrence count in the corpus. Such Directed Collocation can be named as statistically significant or valid Directed Collocation.

Elaborating the DC Rule 02:

In the following example, $t_{1}$ and $t_{2}$ (used in the definition) are replaced with Subject terms s and Verb terms $v$ in the ontological knowledge corpus.

Candidate DCs:

n(s ${}_{1}\rightarrow$ v ${}_{1})=$ n(student $\rightarrow$ study) $=$ 6 n(s ${}_{2}\rightarrow$ v ${}_{2})=$ n(chef $\rightarrow$ cook) $=$ 5 n(s ${}_{3}\rightarrow$ v ${}_{3})=$ n(client $\rightarrow$ pay) $=$ 4

If the QTV is defined as 5 (as an example) then, (chef $\rightarrow$ cook) and (student $\rightarrow$ study) are the valid DCs.

DC Rule 03:Directed Collocation (DC) is valid only for the meant direction. Therefore, the symmetric property is not guaranteed.

Elaborating the DC Rule 03:

If both $t_{1}\rightarrow t_{2}$ and $t_{1}\leftarrow t_{2}$ are candidate DCs then it proves the symmetric property. However, in real world it is not guaranteed.

If $t_{1}\rightarrow t_{2}$ is a valid DC but $t_{1}\leftarrow$ $t_{2}$ is not a candidate DC then, it does not prove the symmetric property (asymmetric property).

E.g., Proving asymmetric property

Suppose, DC type (s $\rightarrow$ v) where, v is the highest collocated verb (among all verbs) with subject s in the corpus.

Subject(S ${}_{1}$ )	Verb(V ${}_{1}$ )	Number of co-occurrences
(s ${}_{1}$ ) employee	(v ${}_{1}$ ) works at	n (employee $+$ works at) $=$ 30
(s ${}_{1}$ ) employee	(v ${}_{2}$ ) plays	n (employee $+$ plays) $=$ 20
(s ${}_{1}$ ) employee	(v ${}_{3}$ ) cheats	n (employee $+$ cheats) $=$ 10
(s ${}_{1}$ ) employee	(v ${}_{4})$ cooks	n (employee $+$ cook) $=$ 5

candidate DC $=$ (employee $\rightarrow$ works at)

“works at” is the highest collocated verb with subject “employee” in the corpus.

n(employee $\rightarrow$ works at) $=$ 30

Suppose DC type (s $\leftarrow$ v) where, s is the highest collocated subject (among all subjects) with verb v in the corpus.

Subject(S ${}_{1}$ )	Verb(V ${}_{1}$ )	Number of co-occurrences
(s ${}_{1}$ ) labor	(v ${}_{1}$ ) works at	n (labor $+$ works at) $=$ 40
(s ${}_{2})$ employee	(v ${}_{1}$ ) works at	n (employee $+$ works at) $=$ 30
(s ${}_{3})$ women	(v ${}_{1}$ ) works at	n (woman $+$ works at) $=$ 20
( $s_{4}$ ) man	(v ${}_{1}$ ) works at	n (man $+$ works at) $=$ 10

candidate DC $=$ (labor $\leftarrow$ works at)

In the above example, (employee $\rightarrow$ works at) is a candidate DC.

But (employee $\leftarrow$ works at) is not a candidate DC.

Instead, (labor $\leftarrow$ works at) is the candidate DC.

In other words, the symmetric property has not been proved.

3.2.6 Introducing Joined Directed Collocations (JDC)

By combining two Directed Collocations (DC) in a methodical way, a Joined Directed Collocation (JDC) can be formed. This sub process is depicted in block 5 of the Fig. 1 – High-level Design Diagram.

E.g. DC ${}_{1}$ $+$ DC ${}_{2}$ $=$ JDC DC ${}_{1}$ (s $\rightarrow$ v) and DC ${}_{2}$ (v $\rightarrow$ o) (s $\rightarrow$ v) $+$ (v $\rightarrow$ o) $=$ (s $\rightarrow$ v $\rightarrow$ o)

Method of generating twelve JDC types using six DC types

Considering the ontological knowledge structure, 6 DC types have been identified. With systematic joining of such DCs, at most 12 unique JDC types can be formed. The below example illustrates this systematic joining mechanism made towards generating JDCs.

E.g. Suppose the DC type S $\rightarrow$ V has been taken as the DC1. Now, out of S, V and O terms of ontological knowledge structure, S and V terms have already been used. Therefore, the term O is the only remaining term to create DC2 joining with the term V. Thus, two DC types are possible such as V $\rightarrow$ O and V $\leftarrow$ O as DC2. Hence, the two possible JDC types would be,

	DC ${}_{1}$ $+$ DC ${}_{2}$	$=$ JDC Type
JDC Type 1.	(S $\rightarrow$ V) $+$ (V $\rightarrow$ O)	$=$ (S $\rightarrow$ V $\rightarrow$ O)
JDC Type 2.	(S $\rightarrow$ V) $+$ (V $\leftarrow$ O)	$=$ (S $\rightarrow$ V $\leftarrow$ O)

The term V can be considered as the joining term of DCs to generate the JDC Type 1 and 2 above. The same logic can be applied to other 5 DC types by keeping them as DC1 and finding the possible DCs to join as DC2.

S $\leftarrow$ V as the DSC ${}_{1}$ (V $=$ joining term),

JDC Type 3.	(S $\leftarrow$ V) $+$ (V $\rightarrow$ O)	$=$ S $\leftarrow$ V $\rightarrow$ O
JDC Type 4.	(S $\leftarrow$ V) $+$ (V $\leftarrow$ O)	$=$ S $\leftarrow$ V $\leftarrow$ O

V $\rightarrow$ O as the DC ${}_{1}$ (O $=$ joining term),

JDC Type 5.	(V $\rightarrow$ O) $+$ (O $\rightarrow$ S)	$=$ V $\rightarrow$ O $\rightarrow$ S
JDC Type 6.	(V $\rightarrow$ O) $+$ (O $\leftarrow$ S)	$=$ V $\rightarrow$ O $\leftarrow$ S

V $\leftarrow$ O as the DC ${}_{1}$ (O $=$ joining term),

JDC Type 7.	(V $\leftarrow$ O) $+$ (O $\rightarrow$ S)	$=$ V $\leftarrow$ O $\rightarrow$ S
JDC Type 8.	(V $\leftarrow$ O) $+$ (O $\leftarrow$ S)	$=$ V $\leftarrow$ O $\leftarrow$ S

O $\rightarrow$ S as the DC ${}_{1}$ (S $=$ joining term),

JDC Type 9.	(O $\rightarrow$ S) $+$ (S $\rightarrow$ V)	$=$ O $\rightarrow$ S $\rightarrow$ V
JDC Type 10.	(O $\rightarrow$ S) $+$ (S $\leftarrow$ V)	$=$ O $\rightarrow$ S $\leftarrow$ V

O $\leftarrow$ S as the DC ${}_{1}$ (S $=$ joining term),

JDC Type 11.	(O $\leftarrow$ S) $+$ (S $\rightarrow$ V)	$=$ O $\leftarrow$ S $\rightarrow$ V
JDC Type 12.	(O $\leftarrow$ S) $+$ (S $\leftarrow$ V)	$=$ O $\leftarrow$ S $\leftarrow$ V

3.2.7 Method of inferring new ontological knowledge

Fig. 2.

A symbolic illustration of inferring new ontological knowledge using JDC method.

Based on the fundamental collocation concept, DC and JDC novel concepts have been derived. These two new concepts have been employed in formulating the below mentioned JDC Assumption to infer new ontological knowledge using existing ontological knowledge corpus.

JDC Assumption for ontological knowledge inferring:Valid DCs are the real existing semantic patterns (statistically significant collocations) in the corpus. By logically combining these DCs, JDCs are formed. Therefore, the three terms – Subject(S), Verb(V) and Object(O) of a specific JDC type may have a contextual relationship among them. Therefore, re-arrangement of these terms in the semantic order of S $+$ V $+$ O, may discover a new meaningful semantic relation that can be used as a new ontological knowledge.

Based on the above JDC Assumption, hidden ontological knowledge can be inferred by discovering the three terms S, V and O based on the identified JDC types. The implementation methods of all JDC types would output a large number of three-value sets that consist of potential terms for S, V and O. Then, such three value instances can be re-organized in a semantic order of “S $+$ V $+$ O” to infer new ontological knowledge facts. A symbolic illustration of inferring new ontological knowledge using JDC method is depicted in Fig. 2.

The below examples will further illustrate the above concept comprehensively.

E.g. 1: DC1 (S $\rightarrow$ V) : student $\rightarrow$ play DC2 (V $\rightarrow$ O) : play $\rightarrow$ soccer Identified 3 terms : S $=$ student, V $=$ play, O $=$ soccer Inferred ontological knowledge using JDC (S $\rightarrow$ V $\rightarrow$ O) : student $+$ play $+$ soccer E.g. 2: DC1 (S $\rightarrow$ V) : student $\rightarrow$ play DC2 (V $\leftarrow$ O) : play $\leftarrow$ game Identified 3 terms : S $=$ student, V $=$ play, O $=$ game Inferred ontological knowledge using JDC (S $\rightarrow$ V $\leftarrow$ O) : student $+$ play $+$ game E.g. 3: DC1 (V $\rightarrow$ S) : cook $\rightarrow$ chef DC2 (S $\leftarrow$ O) : chef $\leftarrow$ meal Identified 3 terms : S $=$ chef, V $=$ cook, O $=$ meal Inferred ontological knowledge using JDC (V $\rightarrow$ S $\leftarrow$ O) : chef $+$ cook $+$ meal

The newly inferred ontological knowledge using JDC method is the final outcome of this research study. A corpus of such knowledge is depicted in block 6 of the Fig. 1 – High-level Design Diagram.

4. Results and discussion

The Joined Directed Collocation method described in the Methodology section has been successfully implemented and validated. Content of this section comprehensively discusses the analysis and results of these experiments.

4.1 Preparing the data source

The initial corpus of 43,100 triples extracted from Sri Lankan open-domain English news articles considered to be the existing ontological knowledge. The triple extraction approach presented in our previous work “Grammatical Structure Oriented Automated Approach for Surface Knowledge Extraction from Open Domain Unstructured Text” was successfully used to extract those triples from Sri Lankan English news articles [37]. We presented a semantic abstraction method by which such ontological knowledge was converted into the same number of abstract triples. Those triples ensured that only the three components, subject, predicate/verb and object exist in it.

4.2 Technological background

Sophisticated and robust algorithms were written in order to implement the JDC concept. All coding was performed using Python version 3.6.5. The experiment used hardware configurations such as an Intel i5-8350U processor (1.70 GHz, 4 cores, 8 threads) and 16 GB of memory.

4.3 Joined Directed Collocation (JDC) output

This method inferred new ontological knowledge from 43,100 existing ontological knowledge. Results obtained by setting different values for the QTV parameter such as 1, 2, 3, 4, 5 and 10 for a better analysis. Table 1 shows the statistics obtained from the JDC algorithm output with QTV set as 3 in detailed level. Similar statistics were obtained with other values set for QTV but only the summarized figures were depicted in Table 1.

When analyzing the results, it was observed that, a considerable amount of inferred ontological knowledge was exactly similar to some knowledge facts of the source ontological knowledge corpus (overlapping with source Ontological knowledge). Source ontological knowledge is considered as valid knowledge facts. As the JDC method also could infer ontological knowledge that overlaps with the source content, a suggestion can be made that the JDC method based on the JDC Assumption has the ability to infer valid ontological knowledge. Such overlapping counts can be considered as a good measurement to prove the success and effectiveness of the JDC method. Table 2 also demonstrates such percentage values, comparatively evidencing with different QTV parameter values.

According to the Table 2, the overlapped percentages of inferred ontological knowledge (considered to be valid) are directly proportional with the relevant QTV parameter values used. Also, the number of inferred knowledge facts were inversely proportional with the QTV parameter values. Therefore, with high QTV parameter value (with a high level of collocation), these overlapped percentages have been increased. With this behavior, we can conclude that the JDC method is more effective with high level of collocations in DCs of JDC. In other words, the quality of the ontological knowledge inferred depends on the number of collocations or the DC-scores of the relevant DCs joined.

Moreover, when the QTV parameter value is increased, more DCs will be rejected due to insufficient collocations. Therefore, the higher the parameter value, the lower the amount of inferred ontological knowledge. That has an impact to the total number of ontological knowledge facts inferred by this method.

Table 1
JDC Inferred ontological knowledge with QTV $=$ 3

JDC type	JDC	Inferred ontological knowledge	Overlapped with source ontological knowledge	New ontological knowledge
1	S $\rightarrow$ V $\rightarrow$ O	294	112	182
2	S $\rightarrow$ V $\leftarrow$ O	6,424	314	6,110
3	S $\leftarrow$ V $\rightarrow$ O	1,218	645	573
4	S $\leftarrow$ V $\leftarrow$ O	587	322	265
5	V $\rightarrow$ O $\rightarrow$ S	476	258	218
6	V $\rightarrow$ O $\leftarrow$ S	390	125	265
7	V $\leftarrow$ O $\rightarrow$ S	1,164	532	632
8	V $\leftarrow$ O $\leftarrow$ S	163	88	75
9	O $\rightarrow$ S $\rightarrow$ V	561	174	387
10	O $\rightarrow$ S $\leftarrow$ V	83,000	1,339	81,661
11	O $\leftarrow$ S $\rightarrow$ V	617	236	381
12	O $\leftarrow$ S $\leftarrow$ V	597	123	474
Total		95,491	4,268	91,223

Table 2

JDC method – summary of the results

QTV	Total inferred ontological knowledge facts	Overlapped with source ontological knowledge as a percentage	Total new ontological knowledge facts
1	826,335	2.1%	809,325
2	223,144	3.1%	216,275
3	95,491	4.5%	91,223
4	52,067	5.8%	49,041
5	32,704	7.2%	30,354
10	4,620	17.8%	3,797

4.4 Inter-rater agreement based Manual validation method

Since the overlapping inferred ontological knowledge can be considered as valid, the remaining knowledge facts (new ontological knowledge facts) require a proper validation mechanism to prove the effectiveness.

An inter-rater agreement based manual validation method successfully served the purpose with high reliability. Sample selection algorithm designed for this test secured the high variability over the large corpus of new inferred ontological knowledge.

Table 3
Inter-rater agreement test results of inferred ontological knowledge [QTV $=$ 3]

Statistical fact	Value
Total number of knowledge facts in the samples (200 per sample * 5)	1000
Number of knowledge facts which obtained 3/3 votes as effective	314
Number of knowledge facts which obtained 2/3 votes as effective	243
Number of knowledge facts which obtained 1/3 votes as effective	262
Number of knowledge facts which obtained 0/3 votes as effective	181
Number of knowledge facts which marked as effective	557
Number of knowledge facts which marked as ineffective	443
Effective ontological knowledge inferring rate	55.7%
Error rate	44.3%
Inter-Rater Reliability score (IRR) computed using Percent Agreement method	56.3%

Five sample sets of inferred ontological knowledge facts were randomly chosen using sampling without replacement method. Those were distributed among 5 participants of the testing team, ensuring that every knowledge fact in the sample had been manually verified by exactly three qualified examiners. Examiners analyzed the provided new ontological knowledge facts and mark them as either effective or ineffective. Once the validation results of 3 examiners are collected, those were amalgamated to find the majority votes which helps to decide whether the inferred ontological knowledge fact is effective or not. The initial validation test was carried out using the ontological knowledge facts that were inferred when the QTV was set as 3. With the hope of avoiding extreme situations, value 3 was opted as the QTV. Validation results of all 5 members, amalgamated and re-processed to find the final performance. Revealed results of this experiment is listed in Table 3.

Validation results show that the effective ontological knowledge inferring rate is 55.7%. In addition to that, a similar validation test was carried out using the ontological knowledge facts that were inferred when the QTV was set as 10. The intention of this secondary test was to validate the outcome with an extreme value set for the QTV. Secondary test results showed the effective ontological knowledge inferring rate as 56.3%. It was only a 0.6% increment compared with the initial test performance. Therefore, this manual verification did not show much performance variation with different QTV parameter values. The following statistics (relevant to the validation test performed with QTV parameter set as 3) gives more insights about the performance of the JDC method.

Source ontological knowledge facts used to infer new ontological knowledge $=$ 43,100 Total inferred ontological knowledge $=$ 95,491 Overlapped with source ontological knowledge $=$ 4,268 New ontological knowledge inferred $=$ 91,223 Estimated effective new ontological knowledge $=$ 50,811 Total effective ontological knowledge inferred [when QTV $=$ 3] $=$ $\sim$ 55,079

The above statistics demonstrated the fact that the number of inferred ontological knowledge generated from JDC method is more than twice of its source ontological knowledge corpus, especially with an average value set for QTV parameter (QTV $=$ 3). Also, around 56% of them would be considered as effective. By decreasing the parameter value, the size of the output can be further increased. However, by increasing the parameter value, the effective ontological knowledge inferring rate may not be improved significantly.

Table 4

JDC method inferred ontological knowledge

Potential ontological knowledge	Effective [YES or NO]
Discussions $\|$ focused on $\|$ policies	YES
Opposition $\|$ hold $\|$ elections	YES
Opposition $\|$ support $\|$ efforts	YES
Person $\|$ received $\|$ cheque	YES
Person $\|$ sought $\|$ intervention	YES
People $\|$ make $\|$ announcement	YES
People $\|$ reduce $\|$ gap	YES
Person $\|$ changed $\|$ mind	YES
Unions $\|$ have $\|$ majority	YES
Bill $\|$ published in $\|$ newspapers	YES
System $\|$ introduced to $\|$ country	YES
People $\|$ understand $\|$ arithmetic	YES
Request $\|$ made by $\|$ department	YES
President $\|$ appointed $\|$ parliament	YES
People $\|$ is within $\|$ economy	NO
People $\|$ say $\|$ drugs	NO
Person $\|$ gave $\|$ weeks	NO
Family $\|$ arrest $\|$ person	NO
India $\|$ arrived in $\|$ lanka	NO
Parties $\|$ led $\|$ government	NO
Article $\|$ were in $\|$ power	NO
Million $\|$ is $\|$ company	NO
Ministers $\|$ is $\|$ president	NO
Speaker $\|$ tabled in $\|$ parliament	NO

Table 4 shows some ontological knowledge inferred from JDC method with QTV set as 3 with the effective or ineffective demarcation obtained through the inter-rater agreement test.

4.4.1 Reasons for less impressive effective ontological knowledge inferring rate and IRR score

For the conducted inter-rater agreement test, Inter-Rater Reliability, or the level of agreement between raters, was computed using Percent Agreement method. The IRR score obtained was 56.3%. Moreover, having amalgamated the three examiners’ decisions, the effectiveness of each ontological knowledge fact was computed in the test samples, thereby obtained the total effective ontological knowledge inferring rate approximately as 56%. This performance does not seem to be much impressive. One key reason for this less impressive percentages is that, during this validation process, examiners had validated abstract syntactic patterns (ontological knowledge), without having any relevant contextual information. Those patterns had less, or no instance-specific information to make their decision undoubtedly. Therefore, validating the effectiveness was somewhat challenging and fairly individualistic as well. Therefore, individual perception matters, and opinions may contrast. That might have an impact on the less impressive performance measures mentioned above.

Results revealed that this method can infer new ontological knowledge in a mass scale. Thus, it can easily be used to extend the existing ontologies.

4.5 Performance comparison

The methods used for ontology learning has many approaches and those are being used by existing tools. Such methods include but not limited to conceptual clustering – E.g., ASIUM [38], Learning from patterns – E.g., SOAT [39], NLP and Machine Learning – E.g., OntoLearn [40], Statistical methods such as tf.idf – E.g., OntoLT [41] etc. However, the presented approach is an entirely new method developed using the collocation concept.

When this approach is compared with other Ontology Learning/construction systems existing today, the following drawbacks that are existing in such systems have been overcome by the proposed approach.

Domain-specific nature: Majority of the existing tools attempt for automatic ontology generation from unstructured text corpus. Unfortunately, such methods that aim to generate ontologies from unstructured text corpus are domain-specific and require manual intervention [6]. However, this approach can be used to create both domain specific and domain independent (open domain) ontologies. The enabler behind this feature is the triple structure. Since triples are used as the data source for this approach, domain of the source triples decides whether the output ontological knowledge is open domain or domain specific.

Scalability: Selective problems of existing ontology learning techniques highlighted in [42], emphasizes on scalability as a key issue. Extracting knowledge from massive amounts of data on the Web in heterogeneous formats requires scalable and efficient approaches. This study used automatic knowledge extraction from unstructured text, a way to extract triples from open domain web data as well using an existing approach [37]. Such extracted triples or triples of an existing small ontology can be used as the data source. Such limited data will be used to automatically infer new ontological knowledge approximately doubled in size. The size of the output depends on the size of the data source and no scalability issues arise. Hence, this method can be considered as an ontology extension tool as well.

Standardization: As highlighted above, heterogeneous formats of ontological knowledge introduce interoperability issues. However, this approach uses standard RDF triples as the template for source and output both. Therefore, it helps in standardization.

5. Future work

There are principal approaches available to find collocations. In this research, a frequency-based mechanism was used to identify the collocations. There are other popular methods such as the t-Test, Mutual information etc. that can be experimented to discover the best. Furthermore, the distinct lists of Subjects, Verbs and Objects that used to identify DCs were extracted from the source corpus itself. Instead, such collections are formed using any other source of the same domain would be much effective.

6. Summary and conclusion

Ontologies are one of the best structures that serve as a knowledge base and enable reasoning of knowledge dynamically. Hand-crafted ontologies are in excellent quality but due to the expensiveness, automatic ontology construction by extracting ontological knowledge from unstructured text has now become a recent and a popular trend today.

The Joined Directed Collocation (JDC) method that has been proposed in this research can infer new ontological knowledge using existing corpus of limited ontological knowledge facts or triples. Further, this has introduced two novel concepts, Directed Collocation (DC) and Joined Directed Collocation (JDC), based on the existing concept called Collocation. Having systematically joined the six types of DCs, 12 types of JDCs were identified. New ontological knowledge was effectively inferred from the original corpus based on the identified 12 JDC types. This novel approach has been illustrated using appropriate examples enriched with mathematical symbols. To improve the quality of the inferred ontological knowledge, a Quality Threshold Value (QTV) parameter was introduced aiming to utilize only the DCs with high level of collocations. In overall, JDC method could infer 95,491 new ontological knowledge facts from 43,100 existing ontological knowledge corpus (when QTV $=$ 3, a moderate threshold). Total inferred ontological knowledge count was approximately two times larger than the size of the source corpus. According to the results of the Inter-rater agreement validation test, around 56% of the inferred new ontological knowledge could be considered as effective. A percentage of inferred ontological knowledge could be considered as truly valid as those were identical with the ontological knowledge in the source corpus. With the QTV setting as 10, approximately 17.8% of the inferred ontological knowledge facts were found as identical as triples of the source corpus. That proves the success of this method.

There was a limitation which had unfavorable impact on the success of this study. This approach results massive output in terms of auto generated ontological knowledge. Though, the manual testing is the most effective approach, validating such massive amounts manually is a labor intensive, time consuming and costly task. Therefore, the manual validation was limited to selected samples.

With respect to the practical usefulness of the approach, triples can be used as possible inputs for this method. Hence, a textual content which enables extracting triples from any domain (E.g., open domain news, web data etc.) can be effectively used as the primary data source to infer new ontological knowledge using this approach. Hence, the proposed approach is practically usable for both open-domain and domain specific ontology development. The inferred ontological knowledge simply has the “Subject $+$ Predicate $+$ Object” semantic pattern. Therefore, the outcome of this research can be used in any kind of knowledge base which has the aforementioned semantic pattern as the base.

The novel DC and JDC concepts and their method of application to infer new ontological knowledge are the key contributions of this study. It helps not only for automating the process of ontology construction/induction but also for the extending of existing ontologies (Can be used as an ontology extension tool). Finally, the findings of this research will have a considerable impact on the AKE body of knowledge. If the approach is effectively utilized, open domain text wouldn’t be a waste anymore.

References

Gruber

. A translation approach to portable ontology specifications. Knowledge Acquisition. 1993 Jun; 5(2): 199-220.

Antoniou

Groth

van Harmelen

Hoekstra

. Semantic Web Primer. Third edition. The MIT Press; 2012; 270.

Lumsden

Hall

Cruickshank

. Ontology definition and construction, and epistemological adequacy for systems interoperability: A practitioner analysis. Journal of Information Science. 2011; 37(3): 246-53.

Davies

Fensel

Harmelen

. Towards the Semantic Web: Ontology – Driven Knowledge Management. 2003 Mar.

Baclawski

Bennett

Berg-Cross

Dickerson

Schneider

Seppälä

, et al. Ontology Summit 2021 Communiqué: Ontology generation and harmonization. AO. 2022 May 4; 17(2): 233-48.

Elnagar

Yoon

Thomas

. An Automatic Ontology Generation Framework with An Organizational Perspective. In 2020; 10.

Konys

. Knowledge Repository of Ontology Learning Tools from Text. Procedia Computer Science. 2019; 159: 1614-28.

Al-Aswadi

Chan

Gan

. Automatic ontology construction from text: a review from shallow to deep learning trend. Artif Intell Rev. 2020 Aug; 53(6): 3901-28.

Staab

Studer

, editors. Handbook on Ontologies [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2009; [cited 2017 May 19]. Available from: doi: 10.1007/978-3-540-92673-3.

10.

Noy

McGuinness

, others. Ontology development 101: A guide to creating your first ontology [Internet]. Stanford, CA; 2001; [cited 2017 May 8]. Report No.: Stanford knowledge systems laboratory technical report KSL-01-05 and Stanford medical informatics technical report SMI-2001-0880. Available from: http//liris.cnrs.fr/alain.mille/enseignements/Ecole_Centrale/What%20is%20an%20ontology%20and%20why%20we%20need%20it.htm.

11.

Fernandez

Gomez-Pearez

Juristo

. Methontology: From Ontological Art Towards Ontological Engineering. 1997; 33-40. Report No. SS-97-06.

12.

Gherasim

Harzallah

Berio

Kuntz

. Methods and Tools for Automatic Construction of Ontologies from Textual Resources: A Framework for Comparison and Its Application. In: Guillet

Pinaud

Venturini

Zighed

, editors. Advances in Knowledge Discovery and Management [Internet]. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013 [cited 2017 May 28]. 177-201. Available from: doi: 10.1007/978-3-642-35855-5_9.

13.

Navigli

Velardi

Gangemi

. Ontology learning and its application to automated terminology translation. IEEE Intelligent systems. 2003; 18(1): 22-31.

14.

Nedellec

. Semantic class learning and syntactic resources tuning. Technical report, Deliv. 6.4 a for ALVIS (Superpeer semantic Search Engine) Project; 2006.

15.

Cimiano

Völker

. A framework for ontology learning and data-driven change discovery. In: Proceedings of the 10th; International Conference on Applications of Natural Language to Information Systems (NLDB), Lecture Notes in Computer Science, Springer [Internet]. Springer; 2005 [cited 2017 Sep 12]. 227-38. Available from: doi: 10.1007/b136569.pdf#page=238.

16.

Maynard

Funk

Peters

. SPRAT: a tool for automatic semantic pattern-based ontology population. In: International conference for digital libraries and the semantic web [Internet]. Trento, Italy; 2009; [cited 2017 Sep 12]. Available from: https//pdfs.semanticscholar.org/252c/a86439d6ba013d7c03963d30e9c68dfc491e.pdf.

17.

Dutta

Chatterjee

Madalli

. YAMO: Yet Another Methodology for large-scale faceted Ontology construction. Biswanath Dutta Dr, Devika P. Madalli Dr, editors. Journal of Knowledge Management. 2015 Feb 9; 19(1): 6-24.

18.

Suchanek

Sozio

Weikum

. SOFIE: a self-organizing framework for information extraction. In: Proceedings of the 18th; international conference on World wide web [Internet]. ACM; 2009 [cited 2017 Jul 4]. 631-40. Available from: http//dl.acm.org/citation.cfm?id=1526794.

19.

Presutti

Draicchio

Gangemi

. Knowledge extraction based on discourse representation theory and linguistic frames. International Conference on Knowledge Engineering and Knowledge Management [Internet]. Springer; 2012 [cited 2017 Jul 4]. 114-29. Available from: doi: 10.1007/978-3-642-33876-2_12.

20.

Lehmann

Isele

Jakob

Jentzsch

Kontokostas

Mendes

, et al. DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web Journal. 2015; 6(2): 167-95.

21.

Wikipedia contributors. Linked data – Wikipedia, The Free Encyclopedia [Internet]. 2022; Available from: https//en.wikipedia.org/w/index.php?title=Linked_data&oldid=1107758074.

22.

Nuzzolese

Gentile

Presutti

Gangemi

Garigliotti

Navigli

. Open knowledge extraction challenge. Semantic Web Evaluation Challenge [Internet]. Springer; 2015; [cited 2017 Jul 2]. 3-15. Available from: doi: 10.1007/978-3-319-25518-7_1.

23.

Mihalcea

. Word semantics for information retrieval: moving one step closer to the Semantic Web. In: 13th; International Conference on Tools with Artificial Intelligence [Internet]. Texas, USA: IEEE; 2001 [cited 2017 May 29]. 280-7. Available from: http//ieeexplore.ieee.org/abstract/document/974475/.

24.

Bianchi

Soto

Palmonari

Cutrona

. Type vector representations from text: An empirical analysis. Deep Learning for Knowledge Graphs and Semantic Technologies Workshop, co-located with the Extended Semantic Web Conference, 2018.

25.

Duan

Shao

Zhou

Zou

Lin

. Specifying architecture of knowledge graph with data graph, information graph, knowledge graph and wisdom graph. In: 2017; IEEE 15th International Conference on Software Engineering Research, Management and Applications (SERA) [Internet]. London, United Kingdom: IEEE; 2017 [cited 2019 Oct 26]. 327-32. Available from: http//ieeexplore.ieee.org/document/7965747/.

26.

Lin

Zhao

Huang

Liu

. Domain knowledge graph-based research progress of knowledge representation. Neural Computing and Applications. 2021 Jan 1; 33(2): 681-90.

27.

Fawei

Pan

Kollingbaum

Wyner

. A Semi-automated Ontology Construction for Legal Question Answering. New Gener Comput. 2019 Dec; 37(4): 453-78.

28.

Yuan

Wang

Zhou

Ren

. A review for ontology construction from unstructured texts by using deep learning. In: Cen F, editor. International Conference on Internet of Things and Machine Learning (IoTML 2021); [Internet]. Shanghai, China: SPIE; 2022 [cited 2022 Sep 17]. 41. Available from: https//www.spiedigitallibrary.org/conference-proceedings-of-spie/12174/2628713/A-review-for-ontology-construction-from-unstructured-texts-by-using/10.1117/12.2628713.full.

29.

Deepa

Vigneshwari

. An effective automated ontology construction based on the agriculture domain. ETRI Journal. 2022 Aug; 44(4): 573-87.

30.

Zhao

Dong

Zhang

. ROCP: A Rapid Ontology Construction Platform from Unstructured Data. Data Science Journal. 2018 Sep 25; 17: 23.

31.

Fontenelle

. Collocation acquisition from a corpus or from a dictionary: a comparison. 5th; EURALEX International Congress on Lexicography, Euralex 1992 Part 1. Finland; 1992; 221-8.

32.

Seretan

. Induction of Syntactic Collocation Patterns from Generic Syntactic Relations. Proceedings of Nineteenth International Joint Conference on Artificial Intelligence. 2005; 1698-9.

33.

Paquot

Naets

, Gries STh. Using Syntactic Co-occurrences to Trace Phraseological Complexity Development in Learner Writing: Verb

+

Object Structures in LONGDALE. In: Le Bruyn

Paquot

, editors. Learner Corpus Research Meets Second Language Acquisition [Internet]. 1st; ed. Cambridge University Press; 2020 [cited 2022 Sep 19]. 122-47. Available from: https//www.cambridge.org/core/product/identifier/9781108674577%23CT-bp-6/type/book_part.

34.

Akinina

Kuznetsov

Toldova

. The impact of syntactic structure on verb-noun collocation extraction. Кмпьютерная лингвистика и интеллектуальные технологии. 2013; 29: 2-16.

35.

Wikipedia contributors. Collocation – Wikipedia, The Free Encyclopedia [Internet]. 2022; Available from: https//en.wikipedia.org/w/index.php?title=Collocation&oldid=1098892033.

36.

Seretan

. Induction of Syntactic Collocation Patterns from Generic Syntactic Relations. Proceedings of Nineteenth International Joint Conference on Artificial Intelligence. 2005; 1698-9.

37.

Tissera

Weerasinghe

. Grammatical Structure Oriented Automated Approach for Surface Knowledge Extraction from Open Domain Unstructured Text. Journal of Information and Communication Convergence Engineering. 2022 Jun 30; 20(2): 113-24.

38.

Faure

Poibeau

. First experiments of using semantic knowledge learned by ASIUM for information extraction task using INTEX. In 2000; 7-12.

39.

Hsu

. SOAT: a semi-automatic domain ontology acquisition tool from Chinese corpus. Proceedings of the 19th international conference on Computational linguistics – [Internet]. Taipei, Taiwan: Association for Computational Linguistics; 2002 [cited 2022 Sep 5]. 1-5. Available from: http//portal.acm.org/citation.cfm?doid=1071884.1071897.

40.

Missikoff

Navigli

Velardi

. Integrated approach to Web ontology learning and engineering. Computer. 2002 Nov; 35(11): 60-3.

41.

Buitelaar

Olejnik

Sintek

. OntoLT: A Protégé Plug-In for Ontology Extraction from Text. Proceedings of the International Semantic Web Conference. 2003; 31-44.

42.

Konys

. Knowledge Repository of Ontology Learning Tools from Text. Procedia Computer Science. 2019; 159: 1614-28.

Ontological knowledge inferring approach: Introducing Directed Collocations (DC) and Joined Directed Collocations (JDC)

Abstract

Keywords

1. Introduction/background

1.1 Ontological knowledge

1.2 Problem formation

2. Related work

3.1 Semantic abstraction method

3.2 Joined Directed Collocation (JDC)

3.2.1 Collocation

3.2.2 Introducing Directed Collocation (DC)

3.2.3 Finding Directed Collocations (DCs) from the input ontological knowledge corpus

3.2.5 Identifying statistically significant (valid) DCs out of candidate DCs

3.2.6 Introducing Joined Directed Collocations (JDC)

3.2.7 Method of inferring new ontological knowledge

4.1 Preparing the data source

4.2 Technological background

4.3 Joined Directed Collocation (JDC) output

Table 1 JDC Inferred ontological knowledge with QTV = 3

Table 3 Inter-rater agreement test results of inferred ontological knowledge [QTV = 3]

4.5 Performance comparison

5. Future work

6. Summary and conclusion

References

Table 1
JDC Inferred ontological knowledge with QTV $=$ 3

Table 3
Inter-rater agreement test results of inferred ontological knowledge [QTV $=$ 3]