Abstract
The growing need of utilizing unstructured knowledge embedded in open-domain natural language text into machine-processable forms requires the induction of hardly extracted structured knowledge into knowledge bases which makes the Semantic Web vision a reality. In this context, ontologies, and ontological knowledge (triples) plays a vital role. This research introduces two novel concepts named Directed Collocation (DC) and Joined Directed Collocation (JDC) along with a methodical application of them to infer new ontological knowledge. Introduced Quality-Threshold-Value (QTV) parameter improves the quality of the inferred ontological knowledge. Having set a moderate value (3) for QTV, this approach inferred 95,491 new ontological knowledge from 43,100 triples of open domain Sri Lankan English news corpus. Indeed, the outcome was approximately doubled in size as the source corpus. Some inferred ontological knowledge was identical with the original corpus content, which evidences the accuracy of this approach. The remaining were validated using inter-rater agreement method (high reliability) and out of which around 56% were estimated as effective. The inferred outcome which is in the triple format may use in any knowledge base. The proposed approach is domain independent. Thus, helps to construct/extend ontologies for any domain with the help of less or no human specialists.
Keywords
Introduction/background
Unlike in the past, accessibility of Information is advanced and expanded today. Users have access to information generated anywhere in the world remotely, 24/7 with the support of the internet. Though this provides a good opportunity for knowledge acquisition, the human’s cognitive capability and processing power is limited. Therefore, the mass volume of knowledge, which is encapsulated in web data and e-news like sources are not effectively utilized owing to the effect of information overloading.
The above situation becomes a real motive for Automatic Knowledge Extraction (AKE). Today, it has become crucial and beneficial for human. AKE is a non-trivial, highly challenging area of research in Natural Language Processing (NLP) domain in computer science. Ontologies and knowledge bases are part and parcel with AKE as hardly extracted structured knowledge (e.g., RDF/triples) should be organized into knowledge bases, while enabling them to be machine processible. That is the vision of the semantic web.
Ontological knowledge
An ontology is a shared structure/schema or a vocabulary, which models the type of objects/concepts, their properties, and relations. It is an explicit specification of a conceptualization, according to Thomas R. Gruber [1]. It can act as a repository/knowledge base to store real world instances of such objects and relationships by allowing the end users to reason out and infer the knowledge.
Resource Description Framework (RDF) is a structured semantic pattern which consists of Subject, Predicate and Object (a triple) components respectively. RDF is a universal language that lets users describe resources (objects and relationships) in their own vocabularies and is the main construct of ontologies. RDF/triples are of two types. Some represent the schema layer of the ontology while the others represent the instances of the schema layer [2].
E.g. (URL has been omitted for the simplicity.) RDF schema: person
Therefore, RDF helps in constructing both schema (ontology) and the instances (knowledge base) of the ontology. Henceforth, the two terms “Ontological knowledge” and “ontological knowledge facts” both have been used in this paper interchangeably to represent a similar plural meaning for RDF (both RDF schema and RDF instance types).
Problem formation
The use of ontologies is broad. Ontology development is a useful approach in the design and implementation of interoperable and multi-agent systems [3]. Ontology-based expert systems are widely used in many industry domains. Ontology driven knowledge management systems help organizing and managing firm’s knowledge in an efficient way [4]. Machine learning and the construction of very large knowledge graphs have accompanied a proliferation of ontologies for many purposes such as Semantic Web applications, business reporting and artificial intelligence. Ontologies can be extracted, learned, modularized, interrelated, transformed, analyzed, and harmonized, as well as developed in a formal process which can be manual or automated [5]. Domains are continuously updating. But the ontologies do not evolve to reflect such changes automatically. To address this limitation, researchers attempt for automatic ontology generation from unstructured text corpus. Unfortunately, methods that aim to generate ontologies from unstructured text corpus are domain-specific and require manual intervention. Also, they suffer from uncertainty in creating concept relationships and difficulty in finding axioms for the same concept [6]. Often, a human intervention is required to improve the quality of the ontologies. Hence, the involvement of consensus and high-level abstraction requires human cognitive processing. This makes the process of fully automating ontology learning impossible. Another important issue refers to scalability of ontology learning techniques. Extracting knowledge from the growing amounts of data on the Web in different formats requires scalable and efficient approaches [7].
Considering such background facts, it must be acknowledged that the biggest challenge in ontology construction is to gather appropriate ontological knowledge. These attempts are mostly carried out through manual processes which indeed consume more labor and time. In addition, scalability issues and being domain specific are vital concerns to be addressed. These turned out to be the real motivation for this research.
The primary objective of this research is to introduce a novel mechanism to automatically infer new ontological knowledge by utilizing the limited number of existing triples/ontological knowledge. The new ontological knowledge facts are inferred based on the existing ontological knowledge, using a novel approach named Joined Directed Collocation (JDC).
The proposed approach is generic and can be applied on triples irrespective of their domain and infer new ontological knowledge. Hence, this approach can be used in developing either domain specific or open domain ontologies. This method infers a large amount of new ontological knowledge, thus can be used to construct new ontologies or extend existing ontologies.
Related work
The existing literature provides evidence for the importance of having knowledge representation schemas to make the extracted knowledge accessible by the machines. Ontologies, Knowledge graphs, Frames and the semantic web are some of the knowledge representation schemas/frameworks which has become popular today. Though ontology now becomes a kind of matured knowledge representation, yet the usability and application of ontologies are not being degraded.
Ontology learning is a research field that brings together the technologies of ML, data and text mining, Natural Language Processing and knowledge representation. Automatic ontology construction from text: a review from shallow to deep learning trend, includes various methods, application systems, and the difficulties of automated ontology construction from the unstructured natural language text. This paper also highlights the ways the ontology construction process could be enhanced by presenting techniques from shallow learning to deep learning [8]. Ontology learning frameworks exist connecting with Ontology engineering tools such as TextToOnto framework on KAON ontology engineering environment, OntoLT on Protégé and Text2Onto on NeOn Toolkit etc. Fully automation of Ontology learning, and construction is not yet feasible and human intervention is vital [9]. Ontology 101 is an ontology development methodology for declarative frame-based systems [10]. It lists the main steps in the ontology development process while addressing issues come across when defining class hierarchies and properties of classes and instances. Highlighted steps include Defining classes in ontology, arranging the classes in a taxonomic hierarchy, defining slots and describing allowed values for such slots and finally filling values for slots for instances. It is also a well-structured methodology to build ontologies. It also identifies a set of activities in the ontology development process such as planify, specify, acquire knowledge, conceptualize, formalize, integrate, implement, evaluate, document, and maintain. It further suggests techniques to carry out at each stage and the deliverables. This approach highly recommends the reuse of existing ontologies [11].
The paper titled as “Methods and Tools for automatic construction of Ontologies from textual resources” proposes a comparative analysis of some Ontology construction approaches and their associated tools [12]. Initially, they have compared 4 approaches for automatic construction of ontologies from textual resources. All 4 approaches were based on the general domain-independent ontology construction framework “METHONTOLOGY”. The four approaches were, OntoLearn [13], Alvis [14], Text2Onto [15] and SPRAT [16]. In these 4 approaches, generally, the following algorithmic techniques have been used mainly for the first 3 steps. Step 1 – Builds a glossary of terms, used linguistic techniques and statistical techniques. Step 2 – Build concept taxonomies, used structural techniques and Contextual technics. Structural techniques utilize the structure of a term representing a concept. They can be based on the syntax of the term (e.g., domain ontology subsumes ontology), or on the morphology of the term (e.g., blood mononuclear cell as a variant of blood cell). A context is usually defined as a vector representing syntactic dependencies between the term that represents the concept and other surrounding terms. Contextual techniques are based on distributional & clustering techniques and pattern-based techniques. Step 3 – Identifies ad-hoc relations, uses pattern-based techniques, external resources and distributional & clustering techniques. In the four approaches mentioned above, hybrid methods have been used by appropriately selecting a combination of the above techniques. Yet Another Methodology for Ontology (YAMO), proposed a new method to develop ontologies. This research used facet analysis and an analytico-synthetic classification approach conceptualized by Ranganathan in 1997, to achieve their goals. It explained the methodology step-by-step based on an example in large-scale food domain [17].
Some research studies have attempted to extend existing ontologies by automatically extracting information from unstructured sources such as Web. SOFIE: A Self-Organizing Framework for Information Extraction is one such attempt [18]. In this research, the existing ontology was used to extract new knowledge and then the extracted knowledge was used back to extend the same ontology. This approach parses the unstructured text, identify strong positive and negative patterns as ontological facts and hypothesis that can be mapped into clause form for the Weighted MAX SAT solver. They introduce a set of rules. SOFIE aims to find the hypotheses that should be accepted as true so that a maximum number of rules is satisfied (Using Weighted MAX SAT Algorithm). Then, the FMS Algorithm is run by assigning truth values to hypotheses. Subsequently, the true hypotheses are accepted as new facts of the ontology. This way ontology can be expanded. SOFIE combined Pattern matching, Word Sense Disambiguation and ontological reasoning in their model. This unsupervised and self-organizing approach showed fairly good precision and recall. Though this is a general-purpose knowledge extraction, when customizing rules into a specific type of input corpora, performance can be improved.
A robust ontology design from natural language texts has been done in [19] using Discourse Representation Theory (DRT), Linguistic frame semantics and Ontology design patterns. The paper describes a task implementation of ontology design from natural language texts which developed for Open Knowledge Extraction Challenge (OKE2015). This system named as FRED defines a mapping between DRT and Resource Description Framework (RDF)/ Web Ontology Language (OWL) while producing quality linked data and ontologies. The system architecture has the best existing tools combined in a novel way to get good performance. They have used the tool Boxer, a deep parser that produces linguistic frames in DRS. Boxer has been enhanced to identify most appropriate linguistic frames when complex relations are in the input text. Boxer uses VerbNet which is linked to FrameNet. These mappings help frame detection without training. Remarkably low computational time was observed at evaluations.
DBPedia is a large scale, multilingual ontology and a knowledge base which gives a remarkable contribution to the Semantic Web [20]. The research article describes a community project DBPedia which extracts structured knowledge from Wikipedia. DBPedia extracts knowledge from 111 language editions of Wikipedia and makes it available for free through Semantic Web and Linked Data [21]. It serves on the Web in three forms; 1) provides downloadable data sets where each data set contains the results of its extractors (Ex. abstract, disambiguation, geo coordinates, infobox, image, mappings etc.) 2) Serve via a public SPARQL endpoint 3) Provides dereferenceable URIs according to the Linked Data principles. DBpedia has an ontology and a knowledge base. It maps Wikipedia Infoboxes to this single shared ontology. This mapping has been done using a worldwide crowdsourcing effort. Project publishers release knowledge base to download and access via SPARQL queries. It sets RDF links to external data sources. It has now become one of the central interlinking hubs in the Linked Open Data (LOD) cloud. DBPedia provides many data sets for NLP tasks. Out of them, Topic signatures can be useful in tasks such as query expansion or document summarization.
When Open Information Extraction (OIE) is considered, the Semantic Web is an important topic to discuss. It is an attempt of people to make the data in the web pages to structured and tagged to allow machines to understand the internet. Though, the semantic web is part and parcel with the OIE, classifying it as a knowledge representation framework seems more appropriate. World Wide Web (WWW) is for people. The content of WWW is meaningful to human. The key intention of the Semantic Web is to translate the current content of WWW into a format that is meaningful to computers. Several machine-readable languages have been developed for this purpose such as XML, RDF, DAML, OIL, SHOE etc. The main target of the Open Knowledge Extraction (OKE) challenge was to bridge the gap between Knowledge Extraction and Semantic Web /Linked Data. So, the tasks were designed to automatically extract the structured content from textual data and represent them as Linked Data [22]. There were two tasks in the challenge. Task 1 was Entity Recognition, Linking and Typing for Knowledge Base population. Task 2 was Class Induction and entity typing for Vocabulary and Knowledge Base enrichment. An evaluation was done through a tool designed to evaluate precision, recall and F-Measure of candidate systems and benchmark them. The winner of Task 1 was Adel. They achieved micro F1 and macro F1 as 0.6075 and 0.6039 respectively. (micro precision
Authors of the paper [23] presented a framework for transforming unstructured documents into machine-readable forms using existing tools. They highlighted that the existing Web architecture consists of three layers such as Syntactic (HTML/XML), Semantic (RDF) and Ontology layers separately and hence answering for queries is difficult. To overcome this issue, they proposed a model that combine these layers by giving meaning to words in the document (Word semantics). Various semantics were added to the word such as stem, POS, Named Entity Tag, Sinset id of the WordNet (Ontology and knowledge base) and so on, using a pre-defined format. They defined a new query format also which extracts information from the documents in this enhanced format.
Knowledge Graphs (KGs) can be used as abstractions to represent and share knowledge in a structured way [24]. It has a schema-less nature which allows the graph to grow by adding new entities and relationships seamlessly. Knowledge graph has become a powerful tool to represent knowledge in the form of a labelled directed graph and to give semantics to textual information [25]. Another research study introduces the concepts related to the knowledge representation and analyzes knowledge representation of knowledge graphs, mainly including several typical domain knowledge graphs. It highlights the knowledge representation compliant with the difference of entities, relationships, and properties [26].
In the paper titled as “A Semi-automated Ontology Construction for Legal Question Answering” proposed a methodology for constructing legal Web Ontology Language (OWL) ontologies with Semantic Web Rule Language (SWRL) rules. They also have developed a Legal NER system to identify legal named entities as well. Further, it evidences the plausibility of ontologies yet in many different industry domains [27].
Review for Ontology Construction from Unstructured Texts by using Deep Learning is a study which introduces some recent representative technical research studies on ontology construction [28]. Among their findings, “the corpus acquisition is the most fundamental and important in ontology learning” motivates us to create our own Sri Lankan news corpus. Not only that but also, “realizing the importance of word representations in the construction of ontology” is another finding which influences the approaches that are similar to our approach on syntactic colocation. A new agricultural ontology has been developed using Jaccard relative extractor (JRE) and Naïve Bayes algorithms [29]. JRE identifies the similarity between two sentences/words in the agricultural documents and the relationship between two terms is identified via the Naïve Bayes algorithm. The results show high precision (94.4%) while outperforming the decision tree and K-nearest neighbor algorithms. However, this study also falls under the domain specific category. Another research study has proposed a platform for rapid domain ontology construction from unstructured data [30]. They have proved that their approach is good for specific domains and taxonomy relationships, but poorly performing in open domain text and non-taxonomy relationships.
Giving a definition for the collocation in linguistics, the paper [31] highlighted the fact, “The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure.” The study focused on extracting collocations from the “Collins-Robert English-French. French-English dictionary”. The author has made a comparison between the collocation knowledge extracted from the dictionary and the similar data retrieved from statistical processing. Results revealed that both methods are complementary and mutually enriching.
A syntactic collocation has the characteristic such that, the presence of a syntactic link between collocation’s items [32]. Keeping that characteristic as a minimum syntactic constraint, they induced the collocation patterns in a data-driven fashion with the support of a parser. Rather than relying on pre-defined patterns like in other similar research, they consider any POS combination which has a syntactic link in the parser output.
The study [33] has attempted in examining whether the syntactic co-occurrence can help to trace phraseological development in foreign language learner writing. They have experimented on Verb
High-level Design Diagram for Inferring new ontological knowledge.
The proposed ontological knowledge inferring mechanism can be diagrammatically depicted as in Fig. 1.
The main data source for this inferring mechanism is a corpus of triples (RDF). It could be taken from an existing ontological knowledge corpus (Depicted in block 1 of the Fig. 1 – High-level Design Diagram). For the simplicity of RDF statements, URLs have been omitted and the 3 triple components have been separated using “
Except the RDF triples that represent the schema layer of an ontology, other triples have some instance specific information embedded with the Subject and Object components of the triple. Therefore, such components look like fairly lengthy noun chunks.
E.g., Budget Proposal of 2016
The proposed approach prefers RDF statements which represent the schema layer than instance layer of the ontology as the inputs to infer new ontological knowledge. Therefore, our own semantic abstraction method is used to convert such triples into more abstract format which closer to schema type.
Semantic abstraction method
This method (Depicted in block 2 of the Fig. 1 – High-level Design Diagram) keeps the predicate part of the triple unchanged. But remove all instance specific other terms except the base term (noun), from both the Subject and Object components of the triple.
E.g., Sentence: The Budget Proposal of 2016 has introduced a comprehensive policy framework for tourism development.
Triple extracted from the example sentence:
Budget Proposal of 2016
Triple after semantic abstraction (Abstract triple):
Proposal
Using the semantic abstraction method, triples are further processed and constructed an abstract ontological knowledge corpus. This corpus is depicted in block 3 of the Fig. 1 – High-level Design Diagram.
The abstract ontological knowledge in the corpus is methodically processed using a novel approach named as Joined Directed Collocation (JDC), to infer new ontological knowledge.
Joined Directed Collocation (JDC)
This is a novel concept that is introduced to infer new ontological knowledge from triples. The conceptual framework of this method is rooted from the existing concept called, collocation. The authors first elaborate the fundamental collocation concept and consequently introduce the two novel concepts Directed Collocation (DC) and Joined Directed Collocation (JDC). Finally, the method of inferring ontological knowledge based on the JDC concept is clearly described with examples.
Collocation
The dictionary meaning of the term “collocate” is to locate two or more things close together or be located together. “In corpus linguistics, a collocation is a sequence of words or terms that co-occur more often than would be expected by chance” [35]. A syntactic collocation is a kind of collocation that has the characteristic such that, the presence of a syntactic link between collocation’s items [36]. There are 6 main syntactic collocation types in the English language. Those are,
Adjective
Applying Collocation concept into elements of ontological knowledge
The structure of the ontological knowledge includes 3 elements such as “Subject”, “Verb/predicate” and “Object”. Such elements can be mapped into “Verb
Noun (Subject)
Introducing Directed Collocation (DC)
This is the novel concept/variation introduced to the collocation concept. Directed Collocation is also a co-occurrence between two terms in the corpus. But, in this method, rather than simply finding the co-occurrence of two terms, for a given term (From-term), the most frequently co-occurring term (To-term) is identified within the specified corpus. Therefore, it looks like co-occurrence to a specified direction. Two terms can be named as From-term and To-term as it shows the flow of the direction. Thus, the direction that is emphasized here is very important and it will differentiate the Directed Collocation outcome from the Collocation outcome. All such From-term and To-term co-occurrences are not eligible to be valid DCs unless they are statistically significant.
When DC is denoted, the addition symbol (
Finding Directed Collocations (DCs) from the input ontological knowledge corpus
This sub process is depicted in block 4 of the Fig. 1 – High-level Design Diagram. As mentioned above, Collocation’s two terms (
DC Type 1. S
Bigrams or DCs (
Create distinct lists of subject terms (S), verb terms (V) and object terms (O) from the triple corpus. For each term in the subject terms list, identify the most frequent verb term by scanning through all the triples in the corpus. It would result the candidate DCs of DC Type 1. Repeat the Step 2 appropriately for other DC types and generate candidate DCs. (From term will be always taken from one of the lists above appropriately.) It is important to highlight here that, unlike in collocations, from-term and to-term of the DC do not need to be placed as adjacent terms in the triple. Instead, they could be placed somewhere in a triple structure (E.g., terms of O Identify statistically significant DCs (explained below) and categorized them as valid DCs.
Suppose
T1
Every candidate DC has a property named as DC-score. The co-occurrence count which makes the two terms (bigram) a candidate DC, is called as the DC-score. Therefore,
suppose n(T1)
S
E.g: 01
Suppose n(V)
Further elaboration of E.g. 01 with some real-world example values:
n(s
E.g: 02
Suppose n(O)
Further elaboration of E.g. 02 with some real-world example values:
n(v
In some situations, for a specific term t1, there can be multiple t2 terms with the same DC-score which demonstrates the highest co-occurrence count. In such situations, all such bigrams are considered as candidate DCs.
Identifying statistically significant (valid) DCs out of candidate DCs
When the base concept Collocation is considered, a co-occurrence of two terms is considered as a valid collocation, only when that combination has co-occurred more often than would be expected by chance [35]. Similarly, every candidate DC is not a valid DC. Candidate DC should be statistically significant to become a valid DC. In this approach, identifying statistically significant DCs from candidate DCs is performed using a pre-defined threshold value. This threshold value ensures the quality of the outcome; thus, it can be named as a Quality-Threshold-Value (QTV).
In the following example,
Candidate DCs:
n(s
If the QTV is defined as 5 (as an example) then, (chef
If both
If
E.g., Proving asymmetric property
Suppose, DC type (s
candidate DC
“works at” is the highest collocated verb with subject “employee” in the corpus.
n(employee
Suppose DC type (s
candidate DC
In the above example, (employee
But (employee
Instead, (labor
In other words, the symmetric property has not been proved.
Introducing Joined Directed Collocations (JDC)
By combining two Directed Collocations (DC) in a methodical way, a Joined Directed Collocation (JDC) can be formed. This sub process is depicted in block 5 of the Fig. 1 – High-level Design Diagram.
E.g. DC
Method of generating twelve JDC types using six DC types
Considering the ontological knowledge structure, 6 DC types have been identified. With systematic joining of such DCs, at most 12 unique JDC types can be formed. The below example illustrates this systematic joining mechanism made towards generating JDCs.
E.g. Suppose the DC type S
The term V can be considered as the joining term of DCs to generate the JDC Type 1 and 2 above. The same logic can be applied to other 5 DC types by keeping them as DC1 and finding the possible DCs to join as DC2.
Method of inferring new ontological knowledge
A symbolic illustration of inferring new ontological knowledge using JDC method.
Based on the fundamental collocation concept, DC and JDC novel concepts have been derived. These two new concepts have been employed in formulating the below mentioned JDC Assumption to infer new ontological knowledge using existing ontological knowledge corpus.
Based on the above JDC Assumption, hidden ontological knowledge can be inferred by discovering the three terms S, V and O based on the identified JDC types. The implementation methods of all JDC types would output a large number of three-value sets that consist of potential terms for S, V and O. Then, such three value instances can be re-organized in a semantic order of “S
The below examples will further illustrate the above concept comprehensively.
E.g. 1: DC1 (S
The newly inferred ontological knowledge using JDC method is the final outcome of this research study. A corpus of such knowledge is depicted in block 6 of the Fig. 1 – High-level Design Diagram.
The Joined Directed Collocation method described in the Methodology section has been successfully implemented and validated. Content of this section comprehensively discusses the analysis and results of these experiments.
Preparing the data source
The initial corpus of 43,100 triples extracted from Sri Lankan open-domain English news articles considered to be the existing ontological knowledge. The triple extraction approach presented in our previous work “Grammatical Structure Oriented Automated Approach for Surface Knowledge Extraction from Open Domain Unstructured Text” was successfully used to extract those triples from Sri Lankan English news articles [37]. We presented a semantic abstraction method by which such ontological knowledge was converted into the same number of abstract triples. Those triples ensured that only the three components, subject, predicate/verb and object exist in it.
Technological background
Sophisticated and robust algorithms were written in order to implement the JDC concept. All coding was performed using Python version 3.6.5. The experiment used hardware configurations such as an Intel i5-8350U processor (1.70 GHz, 4 cores, 8 threads) and 16 GB of memory.
Joined Directed Collocation (JDC) output
This method inferred new ontological knowledge from 43,100 existing ontological knowledge. Results obtained by setting different values for the QTV parameter such as 1, 2, 3, 4, 5 and 10 for a better analysis. Table 1 shows the statistics obtained from the JDC algorithm output with QTV set as 3 in detailed level. Similar statistics were obtained with other values set for QTV but only the summarized figures were depicted in Table 1.
When analyzing the results, it was observed that, a considerable amount of inferred ontological knowledge was exactly similar to some knowledge facts of the source ontological knowledge corpus (overlapping with source Ontological knowledge). Source ontological knowledge is considered as valid knowledge facts. As the JDC method also could infer ontological knowledge that overlaps with the source content, a suggestion can be made that the JDC method based on the JDC Assumption has the ability to infer valid ontological knowledge. Such overlapping counts can be considered as a good measurement to prove the success and effectiveness of the JDC method. Table 2 also demonstrates such percentage values, comparatively evidencing with different QTV parameter values.
According to the Table 2, the overlapped percentages of inferred ontological knowledge (considered to be valid) are directly proportional with the relevant QTV parameter values used. Also, the number of inferred knowledge facts were inversely proportional with the QTV parameter values. Therefore, with high QTV parameter value (with a high level of collocation), these overlapped percentages have been increased. With this behavior, we can conclude that the JDC method is more effective with high level of collocations in DCs of JDC. In other words, the quality of the ontological knowledge inferred depends on the number of collocations or the DC-scores of the relevant DCs joined.
Moreover, when the QTV parameter value is increased, more DCs will be rejected due to insufficient collocations. Therefore, the higher the parameter value, the lower the amount of inferred ontological knowledge. That has an impact to the total number of ontological knowledge facts inferred by this method.
JDC Inferred ontological knowledge with QTV
3
JDC Inferred ontological knowledge with QTV
JDC method – summary of the results
Since the overlapping inferred ontological knowledge can be considered as valid, the remaining knowledge facts (new ontological knowledge facts) require a proper validation mechanism to prove the effectiveness.
An inter-rater agreement based manual validation method successfully served the purpose with high reliability. Sample selection algorithm designed for this test secured the high variability over the large corpus of new inferred ontological knowledge.
Inter-rater agreement test results of inferred ontological knowledge [QTV
3]
Inter-rater agreement test results of inferred ontological knowledge [QTV
Five sample sets of inferred ontological knowledge facts were randomly chosen using sampling without replacement method. Those were distributed among 5 participants of the testing team, ensuring that every knowledge fact in the sample had been manually verified by exactly three qualified examiners. Examiners analyzed the provided new ontological knowledge facts and mark them as either effective or ineffective. Once the validation results of 3 examiners are collected, those were amalgamated to find the majority votes which helps to decide whether the inferred ontological knowledge fact is effective or not. The initial validation test was carried out using the ontological knowledge facts that were inferred when the QTV was set as 3. With the hope of avoiding extreme situations, value 3 was opted as the QTV. Validation results of all 5 members, amalgamated and re-processed to find the final performance. Revealed results of this experiment is listed in Table 3.
Validation results show that the effective ontological knowledge inferring rate is 55.7%. In addition to that, a similar validation test was carried out using the ontological knowledge facts that were inferred when the QTV was set as 10. The intention of this secondary test was to validate the outcome with an extreme value set for the QTV. Secondary test results showed the effective ontological knowledge inferring rate as 56.3%. It was only a 0.6% increment compared with the initial test performance. Therefore, this manual verification did not show much performance variation with different QTV parameter values. The following statistics (relevant to the validation test performed with QTV parameter set as 3) gives more insights about the performance of the JDC method.
Source ontological knowledge facts used to infer new ontological knowledge
The above statistics demonstrated the fact that the number of inferred ontological knowledge generated from JDC method is more than twice of its source ontological knowledge corpus, especially with an average value set for QTV parameter (QTV
JDC method inferred ontological knowledge
Table 4 shows some ontological knowledge inferred from JDC method with QTV set as 3 with the effective or ineffective demarcation obtained through the inter-rater agreement test.
For the conducted inter-rater agreement test, Inter-Rater Reliability, or the level of agreement between raters, was computed using Percent Agreement method. The IRR score obtained was 56.3%. Moreover, having amalgamated the three examiners’ decisions, the effectiveness of each ontological knowledge fact was computed in the test samples, thereby obtained the total effective ontological knowledge inferring rate approximately as 56%. This performance does not seem to be much impressive. One key reason for this less impressive percentages is that, during this validation process, examiners had validated abstract syntactic patterns (ontological knowledge), without having any relevant contextual information. Those patterns had less, or no instance-specific information to make their decision undoubtedly. Therefore, validating the effectiveness was somewhat challenging and fairly individualistic as well. Therefore, individual perception matters, and opinions may contrast. That might have an impact on the less impressive performance measures mentioned above.
Results revealed that this method can infer new ontological knowledge in a mass scale. Thus, it can easily be used to extend the existing ontologies.
Performance comparison
The methods used for ontology learning has many approaches and those are being used by existing tools. Such methods include but not limited to conceptual clustering – E.g., ASIUM [38], Learning from patterns – E.g., SOAT [39], NLP and Machine Learning – E.g., OntoLearn [40], Statistical methods such as tf.idf – E.g., OntoLT [41] etc. However, the presented approach is an entirely new method developed using the collocation concept.
When this approach is compared with other Ontology Learning/construction systems existing today, the following drawbacks that are existing in such systems have been overcome by the proposed approach.
Future work
There are principal approaches available to find collocations. In this research, a frequency-based mechanism was used to identify the collocations. There are other popular methods such as the t-Test, Mutual information etc. that can be experimented to discover the best. Furthermore, the distinct lists of Subjects, Verbs and Objects that used to identify DCs were extracted from the source corpus itself. Instead, such collections are formed using any other source of the same domain would be much effective.
Summary and conclusion
Ontologies are one of the best structures that serve as a knowledge base and enable reasoning of knowledge dynamically. Hand-crafted ontologies are in excellent quality but due to the expensiveness, automatic ontology construction by extracting ontological knowledge from unstructured text has now become a recent and a popular trend today.
The Joined Directed Collocation (JDC) method that has been proposed in this research can infer new ontological knowledge using existing corpus of limited ontological knowledge facts or triples. Further, this has introduced two novel concepts, Directed Collocation (DC) and Joined Directed Collocation (JDC), based on the existing concept called Collocation. Having systematically joined the six types of DCs, 12 types of JDCs were identified. New ontological knowledge was effectively inferred from the original corpus based on the identified 12 JDC types. This novel approach has been illustrated using appropriate examples enriched with mathematical symbols. To improve the quality of the inferred ontological knowledge, a Quality Threshold Value (QTV) parameter was introduced aiming to utilize only the DCs with high level of collocations. In overall, JDC method could infer 95,491 new ontological knowledge facts from 43,100 existing ontological knowledge corpus (when QTV
There was a limitation which had unfavorable impact on the success of this study. This approach results massive output in terms of auto generated ontological knowledge. Though, the manual testing is the most effective approach, validating such massive amounts manually is a labor intensive, time consuming and costly task. Therefore, the manual validation was limited to selected samples.
With respect to the practical usefulness of the approach, triples can be used as possible inputs for this method. Hence, a textual content which enables extracting triples from any domain (E.g., open domain news, web data etc.) can be effectively used as the primary data source to infer new ontological knowledge using this approach. Hence, the proposed approach is practically usable for both open-domain and domain specific ontology development. The inferred ontological knowledge simply has the “Subject
The novel DC and JDC concepts and their method of application to infer new ontological knowledge are the key contributions of this study. It helps not only for automating the process of ontology construction/induction but also for the extending of existing ontologies (Can be used as an ontology extension tool). Finally, the findings of this research will have a considerable impact on the AKE body of knowledge. If the approach is effectively utilized, open domain text wouldn’t be a waste anymore.
