Abstract
Advances in machine learning and the development of very large knowledge graphs have accompanied a proliferation of ontologies of many types and for many purposes. These ontologies are commonly developed independently, and as a result, it can be difficult to communicate about and between them. To address this difficulty of communication, ontologies and the communities they serve must agree on how their respective terminologies and formalizations relate to each other. The process of coming into accord and agreement is called “harmonization.” The Ontology Summit 2021 examined the overall landscape of ontologies, the many kinds of ontology generation and harmonization, as well as the sustainability of ontologies. The Communiqué synthesizes and summarizes the findings of the summit as well as earlier summits on related issues. One of the major impediments to harmonization is the relatively poor quality of natural language definitions in many ontologies. The summit surveyed the state of the art in natural language definition development, based on lexicographic principles, as well as examples of ongoing projects that are explicitly dealing with harmonization and sustainability.
Introduction
Ontologies are proliferating, producing a complex landscape of many types, roles and uses for many purposes (Hitzler, 2021b), such as data integration, Semantic Web applications, business reporting and artificial intelligence. Ontologies can be extracted, learned, modularized, interrelated, transformed, analyzed, and harmonized, as well as developed in a formal process which can be manual or automated. There are now many ways that ontologies interact with other technologies, including, but not limited to, statistical and linguistic techniques, generation by machine learning tools, serving as the basis for machine learning so as to improve the quality and explainability of its results, and integration into machine learning architectures (Gaur, Faldu, and Sheth, 2021). Unfortunately, it is commonly the case that ontologies are constructed independently of one another. While such ontologies can serve their purposes very well, it can be difficult to align ontologies that were developed for overlapping domains. Furthermore, when an ontology is generated using automated methods, it could be difficult for humans to understand it. There is growing understanding of the need for harmonization and better definitions of terms in ontologies as well as some best practices to provide these.
Following some planning meetings, the Ontology Summit 2021 examined the landscape of ontology generation and harmonization using a series of virtual presentations and sessions held from February to May 2021. This Communiqué synthesizes and summarizes the findings of that series. Based on community interests, the Ontology Summit 2021 was organized into four tracks and the Communiqué is organized with one section for each of the summit tracks. The first track explored the many different kinds of ontology currently in use and provided an overall framework for the other tracks. The second track surveyed the different notions and levels of formality of definitions and practical methods to harmonize a variety of semantic resources. The third track examined automated techniques such as NLP and ML for constructing ontologies. The final track was concerned with the sustainability of ontologies.
One of the purposes of ontologies is for communication, and misinterpretations can occur when terms are not adequately defined. Natural language definitions, especially domain-specific definitions, are an essential part of ontologies; but, from an ontological perspective, they are often poorly written, either because of a lack of experience with writing definitions, a shortage of resources, or because of a lack of emphasis on properly defining terms during the ontology development process. Details on the notion of a semantically adequate natural language definition and its role in ontology development and harmonization are presented in Section 3.
To avoid misinterpretations, in the context of this Communiqué, we distinguish between natural language terms and terms that have a namespace prefix. When an important natural language term is ambiguous, the intended definition or definitions will be specified. For example, in Section 3.1, the word “annotation” is used, and the intended dictionary definition is specified. It is this notion of annotation that is used in the Communiqué, and not, for example,
The recent development of much more effective machine learning techniques has made it possible to extract ontological information from source documents. While such techniques are effective for particular tasks, currently they are generally opaque and do not lend themselves to easy human understanding and explainability. The issue of explainability was examined in a previous Ontology Summit (Baclawski et al., 2020a). One way to make machine learning techniques more explainable is to integrate machine learning with ontologies (Baclawski et al., 2018b; Gaur, Faldu, and Sheth, 2021). Many architectures have been developed for integrating ontologies with machine learning, and more generally, for integrating symbolic (which includes knowledge graphs and ontologies) and sub-symbolic (such as machine learning) techniques.
The ontological landscape
Ontologies have many aspects and purposes. As a result, the growing classification of ontologies will have many dimensions. These dimensions form a rich landscape rather than a simple linear “spectrum.” The distinctions among types of ontologies were studied and surveyed at the first two Ontology Summits in 2006 and 2007. The first was the Upper Ontology Summit (Obrst et al., 2006), and the second was on “Ontology, Taxonomy, Folksonomy: Understanding the Distinctions” (Grüninger, Bodenreider, Olken, Obrst, and Yim, 2008). A number of ontology dimensions were identified in these two summits, and some of the most prominent ontologies of the time were classified along the dimensions (Baclawski, 2007a, 2007b; Baclawski and Duggar, 2007). The most common classification of ontologies is by their level of generality, and Section 2.1 discusses the types of ontology as distinguished by how generally applicable the ontologies are.
One of the primary purposes of ontologies is for communication, and Section 2.2 surveys the types of ontology with respect to different communication requirements. Section 2.3 discusses the different approaches to what the entities of an ontology are intended to represent. Sections 2.4 and 2.5 survey two more dimensions: attitudes toward realism and the representation of uncertainty. Section 2 ends with a short discussion of ontological commitments in Section 2.6. However, there are still other dimensions in the landscape, and these are surveyed later on in Sections 3, 4 and 5.
Levels of generality
Ontologies are most often classified by their level of generality. The most generic or abstract ontologies are called “foundation(al) ontologies”, “generic ontologies”, “top-level ontologies”, and “upper ontologies” (Ontology Term List, 2020). The most specific ontologies are called “application ontologies” because they are generally associated with a specific or narrow range of applications. In between these two extremes, there are reference ontologies and domain ontologies. Reference ontologies are more specific than foundation ontologies, yet are not limited to a particular domain. Domain ontologies are limited to a single domain, but domains can form hierarchies with many levels of generality and the domain ontologies may also have many levels of generality (Schneider, 2021). Other notions of levels of ontology generality have been proposed, such as upper-level reference ontologies that provide specifications of requirements, functions, design or standards for a specific application (Chen, Ludwig, Ma, and Walther, 2019).
Kinds of language
One of the main purposes of ontology is to improve communication. In general, communication can be between people, between people and machines, and between machines. Machines operate and interoperate when the elements of the ontology are defined precisely and logically so that they are processable (e.g., inferences can be computed) and their ambiguity is minimized. Human language, in comparison, is much richer with a large variety of figures of speech and inherent ambiguities (Hanks and Jezek, 2008; Baclawski, 2021). Far from being a flaw of human language, this richness is its strength. Humans deal with the ambiguity of language by the context of a dialog, the manner of communication (e.g., inflection and gestures) and the ability to ask for clarification (Sowa, 2021). Accordingly, an ontology and its documentation should recognize the distinctions between the needs of humans and machines. One means of accomplishing this is to have a “language interface” that mediates between human and machine language. The language interface is an important feature of modern software engineering processes, and ontology development can also benefit from these practices (Bennett, 2021; Woods and Low, 2021).
An effective technique for communication is to base the ontology on the events and states of interest in a domain, as well as the arguments (subjects, objects, adverbs, etc.) of those events (as per neo-Davidsonian semantics). Basing an ontology on events and states allows the ontology to be deeply informed by narrative and event linguistic theory (Westerinen, 2021).
For a computable ontology using OWL, Common Logic or other logic-based ontology representation languages, identifiers (for elements of the ontology) are, in fact, symbols of the signature of the (logical) language. However, these identifiers are often natural language terms or phrases with which humans are already familiar, sometimes in highly emotional and biased ways. As a result, humans can easily forget or ignore that identifiers need to be treated as symbols whose interpretation is dictated by the formalizations used to define the identifiers. Misinterpretation is even more likely when an ontology element is not adequately defined to meet the intended interpretation. Inadequate definitions occur for many reasons. One reason why a natural language definition can be inadequate is an implicit reliance by ontology developers on the expectation that a user of the element will “understand” how the element is to be interpreted. Ontology developers may have a specific understanding of the natural language term or phrase being defined without realizing that some of the users of the ontology may have a different understanding (Schneider, 2021).
Well defined concepts are an essential ingredient in communication and play a central role in semantic interoperability. Both communication and interoperability are based on a common understanding of concepts, services, information and contributing data. The Ontology Summit 2007 discussed the spectrum of useful semantic artifacts starting with an implicit preference for strong formality found in ontologies (Grüninger et al., 2008). A range of resources exist, such as folksonomies (e.g., simple user-defined keyword lists useful to annotate resources on the Web), along with taxonomies, conceptual models, and controlled vocabularies (e.g., the Medical Subject Headings (MeSH)). A controlled vocabulary reflects an association of terms and is part of language use in a domain, which, like a model, imposes some simplifying, constraining, and organizing form on the fluxing complexity of that domain about which we communicate. The rationale for using controlled vocabularies is to formalize concepts in a logical language to allow some degree of automated processing of data. There was less attention at the Ontology Summit 2007 on standardization of other forms of semantic resources in pre-formalization phases. More recently, there has been a greater recognition and appreciation of both ontologies and standard vocabularies to support communication, data sharing and interoperability.
Represented entities
Ontologies may be used for a range of purposes, such as data integration, Semantic Web applications, business reporting, and artificial intelligence. Different uses are best supported by different and sometimes contrasting modeling styles. A key distinction with ontologies, as with any kind of model, is the question of what kind of thing is represented by the elements of that model. An ontology is a model that represents things and the relationships among them, relevant for a purpose. Many ontologies are designed to contextualize data for a specific domain. For example, a “concept” ontology, will contain logical statements framed in terms of domain attributes and how they define the meanings of things in that domain (Bennett, 2021). These logical statements can be used by humans to understand things in the domain, e.g., for business purposes. An ontology can also be used to model the data associated with the things in the domain, as opposed to the things themselves. Choosing what data to use to represent the things is a design decision, even if in many cases the design choices are obvious. Ontologies that are primarily concerned with data about things in the domain are often known as “operational” or “application” ontologies.
Further analysis suggests that there may be different styles of ontology that deal with operational data. An ontology for integrating multiple sources of data may need to have more semantically nuanced distinctions to deal with the different ways those data sources reflect conceptualizations of the world. By contrast, an ontology for reasoning over data from a single source (e.g., in a knowledge graph) would typically be simpler.
Other differences reflect operational design choices. For example, an operational ontology need not use a full foundational ontology to partition its world. It would also typically have fewer relationships, with little or no use of constraints such as property domains and ranges. By contrast, a concept ontology would have a richer set of relationships such as different types of whole-part relations, and would reflect the many constraints that apply to the relationships and properties.
In deriving an operational ontology from a concept ontology several design steps are needed: flattening the class hierarchy; extracting concepts that are relevant to the application use case; selectively removing classes that reflect some of the top-level ontology (TLO) partitioning such as “things in roles”; and shortening the corresponding property paths, so that for example a conceptual distinction between say “Loan – Borrower (as party in role) – Person”, becomes a simple “Loan – Person” relationship. Each application use case, being a distinct business context, would entail the extraction of different material from the concept ontology. In this regard, operational ontology design follows a similar path to any other technology development, with the concept ontology playing the role of a “Computationally Independent Model” (CIM) in the development process. For an ontology, especially a very large one, to be consistent in the broader sense of this term, it must not only be logically and conceptually consistent, but it must also have a consistent style throughout. Consistency in this broader sense requires conformance to style guidelines, such as naming conventions and change management procedures (Uschold, 2021).
Degrees of realism
Ontologies can be distinguished by the degree of “realism” that they adopt. There are many notions of realism that have been proposed and studied in philosophy. An ontology is realist, with respect to a particular notion of realism, if it is based on the thesis that there is a reality that exists independently of people. Neither philosophical nor computational ontologies require realism; indeed, computer scientists’ focus is generally more pluralistic with respect to what entities may be relevant, whether they are universals, defined classes, conceptual, cognitive, hypothetical, etc. Being more pluralistic has benefits for dealing with natural language, common sense, and other human capabilities. A pluralistic attitude allows for notions of “truth” that underlie different reasoning mechanisms (i.e., logics) and that depend on context and situation (Masolo, 2021). Different forms of logic are a frequent topic of earlier Ontology Summits, including the ones in 2007 referenced above (Grüninger et al., 2008), in 2010 “Creating the Ontologists of the Future” (Neuhaus et al., 2010), and in 2017 “AI, Learning, Reasoning, and Ontologies” (Baclawski et al., 2018b). One of the most important examples of a reasoning mechanism different from logical reasoning is probabilistic reasoning which is discussed in the next section.
Uncertainty management
Human understanding is typically probabilistic. To reflect our best understanding of the world, ontologies, for some situations or applications, should support uncertainty specification and reasoning. However, there is no generally agreed upon way to specify uncertainty in an ontology. For example, the likelihood of a specific RDF statement being true can be defined as simply as adding a
Ontology commitments
Ontology development and sustainability are processes during which many design decisions must be made. These decisions vary with respect to how much of the ontology is affected. Furthermore, different ontology development methodologies vary with respect to what decisions are made, who makes the decisions, as well as when and how the decisions are made (i.e., governance). While modularity can help limit the scope of a decision, there will nevertheless be decisions that have major consequences for the process and the artifacts that are developed. The design decisions also affect how well an ontology can adapt to future requirements. Decisions made during the development of an ontology are commonly referred to as “ontological commitments.” When using this term in the context of ontology development, one must be careful not to confuse the term with the philosophical notion of ontological commitment, such as Quine’s Criterion (Bricker, 2016).
Definitions and harmonization
The same term can be defined in many ways, depending on the purpose, audience, and circumstances. Ontologies use intensional definitions that specify properties of the thing or things denoted by the terms in the ontology (i.e., of their extension). Definitions can vary with respect to the level of precision, their purpose and linguistic function, the intended scope, and their quality (Gupta, 2021; Robinson, 1950; Seppälä, Ruttenberg, and Smith, 2016b). The precision of a definition is concerned with how precisely the definition specifies a term. The strongest form is a classical definition, which is an intensional definition that gives necessary and sufficient conditions for membership in the defined term’s extension. A weaker form is an intensional definition that includes conditions that are not necessary, or more conditions than is sufficient. Another weak form is an extensional definition that lists all examples explicitly. An extensional definition might not be possible or feasible, and there may be other examples that subsequently need to be added. The weakest form is an ostensive definition that is often non-verbal and accompanied by a gesture pointing to an example.
Another distinction is concerned with a definition’s purpose (Robinson, 1950). The purpose of a real definition is to explain what the thing or things denoted by a term are, while a nominal definition explains the meaning of a term and is concerned with its usage in practice. Definitions can also be distinguished by their linguistic function (Seppälä, Ruttenberg, and Smith, 2016b): a descriptive definition aims to be compatible with all existing usages, while a stipulative definition is introduced either on a temporary basis or in a specific context and need not be compatible with any other usages of the term. One aspect of definition is that there should be terms, such as the term “entity,” that are explicitly left undefined. Otherwise, the definitions will inevitably be circular. In some ontologies, such as the Basic Formal Ontology (BFO), these terms have textual elucidations instead of definitions that serve the same purpose.
Definitions can serve as links between humans, between human communities, between humans and ontologies, as well as between different ontologies. Historical attempts to standardize terms included creating core metadata models and common conceptual models for combining data into a single representation (Silva, Pérez-Alcázar, and Kofuji, 2019). For example, the Simple Knowledge Organization System (SKOS) is a core metadata model standard for the Web (SKOS, 2009). However, these standardization attempts have largely failed to be adopted because of flawed conceptualizations, lack of community agreement, and inadequate representation; and thus the attempts have resulted in silos. More recently, significant progress has been made leveraging best practices including the use of ontological analysis and design. The Ontology Summit 2021 surveyed the different notions and levels of formality of definitions, with emphasis on practical methods to harmonize a variety of semantic resources, and a summary of this survey is given in this section.
It is clear that domain vocabularies vary significantly in quality and scope, often with alternative definitions for the same term and definitions that have varying degrees of formality. This problem has been recognized for a long time. It is said that when Confucius was asked what he would do if he was a governor, he replied that he would “rectify the names” to make words correspond to reality. Standardization of term meanings remains challenging since there are many conflicting and overlapping glossaries and incompatible data models that define domain terms in idiosyncratic, domain or application-specific ways. Completely “rectifying the names,” like attempting to develop a single ontology for everything, may be too ambitious a goal, but harmonization can be achieved, albeit with some effort. Harmonizing terminology is underway in some domains, such as the cryosphere, which is the frozen water part of the Earth system concerned with ice fields and glaciers (ESIP, 2021), and in government-related domains such as the National Information Exchange Model (NIEM, 2017).
Writing an adequate definition, whether from a human or a computer perspective, is not easy. Indeed, one of the main problems with developing natural language definitions for ontologies is precisely the common assumption that anyone can write definitions and that they will then be harmonious. Ontology design experience as well as experiments have shown that even for well understood domains, the results of manual classification tasks performed by domain experts are highly inconsistent (Westerinen, 2021). The first step in writing definitions is to accept that it is a hard task. There are many definition writing principles and guidelines in lexicography, terminology, and logic that one can use for writing definitions (Seppälä, Ruttenberg, and Smith, 2017). One of the functions of a definition is to adjust the readers’ or systems’ inferential competences, i.e., what readers or systems infer when encountering the term that is defined (Seppälä, Ruttenberg, Schreiber, and Smith, 2016a). For a theoretical explanation of the functions of natural language definitions in ontologies, backed by empirical neuropsychological studies, see (Seppälä, Ruttenberg, and Smith, 2016b).
Understanding and defining the notions to be represented in an ontology are fundamental problems in ontology development that distinguish it from software, or even systems, development. In ontology development, there are at least two phases to representing a notion or entity. First, there is understanding what the notion or entity is. The second is the analysis phase in which the notion or entity is represented or modeled using the selected or available constructs. For example, one may have selected OWL as the ontology language or one is working within the context of a foundational, reference, or domain ontology.
During the “understanding” phase, the developer needs to survey relevant references, starting with a “common” dictionary (e.g., Oxford or Cambridge) and moving on to more domain-specific references (e.g., ISO specifications), if they exist. Using a range of references can provide a contrast among the different ways a notion or entity is understood as well as the contexts in which the notion or entity occurs. These contrasts will greatly aid in “understanding” the notion or entity and what may be needed to be represented to meet the stated needs of the ontology being developed. Ideally, the understanding phase should lead to a useful natural language definition, a critical element in representing a notion or entity for human interpretation.
The next phase is the ontological analysis process. Note that the understanding and analysis phases are usually performed in parallel rather than in series. An aid in ontological analysis can be a well-constructed foundational ontology (e.g., DOLCE, BFO, UFO, GFO), since such an ontology has already incorporated ontological analysis during its creation. One common pitfall during the analysis phase is “taxonomy seduction” which is the tendency to place an entity into a taxonomy without completing an understanding or analysis. Premature “classification” of a notion or entity can bias an entire development effort and/or require rework (Schneider, 2021).
Definitions and harmonization in the environmental sciences
In this section, a specific example of a domain that aims to address definitions and semantic harmonization is presented in detail. The domain is the environmental sciences, and the ontology is the Environmental Ontology (“EnvO GitHub Site”). EnvO is a semantic resource for semantically controlled descriptions of environmental entities. For example, the Darwin Core glossary uses EnvO in its habitat descriptions and was developed by applying text-mining approaches to extract habitat information from the Encyclopedia of Life and automatically create experimental habitat classes within EnvO.
EnvO’s initial focus was to represent biomes, environmental features, and environmental materials, and the initial purpose was for genomic and microbiome-related investigations. However, the need for environmental semantics is common to a multitude of fields, and EnvO’s scope has steadily grown since its initial description. As the scope has expanded, the ontology has been enhanced and generalized to support its increasingly diverse applications (such as the Cryo (glaciers and ice fields) and the Marine (ocean) realms) that are now examined in more detail.
The Global Cryosphere Watch (GCW) has sponsored the effort to harmonize term definitions that are already in use in the Cryo domain. Sometimes a single term will have dozens of different term definitions. Consider the term “snow cover”. The GCW harmonized definition is the following:
“An area density which inheres in snow distributed over an area of a landmass or other substrate.”
Additional information about this term is provided by annotations. An annotation is an explanation or comment added to data or metadata. Definitions are annotations, but there are other kinds of annotations that have a significant impact on understanding both by humans and by machines. For example, annotations can be used for explanations (Baclawski et al., 2020a) and for specifying the context and provenance of information (Baclawski et al., 2018a). An example of an annotation for GCW that is not part of a term definition states that, in general, snow cover is a layer of snow on the ground surface and can be compared to the related terms of snowfield and snowpack.
EnvO natural language definitions conform to the Minimum Information for the Reporting of an Ontology (MIRO) guidelines (Matentzoglu, Malone, Mungall et al., 2018). The MIRO guidelines specify seven categories of requirements for ontology reporting: (A) basic provenance information, (B) target audience and related ontologies, (C) scope of the ontology, (D) knowledge sources used, (E) ontology itself, expressed in a representation language, and the policies for its development, (F) sustainability plan, and (G) quality assurance. As required by part (A) of MIRO, best practices for documentation and for provenance information were instituted, including annotating the time the definition was added, the orcid.org of the individual who created the update, and provenance information about definition development. Target audiences were identified, as required by part (B), and subsets were developed within the scopes of the target audiences, as required by part (C). For example, a special envoPolar subset of EnvO has been crafted with relevant terms and axioms for the polar environment community. Many knowledge sources were employed, including the incorporation of other ontologies. The process was documented as required by parts (D) and (E) of MIRO. Incorporation of the knowledge sources and ontologies is a major part of the GCW project because the incorporated terminology had to be made consistent with related terminology. For example, the GCW glossary analysis results were harmonized with EnvO and aligned with corresponding terms in the Semantic Web for Earth and Environment Technology (SWEET) ontology. The effort to align with SWEET has benefited SWEET by improving the definitions of their terms. Part (E) of MIRO is concerned not only with the results of the ontology development but also with the procedures and policies for its development. GCW developed templating methods to accelerate class creation, and spreadsheets were used to help update ontologies with definitions. Specifically, the OBO Robot tool was used to facilitate EnvO collaboration. The Robot tool guides users through the process of creating new terms and is intended to be used by non-ontologists. Robot organizes new term requests in a standardized Google Sheet Template, and users can follow a step-by-step process to fill out the appropriate spreadsheet columns (Berg-Cross, 2021a). Finally, to fulfill part (G), EnvO and SWEET terms have been aligned with other OBO Foundry ontology terms (Ontology Tools and Resources, 2021).
Lessons learned and best practices
In this section, some of the best practices for harmonization of definitions are presented. For general advice about developing natural language definitions, see the Guidelines for Writing Definitions in Ontologies (Seppälä et al., 2017). The guidelines include advice such as: be brief, align, re-use, extend, and revise semantic resources. To ensure that definitions are brief, one should put other, more encyclopedic information in annotation notes.
Organizing the terms into a concept system (such as a network of concepts) is helpful for the alignment of terminology by showing all of the relationships between terms. A concept system can help to reduce ambiguity by helping to clarify the contexts of the terms. The concept system can also be helpful during any subsequent ontology development activities. Typically, the first step in any ontology development is to develop a taxonomy. A term definition that adds one or a few constraining characteristics to the nearest generically superordinate concept is specifying a taxonomic relationship. So definition development and taxonomy development are closely related. Reusing existing vocabularies, such as Schema.org, DCAT2, VIVO, DDI, etc., provides useful sources for the superordinate concepts of definitions; and hence also for the taxonomy (Berg-Cross, 2021b).
The relationships in a concept system should be as precise as possible. In particular, one should avoid vague comparisons, such as using the word “similar”. There will usually be many similar terms as well as terms with overlapping meanings. Much better relationships include subtypes, part-whole relationships, roles, influences, production (output) relationships, etc. For example, instead of describing a glacier as being similar to an ice mass, one can specify that “a glacier is a type of ice mass.” When specifying relationships between terms, one should be sensitive to issues of granularity. Metonymy is the naming of a thing by something related to it. For example, using the name of the whole for a part, or vice versa. Figures of speech such as metonymy are so commonplace that one may not be aware that one is using a figure of speech at all, but they should be avoided in ontologies. In natural language, relationships are often expressed by means of lexical modifiers like adjectives and adverbs, and such terms should also be well defined.
The use of a logical language for a definition or part of a definition allows the use of automated reasoning, and helps human-machine communication by making language more precise and less ambiguous (Seppälä, 2021; Woods and Low, 2021). While there are differences between textual and logical definitions (Seppälä et al., 2016a), it is desirable that their contents align to ensure consistent use of ontologies by humans and machines. Seppälä (2021) outlined a practical method to systematize and harmonize definitions in ontologies. The method can be applied both to textual and logical definitions to accelerate their creation. It may also be used for definition checking, quality control of ontologies, and automatic generation of definitions and axioms. The idea is to create templates for textual and logical definitions starting from one or more textual definitions, and leveraging classes and properties of existing ontologies when possible. The method consists, first, in abstracting the contents of a “seed” textual definition in terms of classes and properties already available in an existing ontology, or by creating new classes and properties, to produce a mapping between parts of the text and parts of the logical definition; second, in creating the corresponding template for the logical definition using the same ontology classes and properties. This ensures a close alignment between the textual and the logical definitions of an ontology. The semi-structured textual templates and the logical templates thus obtained can be generalized and specialized for new sub-categories thus ensuring harmonization.
Neuro-symbolic learning ontologies
Symbolic reasoning has a long history, and continues to be an active area of research. Machine learning, also known as sub-symbolic methods, is also a very active area of research. Although both are part of AI, these two areas have been developed under clearly distinct technical foundations and by separate research communities. The two areas have complementary strengths and weaknesses. As a result, finding ways for bridging the gap between symbolic and sub-symbolic approaches to AI is a long-standing unresolved challenge, and integrating these two areas is now the subject of growing research interest in AI. Bridging this gap was addressed at the Ontology Summit 2017 (Baclawski et al., 2018b), but new AI techniques, especially in machine learning, have since been developed so that revisiting this topic is certainly timely.
Neuro-symbolic learning aims to integrate neural learning with symbolic approaches typically used in computational logic and knowledge representation in AI. One benefit of such an integration is the development of effective knowledge extraction methods towards explainable AI (Gaur, Faldu, and Sheth, 2021; Lamb, 2021; Sheth, 2021), but there are many other advantages (Sriram, 2021). While there are significant benefits for tighter integration of neural and symbolic paradigms, it is not known how best to integrate them, and many integration architectures have been proposed. Symbolic models can be the result of, or the basis for, different stages of a neural process. This section provides a survey and rough categorization of the large variety of neuro-symbolic architectures that are now being developed, but one can expect new architectures to be developed by researchers in this very active field.
Several architectures have been used and proposed for integrating symbolic and sub-symbolic methods (Kautz, 2021). The simplest and most common architecture is one in which symbolic data (e.g., documents) are processed with symbolic techniques to produce vectors that are then used as input to a sub-symbolic module (e.g., a neural network). The vector output of the sub-symbolic module is then interpreted in symbolic form using symbolic techniques. In other words, the sub-symbolic module is subsidiary to a symbolic enclosing architecture. A more elaborate version of this architecture uses multiple sub-symbolic submodules that are essentially treated as subroutines. Self-driving vehicles usually use this architecture.
An alternative approach is to reverse the roles of symbolic and sub-symbolic to get an architecture in which it is the sub-symbolic system that is invoking a symbolic module, or possibly several symbolic submodules. The advantage of reversing the roles of symbolic and sub-symbolic is that doing so allows for very complex decision making, since symbolic reasoners can perform combinatorial reasoning much more scalably and efficiently than sub-symbolic systems.
In either of the approaches discussed above, the symbolic and sub-symbolic modules and submodules are presumed to have already been programmed and trained. No learning or any other kind of adaptation takes place during normal processing. Moreover, other than invoking each other, the symbolic and sub-symbolic modules do not influence each other. Some recently developed architectures have incorporated symbolic techniques within sub-symbolic modules or vice versa. One class of such architectures uses sub-symbolic techniques to generate symbolic modules. Examples of such architectures are tensor product representations and logic tensor networks, which can find generalization and part-whole hierarchies. In other words, these newly developed architectures can generate ontologies, or at least some aspects of ontologies. The forms of reasoning that can be incorporated include temporal logic, description logic, and first-order predicate logic (Hitzler, 2021a). While this technique involves a deeper integration of symbolic and sub-symbolic methods, the integration does not take place during normal processing. An approach that includes learning during normal processing is one that uses sub-symbolic methods, such as backpropagation, to train a symbolic system. In such a system, backpropagation is invoked whenever the system makes a mistake. This technique can be useful for conversational question-answering systems.
One can also reverse the roles of symbolic and sub-symbolic techniques, compared with the class of architectures in the previous paragraph. Such architectures use symbolic techniques to generate training data for sub-symbolic modules. For example, instances of logical inferences, expressed as input-output pairs, can be used to train a sub-symbolic model. Training with logical inference examples is primarily useful for mathematical problems, and it is surprisingly effective, although it will make mistakes sometimes (Kapanipathi, 2021).
One reason why sub-symbolic methods need knowledge-based methods is the intuition, based on human behavior, that intelligence necessarily involves learning, knowledge from experience, and reasoning, which could be expressed as an equation as follows:
The last approach is significantly different from the other approaches. Instead of employing machine learning techniques for the sub-symbolic modules, this approach uses signal processing (SP) techniques. In this architecture one starts with a “seed” ontology and then augments it incrementally using SP techniques applied to knowledge graphs extracted from various sources, such as documents. Signal processing is a highly developed and active area with a large community, and this connection between this community and the ontology community could have many benefits (Majumdar, 2021; Sowa, 2021; Baclawski et al., 2020b).
Sustainability of ontologies
Many organizations, including government agencies, standards bodies and commercial firms, use ontologies and have developed tools for various ontological activities, such as creation, evolution, mapping and other forms of harmonization. Sustainability requires a firm foundation. In general, building a firm foundation for sustainability requires addressing three issues: economic viability, social equity and environmental protection. One of the most important requirements of sustainability is to ensure that there is sufficient funding for maintaining the ontology for as long as its purpose remains relevant. The manner in which resources are allocated and monitored determines whether economic viability is achieved. Without proper oversight, economic viability cannot be maintained. However, sustainability involves addressing much more than simply ensuring sufficient funding. It is also important to ensure social equity. As is the case with many institutions, companies, and society in general, ontologies can have biases. This can occur whether an ontology was developed by people or by an automated method, such as machine learning. The problem is that data invariably has biases to a greater or lesser extent, and automated techniques cannot find or correct them on their own (Suresh and Guttag, 2019). While combining symbolic knowledge with machine learning can help to discover and mitigate biases, it is important to accept that addressing bias remains a vital part of ontology development and maintenance. Using standards and rigorous methodology can also assist in ensuring that social equity is adequately addressed. Standards provide a common baseline for the involved parties, so that a wider group of people are able to fully participate in development. A thorough approach helps establish quality, which also encourages an unbiased approach (Dickerson, 2021). The final issue that must be addressed is environmental protection. For an ontology, this refers to the human environment that surrounds the specific community that developed the ontology. It is important to recognize that communities and their ontologies do not exist in isolation. One must maintain avenues of communication and cooperation with adjacent and other related communities. Well designed definitions, documentation and harmonization, as discussed in Sections 3 and 4 above, can help address both social equity and environmental protection issues.
The sustainability issues discussed above illustrate that there is much more to ontology development than simply creating the ontology. The first step is to plan for infrastructure concerns, the engagement of stakeholders, and the scheduling of associated tasks (Franch and Ruhe, 2016). Once an ontology is released and is actively being used by those outside the development group, the need for revisions arises. A development team can only anticipate so many potential issues (Kotis, Vouros, and Spiliotopoulos, 2020). Those issues must be managed appropriately, or the post-release viability of the ontology will be in jeopardy.
Mechanisms must be implemented to facilitate revisions and as appropriate, expansions to the original model. Well-designed feedback and editing channels, including templates, further the robust environment for an ontology to mature (Blasko, Kremen, and Kouba, 2015). Stakeholders can be resources to edit the content of an ontology as well as enforce equity through promotion of standards. Maintaining the technical infrastructure furthers the intellectual and collaborative infrastructure required to sustain ontologies for the long term. Ever-changing formats, languages, platforms and tools also make it hard to sustain ontology repositories, which was the topic of the Ontology Summit 2008 (Obrst et al., 2008).
The EnvO example in Sections 3.1 and 3.2 above illustrates how a community is addressing the sustainability of their ontology. The GCW glossary analysis results were harmonized with EnvO and aligned with corresponding terms in the SWEET ontology. SWEET is a lightweight ontology with broad coverage, but sporadic definitions that historically served as a starting point for concepts within the Earth Sciences. Richer semantics were often added for particular domains, and spinoffs from SWEET were created. In comparison to EnvO’s concepts, SWEET concepts are less axiomatized and, as noted, fewer terms have definitions. Many of the legacy terms that are in SWEET were lifted from online sources like Wikipedia and were not subjected to analysis by domain experts. More recently, portions of SWEET have been updated with a new release in 2021. The lessons that the EnvO community learned are valuable for other communities.
As research methodology and scope have expanded with technological advances, the management of ontologies and other related semantic resources has become a critical and distinct component of the ontology lifecycle. The scale and diversity of new semantic resources, such as knowledge graphs, neuro-symbolic generated products and domain vocabularies, requires a reexamination of ontological engineering practices, and the various roles of ontologies in the overall semantic research enterprise.
Summary and conclusion
The proliferation of ontologies of many different types, purposes and roles has created an urgent need for ontology harmonization to improve communication between people, between people and machines and between machines. This Communiqué has surveyed the issues for ontology generation and harmonization. As there are many stakeholders that have an influence on or are impacted by these issues, the following summarization is organized by the various kinds of stakeholders.
The highest level stakeholders are communities and organizations who sponsor ontology development projects, either alone or as part of other projects. At this level, it is important to ensure that all three pillars of sustainability are well founded. Community agreement is needed for ontology extensions and revisions. Good mechanisms for community discussion are important, as are partner agreements with groups with domain vocabularies. A variety of tools are used for coordination and harmonization, such as Slack and GitHub, but community members are not necessarily skilled in the use of these tools.
The project managers of a project that includes ontology development are important stakeholders for ontology generation and harmonization. The effort to control the meaning of terms in vocabularies requires a lifecycle of their own, which must be managed like other digital data lifecycles. Project managers are responsible for selecting and enforcing appropriate style guidelines in general, and style guidelines for definitions in particular, such as the Guidelines for Writing Definitions in Ontologies and the MIRO guidelines. One should also maintain access to the vocabulary for reuse or for alignment with ontologies.
Project managers and developers must collaborate to make important high-level decisions during the ontology development process. A well-constructed foundational ontology can aid in ontological analysis, but can also affect other decisions and the development process. So, selecting a foundational ontology must be very carefully done. It may also be necessary to select a neuro-symbolic architecture that will be used for integrating symbolic and sub-symbolic techniques.
The end users of ontologies and ontology-based systems are stakeholders, and ontology developers should collaborate with end-users to ensure that the technical language of the developer is consistent with the natural language of the end user. The use of natural language for the symbols of an ontology can result in confusion because of existing meanings that humans have for the symbols. When communicating with natural language, one must be aware of how people categorize the world. Unlike the classes and properties of ontologies, human categories “shimmer” (Hanks and Jezek, 2008; Baclawski, 2021). Accordingly, ontologies should recognize the distinctions between the needs of humans and machines. One of the functions of a definition is to adjust the readers’ or systems’ inferential competences, i.e., what they infer when encountering the term that is defined. One effective means for communication is to use narrative and event linguistic theory. The logic underlying an ontology should be selected according to the requirements of the users of the ontology. It may be important to include an appropriate notion of uncertainty.
There is another kind of stakeholder who is only indirectly part of ontology development but who has a significant impact; namely, the ontology researcher. Ontology harmonization can be very time-consuming, so tools are important to simplify the effort required as well as to manage the effort over time. The challenge is to develop better tools for harmonization. The scale and diversity of new semantic resources, such as knowledge graphs, neuro-symbolic generated products and domain vocabularies, requires a reexamination of ontological engineering practices. Learning and training in these new semantic resources, and their use in operational ontology practices is important. Some preliminary mechanisms to maintain access to vocabularies for reuse or for alignment with ontologies have been established, but better mechanisms are needed. Another challenge is to develop better tools and techniques for reaching agreement efficiently with a diverse community.
The ontology developers have important responsibilities to ensure that the domain of the ontology is properly understood and documented. One important prerequisite for understanding a domain is to survey all existing relevant references. Having collected relevant terminology, the terms should be carefully analyzed and defined. Remember that terms may be standardized, but the meanings or their expected interpretations should be too. Writing good definitions is essential for standardizing meaning, but the first step in writing definitions is to accept that one does not know how to do it. The next step is to start learning how. There are now excellent guidelines for writing definitions. Some other concerns of definition development include avoiding vague comparisons, sensitivity to levels of granularity, and adequate alignment between logical and textual definitions.
Footnotes
Acknowledgements
Certain commercial software systems are identified in this paper. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology (NIST) or by the organizations of the authors or the endorsers of this Communiqué; nor does it imply that the products identified are necessarily the best available for the purpose. Further, any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NIST or any other supporting U.S. or European governments or corporate organizations.
Dr. Selja Seppälä is a Marie Skłodowska-Curie Career-FIT Research Fellow under the number MF20180003. She is grateful for the funding received from the European Union’s Marie Skłodowska-Curie grant agreement No. 713654.
We wish to acknowledge the support of the ontology community, especially the invited speakers and participants who contributed to the Ontology Summit. There were many invited speakers, some of whom gave presentations at more than one session. The complete list of sessions, speakers, and links to presentation slides and video recordings is available at
