Abstract
The TNM classification (Tumor-Node-Metastasis) is the most important coding scheme used to stage tumors based on size and location. Its coding rules may change across different TNM versions, such that the same tumor is represented by different codes in different versions. We present an ontology-based modular architecture for the management of TNM, using the coding rules for pancreas tumors in the considerably different TNM versions 7 and 8 in order to demonstrate how mappings (in the sense of re-classification) between TNM versions can be supported. To enable re-classification of tumor instances between TNM versions, mapping ontologies were created. This work describes two version mapping approaches, one using SWRL rules and the other intermediate classes representing the mapping criteria between the TNM versions. We show that tumor instances with defined characteristics were correctly classified in different TNM versions. In addition, ontological inconsistencies in classification systems based on informal text labels and possible conversion problems due to different categorization criteria in different TNM versions are demonstrated.
Introduction
In oncology, the process of clinical and pathological characterization of malignant tumors in terms of tumor stages, called staging, is a fundamental component of cancer diagnosis and management (Webber et al., 2014). By far the most important coding system for tumor staging is the Tumor-Node-Metastasis classification, usually abbreviated as TNM, describing the growth and spread of solid malignant tumors (Sobin, Gospodarowicz, Wittekind, & International Union against Cancer, 2010).
As presented in previous work (Boeker, Faria, & Schulz, 2014; Boeker, França, Bronsert, & Schulz, 2016), TNM has been translated into formal logic, i.e., the TNM Ontology (TNM-O). Built into tools and applications, it is expected to support the coding of tumor-related findings and the interpretation of TNM codes. A possible use is the automatic classification of instance data extracted from clinical databases or electronic health records. Decomposing the TNM classes into all their defining criteria could help to detect logical inconsistencies and ambiguities in the TNM definitions through automated reasoning, thus contributing to the refinement of the TNM. The first models using TNM-O for feasibility testing were created for breast (Boeker et al., 2014) and colorectal cancer (Boeker et al., 2016). It was demonstrated that this approach can be used for accurate automatic classification of clinical data (Boeker et al., 2016).
Like other scientific classifications, the TNM staging system is subject to revisions based on scientific progress. However, revisions in TNM coding criteria affect the consistency and comparability of TNM-coded clinical data across different TNM versions.
Conventional, lexical terminology mappings between versions of a terminology are insufficient to provide insight regarding changes to the exact meaning of codes between versions. Ontology-based terminology systems have the potential to describe formally the meaning of a given code and the semantic changes to its meaning between versions and to make them amenable to logic-based reasoning.
This is the goal of the work presented. Its objectives are (1) to demonstrate two different methods of semantics-enabled versioning of TNM-O, and (2) to formally and empirically evaluate and compare these methods on a TNM-O representation of pancreatic cancer in the TNM versions 7 and 8.
In the following sections, we provide more detailed background information, starting with an overview of biomedical classification systems and conversion problems in general. We then provide a more thorough introduction to TNM and its ontological representation, and introduce the example of pancreatic cancer, which reveals significant differences between TNM versions.
Background
Biomedical classification systems and conversion problems
Terminology systems have a long tradition in medicine, where they offer standardized terms and codes with defined meanings and criteria for un-ambiguous categorization of medical items in order to facilitate communi-cation in clinical routine and biomedical research. Regarding their main purpose, two major categories of terminology systems can be distinguished (Ingenerf, 2015; Schulz, Rodrigues, Rector, & Chute, 2017). On the one hand, language-based or concept-oriented systems provide standardized terms. Such systems are known as controlled vocabularies, thesauri and ontologies. On the other hand, statistics-oriented classification systems categorize items into non-overlapping classes, guided by classification rules. Such systems are known as aggregation terminologies. An example of the latter is the WHO International Classification of Diseases (ICD-10) (World Health Organization, 2018), while examples of the former are SNOMED CT SNOMED International (2018), MeSH (Nelson, Johnston, & Humphreys, 2001; US National Library of Medicine, 2018a) and the NCI thesaurus (National Cancer Institute, 2018).
Terminology systems can furthermore be distinguished by their level of formal-ontological grounding. While traditional terminology systems use text definitions and scope notes, ontology-based terminology systems make use of formal axioms in a computable language, e.g. OWL (W3C, 2018).
Most medical terminology systems address specific topics and purposes like ICD-10 for diagnosis coding, ICNP for nursing documentation (International Council of Nurses, 2018) or MeSH for biomedical literature indexing and retrieval. There are hundreds of such systems, each of which with different scope, granularity and application contexts. The need to manage overlap and use common identifiers for overlapping content is addressed by the Unified Medical Language System (UMLS) Metathesaurus (Lindberg, Humphreys, & McCray, 1993; US National Library of Medicine, 2018b).
Terminology mapping is not only an issue when different terminology systems are to be aligned, e.g. for the integration of heterogeneous data, but it also matters in the management of different versions of the same terminology system. Changes in meaning or new categorization criteria give rise to problems of comparability of the items identified by a certain code. Such criteria may change over time due to scientific progress or emerging user needs.
Since the same codes may represent different meanings in different versions, it may be difficult to compare data annotated by a given code. An example of this was the transition between the ICD-9 and ICD-10, where, for 13% of the classes, mapping tables had to be created and where, in roughly a third of these cases (4% of the total) mapping was difficult or even impossible (Schulz, Zaiss, Brunner, Spinner, & Klar, 1998).
The TNM classification for malignant tumors
TNM as a cornerstone for diagnosis and management of solid tumors (Webber et al., 2014) is characterized by a “shorthand notation” describing the growth of the local primary tumor (T), the spread to regional lymph nodes (N) and distant metastases (M) (Sobin et al., 2010). Based on the axes T, N and M, stage groupings are assigned to tumors with similar prognoses. Developed in France in the 1940s by Pierre Denoix and first published by the Union for International Cancer Control (UICC) in 1968, TNM has become the globally accepted basis for cancer staging (Brierley, 2006). Since then, the system has undergone several revisions based on progression of scientific knowledge, with the 7th edition (Sobin et al., 2010) first published in 2009 and the 8th edition in 2017 (Brierley, Gospodarowicz, & Wittekind, 2017).
TNM’s objectives are six-fold: it supports treatment planning, prediction of outcomes (prognosis), evaluation of treatment results, exchange of information between stakeholders in the treatment process, continuing research in malignant diseases, and cancer control (Sobin et al., 2010; Webber et al., 2014). In short, TNM provides a common semantic reference for cancer management, research and information exchange (Sobin et al., 2010).
The three main axes T, N and M, together with alphanumeric modifiers, describe the extent of the tumor (cf. Table 1). With a prefix, the pre-treatment cTNM (c = clinical) and post-surgical pTNM (p = pathological) classifications are distinguished.
The main codes of the TNM coding scheme
The main codes of the TNM coding scheme
There are several other prefixes and suffixes used for additional cases and special purposes (Sobin et al., 2010).
Figure 1 shows an example of a pancreas tumor with cancer spread into the regional lymph nodes.
Depending on the type of tumor, further subdivisions are possible and indicated by lower case characters (e.g. N2a and N2b). A series of additional symbols exists, which this work will not address.
TNM is different for each anatomical region, which yields more than sixty different sets of rules. The coding rules for the primary tumor (T codes) are based on tumor size and extension into neighboring structures for most organs, but there are some special criteria for certain tumor locations. For the metastatic structures (N and M codes) the coding rules may rely just on their presence or absence, but may also rely on their number, location or other criteria.
Like other scientific classifications, the TNM staging system is subject to revisions based on scientific progress. The revision process includes a formalized system for submitting proposals for changes directly to the Union for International Cancer Control’s (UICC) and an annual review of the relevant scientific literature, both of which are evaluated by a group of international, multidisciplinary expert panels (Webber et al., 2014).
However, revisions to TNM coding criteria affect the consistency and comparability of clinical data coded according to different TNM versions. A statistical evaluation is not possible without a re-classification, if, e.g., the same tumor is coded as T2 in one version and T3 in another version, or if the same code, like T2, means something different in two versions. The tumor characteristics on which the coding rules for a specific TNM code are based may be completely different between versions, or some subclasses may be split into several more granular subclasses. Where this is the case, there is no way, in principle, to form a simple mapping between the original class and the new more granular subclasses without the further information needed to distinguish them.

Schema of a pancreas tumor spreading into some of the regional lymph nodes. With a tumor size of more than 2 cm and in the absence of distant metastasis, this tumor would be classified as T2 N1 M0 (copyright with authors).
For pancreas tumors, there are considerable differences between TNM versions 7 and 8. In TNM7 there was only one set of rules for all pancreas tumors, where TMN8 distinguishes between two sets of coding rules; for tumors of the exocrine pancreas and for well-differentiated tumors of the neuroendocrine pancreas grades 1 and 2 (that is, the appearance of cells does not deviate markedly from normal cells). While the traditional anatomic view treated the pancreas as one organ with macroscopically homogeneous tissue, the current functional view of human anatomy considers the exocrine and neuroendocrine pancreases as two different organs, with two morphologically and functionally distinguished tissue types. The exocrine pancreas produces pancreas secretions which digest fats and carbohydrates in the small intestine, while the neuroendocrine pancreas produces insulin and other hormones which are released into the blood. The classification of neuroendocrine tumors of higher grades uses the rules for the exocrine pancreas. Another significant difference is based on the finding that the size of the tumor is a stronger predictor of survival than the extension beyond the pancreas, the latter being difficult to interpret due to the anatomical constitution of the pancreas surface structure (Allen et al., 2017).
For pancreas tumors, all possible kinds of mapping relations between the classes in the two versions can be found. In the context of this paper, the expression “mapping” is also used for re-classification between versions via OWL axioms. The tumor characteristics on which the coding rules for a specific TNM code are based
can be identical in TNM7 and TNM8
can be a subset of tumor characteristics necessary for mapping to a code in the other version
can be an intersection (share some aspects and not others)
can be completely different
Table 2 lists the coding rules for pancreas tumors in both TNM versions, in order to present these differences more clearly.
Coding rules for pancreas tumors in TNM7 and in TNM8 with different coding rules for exocrine and neuroendocrine tumors (slightly abbreviated). Rules for the codes TX (Tumor cannot be assessed), T0 (No evidence of tumor), NX (Regional lymph nodes cannot be assessed), N0 (No regional lymph node metastasis) and M0 (No distant metastasis) are identical in both TNM versions (Sobin et al., 2010; Brierley et al., 2017) and are not listed here
Coding rules for pancreas tumors in TNM7 and in TNM8 with different coding rules for exocrine and neuroendocrine tumors (slightly abbreviated). Rules for the codes TX (Tumor cannot be assessed), T0 (No evidence of tumor), NX (Regional lymph nodes cannot be assessed), N0 (No regional lymph node metastasis) and M0 (No distant metastasis) are identical in both TNM versions (Sobin et al., 2010; Brierley et al., 2017) and are not listed here
Includes invasion of peripancreatic soft tissue.
A particular example of a change between TNM7 and TNM8 concerns a tumor that grows beyond the pancreas without invading two specific vessels, which was T3 in TNM7 (Fig. 2).

Pancreas tumor infiltrating tissue outside the pancreas (schematic). This tumor, which extends beyond the pancreas but without involvement of the celiac axis or the superior mesenteric artery, is represented by the TNM7 code T3. In TNM8, additional information – the size of the tumor – is necessary to assign the representing TNM code (copyright with authors).
A TNM7 T3 code can be translated into any of the codes T1, T2, T3 or T4 in TNM8, depending on other tumor characteristics. Examples of this include:
A tumor which invades the soft tissue surrounding the pancreas – without invading the vessel’s celiac axis or superior mesenteric artery – fulfills the criteria to be coded as T3 in TNM7 (“
A tumor which invades the common hepatic artery is also coded as T3 in TNM7, as it “invades structures beyond the pancreas, but not celiac axis or mesenteric artery” (see Table 2). This tumor would be coded as T4 according to the coding rules for exocrine pancreas tumors in TNM8.
These examples are listed in Table 3, which shows the possible mappings between TNM7 and exocrine pancreas TNM8 for all cases of tumors classified as T3 in TNM7.
As there are different sets of coding rules in TNM8 for tumors of the exocrine pancreas and the neuroendocrine pancreas, the mappings from TNM7 to TNM8 for neuroendocrine pancreas tumors are different (not listed here).
How an exocrine pancreas tumor coded as T3 in TNM7 (
The criteria relevant for the assignment of the code in either TNM version are marked in bold.
The abovementioned versioning issues, with their scope and impact requiring accurate human interpretation of the textual rules and descriptions in the authoritative TNM sources are the origin of the central hypothesis of this paper: that expressing all TNM versions in a formal, computable language will have advantages beyond the textual descriptions of codes.
Such an approach, for which we would prefer to use ‘description logics’ (Baader, 2010) and which we have discussed in previous work (Boeker et al., 2014, 2016), would constitute a knowledge base that supports the coding of tumor-related findings and the interpretation of TNM codes. Implemented by appropriate tools, it could also automatically classify instance data extracted from clinical databases, electronic health records, or semantic extracts thereof (Boeker et al., 2016).
Decomposing the TNM classes into their defining criteria rooted in ontological standards, e.g. for anatomy, would help to detect inconsistencies and ambiguities through automated reasoning (Boeker et al., 2016). It would therefore be a foundation for TNM maintenance and refinement across versions.
Feasibility testing has been carried out for TNM-O models for breast (Boeker et al., 2014) and colorectal (Boeker et al., 2016) cancer. It demonstrated the value of this approach for accurate automatic classification of clinical data (Boeker et al., 2016).
In this paper we describe a new dimension of this endeavor, viz. TNM inter-version mapping, with the purpose of supporting tumor re-classification across TNM versions. Having shown that a logics-based rendering of TNM knowledge is feasible, we argue here that such a formal basis is equally tenable under formal-ontological assumptions, based on the realist tenet that ontologies are not use-case dependent descriptions of data or human knowledge, but rather universal axiomatizations of (classes of) entities of a domain. In the tumor domain, this means that pieces of tissue, organs, morphological structures as well as their inherent qualities and acts of diagnostic observations are to be represented.
Our work is also intended to serve as an example of how transitions between subsequent versions of ontology-based terminology systems can be expressed using formal methods.
In the case of TNM – in its use for classifying pathological entities – we contend that this particular aggregation terminology is suited to being consistently modeled and interpreted as an ontology in the above sense. This is despite other work which discouraged this for the “classical” clinical classification systems ICD-10 and ICD-11, whose referents are often blended with epistemic aspects (Bodenreider, Smith, & Burgun, 2004) and are therefore more accurately interpreted as diagnostic statements than as pathological entities (Schulz, Rodrigues, et al., 2017). The rationale for our assumption is the general understanding of pathologists’ findings as a gold standard, i.e. that their diagnostic conclusions denote pathological entities with negligible epistemic interference, in contrast to, e.g. clinical or radiological findings.1
This requires subscribing to the existence of, e.g. a T4 tumor as an observer-independent entity. A reference to this tumor, e.g. by a pathologist’s diagnosis “pT4” can be considered a precise account of the existence of this tumor, whereas a radiologist’s diagnosis cT4 should be seen as an approximate statement. The encodings themselves (distinguished between c and p) are contained in TNM-O only as subclasses of RepresentationalUnits that denote types of pathological entities.
OWL files were created using Protégé 5.2 (Musen & Protégé Team, 2015) in a modular approach. Organ- and version-specific ontologies were imported into the “hub”-ontology TNM-O (Boeker et al., 2016) under BioTopLite2 (Schulz & Boeker, 2013; Schulz, Boeker, & Martinez-Costa, 2017) as a domain top-level ontology. The module TNM-O-Anatomy denotes anatomical entities, following the structure and content of the Foundational Model of Anatomy (FMA) (Rosse & Mejino, 2003) whenever possible. To enable mapping between the pancreas tumor ontologies in TNM versions 7 and 8, each OWL file was imported into the TNM-O modular structure, and two OWL files containing the mapping information were added without changing the original OWL files. Here, mapping means the re-classification via OWL axioms.
Two mapping solutions were tested. In the first, SWRL rules defining the mapping criteria from TNM7 to TNM8 and vice versa were formulated in the human readable syntax described by Horrocks et al. (2004). The structure of these rules, which establish for each tumor class in one TNM version the additional criteria necessary to derive a tumor class in the other TNM version, is explained in detail in Section 4.4.
In the second mapping file, additional OWL classes and axioms were added to create a structure including every possible tumor subcategory, as explained further in Section 4.5. In this approach, additional representational entities with more than one TNM code were also created for cases where a TNM code could not be unambiguously assigned.
The ontologies were tested using the HermiT DL reasoner (Glimm, Horrocks, Motik, Stoilos, & Wang, 2014) version 1.3.8.
A Java-based classifier program for individuals (instances) using test data was developed employing the JAVA OWL API (Bechhofer, 2007), version 4.0.1 and the HermIT DL reasoner (Glimm et al., 2014), version 1.3.8.
Results
Modularization of TNM-O
Three pancreas-specific OWL files (Pancreas TNM7, Exocrine Pancreas TNM8 and Neuroendocrine Pancreas TNM8) were created following a similar, albeit slightly improved structure as already described for breast (Boeker et al., 2014) and colorectal cancer (Boeker et al., 2016). TNM-O.owl serves as the master OWL file, which imports all other OWL files, thus creating a modular structure as shown in Fig. 3.

Modular approach for the three pancreas TNM ontologies with the TNM-O as hub-ontology importing the top-level domain ontology BioTopLite2, TNM-Anatomy for the anatomical entities – taken from the Foundational Model of Anatomy as far as possible – and the ontologies for the three pancreas-specific TNM rules (Pancreas7, Pancreas8neuro or Pancreas8exo respectively). Ontologies for other organ sites and the mapping ontologies described further below are imported in the same way.
The basic structure of each of these ontologies is that a tumor located in a given anatomical region and with specific characteristics, e.g. a quality and its value, is represented by a TNM code (Fig. 4).

General structure of the TNM ontology. The upper part of the figure shows how top-level categories from the BioTopLite2 ontology (btl2) are related to the TNM ontology.
A tumor can be either a primary tumor or a metastatic tumor aggregate, defined as the mereological sum of a primary tumor and its metastatic regional lymph nodes and/or distant metastases. Classes for tumor qualities, value regions and the representational units (TNM codes) were created in the TNM-O hub ontology and can thus be re-used to create ontologies representing TNM coding rules for other organs. However, it is necessary to include the case in which a TNM code represents tissue without tumor cells. Therefore, the general class is not Tumor but (amount of) Anatomical Structure.
An example of the definition of a tumor class in the pancreas7 ontology is described below. This is the tumor class InvasivePancreasTumorNotBeyondCeliacTrunkOrSuperiorMesentericArtery which is represented by TNM7 code T3. According to TNM7 it is described as: “Invades structures beyond pancreas, but not celiac axis or superior mesenteric artery”. The example uses the following namespace prefixes: btl2: BioTopLite2, Pa7: Pancreas7-ontology, tnmo: TNM-O hub-ontology, tnmoFMA: TNM-Anatomy.
This tumor class is a subclass of btl2:MaterialObject:
The classes for the anatomical structures invaded by the tumor are also subclasses of btl2:MaterialObject, e.g.,
The classes describing the tumor quality (tnmo:Confinement) and the associated value (tnmo:Invasive) can be traced back to the top classes btl2:Quality and btl2:ValueRegion respectively. It could be seen as an under-specification that the growth behavior classes, e.g. tnmo:Invasive are modelled as primitive, but their full definition (participation in invasion process) would add complexity without any additional value for the intended use cases.
All classes describing TNM codes (e.g. Pa7:PancreasTNM7_pT3) are descendants of btl2:InformationObject, linked to Anatomical Structure classes by
The following example (Fig. 5) shows part of the hierarchical structure of the pancreas tumor classes in TNM7. It lists the tumor classes represented by the T-codes. This example also illustrates how tumor classes with the criteria “cannot be assessed” or “no evidence of primary tumor” were modelled. This is elaborated further in the discussion section.

Part of the hierarchical structure of the TNM7 ontology. This example lists the tumor classes representing the T-codes. Namespace prefixes: Pa7: pancreas TNM7, tnmo: TNM-O. TNM codes listed bold, in brackets.
Two different mapping approaches are introduced and compared in this and the subsequent chapter.
Each tumor class in the TNM ontologies is defined by specific criteria, and if these criteria differ across TNM versions, conversion between the version-specific definitions is performed by adding the information about the missing additional criteria. These mapping rules between TNM7 and TNM8 or vice versa were created using the semantic web rules language (SWRL) with rules for every possible conversion between the two TNM versions. The mapping rules follow the general structure:
An example of such a rule in human-readable syntax is listed below. It shows the rules for the re-classification of a tumor class from the TNM7 ontology, InvasivePancreasTumorNotBeyondCeliacTrunkOrSuperiorMesentericArtery, which is represented by TNM7 code T3 (cf. Table 3 for more examples). Because TNM8 does not denote a tumor with these exact conditions, this tumor can only be classified into a tumor class in one of the pancreas TNM8 ontologies if further information is provided. In the example below, the tumor is located in the exocrine pancreas and invades the common hepatic artery, and thus corresponds to InvasiveExocrinePancreasTumorInfiltratingDefinedBloodVessels, denoted by TNM8 code T4 (see Table 3). The SWRL rule is created as follows:
This mapping approach can be used to re-classify a tumor instance already classified in one TNM version, if the additional criteria needed to code in the other TNM version are known. Additionally, it has the advantage that all mapping rules can be easily listed using the SWRL tab in Protégé. The SWRL rules were built for all possible combinations of tumor characteristics specified in TNM7 and TNM8 and tested with individuals representing each of these cases. In total, 39 SWRL rules were created for the transformation from TNM7 to TNM8, 24 SWRL rules for the transformation from TNM8 (exocrine pancreas) to TNM7, and 22 SWRL rules for the transformation from TNM8 (neuroendocrine pancreas) to TNM7.
Figure 6 illustrates how SWRL mapping works, using the example of a re-classification from TNM7 (pancreas) to TNM8 (exocrine pancreas). All OWL-files containing classes common to TNM, as well as those containing classes for pancreas in the different TNM versions, and the SWRL rules are imported into the TNM-O hub ontology. An instance of a TNM7 tumor class (A), which is represented by a TNM7 code (n), is re-classified to derive the corresponding TNM8 tumor class (B) and the code (m) by applying the appropriate SWRL rules along with additional criteria necessary to assign the TNM8 code.

Structure of the TNM ontology as described in the text, explaining how the SWRL rules are used to re-classify an individual tumor. A pancreas tumor instance belongs to TNM7 class A. With the additional information, viz. that this tumor is growing in the exocrine pancreas and has characteristics x, y, z, it can be re-classified to exocrine pancreas TNM8 class B by applying the appropriate SWRL rules. Namespace prefixes: btl2: BioTopLite2, Pa7: Pancreas7-ontology, Px8: Exocrine Pancreas 8-ontology, tnmo: TNM-O hub-ontology, tnmoFMA: TNM-Anatomy.
While the SWRL mapping was based on the TNM-specific definitions for tumor classes, which usually comprise a combination of tumor characteristics, the second mapping approach decomposes these definitions even further, creating defined classes for tumor subcategories such as ConfinedPancreasTumor or PancreasTumorMoreThan2cm (which constitute intersections of tumor classes in TNM-O-vX-Tumor), and further down in the class hierarchy, for tumors with every possible combination of tumor characteristics. Some of these classes represent a TNM code and its definition exactly, whereas others are ambiguous and represent more than one TNM code. For the latter classes, disjunctive TNM codes were created, such as
The following example illustrates the concept of this mapping structure. Let a tumor instance have a size of 3 cm and be located in the exocrine pancreas without invasion of structures surrounding the pancreas. According to the TNM7 coding standards this tumor belongs to the class ConfinedPancreasTumorMoreThan2cm, which is represented by the TNM7 code T2. According to the TNM8 coding rules for exocrine pancreas tumors, it is classified as ExocrinePancreasTumor2to4cm represented by the TNM8 (exocrine pancreas) code T2. The mapping ontology includes classes for all possible tumor criteria, including a common subclass of both classes described above, viz. ConfinedExocrinePancreasTumor2to4cm. The tumor instance described above is an instance of this class, and its TNM code can be derived according to TNM7 and TNM8 exocrine pancreas coding rules, respectively.
Evaluation of the TNM-O approach using all possible instances
Both mapping solutions were evaluated by analyzing a sample of individuals encompassing all possible tumor classes in both TNM versions, with different combinations of tumor characteristics according to the pancreas tumor ontologies.
A Java-based software module, implemented for a previous version of the TNM ontology (Boeker et al., 2016), was adapted to this dataset and the new ontology. We demonstrated that both mapping approaches can be used to either re-classify tumors representing a TNM code in one version to the appropriate code in the other version or to classify a tumor in both TNM versions, if all information necessary to assign a code is provided. For several tumor classes the availability of additional patient information for the re-classification is important, as otherwise a re-classification would not be possible. For the test cases, the information necessary to validate all possible re-classification options was added, whereas real hospital patient data sets often lacked the required information to assign codes in the new version.
For each of the 85 SWRL rules, a dataset was created listing the initial tumor class and tumor characteristics essential for the transformations from TNM7 to TNM8 and vice versa. Using the software module, instances were created and classified by employing the SWRL definitions to derive the tumor class in the other TNM version. For each of these datasets the transformation resulted in the expected tumor class of the other TNM version. As no SWRL rules could be created for cases where an unambiguous re-classification was not possible due to missing tumor characteristic data, no datasets were created for these cases.
For the second mapping approach, 91 datasets were created representing different combinations of tumor characteristics. The corresponding tumor classes and their TNM code representations in TNM7 and TNM8 were derived as expected.
The difference between the two mapping approaches is illustrated by the following example. The tumor class ConfinedPancreasTumorMoreThan2cm, which is represented by the TNM7 code
Discussion
We proposed two methods for the versioning of ontology-based terminology systems based (i) on the Semantic Web Rule Language and (ii) on a mapping ontology containing all possible tumor subcategories. Further, we introduced a new TNM-O module for pancreatic cancer.
It could be shown that both methods produced correct mappings, which were complete within a simulated set of all possible TNM codes for pancreatic cancer in versions 7 and 8.
To our knowledge, this is the first implementation of a transition between two versions of a medical classification system based on formal ontology.
Related approaches to represent tumor classification and versioning
In addition to TNM coding, tumors are described by “clinical stages” which are determined by different rules. While there are few examples representing the TNM coding rules by ontologies (Boeker et al., 2014, 2016; Dameron, Roques, Rubin, Marquet, & Burgun, 2006), several authors have described the use of Semantic Web technologies to represent clinical cancer staging (Alfonse, M. Aref, & M. Salem, 2014; Kumar, Yip, Smith, Marwede, & Novotny, 2005; Massicano, Sasso, & Amaral-Silva, 2015). Clinical cancer stages, also organ-specific, are mostly denoted by Roman numerals I to IV and are sometimes further modified by upper case letters. They correspond to global severity levels, as used in clinical guidelines for differential treatment options. Clinical stages are typically inferred by combinations of T, N and M values, and also sometimes by additional criteria such as biomarkers. For example, the TNM8 system for exocrine pancreas defines stage IIB as corresponding to TxN1M0 for
A recent work describes the application of the representation of breast cancer staging using RDF and the semantic nanopublication knowledge graph framework (Seneviratne et al., 2018), intended to be used by physicians for tumor staging. The system lists the treatment guidelines associated with each clinical stage. It uses ontologies representing the combinations of the values for T, N, M and biomarkers for breast cancer stages according to staging manuals. The staging rules are expressed as simple intersections of values for T, N and M and known biomarkers, which then can be used to stage in either TNM version in parallel.
In our work we did not primarily include clinical stages, due to the sole focus on TNM codes, defined by clinical and pathological criteria. However, because stages can be easily derived from given combinations of TNM codes (defined in the TNM system), a next step will be to add clinical staging to an updated TNM-O version.
General approaches to ontology/terminology mapping and versioning
Versioning of knowledge bases, and mapping and merging of ontologies have been addressed in various contexts. A comprehensive overview and detailed definition of aspects related to ontology changes is provided by (Flouris, Plexousakis, & Antoniou, 2006; Flouris, Manakanatas, Kondylakis, Plexousakis, & Antoniou, 2008). The authors distinguish the evolution and versioning of ontologies, methods to resolve heterogeneity by providing translation rules (mapping, morphism, alignment, articulation), and discuss integration and merging of ontologies. They use the expression ontology articulation for the creation of an “intermediate ontology and mappings between the vocabularies of the intermediate ontology and each source” (Flouris et al., 2008), which is quite close to our approach.
A model of the representation of change in controlled biomedical terminologies is CONCORDIA (CONcept and Change-Operation Representation for any DIAlect), which is composed of a concept model, a set of change operations and their semantics and a change-documentation model (Oliver, Shahar, Shortliffe, & Musen, 1999). This model is also used to synchronize local adaptations to shared health-care terminology (Oliver & Shahar, 2000). Its relevance to our work is rather limited due to their interpretation of mostly informal terminologies and thesauri as frame-based systems, and the importance of lexical features and relations, e.g. synonymy. Lexical features for suggesting maps is also a focus of the Chimaera tool, a framework to align and merge ontologies (McGuinness, Fikes, Rice, & Wilder, 2000). Another approach, closer to our work, is the ontology-composition algebra which describes a way to integrate ontologies from different sources, keeping the original ontologies and articulating linkages (e.g. intersection, union, difference) described by Horn clauses (Mitra & Wiederhold, 2004).
In general, these approaches use the word “ontologies” to refer to a heterogeneous class of artifacts, from informal thesauri to various kinds of logic-based models.
A more recent method for tracking, explaining and measuring changes between successive versions of a foundational, realist ontology (the Basic Formal Ontology, BFO) with a formal semantics was proposed by (Seppälä, Smith, & Ceusters, 2014). It suggests and scores eight categories of changes. Of these, issues of existence and relevance are changes that could also be applied to describe the evolution of TNM. For example, the distinction between the endocrine and exocrine pancreases as two “organs” has evolved over time, as has the relevance of this distinction for tumor classification, which was finally introduced in TNM version 8. The main difference to our scenario is that BFO was conceived as a realism-based upper-level from the very beginning, whereas TNM-O is based on a conventional aggregation terminology, the interpretation of which as an ontology was not a primary design principle.
Re-classification between TNM versions in clinical settings
Changes to classification criteria are a clear use case for re-classification. They can lead to a different statistical distribution of individual patients in cancer staging groups, and as a consequence, to different treatment schemes or prognoses associated with these stages. For example, the new approach in TNM8 to focus on tumor size leads to a more balanced distribution of individuals with exocrine pancreas cancer between T1, T2, T3 and T4, compared to the classification following the TNM7 coding rules. The new TNM8 classification results in the significantly improved prediction of the overall survival of pancreatic cancer patients (Cong et al., 2018). The authors extracted the necessary information for both TNM versions from radiology reports, but did not provide more details; presumably both TNM versions were used in parallel.
This study provides evidence that all possible combinations of tumor characteristics specified in the TNM coding system can be correctly (re-)classified. While we used test data in an experimental setting, real data can be analyzed in the same way by simply inserting them into the same format.
However, in clinical settings, automated re-classification between different versions of terminology systems depends on the availability of the necessary data. This is highly dependent on the local documentation policy and practice, features of the local documentation system and the availability of public datasets. We were unable to find any publicly available dataset which could have been used for TNM coding. While there are published templates with a comprehensive data set for pancreatic cancer, covering aspects beyond those required for TNM coding (The Royal College of Pathologists, 2017), routine pathology documentation in hospital settings often contains little or no information beyond the tumor charac-teristics required for the coding in the currently valid TNM version. With the significant changes for pancreatic cancer between TNM7 and TNM8, some criteria necessary for re-classification will be missing, and some information might be ambiguous. For example, a tumor which invades the peripancreatic soft tissue, a coding criterion for T1, T2 and T3 in TNM8, might simply be described as growing “beyond the pancreas”, which is sufficient to assign code T3 in TNM7, but insufficient for TNM8 where the size of the tumor must be known.
The following structured data were locally available for TNM7 coding: histopathology (necessary to define whether the tumor is growing in the exocrine or neuroendocrine pancreas) and the number of metastatic lymph nodes (necessary to define the N-codes). Unstructured data were available in reports, which often, but not always, also included the size of the tumor (necessary to define the T-codes) and information about distant metastases. Both NLP or manual sample re-analysis were beyond the scope of this work. At the time this manuscript was written, local documentation policy and practice had not changed to reflect TNM8 requirements due to the novelty of the new TNM version, which explains the lack of structured data.
Comparing the two mapping solutions, the SWRL approach, providing rule-based “translations” between definitions, may be easier to implement and need fewer computer resources. However, it requires the availability of all necessary data for a correct transition in order to succeed. In this respect, the bridging class method is likely to work better, as it will at least propose a combination of the possible tumor classes (as more generic union classes) in cases where the exact code cannot be derived. This is illustrated above by the bridging code example T2_Or_T3 which may either represent a tumor with the code T2 or with the code T3. Thus, the bridging class method identifies the ambiguity rather than just failing. As it performs a systematic decomposition into all defining criteria, it should be more helpful as an ontological support to future changes, such as extension of the prognostic factors beyond size and invasiveness (e.g. including biomarkers etc.).
The representation of a traditional classification system using formal statements requires non-rigid classes
The TNM system is – at first glance – a traditional medical classification system using textual definitions (and scope elucidations). Our work aimed to complete text via formal axioms in a way that closely follows the text definitions in the sources, which made apparent some differences beyond these two ways of encoding meaning.
As with all systems which provide semantic reference, definitions attached to codes depend on their purpose, e.g., here to describe categories of malignancy as the basis for cancer staging. TNM classes are precisely defined, unique and mutually exclusive (disjoint), so that each individual anatomical entity of relevance can be assigned to exactly one class. This focus can change as the science advances and the classes can be re-arranged in other ways, including changes in granularity. Therefore, TNM codes always describe real-world entities delineated according to fiat criteria, e.g. measurement intervals on the metric scale, together with a certain, application-specific view of them. Since classifications follow a different paradigm to other terminology systems – the aim is to aggregate objects into custom categories to support statistics, at the expense of not covering the full level of detail (Ingenerf, 2015; Schulz, Rodrigues et al., 2017), this may be a general problem representing a classification using an ontology language. The issue has recently been discussed with regard to a planned logic-based alignment of ICD-11 and SNOMED CT (Rector, Schulz, Rodrigues, Chute, & Solbrig, 2019; Rodrigues et al., 2015).
Tumor classes representing “no assessment” or “no evidence” of a tumor: When formally representing the TNM coding rules in a way that closely follows the text definitions in the sources, one may be tempted to place tumor classes characterized by “no assessment” or “no evidence” under Primary Tumor and next to the classes in which a tumor is defined by size, as was done in the previous versions of the TNM ontology (Boeker et al., 2016). This mirrors the phenomenon that non-referring language expres-sions often follow the same grammatical patterns as referring ones.
In formal ontologies, however, the instantiation of a class implies the real-world existence of an individual. “No assessment” and “no evidence” related to tumors as part of examination results are epistemic notions (i.e. they refer to the examination and not to the tumor), which prohibits their use in interpreting tumor qualities. It would be akin to concluding that “tumor” described by “no evidence” probably does not exist at all. Associating the word “tumor” with the qualifier “no evidence” is a simple turn of phrase, used for convenience of expression, but which must not be interpreted literally. A more precise paraphrase would be “amount of tissue in which there is no evidence of malignant spread”.
Hence our decision to consider the entities under scrutiny as anatomical structures, which may or may not be or contain tumors. These particular “tumor classes” were subsumed by a class called
Whether or not a piece of tissue is assessed for tumor is independent of the existence of a tumor and its size in that tissue sample. The traditional TNM classification structure suggests disjointedness between these classes, which, of course, clashes with our assumption that TNM classes are classes of domain entities (and not of classificatory statements). Our solution makes a compromise by introducing the non-rigid classes
Conclusion
A modular approach was used to create a set of ontologies for the representation of the TNM coding rules across TNM versions. It was shown that mapping between different versions of the TNM scheme can be usefully implemented using OWL files for mapping. They can either be based on SWRL rules, which translate the definitions from the former to the latter versions, or by adding mapping classes and OWL axioms via decomposing the defining criteria into tumor subcategories, which can be re-combined to reflect new classification criteria. Both mapping solutions might be useful tools for the re-assignment of TNM codes in different TNM versions. Both methods have their merits with regard to ease of implementation and maximum preservation of information; the bridging-class method can highlight ambiguity while preserving as much information as possible, whereas the SWRL rule method is easier to implement, but fails entirely in the event of ambiguity. The feasibility study in the context of pancreatic cancer – where significant changes between TNM7 and TNM8 exist – suggests that this approach can be generalized and used for other tumor entities as well. However, re-coding in clinical documentation practice requires the availability of data of sufficient granularity required by the target TMN version.
Our work provides support for the argument that modern classification systems in general, and TNM in particular, would benefit – at least in the long-term – from a formal foundation, which ensures that decisions about mapping between codes from different sources do not depend on the wording of labels, rules and coding guidelines, and their subjective interpretation. A similar attempt to define a formal foundation for version 11 of ICD was mostly unsuccessful, in part due to the complexity of the task but also due to an unclear ontological commitment to the “foundation” classes and their alignment with other standards such as SNOMED CT. We believe that applying the method of ontology-based interpretation to the much smaller and well-curated TNM classification may be more successful.
The ontology is available as open source via GitHub:
Footnotes
Acknowledgements
This work was conducted using the Protégé resource, which is supported by grant GM10331601 from the National Institute of General Medical Sciences of the United States National Institutes of Health.
This research was partly funded by the Federal Ministry of Education and Research (BMBF) as part of the MIRACUM consortium of the German Medical Informatics Initiative (FKZ 01ZZZ1801B).
We would like to thank James Balmford for English proof-reading.
