Abstract
Objective:
This study tests coverage of SNOMED CT as an expansion source in the process of automated expansion of clinical terms found in discharge summaries. Term expansion is commonly used as a technique in knowledge extraction, query formulation and semantic modelling among other applications. However, characteristics of the sources might affect credibility of outputs, and coverage is one of them.
Method:
We developed an automated method for testing coverage of more than one source at a time. We used several methods to clean our corpus of discharge summaries before we extracted text fragments as candidates for clinical concepts. We then used Unified Medical Language System (UMLS) sources and UMLS REST API to filter concepts from the pool of text fragments. Statistical measures like true positive rate and false negative rate were used to decide on the coverage of the source. We also tested the coverage of the individual SNOMED CT hierarchies using the same methods.
Results:
Findings suggest that a combination of four terminologies tested (SNOMED CT, NCI, LNC and MSH) achieves over 90% of coverage for term expansion. We also found that the SNOMED CT hierarchies that hold clinically relevant concepts provided 60% of coverage.
Conclusion:
We believe that our findings and the method we developed will be of use to both scientists and practitioners working in the domain of knowledge extraction.
Introduction
SNOMED CT is claimed to be the most comprehensive single resource of clinical terms available today (IHTSDO, 2020a). It has been created and is maintained by the SNOMED International organisation that currently has 37 countries as governing members. The number of concepts in the January 2019 edition of this resource was 349,548 and it is showing a steady growth in the number of concepts added in every edition (IHTSDO, 2020b). Its content is clinically verified and enables accurate representation of clinical concepts in electronic clinical records and clinical analytics systems (Lee et al., 2013). SNOMED CT’s structure consists of 19 large hierarchies that are joined through the top-level concept, called the SNOMED CT Concept. The hierarchies are made of concepts connected using the Is_A attribute. Hierarchies themselves are large acyclic graphs in nature and no concept from any of the hierarchies connects to any of the concepts from the other 18 hierarchies. In addition to the Is_A attribute, which is used purely for depicting hierarchy, other attributes, listed in the SNOMED CT Model Component (metadata) hierarchy, are used to depict relationships. In contrast to the Is_A attribute, these attributes can “connect” concepts located in different hierarchies. For example, Finding site (attribute) is used to represent the relationship between the concepts Pulmonary infarction (disorder) and Structure of pulmonary blood vessel (body structure) that reside in different hierarchies.
The ability of SNOMED CT to provide axioms using a concept’s hierarchical and non-hierarchical relationships allows for reasoning and classifies SNOMED CT as an ontology rather than just a taxonomy (Bodenreider, 2018). This is supported by the fact that the axioms in SNOMED CT are presented using Compositional Grammar that can be interpreted using Description Logic (SNOMED International, 2019). Considering that the number of concepts covered in SNOMED CT is significant and that SNOMED CT is a quality-controlled resource created by experts and scrutinised by many, it can be assumed that the axioms postulated in SNOMED CT are a close representation of the propositions in reality or in its representative ontology. These characteristics make SNOMED CT a suitable source for use in knowledge extraction and for term expansion in particular. Term expansion is a frequently used technique for knowledge extraction. Term expansion is used to overcome vocabulary mismatch issues where expansion increases the scope of analysis to cover syntactic patterns that are not otherwise present in the text. This technique keeps compound or obscure terms from preventing or confusing concept identification, making processes like knowledge extraction more effective (Alani et al., 2003). In term expansion, a term is supplemented with associated terms found in a source selected as an expansion repository. This source can be any body of text; however, the most commonly used sources are domain-specific and domain-independent ontologies and databases. SNOMED CT, in particular, is used for work in the domain of medicine and human biology (Stokes et al., 2009; Wang et al., 2008).
As part of our work on annotation of clinical datasets using openEHR Archetypes (Zivaljevic et al., 2015), we need a repository that we can use as a source for expansion of clinical terms found in the free-text discharge summaries that we use as a sample. We expect that the source will be comprehensive and have its concepts and their relationships presented as closely as possible to reality, and thus be a valid representation of the clinical domain. Considering its comprehensiveness, quality control and involvement of clinicians in its creation, SNOMED CT is presumed to be an obvious choice. It is expected that its 19 hierarchies, that each cover different subdomains relevant to the clinical domain, will provide valuable guidance in noise minimisation efforts as they will allow for different weighting to be assigned to different (sub)domains of interest. However, the research community reports issues that can potentially render SNOMED CT unsuitable for term expansion. Rodrigues et al. (2018) raise the question of modelling issues in SNOMED CT and suggest quality assurance improvements. However, their conclusions are based on the limited sample, comparison of SNOMED CT concepts and mapped ICD11 classes from the circulatory and digestive chapters only.
Miñarro-Giménez et al. (2018) conducted a qualitative analysis testing the utility of SNOMED CT in coding clinical text by measuring inter-annotator disagreement. Their results show that their annotators matched what was defined in the reference standard only in 21.6% of the codes assigned to the terms found in the provided sample. This, as the authors term it, “astonishingly low” result could mean that SNOMED CT features a high level of ambiguity, which would render it unsuitable for our research due to the potential introduction of noise in the term expansion process. However, the issues that the authors list as factors causing this are mostly subjective or of a research methods nature and do not depict a failure of SNOMED CT. The factors include annotators’ lack of domain knowledge, carelessness, annotation guideline issues, interface term issues and language issues. Bona and Ceusters (2018) demonstrated that some of the concept tags in SNOMED CT incorrectly identify a concept’s place in the hierarchy. The semantic tag is part of the concept’s description and comes surrounded in parentheses. It functions to disambiguate the description of the particular concept from the descriptions of other concepts that might be the same but might belong to other concepts. The example that the authors give are the concepts [35566002 | Hematoma (morphologic abnormality)] and [385494008 | Hematoma (disorder)]. It was found that 89 of over 300,000 concepts in the SNOMED CT 20170131 release have mismatched semantic tags. However, based on the magnitude of the problem (or lack of it), we discard this issue as minor to our study.
The issue of SNOMED CT’s coverage, which we consider to be SNOMED’s ability to capture clinical information in a given text, has been assessed by many (Khorrami et al., 2018). Some deficiencies of SNOMED CT is reported to have practical implications on its utilisation in the clinical domain. For example, Liu et al. (2018) established that SNOMED CT’s adoption among ophthalmologists in the United States has not reached expected levels and find that the reason for that is the coverage of SNOMED CT’s ophthalmology component. Similarly, Rastegar-Mojarad et al. (2017) raised an issue of coverage in their attempt to map the list of procedures from their Gynaecology Surgery Registry to SNOMED CT. They found that only a small percentage of procedure names can be mapped to the concepts in SNOMED CT. The reason behind this issue was not the absence of a suitable terms in SNOMED CT but the format in which the procedures are represented and use of what they call procedure modifiers. This issue observed by Rastegar-Mojarad et al. is an important one as it indicates that the issues seen as SNOMED CT coverage could actually be unrelated to SNOMED CT and related to how concepts are presented in the corpus. This particular case highlighted the importance of stemming and lemmatisation of the text fragments that are to be mapped to the terms in SNOMED CT. In their comparison of Nomenclature for Properties and Units (NPU), Logical Observation Identifiers Names and Codes (LOINC) and SNOMED CT for use in representing laboratory results, Bietenbeck et al. (2018) suggest that for the best coverage SNOMED CT and LOINC should be used together as some of the concepts in SNOMED CT are represented in rather complex fashion, using post-coordinated, rather than pre-coordinated expression. However, they note that this particular issue is by design and is to ensure that duplication between SNOMED CT and LOINC is minimal.
The issues of SNOMED CT coverage raised in the literature have prompted us to question its suitability to our study. As our plan was to conduct ontological term expansion of the terms found in the sample corpus, we needed an ontology that incorporated as many of our target concepts as possible. Ideally, the ontology would have all of the target concepts incorporated, but the possibility for that to be the case was low. In this work, we tested the suitability of SNOMED CT for use in expansion of clinical concepts in terms of its coverage. The main goal was to test whether the coverage of SNOMED CT would allow for the majority (if not all) of the clinical terms from our sample corpus of discharge summaries to be recognised. We expected that at least 90% of the clinical terms found in the discharge summaries would be available in SNOMED CT. This was an arbitrary number based on a consensus of the research team members and work of the authors mentioned later in the text. If less than 90% coverage was found, we would consider terms from other ontologies for inclusion in SNOMED CT using extension.
Method
The sample we used in our study was the set of discharge summaries provided by the i2b2 National Center for Biomedical Computing (Uzuner et al., 2007). The sample contains 889 unannotated, de-identified discharge summaries provided as free text. We extracted each record’s text and stored it into an SQL database along with the related RECORD ID. The free text extracted needed to be cleansed of elements that did not hold any value in the process. This particular corpus had section captions that divided free text. The captions were not standard and varied from discharge summary to discharge summary. Due to the fact that the captions contained no medical concepts related to the patient health record, but generic descriptions of the sections (Diagnosis, Family History, etc.), they were temporarily removed from the corpus. Some value was seen in classification of the concepts recognised under the headings, but we ignored that knowing that the headings were specific to this particular corpus and might not be found in other corpuses of clinical free text. This step would most likely be different for different corpuses, due to structural variations that can be expected. We called this step Text Cleaning.
Boundary detection took place next. In this step, we split the text in the array of sentences. For that to be possible, line breaks were normalised and changed to full stops if they were not present and double spaces were replaced with single spaces. It was ensured that the sentence boundaries were kept intact so that multi-word text fragments would not cross them. The candidates for clinical concepts were extracted next in the step we called Fragments Extraction. The algorithm extracted text fragments that were up to five@@ words long, contained only a-z, A_Z, -, ', and space. We kept the interpunctuation in the sentence intact, which ensured that the extracted text fragments would not span across sentence segments. The minimal number of characters for the text fragments that were one word in length was set to three. The minimum length for the words in the text fragments made of two or more words has not been set. Text fragments are recorded once only, regardless of how many times they appeared in the corpus. However, the number of times that text fragments appeared in the corpus has been recorded. The algorithm used to extract the text fragments tokenised the text into one-word tokens and then, starting from each token, the algorithm selected all n-word text fragments up to n = 5. The duplicates were then removed and text fragments of one word in length were filtered for stop words. The fragments that contained more than one word were not filtered of stop words. The final array was cached into a database.
Due to our focus on expressiveness of terminology rather than higher level semantics of the text fragments retrieved, we were not concerned with recognising negation in the corpus. We viewed negation as a topic on its own and a question separate to this stage of our research. Each of the extracted text fragments was then used to formulate a query to the Unified Medical Language System (UMLS). UMLS Representational State Transfer REST application program interfaces APIs are used to query whether an exact match for the extracted text fragment exists in the UMLS Metathesaurus. If the exact match or matches existed, the source terminologies were recorded as associated with the text fragment used in the query. The information was returned by the UMLS in the JavaScript Object Notation (JSON) format and JSONs were saved in the database for future reference. If SNOMED CT was listed as a source terminology for the discovered concept, we made a recursive query that revealed to which of 19 SNOMED CT hierarchies the concept belonged. The version of SNOMED CT available in UMLS was the US Edition version 20190301.
As our aim was to learn how well SNOMED CT would perform as a term expansion source for our study, we wanted to establish the level of coverage that SNOMED CT would provide to our corpus. Elkin et al. (2006) conducted similar research testing how well SNOMED CT provided coverage for coding the most common conditions seen at one of their large clinics. They found that the coverage was 92.3% and declared that as sufficient coverage. Following Elkin et al., we used sensitivity as one measure. For us, sensitivity answered the question of how to quantify a UMLS source’s (the Source) coverage. For example, for SNOMED CT, we calculated Sensitivity Ratio as the number of concepts found in SNOMED CT divided by the number of concepts found in the UMLS as if it was without SNOMED CT. We calculated Sensitivity Ratio for every Source where at least one text fragment matched a concept (see Box 1, equation (1)). We also wanted to know the rate of false negatives as an indicator of the number of concepts that will not be recognised if the particular source is used. False negatives are the concepts found in the UMLS but not in the tested source divided by the total number of concepts found in the UMLS (see Box 1, equation (2) for calculation of false negatives). As we were also interested in the coverage of distinct SNOMED CT hierarchies (organised as SNOMED CT axes), we calculated sensitivity of each one of the SNOMED CT hierarchies. We defined sensitivity of each hierarchy is the number of concepts found in that hierarchy over the number of total SNOMED CT concepts recognised expressed as percentage (see Box 1, equation (3)).
Considering that the corpus of text we worked on consisted of discharge summaries, we expected that significant sensitivity would be shown by the hierarchy with “Clinical finding (finding)” as its root concept. We placed a threshold at 40% sensitivity per source (
A question may arise as to why we did not use Specificity as a measure in this work. Specificity takes into consideration true negatives, and in the case at hand, Specificity would have revealed the proportion between the number of text fragments that were correctly identified to have no meaning as concepts (true negatives) and the total number of text fragments that truly had no meaning as concepts (true negatives + false positives) (Box 1, equation (5)). We were able to calculate the number of true negatives for a Source as the number of text fragments that have not been recognised, neither in a particular Source nor in one of the other Sources. However, establishing a reliable number for the false positives was not possible without human agents. The question of whether a recognised match between a text fragment and a concept in a Source is a true negative or a false positive cannot be answered using automated methods at this point in time.
Results
Out of 208,513 text fragments of 1–5 words length, 9777 matched to at least one concept in at least one UMLS source. The number of UMLS sources that the matches were found in was 85. The top 5 sources were Consumer Health Vocabulary (CHV), National Cancer Institute (NCI), SNOMED CT US Edition (SNOMEDCT_US), LOINC (LNC) and MeSH (MSH).
Table 1 shows the experiment results for the top five sources. The field “Concepts” shows the number of text fragments recognised as concepts in a particular Source. The next field, named “Unique,” shows the number of text fragments recognised as concepts only in one particular source. Sensitivity Ratio is given in the next field (TPR Source). It is explained in equation (1) as the number of concepts found in the Source divided by the number of concepts found in the UMLS as if it was without that particular Source. The next field – false negatives is given in equation (2) as the number of concepts found in the UMLS but not in the tested source. And the last field in this table shows false negatives of the Source in relation to the false negatives of SNOMED CT.
Top 5 UMLS sources for concept matching.
CHV: Consumer Health Vocabulary; NCI: National Cancer Institute; SNOMED CT-US: SNOMED CT US Edition; LNC: LOINC; MSH: MeSH; Concepts: the number of text fragments recognised as concepts in the source; Unique: the number of text fragments only in the one particular source; TPR: true positive ratio (sensitivity); FNR: false negative ratio;
Table 2 shows the results of the combined sources. The first field lists the number of text fragments recognised as concepts by the relevant combination of Sources. The next field shows the number of concepts that are uniquely recognised when the particular Sources are combined. Sensitivity and false negatives for the particular combination of the sources are given as the last two fields in this table (description as for the fields in Table 1). Table 3 shows a sample of text fragments, their occurrences in the discharge summaries and the number of matching concepts in each of the top 5 Sources. Table 4 shows a sample of text fragments that have been matched to concepts in Sources other than SNOMED CT, their occurrences in the discharge summaries and the number of matching concepts in each of the top 5 Sources. Table 5 depicts the result of the coverage of distinct SNOMED CT hierarchies organised as SNOMED CT axes. It lists the SNOMED CT hierarchy axis followed by the number of concepts that belong to the particular SNOMED CT hierarchy axis and its coverage.
Selected sources and results of source combinations.
SNOMED CT-US: SNOMED CT US Edition; NCI: National Cancer Institute; LNC: LOINC; MSH: MeSH; Concepts: the number of text fragments recognised as concepts in the source; Unique: the number of text fragments only in the one particular source; TPR: true positive ratio (sensitivity); FNR: false negative ratio.
Sample of matched text fragments, their occurrences in the test and the number of matching concepts in each source.
CHV: Consumer Health Vocabulary; NCI: National Cancer Institute; SNOMED CT-US: SNOMED CT US Edition; LNC: LOINC; MSH: MeSH.
Sample of text fragments matched to sources other than SNOMED CT.
CHV: Consumer Health Vocabulary; NCI: National Cancer Institute; SNOMED CT-US: SNOMED CT US Edition; LNC: LOINC; MSH: MeSH.
Coverage of SNOMED CT axes.
Note. TPRAxis is the total number of true positives for a particular SNOMED CT hierarchy (axis). It represents the number of concepts found in that hierarchy over the number of total SNOMED CT concepts recognised, expressed as percentage.
Discussion
In this study, we demonstrated a fully automated approach for selecting sources for term expansion based on their coverage of a corpus of clinical text. Similar approaches have been described in the past, some as theoretical concepts or frameworks (Alani et al., 2003; Li and Motta, 2010), some as solutions that require input of a human agent (Maiga and Ddembe, 2009) and some as complex sets of algorithms that include more than just a test of coverage in the process of source evaluation (Martínez-Romero et al., 2014, 2017). However, the method we developed differs from these other methods in several respects. Firstly, it includes consideration for preparation of text for coverage testing. As we expected that the input would be free text, we allowed for text cleaning, stop word removal and tokenisation. Secondly, we deployed a version of brute-force testing of text fragments for presence in the sources. Starting from our one-word text fragments, we extracted all multi-word text fragments up to the length of 5 and tested them all for presence in the sources. We used caching to ensure that there was no performance penalty for doing so. Thirdly, our measure for selection of the sources to extend SNOMED CT as the preferred source was the sources’ coverage of SNOMED CT’s false negatives –
We also considered using CHV as a source for extending the selected source. However, despite offering an impressive 916 concepts that were unique (not present in other sources), and 57.91% of coverage of SNOMED CT’s false negatives, CHV’s lack of ontological properties made us decide against using it. Although the richness of this source would certainly have increased the number of text fragments recognised as concepts, its lack of ontological properties made it unsuitable for use in semantics operations.
NCI and SNOMED CT were found to have the same number of text fragments recognised as concepts. Accordingly, their number of false negative results, represented as
Nevertheless, some conclusions on the utility of extending SNOMED CT with the concepts from NCI can be made. NCI’s
The number of concepts recognised by LOINC and MeSH was relatively small compared to the top 3 sources. However, their use as extensions of the main source is worth considering. LOINC’s overlap with SNOMED CT is small at 6.7% (NLM.govt, 2019a) and considering its specialisation in laboratory results, we saw it as a good candidate for using as an extension to the main source. MeSH’s overlap with SNOMED CT was also small at 8.1% and the prospect of the increase in coverage of close to 7% prompted us to decide that its concepts are worth inclusion. The next source with the largest number of concepts that are not contained in SNOMED CT, NCI, LOINC and MeSH was MEDCIN. The number of concepts that MEDCIN would contribute would be 145 with 87 unique. That number of concepts does not significantly improve the resulting ontology, so we decided that it was not worth including.
Although our results showing the root axes of SNOMED CT terms revealed that the greatest number of results came from the Qualifier value hierarchy, close to 60% of the concepts came from the next four hierarchies (clinical finding, substance, body structure and procedure), which contained the concepts of highest relevance for expansion of clinical concepts. This clearly shows the need for assigning to, or at least adjusting weight of, the recognised concepts based on the semantics relevance of their roots.
Study limitations
The algorithm we used for preparing text fragments for matching to concepts did not stem the text fragments before they were sent to UMLS for matching. Stemming is used to reduce variant word forms to common roots (Xu and Croft, 1998). Our tests conducted during the experiment showed that deploying the Porter Stemmer (Porter, 1980) would improve recognition of the text fragments by 8–10%. We tested just the text fragments that have been found in at least one of the sources. We believe that the utility of stemming should be further evaluated and that various stemming algorithms should be included in the testing process. Also, our focus was on expressiveness of terminology rather than higher level semantics of the text fragments retrieved. Hence, we were not concerned with recognising negation in the corpus nor any effort was put in deciphering exact semantics of the text fragments.
Conclusion
We believe that all projects that utilise ontologies for term expansion, concept recognition, coding or similar applications need to undergo an exercise similar to what we have described in our current work. The unique contribution our research offers is the description and justification of the methodology used. We believe a similar process can be employed by researchers in many other projects. We also believe that the use of UMLS as a source of truth in an automated process of discovering true positives and false negatives is novel and that, as source ontologies of the UMLS grow, the utility of UMLS for this purpose will increase.
Footnotes
Acknowledgements
Deidentified clinical records used in this research were provided by the i2b2 National Center for Biomedical Computing funded by U54LM008748 and were originally prepared for the Shared Tasks for Challenges in NLP for Clinical Data organised by Dr. Ozlem Uzuner, i2b2 and SUNY.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
