Abstract
The ability to find highly related clinical concepts is essential for many applications such as for hypothesis generation, query expansion for medical literature search, search results filtering, ICD-10 code filtering and many other applications. While manually constructed medical terminologies such as SNOMED CT can surface certain related concepts, these terminologies are inadequate as they depend on expertise of several subject matter experts making the terminology curation process open to geographic and language bias. In addition, these terminologies also provide no quantifiable evidence on how related the concepts are. In this work, we explore an unsupervised graphical approach to mine related concepts by leveraging the volume within large amounts of clinical notes. Our evaluation shows that we are able to use a data driven approach to discovering highly related concepts for various search terms including medications, symptoms and diseases.
Introduction
Related clinical concepts are two or more terms or phrases that are highly associated and related. For example, the term “nausea” is highly related to “vomiting” and the term “pregnancy” is highly related to “ectopic”. Such related concepts are extremely important to health-care-related applications. One important use case of related concepts is for hypotheses generation. Let us say, a physician is investigating a series of nausea events that have recently affected the local community. During this investigation, if he/she observes that the term “nausea” is being strongly associated with “salmon” and “salad” in the given time range of the event, the physician may choose to investigate this relationship further to understand if the salmon or the salmon in the salad was the root cause of those nausea events and take the necessary steps to resolve the issue. Another use case of related concepts is in medical and biomedical literature search. For example, if a user is looking for articles related to “opioid”, adding related concepts such as “analgesics” and “pain” to expand the query could further improve the relevance of the search results. Currently, in order to do this, most applications will have to rely on availability of manually curated concepts in published dictionaries. With these dictionaries however, there is no guarantee that the related concepts are in fact really related as it might include a related diagnosis or a parent concept instead of concepts that may occur together.
Most health-care applications today rely primarily on existing “static” terminologies curated by human experts such as SNOMED CT, LOINC, and RxNorm. While these controlled terminologies are extremely useful, they are highly dependent on human expertise, which leads to several issues. First, since these terminologies were hand curated by several subject matter experts from a specific geographic area, the actual terminologies used in practice can vary significantly from institution to institution let alone country to country. Thus, it does not account for locality of information. Another issue is that even though related concepts can be anything from symptoms to medications to procedures, each of these controlled terminologies will only show relationships explicitly defined by the expert based on his or her knowledge (eg, clinical findings with certain procedures only) and most likely would not be able to show concepts that can only be clinically observed. For example, a lookup on chest pain using the SNOMED CT browser returns concepts such as dull chest pain, acute chest pain, and upper chest pain, but does not show related concepts that one would find specifically in clinical notes such as stuttering right-sided chest pain or radiating chest pain. Note that both these statements appeared in the MIMIC II Clinical Database 1 that we used for the evaluation of our work. Furthermore, these clinical terminologies provide no form of evidence, score, or statistics that would be extremely useful in applications. For example, what is the relationship strength between “chest pain” and “stuttering” or what is the probability of “chest pain” occurring with “vomiting” versus “chest pain” occurring with “acute”? Such questions cannot be answered with just controlled terminologies.
In this study, we thus explore a highly scalable graph-based approach to establish relationships between a search query and related concepts by leveraging large amounts of clinical notes. Specifically, we first construct a Concept-Graph using 10,000 clinical notes from the MIMIC II Database, where each node represents a unique word and the edges represent the links between the words. Then, given a query, we mine related concepts using links in the Concept-Graph and then rank the concepts found using a relatedness measure based on pointwise mutual information (PMI) and probability of co-occurrence. Evaluation on 10 different search topics, which include medications, symptoms, and diagnosis by five physicians, shows an average precision of 0.98 and a utility score above 0.90 using our best system.
One key advantage of our approach is that it is very general in that users would have complete control on the data used for constructing the Concept-Graph to find related concepts. For example, a user can construct a Concept-Graph using only cancer treatment related notes. A user may also choose to use all notes within the organization to construct a comprehensive Concept-Graph. Since the Concept-Graph also provides evidence information, one can directly obtain various statistical information from the graph to be used within an application. The resources used as part of this work can be found at https://github.com/rxnlp/clinical-concepts.
Methods
The goal of this work is to find a list of related clinical concepts given a search query by leveraging large amounts of clinical notes. The intuition here is that the volume of clinical texts can provide hints on how related the concepts are to the search query based on a co-occurrence relationship within a specific distance. The search query can be any term such as a medication (eg, valium, lithium), disease (eg, COPD), side effect (eg, diarrhea, nausea), symptom (eg, chest pain, headache), or even a partial header name in a clinical note (eg, history). The search query can be a unigram (single word) or a multiword expression (eg, diabetes mellitus). The related concepts returned would be a list of unigrams ranked by their relatedness scores (we plan to explore multiword concepts in our future work). While there are many different ways to estimate the relatedness score between a search query and a candidate concept, in this paper, we evaluate two measures for relatedness, namely, a modified PMI measure and a probability of occurrence measure (PROB). We use the Concept-Graph data structure to efficiently represent large amounts of text in a way that enables quick lookup of statistical information based on the volume as well as provide indicators of which concepts are linked to the search query.
In the following subsection, we first describe how the Concept-Graph is constructed using large amounts of clinical notes and the preprocessing involved in constructing the graph. Then, in “Discovering related concepts” section, we describe how the graph is used to find candidate concepts where some of these concepts become the related concepts. Finally, in “Computing relatedness scores for ranking related concepts” section, we describe the procedure that we use for relatedness scoring so that the related concepts can be ranked based on how relevant the concepts are to the search query.
Concept-Graph construction
The first step prior to building the Concept-Graph is to preprocess the clinical notes as they are fed into the Concept-Graph. We perform minimal preprocessing on the notes that include sentencing, lowercasing, and stop word removal. Each sentence in a clinical note is considered independent of one another. Sentences can be easily obtained from the clinical notes using existing sentence segmentation tools.2,3 In our case, we developed a simple sentence segmenter using punctuation as heuristics. We also remove stop words from each sentence. Stop words are common words in the language that appear both in day to day language as well as very commonly in clinical notes. We appended the English stop words used within the Terrier Package 4 with some manually curated clinical stop words. The list of stop words used is published in https://github.com/rxnlp/clinical-concepts. While some of the common clinical note terms would naturally have a low rank using our system, these terms are distracting and yield unnecessary memory overhead and thus we dropped some of these words (eg, patient, clinic, and hospital). The preprocessing steps used in our work are graphically demonstrated in Figure 1.

Preprocessing steps applied to clinical notes.
Once preprocessing is complete, the Concept-Graph is constructed. The Concept-Graph is essentially an undirected positional word graph data structure that represents large amounts of natural language text in a compressed and easy to analyze format. It naturally models co-occurrence relationships between words, as each unique word is a node in the graph and the edges represent the relationship between words as it appears within sentences. This provides cues on which two concepts are related just by leveraging the links based on the original text. Each preprocessed sentence from the clinical notes is fed into the Concept-Graph data structure where each unique word becomes a node in the graph and each node holds the sentence identifier (SID) as well as position of the word in the sentence (PID). For example, if a word “asthma” appears 10,000 times in the text, there will only be a single node to represent asthma. The node representing asthma keeps track of which sentences used that particular word along with the corresponding positional information. An edge A-B is used to indicate that word “A” appeared at least once next to word “B” and the direction of the edge does not matter. Figure 2 shows an example of a Concept-Graph constructed with three preprocessed sentences from some clinical text. Notice that just based on this simple example, a strong link can already be seen for example between “vomiting” and “diarrhea”.

Example of concept-Graph constructed using three sentences that have been preprocessed. each node represents a unique word in the text. Each node stores the sentence identifier (SID) and the position of the word in the sentence (PID).
In constructing the Concept-Graph for this work, we utilized 10,000 clinical notes that were randomly picked from the MIMIC II Clinical Database. 1 The MIMIC II Clinical Database contains comprehensive clinical data from intensive care unit (ICU) patients. These data were collected between 2001 and 2008 from different ICUs (medical, surgical, coronary care, and neonatal) in a single hospital. Thus, the query concept can be fairly general as the MIMIC II Database covers a range of treatments and conditions. We used all 10,000 notes to construct the Concept-Graph using the steps mentioned above.
Discovering related concepts
Once the Concept-Graph has been constructed, the next step is to find the related concepts for a given query. For this, the query terms would first have to be identified in the Concept-Graph. If the query terms (eg, “chest” and “pain”) are themselves linked (there is an edge linking the terms), then all concepts that are linked to the query terms will first be identified. Based on Figure 2, for the query vomiting, linked concepts would include “morning”, “nausea”, and “diarrhea”. We will refer to these as candidate concepts. From these candidate concepts, we find the related concepts by eliminating weak links. Weak links are found by identifying candidate concepts that do not fulfill the minimum overlap requirements between the SIDs.
Specifically, if a query term, SQ, and a candidate concept, CC i , share N sentences, the candidate concept is considered a related concept if the number of shared (overlapping) sentences exceed a threshold that we refer to as σoverlap. In addition to this, the overlap computation is subject to a distance constraint referred to as σwindow, such that the distance between the positions of the words in consideration is no more than σwindow. For example, in Figure 3, based on just the overlap of the SIDs, vomiting and nausea have an overlap of 3 (SID: 3, 4, 5). However, when we take σwindow = 3 into consideration, then the overlap will only be 2 (SID: 3, 4) since the distance between the position of the words for sentence 5, is 6 (5:12–5:6) and this exceeds the σwindow threshold. Thus, if we use σgoverlap = 3 and σwindow = 3, then this relationship will be considered a weak link since the overlap ends up being only 2.

An overlap example, where vomiting is the query and nausea is the candidate concept. The format used is SID:PID where SID is the sentence identifier and PID is the position of the word in the sentence.
The intuition for the σwindow restriction is that words that appear much further away are less likely to be related to the query term than words that are much closer in general to the search term. Formally, the overlap between a search query, SQ, and a candidate concept, CC
i
, can be expressed as follows:
Computing relatedness scores for ranking related concepts
While a query term may be linked with thousands of related concepts, there are some concepts that are more related than others. For example, for the query vomiting, intuitively we know that the term nausea is more strongly related to vomiting than the word morning. The term morning most likely appears in the context of morning sickness where a patient experiences vomiting. Another example is for the query asthma where the medication “QVAR” is more related to asthma than the word medication itself. Thus, to distinguish concepts that are highly related from ones that are marginally related, we introduce a ranking system that ranks relationships based on a relatedness score. Given a search query, SQ, and a related concept, RC
i
, we denote the relatedness score as Relatedness(SQ, RC
i
). We evaluate two different ways to compute Relatedness(SQ, RC
i
) with the first measure estimating the likelihood of a related concept occurring with the search query as follows:
The first part of Equation 3,
Once Relatedness(SQ, RC i )PROB and Relatedness(SQ k , RC i )PMI are computed for each related concept for a given search query, these related concepts are sorted by decreasing order of their relatedness scores.
Evaluation
The goal of our evaluation is to understand if the top related concepts discovered by our system are in fact related and relevant to the search query. For example, is “vomiting” related to “nausea”? To perform our evaluation, we first requested a physician to provide 10 search queries that he/she may want to search for to find related concepts. The topics we received were: lithium (medication), beta-blocker (medication), penicillin (antibiotic medication), advair (inhaler for asthma treatment), chest pain, pregnancy, myocardial infarction, bloody stool, fracture, and syncope.
Ground truth
We then used our system to generate the top 50 related concepts for these 10 topics using the two relatedness measures described in “Methods” section. We then presented the results to five independent physicians to rate our results. We asked these five physicians to rate the results based on the two rankings (PMI and PROB) as follows:
Given the ratings from the five physicians, we used majority vote as the final rating. This means that if a related concept has been rated as
Evaluation metric
To evaluate the overall performance of our system, we introduce two measures, one being precision and the other we refer to as a utility score. Precision evaluates how many of the top 50 concepts are relevant (ie, not noise). Utility score on the other hand assigns a score for each type of concept produced. Specifically, a score of 2 is assigned if the system found a relevant concept (R), a score of 1.5 if the system produced a relevant but general concept (RG), and a score of –2 if the system produces noise. With this, the more noise the system produces, the more it gets penalized and the more relevant concepts the system produces the better the overall score. Given top N concepts, the utility score is computed as follows:
Results
Precision of results
We first look into precision of our results to understand how many nonrelevant concepts are produced by the system. Table 1 shows a summary of precision based on our ground truth. Notice that with both rankings, the amount of noise produced is extremely low with above 95% of the results being either R or RG. In fact, the PMI-based ranking has an average precision of 98%. This indicates that all in all, the system finds concepts that are related to the search query. Also notice that in general, the overall precision of the PMI-based ranking is slightly higher than that of the PROB-based ranking. This shows that overall, PROB-based ranking introduces more noise than PMI.
Precision of top 50 results.
Utility of results
The utility of results indicate how usable the top related concepts are in practice. The less relevant the top concepts, the lower the utility score at different rank cutoffs. Based on Table 2, we can see that overall the PMI-based ranking provides a higher utility compared with the PROB-based ranking. This is because the PMI-based ranking immediately returns concepts that are considered
Utility scores at different rank cutoffs.
Sample results
Table 3 shows a snapshot of the top related concepts for four different search queries based on PROB-based ranking and PMI-based ranking. Notice that with the PMI-based rankings, the top related concepts are very specific to the search query. With the PROB-based ranking, the top concepts are more general and the top PMI concepts start appearing in the later ranks of PROB-based ranking. This is interesting because it produces two use cases for applications. For example, some applications may value related concepts that are more general in which case the PROB-based ranking would be more suitable. Some other applications may value concepts that are very clinically related, and in that case, the PMI-based ranking would be ideal.
Top 15 related concepts for four search queries ranked based on PMI and PROB scores.
Example Usage in Practice
The concept-graph can be used in a variety of settings once it has been constructed. For example, in medical literature search, the query terms used by the user can be expanded with related concepts to improve the search results. In more specific terms, if the user is interested in literature related to asthma, related concepts such as “advair”, “shortness”, and “breath” could help bring up literature that is more relevant. Another example is in the case of ICD-10 code set filtering. If a particular ICD-10 code (ie, the description of the code after stop word removal) matches none of the top N concepts related to the clinical finding or procedure in question, then the ICD-10 code suggested can be disregarded. For instance, let us say, the clinical finding is chest pain and the suggested ICD-10 code is “M79.642 Pain in left hand”. After stop word removal, this becomes “pain”, “left”, and “hand”. While pain and left would match related concepts of chest pain, none of the top related concepts for chest pain would be a match to hand. Thus, this code can be eliminated from the list of suggestions. This can improve precision of ICD-10 code set suggestions within automatic ICD-10 coding systems.
Conclusion
In this work, we proposed a method to mine related clinical concepts by leveraging the volume within large amounts of clinical notes along with a graph data structure. Our evaluation shows that our system is able to return highly relevant concepts with above 95% precision and our best method achieves an average utility score of 0.90. This shows that the related concepts generated by our system can be immediately used for a variety of tasks, including query expansion, hypothesis generation, incident investigation, sentence completion, and ICD-10 code set filtering.
Our system is not only lightweight wherein it relies on limited linguistics resources but also very general in that the same method can be applied to different types of big clinical data. The only requirement for our method to work is to have volume in data, which is almost not a problem in this era of Big Data. For example, we can run our method on all clinical notes from a specific department (eg, cardiology) across different organizations to obtain a very focused set of related concepts. We can even use the same method on all clinical notes within a particular time range to investigate an incident or an outbreak.
While this work was evaluated using the MIMIC II Database that is a fairly general dataset, we would like to explore its use in a more narrow situation to understand its applicability in a clinical setting. We would like to work with a physician in actually using our system to investigate certain surprising relationships that could help in their future clinical investigation.
Author Contributions
Conceived and designed the experiments: KG, SL and VS. Analyzed the data: KG, SL and VS. Wrote the first draft of the manuscript: KG, SL and VS. Contributed to the writing of the manuscript: KG, SL and VS. Agree with manuscript results and conclusions: KG, SL and VS. Jointly developed the structure and arguments for the paper: KG, SL and VS. Made critical revisions and approved final version: KG, SL and VS. All authors reviewed and approved of the final manuscript.
