Graph Neural Network-Based Diagnosis Prediction

Abstract

Diagnosis prediction is an important predictive task in health care that aims to predict the patient future diagnosis based on their historical medical records. A crucial requirement for this task is to effectively model the high-dimensional, noisy, and temporal electronic health record (EHR) data. Existing studies fulfill this requirement by applying recurrent neural networks with attention mechanisms, but facing data insufficiency and noise problem. Recently, more accurate and robust medical knowledge-guided methods have been proposed and have achieved superior performance. These methods inject the knowledge from a graph structure medical ontology into deep models via attention mechanisms to provide supplementary information of the input data. However, these methods only partially leverage the knowledge graph and neglect the global structure information, which is an important feature. To address this problem, we propose an end-to-end robust solution, namely Graph Neural Network-Based Diagnosis Prediction (GNDP). First, we propose to utilize the medical knowledge graph as an internal information of a patient by constructing sequential patient graphs. These graphs not only carry the historical information from the EHR but also infuse with domain knowledge. Then we design a robust diagnosis prediction model based on a spatial-temporal graph convolutional network. The proposed model extracts meaningful features from sequential graph EHR data effectively through multiple spatial-temporal graph convolution units to generate robust patients' representations for accurate diagnosis predictions. We evaluate the performance of GNDP against a set of state-of-the-art methods on two real-world medical data sets, the results demonstrate that our methods can achieve a better utilization of knowledge graph and improve the accuracy on diagnosis prediction tasks.

Introduction

The accumulation of patients' electronic health record (EHR) or electronic medical record data lays a solid foundation for applying machine learning approaches in medical domain, thus enabling the possibility of clinical predictive tasks.¹ Such predictive tasks aim to predict an individual's future health status to improve the quality of personalized health care. Diagnosis prediction, which predicts patients' future diagnosis based on their historical EHR data, is one of the most popular yet complex tasks in the research community. On one hand, diagnosis prediction may possibly contribute to clinical anticipation and precision diagnosis.² On the other hand, the high dimensionality, temporal nature, and noisy of EHR data bring challenges to traditional machine learning methods.^3,4

In recent years, the emerging deep learning techniques attract considerable attention to researches and have been widely applied in various domains, including computer vision,⁵ natural language processing,⁶ and clinical predictions.⁷ Different from traditional machine learning models that require a manual feature engineering procedure by domain experts, deep learning models learn the data representation or the task-related features automatically and effectively from the source data. With proper objective function and sufficient quality data, deep models can possibly achieve superior performance than traditional models in various tasks. Recurrent neural networks (RNNs), which are a catalog of remarkable deep models, have been broadly applied to clinical prediction tasks.^2,8–11 The literatures indicate that RNNs have an outstanding ability to model sequential relationships, thus achieving impressive success in EHR-related tasks. However, it also has been discussed that deep models are vulnerable to data insufficiency and noise, which are regularly existing in EHR.^11,12 Moreover, RNNs cannot handle long sequences effectively.¹⁰ Therefore, the challenge in diagnosis prediction tasks cannot be tackled by utilizing deep models solely.

To address the aforementioned problem, researchers propose knowledge-guided clinical predictive methods that incorporate domain knowledge into deep models.^9,11–13 These methods make use of the strong sequential modeling ability of RNNs to learn the patients' representations from EHR data while injecting supplementary information into the model to alleviate data insufficiency and noise. The information is either extracted from a medical ontology or the medical relations within EHR. Graph-based Attention Model (GRAM)¹² adopts RNNs to model EHR data and utilizes a medical ontology as a knowledge graph to provide supplementary information in the training stage via a graph-attention mechanism. Knowledge-based Attention Model (KAME)¹¹ further exploits the knowledge in the predicting stage of a deep model and brings out a state-of-the-art (SOTA) performance in diagnosis prediction tasks. Co-attention memory networks for diagnosis prediction (CAMP)¹³ applies augmented RNN-based models with a knowledge graph to enhance the prediction accuracy. These works indicate that the utilization of the medical knowledge graph will improve the model's robustness against data insufficiency and noise effectively, thus benefiting the prediction performance. In medical ontologies (i.e., International Classification of Diseases^* [ICD], Clinical Classifications Software^† [CCS]), the medical events (i.e., diagnosis, medications, and procedures) are encoded following a hierarchical structure and in parent–child relationships, which is a graph naturally. Regardless of the success, the above studies only partially make use of the information derived from knowledge graphs (i.e., medical codes co-occurrence and parent–child relationships), yet unable to capture the graph structure features that may be equally important. Furthermore, these studies exploit the medical knowledge from ontologies as external information separated from EHR data, which may introduce extra noise when training the deep model.

This article proposes a knowledge-guided predictive method, namely the Graph Neural Network-Based Diagnosis Prediction (GNDP), to perform accurate diagnosis prediction by fully exploiting the medical knowledge. Different from existing methods that adopt RNNs and attention mechanisms, the proposed GNDP is developed based on the framework of the spatial-temporal graph convolutional network (ST-GCN).¹⁴ Moreover, we propose to reconstruct patients' EHR in the spatial-temporal graph format, which naturally infuses medical knowledge into data and converts it to internal information. Figures 1 and 2 show toy examples of the input structure for the existing method and the proposed method. With the unified framework and graph format input, GNDP can leverage the sequential information from EHR and domain knowledge from medical ontologies simultaneously to learn more robust and accurate patients' representations and perform more accurate predictions in diagnosis prediction tasks.

FIG. 1.

The input structure of RNN-based methods. HER, electronic health record; RNN, recurrent neural network.

FIG. 2.

The input structure of the proposed GNDP. GNDP, Graph Neural Network-Based Diagnosis Prediction.

The contributions of this article are as follows:

We propose GNDP, a graph convolution network-based, end-to-end, and robust diagnosis prediction method that can make use of the underlying spatial and temporal dependence of EHR data comprehensively to improve the accuracy of diagnosis prediction.

We introduce a spatial-temporal patient graph construction method. By integrating patients' EHR data with the medical knowledge graph, the domain knowledge is converted into internal information of the data, thus benefiting deep models to extract more meaningful features.

We empirically demonstrate that the proposed GNDP outperforms SOTA RNN and attention-based methods in diagnosis prediction task.

This article is organized as follows: the Related Work section discusses the existing studies that are related to this work, including the description of EHR data. The details of the proposed GNDP are provided in the Proposed Method section, including the basic notations, the problem definition, and the network architecture. In the Experiments and Evaluations section, ablation and comparative experimental results are provided as well as the implementation detail. At last, the conclusions are presented in the Conclusions section.

Related Work

This section provides a description of EHR data and discusses the existing works in a clinical predictive method, especially the diagnosis prediction method.

EHR systems are primarily designed for the administration aspect.¹⁵ Such systems store patients' records that consist of massive and diverse medical variables and information in a sequential order associated with their visits to the hospital.¹ Nowadays, EHR systems are broadly applied worldwide and have accumulated a tremendous amount of patients' historical data.¹⁵ Based on this, researchers apply EHR data for multiple clinical predictive tasks such as diagnosis prediction, risk prediction, medicine recommendation, and disease progression.

Diagnosis prediction is one of the most important and difficult tasks in clinical predictions based on EHR data. This task aims to predict the patients' future diagnoses based on their historical medical records, and a crucial requirement for it is to model the patient visits effectively. Reverse Time Attention Mechanism (RETAIN)⁸ and Dipole¹⁰ are two inspiring studies to adopt deep learning techniques for diagnosis prediction. RETAIN utilizes RNNs to model reverse time-ordered sequential EHR data. This is inspired by a clinical practice that the up-to-date health status of a patient is more informative than the previous. Dipole applies bi-directional long short-term memory to handle long sequences, thus enhancing the data modeling ability of predictive models. These methods prove that RNNs are effective to model patients' historical records. However, both of them still suffer from data insufficiency and noise.¹²

In practice, data insufficiency and noise are invariably existing in EHR.^1,11,12 As it is hard to solve these problems through data preprocessing procedures, researchers propose to utilize domain knowledge to provide supplementary information of the original data for deep models. GRAM¹² creates a co-occurrence matrix that consists of medical codes in EHR data and ancestor codes in a medical ontology, and the use is to generate reliable medical code representations via an attention mechanism. KAME¹¹ makes use of the medical ontology to generate both the medical code and ancestor code representations and apply them to the training and predicting stage of their model via attention mechanisms. CAMP¹³ also developed a knowledge-guided method based on the medical ontology and augmented memory networks that share the same concept with KAME. All of these methods experimentally prove that the utilization of domain knowledge can effectively improve the performance of RNN-based deep models on diagnosis prediction tasks.

However, the knowledge information from medical ontologies may not be leveraged comprehensively by the above methods. The ontologies are graphs that model a set of medical concepts and their relationships. The above studies utilize these graphs by creating an embedding matrix that consists of nodes from data and their parent nodes from an ontology initially, and then generate code representations that infuse with knowledge via attention mechanisms. Although these studies leverage the parent–child relationships in the knowledge graph, they neglect and are incapable of extracting the global structure information, which is a crucial feature of a graph.

Recently, graph neural networks (GNNs) have attracted wide attention from the deep learning community.^16,17 Different from convolutional neural networks that perform effective feature extraction on grid-like data, GNNs focus on graphs that are non-Euclidean data.¹⁷ Studies show that GNNs have been successfully applied in various graph data-based tasks.^14,16–19 For this work, the most relevant GNN is ST-GCN.¹⁴ ST-GCN is primarily designed for skeleton-based action recognition tasks. It captures the features that are underneath the spatial configuration and temporal dynamics of graph structure skeleton data to generate robust and accurate predictions of human actions. However, due to the significant differences between EHR and skeleton action data, ST-GCN cannot be applied in diagnosis prediction tasks directly. The data differences are due to two aspects: (1) complexity of the graph structure. The number of nodes in EHR data is significantly larger than that in skeleton data, which are at most 25 nodes and at least 669 nodes, respectively^11,14; (2) the sparsity of the node attributes. The nodes' attributes of graph EHR data are more sparse in both the spatial and temporal domains. More detailed information can be found in the Experiments and Evaluations section.

Proposed Method

This section is divided into three parts. First, it defines the basic notations and the diagnosis predication. Then, it provides a detailed description of the method to convert EHR data and medical ontology into a graph structure. At last, we expatiate the framework of GNDP.

Basic notations

The entire set of unique medical codes (i.e., diagnose code, procedure, and medication code) from the EHR data set can be denoted by M = ${m_{1}, m_{2}, m_{3}, \dots, m_{| M |}}$ where $|M|$ is the total amount. Patient p who has T visit records can be represented as a sequence of visits: $P_{p} = {x_{1}^{p}, x_{2}^{p}, \dots, x_{T}^{p}}$ . Each visit $x_{t}^{*}$ from an arbitrary patient contains multiple medical codes from code set M ( $x_{t}^{*} \subseteq M$ ).

The medical ontology $G$ used in previous works is CCS, which consists of diverse medical concepts with parent–child relationships in a hierarchical structure. The leaf nodes are in the lowest layer of $G$ and represent specific medical concepts as same as the medical codes in the EHR data set. The ancestor nodes, the more general concepts, are in the higher layers and related to the leaf nodes. With medical code set M, we can get a subontology $G_{M} = M + Λ$ , where $Λ = {a_{1}, a_{2}, \dots, a_{| Λ |}}$ are $| Λ |$ ancestor nodes that related to M from $G$ . In this work, the subontology $G_{M}$ is used as an undirected graph to ensure that the features of each node can be bidirectionally propagated.

With the notations above, the diagnosis prediction task can be defined as follows: Given a patient visit record $P_{*} = {x_{1}^{*}, x_{2}^{*}, \dots, x_{T - 1}^{*}}$ and medical ontology $G_{M}$ , the proposed model GNDP generates a multihot prediction of what medical codes will appear in $x_{T}^{*}$ .

Graph EHR data construction

For simplicity, the graph construction procedure is described in a single-patient case. Given an EHR data set with $|M|$ unique medical codes, a knowledge graph $G_{M} = (V, E)$ can be extracted from a medical ontology $G$ , where ) are the node set and E the edge set in the graph. Then, an adjacency matrix can be generated according to E, where $A_{i j} = 1$ if $(V_{i}, V_{j}) \in E$ . For the node set $V = {v_{1}, \dots, v_{| M |}, \dots, v_{| M | + | Λ |}}$ , the indexes from 1 to $| M |$ are medical nodes from M and else are ancestor codes from $| Λ |$ . Each node in V is encoded to a one-hot vector , where the i-th element is 1 if i is the index of $v_{*}$ in V. For the p-th patient who has T visit records $P_{p} = {x_{1}^{p}, x_{2}^{p}, \dots, x_{T}^{p}}$ , each visit except the last one is represented by a vector that is the vector sum of all the medical codes in each $x_{*}^{p}$ and their ancestors in $G_{M}$ . Notice that each ancestor code is only added once. The generated multihot vectors are as same as the indexes of the appeared medical codes and their ancestors in each patient. Different from existing works^10–12 that require medical code or visit embedding process, we directly use these multihot vectors as the model input.

The last visit $x_{T}^{p}$ is taken as the label of this patient and is encoded by a multihot vector $y^{p} \in {0, 1}^{| M |}$ . By stacking the visit vectors from $x_{1}^{p}$ to $x_{(T - 1)}^{p}$ in sequence, a patient visit matrix can be generated, where $| V |$ is the size of a medical code set and $(T - 1)$ is the number of visits. This matrix contains not only the occurrence information of the medical codes from the original patient visits but also contains the temporal dependency between each visit. At this point, the patient visit records are converted into graphs where the adjacency matrix A represents the invariant structure and the feature matrix P_p refers to the occurrence of each node. Note that all patients from an EHR data set share the same graph structure, but have a different feature matrix.

The proposed GNDP

The framework of GNDP is shown in Figure 3. When feeding the patient visit graphs that consist of a patient feature matrix and the adjacency matrix into GNDP, a batch normalization layer is applied to normalize the input feature matrices. The following are six ST-GCN units to extract the features of patient graphs in both spatial and temporal domains. Thus, all the data dependency, temporal dependency, and global structure information can be exploited by the model to learn robust patient representation. Figure 4 illustrates the details of an ST-GCN unit. The ST-GCN unit consists of three layers. The first layer performs regular two-dimensional (2D) convolution operation to expand the dimension of the input nodes' feature. Then, a graph convolution is applied to broadcast the expanded nodes' feature along with the graph edges. After this, feature maps that contain the aggregation information of nodes and their neighbors can be generated. The last layer is similar to the first one but with different kernel sizes. It performs a 2D convolution operation on the temporal axis to extract temporal information of the feature maps from the previous layer. To this end, a higher level patient feature map is generated. Each ST-GCN unit is followed by a channel-wise attention layer, to help the model focus on the channels that have more meaningful features.²⁰ The first two, middle two, and last two ST-GCN units have 64, 128, and 256 output channels, respectively, while each of them is followed by a global average pooling layer. The output of these pooling layers is concatenated together to achieve feature fusion and generate the final patient feature maps. Note that the outputs of the first two pooling operations are not passing to the following units. At the end of the model framework, a fully connected layer with a sigmoid activation function is applied to generate the final output for diagnosis prediction. This model can be trained end-to-end.

FIG. 3.

Framework of GNDP. ST-GCN, spatial-temporal graph convolutional network.

FIG. 4.

Details of an ST-GCN unit.

Spatial graph convolution

For simplicity, the spatial graph convolution operation is explained in a single patient with a single visit case. Taking adjacency matrix and a patient visit vector as the inputs, an effective and efficient graph convolution can be achieved by the following function that is defined by GCN¹⁹: $G C N_{o u t} (x) = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}} x W_{g c n}$ (1)

and $D^{i i} = \sum_{j} (A^{i j} + I i j)$ (2)

where is the degree matrix of the input graph, is the identity matrix referring to the self connections of each nodes, and represents a learnable weight matrix.

In practice, the input feature of a single visit x can be represented by a 2D tensor of $(| V |, 1)$ dimensions, where $| V |$ represents the amount of nodes in the patient graph and 1 is the dimension of node features. By performing standard 2D convolution with $| 1 \times 1 |$ kernel size and $(1, 1)$ stride on the input tensor, which equals to multiply the input with a learnable weight matrix $W_{c n n} \in R^{d \times d}$ and adds a bias vector $b \in R^{d}$ , a new tensor $x'$ with the shape of $(| V |, d)$ is generated. This step equals to map the low dimension of node features to the higher space by learnable weights to enrich the features. Then, the graph convolution is implemented in the matrix multiplication of the normalized adjacency matrix and the new tensor. The spatial graph convolution process can be formally described as follows: $x' = \sum W_{c n n} \cdot x + b$ (3) $G C N_{o u t} (x') = \tilde{A} ⨂ x' ⨂ W_{g c n}$ (4)

and $\tilde{A} = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}}$ (5)

Temporal dependency modeling

When given a patient who has T visit records, the feature matrix is formed by concatenating the $T - 1$ visit vectors together in sequential order, and the T-th vector is the prediction label. This equals to assigning a $T - 1$ dimensional vector that carries the temporal information for every node in the graph. Thus, the temporal axis is well-ordered with a constraint length for individual patients. This provides a possibility to define a simple convolutional operation to extract features in the temporal domain. Formally, the patient feature matrix can be represented as a three-dimensional tensor with $(| V |, T - 1, 1)$ dimensions, where $| V |$ is the size of the node set, $T - 1$ is the total visits, and 1 is the dimension of node features. After the first convolution layer of an ST-GCN unit, a new tensor with $(| V |, T - 1, d)$ dimensions is generated. By reshaping the tensor to $(| V |, d, T - 1)$ and performing graph convolution on the first and second dimension, a tensor with the same shape is generated. Note that this step expands the dimension of node features and aggregates them on the first two axes of the tensor where the temporal axis remains unchanged. Therefore, inspired by ST-GCN,¹⁴ the whole spatial-temporal convolution operation can be defined as follows: $P'_{(| V |, T - 1, d)} = \sum_{0}^{| V | - 1} \sum_{0}^{T - 2} W_{c n n} \cdot P_{(*, *, 1)} + b_{g c n}$ (6) $P_{G C N} = G C N_{o u t} (P'_{(| V |, T - 1, d)}) = \sum_{0}^{T - 2} \tilde{A} ⨂ P'_{(| V |, *, d)} ⨂ W_{g c n}$ (7)

P_{T C N} = T C N_{o u t} (P_{G C N}) = \sum_{0}^{| V | - 1} \sum_{0}^{d - 1} W_{t c n} \cdot P_{G C N (*, T - 1, *)} + b_{t c n}

(8)

After six spatial-temporal convolution operations, the final feature maps are defined as follows: $P_{f i n a l} = [ℱ_{p o o l i n g} (P_{T C N 2}), ℱ_{p o o l i n g} (P_{T C N 4}), ℱ_{p o o l i n g} (P_{T C N 6})]$ (9)

where $ℱ_{p o o l i n g} (\cdot)$ is the global average pooling operation. After a fully connected layer, the output of the GNDP is defined as follows: $\hat{y} = S i g m o i d (F C N (P_{f i n a l}))$ (10)

where is the multihot prediction result. Note that the dimension of $\hat{y}$ is the size of the medical code set M, which is as same as the label y.

Objective function

The diagnosis prediction is a multilabel classification task, and therefore, GNDP applies binary cross-entropy loss as the objective function to optimize the loss between the ground truth multihot label y and the model prediction $\hat{y}$ as follows: $ℒ (\hat{y}, y) = - \frac{1}{| M |} \sum_{i = 0}^{i = | M | - 1} (y_{i} * log (ŷ_{i}) + (1 - y_{i}) * log (1 - ŷ_{i}))$ (11)

Graph partition strategy

The graph convolution operation defined in 1 is equal to computing the inner product between each node feature vector and a shared weight vector, which may neglect the local properties of the graph structure.¹⁴ Since the graph EHR data are constructed in a hierarchical structure, nodes at different levels should have distinct weights. To unfold this property to the model, we design a strategy to divide the graph structure of EHR data into four subsets: (1) from the leaf nodes to their second-level ancestors; (2) from first-level ancestors to third-level ancestors; (3) from second-level ancestors to fourth-level ancestors; and (4) from third-level ancestors to the root code. To achieve this, four new adjacency matrices are created based on $G_{M}$ and the partition strategy above. We assign different edge weights in the four new adjacency matrices according to which subset they belong. Thus, the differences between each node level in the graph can be learned by the proposed model. For implementation, the four new adjacency matrices are represented by a tensor with $(4, | V |, | V |)$ dimensions.

The reasons for using subgraphs instead of the whole structure are twofold. First, the subgraph will restrict the broadcast of node information more locally when performing graph convolution operations. Therefore, the local differential properties of the whole graph structure can be captured by the model. Meanwhile, the node information can still be transferred globally through the common nodes from a different subset. Second, the computational consumption of performing graph convolution on subgraphs is less because the adjacency matrices of subgraphs are more sparse than the original graph. The effectiveness of our partition strategy is verified in the Experiments and Evaluations section.

Experiments and Evaluations

Data description

We use two real-world medical data sets in the experiments to examine the performance of the proposed model in the diagnosis prediction task. Data set-I is the third version of Medical Information Mart for Intensive Care,^‡ a public accessible benchmark data set for critical care that has been widely applied in a variety of researches.^8,10–12,21 Data set-II is a private data set that is constructed from a real-world longitudinal EHR database. The medical events from both data sets are encoded following the ICD coding system. Table 1 shows the details of the two data sets. It can be seen that Data set-II contains more patients and each patient has more visit records. However, the average medical events of an individual visit in Data set-II are significantly less than Data set-I. Therefore, Data set-I is more challenging in training the deep models for diagnosis predictions.

Table 1.

Statistics of the data sets

	Data set-I	Data set-II
No. of patients	7499	14,060
No. of visits	19,911	258,140
Average no. of visits per patient	2.67	18.36
No. of unique ICD9 codes	4880	4914
Average no. of ICD9 codes per visit	13.06	3.28
Maximum no. of ICD9 codes per visit	39	23
No. of category codes	171	154
Average no. of category codes per visit	10.16	2.68
Maximum no. of category codes per visit	30	13

ICD, International Classification of Diseases.

We follow the initial data process procedure developed by Choi et.al.⁸ to create time-ordered patient sequences for each data set, and patients who have less than two visits are removed. After this, an exclusive knowledge graph for each data set can be constructed for the CCS medical ontology, which is also used in previous works.^11–13 As shown in Table 2, the structure and the size of each constructed graph are nearly the same and both them are significantly complex than the graph applied in ST-GCN.¹⁴

Table 2.

Statistics of the graphs

	Graph-I	Graph-II	Graph in ST-GCN¹⁴
No. of nodes in graph	5550	5572	25
No. of edges in graph	5564	5513	24
No. of leaf nodes in graph	4880	4914	–
No. of ancestor nodes in graph	670	658	–
No. of level 4 ancestor nodes	15	15	–
No. of level 3 ancestor nodes	130	130	–
No. of level 2 ancestor nodes	323	334	–
No. of level 1 ancestor nodes	202	179	–

ST-GCN, spatial-temporal graph convolutional network.

It has been discussed in previous works that, in practice, predicting the category of each medical event is enough for preserving sufficient granularity for each diagnosis.^11,12 Therefore, we implement category diagnosis prediction by replacing the actual diagnosis codes from the target visit of each patient to the code in the second hierarchy of ICD-9 as the category label.

Baseline method

To examine the performance of the proposed approach GNDP, we conduct comparative experiments with the following baseline models:

GNDP_. GNDP_ removes our partition strategy and performs graph convolution operations with a single adjacency matrix.

GNDP_α. GNDP_α is the backbone of the proposed model but without feature fusion and channel-wise attention.

GNDP_β. GNDP_β removes the channel-wise attention layers behind each ST-GCN unit in GNDP and keeps the average pooling layers to perform feature fusion.

GNDP_γ. GNDP_γ removes the global average pooling layers in the second, fourth, and sixth ST-GCN unit in GNDP and keeps the attention layers.

ST-GCN.¹⁴ ST-GCN uses different partition strategies to divide the input graph into different subsets to enhance the model performance in action recognition tasks, which are not applicable to diagnose prediction task. Thus, we take the unilabeling partition strategy, which is equivalent to compute the inner product between the weight vector and the feature vector of all neighboring nodes.¹⁴ We also adopt ST-GCN+, which adopts our partition strategy as a baseline.

GCN.¹⁹ GCN, which is developed by Kiptf and Welling, is considered to be one of the strongest baselines for graph convolutional networks.²² We follow the data prepossessing method introduced in Choi et al.¹² and Kipf and Welling¹⁹ and fed the data into a two-layer GCN model. Note that this model is incapable of learning the time dependency of the input data.

Dipole.¹⁰ Dipole is an attention-based bidirectional recurrent neural network, and it takes the same raw input as GRAM. We implement the Dipole $_{l}$ version as given in Ma et al.,¹⁰ which is based on a location-based attention mechanism.

GRAM.¹² GRAM is the pioneering work that uses a medical knowledge graph associated with EHR data to learn the medical code representations via attention mechanisms and RNNs. We implement the GRAM $+$ version,¹² which utilizes an initialized embedding matrix with original input visit sequences to generate medical code embeddings and feeds it into a single hidden layer Gated Recurrent Unit (GRU).²³

KAME.¹¹ KAME shares the framework with GRAM; we implement this model by using a supplementary branch that generates knowledge vector, and then concatenate the output with the hidden vector, which is generated by the GRU from GRAM before the last classification layer.

CAMP.¹³ CAMP is a recent work that uses not only the medical ontology but also patient demographics to perform diagnosis prediction. The patient demographics consist of age and gender. However, this information is only available in Data set-I. Therefore, we implement CAMP_ that removes the patient demographic attention branch in Data set-II.

RNN. We use a one-directional GRU²³ to model the EHR sequence as a baseline for all the models above.

Evaluation metric

We evaluate the performance of all baseline methods and the proposed method by using visit-level precision@k and code-level accuracy@k as same as previous works^11,12 to provide multigrained measurements.

Visit-level precision@k measures the prediction precision of individual visits within patient sequences. For a single visit, the final output of our model is $ŷ = [ŷ_{1}, ŷ_{2}, \dots, ŷ_{| M |}]$ , where $ŷ \in ℛ^{| M |}$ , and the grand truth label is $y = [y_{1}, y_{2}, \dots, y_{M}]$ , where $y_{*} \in {0, 1}$ . The visit-level precision@k is defined as follows: $v i s i t l e v e l - p r e c i s i o n @ k = \frac{| ŷ_{c o r r e c t} |_{k}}{m i n (k, Y)}$ (12)

where $| ŷ_{c o r r e c t} |$ denotes the number of correct predictions among the top-k outputs of $ŷ$ , which are ranked by their probability, and $Y$ is the number sum of the positive labels ( $y_{i} = 1$ ) in the target visit. Code-level accuracy@k measures the overall accuracy of the model predictions. For multiple patient sequences, the code-level accuracy@k is defined as follows: $c o d e l e v e l - p r e c i s i o n @ k = \frac{\sum_{i = 1}^{| P |} | ŷ_{c o r r e c t} |_{k}}{\sum_{i = 1}^{| P |} | Y |}$ (13)

where P indicates the total number of patients.

We tune k from 5 to 30 to evaluate the coarse-grained and fine-grained performance of each model, and the greater value indicates a better performance.

Implementation detail

We implement all the aforementioned approaches with PyTorch^§ 1.0. All training processes are accomplished via two Nvidia Titan V GPU and CUDA 9.0 with Inter Core i9-7900x processor. We split the data sets into different ratios to evaluate the performance of GNDP. First, the data sets are randomly divided into training, validation, and testing set in a $0.75 : 0.10 : 0.15$ ratio for 10 times as same as existing methods.^11–13 Then the ratio of the testing set is increased to $0.30$ and the ratio of the training set is decreased to $0.60$ . The regularization (l₂ norm with the coefficient $0.3 \times 1 0^{- 5}$ ) and dropout strategies (the dropout rate is 0.25) are used for training GNDP. The learning rate is set to $0.45 \times 1 0^{- 3}$ initially and decay $10 %$ for every 10 epochs. For other baseline models, the model parameters are set as same as their proposals.

Result and evaluation

We examine the effectiveness and necessity of the proposed components in GNDP, and in the meanwhile making comparisons with the most related model ST-GCN in diagnosis prediction tasks on Data set-I. Table 3 shows the code-level accuracy and visit-level precision with different k values under the split ratio of $0.75 : 0.10 : 0.15$ . Compared with ST-GCN+, the code-level accuracy and the visit-level precision of GNDP improve $4.99 %$ and $4.39 %$ , respectively, when $k = 5$ , and improve $5.78 %$ and $5.44 %$ , respectively, when $k = 30$ . As aforementioned, ST-GCN is designed for skeleton-based action recognition. Compared with the graph structure patient data, the skeleton data have significantly denser time stamps (i.e., >300 on Kinetics data set¹⁴) and a much simpler structure. Thus, it is more challenging to model patient graphs. The results suggest that our proposed model is more effective to model sparse and complex EHR data than ST-GCN. Compared with ST-GCN, the performance of ST-GCN $_{+}$ is better, and meanwhile, GNDP outperforms GNDP_. These results confirm the effectiveness of the graph configuration partitioning strategy. The following results show that each component of the proposed GNDP contributes to increase the precision and accuracy. Especially, when removing feature fusion in GNDP (i.e., the global average pooling layer behind the second, fourth, and sixth unit), the code-level accuracy drops $1.38 %$ and the visit-level precision drops $2.30 %$ when $k = 5$ .

Table 3.

Results of ablation experiments

Data set	Model	Code-level Accuracy@K						Visit-level Precision@K
Data set	Model	5	10	15	20	25	30	5	10	15	20	25	30
Data set-I	GNDP	0.3432	0.5401	0.6701	0.7571	0.8176	0.8629	0.7433	0.6766	0.7182	0.7811	0.8338	0.8749
	GNDP_	0.3337	0.5338	0.6625	0.7498	0.8127	0.8603	0.7291	0.6704	0.7115	0.7754	0.8295	0.8721
	GNDP_β	0.3331	0.5257	0.6530	0.7431	0.8082	0.8558	0.7275	0.6605	0.7013	0.7669	0.8248	0.8679
	GNDP_γ	0.3294	0.5188	0.6493	0.7364	0.7983	0.8473	0.7203	0.6538	0.6990	0.7626	0.8171	0.8616
	GNDP_α	0.3082	0.5058	0.6333	0.7228	0.7847	0.8303	0.7077	0.6420	0.6805	0.7460	0.8004	0.8407
	ST-GCN+	0.2933	0.4901	0.6095	0.6890	0.7434	0.8051	0.6994	0.6349	0.6711	0.7273	0.7782	0.8205
	ST-GCN	0.2801	0.4722	0.5895	0.6802	0.7364	0.7993	0.6922	0.6204	0.6619	0.7187	0.7737	0.8147

The values in bold are the best results in this experiment.

GNDP, Graph Neural Network-Based Diagnosis Prediction.

We apply Data set-I and Data set-II to compare our model against SOTA approaches under the split ratio of $0.75 : 0.10 : 0.15$ . From Table 4, it can be observed that knowledge-guided models, which are GNDP, GRAM, KAME, and CAMP, achieve better performance on both data sets than nonknowledge models. These results suggest that the utilization of medical knowledge can effectively increase the performance of diagnosis prediction model. However, the code-level accuracy and visit-level precision of Dipole and RNN are significantly higher than GCN, which performs diagnosis prediction on the knowledge graph directly. The reason is that GCN only uses the spatial information from knowledge graph and is unable to capture the time sequence information. This implies that the temporal features of EHR data are crucial for predicting patients' future information.

Table 4.

Results of comparative experiments-I

Data set	Model	Code-level Accuracy@K						Visit-level Precision@K
Data set	Model	5	10	15	20	25	30	5	10	15	20	25	30
Data set-I	GNDP	0.3432	0.5401	0.6701	0.7571	0.8176	0.8629	0.7433	0.6766	0.7182	0.7811	0.8338	0.8749
	CAMP	0.3225	0.5173	0.6489	0.7285	0.7933	0.8457	0.7219	0.6680	0.7074	0.7623	0.8219	0.8541
	CAMP_	0.3188	0.5074	0.6401	0.7253	0.7870	0.8411	0.7139	0.6630	0.7010	0.7523	0.8129	0.8453
	KAME¹¹	0.3167	0.5100	0.6379	0.7240	0.7862	0.8303	0.7103	0.6568	0.6967	0.7562	0.8091	0.8470
	GRAM¹¹	0.3123	0.5026	0.6296	0.7142	0.7798	0.8266	0.6698	0.6447	0.6847	0.7439	0.8007	0.8424
	Dipole¹¹	0.2774	0.4556	0.5801	0.6671	0.7354	0.7902	0.6220	0.5839	0.6310	0.6912	0.7542	0.8017
	RNN¹¹	0.276	0.4548	0.5751	0.6647	0.7350	0.7867	0.6158	0.5803	0.6243	0.6912	0.7542	0.8017
	GCN	0.2465	0.3902	0.4909	0.5941	0.6790	0.7317	0.5526	0.5328	0.5751	0.6249	0.7011	0.7324
Data set-II	GNDP	0.6539	0.8033	0.8633	0.9028	0.9242	0.9429	0.6924	0.8251	0.8804	0.9142	0.9322	0.9490
	CAMP_	0.6015	0.7611	0.8410	0.8772	0.9052	0.9295	0.6635	0.8007	0.8587	0.9013	0.9197	0.9402
	KAME	0.6005	0.7602	0.8331	0.8753	0.9045	0.9249	0.6618	0.7934	0.8561	0.8923	0.9181	0.9353
	GRAM	0.5716	0.7533	0.8285	0.8746	0.9062	0.9227	0.6460	0.7855	0.8267	0.8908	0.9181	0.9321
	Dipole	0.5756	0.6893	0.7640	0.8102	0.8481	0.8828	0.6376	0.7418	0.8049	0.8430	0.8721	0.9001
	RNN	0.5677	0.6814	0.7525	0.8011	0.8403	0.8769	0.6366	0.7401	0.7932	0.8344	0.8528	0.8970
	GCN	0.5112	0.5726	0.6433	0.7201	0.7317	0.7512	0.5526	0.5328	0.5751	0.6249	0.7011	0.7324

The values in bold are the best results in this experiment.

CAMP, co-attention memory networks for diagnosis prediction; GRAM, graph-based attention model; KAME, knowledge-based attention model; RNN, recurrent neural network.

For knowledge-guided models, compared with CAMP, GNDP improves the code-level accuracy in $2.07 %$ , $2.12 %$ , $1.72 %$ when $k = 5, 15, 30$ , and improves the visit-level precision in $2.14 %$ , $1.08 %$ , $2.08 %$ when $k = 5, 15, 30$ in Data set-I. In Data set-II, the improvements of GNDP are $5.24 %$ , $2.23 %$ , $1.34 %$ for code-level accuracy and $2.89 %$ , $2.17 %$ , $0.88 %$ for visit-level precision for $k = 5, 15, 30$ when compared with CAMP_. It is worth noticing that the performance gap between CAMP_ and KAME is small. This is because CAMP_ and KAME utilize the knowledge information in a similar approach. We conduct another set of comparative experiments under the split ratio of $0.60 : 0.10 : 0.30$ with the most competitive baselines. As shown in Table 5, GNDP still has a superior performance than other baselines. Besides, although the code-level accuracy and visit-level precision of all the knowledge-guided models drop a little when decreasing the size of the training set, the performances are still better than the models without domain knowledge, as shown in Table 4.

Table 5.

Results of comparative experiments-II

Data set	Model	Code-level Accuracy@K						Visit-level Precision@K
Data set	Model	5	10	15	20	25	30	5	10	15	20	25	30
Data set-I	GNDP	0.3400	0.5387	0.6642	0.7520	0.8146	0.8606	0.7378	0.6754	0.7121	0.7760	0.8310	0.8725
	CAMP	0.3175	0.5084	0.6425	0.7269	0.7902	0.8400	0.7211	0.6649	0.7046	0.7537	0.8151	0.8435
	CAMP_	0.3139	0.5062	0.6401	0.7167	0.7813	0.8326	0.7105	0.6581	0.6992	0.7500	0.8030	0.8409
	KAME	0.3117	0.5028	0.6296	0.7211	0.7822	0.8248	0.7057	0.6494	0.6910	0.7530	0.8049	0.8380
	GRAM	0.3041	0.4976	0.6217	0.7123	0.7737	0.8206	0.6647	0.6447	0.6847	0.7392	0.7913	0.8327
Data set-II	GNDP	0.6510	0.7978	0.8631	0.8975	0.9229	0.9376	0.6866	0.8244	0.8768	0.9067	0.9226	0.9358
	CAMP_	0.5948	0.7595	0.8377	0.8718	0.9034	0.9288	0.6558	0.7983	0.8515	0.8961	0.9119	0.9329
	KAME	0.5998	0.7544	0.8311	0.8741	0.8929	0.9241	0.6573	0.7932	0.8496	0.8921	0.9163	0.9305
	GRAM	0.5669	0.7512	0.8204	0.8726	0.8957	0.9167	0.6408	0.7833	0.8230	0.8826	0.9112	0.9238

The values in bold are the best results in this experiment.

These results demonstrate that because of the better utilization of the medical knowledge graph and reasonable model configuration, the proposed GNDP can generate more accurate predictions than the existing knowledge-guided models.

Conclusions

In this study, we propose GNDP, a novel diagnosis prediction method to predict patients' future health status based on their historical medical records. Taking advantage of GNNs, GNDP learns the spatial and temporal patterns from patients' sequential graph data, in which the knowledge from the medical ontology and the information from EHR are naturally infused. In this way, GNDP can fully make use of the medical knowledge as an internal information of EHR data to improve prediction accuracy. We experimentally verify the necessity of the model components through ablation experiments and compare our model with SOTA approaches on two real-world EHR data sets in diagnosis prediction tasks. Experimental results confirm that GNDP significantly outperforms RNN and attention-based, knowledge-guided clinical prediction models.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This work was supported by the National Key Research and Development Program of China under grant No. 2018YFC130078, the National Natural Science Foundation of China General Program under grant No. 61672420, the Key Project of Natural Science Foundation of China under grant No. 61532015, the Project of China Knowledge Center for Engineering Science and Technology, National Natural Science Foundation of China Innovation Research Team No. 61721002, Innovation Research Team of Ministry of Education (IRT_17R86).

Abbreviations Used

References

Jensen

, Jensen

, Brunak

. Mining electronic health records: Towards better research applications and clinical care. Nat Rev Genet. 2012; 13:395–405.

Choi

, Bahadori

, Sun

Doctor AI: Predicting clinical events via recurrent neural networks. CoRR, abs/1511.05942, 2015.

Kho

, Geoffrey Hayes

, Rasmussentorvik

, et al. Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J Am Med Inform Assoc, 2012; 19:212–218.

Caruana

, Lou

, Gehrke

, et al. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10–13, 2015 pp. 1721–1730.

Ren

, He

, Girshick

, Sun

Faster R-CNN: Towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.

Devlin

, Chang

M-W

, Lee

, Toutanova

BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.

Shickel

, Tighe

, Bihorac

, Rashidi

. Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform, 2018; 22:1589–1604.

Choi

, Bahadori

, Schuetz

, et al. RETAIN: Interpretable predictive model in healthcare using reverse time attention mechanism. CoRR, abs/1608.05745, 2016.

Choi

, Xiao

, Stewart

, Sun

Mime: Multilevel medical embedding of electronic health records for predictive healthcare. CoRR, abs/1810.09593, 2018.

10.

, Chitta

, Zhou

, et al. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks. CoRR, abs/1706.05764, 2017.

11.

, You

, Xiao

, et al. KAME: Knowledge-based attention model for diagnosis prediction in healthcare. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22–26, 2018. pp. 743–752.

12.

Choi

, Bahadori

, Song

, et al. GRAM: Graph-based attention model for healthcare representation learning. CoRR, abs/1611.07012, 2016.

13.

Gao

, Wang

, et al. CAMP: Co-attention memory networks for diagnosis prediction in healthcare. In: 2019 IEEE International Conference on Data Mining, ICDM 2019, Beijing, China, November 8–11, 2019. pp. 1036–1041.

14.

Yan

, Xiong

, Lin

Spatial temporal graph convolutional networks for skeleton-based action recognition. CoRR, abs/1801.07455, 2018.

15.

Birkhead

, Klompas

, Shah

. Uses of electronic health records for public health surveillance to advance public health. Ann Rev Public Health, 2015; 36:345–359.

16.

Yao

, Mao

, Luo

Graph convolutional networks for text classification. CoRR, abs/1809.05679, 2018.

17.

Zhou

, Cui

, Zhang

, et al. Graph neural networks: A review of methods and applications. CoRR, abs/1812.08434, 2018.

18.

Velickovic

, Cucurull

, Casanova

, et al. Graph attention networks. ArXiv, abs/1710.10903, 2017.

19.

Kipf

, Welling

Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016.

20.

Chen

, Zhang

, Xiao

, et al. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. CoRR, abs/1611.05594, 2016.

21.

Zhang

, Qian

, Li

, et al. An interpretable fast model for predicting the risk of heart failure. In: Proceedings of the 2019 SIAM International Conference on Data Mining, SDM 2019, Calgary, Alberta, Canada, May 2–4, 2019. pp. 576–584.

22.

Zhang

, Cui

, Zhu

Deep learning on graphs: A survey. CoRR, abs/1812.04202, 2018.

23.

Cho

, van Merrienboer

, Gülçehre Ç, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.