Abstract
Background
Accurate extraction of the relations between clinical entities is important for tasks like clinical decision support, automated medical coding, and large-scale analysis of clinical text. However, most of the medical records are written as long, unstructured, and sometimes noisy narratives. This makes it difficult for traditional relation extraction systems to properly analyze the document-level context. Improving relation extraction after entity recognition is still a key challenge in clinical natural language processing.
Methods
In this work, we propose a deep learning framework that combines RoBERTa, a bidirectional gated recurrent units (Bi-GRU), a multi-level attention mechanism, and a conditional random field (CRF) layer for clinical relation extraction. RoBERTa is used to generate strong contextual embeddings for each token. The Bi-GRU layers then model the sequential semantic dependencies in a more lightweight way compared to the transformer-only models. The multi-level attention module operates at both word and sentence level to down-weight the irrelevant or noisy information and highlight the clinically important context. Finally, a CRF layer is added on top to produce the globally consistent relation labels. The model is then trained and evaluated on the MTSamples clinical transcription dataset.
Results
The experimental results shows that the proposed model performs better than the recent baseline systems and achieves higher F1-score. The use of multi-level attention clearly improves the ability of the model to capture long-range context and document-level relation clues.
Conclusion
The RoBERTa-Bi-GRU-Attention-CRF framework offers an effective and scalable way to extract the relationships between entities from unstructured clinical narratives. The improved relation extraction performance can support more accurate downstream applications in clinical information extraction and intelligent healthcare systems.
Keywords
Introduction
Information extraction is one of the key functions in the Intelligent Communication Systems (ICS). It helps to deliver the intelligent and automated services in many domains. Typical application domain of ICS are: recommendation systems (Ahmadian et al., 2022; Chen et al., 2020, December 8–13), emotion monitoring (Chen et al., 2020, December 8–13), intelligent healthcare services (Li et al., 2022), web data retrieval (Arora et al., 2023; Liang et al., 2023), and wireless sensing technologies (Qiu et al., 2018). Government organizations also use ICS to analyze the public opinion and to support the policy decisions. Within this area, relation extraction (RE) is an important task, which focuses on finding the semantic links between the entities in unstructured text.
The use of the Electronic Medical Records (EMRs) has been increased rapidly since past decade. For example, in the United States, EMR adoption went from around 10% to more than 96% between 2008 and 2017. This change from paper-based to digital records is happened mainly due to the government incentives, advances in technology, and the need for better efficiency and patient care. EMRs store the rich clinical information, such as patient demographics, chief complaints, and medical history, physical examinations findings, diagnostic test results, medication history, treatment plans, progress notes, and discharge summaries. Extracting the structured information from these unstructured texts (EMRs) is an essential task for many of the healthcare researchers.
One common way to extract the structure information is to build the triplets in the form of a subject-predicate- and object model. In the medical domain, a triplet can connect a patient (subject) to a diagnosis (predicate) and a specific disease (object). Such triplets supports several important applications in real life are: (i) analyzing the relationship between the patient attributes and disease outcomes for health risk prediction (Nelson et al., 2022; Ning et al., 2022), (ii) building the clinical knowledge bases to assist the clinical decision-making (Li et al., 2020; Liu et al., 2023), and (iii) developing the diagnostic support tools that can combine the information from clinical records and biomedical literature (Wu et al., 2018).
However, manually extracting the clinical entity relationships from EMRs is a slow process, inconsistent, and error-prone also. Recent progress observed in the natural language processing (NLP) and machine learning has greatly improved the level of automation, accuracy, and scalability for this relation extraction task.
Entity relationship extraction from the clinical abstracts is mainly useful for constructing the clinical knowledge graphs. These Knowledge graphs can automatically identify the relationships among treatments, interventions, outcomes, and patient characteristics. They reduce the effort needed for the evidence synthesis, support compliance with clinical trial protocols, and enables more personalized healthcare. Knowledge graphs built from these relationships helps the clinicians to stay updated on new therapeutic options and the current research trends. In addition to this, analyzing the extracted relationships can also highlight the research gaps in existing studies and guides the researcher to design of future clinical trials (Mohamed et al., 2020).
In this work, we use the MT Samples dataset as a representative benchmark. It contains transcriptions from about 40 medical specialties, with 140,214 total sentences and around 50,000 unique sentences. The proposed entity relationship extraction model is able to process large volumes of clinical text and identify the important clinical entities and their relationships. This model improves the understanding of clinical research data and supports the downstream applications such as intelligent medical question answering, disease prediction, drug recommendation systems, and detection of adverse drug reactions.
The main contributions of these studies are summarized below: First, based on the insights from UMLS and previous studies, we identify the clinically relevant associations in text and constructs a specialized corpus for clinical entity relationship extraction. Second, we use pre-trained language models to capture the semantic information and to generate the dynamic word representations. Then, a hierarchical collaborative Bidirectional GRU (Bi-GRU) network is applied to extract the deep latent features from the generated input vectors. Third, we design a multi-level attention mechanism to effectively capture the important contextual information from each sentence of clinical text. Finally, we add a conditional random field (CRF) layer on top of the network model to produce more accurate and consistent predictions of relationships among the clinical entities.
Related Work
In the NLP, a relation means the meaningful connections or association between two or more entities. The main goal of RE is to automatically find and extract these connections from the text, usually in the form of triplets and their components. Existing RE methods are generally divided into three groups are: rule-based methods, classical Machine Learning (ML) methods, and deep learning (DL) approaches (Shi et al., 2020).
Rule-based methods depends on linguistic experts to design the grammar rules or pattern templates for detecting the relations. These methods can work well in domain-specific environments, but they are usually not generalized and flexible to work on other domains. To improve the pattern-based techniques, Nakashole et al. (2012) suggested using of external knowledge, such as knowledge graphs, to enrich the entity and relation representations. This kind of integration can improve the accuracy of relation extraction and make the results more useful for downstream applications.
Zhang et al. (2017) reviewed the key traditional methods for RE and noted that many of them are based on statistical language models. Classical supervised ML approaches are often divided into two categories are: kernel-based methods (Zelenko et al., 2003) and feature-engineering methods (Xu et al., 2012). In feature-engineering methods, linguistic features are transformed into vector representations. In kernel-based methods, the special kernel functions are used to measure the similarity between these entities.
In recent years, DL based methods have become very popular for RE, especially for those that use Graph Neural Networks (GNNs) and Distant Supervision (DS). Wu et al. (2021) discussed the basic ideas, development, and research trends of GNNs, while Scarselli et al. (2009) introduced the original GNN framework. GNN based models are often effective for RE, because they can capture the structural dependencies among the entities in a sentence or document. DS frameworks assumes that, if two entities are linked by a relation in a knowledge base and appear together in a sentence, then that sentence is likely to express that relation.
The Graph Convolutional Networks (GCNs) are a special type of GNNs, which applies the convolution operations on graph structures, and they have been widely used for the RE tasks. Zhang et al. (Mohamed et al., 2020) proposed a contextualized GCN (C-GCN), which uses a pruned dependency tree to focus on the shortest path between the potential entity pairs. Guo et al. (2019) introduced an attention-based GCN (AGGCN), which allows the model to automatically learn and choose subgraph structures that are very helpful for the RE. Jin et al. (2020) pointed out that many earlier works did not modeled the interdependent relations jointly, which can lead to the loss of relational context. They suggested the use of GCNs can efficiently capture the mutual dependencies among the relations. In another example of DL expanding into new application areas, Wu et al. (2019) presented a deep reinforcement learning approach for the autonomous search by UAVs in complex disaster environments.
In real life applications, the RE process often treated as a text classification problem (Dongmei et al., 2020). This view makes it possible to use the standard text classification architectures in RE. Transformer based pre-trained language models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) have shown their strong performance on many NLP tasks, including the text classification (Devlin et al., 2019; Jin et al., 2020). Because they are capable of learning the rich contextual representations from base text, they become standard and very suitable DL models for RE.
Recently, biomedical and clinical RE has seen notable progress using the hybrid models that combines the transformers, graph structures, and Large Language Models (LLMs). Fang et al. (2025) proposed GLiM hybrid model, which dynamically builds a graph-transformed representation of clinical text. On graph, it uses an LLM for reasoning to handle the document-level biomedical RE with incomplete annotations. Yuan et al. (2024) proposed HTGRS model, which uses the hierarchical tree graph and a relation-segmentation module to capture the contextual information and to improve the cross-sentence inference.
Some other research works in this domain combined the syntactic structure with transformers. Kim et al. (Chen et al., 2020, December 8–13) combined the BERT with a Graph Attention Network (GAT), which is applied over the dependency parses. This hybrid framework with a simplified T5 model, proven that using the syntax with transformers can improve the performance. Jain et al. (2023) proposed ReOnto, a neuro-symbolic system that uses the ontological paths and GNNs to bring in domain knowledge from biomedical ontologies. Chaturvedi et al. (2025) focused on temporal relation extraction from the clinical narratives and used a span-based graph transformer to model the event sequences.
From this literature, we find that, in clinical domain, transformer-based RE has given promising results. Yang et al. (2021) systematically evaluated the BERT, RoBERTa, and XLNet models on clinical RE tasks. The results noted that RoBERTa clinical and XLNet clinical were achieved the higher F1-scores on benchmark datasets (MADE1.0, n2c2) than standard BERT.
More recently, the methods that combine domain knowledge with transformers have appeared in this domain. Roy and Pan (2021) enriched the BERT with UMLS knowledge for clinical relation extraction. In this method, the model can use both structured medical knowledge and the contextual language representations.
Methodology
In this section, we thoroughly describe the proposed model for extracting the entity relationships from clinical texts. The model is built up on top of the pre-trained RoBERTa language model, which has shown in strong performance on many NLP tasks. First, we explain the overall framework and its basic theory. Then, we define the types of the entity relationships considering and how the extraction task implemented. After that, describes how the data samples are labeled and build a dedicated relation extraction corpus from the clinical trial texts. To improve the generalization and to reduce the overfitting, dropout regularization applied in the network. Finally, the comparative experiments presents that the proposed models are effective and performs better than the baseline methods for clinical entity-relation extraction.
Overview of the Entity-Relation Extraction (E-RE) Model
The main goal of the Entity-Relation Extraction (E-RE) model is to capture the semantic relationships between entities presented in the clinical text. Using these extracted relations, a knowledge map is constructed for presenting the interconnected domains, which connects the concepts within the clinical domain.
Problem Definition
Let the training dataset for relation extraction be denoted as
Model Architecture
Figure 1 presents the complete pipeline of our E-RE model, which follows the Hierarchical Attention Networks as backbone was introduced by Yang et al. (Roy & Pan, 2021). The model is organized into five clearly separated levels, namely: Input layer, Representation layer, Word-level encoding and attention, Sentence-level encoding and attention, and CRF decoding layer. The process flow steps of the architecture is defined as follows:

Entity-relation extraction (E-RE) model (proposed).
Step-1: Input Layer
At the input layer, the document is represented as a sequence of sentences. Each sentence
Step-2: Representation Layer (RoBERTa Encoder)
At this layer, all input tokens
Step-3: Word-Level Encoding with Bi-GRU
For each sentence
Step-4: Word-Level Attention Mechanism
In input, not all the words contribute equally to the meaning of a sentence. Therefore, a word-level attention mechanism is applied, as is denoted by
This process allows the model to focus on important clinical words, such as disease names, symptoms, or treatment actions, while reducing the influence of irrelevant tokens.
Step–5: Sentence-Level Encoding with Bi-GRU
After word attention, the sentence vectors
Step-6: Sentence-Level Attention Mechanism
To further identify the most informative sentences, a sentence-level attention mechanism is applied. In this, each sentence hidden state
This mechanism enables the model to emphasize the sentences that are more relevant for clinical entity prediction, such as diagnostic conclusions or treatment descriptions.
Step-7: CRF Decoding Layer
At the top of the architecture, the final representation is passed to a CRF layer. This layer models the label dependencies between adjacent tokens, for ensuring that the predicted label sequence follows valid constraints. Instead of predicting labels independently, the CRF finds the optimal global label sequence using transition scores. During inference, the Viterbi algorithm is used to decode the best label sequence.
This hierarchical and attention-guided design allows the model to jointly capture the local word-level semantics, sentence-level structure, and global document-level context. This clear and simplified flow making the architecture well suited for complex clinical text analysis.
In this study, we uses a pre-trained RoBERTa encoder model to generate the contextual word representations from clinical text. These embeddings are then provided as input to the downstream Bi-GRU and multi-level attention modules of our E-RE framework. RoBERTa belongs to the same Transformer family as BERT, but it is trained with several design changes for stronger and more stable representations. In our work, we select the RoBERTa as backbone because of the two important differences when compared with BERT are (i) the tokenization/segmentation strategy, and (ii) the pre-training procedure (clearly summarized in Figure 2).

BERT vs RoBERTa comparison for representation learning in our model. (a) Masking policy used during masked language modeling, (b) the next sentence prediction (NSP) modeling, (c) byte-level text encoding/tokenization, and (d) common RoBERTa variants/model scales.
At first, the RoBERTa uses Byte Pair Encoding (BPE) for subword segmentation, while BERT uses the WordPiece algorithm. In BPE, the vocabulary is constructed by starting from a character-level inventory and repeatedly merging the most frequent adjacent symbol pairs until reaching to a fixed vocabulary size. This frequency-driven merging process is simple and efficient than BERT, and it helps to create the consistent subword units across the related word forms. For example, if we consider the words such as “lower” and “lowest” may share the subword stem “low”, which helps the model learn morphological patterns more effectively. In addition, RoBERTa's byte-level encoding technique reduces the number of unknown tokens during the tokenization process. This much useful in clinical narratives where abbreviations, rare terms, and spelling variations occur frequently. In contrast to BPE, the WordPiece chooses subwords based on maximizing the likelihood improvement under the language model objective. This process is more expensive to optimize over very large corpora. Due to this, the BPE strategy is selected and used in RoBERTa to support the robust subword decomposition and improve the coverage for domain-specific terminology.
Second, RoBERTa differs from BERT in the pre-training strategy. BERT is trained with Masked Language Modeling (MLM) together with Next Sentence Prediction (NSP). Its masking pattern is typically selected once during the preprocessing and then kept fixed. In contrary, the RoBERTa removes the NSP objective and adopts dynamic masking. In this, different tokens may be masked in different epochs even for the same sentence. This means the model observes many masking configurations over training. This helps to increase the model's diversity of supervision signals and improves the learning capacity, especially when the training corpora are large. For instance, in a sentence like “The quick brown fox jumps over the lazy dog,” different words may be masked across iterations. So the model learns to recover the missing information under multiple contexts instead of only one fixed mask layout. This dynamic masking is beneficial for building the stronger contextual representations, which is very important for our clinical relation extraction setting where relation cues can be subtle and context-dependent.
In proposed model, the Gated Recurrent Unit (GRU) network is selected, which is a popular variant of the Long Short-Term Memory (LSTM) networks and was originally proposed in (Williams & Zipser, 1989). We employed a Bi-GRU, which considers the past and future context while computing the hidden state at that time t.
The proposed E-RE model integrates the Bi-GRU network as shown in Figure 1. At both word level and sentence level, the Bi-GRU encodes the input into several contextual representations. Bi-GRU keeps the ability of LSTM to model the long-term dependencies but uses a simple internal structure than Bi-LSTM. By combining the forward and backward GRU states, the model learns deep contextual features and captures semantic relationships, which are depended on the entire input sequence.
Bi-GRU at the Word Level
At the word level, the Bi-GRU encodes each token from the sentence. The forward hidden the state
Here,
At the sentence level, the model treats each sentence
Using Bi-GRU at both word and sentence levels allows the proposed model to capture the sequential dependencies within sentences and between the sentences. After obtaining these hidden states, a self-attention (SA) is applied to highlight the semantically important parts. At the end of this process, word-level and sentence-level features are combined to form a rich representation of the document, which is used for relation extraction in further.
In recent years, the attention mechanisms became standard models in the computer vision, speech, and NLP, as they can efficiently capture the long-range dependencies and focuses on important context (Vaswani et al., 2017, December 4–9). The core idea of attention mechanism in RE is to compute a relevance score for each unit (word or sentence) to indicate that how much it contributes to the overall meaning.
Word-Level Attention (WL-Attention): Given a token sequence of length M, the Bi-GRU produces a sequence of contextual word-level features as:
The attention-weighted representation of the word sequence is then given by:
Sentence-Level Attention (SL-Attention): Before applying sentence-level attention, the Bi-GRU encodes each sentence into a sentence embedding
This two-stage attention (word-level and sentence-level) helps the models to focus on the most informative features before making them as final relation predictions.
To predict the relation label for each entity pair, the CRF layer is placed on top of the Bi-GRU and attention outputs. Unlike the standard sequence tagging with BIO labels, our CRF acts as a structured classifier at the sentence level. It receives the attention-enhanced sentence representation
Here,
During the decoding process, the Viterbi algorithm is applied to find out the label sequence with the highest score. This process results a relation prediction that respects the structural constraints learned by the CRF, making it more robust than an unconstrained softmax classifier.
Consider the sentence: “Aspirin reduces inflammation in patients.” with the entity pair (Aspirin, inflammation).
The Bi-GRU encodes the sentence, and the attention layers to highlight the verb reduces and its relation to both entities. The resulting feature vector is then given to the CRF layer for producing the sample unary scores presented in Table 1:
Unary Scores for Relation Labels.
Unary Scores for Relation Labels.
After adding the transition constraints and applying Viterbi decoding, CAUSE becomes the highest-scoring and most consistent label. Therefore, the model outputs as:
This study assumes that not every entity pair in a sentence from clinical text has a meaningful relation. To avoid forcing the model to assign a relation to every pair, we explicitly include NO-RELATION as a separate label in the CRF output space. During training, all candidate entity pairs are enumerated in each sentence. Pairs that are annotated with a relation are labeled with gold relations (e.g., CAUSE, TREAT), while the remaining pairs are labeled as NO-RELATION. These negative examples are used later to compute the both unary potentials and CRF transition scores. This helps the model learns how to predict the NO-RELATIONs presented in text.
During the evaluation, NO-RELATION predictions are treated as negative cases. To follow the standard biomedical RE evaluation protocols, the metrics such as Precision, recall, and F1-score are calculated only for the positive relations, by excluding the NO-RELATION class.
Running Example
Using the similar sentence “Aspirin reduces inflammation in patients.” the entity pair (Aspirin, inflammation) is again considered. The attention mechanism focuses on the word reduces and its connection to both entities. The CRF might output the unary scores as shown in Table 2:
Unary Scores Including NO-RELATION.
Unary Scores Including NO-RELATION.
After considering transition constraints, the Viterbi decoder still selects CAUSE as the best label, and results as:
This example also shows that proposed model can distinguish between the real relations and the NO-RELATION case from the clinical text when appropriate.
Based on the model architecture described in the previous sections (RoBERTa encoder, Bi-GRU hierarchical layers, and self-attention modules), there is a clear risk of overfitting during the model training process. Overfitting happens when the model starts to learn noise or very specific patterns from the training data, instead of the general patterns. In that case, the model shows high accuracy on the training set, but its performance becomes poor on the validation and/or test sets. This problem is considerable in our study because we use a limited amount of annotated biomedical text and the complex clinical language.
In general, overfitting appears mainly for three reasons:
Ideally, increasing the size of the annotated dataset would help to reduce the overfitting. However, in the biomedical domain this is difficult, because the clinical abstracts are limited and the manual annotations by domain experts is very costly and time-consuming. Therefore, instead of relying only on more data, need to use regularization techniques during the training to control overfitting.
This study customizes and apply the Dropout regularization method proposed by Hinton et al. (Luo et al., 2018). Dropout improves the generalization by randomly turning off (dropping) a subset of neurons during each training step. This prevents the strong co-adaptation between units and stops the model from depending too much on a small set of features. In practice, typical dropout rates are between 0.2 and 0.5, and we chosen values in this range for our experiments. As shown in Figure 1 (in the model architecture), dropout the applied at the word-level encoding layers and the sentence-level encoding layers. In this way, the regularization process is enforced at multiple stages of the hierarchical representation learning process to control the model overfitting.
By integrating the dropout into the proposed E-RE framework, our model is able to learn more robust features, reduces the negative effect of limited training data, and achieves better relation extraction performance on unseen biomedical text.
Experiments
Dataset
In clinical text processing, collecting the high-quality medical data for research is still difficult because of the strict privacy rules such as HIPAA. For this reason, limited medical records and medical text datasets are publicly available. In this work, we use the medical transcription dataset from mtsamples.com, which contains transcriptions from many different medical specialties. The dataset covers around 40 categories and includes 140,214 transcribed phrases with about 50,000 unique words. Figures 3 and 4 shows the distribution of samples across medical specialties and the ranking of missing values in descending order.

Word count of each medical specialty.

Sentence length distribution.
In total, the dataset has 2,311,419 words and a vocabulary size of 28,581. The maximum sentence length is 3,114 words, while the median and mean sentence lengths are 421.5 and 487.5 words, respectively. These numbers show that the average transcription is quite long, with both the mean and median sentence length is above 400 words. This reflects the narrative and detailed nature of clinical documentation.
To support the reliable training and evaluation, we split the dataset into training, validation, and test sets are using a 70:15:15 ratio. That is, 70% of the samples are used for training, 15% for the validation, and 15% for the final testing. Because the number of samples is very different across medical categories, we used the stratified sampling model. This model ensures that the proportion of each medical specialty is similar in the training, validation, and test sets. Stratification also helps to avoid the severe class imbalance and reduces the risk of biased learning. This is important for relation extraction tasks, where the context and style can change frequently between the specialties.
Annotation Process
The annotation process is a key step in preparing the MTSamples clinical transcription dataset for the relation extraction process. Raw medical transcriptions are unstructured, so a systematic procedure is applied to identify and label the entities and their relationships.
First, we identify clinical entities such as: Symptoms, Diagnoses, Procedures, Medications, Anatomical locations, and Laboratory or diagnostic findings. The annotation guidelines are derived from the UMLS semantic types and existing clinical NLP literature. These guidelines helps to ensure that the annotators follow a consistent rule when deciding what counts as an entity and how to classify it. Next, examined the sentences that contain pairs of entities to check whether there is a meaningful clinical relationship between them or not. Annotators assign the relation labels that represent clinically relevant interactions, such as treats, causes, indicates, associated-with, or part-of. To reduce the ambiguity in this annotation process, we design a detailed annotation protocol. This protocol contains with examples, instructions for handling borderline cases, and clear rules for multi-word entity boundaries.
Annotations are done manually by the people with basic domain knowledge of clinical terminology. Each transcript is reviewed independently by at least two annotators. When there is a disagreement in annotation process, the annotators discuss the case, and if needed, a senior reviewer makes the final decision. The inter-annotator agreement scores also computed to check the quality and consistency of the annotations. At the end, the labeled data are converted into a structured format, which is suitable for models training. Sentence-level and document-level records that include entity spans, relation labels, and the surrounding context are generated in annotation.
Categories that have very few transcription records are excluded from the annotation process. Only categories with at least 50 samples are kept. Figure 5 shows the number of labels in the datasets, and there are about 40 target labels in total.

Sample size of each medical specialty.
Table 3 gives a few examples of medical transcriptions with their corresponding specialties. Each transcription shown in Tables 3 and 4 belongs to a specific medical specialty and contains a rich set of medical terms and abbreviations.
Sample Instances of the Medical Records with Their Medical Specialty.
Data Inspection Sample.
Table 4 shows a portion of the original dataset after basic inspection.
In clinical text, the main goal of the relation extraction task is to identify the conceptual connection between each pair of medical entities mentioned in a sentence. When a meaningful connection exists between two entities are, the relation extraction tasks are becoming a classification problem: we must decide which relation label best describes the link.
Before extracting the relations from text, it is important to clearly define that which types of clinical ER types are of interest. Then assign these ER types to entity pairs in the sentence according to their meaning. In general, the clinical health records contain rich and detailed medical information. Meystre et al. (Uzuner et al., 2010) suggest that usage of ‘problem-oriented medical records’ to collect the data that support diagnosis, care planning, and other medical interventions of patient. Uzuner et al. (2010) classified the semantic relations in clinical abstracts. With this, they defined the relation types suitable for problem-centered records, taking into account the special characteristics of clinical text. Inspiring from the literature, this study defines six types of clinical trial data relationships are: Disease-Drug, Symptom-Drug, Intensity-Drug, Disease-Item, Disease-Method and Disease-Intensity. In our implementation, we refine them into the categories as shown in Table 5.
Clinical Entity Relation Types with Example Pairs.
Clinical Entity Relation Types with Example Pairs.
The pre-processing of text includes several standard NLP steps are: 1. Data cleaning / text normalization, 2. Stop-word removal, 3. Lemmatization and 4. Tokenization. We applied all these four operations on each transcription to make the data ready for main processing. Table 6 (not shown in full here) demonstrates the impact of the pre-processing pipeline on a sample input text data (record no. 400, specialty “Allergy / Immunology”).
Effect of Pre-Processing Procedures on an Example Text Segment.
Effect of Pre-Processing Procedures on an Example Text Segment.
After pre-processing, the samples are annotated using the BRAT tool, which exports annotation files in. ann format. Once labeling is complete, we align the .ann files with the corresponding raw text files. During this step, we: Removed all illegal characters and blank lines Split long texts into smaller segments using sliding windows and punctuation Merged entity pairs into the sentence patterns that contain both entities Filtered out overly long sentences to reduce the noise and improve training stability
These steps produces the final input samples in which each sentence (or segment) contains the clear entity boundaries and associated relation labels, that are suitable for training the proposed relation extraction model.
For evaluating the relation extraction (RE) model, we use common metrics from the literature (Ayush et al., 2016):
In our settings:
The formulas used are summarized in Table 7.
Evaluation Metrics.
Evaluation Metrics.
We report F1, precision, and recall for the relation extraction model on the test set.
All experiments are conducted on a machine running with the Windows 10 OS, with Python 3.7.0 and PyTorch 1.8.0. The hardware's are including an NVIDIA GeForce RTX 3080 Ti (16 GB) GPU. Under this configuration, the model trains and runs smoothly (Yang et al., 2016).
Table 8 lists the main hyperparameters used during our proposed model training.
Hyperparameter Settings.
Hyperparameter Settings.
To evaluate the effectiveness of the proposed clinical relation extraction (RE) model, we compared it with the four other baseline methods under the same experimental conditions. The results are summarized shows in the Table 9 and presented in Figure 6. The following baseline models are included in the comparison are:

Comparative visualization of the performance of each model.
Results of Comparison Between Each Experiment.
#P - Precision, #R - Recall, #F1 - F1-score.
From the Table 9 and Figure 6, we can see that the proposed RoBERTa-BiGRU-MAtt-CRF model achieves the best performance across all the three-evaluation metrics (P, R, and F1). The detailed observations are as follows:
These results show that integrating a strong pre-trained language model (RoBERTa), BiGRU encoding, and a multi-head attention + CRF architecture leads to a more powerful and robust clinical relation extraction model.
The proposed model shown the better performance than the baselines on all the three-evaluation metrics: precision, recall, and F1-score. This suggests that our clinical E-RE approach is not only effective for the MTSamples dataset but may also be useful for a broader range of practical applications.
In general, a reliable RE model can help to automatically extract the relations between entities from many kinds of text, such as social media posts, news articles, clinical narratives, and biomedical research papers. With improved precision and recall, the model can support the downstream tasks like large-scale text analysis, topic modeling, and even the parts of sentiment or opinion analysis, where the entity-relation structure is important. In this way, our study provides a useful tool for researchers and practitioners in different fields who need the accurate and efficient relation extraction.
At the same time, the model has some clear limitations also. In its current form, our neural network-based RE method mainly focuses on relations between two entities at a time. However, in real clinical text, one sentence often contains multiple entities and several relations that may interact with each other. The present model does not explicitly handle the higher-order or multi-entity relations. In future work, it will be important to extend the framework, so that it can capture the relationships among multiple entities within the same sentence or document, thereby making the model more complete and more useful in real-world clinical settings.
Conclusions
In this study, we proposed a deep learning-based model for clinical relation extraction that combines a RoBERTa encoder, a BiGRU network, a multi-head attention mechanism, and a CRF layer for final E-RE prediction. By replacing the traditional BiLSTM with BiGRU, the model efficiently captures the contextual information from both past and future tokens. The integration of word-level and sentence-level attention mechanism allowed the model to focus on important parts of the text and to learn richer feature representations. This, in turn, improves the performance on the relation classification task. Experimental results are proved that the proposed RoBERTa-BiGRU-MAtt-CRF model achieves higher precision, recall, and F1-score than several strong baselines, confirming its effectiveness for extracting the clinical entity relations from medical transcriptions. However, some limitations are remain in this model. At present, the model does not handle more complex linguistic patterns such as multiple overlapping relations in the same sentence or nested entity structures. In addition, our work assumes that gold-standard entity annotations are already available. For real life clinical applications, an end-to-end system that combines both the named entity recognition (NER) and relation extraction will be necessary. For future work, we planned to extend the model to jointly learn the NER and relation extraction in a single framework. Exploring the use of knowledge-graph-enriched representations (i.e., integrating UMLS or other medical ontologies) to further improve accuracy. Evaluating the model on additional clinical and biomedical datasets to test its robustness and generalizability across different types of clinical narratives.
Footnotes
Consent for Publication
All authors agree with the content of this manuscript and give their explicit consent to submit it for publication.
Author Contributions
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Competing Interests
The authors declare that they have no competing interests.
Code Availability
The code used in this study is available from the authors upon reasonable request.
