Abstract
Background
Magnetic resonance enterography (MRE) can visualize Crohn's disease (CD) and its complications.
Purpose
To determine the inter-observer agreement for the detection of ileocolonic CD.
Material and Methods
This post-hoc analysis included MRE scans from 48 patients selected from a prospective, blinded multicenter study of patients with suspected CD. Based on ileocolonoscopy in the main study, CD was diagnosed in 39 (81%) patients, with colonic involvement in 36 (69%). Two senior radiologists and two junior doctors undergoing specialist training, blinded to clinical data assessed the image quality, CD presence, and disease severity.
Results
The inter-observer agreement for CD detection varied by location: terminal ileum (κ = 0.77, 95% confidence interval [CI] = 0.63–0.89), colon (κ = 0.59, 95% CI = 0.42–0.74), and ileocolon (κ = 0.65, 95% CI = 0.49–0.80). Agreement was higher among senior radiologists than juniors: terminal ileum (κ = 0.83 vs. 0.73; P = 0.60) and colon (κ = 0.73 vs. 0.41; P = 0.11), though differences were not statistically significant. The inter-observer agreement for disease severity was poor to moderate with intraclass correlation coefficients (ICCs) of 0.51 (95% CI = 0.35–0.67) for the MaRIA and 0.46 (95% CI = 0.29–0.62) for the simplified MaRIA. Senior radiologists showed higher consistency, with moderate to good agreement: ICC of 0.69 for the MaRIA, 0.70 for the simplified MaRIA, and 0.80 for bowel wall thickness.
Conclusion
In early CD, MRE demonstrated moderate to substantial inter-observer agreement for ileocolonic evaluation. Limitations in colonic assessment likely reflect early disease detection rather than observer variability, highlighting the need for a complementary assessment.
Introduction
Crohn's disease (CD) is an idiopathic chronic inflammatory bowel disease characterized by transmural inflammation and a segmental distribution (1). The incidence of CD is increasing, and the future demand for diagnostic evaluation and monitoring is expected to rise (2). Magnetic resonance enterography (MRE) demonstrates a high diagnostic accuracy for detecting CD (3), and the Magnetic Resonance Index of Activity (MaRIA) was developed to assess the activity of ileocolonic CD (4,5). However, MRE is primarily a small bowel examination in clinical practice as an addition to ileocolonoscopy (6).
Detection and evaluation of CD relies on a range of findings, such as changes in the mucosa, intestinal wall, surrounding tissue, and penetrating complications. Many of these indicators are subtle, particularly in the early stages of CD, potentially leading to significant interpretative variability among radiologists. Studies have reported suboptimal results in the evaluation of early colonic CD (3,7), and it is not clear whether this is caused by an inadequate diagnostic accuracy or variability among observers. The inter-observer variability ranges from fair to substantial in different studies (8). However, data regarding inter-observer variability are scarce, especially in colonic CD, and patients with newly diagnosed or suspected CD. This issue is of clinical relevance, as diagnosis, treatment decisions, and a subsequent diagnostic strategy for monitoring CD depend heavily on the validity of MRE findings. Reliable results are critical if transmural healing is to become a future treatment goal, as suggested in the STRIDE-2 recommendations (9).
The aim of the present study was to assess the inter-observer variability for the detection of suspected ileocolonic CD and the evaluation of disease severity.
Material and Methods
MRE scans were selected from a prospective, blinded, multicenter study of the diagnostic accuracy of MRE and panenteric capsule endoscopy in patients with suspected CD (7). The local Ethics Committee of Southern Denmark (S-20150189) and the Danish Data Protection Agency (16/10457) approved the study. All patients gave informed consent before participation. Before the inclusion of adolescents aged 15–17 years, both parents gave informed consent.
MRE procedure
All participants were centralized to a single imaging center. MRE was performed after overnight fasting with a 1.5-T Intera MRI unit (Philips, Eindhoven, Netherlands) with a Syn-body coil. Patients ingested 1 L of Mannitol 7.5% solution 1.5 h before the examination. Hyoscine butylbromide 20 mg was administered intravenously to reduce artifacts from bowel peristalsis. Post-contrast evaluation was facilitated by intravenous infusion of 15 mL gadoterate meglumine (0.5 mmol/mL) (Dotarem; Guerbet, Raleigh, NC, USA). Imaging was conducted using coronal/axial T2, balanced fast field echo (B-FFE), T1, spectral presaturation with inversion recovery (SPIR), axial T1-weighted, and diffusion-weighted sequences.
Selection of MRE
In total, 48 patients were selected based on findings at ileocolonoscopy to facilitate a mix of different disease localizations. JBB selected the cases and was not involved in the subsequent MRE analysis. The study included all patients with colonic CD from the main trial: 19 patients with CD exclusively in the colon, 17 with both colonic and terminal ileum involvement, three with CD isolated to the terminal ileum, and nine without CD. In the 36 patients with colonic disease, ileocolonoscopy revealed CD activity in 89 colonic segments: right colon (n = 27), transverse colon (n = 20), left colon (n = 24), and rectum (n = 18). We chose a high prevalence of CD in the colon due to an anticipated low sensitivity for this localization. Observers were not informed about the selection criteria or the distribution of CD in the cohort. They only learned that CD was clinically suspected.
Observers
The study involved four observers affiliated with a secondary center specializing in inflammatory bowel disease diagnosis and management. These included two senior consultants in abdominal radiology, each with substantial expertise spanning over more than 15 years, and two junior doctors currently undergoing specialist training, each with 2 years of clinical MRI experience.
Image evaluation and data registration
All MRE scans were anonymized and reviewed using a standard Picture Archiving and Communication System (PACS; Philips, Amsterdam, the Netherlands). Observers were blinded to the patient's histories, the results of previous ileocolonoscopies, capsule endoscopies, and the initial MRE interpretation, as well as the evaluations of other observers. Observers analyzed the MREs in randomized orders. The distension of the small and large bowel was assessed using a 3-point scale: poor (<50% of the intestine sufficiently dilated), sufficient (50%–75% of the intestine sufficiently dilated), and good (>75% of the intestine sufficiently dilated).
The image quality was assessed using a 3-point scale: good (diagnostic images without artifacts, score 3), sufficient (diagnostic images with artifacts, score 2), and poor (non-diagnostic images, score 1) (10). Imaging findings consistent with CD were evaluated, including mucosal ulcerations, bowel wall thickening (≥3 mm), bowel wall hyper-enhancement, diffusion restriction, bowel stenosis, creeping fat, dilated vasa recta, and the presence of an abscess or fistula adjacent to a diseased bowel segment (11). The observers classified each patient as having imaging findings suggestive of CD, based on an overall assessment of characteristic features (e.g. see Fig. 1). The observers evaluated the ileocolonic area – the terminal ileum (distal 20 cm), ascending colon, transverse colon, descending colon, sigmoid colon, and rectum. Disease activity was determined using the MaRIA (12) and the simplified MaRIA (13). For the global MaRIA, normal segments were assigned a value of zero to reduce the workload, thereby creating a modified MaRIA (mMaRIA).

(a) A coronal T1W MRI scan after intravenous gadolinium administration, and (b) an axial T1W image also after intravenous gadolinium. The white arrows indicate changes consistent with Crohn's disease in the ascending colon. There is marked bowel wall thickening and contrast enhancement corresponding to the affected bowel segment. MRI, magnetic resonance imaging; T1W, T1-weighted.
Data management and collection
Study data were collected and managed using REDCap electronic data capture tools hosted at OPEN – Region of Southern Denmark (14). All authors had access to the study data, and they reviewed and approved the final manuscript.
Statistical analyses
The sample size was calculated, assuming a statistical power of 80% and choosing 95% confidence intervals (CIs), a minimum acceptable kappa of 0.30, and an expected kappa of 0.60. A sample size of 48 patients was sufficient to determine a significant agreement between four observers (15). Demographic data were analyzed using descriptive statistics, and a P value <0.05 was considered significant.
For binomial data, Cohen's kappa was used to assess the inter-observer agreement, and a bootstrapped 95% CI was calculated. For the detection of CD and specific findings with MRE, a kappa between four observers was reached using Fleiss’ multi-rater kappa statistics. Kappa values were interpreted as follows: absence of agreement = 0, slight agreement ≤0.20, fair agreement = 0.21–0.40, moderate agreement = 0.41–0.60, substantial agreement = 0.61–0.80, and almost perfect agreement ≥0.81, as proposed by Landis and Koch (16).
For continuous variables (e.g. bowel wall thickness, disease severity), agreement was quantified using the intraclass correlation coefficient (ICC), applying a two-way random-effects model with absolute agreement for single-rater measures. ICC values were interpreted as follows: poor <0.50, moderate = 0.50–0.75, good = 0.75–0.90, and excellent reliability >0.90 (17). Statistical analyses were performed using Stata (Stata Statistical Software release 18, StataCorp LLC, College Station, TX, USA).
Role of funding sources
The investigators initiated, planned, and undertook the study without funding from pharmaceutical companies or imaging equipment manufacturers.
Results
All four observers evaluated the 48 MRE examinations. Table 1 shows patient characteristics and disease distribution as determined by ileocolonoscopy. The prevalence of CD in this selected cohort was 81% (39/48). Based on the MRE findings, observers 1–4 classified 31, 32, 23, and 29 of 48 patients, respectively, as having CD.
Characteristics of 48 patients examined in the study.
Values are given as n (%) or median (range), IQR.
*Only for the patients with CD (n = 39).
BMI, body mass index; CD, Crohn’s disease.
The observers agreed on the adequacy of image quality for the diagnostic evaluation except in one case, where a single radiologist deemed the quality insufficient for diagnostic purposes. In the assessment of bowel distension, all observers rated small bowel and colonic distension as adequate in 90% and 95% or more cases, respectively.
Inter-observer agreement for disease detection
The overall inter-observer agreement for detecting CD—presence or absence—varied based on the anatomical location. Among the four observers, agreement ranged from moderate to substantial, with kappa values of 0.77 (95% CI = 0.63–0.89) for the terminal ileum, 0.59 (95% CI = 0.42–0.74) for the colon, and 0.65 (95% CI = 0.49–0.80) for the ileocolon. Inter-observer agreement for the terminal ileum was numerically higher among senior radiologists (κ = 0.83, 95% CI = 0.65–0.96) compared to junior radiologists (κ = 0.73, 95% CI = 0.49–0.91), although this difference was not statistically significant (P = 0.60). Similarly, for the colon, seniors demonstrated numerically higher agreement (κ = 0.73, 95% CI = 0.49–0.91) than junior radiologists (κ = 0.41, 95% CI = 0.12–0.69), without a significant difference observed (P = 0.11). Full details are provided in Table 2.
Inter-observer agreement for detection of CD among pairs of observers.*
Values in parentheses are 95% CI.
*Observers 1 and 2 are senior consultants, and observers 3 and 4 are in specialist training.
CD, Crohn’s disease; CI, confidence interval.
Inter-observer agreement for disease severity
The ICC between four observers was 0.51 (95% CI = 0.35–0.67) for mMaRIA and 0.46 (95% CI = 0.29–0.62) for the simple MaRIA, indicating a moderate to poor inter-observer agreement. Comparing only the two senior radiologists yielded a higher agreement, with an ICC of 0.69 (95% CI = 0.51–0.82) for the mMaRIA and 0.70 (95% CI = 0.53–0.82) for the simple MaRIA. Detecting ulcerations proved challenging, with notable variation in detection rates across observers: observers 1–4 identified 20, 4, 9, and 0 segments with ulcerations, respectively. Table 3 summarizes the per-patient distribution of MRI findings in CD as assessed by the four observers.
Frequency of MRE findings in patients with CD, reported by each observer.
Values are given as n.
CD, Crohn’s disease; MRI, magnetic resonance imaging.
In 28 patients (one case excluded due to a missing measurement) where both senior radiologists agreed on the presence of ileocolonic CD, they demonstrated a good agreement on the maximal bowel wall thickness, with an ICC of 0.80 (95% CI = 0.61–0.90) (Table 4) In contrast, the two junior doctors agreed on the presence of CD in 21 patients, and they demonstrated a poor agreement on the maximal bowel wall thickness with an ICC of 0.49 (95% CI = 0.09–0.76). Although the number of cases with concordance in diagnosing CD in the colon was lower, the agreement on maximal bowel wall thickness remained consistent between senior radiologists and junior doctors, with an ICC of 0.79 (95% CI = 0.48–0.93) and 0.42 (95% CI = −0.15–0.85), respectively.
Inter-observer agreement for disease severity among pairs of observers.*
Values are given as ICC, applying a two-way random-effects model with absolute agreement for both single and average measures.
*Observers 1 and 2 are senior consultants, and observers 3 and 4 are in specialist training.
ICC, intraclass correlation coefficient.
Discussion
This inter-observer study included 48 MRE examinations from patients undergoing their first diagnostic evaluation for suspected CD. Four observers demonstrated a moderate to substantial agreement for detection of CD, and disease activity was assessed with equal agreement using mMaRIA or the simplified MaRIA. Agreement was highest between the two senior observers, suggesting that experience may improve diagnostic consistency. However, the study was not designed to detect differences between pairs of observers.
MRE is an important tool for diagnosing and monitoring CD, especially in advanced disease. MaRIA and later the simplified MaRIA were developed for ileocolonic CD based on endoscopy. Both scores have a high diagnostic accuracy and a substantial inter-observer agreement in expert centers (18,19). Yet, MRE is endorsed by ECCO-ESGAR as a modality primarily for evaluating the small bowel, stricturing or penetrating CD. Since MRE offers the possibility for a panenteric exam strategy (colon plus small bowel), it would be convenient to use MRE in this fashion, thereby reducing the number of endoscopies needed. Studies on MRE for colonic evaluation are scarce, however, and the sensitivity varies significantly (20). In the METRIC trial, MRE demonstrated a high diagnostic accuracy on both the small bowel and colon compared to a consensus reference standard. However, the ability to detect CD in the colon was significantly lower in newly diagnosed patients (sensitivity 47%) compared to patients with a longer disease duration. The same was observed in the trial, from which the data for this study originated (patients with suspected CD) (7). This raises the question of whether MRE simply fails to detect CD (low sensitivity) or if early CD is difficult to interpret and a reason for variability between observers. In a study by Bhatnagar et al. with data from the METRIC trial, the inter-observer agreement for the presence of newly diagnosed CD in the small bowel or colon was 68% (κ = 0.36) and 61% (κ = 0.21), respectively (8). In comparison, we found a higher agreement between four observers with a kappa of 0.77 for the terminal ileum and 0.59 for the colon. This discrepancy is most likely related to differences in study designs. Bhatnagar et al. incorporated a reference standard, allowing for a partial comparison with endoscopy (8), whereas the present study focused solely on inter-observer variability. Leaving out a reference standard, the level of agreement is similar to this study and previous studies by Jensen et al. (10) and Schleder et al. (21).
Equally important is the ability to grade disease severity, which directly impacts treatment decisions and enables monitoring of disease activity over time and response to therapy. In this study, the overall inter-observer agreement for the mMaRIA was moderate, which is lower than previously reported. Differences in disease location, duration, and severity may explain these differences. Furthermore, omitting rectal contrast in this study might have affected the interpretation of the distal part of the colon and rectum. Differences in observer experience might also be a factor. Finally, differences in the applied statistical method could have influenced the reported ICCs across studies. MaRIA relies on single-observer measurements in clinical practice, and the single-rater ICC is the most appropriate measure, even in the current experiment with multiple raters (17). However, using the average of four observers increases the ICC of the mMaRIA significantly (from 0.51 to 0.81) (Table 3). Hence, using multiple observers and averaged scores may improve assessment consistency and reliability. However, this approach is not feasible in routine clinical settings.
When focusing on the senior radiologists, agreement was notably higher, with an ICC of 0.69 for mMaRIA and 0.70 for the Simple MaRIA, aligning with previous studies (18,22,23). The pronounced inter-observer variability in ulceration detection (20 segments vs. none across readers) underscores the subjectivity of this imaging criterion. Ulcerations may be subtle and difficult to distinguish from mucosal irregularities or mild edema on MRE, leading to variation in interpretation thresholds. This is in line with Puylaert et al., who previously reported that MRI evaluation of ulcers accounted for nearly all discordant findings between scoring systems that incorporated ulcer detection (MaRIA and Clermont score) and those that did not (CD MRI index [CDMI] and London score) (24). Such variability affects activity scores and has prompted calls to prioritize simpler imaging markers over complex multifactorial indices (25). We tested the maximum bowel wall thickness as a simple measure of disease severity. Senior radiologists had higher inter-observer agreement than junior doctors, suggesting that experience affects diagnostic consistency. We found no significant difference between the maximum bowel wall thicknesses measured overall or in the colon only. However, due to the wide CIs, this result should be interpreted with caution. Although activity scoring is important, this study did not demonstrate full agreement on strictures. Although disagreement on low-grade strictures may be expected, the lack of concordance regarding high-grade strictures—findings with substantial clinical implications—is particularly concerning.
The present study has some strengths and limitations. First, we included a clinically relevant group of patients with suspected CD determined by a strict inclusion criterion. MRE examinations were selected for this analysis to ensure a representative spectrum of disease severity. The METRIC trial and our main study demonstrated a low diagnostic performance for detecting colonic CD (3,7). To ensure robust conditions for assessing MRE performance in colonic disease, we enriched the sample with endoscopic cases of colonic CD, yielding a high endoscopic prevalence. Observers were not informed about this selection criterion. Importantly, however, the prevalence that affects Cohen's κ is not the endoscopic prevalence, but the prevalence of each category as determined by the observers. Because κ is sensitive to the observed category distribution, imbalance can lead to κ deflation even when raw agreement is high. In our study, raw agreement for colonic CD was high (77%–87%), with κ values of 0.40–0.73, whereas for terminal ileal disease, raw agreement was similarly high (83%–96%), with κ values of 0.66–0.91. The expected agreement in the κ-analysis was in the range of 54%–61% and 51%–54%, respectively, reflecting a slight imbalance in the dataset. Interestingly, the imbalance was not attributable to a high prevalence of colonic CD on endoscopy, but rather to MRE's inability to detect signs of colonic CD, as reflected in the observers’ assessments. Although our goal was to achieve an optimal prevalence of CD in the analysis, this approach may have introduced spectrum bias, thereby limiting the generalizability of our results to broader clinical populations. In the assessment of MaRIA, we chose to assign a value of zero to segments deemed normal by the observers, rather than measuring bowel wall thickness and relative contrast enhancement for all segments. This decision was made to reduce assessment time. The potential influence of additional measurements on the inter-observer agreement is not accounted for. The observers were selected from a secondary center managing patients with inflammatory bowel disease and included both experienced radiologists and those in the final stages of their specialist training. This mix of expertise likely reflects real-world clinical practice, where not all centers have world-class experts available at all times. However, as the study was specifically designed to assess inter-observer agreement among four observers, it lacks sufficient statistical power to evaluate the impact of different levels of observer experience.
MRE will remain one of the diagnostic pillars in CD diagnostics and monitoring, although several studies indicate that detecting very early colonic disease is challenging. Based on the levels of agreement observed in this study, inter-observer variability does not appear to be the primary factor contributing to a poor diagnostic performance in identifying colonic CD in very early cases. However, this study suggests that experience influences MRE reading, highlighting the need for structured training programs and standardized reporting protocols to enhance diagnostic consistency across various clinical settings. Research on artificial intelligence (AI) in this field is emerging with promising results (26,27). AI-assisted analysis has the potential to bridge the experience gap and further enhance the modality.
In conclusion, the inter-observer agreement ranged from moderate to substantial for the detection and severity assessment of early ileocolonic CD. These findings suggest that any limitations in the diagnostic performance of colonic assessment with MRE are likely attributable to the modality's limitations in detecting early-stage disease rather than observer variability.
Footnotes
Acknowledgments
The authors would like to thank the staff at the Department of Radiology at Lillebaelt Hospital, Vejle, for their invaluable support with logistics and imaging procedures.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by public grants from the Region of Southern Denmark (grant nos. 16/9780 and 19/37100) and the Research Council of Lillebaelt Hospital (grant no. May 2016). The Danish Colitis and Crohn's Association (grant no. 2017) also provided support to cover the participants’ travel expenses.
