Abstract
Background
The reproducibility of intravoxel incoherent motion (IVIM)-based radiomics studies in humans has not been reported.
Purpose
To determine the inter- and intra-observer variability on the reproducibility of IVIM-based radiomics features in cervical cancer (CC).
Material and Methods
The IVIM images of 25 patients with CC were retrospectively collected. Based on the high-resolution T2-weighted images, the regions of interest (ROIs) were independently delineated twice in diffusion-weighted images at a b value of 1000 s/mm2 (interval time was one month) by two radiologists. This was done at the largest transversal cross-sections of the tumors. The ROI was subsequently used in apparent diffusion coefficient (ADC), true diffusion coefficient (D), pseudo-diffusion coefficient (D*), and perfusion fraction (f) maps derived from IVIM images. In total, 105 radiomics features were then finally extracted from the IVIM-derived maps. The inter- and intra-observer reproducibility of IVIM-derived features was then evaluated using the intraclass correlation coefficient.
Results
Inter- and intra-observer variability affected the reproducibility of radiomics features. D* map had 100% and 95% reproducible features, ADC map had 89% and 93%, D map had 97% and 86%, while f map had 54% and 62% reproducible features with good to excellent reliability in the intra-observer analysis. Similarly, D* map had 90% and 94%, ADC map had 85% and 70%, D map had 81% and 78%, while f map had 41% and 93% reproducible features with good to excellent reliability in the inter-observer analysis.
Conclusion
Inter- and intra-observer variability can affect radiomics analysis. Cognizant to this, multicenter studies should pay more attention to intra- and inter-observer variability.
Background
Cervical cancer (CC) is one of the most prevalent gynecologic malignancy across the world (1). Magnetic resonance imaging (MRI) is the commonly used technology for staging and evaluation of CC (2). MRI-based apparent diffusion coefficient (ADC) measurements are helpful in tumor differentiation, monitoring of treatment response, postoperative review, and recurrence prediction (3–5). Although ADC values can reflect the Brownian motion of water molecules in vivo, it is susceptible to capillary microcirculation perfusion that results in “pseudo-diffusion.” In light of this, Le Bihan et al. (6) proposed that intravoxel incoherent motion (IVIM) diffusion-weighted imaging (DWI) can be used to separate “true-diffusion” from “pseudo-diffusion.” Clinical studies have applied IVIM DWI in the differential diagnosis and evaluation of various cancers such as breast, rectal, and prostate cancers (7–10). Additionally, other studies showed that IVIM can effectively differentiate tissues and monitor early tumor response in CC given its low perfusion and diffusion characteristics (11,12).
Radiomics utilizes automated algorithms to transform medical image data into extractable high-dimensional data. It allows for quantitative high-throughput extraction and comprehensive analysis of the biological features and heterogeneity of tumors (13,14). Clinical quantitative analysis of radiomics based on IVIM DWI has been used for preoperative staging and efficacy assessment in patients with CC (15). Thapa et al. (15) proposed that the histogram feature of 75th percentile extracted from ADC, D, D*, and f maps can distinguish between early-stage and locally advanced CC.
However, most studies focus on the relationship between the reproducibility radiomics features and clinical application rather than basic research. To date, quantitative analysis of radiomics features derived from MRI has been a challenge because of multiparameter imaging and multicenters. Many studies have found that features extracted from MRI may be affected by factors such as repetition time (TR), echo time (TE), sampling bandwidth, field strength, scanner manufacturer, etc. (16–18). These factors weaken the reproducibility of the radiomics features. Furthermore, less attention has been paid to explore the reproducibility of IVIM-based radiomics features.
In addition, tumor delineation variability can lead to uncertain reproducibility in radiomics features (19,20). Cognizant to this, assessment of intra and inter-observer variability is therefore needed in radiomics analysis. It is essential to find out robust and stable radiomics features whose variation are inevitable in acquisition parameters and segmentation before clinical application. To our knowledge, the intra- and inter-observer reproducibility of IVIM-based radiomics studies in humans has not been reported. Based on this, the aim of the present study was to investigate the reproducibility of IVIM-based radiomics features in patients with CC. This was done in two ways: (i) reproducibility in intra-observer setting (interval time was one month); and (ii) reproducibility in inter-observer setting.
Material and Methods
Patient characteristics
The Institutional Medical Research Ethics Committees approved this study and informed consent was obtained from all participants. Twenty-five female patients diagnosed with CC were retrospectively recruited for the study between December 2015 and February 2016. Routine pelvic MRI (including IVIM imaging) was done before treatment. Invisible tumors (n = 2) and apparent artifacts (n = 3) on IVIM images were excluded. The remaining 20 female patients (mean age = 54.4 ± 6.9 years at first diagnosis of CC) were incorporated. Tumor grading was done based on the recommendations of the International Federation of Gynecology and Obstetrics (FIGO), included IB (n = 4), IIA (n = 2), IIB (n = 3), IIIA (n = 0) and IIIB (n = 11).
Image acquisition
Pre-treatment assessment involved routine pelvic MRI protocols (including IVIM imaging) performed using a 1.5-T GE MRI Signal HD xt MRI scanner with eight-channel abdominal phased array coil (GE Medical System, Milwaukee, WI, USA). The MRI protocols used were sagittal T2-weighted (T2W) imaging, axial T1-weighted (T1W) imaging, axial fat saturation (FS)-T2W imaging, axial high-resolution T2W imaging, IVIM imaging, and post-contrast sagittal FS T1W imaging and axial FS T1W imaging. The protocols were performed after administration of intravenous contrast injection of the patients with gadopentetate dimeglumine (0.2 mL/kg at a rate of 2 mL/s).
The IVIM DWI was based on a single-shot planar echo sequence and was performed with free-breathing. The parameters of the IVIM DWI were as follows: variable b values in three orthogonal directions, including 0, 30, 50, 80, 100, 150, 200, 400, 600, 800, 1000 s/mm2; TR = 5000 ms; TE = 32 ms; flip angle = 90°; number of excitations (NEX) = 2; field of view (FOV) = 320 mm2; slice thickness = 4 mm; spacing = 5 mm; matrix = 128 × 128; phase encoding from anterior to posterior; and acquisition time = 2 min 25 s.
IVIM-derived parameters
Herein, several improvements were made to the IVIM protocol. The b values were first optimized for better signal-to-noise ratio (SNR) of the images (21). The upper limit of maximum b value was thus set to ensure that the SNR of the image optimized the non-linear fitting of the model. On the other hand, the minimum b value was considerably low to reduce the effect of D*. The MADC software (GE AW4.6 post-processing workstation) was then used to generate the ADC, true diffusion coefficient (D), pseudo-diffusion coefficient (D*), and the perfusion fraction (f) maps.
The IVIM-derived parameters of D, D*, and f were calculated based on the theory by Sumi et al. (10) using Eq. 1.
Where Sb and S0 is the mean signal intensity of a given b value and b = 0 s/mm2 respectively. D is the true diffusion coefficient, D* is the pseudo-diffusion coefficient, and f is the perfusion fraction.
The ADC value was estimated based on the mono-exponential model on a voxel-by-voxel basis with all b values using Eq. 2.
Region of interest (ROI) segmentation
The largest cross-sections of tumors were determined by two radiologists (radiologist A with three years of experience and radiologist B with nine years of experience). Based on the high-resolution T2W images (Fig. 1a), two radiologists independently delineated the ROI at the broadest level of tumor in DWI images with a b value of 1000 s/mm2 (Fig. 1b). This was done because images of higher b values provided better tumor contrast. Previous studies had suggested that 1000 s/mm2 was the best b value that clearly showed the boundary between tumors and normal tissues (22). As such, it was used in this study. The ROI covered the largest possible lesion devoid of the bleeding and necrosis areas. Subsequently, the same ROI was used in the corresponding ADC (Fig. 1c), D (Fig. 1d), D* (Fig. 1e), and f map (Fig. 1f) for analysis. Two radiologists (radiologist A and radiologist B) re-defined the ROI after one month to assess the intra-observer variations. A1 and B1 represented the first ROI delineation while A2 and B2 represented the second ROI delineation.

Image of a cervical squamous cell carcinoma from a 49-year-old woman with FIGO IIIB. (a) High-resolution T2W image showing an irregular tumor with soft-tissue signal at the cervix, (b) DWI image (b value = 1000 s/mm2), the green contour is the edge of the lesion (i.e. ROI), (c) ADC map, (d) D map, (e) D* map and (f) f map derived from IVIM images with corresponding ROI. DWI, diffusion-weighted imaging; FIGO, International Federation of Gynecology and Obstetrics; ROI, region of interest; T2W, T2-weighted.
Feature extraction
A total of 105 radiomics features were extracted via the Radiomics module in the 3D-slicer software (http://www.slicer.org) (23). The features included were shape, gray level co-occurrence matrix (GLCM), gray level dependence matrix (GLDM), first-order, gray level run length matrix (GLRLM), gray level size matrix (GLSZM), and neighborhood gray-tone difference matrix (NGTDM). A total of 11, 23, 14, 18, 16, 16, and 5 features were extracted for each of the matrices, respectively.
Statistical analysis
The reproducibility of inter- and intra-observer variability on IVIM-based radiomics features was assessed using the intraclass correlation coefficient (ICC). An ICC value of 0 indicated no reliability while that of 1 represented highly stable features. Previous studies have shown that an ICC value ≥0.75 indicate good to excellent reliability (24). However, Pavic et al. (25) proposed that ICC values ≥0.8 were more suitable. They could eliminate type I and type II errors with a small sample size. The present study had a small sample size of 20 patients and thus a more stringent ICC value of 0.8 was used as the threshold for good to excellent reliability. Statistical analysis was performed using the R software (version 3.6.1). The Heml software (Heatmap Illustrator, version 1.0) was used to illustrate the hierarchical clustering of radiomics features on inter- and intra-ICC (26).
Results
Overall reliability
The reproducible features in ADC, D, D*, and f maps between inter- and intra-observer are displayed in Fig. 2. Radiomics features that had an ICC value ≥0.8 are shown in red. These values suggest good to excellent reproducibility of the features against inter- and intra-observer variability.

Heatmap of the IVIM-based feature variation against inter- and intra- observer manual segmentation. Rows represent different radiomics features extracted from IVIM imaging. Columns represent the comparison in various inter- and intra- observer combination. Different color scales represent different ICC thresholds. Red indicates ICC ≥ 0.8 (robust features), blue indicates 0.4 ≤ ICC < 0.8 while green indicates 0 < ICC < 0.4. A and B are radiologists A and B, respectively. 1 and 2 represent the first and second delineation of ROI (one month later), respectively.
For intra-observer analysis, the D* map had 100% (105/105) and 95% (100/105) reproducible features, ADC map had 89% (93/105) and 93% (98/105), D map had 97% (102/105) and 86% (90/105), while f map had 54% (57/105) and 62% (65/105) reproducible features with good to excellent reliability.
For the inter-observer analysis, D* map had 90% (95/105) and 94% (99/105) reproducible features, ADC map had 85% (89/105) and 70% (73/105), D map had 81% (85/105) and 78% (82/105), while f map had 41% (43/105) and 93% (98/105) reproducible features with good to excellent reliability. The reliability rates are presented in Fig. 3.

Summary of the percentage of reproducible features (ICC ≥ 0.8) including ADC, D, D*, and f maps. (a, b) Radiologists A and B, respectively. 1 and 2 represent the first and second delineation of ROI (one month later) respectively.
Reproducibility of feature categories
Different proportions of both inter- and intra-observer reproducible features (ICC ≥ 0.8) were observed in all the tumor subgroups in ADC, D, D*, and f maps (Fig. 4). The feature group, which was characterized by the higher reproducibility rate, differed between different maps.

The percentage of reproducible features in (a) ADC, (b) D, (c) D*, and (d) f maps. (a, b) Radiologist A and B, respectively. 1 and 2 represent the first and second delineation of ROI (one month later), respectively.
For intra-observer comparison, ADC maps had a reproducibility rate in the range of 78%–100% for the A1A2 subgroup and 85%–100% for the B1B2 subgroup with good to excellent reproducibility rates. Similar results were observed in D maps which had a reproducibility rate in the range of 91%–100% for the A1A2 subgroup and 81%–100% for the B1B2 subgroup. D* map had a reproducibility rate of 100% for the A1A2 subgroup and 85%–100% for the B1B2 subgroup. However, the f map had poor reproducibility rates in the range of 38%–100% for the A1A2 subgroup and 20%–85% for the B1B2 subgroup.
For inter-observer comparison, ADC maps had a reproducibility rate in the range of 60%–100% for the A1B1 subgroup and 56%–85% for the A2B2 subgroup with good to excellent reproducibility rates. Similar results were observed in D maps with a reproducibility rate in the range of 70%–100% for the A1B1 subgroup and 74%–85% for the A2B2 subgroup. D* map had a reproducibility rate in the range of 80%–100% for the A1B1 subgroup and 80%–96% for the A2B2 subgroup. However, the f map had poor reproducibility rates in the range of 11%–100% for the A1B1 subgroup and 83%–100% for the A2B2 subgroup.
Discussion
The present study investigated the effect of intra- and inter-observer variabilities on the reproducibility of radiomics features derived from the IVIM imaging of patients with CC. Radiomics features derived from ADC, D, and D* maps were more reproducible than those from f maps. On one hand, the good to excellent reliability rates of reproducible features extracted from the D*, ADC, D, and f maps in the intra-observer analysis were in the range of 95%–100%, 89%–93%, 86%–97%, and 54%–62%, respectively. On the other hand, the rates of reproducible features in the inter-observer analysis were in the range of 90%–94% in the D* map, 70%–85% in the ADC map, 78%–81% in the D map, and 41%–93% in the f map. ADC, D, and D* maps had good performance in both intra- and inter-observer analyses, while the performance of the f map was poor in both.
Several studies have focused on IVIM DWI application in CC. Lee et al. (11) found that the lowest f value (14.9 ± 2.6%) was significantly different from the normal cervix (18.6 ± 2.2%) and uterine leiomyoma (17.3 ± 3.6%) (P < 0.05). The D value (0.86 ± 0.16 ×10−3 mm2/s) was the lowest and was significantly different from the normal cervix (1.32 ± 0.12 ×10−3 mm2/s) and myometrium (1.26 ± 0.12 × 10−3 mm2/s) (P < 0.05). As such, these values served as biomarkers in type differentiation. Similarly, Zhu et al. (12) reported that D and ADC values had a positive correlation at all times within four weeks in concurrent chemo- radiotherapy points (P1 < 0.001, P2 = 0.003, P3 = 0.032, and P4 < 0.001). These observations helped evaluate the early response of CC to chemoradiotherapy. Three histogram analysis demonstrated that features extracted from IVIM DWI play important roles such as differentiation of benign from malignant tumors and evaluation of tumor grades in clinical research (7–9). Cho et al. (7) reported that maximum extracted from the f map and kurtosis extracted from the D* map showed significant differences between benign and malignant breast lesions (P = 0.024 and 0.003, respectively). Similarly, Zhang et al. (9) reported that the mean, median, 10th and 75th percentiles, kurtosis, and skewness extracted from ADC and D maps had significant differences in differentiating the Gleason grade of prostate cancer (P ≤ 0.023). However, Nougaret et al. (8) reported that histogram analysis did not yield better results compared to the median values and was thus not necessary in routine clinical practice.
Most studies have focused on the relationship between radiomics features and clinical application. Although it is important to analyze the heterogeneity of a tumor non-invasively, the reproducibility of radiomics features should not be ignored, even before its clinical application. To date, most studies have reported that the reproducibility of the radiomics features derived from MRI is a major challenge in multiparameter imaging and multicenters (16–18). In IVIM DWI, only one study has investigated the impact of intra-observer variability on the reproducibility of IVIM-based radiomics features. Song et al. (27) studied the stability of 14 IVIM-based radiomics features on mouse-based breast cancer models. To our knowledge, the present study is the first to evaluate the impact of intra- and inter-observer variability on the stability of IVIM-based radiomics features on humans. Herein, features derived from ADC, D, and D* maps were more reproducible than those from the f map. However, these results were inconsistent with those of Song et al. (27) who reported that features extracted from ADC and D maps were more reproducible than those from the D* and f maps in intra-observer variability. However, inherent differences in pathological tissue types could have contributed to these differences because the former study was based on mouse models whereas ours was based on humans. In addition, the former study did not take the inter-observer variability into account. Only intra-observer variability was compared. Nonetheless, scanning devices (28), feature categories (29), ROI segmentation (30), feature extraction methods and software (31) as well as IVIM double-exponential fitting software could have also affected the reproducibility. Notably, there is no standard IVIM double-exponential model-fitting software nor a consensus on the standard IVIM protocol.
IVIM-based quantitative analysis is a potential non-invasive means of exploring tumor heterogeneity (15). Most statistical parameters’ analysis indicated that the mean value of D* was disadvantaged by a large standard deviation. It was also easily affected by the SNR, resulting in data instability and high uncertainty. However, radiomics analysis proved that radiomics features extracted from D* maps were more robust and independent in both intra- and inter-observer variability segmentation (Fig. 4). This would have been as a result of statistical parameter analysis performed based on the mean statistical measurements because it differed with the existing extracted quantitative feature analysis. The mean value of D* was unstable thus causing fluctuation of statistical parameters. However, it did not affect extraction of quantitative features and quantitative analysis. This result was consistent with that of Thapa et al. (15) who reported that D* had powerful discrimination in evaluation of CC staging. Moreover, its robustness could not be denied or underestimated.
Several studies have also reported on the effect of inter- and intra-observer variability on the reproducibility of radiomics features based on other medical imaging modalities. An intra-observer analysis conducted by Kocak et al. (32) revealed that the percentage of reproducible features (ICC ≥ 0.75) of the unenhanced computed tomography (CT) images was in the range of 84.4%–92.2%, whereas that of the enhanced CT images was in the range of 85.5%–93.1% in renal masses. In the same study, the inter-observer analysis based on two-dimensional (2D) CT images had a reproducibility rate of 76.7% and 84.9% of the reproducible features of the unenhanced and enhanced CT images, respectively. The study also reported that GLCM (≥77.1%) and first order (83.3%) features had higher reproducibility (ICC ≥ 0.75) on both unenhanced and enhanced CT images (32). Herein, only 39% GLCM and 11% first-order features provided good reproducibility (ICC ≥ 0.8) in the present study. This phenomenon could have been caused by differences in tumor types, different imaging modalities, and reproducibility thresholds of ICC. Additionally, the difference in the number and lack of high-order features could also contributed to the variations. In another study, Pavic et al. (25) studied the reproducibility of inter-observer variability in different tumor types (non-small cell lung cancer [NSCLC], head and neck squamous cell carcinoma [HNSCC], and malignant pleural mesothelioma [MPM]) based on CT images. The study found that the percentage of reproducible features in NSCLC, HNSCC, and MPM were 90%, 59%, and 36%, respectively (25). Combined with those obtained in the present study, these results strongly suggested that different tumor types and variability in inter-observer delineation played an essential role in the stability of radiomics features. They further proved the importance of establishing a stable subset of reliable features by performing reproducibility analysis before multicenter studies.
Nevertheless, the present study has some limitations. First, our research is a retrospective study involving a small sample size. It was prone to type I and II because of its small sample size; therefore, the ICC threshold was set to a higher value of 0.8 which eliminated these errors (25). Moreover, the reproducibility analysis was affected by tumor segmentation and feature extraction is done based on 2D images. This limitation was pegged on the suggestion that whole tumor segmentation is more responsive to tumor texture and heterogeneity (33). However, this has been previously disapproved as impractical and time-consuming (32). Radiomics analysis in patients with CC has also been done based on 2D segmentation in other studies (32,34). Herein, only some common features such as shape, GLCM, GLDM, first-order, GLRLM, GLSZM, and NGTDM were extracted. High-order features were not included in the analysis. Indeed, future studies enrolling larger sample sizes from multiple centers are required to overcome these limitations.
In conclusion, the impact of inter- and intra-observer manual segmentation variability on the reproducibility of radiomics features of IVIM in patients with CC was determined. In total, 105 radiomic features were extracted from ADC, D, D*, and f maps, among which those derived from ADC, D, and D* maps were found to more reproducible than those from f maps. Inter- and intra-observer variability was found to affect radiomics analysis. Cognizant to this, multicenter studies should pay more attention to intra- and inter-observer variability.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received the following financial support for the research, authorship, and/or publication of this article: This work was supported by the China National Key Research and Development Program (Grant No. 2016YFC0103400); the Natural Science Foundation of Hubei Province, PR China (Grant No. 2017CFB552), and the Applied Basic Research Programs of Wuhan, PR China (Grant No. 2017060201010160). JQ is supported by the Taishan Scholars Program of Shandong Province (Grant No. TS201712065).
