A pilot study of radiomics signature based on biparametric MRI for preoperative prediction of extrathyroidal extension in papillary thyroid carcinoma

Abstract

OBJECTIVE:

To investigate efficiency of radiomics signature to preoperatively predict histological features of aggressive extrathyroidal extension (ETE) in papillary thyroid carcinoma (PTC) with biparametric magnetic resonance imaging findings.

MATERIALS AND METHODS:

Sixty PTC patients with preoperative MR including T2WI and T2WI-fat-suppression (T2WI-FS) were retrospectively analyzed. Among them, 35 had ETE and 25 did not. Pre-contrast T2WI and T2WI-FS images depicting the largest section of tumor were selected. Tumor regions were manually segmented using ITK-SNAP software and 107 radiomics features were computed from the segmented regions using the open Pyradiomics package. Then, a random forest model was built to do classification in which the datasets were partitioned randomly 10 times to do training and testing with ratio of 1:1. Furthermore, forward greedy feature selection based on feature importance was adopted to reduce model overfitting. Classification accuracy was estimated on the test set using area under ROC curve (AUC).

RESULTS:

The model using T2WI-FS image features yields much higher performance than the model using T2WI features (AUC = 0.906 vs. 0.760 using 107 features). Among the top 10 important features of T2WI and T2WI-FS, there are 5 common features. After feature selection, the models trained using top 2 features of T2WI and the top 6 features of T2WI-FS achieve AUC 0.845 and 0.928, respectively. Combining features computed from T2WI and T2WI-FS, model performance decreases slightly (AUC = 0.882 based on all features and AUC = 0.913 based on top features after feature selection). Adjusting hyper parameters of the random forest model have negligible influence on the model performance with mean AUC = 0.907 for T2WI-FS images.

CONCLUSIONS:

Radiomics features based on pre-contrast T2WI and T2WI-FS is helpful to predict aggressive ETE in PTC. Particularly, the model trained using the optimally selected T2WI-FS image features yields the best classification performance. The most important features relate to lesion size and the texture heterogeneity of the tumor region.

Keywords

Thyroid cancer neoplasm staging radiomics diagnosis magnetic resonance imaging

1 Introduction

Papillary thyroid cancer (PTC) is the most common type of well-differentiated thyroid cancer, accounting for 90% of all thyroid carcinomas(1–3). PTCs are indolent differentiated cancers, which are associated with low mortality and morbidity [1 –5]. However, some PTCs show aggressive behavior and recurrence [1 , 4–7]. Extrathyroidal extension (ETE) is accepted as an important factor influencing the prognosis and recurrence of PTCs [4–6 , 9]. ETE is mainly defined as infiltration beyond the thyroid capsule, which has been a risk factor for poor prognosis of thyroid cancer. According to the seventh edition of the AJCC, the TNM classification for thyroid carcinomas divides ETE into minimal ETE (extension to sternothyroid muscle or perithyroid soft tissues) and extensive ETE (extension to subcutaneous soft tissue, larynx, trachea, esophagus, recurrent laryngeal nerve, or the prevertebral fascia or encases the carotid artery or mediastinal vessels) [7].

Preoperative detection of ETE is very important, not only for adequate nodal staging but also for surgical planning [2 , 9]. Several studies have demonstrated the superiority of various imaging tools, including US, CT, and MRI, in ETE assessments [10 –13]. Currently, US is the accepted method for the first diagnostic evaluation of ETE of PTCs. Previous studies on the diagnostic performance of preoperative US for the determination of ETE of PTCs reported many advantages, such as high-resolution imaging capability and ease of accessibility. However, US also has several limitations, the most important of which is operator dependency. US, CT and MRI have limited accuracy and sensitivity in the detection of ETE [10 –13]. Due to lack of objective parameters and indicators for quantitative analysis, it is not very accurate to apply these observation findings to predict ETE. Therefore, avoiding unnecessary biopsy procedures and finding more objective assessment methods is the key to clinical work.

The diagnosis of the pathological diseases such as PTC by doctors is still subjective problem, some image information that cannot be recognized by human visual system will make the doctor’s diagnosis inaccurate. During the past decades, medical imaging innovations with the development of computer hardware and the establishment of large databases, radiomics is emerged and expected to move towards quantitative imaging [14 –20]. Being different from the traditional megascopic analysis based on experiences, radiomics is a machine learning based technique that detects pathological changes that cannot be perceived by the human eye, using an automated high-throughput extraction of large amounts of quantitative features of medical images and to draw statistical information [14–19 , 22]. Previous studies have shown that radiomics features can characterize the texture and morphology of tumors regions from grayscale medical images [22 –25]. The current study presents a non-invasive method to predict ETE in patients with PTCs with quantitative features extracted from MRI imaging.

2 Materials & method

2.1 Study population

This retrospective study was approved by the institutional review board, and written informed consent was waived. A search performed on the hospital’s database identified 68 consecutive patients with surgically confirmed PTCs who underwent preoperative thyroid MRI at our institute between January 2015 and December 2018 for the evaluation of PTCs. Eight of the 68 patients were excluded for the following reasons: poor-quality MRI (n = 2), small-sized masses (i.e., too small to be identified on MRI, n = 3), and severe susceptibility artifacts and patient motion artifacts (n = 3). Finally, 60 patients (15 males and 45 females) with an average age of 46.5 years (vary from 24 to 74 years old), were included in this study. All patients underwent thyroid MRI within 2 weeks before a thyroidectomy.

2.2 MRI examinations

All examinations were performed on 1.5 Tesla MR Scanners (GE Signa HD 1.5 T MR scanner; GE Healthcare Systems, Milwaukee, WI, USA) with an eight-channel high-resolution receiver synergy-head/neck phased-array coil. All MRI protocols consisted of T1WI, T2WI-FS, DWI, and contrast-enhanced MRI. Conventional MRI sequences including axial T1WI spin echo (TR/TE=520/14ms), T2WI fast spin echo (TR/TE=3500/95ms) with and without fat suppression and coronal T2WI (TR/TE=3500/95). DWI was acquired in three orthogonal directions with two different b values (0 and 500 s/mm²) using STIR fat suppressed single-shot echo planar imaging spin echo sequences. Of all patients, contrast enhanced T1WI (TR/TE=520/14 ms) were obtained with or without fat-suppression immediately after the intravenous injection of 0.1 mmol/kg Gd-DTPA (Magnevist, Schering AG, Germany). The parameters were as follows: section thickness 3 mm, with a 1 mm intersection gap, FOV = 40×28 cm; matrix = 256×256, number of excitation (NEX)=4. The whole examination was completed within 30 minutes.

2.3 Overview

The proposed architecture of our study is illustrated in Fig. 1. The first step of our process is image acquisition, in which T2WI and T2WI-FS images of 60 PTCs were collected. Secondly, it is time-consuming for the radiologists to annotate the tumor contour slice by slice, and the contour of the tumor in this study was annotated on the slice with largest tumor area for each patient. Tumor segmentation was manually performed on a slice with the largest section of tumor by a radiologist (W.X, with 15 years’ experience) using ITK-SNAP software (an open software www.itksnap.org). The region of interest (ROI) covered the entire section of the tumor. To evaluate the reproducibility of features among the different segmentations, 20 cases were selected randomly and their images were re-segmentation by W.X two week after initial segmentation, and another radiologist (H.S.D with >20 years’ experience) also performed the same images. The class correlation coefficient (CCC) was used to measure the agreement of Inter- and intra-observer, and a CCC above 0.7 indicated excellent reliability. Finally, we found that all the segmentations met the acquirement. Then, the radiomics features were extracted by the open-source automatic feature analysis algorithm which provides a comprehensive characterization of thyroid nodules. Afterward, wavelet and LoGfilter were applied for extracting more specific features of the tumors for the purpose of studying whether it is useful for classification task. Finally, we make feature selection on the extracted features to select the most important features for classification, while reducing features dimensionality to avoid overfitting. We report the performance of the most important features for classification, and make statistical data analysis (p-values) on the models to show the differences.

Fig. 1

Overall workflow of the study.

For each patient, two image modalities (T2WI and T2WI-FS) are available. We first analyze the two modalities separately, and then analyze the combined modalities by simply concatenating the radiomics features of the two modalities. We denote the combined modalities as T2WI-combined.

2.4 Radiomics features

The radiomics features [26] are extracted using the open Pyradiomics packages implemented in python [16, 27], so that the results could be easily reproduced, and we totally extract 107 radiomics features from segmented region of interest on each MRI image slice. In our study, the radiomics features compose of First Order Statistics, Shape-based, Gray Level Co-occurrence Matrix (GLCM), Gray Level Run Length Matrix (GLRLM), Gray Level Size Zone Matrix (GLSZM), Gray Level Dependence Matrix (GLDM), Neighboring Gray Tone Difference Matrix (NGTDM).

First Order statistics features describe how voxel intensities within the region of interest (ROI) are distributed. Such features include entropy, mean, minimum and maximum intensity values, energy, skewness, etc. Shape-based features describe the size of the tumor area and the similarity of the tumor area to the sphere. The size characteristics of tumor area include volume, surface area, maximum diameter and effective diameter; and similarity characteristics include surface area ratio, density, eccentricity, sphericity, etc.

The Gray Level Co-occurrence Matrix (GLCM) [28] is a matrix that represents the intensity relationship of discrete gray levels of adjacent pixels or voxels in a 3D volume. GLCM features are extracted based on the probability distribution for the elements of the GLCM, including contrast, correlation, difference variance, etc. The gray level run length matrix (GLRLM) defines various texture features. Compared to GLCM, GLRLM also assesses the distribution of intensities of image, but mainly focus on run lengths, which reflects the comprehensive information of the image gray scale with respect to direction, adjacent interval and variation amplitude. GLRLM features include short run emphasis, long run emphasis, etc. The gray level size zone matrix (GLSZM) quantify the gray level regions in the image. The gray level regions refer to the number of connected voxels with the same gray level intensity. If the distance between the voxels is 1 according to the infinity norm, then we consider the voxel as a neighbor. GLSZM features include small area emphasis, gray level non-uniformity, etc. The Gray Level Dependence Matrix (GLDM) is designed to capture the coarseness of the image texture. GLDM counts the number of neighbor voxel around a central voxel within Chebyshev distance δ. GLDM features include small dependency emphasis, dependency non-uniformity, etc. The Neighboring Gray Tone Difference Matrix (NGTDM) quantifies the sum of gray level intensity differences between a certain voxel and the average gray level intensity of neighboring voxels within a Chebyshev distance δ. NGTDM features include busyness, complexity, etc. Table 1 list the general descriptions of radiomics features and the corresponding classified category.

Table 1
Radiomics feature characteristics

Radiomics feature Number of features

First Order Statistics 18

Shape-based 14

Gray Level Cooccurence Matrix (GLCM) 24

Gray Level Run Length Matrix (GLRLM) 16

Gray Level Size Zone Matrix (GLSZM) 16

Gray Level Dependence Matrix (GLDM) 14

Neighbouring Gray Tone Difference Matrix (NGTDM) 5

Radiomics feature	Number of features
First Order Statistics	18
Shape-based	14
Gray Level Cooccurence Matrix (GLCM)	24
Gray Level Run Length Matrix (GLRLM)	16
Gray Level Size Zone Matrix (GLSZM)	16
Gray Level Dependence Matrix (GLDM)	14
Neighbouring Gray Tone Difference Matrix (NGTDM)	5

2.5 Feature filter

Aside from original radiomics features, there are also some built-in optional filters in Pyradiomics, such as Laplacian of Gaussian (LoG), Wavelet, Square, Logarithm, Local Binary Pattern and so on, and adding filters can generate more specific features of the lesions. For studying whether it is useful for classification, in this paper, we select Laplacian of Gaussian (LoG) and Wavelet filters to obtain additional radiomics features. For LoG filters, we use five LoG filters with 5 sigma levels, 1.0–5.0, respectively. Since each LoG filter can yield 93 features, which consist of GLCM, GLRLM, GLSZM, GLDM, NGTDM and First Order Statistics, we can get totally 465 (93×5) features. For wavelet, as 1 level Wavelet decomposition yields 8 derived images, we can get totally 744 (93×8) features.

2.6 Random forest classification model

First, we select random forest as the classification method [29 –31]. Random forest ensembles a number of decision tree to vote for the most popular class and search for the best feature in a random subset of features at the same time, which reduces the overfitting problem. Because the tree partition algorithm used by random forest sets cutoffs for each feature independently, feature standardization is not necessary.

Next, to combine the two modalities T2WI and T2WI-FS, we concatenate the radiomics features for each patient, and we denote the combined modality as T2WI-combined. Therefore, for T2WI, T2WI-FS and T2WI-combined modality, each patient has 107, 107 and 214 features, respectively.

Then, since random forest model assigns a feature importance score for each feature, for each partition of the training and testing dataset and each feature, there is a feature importance. We calculate the average score for each feature across 10 partitions as the final feature importance score.

2.7 Feature selection and model training

Feature selection [32] is the process of selecting some of the most effective features from the original features to reduce the dimension of the dataset (especially important in small sample problems), and improve the performance of the learning algorithm. We adopt the SelectKBest algorithm of the open source sklearn feature selection in Python to search a minimally sized subset that maximizes the model performance. This approach greedily selects features according to the feature importance.

To train and test the classification models, the dataset is randomly partitioned into training and testing datasets with ratio 1:1. To reduce the influence of the randomness, the dataset is randomly shuffled and evaluated 10 times. Performance of the classifier was measured using receiver operating characteristic curve. All training and evaluation procedures are implemented on Python 2.7 and Pyradiomics 2.2. All model performance levels are evaluated using receiver operating characteristic (ROC) method. Areas under ROC curves the corresponding standard deviations are then computed and compared.

3 Result

3.1 Clinical features and surgicopathological diagnosis of ETE

The demographic and clinicopathological characteristics of the 60 patients in this study are delineated in Table 2. All the included patients have been performed the total or near-total thyroidectomy. This study included data from 15 males and 45 females, with an average age of 46.5 years (vary from 24 to 74 years old). Except for lymph node metastasis and primary tumor size, no significant statistical difference was observed between the clinicopathological features in terms of patient age, sex, location and surgical procedure in the PTC cases with and without ETE. The patients were divided into two groups based on the surgical pathology reports and surgeons’ operative reports. Among these patients, 25 patients were diagnosed as PTCs without ETE, and 35 patients were diagnosed as PTCs with ETE.

Table 2
Clinicopathological features of 2 groups according to ETE & without ETE in patients with PTC

without ETE ETE p value

Age(years) 48.80±11.54 44.91±11.79 0.209

0.511

≥45 15 18

<45 10 17

Sex 0.450

Male 5 10

Female 20 25

Primary tumor size(cm) 1.5(1.1,2.2) 1.1(1.0,1.3) 0.004

Lymph node metastasis <0.001

Positive 8 29

Negative 17 6

Tumor location, n 0.584

Right thyroid 11 19

Left thyroid 14 15

Isthmus 1

	without ETE	ETE	p value
Age(years)	48.80±11.54	44.91±11.79	0.209
			0.511
≥45	15	18
<45	10	17
Sex			0.450
Male	5	10
Female	20	25
Primary tumor size(cm)	1.5(1.1,2.2)	1.1(1.0,1.3)	0.004
Lymph node metastasis			<0.001
Positive	8	29
Negative	17	6
Tumor location, n			0.584
Right thyroid	11	19
Left thyroid	14	15
Isthmus		1

3.2 Random forest model is not sensitive to hyper parameters

To analyze the influence of the hyper parameters on the random forest model, we do grid searching and compare the performance of different models under different hyper parameters. Specifically, we tune three most important parameters in random forest model (other parameters are set as default value): the number of trees in the forest (n_estimators, we set this value ranging from 10 to 100 with step 10), the minimum number of samples in a leaf node (min_samples_leaf, we set this value ranging from 3 to 10 with step 1), and the maximum depth of the tree (max_depth, we set this value ranging from 3 to 10 with step 1). In this section, we analyze these parameters on the T2WI-FS dataset by using the open source ‘GridSearchCV’ algorithm in python, and the AUC score is used as evaluation measurement.

Through the combination of different hyper parameter values, we can get a total of 640 (8×8×10) AUC scores, it is not possible to drawn all AUC scores in a figure, so we randomly selected 10 of them and drew the AUC scores in Fig. 3. As shown in Fig. 3, theses curves illustrate that when min_samples_leaf is too large, the AUC will decrease regardless of how the n_estimators and max_depth change. Too many samples in one leaf generally lead to feature aggregation, resulting in the discriminative features weakened by other non-discriminative features. We also explore the effect of different max_depth values and n_estimators values, and as we can see from Fig. 3, both of them can barely impact the AUC score of model. Most of the AUC scores under different hyper parameters are around 0.9, which indicates that random forest model is a robust classifier and is not sensitive to hyper parameters.

For statistical data analysis, we computed the p-values of the AUC score sets under different hyper parameters in Fig. 3. The p-values between any two AUC sets are much bigger than 0.05 (the smallest p-value is 0.684, which is computed between the AUC sets from n_estimators=10, max_depth=10 parameter and n_extimator=90, max_depth=3 parameter), which indicates that the differences between different hyper parameter combinations are not significant. This result also verified the conclusion that random forest model is not sensitive to hyper parameters.

For further verification, we also use T2WI and T2WI-combined dataset to complete the above experiments, and the same conclusions are achieved. Therefore, we use the best parameters (max_depth = 5, n_estimators = 20, min_samples_leaf = 2) provided by GridSearchCV algorithm on T2WI-FS dataset to do the following experiments.

3.3 T2WI-FS is more informative than T2WI

We first use max_depth = 5, n_estimators = 20, and min_sample_leaf = 2 as the hyper parameters of random forest to evaluate the performance of different feature sets. When all features of T2WI, T2WI-FS and T2WI-combined (107, 107 and 214 features, respectively) were used, the performance of these models is shown as Fig. 2. Three feature sets (T2WI, T2WI-FS and T2WI-combined) achieve average AUC 0.760, 0.906 and 0.882, with standard deviation 0.082, 0.038 and 0.060, respectively. Since the performance of T2WI-FS is better than the other two, we can infer that T2WI-FS is more informative than T2WI. And we speculate that the more severe feature redundancy (limited number of patients (60) compared with the number of features (214)) caused by combination of T2WI and T2WI-FS result in the unsatisfactory performance of T2WI-combined.

Fig. 2

ROC curves of the random forest model using all features. (a) T2WI; (b) T2WI-FS; (c) T2WI-combined. For each feature set, training and testing dataset has been randomly partitioned 10 times.

For statistical data analysis, we made t-test on the 10 test sets shown in Fig. 2 for each dataset. For each dataset (T2WI, T2WI-FS and T2WI-combined), all the p-values of their 10 test sets are greater than 0.05 (where the smallest p-values of the three datasets are 0.1034, 0.6660 and 0.2734, respectively). The p-values show 10 sets are not statistically different, which indicates that the classifier using all features are reliable.

3.4 More feature filters do not improve the results

Figure 4 shows the AUC scores with different min_samples_leaf values on different feature sets (107 features for original, 465 features for LoG, 744 features for wavelet, 1209 features for LoG+wavelet). The average AUCs on these four feature sets are 0.906, 0.889, 0.894 and 0.889on the T2WI-FS datasets, respectively. Similarly, we also computed the p-values of the AUC scores between the three feature sets, and all the p-values are bigger than 0.05 (the smallest p-value is 0.845, which is computed between original AUC set and wavelet AUC set).

Fig. 3

The relation between performance and min_sampels_leaf with different n_estimators and max_depth.

Fig. 4

The relation between performance and min_sampels_leaf with different filters.

Therefore, we can conclude that although various kinds of filters can supply hundreds of new radiomics features of papillary thyroid carcinoma, the classifier cannot benefit from the new features and the curves of various kinds of filters have the similar tendency.

3.5 Feature selection improves model performance

To get better classification performance, we do feature selection based on the feature importance. We use the open source SelectKBest algorithms in sklearn model selection in Python to do forward greedy feature selection based on feature importance. The number of features we select ranges from 1 to 107 for all the three original feature sets (T2WI, T2WI-FS, T2WI-combined).We randomly shuffle the datasets 10 times, and split the training and testing set at ratio 1:1. For each selected subset of features, we recorded the AUC scores (each subset 10 AUC scores) on the 10 test sets, the average AUC scores of each selected feature performance on three datasets are shown in Fig. 5.

Fig. 5

AUC of the random forest model after selecting different feature set. The features are ranked by feature importance. For T2WI-combined, only the top 107 features are considered for a better visualization.

We computed the p-values between the AUC scores on different feature subsets of the three datasets, and all the 107 p-values between T2WI-FS and T2WI are smaller than 0.05 (where the highest p-value is 0.006), which means the differences between T2WI-FS and T2WI are significant. In other words, the features extracted from T2WI-FS are statistically significant than T2WI, and thus T2WI-FS performs better in classification. However, no p-value between T2WI-FS and T2WI-combined is smaller than 0.05 (where the smallest p-value is 0.119), which means the differences between T2WI-FS and T2WI-combined are not significant. We studied the features that we selected from T2WI-combined in each feature subset and find that most selected are from T2WI-FS. We suppose this is the reason for the unsignificant differences of p-values between T2WI-FS and T2WI-combined, and this result can further indicate that T2WI-FS performs better thanT2WI.

In the10 feature subsets which selected features ranging from 1 to 10, the best performances for T2WI, T2WI-FS and T2WI-combined are 0.845, 0.928 and 0.913 AUC score and standard deviation 0.184, 0.082 and 0.091 with 2, 6, 2 features selected, respectively. Specifically, for T2WI-FS, when39 features are selected, we can get highest AUC of 0.953. For statistical data analysis, we computed the p-values between different selected features to study the difference by using different number of features. The p-value results we got (only 5% of the p-values are less than 0.05) show that not all AUC scores of the selected features are significantly different. Specifically, for T2WI-FS, p-values between 6 features and 39 features, 6 features and 107 features, 39 features and 107 features are 0.049, 0.056 and 0.016, respectively. We speculate that this is because the random forest model is robust, and it is not sensitive to the number of features.

3.6 Size of ROI and heterogeneity of the ROI texture are critical for the classification

We further check the top ranked features of T2WI, T2WI-FS and T2WI-combined in Table 3. The AUC of the top 10 selected features are 0.778, 0.865 and 0.904 for T2WI, T2WI-FS and T2WI-combined, respectively. We found that for T2WI-combined, the majority of the features are coming from T2WI-FS. It is reasonable, because T2WI-FS achieves better AUC than T2WI, the features from T2WI-FS are more informative. We also found some top ranked features are shared by T2WI and T2WI-FS. Such features are likely to be the most important features. We look into the details of 5 of them appearing in top 10 ranked features. The MajorAxisLength and MinorAxisLength reflects the size of the ROI. ZoneEntropy measures the distribution uncertainty/randomness of the region sizes and gray levels. A higher value indicates more heterogeneity in the texture patterns. Coarseness measures the average difference between the center voxel and its neighborhood and indicates the rate of spatial change. Higher values represent lower spatial change rate and more uniform local texture. The last common feature, RunLengthNonUniformity, measures the similarity of run lengths across the image, with lower values indicating higher homogeneity between run lengths in the image. Roughly speaking, these three features (ZoneEntropy, Coarseness and RunLengthNonUniformity) measure the heterogeneity of the texture of ROI. Such features are in line with intuitive of a doctor.

Table 3
Top ranked radiomics features on T2WI, T2WI-FS and T2WI-combined. Common features appearing in top 10 features in T2WI and T2WI-FS are highlighted

T2WI T2WI-FS T2WI-combined

shape_MinorAxisLength glszm_ZoneEntropy ngtdm_Coarseness-T2FS

ngtdm_Coarseness ngtdm_Coarseness shape_MeshVolume-T2FS

gldm_DependenceNonUniformity shape_MajorAxisLength shape_VoxelVolume-T2FS

glszm_ZoneEntropy shape_MinorAxisLength shape_MinorAxisLength-T2FS

shape_MeshVolume shape_Maximum3DDiameter glszm_ZoneEntropy-T2FS

shape_SurfaceArea glszm_GrayLevelNonUniformity shape_SurfaceArea-T2FS

shape_MajorAxisLength shape_VoxelVolume gldm_DependenceNonUniformity-T2FS

glrlm_RunLengthNonUniformity gldm_DependenceEntropy gldm_DependenceEntropy-T2FS

glrlm_RunEntropy glrlm_RunLengthNonUniformity shape_Maximum3DDiameter-T2FS

shape_Sphericity gldm_GrayLevelNonUniformity glszm_GrayLevelNonUniformityNormalized-T2FS

T2WI	T2WI-FS	T2WI-combined
shape_MinorAxisLength	glszm_ZoneEntropy	ngtdm_Coarseness-T2FS
ngtdm_Coarseness	ngtdm_Coarseness	shape_MeshVolume-T2FS
gldm_DependenceNonUniformity	shape_MajorAxisLength	shape_VoxelVolume-T2FS
glszm_ZoneEntropy	shape_MinorAxisLength	shape_MinorAxisLength-T2FS
shape_MeshVolume	shape_Maximum3DDiameter	glszm_ZoneEntropy-T2FS
shape_SurfaceArea	glszm_GrayLevelNonUniformity	shape_SurfaceArea-T2FS
shape_MajorAxisLength	shape_VoxelVolume	gldm_DependenceNonUniformity-T2FS
glrlm_RunLengthNonUniformity	gldm_DependenceEntropy	gldm_DependenceEntropy-T2FS
glrlm_RunEntropy	glrlm_RunLengthNonUniformity	shape_Maximum3DDiameter-T2FS
shape_Sphericity	gldm_GrayLevelNonUniformity	glszm_GrayLevelNonUniformityNormalized-T2FS

For redundancy analysis of these top 10 features, we calculated the correlation coefficients for the three datasets respectively. For T2WI, the features can be divided into two groups according to their correlation coefficients, one group contains “original_glszm_ZoneEntropy” and “original_shape_Sphericity” (in which each feature’s correlation coefficients with other features are less than 0, ranges from – 0.924 to – 0.560), and the rest 8 features combine into another group (in which each feature’s correlation coefficients with other features ranges from 0.732 to 1). For T2WI-FS, the features can also be divided to two groups, one group is made up with “original_gldm_GrayLevelNonUniformity” (this feature’s correlation coefficients with others are all less than 0, ranging from – 0.520 to – 0.786), another group is made up with the rest 9 features (the correlation coefficients ranging from 0.570 to 0.996). Similarly, for T2WI-combined, one group contains “original_glszm_GrayLevelNonUniformityNormalized-T2FS”and “original_gldm_DependenceEntropy-T2FS” (the correlation coefficients ranging from – 0.83 to – 0.533), another group contains the rest 8 features (the correlation coefficients ranging from 0.744 to 1). Therefore, the selected 10 features for the three datasets are all redundant.

4 Discussion

ETE is considered to be an important factor that affects the prognosis and recurrence of PTCs(4–6, 9), but current methods that used to predict ETE status is extremely unbalanced in specificity and sensitivity, such as US, CT and MRI(10–13). In this study, the proposed method achieves average AUC 0.760, 0.906 and 0.882 respectively on the three original feature sets: T2WI, T2WI-FS and T2WI-combined. After feature selection, the model can achieve 0.845, 0.928 and 0.913AUC within 10 selected features for T2WI, T2WI-FS and T2WI-combined. The best 0.953 AUC is achieved by selected top 39 features on T2WI-FS dataset. The AUCs showed good predictive ability of ETE status on both training and independent validation cohorts. The high AUC signified that the radiomics feature was a potential marker to discriminate ETE. Therefore, the accurate prediction of ETE preoperatively would be very important in avoiding unnecessary surgical resection in PTC patients.

The application of radiomics in medicine is still in its infancy, and additional studies are needed to examine the radiomics signatures that could predict patient aggressive behavior and outcomes [14 , 25]. Compared to the conventional image analysis, radiomics don’t rely on the subjective factors and level of expertise, and can provide objective information of the lesion image [14–16 , 26]. It may help radiologists avoid subjective bias related to experience. In our study, the pyradiomics software was used to extract radiomics features of each PTC. This method is generally used in radiomics, which allows the high-throughput extraction of informative imaging features to quantify the heterogeneity and phenotypic differences within tumors. Radiomics analysis in cancer studies has been proved to be reliable information that can increase the precision in diagnosis, evaluate the prognosis and predict treatment response. To our knowledge, there are few radiomics related studies on PTC, and most of them are based on US texture analysis [22, 24]. Compared with the number of cases, the radiomics features extracted in our study were much larger, therefore, to avoid overfitting, we performed feature selection to reduce the dimension of features before building the model. The present study suggested that radiomics features based on pre-contrast T2WI and T2WI-FS is helpful to predict aggressiveness histological feature ETE in PTC. T2WI-FS image features outperformed T2WI and T2WI-combined features in the classification. In the experiments, we also found that feature selection can improve the performance of the model and the most significant two features of the radiomics features are the heterogeneity of ROI texture and the size of the lesions.

We developed random forest models to predict the ETE status of PTC. The main advantage of random forests is that each tree in the forest is randomly different, and each tree is highly irrelevant. In theory, the more trees that make up a random forest, the less likely it is to overfit. Random forests can process data with small sample size, which is suitable for the study of small samples of two classifications, and is also robust to outliers.

There are several limitations in our study. Firstly, our study is a retrospective study, since the primary tumor area was selected as the region of interest in this study, there are inevitably some deviations that may affect the analysis. Secondly, the sample size of the study is small, more samples would be needed to avoid overfitting and build a robust model. Thirdly, our samples were all from the same batch and we used to leave one out cross validation method to validate the model. Cross validation was the best method to avoid overfit and decrease the false positive rate. Finally, the MR images we used in this study were acquired on the same scanner machine, so the imaging parameters are the same. Due to the fact that texture features can be affected by different imaging parameters, multi-center data should also be introduced to increase the robustness of the model.

In conclusion, in this paper, we proposed a novel method based on radiomics and preoperative MRI to predict the ETE status in PTCs. The experimental results show that radiomics features are helpful in improving the preoperative prediction of ETE in PTC patients.

Footnotes

Acknowledgments

This study was supported in part by a grant-in-aid for scientific research from the Technology Plan of Jiangsu (Project No. H2019087), the Technology Plan of Wuxi (Project No. MS201901) and the Science and Technology Development Plan of Wuxi (Project No. N20192027).

References

Fagin

J.A.

and Wells

S.A.

Jr. , Biologic and Clinical Perspectives on Thyroid Cancer, The New England Journal of Medicine 375(11) (2016), 1054–1067.

Liang

, Huang

, Hu

, et al., Predicting Malignancy in Thyroid Nodules: Radiomics Score Versus 2017 American College of Radiology Thyroid Imaging, Reporting and Data System Thyroid 28(8) (2018), 1024–1033.

Siegel

R.L.

, Miller

K.D.

and Jemal

, Cancer statistics, 2019. CA: a Cancer Journal for Clinicians 69 (1) (2019), 7–34.

Lee

Y.K.

, Kim

, Shin

D.Y.

, et al., The prognosis of papillary thyroid cancer with initial distant metastasis is strongly associated with extensive extrathyroidal extension: A retrospective cohort study, Annals of Surgical Oncology 26(7) (2019), 2200–2209.

Kim

J.W.

, Roh

J.L.

, Gong

, et al., Extent of extrathyroidal extension as a significant predictor of nodal metastasis and extranodal extension in patients with papillary thyroid carcinoma, Annals of Surgical Oncology 24(2) (2017), 460–468.

Amit

, Boonsripitayanon

, Goepfert

R.P.

, et al., Extrathyroidal extension: does strap muscle invasion alone influence recurrence and survival in patients with differentiated thyroid cancer? Annals of Surgical Oncology 25(11) (2018), 3380–3388.

Edge

S.B.

and Compton

C.C.

, The American Joint Committee on Cancer: the 7th edition of the AJCC cancer staging manual and the future of TNM, Annals of Surgical Oncology 17(6) (2010), 1471–1474.

Lee

H.S.

, Park

, Kim

S.W.

, et al., Correlation of minimal extrathyroidal extension with pathologic features of lymph node metastasis in patients with papillary thyroid carcinoma, Journal of Surgical Oncology 112(6) (2015), 592–596.

Abraham

, Roshan

, Tran

, et al., The extent of extrathyroidal extension is a key determinant of prognosis in T4a papillary thyroid cancer, Journal of Surgical Oncology 120(6) (2019), 1016–1022.

10.

Choi

J.S.

, Kim

, Kwak

J.Y.

, et al., Preoperative staging of papillary thyroid carcinoma: comparison of ultrasound imaging and CT, AJR Am J Roentgenol 193(3) (2009), 871–878.

11.

Gweon

H.M.

, Son

E.J.

, Youk

J.H.

, et al., Preoperative assessment of extrathyroidal extension of papillary thyroid carcinoma: comparison of 2- and 3-dimensional sonography, Journal of Ultrasound in Medicine 33(5) (2014), 819–825.

12.

Kim

S.Y.

, Kwak

J.Y.

, Kim

E.K.

, et al., Association of preoperative US features and recurrence in patients with classic papillary thyroid carcinoma, Radiology 277(2) (2015), 574–583.

13.

, Moreira

A.L.

, Hatzoglou

, et al., Using diffusion-weighted MRI to predict aggressive histological features in papillary thyroid carcinoma: A novel tool for pre-operative risk stratification in thyroid cancer, Thyroid 25(6) (2015), 672–680.

14.

Gillies

R.J.

, Kinahan

P.E.

and Hricak

, Radiomics: images are more than pictures, they are data, Radiology 278(2) (2016), 563–577.

15.

, Li

, Zhang

, et al., Radiomics analysis of multicenter CT images for discriminating mucinous adenocarcinoma from nomucinous adenocarcinoma in rectal cancer and comparison with conventional CT values, Journal of X-ray Science and Technology 28(2) (2020), 285–297.

16.

van Griethuysen

J.J.M.

, Fedorov

, Parmar

, et al., Computational radiomics system to decode the radiographic phenotype, Cancer Research 77(21) (2017), e104–e107.

17.

Al Ajmi

, Forghani

, Reinhold

, et al., Spectral multi-energy CT texture analysis with machine learning for tissue classification: an investigation using classification of benign parotid tumours as a testing paradigm, European Radiology 28(6) (2018), 2604–2611.

18.

Cui

, Yang

, Shi

, et al., Radiomics analysis of multiparametric MRI for prediction of pathological complete response to neoadjuvant chemoradiotherapy in locally advanced rectal cancer, European Radiology 29(3) (2019), 1211–1220.

19.

Wang

, Gong

, Duan

, et al., Correlation between CT based radiomics features and gene expression data in non-small cell lung cancer, Journal of X-ray Science and Technology 27(5) (2019), 773–803.

20.

Aerts

H.J.W.L.

, Velazquez

E.R.

, Leijenaar

R.T.H.

, et al., Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach, Nature Communications 5 (2014), 5006.

21.

Ortiz-Ramon

, Larroza

, Ruiz-Espana

, et al., Classifying brain metastases by their primary site of origin using a radiomics approach based on texture analysis: a feasibility study, European Radiology 28(11) (2018), 4514–4523.

22.

Sollini

, Cozzi

, Chiti

and Kirienko

, Texture analysis and machine learning to characterize suspected thyroid nodules and differentiated thyroid cancer: Where do we stand? Eur J Radiol 99 (2018), 1–8.

23.

, Zhu

, Qiu

, et al., Prediction of immunohistochemistry of suspected thyroid nodules by use of machine learning-based radiomics, AJR Am J Roentgenol (2019), 1–10.

24.

Liu

, Zhou

, Yu

, et al., Prediction of lymph node metastasis in patients with papillary thyroid carcinoma: A radiomics method based on preoperative ultrasound images, Technology in Cancer Research & Treatment 18 (2019), 1533033819831713.

25.

, Zhong

, Dong

, et al., Radiomic analysis for preoperative prediction of cervical lymph node metastasis in patients with papillary thyroid carcinoma, Eur J Radiol 118 (2019), 231–238.

26.

Lambin

, Rios-Velazquez

, Leijenaar

, et al., Radiomics: extracting more information from medical images using advanced feature analysis, European Journal of Cancer 48(4) (2012), 441–446.

27.

Zhang

, Ma

, Liang

, et al., Morphologic feature elongation can predict occlusion status following pipeline embolization of intracranial aneurysms, World Neurosurgery 119 (2018), e934–e40.

28.

Song

C.I.

, Ryu

C.H.

, Choi

S.H.

, et al., Quantitative evaluation of vocal-fold mucosal irregularities using GLCM-based texture analysis, The Laryngoscope 123(11) (2013), e45–e50.

29.

Xiong

, Shao

, Ma

, et al., Lung field segmentation using weighted sparse shape composition with robust initialization, Medical Physics 44(11) (2017), 5916–5929.

30.

Criminisi

, Robertson

, Konukoglu

, et al., Regression forests for efficient anatomy detection and localization in computed tomography scans, Medical Image Analysis 17(8) (2013), 1293–1303.

31.

Lindner

, Thiagarajah

, Wilkinson

J.M.

, et al., Fully automatic segmentation of the proximal femur using random forest regression voting, IEEE Transactions on Medical Imaging 32(8) (2013), 1462–1472.

32.

Haury

A.C.

, Gestraud

and Vert

J.P.

, The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures, PloS One 6(12) (2011), e28210.