Abstract
OBJECTIVE:
To develop and test an optimal machine learning model based on the enhanced computed tomography (CT) to preoperatively predict pathological grade of clear cell renal cell carcinoma (ccRCC).
METHODS:
A retrospective analysis of 53 pathologically confirmed cases of ccRCC was performed and 25 consecutive ccRCC cases were selected as a prospective testing set. All patients underwent routine preoperative abdominal CT plain and enhanced scans. Renal tumor lesions were segmented on arterial phase images and 396 radiomics features were extracted. In the training set, seven discrimination classifiers for high- and low-grade ccRCCs were constructed based on seven different machine learning models, respectively, and their performance and stability for predicting ccRCC grades were evaluated through receiver operating characteristic (ROC) analysis and cross-validation. Prediction accuracy and area under ROC curve were used as evaluation indices. Finally, the diagnostic efficacy of the optimal model was verified in the testing set.
RESULTS:
The accuracies and AUC values achieved by support vector machine with radial basis function kernel (svmRadial), random forest and naïve Bayesian models were 0.860±0.158 and 0.919±0.118, 0.840±0.160 and 0.915±0.138, 0.839±0.147 and 0.921±0.133, respectively, which showed high predictive performance, whereas K-nearest neighborhood model yielded lower accuracy of 0.720±0.188 and lower AUC value of 0.810±0.150. Additionally, svmRadial had smallest relative standard deviation (RSD, 0.13 for AUC, 0.17 for accuracy), which indicates higher stability.
CONCLUSION:
svmRadial performs best in predicting pathological grades of ccRCC using radiomics features computed from the preoperative CT images, and thus may have high clinical potential in guiding preoperative decision.
Introduction
Renal cell carcinoma (RCC) is one of the most common cancers, with an annual morbidity count of 295,000 in the world [1, 2]. There are three common subtypes of kidney cancer: clear-cell RCC, chromophobe RCC, and papillary RCC. Clear-cell carcinoma of the kidney (ccRCC) accounts for 75%to 87%of malignant tumors in the kidney [3]. The Fuhrman grading scheme has been widely used to classify ccRCC and by dividing cases into four pathological grades: I, II, III, IV, where the first two are low-grade tumors and the latter are high-grade, respectively. Recent research has shown that Fuhrman grade highly correlates with the growth rate and prognosis of tumors, with high-grade tumors being more aggressive and metastatic [4–8].
At present, the reference standard for preoperative tumor evaluation is pathological biopsy. However, the conduct of pathological biopsy is limited in clinical practice as it is invasive with a high risk, so a noninvasive grading method is still needed. Texture analysis (TA) is a mathematical calculation able to extract complex quantitative features from images, increasing the potential information value of radiology examination [9, 10]. Recently, it had been shown that CT texture features have the potential to predict the preoperative Fuhrman grade of ccRCC and may have great clinical significance in the diagnosis and treatment of ccRCC [11–15]. However, recent evidence also shows that these conclusions are not completely reliable, because some texture features were sensitive to acquisition parameters. The attainment of similar texture feature values might not be guaranteed even if the same acquisition protocols are used [16–18]. The importance of the reproducibility of texture features for achieving consistent results is self-evident. A machine learning (ML)-based TA includes several continuous steps and any step may become a source of reproducibility problems [18–24].
Considering that, the machine learning-based models need to prove their robustness, reproducibility and performance. So that, we tried to construct a robust, reproducible, and high-performance models to predict the Fuhrman grade of ccRCC, by testing the effects of the number of features and different machine learning methods.
Materials and methods
Patient population
The medical ethics committee of the Affiliated Changzhou No. 2 People’s Hospital of Nanjing Medical University in Changzhou, China approved our study and all study participants signed informed consent forms. A retrospective analysis was conducted in our hospital on renal carcinoma cases confirmed by surgery or biopsy between February 2018 and December 2019 as the training set. In addition, any cases confirmed after January 1, 2020 were collected as a prospective testing set. Patients who met the following criteria were selected for participation: the presence of renal carcinoma confirmed-by surgery or pathological biopsy, with a definite World Health Organization (WHO) grade assignment, and the availability of routine abdominal CT plain and enhanced scans from two weeks prior to surgery. Meanwhile, case whose image qualities were too poor for proper analysis were excluded. A total of 78 ccRCC cases were finally included, 53 in the training set and 25 in the testing set. In the training set, there were a total of 37 males and 16 females (average age: 60.8±11.4 years). The baseline characteristics can be found in Table 1.
The basic information of the subjects in training and test set
The basic information of the subjects in training and test set
Two experienced diagnostic pathologists rediagnosed postoperative histopathology slides and graded the pathological grade according to the 2016 edition of the WHO/International Society of Urologic Pathologists (ISUP) pathology grading criteria for renal cancer. Finally, 42 cases were classified as low grade (WHO I–II) and 11 cases were classified as high grade (WHO III–IV) in training set, and 20 were low grade, 5 were high grade in test set.
Instruments and scanning methods
Patients underwent plain and enhancement CT scans of the abdomen prior to surgery. A 64-slice spiral CT scanner (SOMATOM Definition; Siemens AG, Munich, Germany) was used for scanning with the following scanning parameters: 120 kV tube voltage; automatic tube current modulation; 5 mm slice thickness; 5 mm reconstruction thickness, and reconstruction interval. Omnipaque (350 mgI/mL; GE Healthcare, Chicago, IL, USA) and Ultravist (370 mg I/mL; Schering AG, Berlin, Germany) were used as contrast media injected at a rate of 2.5 to 3.5 mL/sec and a volume of 60 to 80 mL (scaled to body weight). Following completion of abdominal plain CT imaging, the arterial and venous phases were scanned with a delay of 30 to 35 and 70 to 75 seconds, respectively.
Segmentation
Two experienced radiologists (R1 and R2) segmented the tumor using the ITK-SNAP software (version 3.6.0, www.itksnap.org), which is an easy-to-use software dedicated to medical image segmentation [26]. The radiologist scanned the image slice by slice to choose the slice where tumor had maximum cross-section area, then manually drew the two-dimensional region of interest (ROI) covering the whole area of the tumor at the axial slice. R1 drew the ROI two times with a one-week interval, while R2 drew the tumor only once. Finally, one tumor corresponded to three ROIs.
Feature extraction
The AK software (Artificial Intelligence Kit, version 3.2.0; GE Healthcare) was used to extract the radiomics features, which complied with ISBI. During feature extraction, the preprocessing step was performed at first, considering to our analyzed images’ modality was CT, only the gray value discretization was adopted and the gray values of images were discretized to 256 bins. The AK software can calculate the following three classes of features: histogram feature (n = 42), two-dimensional shape feature (n = 9) and texture features (n = 345). The first-order features describe the voxel intensity distribution within the ROI. Texture features reflect the tumor spatial heterogeneity within the ROI, while 2D shape features describe the two-dimensional size and shape of the ROI.
Feature selection
Data preprocessing was first performed to eliminate all features with zero variance. Then, we carried out interobserver agreement and intra-observer agreement (tests [intraclass correlation coefficients (ICCs)]) to evaluate the reproducibility and consistency of each feature. The features extracted based on the two delineations of ROIs by R1 one week apart were used to test the intra-observer agreement, while the features from the first delineation of R1 and the delineation of R2 were used to test the interobserver agreement. The features that fulfilled both the intra- and inter-observer agreement tests (ICC > 0.75) were retained, and those based on the first delineation of R1 were used. The Mann–Whitney U test was applied to features to compare differences between two groups, and p-values were displayed using a Manhattan plot (Fig. 1). Those features with p < 0.01 were retained. Next, minimum redundancy and maximum correlation(mRMR) were used to find out the best feature subsets with different numbers of features (i.e., 5, 10, 15, 20, 25, and 30).

The Manhattan plot of p values of different types of radiomics features between high- and low- Fuhrman grades. The y-axis is the –log10p, the blue line means the p = 0.01.
Seven machine learning methods-random forest (RF), naive Bayes (NB), adaptive boosting (Adaboost), K nearest neighbors (KNN), neural network (NN), support vector machines with linear kernel (svmLinear), and support vector machines with the radial basis function kernel (svmRadial)—were used to construct the classifiers separately, which were then compared with one another.
The optimized subset of the features was determined by the averaged AUC values of 7 ML models trained using each feature subset. Ten-folds cross-validation was used to train seven classifiers in the training dataset. Considering to class imbalance, the SMOTE algorithm [25] was used to resample the dataset to balance the class. The mean area under the curve (AUC) values of the seven classifiers were calculated, and the feature subset corresponding to the maximum mean AUC value was chosen as the final feature set.
After the optimized feature subset was determined, for each ML method, the nested cross-validation scheme was conducted to repeatedly train the model in training dataset. This scheme contained 2 loops, inner loop and outer loop, the inner loop using 10-fold cross validation to train the model by tuning the hyperparameters and provide one model and one set of performance metrics at one time. The outer loop repeated the inner loop for 100 times, for each, the 10-fold dataset split was different. Finally, each ML method would have 100 models and provided 100 values for each metrics, AUC, accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), Which could be summarized to evaluate the robustness of the classifiers by using the relative standard deviation (RSD) value [27]. The lower the RSD is, the greater the stability of the classifier. The formula for RSD calculation was as follows:
After the optimized model with high performance and robustness was selected, it would be validated in the independent test set using ROC analysis.
Statistical analysis
The R software (version 3.6.1; R Foundation for Statistical Computing, Vienna, Austria) was used for statistical analysis. The ICC was used as the criterion to test inter- and intra-observer agreement. Those features with ICC values of more than 0.75 were treated as reproducible. The “mRMRe” package was used for mRMR ensemble feature selection. The “caret” package included all seven ML methods and was used to train these seven classifiers. The difference in the features between two classes were evaluated using the Mann–Whitney U test, and the p-values were shown in a Manhattan plot, which was drawn using the “Manhattan” function in the “qqman” package. Statistical tests with p < 0.05 were significant, unless otherwise noted.
Results
A total of 11 patients with high-grade tumors and 42 patients with low-grade tumors (Table 2), respectively, were included in this study. The mean patient age among the high-grade cases was 63.3±11.0 years, while that among the low-grade cased was 60.1±11.6 years.
The performance metrics of training models using 7 different machine learning methods
The performance metrics of training models using 7 different machine learning methods
Inter- and intraobserver agreement tests showed that 380 features had good reproducibility (ICC > 0.75, intra-ICC range: 0.78–1, inter-ICC range: 0.75–0.98, as shown in Fig. 2.). The classifiers were then trained with five feature subsets (including 5, 10, 15, 20, 25, and 30 features, respectively). The mean AUCs of different classifiers based on 10 selected features showed the greatest predictive performance (Fig. 3).

The Manhattan plots of inter- (Left) and intra- (Right) observer ICCs. The different colors were corresponding to different classes of features. The blue line represents the ICC = 0.75.

The mean AUC values of the predictive models constructed with different number of radiomics features. The predictive models with 10 features had the highest mean AUC value.
Next, we used heatmaps to indicate whether the top 10 optimal features differ between the low- and high-grade tumor groups, and we could discern from visual assessment that the features between the low- and high-risk groups were different (Fig. 4). With the final feature set, we compared the metrics among the seven classifiers and, as shown in Fig. 5 and Table 2, svmRadial had the highest accuracy (0.86) and NB had the highest AUC (0.921) and sensitivity (0.935) values. In contrast, KNN had the lowest accuracy (0.72).

The heatmap of the selected 10 features for each patient. The green color means the large value, and the red color means the small value.

The performance metrics of 7 predictive models built with different machine learning methods.
In the stability analysis, the most stable classifier was svmRadial(RSD:0.13 for AUC, 0.17 for accuracy), followed by NB (RSD:0.14 for AUC, 0.18 for accuracy), RF (RSD: 0.15 for AUC, 0.19 for accuracy), Adaboost (RSD: 0.16 for AUC, 0.22 for accuracy), svmLinear (RSD: 0.17 for AUC, 0.21 for accuracy), and NN (RSD:0.18 for AUC, 0.22 for accuracy). KNN (RSD: 0.18 for AUC, 0.21 for accuracy) showed the worst stability (Fig. 6). In total, we observed that svmRadial (RSD:0.13 for AUC, 0.18 for accuracy; AUC: 0.919±0.118, accuracy: 0.860±0.158) outperformed others.

The model performance metrics (left: AUC, right: Accuracy) vs the model stability. The smaller RSD%is, the model is more stability.
After modeling and internal validation, external validation via ROC analysis was performed to validate the optimal model, the svmRadial classifier trained using the 10 feature subset. The optimal model performance in external validation is shown in Fig. 7. The AUC value was 0.86 (95%confidence interval: 0.67–1.00) and the accuracy, sensitivity, specificity, PPV, and NPV values were 0.88 (95%confidence interval: 0.60–1.00), 0.80, 0.90, 0.67, and 0.95, respectively.

The ROC curve of the model built using svmRadial machine learning method.
The choice of surgical approach for ccRCC is closely related to the prognosis and nuclear grade, and accurate preoperative prediction of the nuclear grade of ccRCC is important for treatment decision-making and prognosis prediction in patients with ccRCC. Current routine clinical imaging methods have limited value for preoperative pathologic grading of tumors [28–30]. While it is difficult for conventional imaging techniques to detect early microscopic tumors, radiomics can establish more accurate medical diagnoses and treatment decisions by comprehensively evaluating the heterogeneity of tumors. Radiomics is a method beyond imaging, which provides high-throughout, mineable, quantitative features that are invisible to the naked eye. These features are likely correlated with pathology and are potential biomarkers for clinical diagnosis and treatment [31–33]. As an economical and noninvasive method for the overall assessment of tumors, radiomics can screen and evaluate tumor features. Studies have shown that results from using ML models based on patient tumors in combination with clinical information to predict tumors are more accurate than those attained by human experts [34].
To successfully implement radiology-based predictive analysis, different models need to be evaluated and compared. However, there are only a few studies to date that have explored and contrasted different ML methods-based models. For example, Zhang et al. investigated the diagnostic efficiency of nice ML models in the preoperative differentiation of lesion located at the anterior skull base and identified LDA as the optimal classification algorithm with an AUC value greater than 0.8 [35]. Parmar et al. compared the methods performance of 11 ML methods for predicting OS in head and neck cancer patients and found that NN (AUC: 0.62, RSD: 10.52), RF (AUC: 0.61, RSD: 7.36), NB (AUC: 0.67, RSD: 11.28) had high prognostic performance and stability [27]. In addition, Zhang et al. investigated the performance of nice ML methods for predicting local failure and distant failure in advanced nasopharyngeal carcinoma and found that RF (AUC: 0.85) and Adaboost (AUC: 0.82) had high prognostic performance and stability [36]. Therefore, there is no “one -fits-all” model applicable to all tumors. Currently, some studies have assessed the image histology of ccRCC CT volume TA and ML combined [37]. However, the optimized ML method for pathological grading predicted in patients with ccRCC has not yet been determined.
In our study, we investigated and compared seven medical learning methods to preoperatively predict the nuclear grade of ccRCC. Contrast-enhanced CT images were used for feature extraction. In the current study, almost 400 quantitative radiation features were analyzed, and feature selection was carried out. We tested the different feature numbers with mRMR as a feature selection method to discern the optimized number of features to construct the final model in our study [29]. Finally, the classifiers were trained with five sets of different numbers of selected features, (i.e., 5, 10, 15, 20, 25, and 30). The performance of seven ML methods was evaluated using AUC values, and RSD values were used to quantify each method’s stability. The average performance with 10 selected features attained the highest AUC values, and svmRadial (AUC: 0.919±0.118, accuracy: 0.860±0.0158) showed the greatest predictive performance and stability. Our analysis results suggest therefore that svmRadial should be the preferred model when predicting nuclear grading in patients with ccRCC.
Next, we drew heatmap of the top 10 selected features and observed that the features between the low- and high-risk groups were different. To increase the reliability of the experimental results, we performed external validation. A total of 25 prospective ccRCC cases were reviewed, the model also gained a high performance, which proved that the model had potential in clinical application. Finally, we confirmed that the models with the best stability and predictability in our study were one and the same, making svmRadial the best MLmethod for predicting the grade for ccRCC.
Besides, we also tested the inter- and intra-observer agreements of different segmentations. As shown in the supplemental materials (Fig. 8), the tumor with clear margins, the segmentations conducted with different observers or different time were similar, if the margins were vague, the segmentation would vary greatly. Meanwhile, we found that the histogram, Shape and GLSZM features were reproducible, that’s because, these features considered the whole information within the ROI, the slight changes in a few of voxels might not affect the features. Some of the GLCM and RLM features were sensitive to the tumor segmentation, thus, they contain the detailed information within the ROI, a little change may affect the computed feature values (Fig. 2).

The tumor segmentations by different radiologists at different time. The first row displayed a tumor with vague margins. The second row was the tumor with clear margins. The first column was the raw enhanced CT images, the 2–4 columns were corresponding to the segmentations by the R1 at first time, R1 at the second time, and R2 at first time.
Lots of studies just split the dataset into training and test set at once using randomly resampling, which might introduce the bias due to the unreasonable division scheme. The results might not be reliable, with the suspicion that they chose the best results. And there are many factors affecting the model performance, such as the number of the feature, the machine learning methods, and so on. Previous studies omitted them, just used a multivariate logistic regression model to train the final model. Besides, they just used the retrospective data, the abilities of clinical application were ambiguous. In our study, we used nested cross validation to train the predictive model, the dataset was divided repeatedly for 100 times, and obtained the averaged performance metrics to prove the model robustness. This study also trained the model with different number of feature sets and different ML methods to find the optimized model. 25 cases were collected as independent prospective validation to validate the final model and proved the clinical usefulness of our model.
Importantly, our study had several limitations. First, larger sample sizes are needed to validate and evaluate the generalizability of our model. In our study, we repeatedly trained the classifiers for 100 times with different random seeds, then used RSD values to quantify the model stability. Recent studies have illustrated that multimode magnetic resonance imaging has the potential to noninvasively predict the nuclear grade of ccRCC [38]. Li et al built an radiomics model based on magnetic resonance images to predict the WHO/ISUP nuclear grade in the setting of ccRCC [39]. Goyal et al. conducted TA based on magnetic resonance images in histological subtyping and grading of RCC, which is the direction of our future research [40]. Our model might be improved by adding other modalities’ features. Third, in our model, the PPV of svmRadial showed a low level, which may be related to the small number of high-grade patients in our study. Finally, it is not clear how these features are related to the underlying biological mechanism of the nuclear grade of ccRCC. Which should be further explored.
In conclusion, we investigated the role of contrast-enhanced CT texture features in preoperative nuclear grade prediction of ccRCC and compared seven ML models in the study. svmRadial with top 10 features showed the best diagnostic performance and the greatest stability. Such radiology-based models can maximize the value of medical images. Determining an optimal radiology-based ML method to predict nuclear grading noninvasively and preoperatively is of great significance for the initial diagnosis, evaluation and treatment planning of patients with ccRCC.
