Abstract
Current machine learning models under artificial intelligence can only improve prediction accuracy, but their underlying logic remains incomprehensible. Therefore, to provide high prediction accuracy and enhance the interpretability of the model through machine learning, the study selects the Extreme Gradient Boosting (XGBoost) model by comparing multiple models under single learner and integrated learning. Then a cancer probability statistical prediction model is constructed through parameter optimization, and its performance and interpretability are analyzed. The experimental results showed that the Receiver Operating Characteristic (ROC) Area under Curve (AUC) value in the single learner was generally lower than 80%, while the AUC value was 84.4%, surpassing that of the comparison model. Simultaneously, an increase in Alpha-Fetoprotein value greater than 13.5 had a stronger predictive effect when combined with other factors. Smaller serum Alanine Aminotransferase and Alpha-Fetoprotein assay near 0 may produce negative or positive effects, whereas a higher value is more likely to produce a positive effect. This is in line with its clinical significance. Overall, XGBoost effectively improves the out-of-sample prediction accurate and interpretability, which is significant for the actual liver cancer diagnosis prediction.
Keywords
Introduction
Currently, liver disease is the second and third leading cause of death in males and females, respectively. Moreover, liver cancer contributes to 4.7% of all newly diagnosed cancer cases globally [1]. Cancer diagnosis is based on imaging and liver tissue biopsy, but these two tests are not very useful for early diagnosis and screening. Liver biopsy is the current “gold standard” for tumor detection. The basic principle of this method is to perform pathological tests on the liver of patients with intermediate to advanced liver cancer after liver resection, which can accurately determine whether it is liver cancer or not. However, this process is invasive, causing great damage to patients [2]. Computer technology has facilitated the application of computer-aided diagnosis for hepatic cellular cancer, which has become a contemporary trend in clinical diagnosis. Liu et al. proposed a computer-aided diagnosis method for liver cells based on ultrasound and parametric imaging [3]. Yoon et al. proposed a computer-aided detection technique for breast cancer using artificial intelligence by retrospective analysis [4]. Virupakshappa et al. proposed a classification method for brain tumors based on spatial fuzzy and Artificial Neural Network (ANN) techniques [5]. With the rapid development of artificial intelligence technology and machine learning technology, significant progress has been made in the diagnosis and treatment of liver cancer. Based on the latest artificial intelligence technology, deep learning algorithms can be used to identify early signs of liver cancer from imaging data such as CT, MRI, and ultrasound. These algorithms, by learning a large amount of image data, are able to identify small abnormal changes that may be overlooked by the human eye, thus achieving earlier lesion detection than traditional methods. Although artificial intelligence and machine learning technologies have made progress in the field of liver cancer, they still face some challenges, such as interpretability and integration issues in clinical practice. Therefore, it can be seen that although current machine learning models can improve prediction accuracy, their inherent decision-making logic lacks transparency, which is particularly problematic in clinical applications because medical decisions require high interpretability to win the trust of doctors and patients. In response to this issue, the research aims to explore machine learning methods that can provide high prediction accuracy while enhancing model interpretability. Extreme Gradient Boosting (XGBoost) is an advanced machine learning technology that has been widely applied and recognized in many fields due to its high efficiency and flexibility, regularization, ability to handle missing values, and customizable optimization objectives and evaluation criteria. XGBoost is a type of integrated learning method, particularly an implementation of Gradient Boosting Decision Tree (GBDT). XGBoost can improve the prediction accuracy by combining the prediction results of multiple Decision Tree (DT) models. Therefore, in the scenario of liver cancer prediction, XGBoost can construct a model for predicting liver cancer risk by learning historical medical records, biomarkers, imaging features, and other information of liver cancer patients. By effectively combining and analyzing these features, XGBoost can help doctors identify high-risk patients and carry out early intervention.
Additionally, the study constructs a probabilistic statistical prediction model utilizing XGBoost for clinical liver cancer research. Its purpose is to apply it to early cancer prediction and diagnosis, predict the likelihood of hepatitis patients developing primary cancer and the impact of related variables to assist doctors in clinical diagnosis, and describe the roles and relationships of each variable. The contribution of the research lies in applying XGBoost technology to predict the liver cancer risk, and conducting in-depth parameter optimization on the basis of the XGBoost model to ensure the optimal performance of the model in liver cancer prediction. At the same time, the interpretability analysis of model prediction results is provided, and key factors in model decision-making are revealed using methods such as feature importance assessment and Shapley Values (SHAP) analysis, enhancing the transparency of the model.
The research is divided into four parts. The first part summarizes the application of predictive models in medicine using current artificial intelligence technology. The second part is to analyze the prediction method under the interpretable artificial intelligence logic, which includes feature screening and the liver cancer prediction model construction under the single learner and integrated learning. The third part is to verify the performance of the liver cancer prediction model. The fourth part is a summary of the entire article.
Related work
Advances in artificial intelligence technologies are driving the application of machine learning in clinical research, which has resulted in several predictive models that aid physicians in making effective decisions [6]. Among them, machine learning methods are widely used in clinical prediction, imaging, genomics, and other fields. Corresponding prediction models have been developed to effectively make decisions. By analyzing existing data using machine learning models, accurate differentiation between cancer patients and healthy individuals can be achieved. In addition, different types of data are selected, and the information contained is also different. The combination of genomics, imaging, and clinical data is common in research. In addition, thanks to the rapid development of machine learning algorithms, the detection of biomarker effects and research on gene expression are easier to achieve. Dafni Rose J et al. proposed a novel computer-aided diagnosis method using mobile networks based on machine learning and deep learning for the problem related to the accurate identification of early breast cancer. The proposed approach effectively improved the breast cancer identification performance [7]. Adla et al. proposed a novel computer-aided diagnosis method using mobile networks based on automatic deep learning for the problem related to the automatic detection of skin cancer, thus effectively enhancing the identification performance of breast cancer. A new computer-aided diagnosis method for skin damage detection was constructed based on automatic deep learning by using category attention layer, thus improving the skin cancer recognition accuracy [8]. Chebbah et al. effectively reduced the damage [9]. Kalsoom et al. proposed a residual U-type network based on deep learning for problems related to accurate segmentation of liver cancer, thus effectively enhancing the segmentation effectiveness of tumors [10]. Sahu et al. developed a computer-aided integrated method for breast cancer diagnosis using Support Vector Machine (SVM) to assist in early cancer screening, resulting in increased diagnostic accuracy and classification effectiveness [11].
In addition, Cheng et al. constructed a single computer-aided detection model by developing a flexible three-dimensional depth algorithm for the diagnosis of liver cancer under different clinical situations, thus effectively enhancing the diagnostic efficiency [12]. Jain et al. proposed a neighborhood adaptive approach utilizing deep learning to improve liver tumor detection in multi-phase X-ray computed tomography images. The proposed adaptation method effectively enhanced the computer-aided diagnosis accuracy [13]. Calderaro et al. proposed a novel computer-aided diagnosis method based on artificial intelligence techniques for problems related to the diagnosis and prognosis of hepatocellular carcinoma, thus effectively enhancing the risk prediction of hepatocellular carcinoma [14]. Dhillon A et al. developed a novel method utilizing machine learning and deep learning for biomarker recognition, which aimed to address cancer diagnosis and prognosis issues. Their method may provide effective assistance in improving cancer diagnosis and prognosis [15].
Most current research on data samples employs manual matching methods to generate a limited number of classes, with balanced positive and negative classes. However, this approach can introduce bias in practical applications. Research by domestic and foreign scholars supports this understanding. Therefore, the research perspective of using large data samples with unbalanced positive and negative classes is innovative. The results obtained are more consistent with the actual clinical results. In addition, the current study uses interpretable methods to investigate the reasonableness of model predictions. Therefore, it also has innovation in this regard.
Explanation of prediction method under artificial intelligence logic
The interpretability of relevant models in clinical research is crucial. It is challenging to explain the predictive performance of machine models in the medical field when using artificial intelligence logic. Therefore, this section constructs a probability prediction model based on the interpretable artificial intelligence logic.
Feature filtering with single learner
To address the problem that machine learning models in traditional artificial intelligence can only improve the actual prediction accuracy and cannot explain the reason for the improvement in model prediction, an XGBoost probabilistic statistical prediction model is constructed in clinical liver cancer research. The interpretability of the model is very important in clinical research as it affects patient safety. Therefore, the prediction model should prioritize improving prediction accuracy and interpretability. Feature screening plays a significant role in developing prediction methods, which is a crucial step in feature engineering. This process involves converting data characteristics into features that represent all dimensions of the data. However, not all features have an impact on prediction [16]. After filtering the original data with features, the impact of noise on the model can be reduced, thereby better discovering information that is beneficial for prediction. By analyzing existing data, machine learning can identify and mine more rules and information from them, facilitating predictions and decisions for new data [17]. The machine learning classification is shown in Fig. 1.
Schematic diagram of classification content in machine learning.
In Fig. 1, machine learning classification graphs can visually represent machine learning methods and help understand the types, characteristics, and relationships between different algorithms and technologies in the field of machine learning. According to the machine learning classification, machine learning can be divided into supervised learning, unsupervised learning, and semi-supervised learning based on whether there are specific labels on the input relevant data. Logistic regression models are widely used in supervised learning due to their interpretability and the ability to influence prediction results whilst holding other factors constant [18]. Logistic regression refers to regression analysis in which the dependent variable is a dichotomous variable, which is a multiple regression relationship between a dependent variable and multiple independent variables that can be used to predict the probability of a specific event occurring within a certain range [19]. The present study focuses on predicting liver cancer using both single learners and ensemble models. A single learner is a single model used independently in machine learning, while logistic regression is a widely used single learner. Logistic regression is widely used in classification problems, especially in binary classification problems. Logistic regression models can apply the Sigmoid function to linear combinations of features and map any real number to the interval of 0 to 1 as a probability prediction value. The output of logistic regression can be interpreted as a probability estimate for a specific category, which is particularly common in medical applications due to its simplicity and high interpretability. In the single learner, assuming
In Eq. (1),
In Eq. (2),
In Eq. (3),
In Eq. (4),
In Eq. (5),
In Eq. (6),
In machine learning, the minimum loss function is commonly utilized to derive the optimal parameter set and measure the deviation between the predicted and actual values of the model. By obtaining the average log-likelihood loss across the entire dataset, maximizing the likelihood function is transformed into minimizing the loss function, which can be solved through optimization techniques like gradient descent. The minimization problem is shown in Eq. (8).
In Eq. (8),
The process of combining a single learner is referred to as integrated learning, which aims to aggregate multiple weakly supervised models to create a more comprehensive ensemble. Depending on how the weakly supervised models are actually generated, the current integrated learning models mainly consist of Bagging and Boosting algorithms. The representative method of Boosting algorithm is XGBoost. The brief definition of XGBoost is shown in Eq. (9).
In Eq. (9),
In Eq. (10),
In Eq. (11),
In Eq. (12),
In Eq. (13),
Schematic diagram of correlation coefficient heat diagram.
In Fig. 2, in the correlation coefficient heat map, data can be visualized to display the correlation relationship strength between variables. In the correlation coefficient heat map, each grid represents the correlation coefficient between a pair of variables, and the color depth indicates the correlation strength. Generally, if the color is dark, the correlation will be stronger. Most of the independent variables have unique information and low information overlap. This implies that it is feasible to construct a hypothetical probabilistic statistical prediction model without discarding any important variables due to commonality considerations. For the probabilistic statistical prediction model of liver cancer, the proportion of individuals with and without the disease in the sample is inconsistent. Especially for malignant tumors and other diseases, the proportion of individuals with the disease is less than 10%. The conventional classifier considers accuracy as the primary measure to evaluate prediction performance. However, in the case of imbalanced categories, accuracy may result in poor classification performance for certain categories. Hence, it is not appropriate to rely solely on accuracy as the initial assessment parameter. The study is analyzed with liver cancer patients, so it is a typical category imbalance classification. To solve the problem, it is necessary to select the most applicable sampling scheme for different models and compare the performance on the test set to interpret the models in a more reasonable way.
In addition, in the local correlation analysis of the interpretable model for liver cancer prediction, the study utilizes SHAP to represent the weighted average of the marginal effect of one characteristic on all characteristics. The SHAP model is an interpretable additive model inspired by SHAP. In the construction of disease prediction models, especially when facing class imbalance, the key is to improve the model’s ability to recognize actual cases. The SHAP value is based on the SHAP in cooperative game theory, aiming to explain the contribution of each feature to the model’s prediction results. Therefore, each prediction can be explained in detail by calculating the impact of each feature value on the model’s prediction and comparing it with the average predicted value. The SHAP value is shown in Eq. (14).
In Eq. (14),
Empirical analysis of liver cancer prediction model
To verify the performance of the XGBoost model, the study first addresses the category imbalance in a single learner. The data are obtained from the clinical data of liver cancer patients in a tertiary hospital for the past 6 years. Because of the category imbalance, it must be improved from the data level. For the training set that has been divided, random over-sampling, random under-sampling, and Synthetic Minority Oversampling Technique (SMOTE) are used to balance the category distribution. The ratio of positive (hepatocellular carcinoma) to negative (non-hepatocellular carcinoma) categories is 1:1. A basic logistic regression model is utilized to fit and predict on the test set. The analysis indicators include the Area under Curve (AUC) of the Receiver Operating Characteristic (ROC) curve, recall, precision, F1-score, and accuracy. In addition, the Least Absolute Shrinkage and Selection Operator (LASSO) is applied to filter the variables to obtain the predictive index of the new logistic regression model. The results are shown in Table 1.
Randomly sampled logistic regression model predictive indicators and variable screened predictive indicators
Randomly sampled logistic regression model predictive indicators and variable screened predictive indicators
From Table 1, the highest value of oversampling was 18.2% under the accuracy index, the highest recall of random oversampling was 75.9%. Therefore, after selecting random oversampling for subsequent logistic regression model validation and using LASSO for variable screening, the results showed an increase of 75.9% in recall, 80.4% in AUC value, and 84.5% in accuracy value. Taken together, random oversampling is the most effective, but it is inherently unable to address the sparsity on a few categories and inadequacy in the data representation. Therefore, the study uses data related to random under-sampling to fit the model. The comparison model introduces SVM, K-Nearest Neighbor (KNN), Naive Bayes, and ANN. The SVM is divided into linear and multilateral kernels (set to 1 and 2). The KNN is divided into nearest neighbors of 1, 3, and 5. The ANN has 50 neurons in the first hidden layer and 40 in the second. The results are shown in Fig. 3.
Other machine learning model prediction indicators.
According to Fig. 3, compared with other algorithms, the overall ANN had a greater prediction performance, with an accuracy rate of 82.2%. Taken together, the actual performance of the centralized model differed relatively little from the logistic regression. Both recall and accuracy values are low, so more sophisticated algorithms are adopted to further fit the model. The reason for this phenomenon is that ANN can capture and simulate complex data relationships, while automatically learning and extracting useful features from a large amount of data. The difference between logistic regression and ANN is relatively small, which may indicate that these models do not fully capture all the useful information in the data. The performance of individual learners is generally poor, prompting the study to explore balanced learning as a solution to the category imbalance problem. The first step is to validate the performance of the XGBoost model proposed in the study. Therefore, the results of the XGBoost model prediction metrics with different sampling methods and sampling ratios are shown in Fig. 4.
Prediction index results of XGBoost model under different sampling methods and proportions.
In Fig. 4, 1–5 indicate that the five sampling methods are Original, Oversampling, Under, SMOTE, and Adaptive Synthetic (ADASYN). The proportions of positive and negative categories represented by 6–11 are 1, 0.9, 0.8, 0.66, 0.5, and 0.33, respectively. Based on Fig. 4, method 1 without sampling had the lowest recall rate of 14.0%. The random Oversampling method 3 had the highest recall rate of 86.0%, which indicated that it had good non-sample size prediction performance under the under-sampling. In addition, as the ratio of positive and negative categories represented by cancer decreased, the recall rate gradually decreases, but the accuracy gradually improves. Both categories had an equal number, the recall rate stabilized at 86%, with an AUC value of 85.8%, resulting in the optimal prediction effect compared with other proportions. At the same time, in the process of improving the recall of cancer individuals, it is preliminarily determined that the optimal random under-sampling strategy is positive class: negative class
The optimal number of classifiers in XGBoost model and the determination of sample weights for leaf nodes in tree size.
According to Comprehensive Fig. 5, when the validation set reached 75 iterations, the loss rate decreased, while in the training set, after 150 iterations, the loss rate decreased. Consequently, setting the number of trees at 100 is more suitable. The training time is set at 80. On this basis, the actual depth of the tree is 3. When the minimum leaf node sample weight of the tree was also 3, the AUC value was higher at 93.14%. It can be seen that the loss reduction rate of the validation set slows down after approximately 75 iterations, while the loss reduction rate of the training set slows down after 150 iterations. This indicates that as the number of iterations increases, the model continues to learn in the training set, but the improvement on the validation set becomes slow, indicating the over-fitting in the model. Therefore, the gamma parameter, subsample parameter, column sampling parameter (colsample_bytree) and regularization parameter tuning results under these two optimal parameters are shown in Fig. 6.
From Fig. 6, when the gamma parameter was 0, the highest AUC value of 93.2% was obtained on the validation set. The highest AUC value of 93.15% was obtained when the subsample parameter and the colsample_bytree parameter were 0.8. The regularization parameter exhibits a gradual reduction in the AUC value as the value increases. Therefore, the optimal parameter is established at 0.00001. In these optimal parameter settings, the optimal prediction index results of the XGBoost model are determined by adjusting the learning rate and introducing the Adaptive Boosting (AdaBoost) model and the Random Forest (RF) model, which will take the same. The predictive metrics under the optimal parameters obtained by the same tuning approach are compared. The results are shown in Table 2.
Comparison results of different models with optimal parameters for predicting indicators
XGBoost model remaining parameter optimization results.
Table 2 illustrates that the XGBoost model’s recall rate and AUC value for predicting the optimal parameters resulted in 84.0% and 84.4%, respectively. These values are superior to those of the AdaBoost and RF. Overall, the XGBoost model surpasses the single learner, AdaBoost, and RF.
Therefore, in terms of interpretability, the interpretation of XGBoost can deepen the understanding of its internal mechanism. The first step is to rank the importance of the data variables and other logistic regression variables in Fig. 2, as shown in Fig. 7.
Results of the importance of relevant variable features.
Verification results of the impact of a single feature on model prediction and the synergistic effect of two features.
The combined Fig. 7 shows that serum Alpha-Fetoprotein (AFP) and serum Alanine Aminotransferase (ALT) had the greatest impacts on the XGBooost model prediction. Therefore, the interpretability principle is utilized to investigate how both variables impact the prediction results. By identifying and understanding the impact of biomarkers such as AFP and ALT on liver cancer prediction, it is possible to better understand the predictive behavior of the model and utilize this knowledge to optimize its performance and application. Among them, the validation results of the effect of a single feature on the model prediction and the synergistic effect of two features are shown in Fig. 8.
From Fig. 8, as the AFP increased, it had a greater impact on the XGBoost model’s ability to predict cancer in individuals. The predicted transformation tended to stabilize after reaching 250. The increased ALT also had a positive effect on the actual prediction effect, and a negative effect appeared after the value was greater than 50, but the effect was smaller. When considering their combined effect, an increase in AFP values greater than 13.5 resulted in a larger impact on the prediction, whereas the impact of ALT was smaller. Taken together, exploring the common impact of these two features on model prediction can be a great tool. However, it can at most describe the effect of both features simultaneously, and the results can be biased when there is a strong association between the two features. Therefore, this study utilizes SHAP values to explore the impact of associated features on model prediction. The overall SHAP values are shown in Fig. 9.
Schematic diagram of overall SHAP values.
Based on Fig. 9, the SHAP values of AFP, ALT, and age ranked highest in the overall SHAP. This aligns with the feature importance results in Fig. 8. It suggests that AFP has the greatest impact on model prediction in clinical practice. Based on this, the partial dependency graph of AFP is shown in Fig. 10.
Partial dependency graph of AFP.
Based on Fig. 10, the AFP detection value has a greater likelihood of producing positive effects as its value increases, which aligns with its clinical significance. From Figs 3 to 10, it can be seen that the logistic regression model tends to use past medical history such as gender and pulmonary embolism for prediction, while the XGBoost model can effectively use liver injury indicators. It only refers to disease indicators and utilizes medical history for reference, which has high credibility.
To further verify the excellent performance of the XGBoost model, it is compared with the most advanced methods related to this study. The comparison methods selected for the study are DT, RF, and SVM, which perform well in the small sample field. These four models are predicted separately on the test set and training set. The predicted AUC values of different models are compared, as shown in Fig. 11. From Fig. 11, the AUC value of the XGBoost model in the test set was 84.4%, which was 5.8%, 23.3%, and 20.4% higher than the AUC values of the DT, RF, and SVM models of 78.6%, 61.1%, and 64.0%, respectively. The AUC value of XGBoost model in the training set was 80.4%, which was 24.7%, 25.1%, and 23.2% higher than the AUC values of DT, RF, and SVM models of 55.7%, 55.3%, and 57.2%, respectively. Overall, the XGBoost model still has performance advantages compared with other advanced prediction models.
Comparison of predicted AUC values of different models.
To address the issue that traditional artificial intelligence machine learning models can only enhance the prediction accuracy while failing to explain the underlying reasoning of improved model predictions, this study compared various models to opt for the XGBoost prediction model and analyzed it through examples. From the experimental results, in a single learner, the AUC value of the logistic regression model uniquely exceeded 80%. In the integrated learning model, the XGBoost not only outperformed AdaBoost and RF in accuracy, recall, F1-score, and AUC values under optimal parameter settings, but also achieved an accuracy of 84.8%, demonstrating excellent performance. Among them, only the logistic regression model in the single learner had an AUC value over 80%, and the rest were generally below 80%. Among the integrated learning models, the XGBoost model outperformed both AdaBoost and RF with its 21.6% accuracy, 84.0% recall, 34.4% F1-score, 84.4% AUC value, and 84.8% accuracy when optimal parameters were utilized. In addition, for the XGBoost model interpretability problem, its application to the actual liver cancer prediction revealed that AFP and ALT had the greatest impacts on the XGBoost model. The larger the value of the former, the greater its influence on the XGBooost model to predict whether an individual has cancer or not, and the prediction transformations tend to stabilize after a value of 250. An increase in the latter parameter obtained a positive impact on the predictive performance, followed by a negative impact when the value surpasses 50, albeit of smaller magnitude. Taken together, the XGBoost model improved recall in liver cancer patients by 10%, suggesting that multiple disease-related indicators should be strictly referred to in the diagnostic model, while information such as demographics and past medical history served as auxiliary bases. However, interpretive models may differ between sample sets, thus objective facts should serve as a standard for diagnosis in subsequent experiments. Future research can further explore the interactions between AFP, ALT, and other biomarkers, as well as how these interactions affect the predictive outcomes of liver cancer. It is expected to improve the accuracy and efficiency of liver cancer analysis, providing more intuitive and easily understandable model decision-making basis for clinical doctors.
