Abstract
Background
In clinical diagnosis, determining the level of malignancy in tumors and differentiating between benign and malignant tumors are common classification challenges. Accurate and early diagnosis is essential for targeted treatment, and machine learning methods can assist in making these judgments.
Methods
This paper focuses on the classification of the lung tissue as benign or malignant and assessing the degree of aggressiveness in lung cancer. The study employed artificial neural network (ANN), logistic regression, and ridge penalized logistic regression, which are methods without built-in feature selection. Additionally, lasso penalized logistic regression, elastic-net penalized logistic regression, and sparse logistic regression with the hybrid L1/2 + 2 regularization (HLR), which are methods with built-in feature selection, were also utilized.
Results
In the study on classifying benign and malignant lung tissue, ANN demonstrated the best predictive performance among the methods without built-in feature selection, achieving an average test accuracy of 91.82%. Among the methods with built-in feature selection, HLR outperformed the others with an average test accuracy of 96.67%. When determining the level of malignancy in lung tumors, ANN surpassed other methods without built-in feature selection, attaining an average test accuracy of 84.74%. In comparison, HLR exceeded the performance of other methods with built-in feature selection, reaching an average test accuracy of 93.33%.
Conclusions
The experimental results indicated that HLR with built-in feature selection and ANN without built-in feature selection exhibited strong competitiveness among the methods investigated in both classifying benign and malignant lung tissue and assessing the degree of aggressiveness in lung cancer.
Introduction
Lung cancer remains a leading cause of cancer-related mortality worldwide, with a significant impact on global health. 1 The high incidence of lung cancer is not only influenced by the growth and aging of population but also by various factors associated with socioeconomic development, such as smoking, alcohol, obesity, and air pollution. Meanwhile, the high mortality rate of lung cancer is primarily due to the lack of noticeable symptoms in its early stages, resulting in most cases being detected at an advanced stage and missing the best time for treatment. This leads to a low five-year survival rate for lung cancer patients, averaging around 20%, whereas the five-year survival rates for two other common cancers, breast cancer and prostate cancer, are as high as 90% and 98%, respectively. Thus, improving early diagnostic methods for lung cancer is imperative since early detection and accurate diagnosis of lung cancer are crucial for enhancing patient outcomes and survival rates. 2
However, the classification of lung tissue as benign or malignant and the determination of tumor malignancy levels pose significant challenges in clinical practice. Traditional diagnostic methods, 3 such as imaging techniques and biopsy, have limitations in terms of sensitivity, specificity, and invasiveness. Therefore, there is an urgent need for novel approaches that can accurately and non-invasively classify lung tissue and assess tumor malignancy.
In recent years, machine learning techniques have emerged as promising tools for cancer diagnosis and prognosis. These methods4–9 leverage the power of computational algorithms to learn patterns and relationships from large-scale biological data, such as gene expression profiles. By training on datasets with known classifications, machine learning models can identify key features and biomarkers that are predictive of disease status or malignancy level. Once trained, these models can be applied to new, unseen data to make predictions and assist in clinical decision-making.
Nevertheless, using machine learning methods for lung cancer diagnosis presents several challenges. Firstly, the performance of machine learning models relies on large amounts of high-quality data. Due to data privacy and security issues, obtaining sufficient annotated medical data is difficult, especially for early-stage lung cancer cases. Secondly, for many medical institutions, allocating sufficient computational resources may pose financial and technical constraints. Lastly, many machine learning models are often seen as “black boxes”, making it difficult to interpret their decision-making processes and be accepted by doctors and patients. Despite these difficulties, machine learning methods continues to demonstrate significant potential and effectiveness for early lung cancer diagnosis.
Among the various machine learning approaches, artificial neural network (ANNs) 10 have gained significant attention due to their ability to model complex, non-linear relationships and achieve high predictive performance. ANNs are inspired by the structure and function of biological neural networks, consisting of interconnected nodes (neurons) organized in layers. Through a process of learning from data, ANNs can adjust the weights of the connections between neurons to minimize prediction errors and optimize performance.
However, the application of ANNs and other machine-learning methods to lung cancer diagnosis faces several challenges. One major issue is the high-dimensional nature of biological data, where the number of variables (e.g., genes) often vastly exceeds the number of samples. This “curse of dimensionality” can lead to overfitting, where the model performs well on the training data but fails to generalize to new, unseen data. Additionally, the presence of irrelevant or redundant features can reduce the interpretability and robustness of the models.
To address these challenges, feature selection techniques have been developed to identify the most informative and discriminative variables for classification.11–13 These methods aim to reduce the dimensionality of the data while retaining the most relevant information. Feature selection can be broadly categorized into two approaches: filter methods, 14 which rank variables based on their individual relevance to the outcome, and wrapper methods, 15 which evaluate subsets of variables based on their collective predictive performance.
In the context of lung cancer diagnosis, several feature selection methods have been applied in conjunction with machine learning algorithms. For example, lasso penalized logistic regression 16 employs L1 regularization to promote sparsity and select a subset of informative genes. Elastic-Net penalized logistic regression 17 combines L1 and L2 regularization to balance between feature selection and grouping of correlated variables. More recently, hybrid L1/2 + 2 regularization (HLR) 18 has been proposed to achieve a more flexible and adaptive feature selection by incorporating both L1/2 and L2 penalty terms.
The present study aims to investigate the performance of various machine learning methods, with and without built-in feature selection, for the classification of lung tissue as benign or malignant and the determination of lung tumor malignancy levels. Specifically, we compare ANN, logistic regression, and ridge penalized logistic regression as methods without built-in feature selection, and lasso penalized logistic regression, elastic-net penalized logistic regression, and HLR as methods with built-in feature selection. By evaluating these approaches on publicly available gene expression datasets, we seek to identify the strengths and weaknesses of each method and provide insights into their potential clinical utility.
The remainder of this paper is organized as follows: Section 2 provides an overview of the machine learning methods employed in this study, including their mathematical formulations and implementation details. Section 3 presents the results of the experiments, comparing the performance of the different methods on the classification tasks. Finally, Section 4 discusses the implications of the findings, potential limitations, and future directions for research.
By conducting a comprehensive evaluation of machine learning methods for lung cancer diagnosis, this study contributes to the ongoing efforts to develop accurate, non-invasive, and clinically applicable tools for early detection and personalized treatment. The insights gained from this research can guide the development of more effective diagnostic strategies and ultimately improve patient outcomes in the face of the growing global burden of lung cancer.
Methods
ANN
An ANN 10 is a computational model inspired by biological neural networks. It consists of an input layer, hidden layers, and an output layer, which are composed of interconnected nodes (neurons) that form a network through weighted connections.
The learning process involves updating the weights by minimizing a loss function, such as mean squared error for regression or cross-entropy for classification. Activation functions, like sigmoid, tanh, ReLU, and softmax, introduce non-linearity and constrain the outputs to a suitable range.
By combining multiple layers with appropriate activation functions and optimizing the weights through backpropagation, an ANN can learn to approximate complex functions and solve various tasks, such as classification, regression, and pattern recognition.
Lasso regression
Lasso regression
16
is a linear regression method that introduces L1 regularization on top of ordinary least squares regression to achieve parameter sparsity and feature selection. In lasso regression, the ordinary least squares regression serves as the loss function, while L1 regularization acts as the penalty term. The loss function of lasso regression is as follows, where
To minimize the loss function, lasso regression sequentially updates the values of each model parameter using the coordinate descent. It optimizes the model by employing the gradient descent algorithm and iterates continuously until each model parameter shows no significant change. During this process, some model parameters are set to zero, causing the corresponding features to have no impact on the value of the loss function, thereby achieving the effect of feature selection.
Logistic regression
Logistic regression is a classification algorithm that aims to predict the sample's category based on the linear combination of input features. It uses a logistic function (also known as the sigmoid function) to map the output of the linear combination to a probability value. During the training of a logistic regression model, gradient descent is commonly used to iteratively update the model parameters in order to minimize the loss function. The cross-entropy loss function is commonly used to measure the difference between the predicted probabilities and the true values. The loss function of logistic regression is as follows, where
Ridge penalized logistic regression
Ridge penalized logistic regression, also known as ridge regression, introduces an L2 penalty into the loss function of logistic regression. The loss function of ridge penalized logistic regression can be expressed as, where
Lasso penalized logistic regression
Lasso penalized logistic regression
12
is a method that applies L1 regularization term to the logistic regression model. The loss function of lasso penalized logistic regression can be expressed as, where
Elastic-net penalized logistic regression
Elastic-Net penalized logistic regression
17
is a method that simultaneously combines L1 regularization term and L2 regularization term in logistic regression. The loss function of elastic-net penalized logistic regression can be expressed as:
HLR
HLR
18
combines L1/2 regularization term and L2 regularization term in logistic regression. The loss function of HLR can be expressed as, where
Pros and cons of different machine learning methods
The six machine learning methods mentioned above except the lasso regression were utilized to perform classification tasks by learning the relationship between input data and different categories before converting the output into a probability distribution for classification.
For methods without built-in feature selection, the lasso regression were employed for feature selection before passing the selected features as input to the model for machine learning. Logistic regression is efficient and straightforward to train while performing well on linearly separable datasets. However, it cannot solve complex nonlinear problems and multicollinearity can’t exist between features. On the contrary, ANN can adaptively handle nonlinear and high-dimensional data, but due to its complexity and black-box nature, the model's interpretability is relatively poor. Moreover, ridge penalized logistic regression introduces the L2 regularization term to limit the size of coefficients and promotes the grouping effect to encourage similar weights for correlated features, resulting in stable performance even in the presence of multicollinearity. However, L2 regularization does not shrink coefficients to zero so that ridge penalized logistic regression is unable to exclude irrelevant features or noise.
To solve the feature selection problem, methods with built-in feature selection are used. Lasso penalized logistic regression introduces the L1 penalty to shrink some feature coefficients to zero, which brings it the ability of feature selection while simplifying the model's complexity. However, if the regularization parameter is poorly chosen during the selection process, some important features might be ignored or excluded. In order to reduce the influence of the regularization parameter, elastic-net regularized logistic regression combines the feature selection ability of the L1 regularization term with the stability of the L2 regularization term. On the other hand, HLR uses L1/2 regularization to induce sparse solutions for identifying important features and uses L2 regularization to generate grouping effects. Nevertheless, both elastic-net regularized logistic regression and HLR need to tune two regularization coefficients at the same time, which increases the time and complexity of computation.
Evaluation
Confusion matrix
The confusion matrix, also known as the error matrix, illustrates the correspondence between the model's predictions and the actual labels. For binary classification, the actual classification can be divided into four categories:
True Positive (TP): The model correctly predicts a positive class when the actual class is positive. True Negative (TN): The model correctly predicts a negative class when the actual class is negative. False Positive (FP): The model incorrectly predicts a positive class when the actual class is negative. False Negative (FN): The model incorrectly predicts a negative class when the actual class is positive.
Accuracy
Accuracy represents the proportion of correctly classified samples (both positive and negative) out of all samples. The formula for calculating accuracy is:
Precision
Precision represents the proportion of samples that are correctly predicted as positive out of all samples that are predicted as positive. The formula for calculating precision is:
Specificity
Specificity represents the proportion of samples that are correctly predicted as negative out of all negative samples. The formula for calculating specificity is:
Sensitivity
Sensitivity represents the proportion of samples that are correctly predicted as positive out of all positive samples. The formula for calculating sensitivity is:
Results and discussion
Study on classification of benign and malignant lung tissue
Data analysis
The study utilized a gene expression profiling dataset with the accession number GSE67061 from the GEO public database, which included 17 samples of normal airway epithelial cells as benign samples and 56 samples of lung squamous cell carcinoma tissues as malignant samples. To be more specific, all sample data belong to non-smokers. However, the limited sample size may restrict the generalizability of the study results. To balance the dataset, the benign samples were augmented to 300 cases, and the malignant samples were expanded to 100 cases using the SMOTE function, resulting in a ratio of 3:1 between benign and malignant samples.
Noise was generated using a normal distribution, and a subset of it was added to the experimental data. The benign data contained 120 instances of noise data and 180 instances of normal data, while the malignant sample data included 40 instances of noise data and 60 instances of normal data. The dataset was then split into 280 instances for training and 120 instances for testing. The training data comprised 210 instances of benign samples and 70 instances of malignant samples, while the test data consisted of 90 instances of benign samples and 30 instances of malignant samples. Notably, noise was only added to the training data, and the data used for training and testing were mutually exclusive to prevent data leakage and ensure the reliability of experimental results.
Results and discussion on methods with built-in feature selection
This experiment focused on the performance of built-in feature selection approaches, including lasso penalized logistic regression, elastic-net penalized logistic regression, and HLR, in classifying benign and malignant lung tissue. The preprocessed dataset with 20,577 features was used as predictor variables, with the benign or malignant status of the samples serving as the response variable.
For each of the three methods, 500 experiments were conducted, and the final evaluation results were the average of all predictions. The experimental outcomes are shown in Table 1, with 1 indicating benign samples and 0 indicating malignant samples.
The prediction performance of the built-in feature selection methods.
The prediction performance of the built-in feature selection methods.
The results have shown that the HLR model's accuracy at 96.67% and precision in detecting malignant samples at 100.00% are particularly remarkable. Its strong ability to accurately identify cancerous tissues will contribute to timely and appropriate treatment.
The elastic-net penalized logistic regression model also demonstrated robust performance, particularly in terms of specificity for malignant samples at 89.63% and precision for benign samples at 96.62%. This suggests that the elastic-net approach is effective in balancing the trade-offs between precision and specificity, which is essential in clinical settings to minimize false positives and false negatives.
In contrast, the lasso penalized logistic regression model showed lower accuracy at 93.53%. This might be caused by the L1 regularization, which may overly penalize the model coefficients, leading to the exclusion of potentially informative features and resulting in reduced predictive performance when dealing with complex datasets.
The ROC curve was also introduced to evaluate the accuracy of predictions. As shown in Figure 1, the HLR model had the highest AUC value at 0.933, slightly surpassing the elastic-net penalized logistic regression model, while the lasso penalized logistic regression model had the lowest AUC value at slightly below 0.9. Based on the ROC curve, it can be concluded that the HLR model performed the best in terms of classification, with its performance being relatively close to that of the elastic-net penalized logistic regression model.

ROC curves of methods with built-in feature selection. (a) Lasso (b) Elastic-Net (c) HLR.
This experiment investigated the merits and limitations of non-feature selection methods, including ANN, ridge penalized logistic regression, and logistic regression, for classifying benign and malignant lung tissue. Due to the lack of feature selection capabilities in the aforementioned three methods, using all genes as input to the model may result in the model learning noise from the training data, leading to overfitting. Additionally, including a large number of features complicates the interpretation and understanding of the model, which may cause distrust in model decisions in clinical applications.
From a biological point of view, performing feature selection also helps to identify those driver genes that play a key role in lung cancer occurrence and progression, which are often key nodes or regulators in important signaling pathways. By studying these genes, it is possible to better understand their specific roles in signaling pathways and to draw a map of signaling pathways related to lung cancer, showing how genes interact and drive tumor development. For example, if certain genes are frequently selected in the EGFR pathway, this pathway may play an important role in lung cancer. What's more, feature selection can also reveal new therapeutic targets and use these targets as an important basis for the development of new drugs. By targeting these key genes, more effective anticancer drugs can be developed to improve the treatment effect.
Therefore, lasso regression was employed to select 35 genes that are most correlated with the benign or malignant classification from a pool of 20,577 genes. Specifically, the number of selected genes was continuously adjusted during the experimental process, and the changes in model evaluation metrics were observed with different numbers of selected genes. Through multiple adjustments, it was found that setting the number of selected genes to 35 resulted in the best models performance.
Subsequently, the selected genes were then used as predictor variables, with the benign/malignant classification as the response variable before each of the three methods was subjected to 500 experiments, and the average prediction results were considered as the final evaluation outcomes. Table 2 below presents the experimental results.
The prediction performance of the non-feature selection methods.
The prediction performance of the non-feature selection methods.
The data in Table 2 were rounded to the nearest decimal based on the experimental results. According to the table, the ANN model had the best performance on the training data and test data, which were 99.91% and 91.82%, respectively, indicating that its strong generalization ability can be adapted to clinical applications.
For precision, while both the logistic regression model and the ridge penalized logistic regression model exhibited perfect precision for malignant samples, which was 100%, they fell short in terms of precision for benign samples. This discrepancy highlights that, for these two models, achieving high precision in one class may compromise performance in another. In contrast, the ANN was able to trade-off the precision difference between the two categories well.
For specificity, the difference between categories was more obvious. The high specificity of the logistic regression model and the ridge penalized logistic regression model for benign samples suggested that these models were more conservative in predicting benign cases, reducing the risk of false positive for benign cases. However, the conservative prediction of benign cases led to an increase of false positive for malignant cases, resulting in a sharp drop in the specificity of malignant samples, with only 42.75% for logistic regression model and 23.35% for ridge penalized logistic regression model.
In summary, logistic regression model and ridge penalized logistic regression model may have advantages over ANN models in scenarios where malignant samples are preferentially identified. However, the ANN model can achieve a better performance balance between different categories and it is suitable for application scenarios that need to consider the recognition performance of benign and malignant samples at the same time.
To further evaluate the accuracy of predictions, this study introduces the ROC curve. According to Figure 2, the ANN model has the highest AUC value of 0.894, the ridge penalized logistic regression model has the lowest AUC value of 0.633 and the logistic regression model has an AUC of 0.683. Thus, based on the ROC curve, it can be concluded that the ANN model has the best classification performance overall.

ROC curves of methods with non-feature selection. (a) ANN (b) Ridge (c) Logistic.
Data analysis
In this study, the “One-vs-All” (OvA) strategy was employed for multi-class training. The study utilized a gene expression profiling dataset with the accession number GSE11969 from the GEO public database, which included 149 lung adenocarcinoma samples. To be more specific, the dataset consists of 90 AD samples, 4 AS samples, 18 LA samples, 2 LCNEC samples, and 35 SQ samples. However, due to the limited number of samples in certain category IIA, with only 5 cases for malignancy grade, the six malignancy grades (IA, IB, IIA, IIB, IIIA, and IIIB) were merged into three categories: I, II, and III.
To address the imbalanced sample sizes, the SMOTE function was used to augment the sample size of each category to 200, resulting in a balanced ratio of 1:1:1 for each category. Subsequently, noise was generated using the normal distribution, where each category consisted of 40 noise samples and 160 normal samples. This step was crucial in creating a more robust and representative dataset for training and evaluation.
After the data augmentation and noise addition, the enhanced dataset was divided into subsets for training, validation, and testing. Specifically, 280 samples were selected as the training data, 80 samples as the validation data, and the remaining 60 samples as the testing data. To illustrate this process more clearly, let's consider the model data specifically designed for identifying malignancy grade I samples. The training data for this model consisted of 140 samples with malignancy grade I (including 40 noise samples), 70 samples with malignancy grade II (including 20 noise samples), and 70 samples with malignancy grade III (including 20 noise samples). The validation data included 40 samples with malignancy grade I, 20 samples with malignancy grade II, and 20 samples with malignancy grade III. Similarly, the testing data consisted of 20 samples each for malignancy grades I, II, and III.
Results and discussion on methods with built-in feature selection
This experiment employed lasso penalized logistic regression, elastic-net penalized logistic regression, and HLR methods to explore the advantages and disadvantages of feature selection methods in the classification of lung cancer malignancy grades. The preprocessed dataset, containing 17,067 features, was directly utilized as the predictor variables, with the malignancy grade of the samples serving as the response variable. Each of the three methods underwent 500 experiments, and the average of the prediction results was taken as the final evaluation result. Table 3 presents the experimental results.
The prediction performance of the built-in feature selection methods.
The prediction performance of the built-in feature selection methods.
As shown in Table 3, the HLR model achieved the highest prediction accuracy on the testing data, with a value of 93.33%. The elastic-net penalized logistic regression model followed closely, with a prediction accuracy of 88.31%, while the lasso penalized logistic regression model had the lowest accuracy at 86.93%. Additionally, the HLR model performed the best on train I, validation I, and train III data, but had the lowest performance on train II and validation II. The elastic-net penalized logistic regression model demonstrated better performance on train II and validation II, while the lasso penalized logistic regression model achieved the best prediction results on validation III data. These findings suggested that HLR was effective in learning the features of class I samples and exhibited the best overall performance. The elastic-net penalized logistic regression model could learn the features of class II samples, while the lasso penalized logistic regression model performed well in learning the features of class III samples. These results indicates that different penalization techniques can capture different features, emphasizing the importance of model selection based on the specific characteristics of the data. Clinically, the HLR model's superior performance in class I samples, which represents early-stage cancer, is significant for early diagnosis and intervention.
Moreover, Table 4 reveals that the HLR model achieved the highest precision for class I classification (94.44%) and the highest specificity for class II classification (95.00%). The elastic-net penalized logistic regression model demonstrated the highest precision for class II and class III samples, as well as the highest specificity for class I classification, all with values of 100.00%. The lasso penalized logistic regression model achieved the highest precision for class II classification, also at 100.00%. Notably, all three models exhibited a specificity of 100.00% for class III classification. These results indicate that the HLR model showed good predictive ability for class I samples and good classification ability for class II samples. The elastic-net penalized logistic regression model demonstrated good predictive ability for class II and class III samples, as well as good classification ability for class I samples. The lasso penalized logistic regression model exhibited good predictive ability for class II samples, while all three models demonstrated high classification ability for class III samples.
The prediction performance of the built-in feature selection methods.
In summary, the experimental results highlight the strengths and weaknesses of each feature selection method in classifying lung cancer malignancy grades. The HLR model exhibited the best overall performance, while the elastic-net and lasso penalized logistic regression models showed specific advantages in learning and classifying certain sample classes. These findings provide valuable insights into the application of feature selection methods in the context of lung cancer malignancy grade classification.
This experiment utilized ANN, ridge penalized logistic regression, and logistic regression methods to investigate the advantages and disadvantages of methods without built-in feature selection in the classification of lung cancer malignancy grades. Lasso regression was employed to select the 50 most relevant genes related to lung tumor malignancy from a pool of 17,067 genes as predictor variables, with the malignancy grade of the samples serving as the response variable. The number of selected genes was also decided by the evaluation outcomes of model performance. Subsequently, each of the three methods underwent 500 experiments, and the average of the prediction results was considered the final evaluation result. Table 5 presents the experimental results obtained from these experiments.
The prediction performance of the non-feature selection methods.
The prediction performance of the non-feature selection methods.
The data in Table 5 are rounded results derived from the experimental outcomes. It can be observed that the ANN model achieved the highest prediction accuracy on the testing data, with a value of 84.74%. The logistic regression model followed closely, securing the second position with a prediction accuracy of 74.79%. In contrast, the ridge penalized logistic regression model exhibited the lowest accuracy on the testing data, with a value of 70.79%. Moreover, the ANN model consistently demonstrated the best prediction results on the training and validation data, significantly surpassing the performance of both the ridge penalized logistic regression model and the logistic regression model. This highlights the superior performance of the ANN model, which might be attributed to its ability to capture non-linear relationships within the data. When comparing the ridge penalized logistic regression model and the logistic regression model, the latter showed slightly higher prediction results on train I, validation I, train II, and train III. However, on validation II and validation III, the ridge penalized logistic regression model's prediction results were marginally higher than those of the logistic regression model. This suggests that the ridge penalized logistic regression model had a slightly better ability to learn the features of samples with malignancy grades II and III compared to the logistic regression model.
Table 6 reveals that the ANN model achieved the best performance in terms of accuracy, precision, and specificity. The logistic regression model outperformed the ridge penalized logistic regression model in terms of accuracy, precision, and specificity for samples with malignancy grades I and II. However, it slightly lagged behind the ridge penalized logistic regression model in terms of specificity for samples with malignancy grade III. Furthermore, all three models demonstrated relatively lower abilities to identify samples with malignancy grade I in terms of classification precision. Both the ridge penalized logistic regression model and the logistic regression model exhibited relatively lower abilities to identify samples with malignancy grade II, with values below 75.00%. Regarding classification specificity, all three models showed relatively lower abilities to identify samples with malignancy grade II. Additionally, both the ridge penalized logistic regression model and the logistic regression model demonstrated relatively lower abilities to identify samples with malignancy grade III, with values below 75.00%. These findings illustrate that the selection of machine learning methods for lung cancer diagnosis should base on the specific clinical needs and sample characteristics.
The prediction performance of the non-feature selection methods.
This study investigated the classification of lung tissue as benign or malignant and examined the lung tumor's level of malignancy using various machine-learning techniques. These techniques included ANN, logistic regression, lasso penalized logistic regression, elastic-net penalized logistic regression, ridge penalized logistic regression and sparse logistic regression with a mixed HLR. Among them, HLR is a novel machine learning method. According to the experimental results, the conclusions of this paper are basically aligned with the existing literatures, so it can be inferred that HLR method does have certain advantages over traditional machine learning methods.
To be more specific, for the classification of lung tissue as benign or malignant, the experimental results indicated that ANN exhibited the best predictive performance among methods without built-in feature selection, with an average accuracy of 91.82%. Among methods with built-in feature selection, HLR demonstrated the best predictive performance, achieving an impressive average accuracy of 96.67%.
Similarly, when examining the lung tumor's level of malignancy, ANN once again achieved the best predictive performance among methods without built-in feature selection, with an average accuracy of 84.74%. For methods with built-in feature selection, HLR displayed the best predictive performance, reaching a remarkable average accuracy of 93.33%.
It is worth noting that due to the data imbalance of the dataset used in the experiments, the study employed the SMOTE algorithm to generate additional balanced data. However, the generated data may still differ from real data and it may not cover all situations and variations present in real data. Additionally, since the occurrence and progression of lung cancer are influenced by various factors such as genetic factors, smoking, air pollution, diet, etc., the dataset used in the study may not adequately represent the diversity and complexity of the entire population of lung cancer patients.
Furthermore, during the experiments, it was found that different datasets exhibited varying classification performance with different machine learning methods. Additionally, the varying quantities of different categories of data within the same dataset could yield contrary results to the original conclusions. Therefore, future research will focus on identifying the common characteristics of sample data that perform well with specific machine learning methods. By using the same machine learning method to classify and predict different categories, our objective is to analyze the common features of the data that produce good predictive results. The characteristics identified from the analysis can provide valuable insights for selecting appropriate machine learning methods for new samples, allowing for quicker classification results and potentially granting patients more time for treatment.
Future studies should also focus on addressing the practical challenges of implementing these machine learning methods in clinical settings, including data availability, computational resources, and model interpretability, to facilitate their translation into routine clinical practice.
In conclusion, this study highlights the effectiveness of ANN and sparse logistic regression with HLR in accurately classifying lung tissue and determining the level of malignancy in lung tumors. These findings could potentially contribute to the development of more efficient and reliable diagnostic tools for lung cancer.
Footnotes
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (62102261); Guangdong Key Construction Discipline Research Capacity Enhancement Project (2022ZDJS049); The Scientific Computing Research Innovation Team of Guangdong Province (2021KCXTD052); Natural Science Foundation of Shaoguan University (SZ2022KJ07).
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
