Abstract
Background
Salivary gland tumors are heterogeneous, making diagnosis challenging. Artificial intelligence (AI) is a potential adjunct in diagnosis, though its performance in ultrasound-based evaluation of salivary gland tumors is not clear.
Methods
Publication databases were searched from inception to August 2025. Eligible studies applied AI techniques to ultrasound for salivary gland tumors and reported diagnostic performance against histopathology.
Results
From 1,239 records, 17 retrospective studies including 5351 patients were eligible. A range of AI methodologies were used. Convolutional neural networks (CNN) were the most applied approach, with additional use of radiomics, ensemble, and hybrid deep learning machine learning models. Diagnostic performance ranged from moderate to excellent across individual studies, with the best-performing models achieving an area under the curve (AUC) of 0.97, while some models showed only modest discrimination (AUC as low as 0.58).
Conclusion
AI models applied to ultrasound imaging suggest promising diagnostic performance across varied classification tasks for salivary gland tumors. Prospective multicenter studies with standardized protocols are required before clinical implementation.
1. Introduction
Salivary gland tumors (SGT) are rare, accounting for 3–6% of head and neck cancers. 1 These tumors occur infrequently in both adults and children, with a slightly higher incidence in females.2,3 The parotid gland is most often affected (65–85% of cases), followed by the submandibular gland, minor salivary glands, and, least often, the sublingual gland. 4 Benign tumors make up 70–80% of salivary gland tumors, with pleomorphic adenoma (PMA) comprising 60–70% of these, and Warthin tumor (WT) as the second most common benign type. 4 Malignant tumors constitute 20–30% of adult cases, with mucoepidermoid carcinoma and adenoid cystic carcinoma (ACC) as main subtypes. 4 The World Health Organization (WHO) 2022 classification system recognizes 36 SGT subtypes,5,6 each with unique clinical behavior, therapy, and prognosis, emphasizing the need for accurate diagnosis.6,7
Fine-needle aspiration cytology (FNAC) is commonly used in the evaluation of salivary gland tumors (SGTs), but its sensitivity for differentiating benign from malignant lesions is variable (60%–86%) despite high specificity (91%–100%).5,8-14 Ultrasonography (US), computed tomography (CT), and magnetic resonance imaging (MRI) are the primary imaging modalities for SGTs. CT scans provide limited anatomic detail due to restricted soft tissue resolution. MRI offers superior delineation of tumor margins, extra-glandular extension, and perineural spread due to its enhanced soft tissue differentiation; however, its use is constrained by cost. 15 US is considered a first-line imaging modality for SGTs due to its affordability, speed, and convenience, although it remains operator-dependent and subject to variability. 8 Its noninvasive nature, wide availability, and ability to provide real-time morphologic information make US particularly well suited for AI-based image analysis aimed at improving lesion characterization and reducing operator dependency.
Artificial Intelligence (AI) has recently emerged as a transformative tool in clinical practice, particularly in medical imaging, through the use of deep learning (DL) algorithms. Convolutional neural networks (CNNs), the most widely used DL architecture, analyze image data by automatically extracting and learning hierarchical features. 16 This enables CNNs to identify subtle patterns that may be overlooked by human observers. AI-based diagnostic models have demonstrated promising results in various medical fields, including thyroid nodule detection on US, 16 breast cancer detection using multimodal imaging, 17 and head and neck oncology imaging. 18 These developments suggest that AI may reduce operator dependency and improve diagnostic accuracy in the ultrasound evaluation of SGTs.
Although AI has advanced in medical imaging, its application in head and neck imaging is currently limited, with most studies focusing on CT rather than US. 19 Research on AI for ultrasound-based SGT evaluation is scarce and typically conducted on a small scale. 20 There is little consolidated data on AI’s ability to distinguish benign from malignant SGTs via US. To address this, we conducted a systematic review to synthesize the current literature, assess the accuracy and clinical value of AI models, and outline key limitations and research needs.
2. Methods
2.1. PICO Format
The study protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO) (ID: CRD420251082909). This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
21
The PICO framework was defined as follows: • • • •
2.2. Search Strategy
A comprehensive literature search was performed across PubMed, MEDLINE, OpenAIRE, ScienceDirect, and Springer Nature Journals from inception to August 2025. The search strategy combined Medical Subject Headings (MeSH) and free-text terms with Boolean operators. The following search string was applied:
(“salivary gland” OR “parotid” OR “submandibular gland” OR “sublingual gland” OR “salivary neoplasm*” OR “salivary tumor*”) AND (“ultrasound” OR “ultrasonography” OR “sonogram” OR “sonography” OR “US imaging”) AND (“artificial intelligence” OR “AI” OR “machine learning” OR “deep learning” OR “neural network*” OR “CNN” OR “convolutional network” OR “computer-aided diagnosis” OR “CAD”) AND (“diagnos*” OR “classification” OR “prediction” OR “detect*” OR “differentiation” OR “malignancy” OR “benign” OR “tumor type” OR “neoplasm type”).
Both prospective and retrospective studies were considered. Only articles published in English and reporting the application of AI for US-based assessment of SGTs were included. To ensure completeness, the reference lists of included articles were manually screened for additional eligible studies.
2.3. Inclusion and Exclusion Criteria
2.3.1. Inclusion Criteria
• Studies applying AI techniques (DL, machine learning, or radiomics) to ultrasound imaging of salivary gland tumors. • Studies reporting diagnostic performance metrics (e.g., sensitivity, specificity, accuracy, AUC). • Original research involving human participants.
2.3.2. Exclusion Criteria
• Studies not involving US imaging, • Reviews, meta-analyses, editorials, letters, or case reports, • Animal or phantom studies, • Studies lacking diagnostic performance outcomes or without a reference standard.
2.4. Data Extraction and Risk of Bias Assessment
Two reviewers independently extracted data; discrepancies were resolved by a third reviewer. Extracted information included: author, year of publication, country, study design, sample size, patient demographics (if reported), US modality (e.g., B-mode, elastography, contrast-enhanced US), AI method (e.g., convolutional neural network [CNN], support vector machine [SVM]), model architecture, features extracted (radiomics, texture, shape, etc.), dataset split and validation strategy, comparator (radiologist, histopathology), and reported diagnostic outcomes.
Risk of bias was assessed using the QUADAS-2 tool for diagnostic accuracy studies. For AI development studies without direct clinical comparison, adapted criteria focusing on dataset representativeness, reference standard, and validation method were applied.
2.5. Data Synthesis
Due to heterogeneity in AI models, ultrasound (US) modalities, diagnostic targets, and reported outcomes, no meta-analysis was performed. Instead, a narrative synthesis was conducted. Diagnostic performance metrics (e.g., sensitivity, specificity, accuracy, AUC, precision, recall, F1-score) were summarized in tables and compared qualitatively across studies. Particular attention was given to variations in AI model architecture, dataset size, validation methods, and reference standards.
Heterogeneity was assessed qualitatively by evaluating differences in study design, AI methodologies, imaging protocols, diagnostic targets, and outcome reporting.
A limited quantitative synthesis (e.g., pooling of studies evaluating benign versus malignant classification) was considered. However, this approach was not feasible due to substantial variability in study objectives, including differences in classification tasks (binary vs multi-class), model outputs, and reporting formats.
Furthermore, most studies did not report sufficient statistical parameters (e.g., confidence intervals or standard errors for AUC), precluding reliable meta-analysis or construction of a forest plot.
3. Results
3.1. Literature Selection
A total of 1,239 records were identified through electronic databases: Medline (n = 219), Springer (n = 925), OpenAIRE (n = 45), ScienceDirect (n = 5), PubMed (n = 43) and Record identified from other resources (n = 2). After removal of 156 duplicates, 1,083 unique records remained and underwent primary screening of titles and abstracts. Of these, 1,061 records were excluded, leaving 22 studies for full-text review. Following secondary screening, 17 studies met the inclusion criteria and were included in the final review (Figure 1). PRISMA flowchart showing the screening process of articles based on the study inclusion and exclusion criteria
3.2. Study Characteristics
Study Characteristics
BPT, benign parotid gland tumor; MPT, malignant parotid gland tumor; PMA, pleomorphic adenoma; WT, Warthin tumor, ACC, adenoid cystic carcinoma.
3.3. Diagnostic Targets
Artificial Intelligence Diagnostic Performance
ANOVA, analysis of variance; CEUS, contrast-enhanced ultrasound; CNN, convolutional neural network; CV, cross validation; DLR, deep learning-based radiomics; ET, ExtraTrees; FNAC, fine needle aspiration cytology; GLCM, gray-level co-occurrence matrix; GLDM, gray-level dependence matrix; GLRLM, gray-level run length matrix; GLSZM, gray-level size zone matrix; LASSO, least absolute shrinkage and selection operator; NGTDM, neighborhood gray-tone difference matrix; NR, not reported; pts, patients; RF, Random Forest; val., validation.
This range of diagnostic objectives underscores the versatility of AI methodologies in addressing both broad and nuanced classification tasks within salivary gland tumor imaging.
3.4. AI Models Utilized
The majority of studies22-26,32,34 employed CNNs as the primary classification model. Several CNN architectures, including ResNet18, MobileNetV3Small, and InceptionV3, were tested, demonstrating the adaptability of DL frameworks to heterogeneous imaging datasets. 34
A subset of studies implemented ensemble or multi-model machine learning strategies to enhance predictive performance. For instance, one study used an ensemble learning approach, 33 while another evaluated eight algorithms (Decision Tree, Random Forest, AdaBoost, XGBoost, Artificial Neural Network, SVM, Naïve Bayes, and K-Nearest Neighbors) alongside Logistic Regression. 35 Similarly, a two-step radiomics–machine learning pipeline combining Least Absolute Shrinkage and Selection Operator-Bernoulli Naïve Bayes (LASSO-BNB) and Recursive Feature Elimination (RFE)-Voting (an ensemble of Random Forest, Extra Trees, and Hyperoptimized Gradient Boosting (HGBoost) to improve diagnostic robustness. 25
Radiomics-based and hybrid DL approaches further demonstrated the advantage of integrating handcrafted radiomic features with deep learning representations,27-31 Additionally, one study evaluated multiple machine learning algorithms including logistic regression, decision tree, random forest, XGBoost, and support vector machine (SVM), with SVM demonstrating the best diagnostic performance.27-31 (Table 2).
3.5. Diagnostic Performance in Characterizing SGTs
The included studies reported high diagnostic performance, with AUC values ranging from 0.575 to 0.97 across individual studies, reflecting variability in diagnostic performance rather than a pooled estimate.
Studies differentiating benign and malignant SGTs showed peak performances, with AUCs of 0.958 (CNN), 22 0.972 (SVM variants), 29 0.94 (DeepSGT), and 0.92–0.93 (DenseNet/DenseNet-like models), 32 while several studies reported lower values (model range 0.67–0.852;VGG16 0.618; 0.77–0.85).23,26,27 Notably, fusion and ensemble approaches (DLR fusion 0.916; nomogram 0.934; RFE-Voting 0.962)25,28 and well-trained CNNs (ResNet18/50, DenseNet variants) generally produced the highest AUCs, underscoring both the promise and heterogeneity of AI performance across studies.24,26,32
Three studies specifically addressed the distinction between pleomorphic adenoma and Warthin tumor, representing a more refined diagnostic challenge within benign subtypes.25,33,34 These studies achieved AUCs between 0.81 and 0.95, suggesting that AI-based image analysis can reliably differentiate benign salivary gland lesions with overlapping radiologic features.
In classifying stromal subtypes (stroma-low vs stroma-high) study within pleomorphic adenomas, SVM showed the best discrimination (AUC 0.827) with Logistic Regression close behind (AUC 0.818). 36 Decision Tree (AUC 0.649) and KNN (AUC 0.575) performed substantially worse, suggesting linear/regularized models were better suited to this task.
For differentiation of primary salivary gland malignancies from secondary metastatic tumors, conventional US alone performed poorly (AUC 0.421), radiomics improved discrimination (AUC 0.636), and DL using ResNet50 performed better (AUC 0.763). A radiomics–deep learning fusion with a Multilayer Perceptron achieved the highest AUC (0.807), whereas the simple combined model yielded intermediate performance (AUC 0.711), indicating that modality-specific feature integration may provide the greatest diagnostic gain. 37
In the ACC-focused study, the SVM model achieved strong diagnostic performance, with an AUC of up to 0.913 in the external validation cohort. 38
In summary, AI techniques applied to US imaging of SGTs achieved generally high diagnostic performance across diverse architectures and validation settings. Despite heterogeneity in model design and reporting, these findings support the potential utility of AI in augmenting conventional ultrasonographic assessment, pending further prospective, multicenter validation.
3.6. Study Quality Assessment
QUADAS-2 Risk of Bias and Applicability Assessment
3.7. Limitations
The included studies share several recurrent limitations that constrain generalizability and clinical translation: most were retrospective and single-center or limited to 1–2 centers, many had small or unevenly distributed samples, and external or multicenter validation was generally lacking. Technical and methodological issues were common, including reliance on static or grayscale ultrasound images (absence of elastography or dynamic info), heterogeneous ultrasound equipment and image acquisition, operator dependency, and variable preprocessing or labeling procedures. Several studies also had a relatively small number of malignant tumors, and a few reported potential bias from case selection and class imbalance. Taken together, these weaknesses underscore the need for larger, prospective, multicenter datasets with standardized acquisition, and external validation.
4. Discussion
The parotid gland is the largest salivary gland that is located in front of the ears on each side of the face. Although most parotid tumors are benign, they are heterogeneous in nature and carry a risk of recurrence or malignant transformation. Consequently, precise characterization of parotid tumors is essential for accurate diagnosis and appropriate management. 22 Accurate preoperative characterization is critical, as benign lesions are generally managed with conservative intervention, whereas malignant tumors may necessitate more invasive options, including radical neck dissection. Misdiagnosis risks both overtreatment of benign disease and undertreatment of malignancies, underscoring the need for precise diagnostic tools.
This systematic review aimed to provide a summary of evidence regarding the diagnostic value of AI models based on US for differentiating benign from malignant SGTs. Across the included studies, AI generally showed high performance across studies, suggesting relatively strong diagnostic accuracy with sensitivities ranging from 81 to 100%, specificities from 66.7 to 100%, and AUCs between 0.86 and 0.96. These findings suggest that AI-augmented ultrasound has potential as a reliable diagnostic tool in the preoperative evaluation of SGTs.
Clinical background information and the US play pivotal roles in reaching accurate diagnosis. Zhang et al combined US and clinical data and achieved the highest performance in our review, with sensitivity of 97.2%, specificity of 94.4%, accuracy of 95.8%, and an AUC of 0.958. 22 Similarly, Li et al reported strong performance using MobileNetV3, with accuracy and AUC exceeding 0.87 and 0.94, respectively, while also surpassing radiologists’ performance. 34 He et al introduced an US ensemble machine learning model, achieving high internal (AUC 0.891) and external (AUC 0.833) validation results, and providing positive predictive and negative predictive value estimates that reinforce its clinical reliability. 33 Similarly, Su et al developed a machine learning model integrating clinical and ultrasound features to differentiate adenoid cystic carcinoma from other salivary gland tumors, achieving high diagnostic performance with external validation. 38 By achieving accuracies and AUC values comparable to radiologists, these models may serve as a supportive adjuncts, potentially enhancing diagnostic confidence, reduce operator dependency, and support timely differentiation between benign and malignant SGTs. Importantly, their reliable sensitivity and predictive values suggest a role in guiding surgical planning and reducing the risks of both overtreatment and undertreatment.
Complementing US-based AI studies, a multicenter retrospective study aimed to develop and validate DL models on contrast-enhanced CT to assist radiologists in distinguishing benign from malignant parotid tumors. Using data from 573 histopathology-confirmed cases across two centers, the authors trained six CNN architectures based on arterial-phase CT images, and a baseline SVM model integrating clinical-radiological features with handcrafted radiomics signatures was constructed. The performance of senior and junior radiologists with and without optimal model assistance was compared. MobileNetV3 delivered the best performance and significantly exceeded the SVM, and model assistance improved clinical benefit and overall efficiency of a junior radiologist. 20 In support of these findings, Zheng et al evaluated CT-based radiomics models in differentiating benign from malignant parotid tumors in a cohort of 388 patients. Radiomics features were extracted from non-contrast, arterial, and venous phase CT images, and models including Logistic Regression, SVM, and Random Forest were developed. The Random Forest model demonstrated the highest diagnostic performance, with an AUC of 0.91 in the test cohort, while combining radiomics with clinical features further improved predictive accuracy. 39 These results suggest that AI-augmented imaging, not only with ultrasound but also with CT, as a non-invasive, reliable tool for preoperative tumor characterization.
The limitations of the available studies must be acknowledged. Most studies were retrospective and single-center, with relatively small sample sizes. Substantial heterogeneity in AI architectures, imaging protocols, and reported performance metrics precluded formal meta-analysis and limited direct comparability across studies. Therefore, the current findings should be interpreted with caution, given the absence of quantitative pooling.
Future research should prioritize prospective, multicenter trials with standardized ultrasound acquisition protocols and transparent reporting of AI development and validation. Incorporating advanced techniques such as elastography, radiomics, and multimodal data fusion such as combining clinical and imaging could further enhance diagnostic reliability.
5. Conclusion
This systematic review suggests that AI assisted interpretation of ultrasound imaging may achieve high diagnostic performance in differentiating benign from malignant salivary gland tumors (SGTs), and in some studies may outperform radiologist assessment. Our findings are consistent with recent developments in the field and highlight the potential role of AI in supporting preoperative decision-making. Nonetheless, the heterogeneity of methodologies, limited external validation, and lack of prospective data warrant cautious interpretation and indicate the need for further rigorous evaluation before clinical implementation.
Footnotes
Acknowledgments
The authors acknowledge the use of generative artificial intelligence tools (ChatGPT, OpenAI) for assistance with language editing and clarity. The authors take full responsibility for the content of this manuscript.
Ethical Considerations
Ethical approval was not required for this study as it is a systematic review of previously published literature.
Consent to Participate
Patient consent was not required as no individual patient data or identifiable information were included in this systematic review. The authors have nothing to report.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Data sharing is not applicable to this article as no new data were generated or analyzed in this study.
Trial Registration
This systematic review was registered with the International Prospective Register of Systematic Reviews (PROSPERO) under registration number
Permission to Reproduce Material
No previously published material requiring permission was reproduced in this manuscript.
