Diagnostic Performance of Artificial Intelligence in Salivary Gland Tumors Using Ultrasound Imaging: A Systematic Review

Abstract

Background

Salivary gland tumors are heterogeneous, making diagnosis challenging. Artificial intelligence (AI) is a potential adjunct in diagnosis, though its performance in ultrasound-based evaluation of salivary gland tumors is not clear.

Methods

Publication databases were searched from inception to August 2025. Eligible studies applied AI techniques to ultrasound for salivary gland tumors and reported diagnostic performance against histopathology.

Results

From 1,239 records, 17 retrospective studies including 5351 patients were eligible. A range of AI methodologies were used. Convolutional neural networks (CNN) were the most applied approach, with additional use of radiomics, ensemble, and hybrid deep learning machine learning models. Diagnostic performance ranged from moderate to excellent across individual studies, with the best-performing models achieving an area under the curve (AUC) of 0.97, while some models showed only modest discrimination (AUC as low as 0.58).

Conclusion

AI models applied to ultrasound imaging suggest promising diagnostic performance across varied classification tasks for salivary gland tumors. Prospective multicenter studies with standardized protocols are required before clinical implementation.

Keywords

salivary gland tumors ultrasound artificial intelligence deep learning diagnostic accuracy

1. Introduction

Salivary gland tumors (SGT) are rare, accounting for 3–6% of head and neck cancers.¹ These tumors occur infrequently in both adults and children, with a slightly higher incidence in females.^2,3 The parotid gland is most often affected (65–85% of cases), followed by the submandibular gland, minor salivary glands, and, least often, the sublingual gland.⁴ Benign tumors make up 70–80% of salivary gland tumors, with pleomorphic adenoma (PMA) comprising 60–70% of these, and Warthin tumor (WT) as the second most common benign type.⁴ Malignant tumors constitute 20–30% of adult cases, with mucoepidermoid carcinoma and adenoid cystic carcinoma (ACC) as main subtypes.⁴ The World Health Organization (WHO) 2022 classification system recognizes 36 SGT subtypes,^5,6 each with unique clinical behavior, therapy, and prognosis, emphasizing the need for accurate diagnosis.^6,7

Fine-needle aspiration cytology (FNAC) is commonly used in the evaluation of salivary gland tumors (SGTs), but its sensitivity for differentiating benign from malignant lesions is variable (60%–86%) despite high specificity (91%–100%).^5,8-14 Ultrasonography (US), computed tomography (CT), and magnetic resonance imaging (MRI) are the primary imaging modalities for SGTs. CT scans provide limited anatomic detail due to restricted soft tissue resolution. MRI offers superior delineation of tumor margins, extra-glandular extension, and perineural spread due to its enhanced soft tissue differentiation; however, its use is constrained by cost.¹⁵ US is considered a first-line imaging modality for SGTs due to its affordability, speed, and convenience, although it remains operator-dependent and subject to variability.⁸ Its noninvasive nature, wide availability, and ability to provide real-time morphologic information make US particularly well suited for AI-based image analysis aimed at improving lesion characterization and reducing operator dependency.

Artificial Intelligence (AI) has recently emerged as a transformative tool in clinical practice, particularly in medical imaging, through the use of deep learning (DL) algorithms. Convolutional neural networks (CNNs), the most widely used DL architecture, analyze image data by automatically extracting and learning hierarchical features.¹⁶ This enables CNNs to identify subtle patterns that may be overlooked by human observers. AI-based diagnostic models have demonstrated promising results in various medical fields, including thyroid nodule detection on US,¹⁶ breast cancer detection using multimodal imaging,¹⁷ and head and neck oncology imaging.¹⁸ These developments suggest that AI may reduce operator dependency and improve diagnostic accuracy in the ultrasound evaluation of SGTs.

Although AI has advanced in medical imaging, its application in head and neck imaging is currently limited, with most studies focusing on CT rather than US.¹⁹ Research on AI for ultrasound-based SGT evaluation is scarce and typically conducted on a small scale.²⁰ There is little consolidated data on AI’s ability to distinguish benign from malignant SGTs via US. To address this, we conducted a systematic review to synthesize the current literature, assess the accuracy and clinical value of AI models, and outline key limitations and research needs.

2. Methods

2.1. PICO Format

The study protocol was registered in the International Prospective Register of Systematic Reviews (PROSPERO) (ID: CRD420251082909). This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.²¹ The PICO framework was defined as follows:

• Population (P): Patients with SGTs undergoing US imaging.

• Intervention (I): AI applications, including machine learning and DL methods, for diagnostic evaluation of salivary gland tumors.

• Comparison (C): Conventional diagnostic methods such as radiologist interpretation or histopathology (reference standard), when available.

• Outcomes (O): Diagnostic performance metrics (e.g., sensitivity, specificity, accuracy, precision, recall, F1-score, area under the curve [AUC]) and model validation outcomes.

2.2. Search Strategy

A comprehensive literature search was performed across PubMed, MEDLINE, OpenAIRE, ScienceDirect, and Springer Nature Journals from inception to August 2025. The search strategy combined Medical Subject Headings (MeSH) and free-text terms with Boolean operators. The following search string was applied:

(“salivary gland” OR “parotid” OR “submandibular gland” OR “sublingual gland” OR “salivary neoplasm*” OR “salivary tumor*”) AND (“ultrasound” OR “ultrasonography” OR “sonogram” OR “sonography” OR “US imaging”) AND (“artificial intelligence” OR “AI” OR “machine learning” OR “deep learning” OR “neural network*” OR “CNN” OR “convolutional network” OR “computer-aided diagnosis” OR “CAD”) AND (“diagnos*” OR “classification” OR “prediction” OR “detect*” OR “differentiation” OR “malignancy” OR “benign” OR “tumor type” OR “neoplasm type”).

Both prospective and retrospective studies were considered. Only articles published in English and reporting the application of AI for US-based assessment of SGTs were included. To ensure completeness, the reference lists of included articles were manually screened for additional eligible studies.

2.3. Inclusion and Exclusion Criteria

2.3.1. Inclusion Criteria

• Studies applying AI techniques (DL, machine learning, or radiomics) to ultrasound imaging of salivary gland tumors.

• Studies reporting diagnostic performance metrics (e.g., sensitivity, specificity, accuracy, AUC).

• Original research involving human participants.

2.3.2. Exclusion Criteria

• Studies not involving US imaging,

• Reviews, meta-analyses, editorials, letters, or case reports,

• Animal or phantom studies,

• Studies lacking diagnostic performance outcomes or without a reference standard.

2.4. Data Extraction and Risk of Bias Assessment

Two reviewers independently extracted data; discrepancies were resolved by a third reviewer. Extracted information included: author, year of publication, country, study design, sample size, patient demographics (if reported), US modality (e.g., B-mode, elastography, contrast-enhanced US), AI method (e.g., convolutional neural network [CNN], support vector machine [SVM]), model architecture, features extracted (radiomics, texture, shape, etc.), dataset split and validation strategy, comparator (radiologist, histopathology), and reported diagnostic outcomes.

Risk of bias was assessed using the QUADAS-2 tool for diagnostic accuracy studies. For AI development studies without direct clinical comparison, adapted criteria focusing on dataset representativeness, reference standard, and validation method were applied.

2.5. Data Synthesis

Due to heterogeneity in AI models, ultrasound (US) modalities, diagnostic targets, and reported outcomes, no meta-analysis was performed. Instead, a narrative synthesis was conducted. Diagnostic performance metrics (e.g., sensitivity, specificity, accuracy, AUC, precision, recall, F1-score) were summarized in tables and compared qualitatively across studies. Particular attention was given to variations in AI model architecture, dataset size, validation methods, and reference standards.

Heterogeneity was assessed qualitatively by evaluating differences in study design, AI methodologies, imaging protocols, diagnostic targets, and outcome reporting.

A limited quantitative synthesis (e.g., pooling of studies evaluating benign versus malignant classification) was considered. However, this approach was not feasible due to substantial variability in study objectives, including differences in classification tasks (binary vs multi-class), model outputs, and reporting formats.

Furthermore, most studies did not report sufficient statistical parameters (e.g., confidence intervals or standard errors for AUC), precluding reliable meta-analysis or construction of a forest plot.

3. Results

3.1. Literature Selection

A total of 1,239 records were identified through electronic databases: Medline (n = 219), Springer (n = 925), OpenAIRE (n = 45), ScienceDirect (n = 5), PubMed (n = 43) and Record identified from other resources (n = 2). After removal of 156 duplicates, 1,083 unique records remained and underwent primary screening of titles and abstracts. Of these, 1,061 records were excluded, leaving 22 studies for full-text review. Following secondary screening, 17 studies met the inclusion criteria and were included in the final review (Figure 1).

Figure 1.

PRISMA flowchart showing the screening process of articles based on the study inclusion and exclusion criteria

3.2. Study Characteristics

This review included 17 retrospective studies (Table 1), all of which used histopathology as the reference standard for diagnosis. The studies were published between 2023 and 2025 and conducted primarily in academic and tertiary-care settings across China (n = 13) and Taiwan (n = 4). Sample sizes per study varied from 48 to 907 patients yielding a combined population of 5,351 patients. The mean age of patients across the included studies ranged from the late 30s to early 60s. In most studies, malignant SGTs occurred in slightly older patients compared to benign lesions. Warthin tumors were typically seen in older individuals, while pleomorphic adenomas were more common among younger adults. The sex distribution of participants is presented in Table 1. Overall, male patients were more frequently affected by Warthin tumors, consistent with previously reported epidemiologic patterns.

Table 1.

Study Characteristics

Author, year	Country	Study design	Population	Age (mean ± SD, years)	Sex (M/F)	Tumor types (gland, type of tumor)	Reference standard
Zhang et al, 2023²²	China	Retrospective	173	Training set: Benign tumor 46.2 ± 23.8; Malignant tumor 46.6 ± 28.4	Benign tumor: 99/79;	Benign and malignant salivary gland tumors	Pathology (histology)
Zhang et al, 2023²²	China	Retrospective	173	Test set: Benign tumor 44.9 ± 22.9; Malignant 64. Tumor 4 ± 26.7	Malignant tumor: 43/47	Benign and malignant salivary gland tumors	Pathology (histology)
Wei et al, 2024²³	China	Retrospective	582	Training set: BPT 49.5 ± 15.5, MPT 52.8 ± 15.8	Training set BPT: 123/107, MPT: 52/43	Benign and malignant salivary gland tumors	Pathology (histology)
				Validation set: BPT 52.8 ± 15.6, MPT 61.5 ± 14.5	Validation set BPT: 39/18, MPT: 14/10
				External test set 1: BPT 53.5 ± 13.6, MPT 56.6 ± 15.3	External test set 1; BPT: 76/32, MPT: 2/8
				External test set 2: BPT 55.0 ± 12.2, MPT 61.8 ± 13.7	External test set 2; BPT: 25/8, MPT 13/2
Jiang et al, 2024²⁴	China	Retrospective	907	Median training Benign tumor 56.0; Malignant tumor 59.0	542/365	Benign and malignant salivary gland tumors	Pathology (histology)
Liu et al, 2025²⁵	China	Retrospective	336	Benign tumor 52.4 ± 14.1; Malignant tumor 53.3 ± 17.6; WT 58.7 ± 9.2; PMA 47.4 ± 15.4	Benign tumor: 172/91; Malignant tumor: 31/42	Benign and malignant salivary gland tumors	Pathology (histology)
Cheng et al, 2023²⁶	Taiwan	Retrospective	337	Training set: Benign tumor 46.2 ± 23.8; Malignant tumor 46.6 ± 28.4	Benign tumor: 99/79;	Benign and malignant salivary gland tumors	Pathology (histology)
Cheng et al, 2023²⁶	Taiwan	Retrospective	337	Testing set: Benign tumor 44.9 ± 22.9; Malignant tumor 64.4 ± 26.7	Malignant tumor: 43/47	Benign and malignant salivary gland tumors	Pathology (histology)
Cheng et al, 2025²⁷	Taiwan	Retrospective	294	Training 53 ± 14; Test 53 ± 15		Benign: PMA 37–48%, WT 26–41%, others	Pathology (histology)
Wang et al, 2024²⁸	China	Retrospective	526	51.7 ± 15.2 (range 12–87)	58% male, 42% female	PMA (207), WT (83), others	Pathology (histology)
Shan et al, 2024²⁹	China	Retrospective	48	Benign tumor 55.8 ± 14.5; Malignant tumor 62.4 ± 11.2	283/243	Parotid gland: PMA, WT, mucoepidermoid carcinoma, etc.	Pathology (histology)
Liao et al, 2024³⁰	Taiwan	Retrospective	122	53 (range 21–93)	87/169 overall	Salivary gland tumors — PMA, WT, adenoma, carcinoma	Pathology (histology)
Tu et al, 2023³¹	Taiwan	Retrospective	638	Benign tumor 52.0 (41.0–62.0); Malignant tumor 58.5 (43.5–69.0)	73/49	PMA, WT, carcinoma	Pathology (histology)
He et al, 2025³²	China	Retrospective	315	44 years (Benign tumor 42; Malignant tumor 50)	80/60	PMA, WT, mucoepidermoid carcinoma	Pathology (histology)
He et al, 2024³³	China	Retrospective	203	PMA: 40.9 ± 14.3	PMA: 37/50	Benign parotid gland tumors —PMA (n=99,) WT (n=104)	Pathology (histology)
Li et al, 2025³⁴	China	Retrospective	91	PMA: 46.6 ± 21. WT: 61.3 ± 8.3	PMA: 23/29; WT: 33/6	Benign parotid gland tumors — PMA (n=52), WT (n=39)	Pathology (histology)
Liu et al, 2024³⁵	China	Retrospective	488	Overall mean 52 ± 13.8	PMA: 93/173; WT: 212/10	Benign parotid gland tumors — PMA (n=266), WT (n=222)	Pathology (histology)
Su et al, 2025³⁶	China	Retrospective	256	38.5 (range 4–76); Training cohort 38; Validation cohort 39	18/18; Malignant tumor: 10/2	Salivary gland PMA, stromal and epithelial variants	Pathology (histology)
Xia et al, 2025³⁷	China	Retrospective	140	<50 years: 24; >50 years: 116	Benign tumor: 336/222; Malignant tumor: 48/32	Primary epithelial malignancies (mucoepidermoid, adenocarcinoma)	Pathology (histology)
Su et al, 2025³⁸	China	Retrospective	365	∼50 years	N/A	ACC vs non-ACC salivary gland tumors	Pathology (histology)

BPT, benign parotid gland tumor; MPT, malignant parotid gland tumor; PMA, pleomorphic adenoma; WT, Warthin tumor, ACC, adenoid cystic carcinoma.

3.3. Diagnostic Targets

The diagnostic focus varied across studies (Table 2). Eleven investigations^22-32 aimed to differentiate between benign and malignant SGTs. Three studies^25,33,34 specifically addressed the distinction between pleomorphic adenoma and Warthin tumor, reflecting a more refined diagnostic challenge within benign subtypes. One study³⁶ explored the preoperative classification of stromal subtypes in pleomorphic adenoma, while another³⁷ focused on distinguishing primary from secondary salivary gland malignancies. The remaining studies incorporated both benign and malignant tumors, providing a broader context for evaluating AI performance across diverse histopathological entities. Among benign tumors, pleomorphic adenoma and Warthin tumor were consistently the most frequent, followed by less common entities such as basal cell adenoma and oncocytoma. Malignant tumors included a range of histologies, most notably mucoepidermoid carcinoma. A few studies also reported cases of metastatic or lymphoid malignancies involving the salivary glands. One study specifically focused on differentiating adenoid cystic carcinoma (ACC) from non-ACC salivary gland tumors, representing a subtype-level diagnostic classification task³⁷

Table 2.

Artificial Intelligence Diagnostic Performance

Author, year	AI type	Features Extracted	Sensitivity (%)	Specificity (%)	Accuracy	Area under the curve (AUC)	Validation
Zhang et al, 2023²²	CNN	Retrospective	97.2%	94.4%	95.8%	0.958	Training/testing split (132 training, 41 testing). No external validation
Wei et al, 2024²³	CNN	Retrospective	DenseNet201 49.5 (train), 12.5 (val), 20.0 (ext1), 6.7 (ext2)	DenseNet201: 91.7, 93.0, 95.4, 87.9	DenseNet201 79.4, 69.1, 83.6, 62.5	DenseNet201 0.822, 0.780, 0.759, 0.733	Internal validation + 2 external test sets
Jiang et al, 2024²⁴	CNN	Retrospective	ResNet18 78.2 (internal), 83.3 (external)	ResNet18 92.7 (internal), 90.6 (external)	ResNet18 88.5 (internal), 89.8 (external)	ResNet18 0.947 (internal), 0.925 (external)	Internal (train/validation/test split with 5-fold CV) + External (second hospital cohort, n=88)
Liu et al, 2025²⁵	Machine learning/radiomics — two-step models: LASSO-BNB (Bernoulli Naive Bayes) and RFE-Voting (ensemble of RF, ET, HGBoost)	Retrospective	LASSO-BNB 92.5	LASSO-BNB 66.7	LASSO-BNB86.8	LASSO-BNB 0.910	Internal 5-fold CV
Liu et al, 2025²⁵		Retrospective	RFE-Voting 84.0	RFE-Voting 82.1	RFE-Voting 83.0	RFE-Voting 0.962	Internal 5-fold CV
Cheng et al, 2023²⁶	CNN	105 radiomic features (shape, first-order, GLCM, GLDM, GLRLM, GLSZM, NGTDM); 10 selected via LASSO + ANOVA	78%	94.4%	ResNet50V289.0	ResNet50V2 0.920 (5-fold validation)	All three used internal 5-fold CV (training/validation, 264 pts)
Cheng et al, 2023²⁶	CNN		78%	94.4%	ResNet50V289.0	ResNet50V2 0.920 (5-fold validation)	Independent test cohort (73 pts, 163 images) used for ResNet50V2 reporting
Cheng et al, 2025²⁷	DLR + ExtraTrees hybrid model	Radiomic + deep learning fusion features (transfer learning).	DLR Nomogram (best model) 87.5%	86.60%	95.8%	0.850	Train 368; Test 158 (7:3 split); no external validation
Wang et al, 2024²⁸	SVM	CEUS and laboratory features.	91%	81%	86.40%	0.934	70/30 split; 5-fold CV
Shan et al, 2024²⁹	SVM (histogram + clinical features)	Ultrasound morphologic and histogram features.	Best model (SVM) 86.2%	not stated	79.80%	0.930	Training 177, Validation 79 (10-fold CV in training set)
Liao et al, 2024³⁰	CNN (custom architecture)	Deep CNN features (VGG16 backbone).	100%	87%	93%	NR	Training 500 images (250 benign/250 malignant); Testing 62 images (31 benign/31 malignant)
Tu et al, 2023³¹	DeepSGT transformer	Radiomic + deep learning fusion features.	60%	91%	80%	NR	111 patients (train + internal validation, 7:3 split); 29 patients external test set
He et al, 2025³²	Radiomics + machine learning (Random Forest, best model)	105 radiomic features (shape, first-order, GLCM, GLDM, GLRLM, GLSZM, NGTDM); 10 selected via LASSO + ANOVA	78%	92%	90%	0.94	Train 235 (198 benign/37 malignant); Test 59 (50B/9M); internal val. 27 (24 benign/3 malignant)
He et al, (2024)³³	Ultrasound-based Ensemble Machine Learning model	Retrospective	89.3% (internal validation); 83.3% (external validation)	87.5% (internal); 83.3% (external)	86.5% (internal validation)	0,891 (internal); 0.833 (external)	Institution 1 (n=173) split 7:3 for training and internal validation; Institution 2 (n=30) used for external validation
Li et al, 2025³⁴	CNN	Retrospective	MobileNetV3 86.8%	MobileNetV3 87.3%	MobileNetV3 87.0%	MobileNetV3 0.946	Internal split (training 70%, validation 10%, testing 20% at patient level)
Liu et al, 2024³⁵	CNN	Retrospective	ResNet50 73.6%	ResNet50 90.4%	ResNet50 83.3%	ResNet50 0.908	Internal split (70% training, 20% validation, 10% testing); also separate “indeterminate FNAC” group tested (192 patients, 6:2:2 split)
Su et al, 2025³⁶	Machine learning models (best-performing: SVM)	Six LASSO-selected features (lesion size, shape, cystic areas, vascularity, mean, and skewness)	not stated	not stated	79.8%	0.827	Training cohort (n=177) and validation cohort (n=79); 10-fold cross-validation
Xia et al, 2025³⁷	Deep learning CNN (ResNet50d with Focal Loss)	Automatically learned deep grayscale features.	92.90%	89.20%	91.10%	0.807	70:30 split (221 training, 94 testing)
Su et al, 2025³⁸	Machine learning (SVM best model)	Clinical + ultrasound features	90.0%	91.84%	91.53%	0.913	Internal + external validation

ANOVA, analysis of variance; CEUS, contrast-enhanced ultrasound; CNN, convolutional neural network; CV, cross validation; DLR, deep learning-based radiomics; ET, ExtraTrees; FNAC, fine needle aspiration cytology; GLCM, gray-level co-occurrence matrix; GLDM, gray-level dependence matrix; GLRLM, gray-level run length matrix; GLSZM, gray-level size zone matrix; LASSO, least absolute shrinkage and selection operator; NGTDM, neighborhood gray-tone difference matrix; NR, not reported; pts, patients; RF, Random Forest; val., validation.

This range of diagnostic objectives underscores the versatility of AI methodologies in addressing both broad and nuanced classification tasks within salivary gland tumor imaging.

3.4. AI Models Utilized

The majority of studies^22-26,32,34 employed CNNs as the primary classification model. Several CNN architectures, including ResNet18, MobileNetV3Small, and InceptionV3, were tested, demonstrating the adaptability of DL frameworks to heterogeneous imaging datasets.³⁴

A subset of studies implemented ensemble or multi-model machine learning strategies to enhance predictive performance. For instance, one study used an ensemble learning approach,³³while another evaluated eight algorithms (Decision Tree, Random Forest, AdaBoost, XGBoost, Artificial Neural Network, SVM, Naïve Bayes, and K-Nearest Neighbors) alongside Logistic Regression.³⁵ Similarly, a two-step radiomics–machine learning pipeline combining Least Absolute Shrinkage and Selection Operator-Bernoulli Naïve Bayes (LASSO-BNB) and Recursive Feature Elimination (RFE)-Voting (an ensemble of Random Forest, Extra Trees, and Hyperoptimized Gradient Boosting (HGBoost) to improve diagnostic robustness.²⁵

Radiomics-based and hybrid DL approaches further demonstrated the advantage of integrating handcrafted radiomic features with deep learning representations,^27-31 Additionally, one study evaluated multiple machine learning algorithms including logistic regression, decision tree, random forest, XGBoost, and support vector machine (SVM), with SVM demonstrating the best diagnostic performance.^27-31 (Table 2).

3.5. Diagnostic Performance in Characterizing SGTs

The included studies reported high diagnostic performance, with AUC values ranging from 0.575 to 0.97 across individual studies, reflecting variability in diagnostic performance rather than a pooled estimate.

Studies differentiating benign and malignant SGTs showed peak performances, with AUCs of 0.958 (CNN),²² 0.972 (SVM variants),²⁹ 0.94 (DeepSGT), and 0.92–0.93 (DenseNet/DenseNet-like models),³² while several studies reported lower values (model range 0.67–0.852;VGG16 0.618; 0.77–0.85).^23,26,27 Notably, fusion and ensemble approaches (DLR fusion 0.916; nomogram 0.934; RFE-Voting 0.962)^25,28 and well-trained CNNs (ResNet18/50, DenseNet variants) generally produced the highest AUCs, underscoring both the promise and heterogeneity of AI performance across studies.^24,26,32

Three studies specifically addressed the distinction between pleomorphic adenoma and Warthin tumor, representing a more refined diagnostic challenge within benign subtypes.^25,33,34 These studies achieved AUCs between 0.81 and 0.95, suggesting that AI-based image analysis can reliably differentiate benign salivary gland lesions with overlapping radiologic features.

In classifying stromal subtypes (stroma-low vs stroma-high) study within pleomorphic adenomas, SVM showed the best discrimination (AUC 0.827) with Logistic Regression close behind (AUC 0.818).³⁶ Decision Tree (AUC 0.649) and KNN (AUC 0.575) performed substantially worse, suggesting linear/regularized models were better suited to this task.

For differentiation of primary salivary gland malignancies from secondary metastatic tumors, conventional US alone performed poorly (AUC 0.421), radiomics improved discrimination (AUC 0.636), and DL using ResNet50 performed better (AUC 0.763). A radiomics–deep learning fusion with a Multilayer Perceptron achieved the highest AUC (0.807), whereas the simple combined model yielded intermediate performance (AUC 0.711), indicating that modality-specific feature integration may provide the greatest diagnostic gain.³⁷

In the ACC-focused study, the SVM model achieved strong diagnostic performance, with an AUC of up to 0.913 in the external validation cohort.³⁸

In summary, AI techniques applied to US imaging of SGTs achieved generally high diagnostic performance across diverse architectures and validation settings. Despite heterogeneity in model design and reporting, these findings support the potential utility of AI in augmenting conventional ultrasonographic assessment, pending further prospective, multicenter validation.

3.6. Study Quality Assessment

Most of the included studies demonstrated a low risk of bias across all QUADAS-2 domains (Table 3). The Index Test domain represented the most frequent potential source of bias, primarily due to insufficient reporting of blinding procedures during test interpretation and limited detail regarding the conduct of the index test. Applicability concerns were minimal across all domains, supporting the relevance and generalizability of the included studies to the review question.

Table 3.

QUADAS-2 Risk of Bias and Applicability Assessment

Author, year	Patient selection	Index test	Reference standard	Flow & timing	Applicability: Patient selection	Applicability: Index test	Applicability: Reference standard
Zhang et al, 2023²²	Low	Unclear	Low	Low	Low	Low	Low
Wei et al, 2024²³	Low	Unclear	Low	Low	Low	Low	Low
Jiang et al, 2024²⁴	Low	Unclear	Low	Low	Low	Low	Low
Liu et al, 2025²⁵	Low	Unclear	Low	Low	Low	Low	Low
Cheng et al, 2023²⁶	Low	Unclear	Low	Low	Low	Low	Low
Cheng et al 2025²⁷	Low	Unclear	Low	Low	Low	Low	Low
Wang et al, 2024²⁸	Low	Low	Low	Low	Low	Low	Low
Shan et al, 2024²⁹	Low	Unclear	Low	Low	Low	Low	Low
Liao et al, 2024³⁰	Unclear	Low	Low	Unclear	Low	Low	Low
Tu et al, 2023³¹	Low	Low	Low	Low	Low	Low	Low
He et al, 2025³²	Low	Low	Low	Low	Low	Low	Low
He et al, (2024)³³	Low	Low	Low	Low	Low	Low	Low
Li et al, 2025³⁴	Low	Unclear	Low	Low	Low	Low	Low
Liu et al, 2024³⁵	Low	Unclear	Low	Low	Low	Low	Low
Su et al, 2025³⁶	Low	Low	Low	Low	Low	Low	Low
Xia et al, 2025³⁷	Low	Low	Low	Low	Low	Low	Low
Su et al, 2025³⁸	Low	Low	Low	Low	Low	Low	Low

3.7. Limitations

The included studies share several recurrent limitations that constrain generalizability and clinical translation: most were retrospective and single-center or limited to 1–2 centers, many had small or unevenly distributed samples, and external or multicenter validation was generally lacking. Technical and methodological issues were common, including reliance on static or grayscale ultrasound images (absence of elastography or dynamic info), heterogeneous ultrasound equipment and image acquisition, operator dependency, and variable preprocessing or labeling procedures. Several studies also had a relatively small number of malignant tumors, and a few reported potential bias from case selection and class imbalance. Taken together, these weaknesses underscore the need for larger, prospective, multicenter datasets with standardized acquisition, and external validation.

4. Discussion

The parotid gland is the largest salivary gland that is located in front of the ears on each side of the face. Although most parotid tumors are benign, they are heterogeneous in nature and carry a risk of recurrence or malignant transformation. Consequently, precise characterization of parotid tumors is essential for accurate diagnosis and appropriate management.²² Accurate preoperative characterization is critical, as benign lesions are generally managed with conservative intervention, whereas malignant tumors may necessitate more invasive options, including radical neck dissection. Misdiagnosis risks both overtreatment of benign disease and undertreatment of malignancies, underscoring the need for precise diagnostic tools.

This systematic review aimed to provide a summary of evidence regarding the diagnostic value of AI models based on US for differentiating benign from malignant SGTs. Across the included studies, AI generally showed high performance across studies, suggesting relatively strong diagnostic accuracy with sensitivities ranging from 81 to 100%, specificities from 66.7 to 100%, and AUCs between 0.86 and 0.96. These findings suggest that AI-augmented ultrasound has potential as a reliable diagnostic tool in the preoperative evaluation of SGTs.

Clinical background information and the US play pivotal roles in reaching accurate diagnosis. Zhang et al combined US and clinical data and achieved the highest performance in our review, with sensitivity of 97.2%, specificity of 94.4%, accuracy of 95.8%, and an AUC of 0.958.²² Similarly, Li et al reported strong performance using MobileNetV3, with accuracy and AUC exceeding 0.87 and 0.94, respectively, while also surpassing radiologists’ performance.³⁴ He et al introduced an US ensemble machine learning model, achieving high internal (AUC 0.891) and external (AUC 0.833) validation results, and providing positive predictive and negative predictive value estimates that reinforce its clinical reliability.³³ Similarly, Su et al developed a machine learning model integrating clinical and ultrasound features to differentiate adenoid cystic carcinoma from other salivary gland tumors, achieving high diagnostic performance with external validation.³⁸ By achieving accuracies and AUC values comparable to radiologists, these models may serve as a supportive adjuncts, potentially enhancing diagnostic confidence, reduce operator dependency, and support timely differentiation between benign and malignant SGTs. Importantly, their reliable sensitivity and predictive values suggest a role in guiding surgical planning and reducing the risks of both overtreatment and undertreatment.

Complementing US-based AI studies, a multicenter retrospective study aimed to develop and validate DL models on contrast-enhanced CT to assist radiologists in distinguishing benign from malignant parotid tumors. Using data from 573 histopathology-confirmed cases across two centers, the authors trained six CNN architectures based on arterial-phase CT images, and a baseline SVM model integrating clinical-radiological features with handcrafted radiomics signatures was constructed. The performance of senior and junior radiologists with and without optimal model assistance was compared. MobileNetV3 delivered the best performance and significantly exceeded the SVM, and model assistance improved clinical benefit and overall efficiency of a junior radiologist.²⁰ In support of these findings, Zheng et al evaluated CT-based radiomics models in differentiating benign from malignant parotid tumors in a cohort of 388 patients. Radiomics features were extracted from non-contrast, arterial, and venous phase CT images, and models including Logistic Regression, SVM, and Random Forest were developed. The Random Forest model demonstrated the highest diagnostic performance, with an AUC of 0.91 in the test cohort, while combining radiomics with clinical features further improved predictive accuracy.³⁹ These results suggest that AI-augmented imaging, not only with ultrasound but also with CT, as a non-invasive, reliable tool for preoperative tumor characterization.

The limitations of the available studies must be acknowledged. Most studies were retrospective and single-center, with relatively small sample sizes. Substantial heterogeneity in AI architectures, imaging protocols, and reported performance metrics precluded formal meta-analysis and limited direct comparability across studies. Therefore, the current findings should be interpreted with caution, given the absence of quantitative pooling.

Future research should prioritize prospective, multicenter trials with standardized ultrasound acquisition protocols and transparent reporting of AI development and validation. Incorporating advanced techniques such as elastography, radiomics, and multimodal data fusion such as combining clinical and imaging could further enhance diagnostic reliability.

5. Conclusion

This systematic review suggests that AI assisted interpretation of ultrasound imaging may achieve high diagnostic performance in differentiating benign from malignant salivary gland tumors (SGTs), and in some studies may outperform radiologist assessment. Our findings are consistent with recent developments in the field and highlight the potential role of AI in supporting preoperative decision-making. Nonetheless, the heterogeneity of methodologies, limited external validation, and lack of prospective data warrant cautious interpretation and indicate the need for further rigorous evaluation before clinical implementation.

Footnotes

Acknowledgments

The authors acknowledge the use of generative artificial intelligence tools (ChatGPT, OpenAI) for assistance with language editing and clarity. The authors take full responsibility for the content of this manuscript.

ORCID iDs

Abdullah Binghaith

Noura Farhan Alanazi

Mohammed Sulaiman Alsayyari

Hassan Alshurafa

Ethical Considerations

Ethical approval was not required for this study as it is a systematic review of previously published literature.

Consent to Participate

Patient consent was not required as no individual patient data or identifiable information were included in this systematic review. The authors have nothing to report.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data sharing is not applicable to this article as no new data were generated or analyzed in this study.*

Trial Registration

This systematic review was registered with the International Prospective Register of Systematic Reviews (PROSPERO) under registration number CRD420251082909.

Permission to Reproduce Material

No previously published material requiring permission was reproduced in this manuscript.

References

Steuer

Hanna

Viswanathan

, et al. The evolving landscape of salivary gland tumors. CA Cancer J Clin. 2023;73(6):597-619. doi:10.3322/CAAC.21807.

Pérez-De-oliveira

Sousa-Neto

Vargas

. Epithelial salivary gland neoplasms in pediatric patients: A comprehensive review. Med Oral Patol Oral Cir Bucal. 2025;30(3):e440-e445. doi:10.4317/MEDORAL.26983.

Quixabeira Oliveira

Pérez-De-Oliveira

Robinson

, et al. Epithelial salivary gland tumors in pediatric patients: An international collaborative study. Int J Pediatr Otorhinolaryngol. 2023;168:111519. doi:10.1016/J.IJPORL.2023.111519.

Mckenzie

Lockyer

Singh

Nguyen

. Salivary gland tumours: an epidemiological review of non-neoplastic and neoplastic pathology. Br J Oral Maxillofac Surg. 2023;61(1):12-18. doi:10.1016/J.BJOMS.2022.11.281.

Pusztaszeri

Deschler

Faquin

PhD

. The 2021 ASCO guideline on the management of salivary gland malignancy endorses FNA biopsy and the risk stratification scheme proposed by the Milan System for Reporting Salivary Gland Cytopathology. Cancer Cytopathol. 2023;131(2):83-89. doi:10.1002/CNCY.22678.

Seethala

. New Entities and Concepts in Salivary Gland Tumor Pathology: The Role of Molecular Alterations. Arch Pathol Lab Med. 2024;148(11):1183-1195. doi:10.5858/ARPA.2023-0001-RA.

Nishida

Kusaba

Kawamura

Oyama

Daa

. Histopathological Aspects of the Prognostic Factors for Salivary Gland Cancers. Cancers (Basel). 2023;15(4):1236. doi:10.3390/CANCERS15041236.

Geiger

Ismaila

Beadle

, et al. Management of Salivary Gland Malignancy: ASCO Guideline. J Clin Oncol. 2021;39(17):1909-1941. doi:10.1200/JCO.21.00449.

Ramírez-Pérez

González-García

Hernández-Vila

Monje-Gil

Ruiz-Laza

. Is fine-needle aspiration a reliable tool in the diagnosis of malignant salivary gland tumors? J Craniomaxillofac Surg. 2017;45(7):1074-1077. doi:10.1016/J.JCMS.2017.03.019.

10.

Tryggvason

Gailey

Hulstein

, et al. Accuracy of fine-needle aspiration and imaging in the preoperative workup of salivary gland mass lesions treated surgically. Laryngoscope. 2013;123(1):158-163. doi:10.1002/LARY.23613.

11.

Yariv

Popovtzer

Wasserzug

, et al. Usefulness of ultrasound and fine needle aspiration cytology of major salivary gland lesions. Am J Otolaryngol. 2020;41(1):102293. doi:10.1016/J.AMJOTO.2019.102293.

12.

Rammeh

Romdhane

Ksentini

, et al. Accuracy of fine-needle aspiration cytology in the diagnosis of salivary gland masses according to the Milan reporting system and to an in-house system. Diagn Cytopathol. 2021;49(4):528-532. doi:10.1002/DC.24682.

13.

Kurasawa

Sato

Saito

, et al. The accuracy of fine needle aspiration cytology in the clinical diagnosis of minor salivary gland tumours. Int J Oral Maxillofac Surg. 2021;50(11):1408-1412. doi:10.1016/J.IJOM.2021.02.001.

14.

Kim

Hyeon

Ryu

, et al. Diagnostic accuracy of fine needle aspiration cytology for high-grade salivary gland tumors. Ann Surg Oncol. 2013;20(7):2380-2387. doi:10.1245/S10434-013-2903-Z.

15.

Kong

Han

. The diagnostic role of ultrasonography, computed tomography, magnetic resonance imaging, positron emission tomography/computed tomography, and real-time elastography in the differentiation of benign and malignant salivary gland tumors: a meta-analysis. Oral Surg Oral Med Oral Pathol Oral Radiol. 2019;128(4):431-443.e1. doi:10.1016/J.OOOO.2019.06.014.

16.

Cao

Tong

, et al. Artificial intelligence in thyroid ultrasound. Front Oncol. 2023;13:1060702. doi:10.3389/FONC.2023.1060702/XML.

17.

Khamparia

Bharati

Podder

, et al. Diagnosis of breast cancer based on modern mammography using hybrid transfer learning. Multidimens Syst Signal Process. 2021;32(2):747-765. doi:10.1007/S11045-020-00756-7.

18.

Pham

Teh

Chatzopoulou

Holmes

Coulthard

. Artificial Intelligence in Head and Neck Cancer: Innovations, Applications, and Future Directions. Curr Oncol. 2024;31(9):5255-5290. doi:10.3390/CURRONCOL31090389.

19.

Mastella

Calderoni

Manco

, et al. A systematic review of the role of artificial intelligence in automating computed tomography-based adaptive radiotherapy for head and neck cancer. Phys Imaging Radiat Oncol. 2025;33:100731. doi:10.1016/J.PHRO.2025.100731.

20.

Ning

Wang

, et al. Deep learning-assisted diagnosis of benign and malignant parotid tumors based on contrast-enhanced CT: a multicenter study. Eur Radiol. 2023;33(9):6054-6065. doi:10.1007/S00330-023-09568-2.

21.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. doi:10.1136/BMJ.N71.

22.

Zhang

Zhu

Huang

, et al. A deep learning model for the differential diagnosis of benign and malignant salivary gland tumors based on ultrasound imaging and clinical data. Quant Imaging Med Surg. 2023;13(5):2989-3000. doi:10.21037/QIMS-22-950/COIF.

23.

Wei

Xia

, et al. Deep learning-assisted diagnosis of benign and malignant parotid gland tumors based on automatic segmentation of ultrasound images: a multicenter retrospective study. Front Oncol. 2024;14:1417330. doi:10.3389/FONC.2024.1417330.

24.

Jiang

Chen

Zhou

, et al. Deep learning-assisted diagnosis of benign and malignant parotid tumors based on ultrasound: a retrospective study. BMC Cancer. 2024;24(1):510. doi:10.1186/S12885-024-12277-8.

25.

Liu

Xiao

Yang

, et al. Diagnosis of parotid gland tumors using a ternary classification model based on ultrasound radiomics. Front Oncol. 2025;15:1485393. doi:10.3389/FONC.2025.1485393.

26.

Cheng

Chiang

HHK

. Diagnosis of Salivary Gland Tumors Using Transfer Learning with Fine-Tuning and Gradual Unfreezing. Diagnostics (Basel). 2023;13(21):3333. doi:10.3390/DIAGNOSTICS13213333.

27.

Cheng

Liao

Chiang

. Diagnosis of Salivary Gland Tumors Using Ultrasound Radiomics. Ultrasound Med Biol. 2025;51(5):815-822. doi:10.1016/J.ULTRASMEDBIO.2025.01.008.

28.

Wang

Gao

Yin

Wen

Sun

Han

. Differentiation of benign and malignant parotid gland tumors based on the fusion of radiomics and deep learning features on ultrasound images. Front Oncol. 2024;14:1384105. doi:10.3389/FONC.2024.1384105.

29.

Shan

Yang

Liu

Sun

Chen

Zhu

. Machine Learning Differentiates Between Benign and Malignant Parotid Tumors With Contrast-Enhanced Ultrasound Features. J Oral Maxillofac Surg. 2025;83(2):208-221. doi:10.1016/J.JOMS.2024.10.018.

30.

Liao

Cheng

Chan

. Machine Learning on Ultrasound Texture Analysis Data for Characterizing of Salivary Glandular Tumors: A Feasibility Study. Diagnostics (Basel). 2024;14(16):1761. doi:10.3390/DIAGNOSTICS14161761.

31.

Wang

Sen

, et al. Neural network combining with clinical ultrasonography: A new approach for classification of salivary gland tumors. Head Neck. 2023;45(8):1885-1893. doi:10.1002/HED.27396.

32.

Zhou

Zhu

Pan

. Ultrasound-based deep learning to differentiate salivary gland tumors. Oral Surg Oral Med Oral Pathol Oral Radiol. 2025;140(2):227-236. doi:10.1016/J.OOOO.2025.03.014.

33.

Zheng

Peng

, et al. An ultrasound-based ensemble machine learning model for the preoperative classification of pleomorphic adenoma and Warthin tumor in the parotid gland. Eur Radiol. 2024;34(10):6862-6876. doi:10.1007/S00330-024-10719-2.

34.

Zou

Zhou

Long

Liu

Yao

. Deep Learning Based on Ultrasound Images Differentiates Parotid Gland Pleomorphic Adenomas and Warthin Tumors. Ultrason Imaging. 2025;47(3-4):107-114. doi:10.1177/01617346251319410.

35.

Liu

Miao

Qian

, et al. Deep learning based ultrasound analysis facilitates precise distinction between parotid pleomorphic adenoma and Warthin tumor. Front Oncol. 2024;14:1337631. doi:10.3389/FONC.2024.1337631.

36.

Yang

Hong

, et al. Machine learning model for preoperative classification of stromal subtypes in salivary gland pleomorphic adenoma based on ultrasound histogram analysis. BMC Oral Health. 2025;25(1):898. doi:10.1186/S12903-025-06298-3.

37.

Xia

Huang

, et al. Ultrasound-Based Deep Learning Radiomics Models for Predicting Primary and Secondary Salivary Gland Malignancies: A Multicenter Retrospective Study. Bioengineering (Basel). 2025;12(4):391. doi:10.3390/BIOENGINEERING12040391.

38.

Hong

, et al. Machine learning model for diagnosing salivary gland adenoid cystic carcinoma based on clinical and ultrasound features. Insights Imaging. 2025;16(1):96. doi:10.1186/S13244-025-01974-Y.

39.

Zheng

Zhou

Liu

Wen

. CT-based radiomics analysis of different machine learning models for differentiating benign and malignant parotid tumors. Eur Radiol. 2022;32(10):6953-6964. doi:10.1007/S00330-022-08830-3.