Abstract
In recent years, the emergence of new-generation deep learning-based artificial intelligence (AI) tools has reignited enthusiasm about the potential of computer-assisted detection (CADe) and diagnosis (CADx) for screening mammography. For screening mammography, digital breast tomosynthesis (DBT) combined with acquired digital 2D mammography or synthetic 2D mammography is widely used throughout the United States. As of this writing in July 2024, there are six Food and Drug Administration (FDA)-cleared AI-based CADe/x tools for DBT. These tools detect suspicious lesions on DBT and provide corresponding scores at the lesion and examination levels that reflect likelihood of malignancy. In this article, we review the evidence supporting the use of AI-based CADe/x for DBT. The published literature on this topic consists of multireader, multicase studies, retrospective analyses, and two “real-world” evaluations. These studies suggest that AI-based CADe/x could lead to improvements in sensitivity without compromising specificity and to improvements in efficiency. However, the overall published evidence is limited and includes only two small postimplementation clinical studies. Prospective studies and careful postimplementation clinical evaluation will be necessary to fully understand the impact of AI-based CADe/x on screening DBT outcomes.
Introduction
The development of artificial intelligence (AI) applications in breast imaging has expanded at a rapid pace. As of this writing in July 2024, the U.S. Food and Drug Administration (FDA) has cleared more than 20 AI products for mammography, which include tools for computer-assisted detection (CADe) and diagnosis (CADx), triage (CADt), breast density assessment, and breast arterial calcification detection and quantification.1–3 Despite the recent heightened interest in CAD due to AI, it has a long-standing history in breast imaging, with the first CADe tool approved by the FDA in 1998. 4 Following reimbursement approval by the Centers for Medicare & Medicaid Services in 2002, CADe was rapidly implemented into clinical settings across the United States, with its use increasing from less than 5% in 2001 to 74% in 2008 and 83% in 2012.5–7
Based on early reader and single-center retrospective studies, CADe was initially thought to be a promising tool that would assist radiologists and improve interpretation accuracy. 8 Subsequent large-scale studies evaluating CADe in routine clinical practice showed that it fails to improve any screening performance metric.6,7 In particular, CADe has been criticized for generating multiple annotations per case, resulting in high rates of false-positives, which has led to radiologists disregarding most of its marks. 9 A postimplementation Breast Cancer Surveillance Consortium study, including 495,818 mammograms interpreted with CADe and 129,807 without CADe by 271 radiologists from 66 U.S. facilities, found that cancer detection rates were the same with and without CADe (4.1 per 1000). 7 Further research showed that use of CADe increased interpretation times by 19%. 10 Thus, despite early promise and enthusiasm, there have been no improvements in radiologist performance nor efficiency gains in clinical practice using traditional CADe tools applied to screening mammography.
The development of new-generation deep learning-based AI tools that learn from large datasets and automatically analyze mammographic images to extract features associated with breast cancer has reignited enthusiasm surrounding the potential impact of CAD on screening mammography8,11,12 (Fig. 1). AI, a computer science subdiscipline, encompasses both machine learning and deep learning.13,14 Machine learning refers to computers learning directly from inputted data without explicit instructions or programming. This field encompasses various techniques such as supervised, unsupervised, and semisupervised learning.14,15 Deep learning is a subdiscipline of machine learning relying on multilayer neural networks, such as the convolutional neural network, to learn complex patterns and representations from data.13,16,17 The FDA-cleared AI tools for mammography have predominantly been trained using supervised learning techniques and are based on deep learning methodologies. 18

Comparison of traditional computer-assisted detection (CAD) with machine learning and new-generation artificial intelligence (AI)-based CAD with deep learning. Traditional CAD relies on expert knowledge for feature extraction and machine learning algorithms for classification, whereas new-generation AI-based CAD uses convolutional neural networks for both feature extraction and classification.
Digital breast tomosynthesis (DBT), approved by the FDA for clinical use in 2011, was rapidly adopted in the United States and is becoming the standard of care in mammography. 19 As of July 2024, 90% of breast imaging practices in the United States utilize DBT. 20 DBT consists of low-dose X-rays acquired along an arc and reconstructed to produce a volume rendering of the breast consisting of a stack of thin sections. 21 Radiologists interpret DBT images in combination with either digital 2D mammographic (DM) images, which are separately acquired, or synthetic 2D mammographic images, which are reconstructed from the DBT stack. With use of synthetic images, patients avoid the additional radiation exposure from a separate acquisition. Research studies conducted in the United States have demonstrated that combined DBT/DM results in higher detection rates of cancers and fewer false-positives than DM alone.22–24 DBT reduces the effects of superimposed breast parenchyma, which enables improved visualization of real lesions and reduces shadow summation that can prompt recall. 21 However, DBT has led to increased interpretation times due to the additional time needed by the reader to scroll through the reconstructed DBT stack of images. 25
The longer reading times required for DBT, workforce shortages, and double-reading protocols common in European countries have fueled interest in developing AI approaches that optimize reading efficiency while maintaining or improving performance.25,26 AI-based CADe/x devices have been developed that flag lesions within the reconstructed DBT image stack to detect and localize clinically significant lesions. This approach is intended to direct radiologists toward significant findings, thereby improving cancer detection rates and concurrently reducing interpretation times. This article reviews published evidence supporting the use of commercial FDA-cleared AI-based CADe/x algorithms for screening DBT.
AI Tools for Lesion Detection and Diagnosis on DBT
AI-based CADe/x systems encompass both computer-assisted detection (CADe) for lesion identification and diagnosis (CADx) for lesion characterization as likely benign or malignant. 27 As of this writing in July 2024, there are six FDA-cleared CADe/x applications for screening DBT (Table 1).1,2 These AI tools analyze the DBT stacks with or without the accompanying DM and/or synthetic images. The AI algorithms identify the presence of soft tissue lesions (architectural distortions, asymmetries, and masses) and calcifications that may indicate malignancy (Fig. 2). The systems also provide bounding boxes for suspicious lesions and assign a suspicion level to the finding reflecting the likelihood of malignancy (e.g., minimal/low/intermediate/high versus a score ranging from 0 to 10 or 0 to 100). Some models will not display markings under a certain threshold (e.g., if a lesion is scored on a scale of 0 to 100, the model will not display markings assigned a value of 10 or less). The higher the score, the more likely the detected finding is malignant. Some models then assign a likelihood of cancer score to each breast, and all assign a likelihood of cancer score to the examination based on the degree of suspicion of the findings.28–33

Examples of various display markings and scores generated by artificial intelligence algorithms. These simulated images are based on the display markings and scores generated by the algorithms listed in Table 1.
Food and Drug Administration-Cleared Artificial Intelligence-Based Computer-Assisted Detection and Diagnosis Products for Digital Breast Tomosynthesis
References: 28–33.
Some models will mark the suspicious lesions with different shapes depending on the finding type. For example, masses, distortions, and asymmetries will be marked with a star, and calcifications will be marked with a triangle. 28 One model offers the option of displaying the lesion localization with a color heatmap or contour map. 29 The outputs are intended to be used as a concurrent reading aid for radiologists interpreting screening mammograms, where the radiologist confirms or dismisses the findings during interpretation.
For purposes of this review article, we searched PubMed for validation studies of the FDA-cleared AI-based CADe/x tools for DBT. Search terms included the names of the FDA-cleared AI algorithms (Table 1), “artificial intelligence,” and “digital breast tomosynthesis.” To the best of our knowledge, there are 15 published studies supporting the use of the FDA-cleared AI-based CADe/x tools for DBT (Table 2).34–48 Four of these studies are recent multireader studies with 240–260 cancer-enriched cases.34,38,43,44 One additional study was also a multireader study, but utilized single-view DBT images (bilateral mediolateral oblique images) rather than standard-of-care two-view DBT images (bilateral mediolateral oblique and craniocaudal images). 36 Two reader studies utilizing an older version of an FDA-cleared AI-based CADe/x tool for DBT are not reviewed herewith;45,46 the most recent reader study of that product is described in Table 2 and in the text. 34
Summary of Published Studies on Artificial Intelligence-Based Computer-Assisted Detection and Diagnosis for Digital Breast Tomosynthesis
AI, artificial intelligence; AIR, abnormal interpretation rate; AUC, area under the receiver operating characteristic curve; CDR, cancer detection rate; DBT, digital breast tomosynthesis; DM, digital 2D mammography; NA, not applicable; NR, not reported.
One study is a retrospective simulation focused on the potential workload reduction with use of AI, 37 one study using the same mammography dataset as the prior study is a retrospective evaluation of screening DBT examinations focused on the stand-alone performance of AI-based CADe/x, 39 one study compares double-reading of DM images of biopsy-proven cancers to stand-alone AI applied to the corresponding DBT images, 35 and one study retrospectively evaluates different screening workflows using AI, double-read single-view DBT examinations, and double-read DM examinations. 40 To the best of our knowledge, only two studies have assessed the utility of AI-based CADe/x for DBT in the “real-world” setting.41,42 Two studies, one of which details an annotation-efficient deep learning approach, 47 and one that investigates the impact of patient characteristics on the performance of an AI algorithm interpreting negative screening DBT examinations, 48 do not report standard metrics and are therefore not reviewed herewith.
Radiologist performance with and without AI based on reader studies
Diagnostic accuracy is often assessed with the area under the receiver operating characteristic curve (AUC), which combines measures of sensitivity and specificity. In five enriched multireader, multicase studies, the average AUC of readers increased with use of AI from 0.795 to 0.852 (p < 0.01), 0.85 to 0.88 (p = 0.01), 0.833 to 0.863 (p = 0.0025), 0.87 to 0.93 (p < 0.001), and 0.90 vs 0.92 (p = 0.003).34,36,38,43,44 The majority of readers had individual improvements in AUC (22 of 24, 11 of 14, 16 of 18, and 18 of 18).34,36,38,43 In one multireader, multicase study with 18 radiologists (9 general and 9 specialized in breast imaging), a larger AUC improvement was observed with general radiologists (0.84 to 0.92, p < 0.001) compared with breast imaging radiologists (0.89 to 0.93, p < 0.001). 43 Breast imaging radiologists had a higher AUC than general radiologists, but general radiologists with AI outperformed breast imaging radiologists without AI.
All five multireader, multicase studies reported increases in sensitivity with the use of AI (77.0% to 85.0% [p < 0.01], 81.1% to 86.5% [p = 0.006], 74.6% to 79.2% [p = 0.016], 80.8% to 89.6% [p = 0.001], 85.4% vs 87.7% [p = 0.04]).34,36,38,43,44 In one multireader, multicase study, cancers were classified as hard, medium, or easy based on the number of radiologists who recalled the case without AI. 43 Hard cancers were defined as cases that were recalled by 50% or less of readers, medium as cases that were recalled by more than 50% but less than or equal to 75%, and easy as cases that were recalled by more than 75%. More than 80% of the “hard” cancers were false-negative examinations (i.e., during the course of routine clinical care, the examination was given a negative or benign assessment, but the patient was diagnosed with breast cancer within a year following the examination). The study found that use of AI improved reader sensitivities across all cases, but helped general radiologists most with medium and hard cases and breast imaging radiologists with hard cases.
Importantly, improvements in sensitivity with AI did not occur at the expense of specificity. In fact, in one of the multireader, multicase studies, specificity improved from 62.7% to 69.6% (p < 0.01). 34 In the multireader study with nine general radiologists and nine breast imaging radiologists, there were no differences in specificity without and with AI use among general radiologists (79.6% vs 77.0%, p = 0.37), but specificity improved with AI use among breast imaging radiologists (70.6% vs 75.1%, p = 0.01). 43
Four studies also evaluated the effect of AI-based CADe/x on interpretation time.34,36,38,44 Three of these studies reported reductions in interpretation time (64.1 s without AI vs 30.4 s with AI [p < 0.01], 41 vs 36 s [p < 0.001], and 54.4 vs 48.5 s [p < 0.001]),34,38,44 while one that included only single-view DBT images reported no change (45 vs 48 s, p = 0.35). 36 In the study that reported the largest decrease in interpretation time (64.1 s without AI to 30.4 s with AI, a reduction of 52.7%, p < 0.01), the AI output appeared when the study was first opened rather than being presented after the radiologist reviewed the images independently. 34 Another study that reported a reduction in interpretation time found that the largest reductions occurred for examinations with low AI scores. 38 In the study that reported no change in overall interpretation time, interpretation time was found to decrease by 8% for examinations with low AI scores and increase by 27% for examinations with high AI scores. 36
The multireader, multicase studies focus on reader averages (e.g., average AUC of 0.852 with AI), and it is important to note that differences are observed across individual readers. For example, in a study with 24 readers, 14 improved in sensitivity, specificity, and interpretation time; 6 improved in specificity and interpretation time, but had no change in sensitivity (n = 3) or had reductions in sensitivity (n = 3); 3 improved in sensitivity and interpretation time, but not in specificity; and 1 improved in sensitivity and specificity with a longer interpretation time. 34 Of note, reader studies must be interpreted with caution since radiologist performance in reader studies may not apply to real-world settings. 49
Stand-alone performance of AI and potential workload reduction
In a study focused on the stand-alone performance of AI, an FDA-cleared AI-based CADe/x tool retrospectively analyzed 15,999 screening DBT examinations, including 98 screen-detected cancers and 15 interval cancers. 39 AI attained an AUC of 0.94 and noninferior sensitivity as a single or double reader. Of the 113 cancers, the AI system correctly localized all but six, and all six cases were interval cancers. However, AI had a higher abnormal interpretation rate or recall rate compared to single readers (9.2% vs 3.0%, p < 0.001) and double readers (16.7% vs 4.4%, p < 0.001).
The multireader, multicase studies report stand-alone performance of the AI systems as a secondary analysis and largely focus on the AUC metric. In three multireader studies, the stand-alone AUC of AI was higher than 12 of 14 radiologists reading without AI, 10 of 18 radiologists, and 18 of 18 radiologists.36,38,43 Two of the multireader studies reported stand-alone specificity of AI, which was lower than the specificity of radiologists reading without AI in one study (40.5% vs 62.7%) and higher in the other (89.6% vs 77.3%).34,44 A recent systematic review and meta-analysis, which included both FDA-cleared and noncleared tools, concluded that there are insufficient studies to evaluate the stand-alone performance of AI in screening DBT interpretation. 50
A retrospective simulation study, including 15,987 DBT examinations, concluded that an autonomous AI triaging approach could result in 72.5% less workload, noninferior sensitivity, and a lower abnormal interpretation rate or recall rate, compared with the standard-of-care double-reading approach. 37 In this simulation, the least suspicious examinations would not be interpreted by a radiologist, and the remainder would be interpreted by a radiologist (with examinations graded as very suspicious by AI recalled for additional imaging even if not recalled by a radiologist). At present, none of the AI-based CADe/x tools is approved for stand-alone interpretation, but use of AI to replace the second radiologist in double-reading screening programs is an area of active research, as further discussed below.
Radiologist performance with AI in a real-world setting
There are two studies, to the best of our knowledge, that evaluate the potential real-world impact of an AI-based CADe/x algorithm on screening DBT, which will be reviewed in turn.41,42 In a retrospective analysis, mammography audit reports over a 12-month period for five radiologists working at three locations were reviewed; only one of the three locations had deployed an AI-based CADe/x tool. 42 A total of 7002 mammograms were interpreted without AI and 5883 with AI. The study reports nonsignificant increases in cancer detection rate (5.9 cancers per 1000 examinations vs 7.3 per 1000), positive predictive value (PPV) 1 (4.8% vs 6.2%), PPV3 (22.9% vs 33.9%), and abnormal interpretation rate or recall rate (10.2% vs 11.2%) with use of AI. Limitations of this study are the potential confounding variables of radiologist reader (the five radiologists in the study did not rotate equally across the three locations) and patient demographics (which are widely variably across the three locations). In addition, the location with the AI tool had rotating trainees, whereas the other locations did not. Of note, one of the five radiologists reported using the AI tool for only 50% of mammograms, reflecting that radiologists may or may not use CADe/x in practice depending on their confidence in the tool, confidence in their own independent interpretation, and/or user friendliness of the tool. 51
In the second real-world study, the performance of 10 radiologists practicing in an independent double-reading setting (e.g., no arbitration or consensus) at a single institution in Spain was compared before and after deployment of an AI-based CADe/x tool. 41 Interpretive performance reading 6949 screening DBT examinations after deployment of AI was compared with performance in a matched control group of 6953 screening DBT examinations before deployment of AI. The matching was based on patient age and breast density. The cancer detection rate was found to be significantly higher after AI deployment (9.6 cancers per 1000 examinations vs 5.8 per 1000, p < 0.05). The abnormal interpretation rate or recall rate was also higher with AI (5.7% vs 4.9%, p < 0.05). These results suggest that AI in real-world clinical settings could lead to higher cancer detection rates, which would presumably lead to fewer false-negative/interval cancers; however, the false-negative/interval cancers in this study are unknown due to limited follow-up.
AI Tools for Lesion Detection and Diagnosis on DM
DBT in conjunction with native or synthetic DM images is widely used for screening mammography throughout the United States; however, DM alone continues to be used in many parts of the world, including Europe. In Europe, mammograms are typically double-read, with disagreements arbitrated in a consensus setting. Although there are no AI-based CADe/x tools cleared for stand-alone interpretation at this time, the possibility of using AI to replace the second reader in European screening programs is an area of active research. Several of the key studies supporting the use of FDA-cleared AI-based CAD e/x tools for screening DM are highlighted in this section.
The Mammography Screening with Artificial Intelligence (MASAI) trial is a randomized controlled trial at four screening sites in Sweden. 52 Participants are randomized to two arms as follows: (a) standard-of-care double-reading without AI and (b) AI-supported screening, in which the AI system score determines whether the mammogram is single-read or double-read with AI output available to all readers. This trial seeks to address two important questions as follows: one, whether AI can safely reduce screening workload and, two, the impact of AI on screening outcomes (with interval cancer rate being a primary outcome). In the first report of the MASAI trial in August 2023, which was published after 80,000 women were randomized to one of the two arms, the cancer detection rate was found to be 5.1 cancers per 1000 screened participants in the standard-of-care arm versus 6.1 per 1000 in the AI-supported screening arm (p = 0.052). 52 Recall rates were similar in both arms (2.0% in the standard-of-care arm versus 2.2% in the AI-supported screening arm). Importantly, with use of AI, workload was reduced by 44.3%. The trial results thus far offer strong evidence supporting the use of AI in double-reading programs; it is continuing with plans to assess interval cancer rates after 100,000 participants are enrolled and have two years of follow-up.
In ScreenTrustCAD, a prospective paired-reader study in Sweden, 55,581 women underwent screening mammograms, which were interpreted by two radiologists as per routine protocol. 53 For purposes of the study, the AI tool (to which the radiologists did not have access) acted as an independent reader. If any of the three readers (two radiologists or AI) interpreted the examination as abnormal, the examination was discussed in consensus, at which time the radiologists could review AI output. The primary analysis consisted of comparing cancer detection rates between double-reading by two radiologists and double-reading by one radiologist and AI (with the read of the first and not the second reader considered). Double-reading by one radiologist and AI led to a 4% higher cancer detection rate. Both this prospective study 53 and the aforementioned MASAI trial 52 report higher cancer detection rates with AI, suggesting that AI could lead to a reduction in interval cancer rates; however, this desirable outcome may not occur if more in situ cancers relative to invasive cancers are detected, as reported in the MASAI trial. 54 Both of the aforementioned prospective studies are based in Sweden.52,53 A third prospective study, titled Artificial Intelligence for Breast Cancer Screening in Mammography (AI-STREAM), is underway at multiple centers in Korea to compare radiologist accuracy with and without AI-based CADe/x. 55
In the only head-to-head comparison, to the best of our knowledge, of commercial AI-based CADe/x tools for DM, in which three tools independently analyzed the same test set of 8805 mammograms (including 739 cancers) from women who underwent screening at an academic hospital in Sweden, the best-performing tool attained an AUC of 0.956 and a sensitivity of 81.9% (at the specificity of the radiologists). 56 Although this tool attained and even exceeded the performance of radiologists in some comparisons, the best sensitivity (88.6%) was achieved by pairing the best-performing tool with first-reader radiologists (which was higher than the sensitivity achieved by combining the first and second radiologists). The tool with the best performance had been trained with more mammograms than the other two tools (although it had not been trained on diverse datasets nor with different vendors), had used pixel-level (rather than image-level) annotations for training, and had used a convolutional neural network architecture with a high-capacity background; these features were hypothesized to contribute to its superior performance.
Clinical Implementation Challenges and Future Directions
To avoid the low impact of older CAD products on mammography interpretation accuracy, attention to best practices in clinical implementation of new AI products is encouraged. AI-based CADe/x algorithms may be integrated into clinical workflows based on local need. 3 In double-reading practices, the AI algorithm could supplant the second reader. In batch reading practices, the algorithm could triage cases more likely to need additional imaging for immediate read before the patient leaves the center or triage more challenging cases to more experienced radiologists. There are challenges with various scenarios, which will likely be best addressed based on local practices and preferences. AI algorithms are costly, which may pose a limitation for practices with limited resources, and the integration of AI algorithms into clinical practice must lead to measured gains in quality and/or efficiency to justify the significant financial investment. 57 At this time, in the United States, payment for computer-assisted mammography interpretation is bundled with payment for the mammogram with no additional payment made for use of the AI algorithm itself.
As AI algorithms continue to improve, stand-alone AI interpretation tools will develop, allowing for transition from assisted to autonomous reading of mammograms by computers. This transition will require extensive regulatory work as AI-based CADe/x is not cleared by the FDA for stand-alone interpretation at this time, and the medicolegal and billing implications of stand-alone interpretation of imaging studies with AI remain unresolved. 58 Future directions include the incorporation of temporal analysis to improve AI algorithm performance (i.e., AI algorithm considers prior mammographic examinations in its interpretation), the use of customized neural network architectures for DBT, and validation of algorithms across diverse patient populations and mammography vendors.3,59 Research studies on AI-based CADe/x for DBT are relatively limited compared with AI-based CADe/x for DM. Specifically, prospective randomized studies and rigorous postimplementation clinical evaluation are needed to determine the impact of AI-based CADe/x for DBT on screening mammography performance metrics and patient outcomes.
Conclusion
Early studies show that AI-based CADe/x holds potential to improve not only the accuracy but also the efficiency of screening DBT. Multireader, multicase studies show that AI leads to improvements in AUC and sensitivity without compromising specificity. Interpretation times may also be reduced with use of AI, but this metric has been investigated only in the context of reader studies, which may not translate to real-world clinical practice. There are insufficient studies to assess the stand-alone performance of AI in screening DBT interpretation. Only two studies report radiologist performance with AI-based CADe/x in real-world settings, which suggest that cancer detection rates could be increased with use of AI; however, both studies are relatively small with methodology limitations. Prospective randomized trials, as are being done for DM, in addition to careful postimplementation clinical evaluation, are essential to comprehensively understand the impact of AI-based CADe/x on screening DBT outcomes.
Footnotes
Acknowledgments
The authors thank Susanne L. Loomis (Medical and Scientific Communications, Strategic Communications, Department of Radiology, Massachusetts General Hospital, Boston, MA) for creating Figures 1 and
.
Authors’ Contributions
L.R.L.: Conceptualization (lead); investigation (lead); visualization (lead); writing—original draft (lead); writing—review and editing (equal). C.D.L.: Conceptualization (supporting); writing—review and editing (equal). S.D.: Writing—review and editing (equal). K.K.: Writing—review and editing (equal). S.L.: Writing—review and editing (equal). M.B.: Conceptualization (lead); funding acquisition (lead); investigation (lead); project administration (lead); supervision (lead); visualization (lead); writing—original draft (lead); writing—review and editing (lead).
Author Disclosure Statement
M.B. is an expert panelist for Accolade, Inc., a consultant for Hologic, Inc., and a consultant for Lunit, Inc. M.B. has institutional research grants from Hologic, Inc., and Lunit, Inc. C.D.L. is cofounder of Clairity, Inc., and a consultant for Hologic, Inc. She has institutional research grants from Hologic, Inc., and GE HealthCare Technologies, Inc. The other authors report no disclosures.
Funding Information
This work was supported by the National Institutes of Health (
