Abstract
Background
The use of artificial intelligence (AI) in screen-reading of mammograms has shown promising results for cancer detection. However, less attention has been paid to the false positives generated by AI.
Purpose
To investigate mammographic features in screening mammograms with high AI scores but a true-negative screening result.
Material and Methods
In this retrospective study, 54,662 screening examinations from BreastScreen Norway 2010–2022 were analyzed with a commercially available AI system (Transpara v. 2.0.0). An AI score of 1–10 indicated the suspiciousness of malignancy. We selected examinations with an AI score of 10, with a true-negative screening result, followed by two consecutive true-negative screening examinations. Of the 2,124 examinations matching these criteria, 382 random examinations underwent blinded consensus review by three experienced breast radiologists. The examinations were classified according to mammographic features, radiologist interpretation score (1–5), and mammographic breast density (BI-RADS 5th ed. a–d).
Results
The reviews classified 91.1% (348/382) of the examinations as negative (interpretation score 1). All examinations (26/26) categorized as BI-RADS d were given an interpretation score of 1. Classification of mammographic features: asymmetry = 30.6% (117/382); calcifications = 30.1% (115/382); asymmetry with calcifications = 29.3% (112/382); mass = 8.9% (34/382); distortion = 0.8% (3/382); spiculated mass = 0.3% (1/382). For examinations with calcifications, 79.1% (91/115) were classified with benign morphology.
Conclusion
The majority of false-positive screening examinations generated by AI were classified as non-suspicious in a retrospective blinded consensus review and would likely not have been recalled for further assessment in a real screening setting using AI as a decision support.
Introduction
In 2022, breast cancer was the most common cancer and cause of cancer-related death among women worldwide, with an estimate of nearly 2.3 million new cases and 666,000 deaths (1). Early detection of the disease through mammographic screening is shown to reduce the incidence of advanced disease, treatment burden, and disease-specific mortality (2,3). Most European and high-income countries have implemented organized mammographic screening in accordance with recommendations by international health authorities (4,5).
In the current screening setting with double reading, radiological accuracy varies, with sensitivity being affected by factors such as mammographic features (6,7), mammographic breast density (8), the radiologists’ annual reading volume, and their years of experience in screen-reading (9). The integration of artificial intelligence (AI) could help standardize and potentially increase accuracy of the screen-reading (10). Retrospective as well as prospective studies have shown promising performance of AI systems used in the interpretation procedure, with sensitivities approaching the level of expert radiologists (11–19). On the other hand, less attention has been paid to the potential false positives generated by AI. In a retrospective study from 2022, including 122,969 screening examinations from BreastScreen Norway, 94.0% (11,638/12,383) of the examinations with the highest AI score of 10 had a negative screening result (20). Selecting all these women for recall would yield an unacceptably high recall rate, 2–4 times greater than that currently reported in BreastScreen Norway (21). This illustrates the importance of handling negative examinations with high AI scores in a way that ensures acceptable recall rates without increasing the rate of false-positive screening results. To our knowledge, no studies have examined mammographic features in AI-generated false-positive screening examinations.
The aim of this retrospective blinded consensus review study was to investigate mammographic features in screening mammograms flagged with high AI scores, despite having a true-negative screening result. To explore whether having access to prior examinations influenced interpretation, the radiologists reviewed the mammograms either with or without prior examinations available for comparison.
Material and Methods
The study was approved by the Regional Committee for Medical and Health Research Ethics (REC no. 2018/2574) and has a legal basis in accordance with Articles 6 (1) (e) and 9 (2) (j) of the General Data Protection Regulation. The data were disclosed with legal basis in the Cancer Registry Regulations of December 2001 No. 47, section 3–1 and the Personal Health Data Filing System Act section 19 a to 19 h (22,23). The requirement for informed consent to participate was waived by the Regional Committee for Medical and Health Research Ethics.
In BreastScreen Norway, all women aged 50–69 years are invited to two-view biennial digital mammographic screening (21). All screening examinations are interpreted independently by two breast radiologists. Each breast is given an interpretation score between 1 and 5 as follows: 1 = negative for malignancy; 2 = probably benign; 3 = intermediate suspicion of malignancy; 4 = probably malignant; and 5 = high suspicion of malignancy. All examinations with a score ≥2 by either of the radiologists are discussed in a consensus meeting to determine whether the woman should be recalled for further assessment. During 2017–2021, the average attendance rate in the program was 75.5%, the recall rate 3.3%, the screen-detected cancer rate 0.64%, and the interval cancer rate 0.18% (21).
The AI system
We used Transpara version 2.0.0 (ScreenPoint Medical, Nijmegen, the Netherlands), which is CE marked and cleared by the U.S. Food and Drug Administration. The AI system scored each screening examination with a continuous value in the range of 0.0–10.0 and categorized examinations into 10 groups based on the highest overall exam-level continuous risk score. This is referred to as AI score and was in the range of 1–10, where 1–7 indicated low risk, 8–9 intermediate risk, and 10 a high risk of breast cancer. Areas of interest were marked with a region score of 1–100, with higher scores indicating a higher level of suspicion.
Study sample
Screening examinations among women following the biennial screening scheme and without breast cancer in at least four consecutive screening examinations in BreastScreen Norway, 2010–2022, were included in the overall study population (Fig. 1). Only examinations with an initial interpretation score of 1 by both radiologists and no cancer diagnosed at screening or in the interval before next screening (i.e. true negative) with one prior and two subsequent true negative examinations available were eligible for the study population. The 54,662 examinations matching these criteria were analyzed by the AI system, and 2,124 examinations had an AI score of 10. From these 2,124 examinations, a random subsample of 400 examinations were selected for review. No woman was included in the final study sample more than once. The study sample was then randomly divided into two groups, A and B, with 200 examinations in each. In group A, prior screening mammograms were made available and presented to the reviewing radiologists for comparison with the current screening mammograms, whereas in group B, prior mammograms were not provided in the review. Ten examinations in group A and eight examinations in group B were excluded due to lack of AI scores and markings, resulting in 190 examinations available for review in group A and 192 in group B. The reason for the lack of AI scores and markings was related to technical issues.

Flow chart of the study sample.
Blinded consensus review
A retrospective consensus-based review was performed by a group of three internal breast radiologists with 11, 19 and 30 years of experience in breast radiology, respectively.. The review was blinded; the radiologists were not informed about the screening outcome, only that the examinations had an AI score of 10. AI markings with region scores were available for the radiologists during the review for both study samples. If several AI markings were visible, only the AI marking with the highest region score was considered in the review.
For both study samples, the consensus review recorded the radiologists’ interpretation score (1–5), mammographic breast density (Breast Imaging Reporting and Data System 5th Edition [BI-RADS] a–d) (24) and mammographic feature (asymmetry, calcifications, asymmetry with calcifications, mass, distortion, spiculated mass) (24) of the AI marked lesion. We used a modified BI-RADS classification to describe mammographic features: asymmetry = findings that represent deposits of fibroglandular tissue not conforming to the definition of a mass (Fig. 2); calcifications = deposits of calcium salts in the breast, which are radio-opaque on mammography (Fig. 3); asymmetry with calcifications = deposits of fibroglandular tissue not conforming to the definition of a mass combined with visible calcifications; mass = a space-occupying 3D lesion seen in two different projections; distortion = the normal architecture is distorted with no definite mass visible; and spiculated mass = a space-occupying 3D lesion with lines radiating from its margins. Calcifications were classified based on morphology (amorphous, benign, coarse heterogenous, fine pleomorphic, fine linear/fine linear branching) and distribution (cluster, diffuse, linear, regional, segmental) (24).

Screening mammograms with AI score 10 (region score 70), and a true-negative screening result. Interpreted as negative (interpretation score 1) and classified as an “asymmetry” in a retrospective blinded consensus review.

Screening mammograms with AI score 10 (region score 81), and a true-negative screening result. Interpreted as negative (interpretation score 1) and classified as “calcifications” with benign morphology in a retrospective blinded consensus review.
Statistical analysis
Descriptive analyses were performed by statistical analysts M.B.B. and J.G. in Stata (StataCorp., College Station, TX, USA). Categorical variables were presented as frequencies and percentages. Continuous variables were presented as means ± standard deviation (SD) according to the distribution. Frequencies and percentages for all reviewed examinations (n = 382) were presented for interpretation score stratified by groups A and B, and mammographic features stratified by mammographic breast density. The distribution of mammographic features for all reviewed examinations (n = 382) were compared descriptively to the distribution of mammographic features for all reported invasive cancers (both screen-detected and interval cancers) registered in the Cancer Registry of Norway from BreastScreen Norway in Rogaland County during 2017–2021 (n = 482) (21). Frequencies and percentages for examinations with calcifications (n = 115) were presented for morphology and distribution of the calcifications.
Results
From the initial study population of 234,771 screening examinations, 149,488 were excluded for having fewer than four consecutive screening examinations, 15,175 were excluded due to irregular attendance, 14,928 because initial interpretation score was 2 or above by one or both radiologists, and 518 were excluded because breast cancer was detected (Fig. 1). A total of 10 examinations from group A and eight examinations from group B had to be excluded due to lack of AI scores and markings. The mean age of the women in group A was 58.7 ± 4.4 years and 59.1 ± 4.2 years for those in group B (Table 1).
Demographic table showing age and the distribution of BI-RADS mammographic breast density in group A (with prior mammograms presented for comparison) and group B (without prior mammograms available).
Values are given as n (%) or mean ± SD.
BI-RADS, Breast Imaging Reporting and Data System; SD, standard deviation.
Radiologists’ interpretation score
Of the 382 examinations, 348 (91.1%) were given an interpretation score of 1 by the radiologists in the review and 30 (7.9%) a score of 2 (Table 2). For group A, where prior mammograms were made available for the radiologists, 178/190 (93.7%) were given an interpretation score of 1, whereas this figure was 170/192 (88.5%) for group B, where prior mammograms were not provided. All 26 examinations with the highest mammographic breast density (BI-RADS d) were given an interpretation score of 1 by the reviewing radiologists (Supplementary Table 1).
Radiologists’ interpretation score* in a blinded consensus review of screening examinations with AI score 10 and a true-negative screening result (n = 382), with and without prior mammograms presented to the radiologists for comparison.
Values are given as n (%).
The interpretation score system used in BreastScreen Norway: 1 = negative for malignancy; 2 = probably benign; 3 = intermediate suspicion of malignancy; 4 = probably malignant; and 5 = high suspicion of malignancy (21).
AI, artificial intelligence.
Mammographic features
Mammographic features of the AI-marked lesion were classified as asymmetry in 117/382 (30.6%) examinations, calcifications in 115/382 (30.1%), asymmetry with calcifications in 112/382 (29.3%), mass in 34/382 (8.9%), distortion in 3/382 (0.8%), and spiculated mass in 1/382 (0.3%) (Table 3). The one examination classified with a spiculated mass was annotated as postoperative changes unchanged from the prior mammogram by the reviewing radiologists. The three most common mammographic features (asymmetry, calcifications, and asymmetry with calcifications) accounted for 344/382 (90.1%). In comparison, among all invasive cancers in Rogaland County during 2017–2021, asymmetry, calcifications, and asymmetry with calcifications accounted for 130/482 (27.0%) (Supplementary Table 2).
Classification of mammographic features by BI-RADS mammographic breast density in a blinded consensus review of screening mammograms with AI score 10 and a true-negative screening result.
Values are given as n (%). Percentages are calculated with the total number of examinations in each column as denominator.
AI, artificial intelligence; BI-RADS, Breast Imaging Reporting and Data System.
When considering calcifications and mammographic breast density, calcifications and asymmetries with calcifications accounted for 26/46 (56.5%) of examinations categorized as BI-RADS a, 77/165 (46.7%) of examinations categorized as BI-RADS b, 104/145 (71.7%) of examinations categorized as BI-RADS c, and 20/26 (76.9%) of examinations categorized as BI-RADS d (Table 3).
Of the 115 examinations with calcifications alone, 91 (79.1%) were classified with benign mammographic morphology and 89 (77.4%) with a cluster distribution (Table 4). Combined, 71/115 (61.7%) examinations were classified as having both benign morphology and a cluster distribution. Benign and amorphous calcifications together accounted for 110/115 (95.7%) of all examinations. All 12 examinations classified as having calcifications with linear distribution were annotated as benign vascular calcifications by the reviewing radiologists. None of the examinations with calcifications were classified as fine linear/fine linear branching.
Classification of morphology and distribution of calcifications according to BI-RADS, in a blinded consensus review of screening mammograms with AI score 10 and a true-negative screening result.
Values are given as n (%). Percentages are calculated with the total number of examinations with calcifications (n = 115) as denominator.
AI, artificial intelligence; BI-RADS, Breast Imaging Reporting and Data System.
Discussion
In this retrospective blinded consensus review of 382 screening examinations with high AI scores but true-negative screening result, 378 (99.0%) examinations were given an interpretation score of 1 (negative) or 2 (probably benign) by three experienced breast radiologists. Three out of four screening examinations with an interpretation score of 2 are shown to be dismissed at consensus in BreastScreen Norway (21). Thus, in a hypothetical screening scenario where an AI system flags suspicious examinations for consensus review, most AI-generated false positives would likely be dismissed during consensus, and the women would not be recalled for further assessment.
When prior mammograms were available for the radiologists, 178/190 (93.7%) examinations were interpreted as negative, compared to 170/192 (88.5%) when prior mammograms were not provided. In another review study investigating breast cancer cases with low AI scores, most of the screen-detected cancers were classified with benign-looking features that were new or developing compared to prior examinations (25). Results from the present study complement those findings and highlight the value of having prior examinations available for comparison, both in the context of AI-generated false positives and false negatives.
In our review, only one examination was classified as a spiculated mass, which is considered the most suspicious feature according to breast cancer risk, compared to 233/482 (48.3%) for the invasive cancers (21). Although differences in the distribution of mammographic features between examinations with and without cancer are expected, our results reinforce the notion that AI-generated false positives are unlikely to contribute to an increased rate of false-positive outcomes in population-based screening.
More than half of the reviewed mammograms showed calcifications (calcifications alone or asymmetries with calcifications). Among mammograms with calcifications alone, the majority (71/115, 61.7%) were small clusters of calcifications with benign morphology, with an interpretation score of 1, easily dismissed as negative by the reviewing radiologists. Our observations suggest that most AI-generated false positives due to calcifications would be considered non-suspicious and likely not be recalled in a future AI-supported screening setting with AI markings available to the radiologists.
Among AI-generated false positives in women with the highest mammographic breast density (BI-RADS d), a population known to have a high false-positive rate (8), most examinations featured calcifications, and all were assigned an interpretation score of 1 by the reviewing radiologists. Although the number of AI-generated false positives with BIRADS d was small (n = 26), the findings indicate that AI-supported screening may not lead to unacceptably high recall rates among women with mammographically dense breasts.
For a population-based screening program to be cost-effective, the recall rate needs to be acceptable. Even if AI-supported screening could yield higher cancer detection rates and reduced screening volume, even a small increase in recall rate would make the gained advantages dissipate. In a Swedish randomized controlled trial (11,19) and a Swedish prospective population-based clinical trial (12), examining the use of AI in screening, the recall rates remained stable. Our results based on retrospective data support the results from these trials and may partly explain the apparently unaffected recall rates.
Depending on the specific setup of AI-supported screening, use of AI in screen-reading may result in an increased consensus rate (12,20). This means that a higher proportion of examinations must be dismissed at consensus to keep the recall rate at an acceptable level. Our review indicates that most potential AI-generated false positives are easily dismissed by human readers during consensus, and would therefore likely not contribute to an increased recall rate. Combined with other potential beneficial effects of AI in screening (e.g. triaging), the true effect of AI on consensus and recall rate is unknown but has potential for being reduced. Further, our observations indicate that the use of AI in the interpretation procedure requires human surveillance to avoid increased recall rates. Use of AI as a decision support tool might thus be the preferable approach.
The main strength of this study was the high data quality, as image data were merged with complete screening data from the Cancer Registry of Norway, which is close to 100% complete for solid cancers (26). However, the study has some limitations. These include inherent weaknesses and laboratory effect associated with a retrospective consensus review of a homogeneous negative population, and that the results may be affected by the experience level of the reviewing radiologists. Reviewing the same mammograms with less experienced or external radiologists may yield different results.
In conclusion, this retrospective study found that most screening mammograms flagged with high AI scores, despite having a confirmed true-negative screening result, exhibited non-suspicious mammographic features and were correctly interpreted as negative by the reviewing radiologists. Our findings indicate that false positives generated by AI likely would not result in an increased recall rate in a real screening setting where AI is used as decision support.
Supplemental Material
sj-docx-1-acr-10.1177_02841851251363697 - Supplemental material for Mammographic features in screening mammograms with high AI scores but a true-negative screening result
Supplemental material, sj-docx-1-acr-10.1177_02841851251363697 for Mammographic features in screening mammograms with high AI scores but a true-negative screening result by Henrik Wethe Koch, Marie Burns Bergan, Jonas Gjesvik, Marthe Larsen, Hauke Bartsch, Ingfrid Helene Salvesen Haldorsen and Solveig Hofvind in Acta Radiologica
Footnotes
Data availability
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
ORCID iDs
Supplementary material
Supplementary material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
