Abstract
Background
There have been no reports on diagnostic performance of deep learning-based automated detection (DLAD) for thoracic diseases in real-world outpatient clinic.
Purpose
To validate DLAD for use at an outpatient clinic and analyze the interpretation time for chest radiographs.
Material and Methods
This is a retrospective single-center study. From 18 January 2021 to 18 February 2021, 205 chest radiographs with DLAD and paired chest CT from 205 individuals (107 men and 98 women; mean ± SD age: 63 ± 8 years) from an outpatient clinic were analyzed for external validation and observer performance. Two radiologists independently reviewed the chest radiographs by referring to the paired chest CT and made reference standards. Two pulmonologists and two thoracic radiologists participated in observer performance tests, and the total amount of time taken during the test was measured.
Results
The performance of DLAD (area under the receiver operating characteristic curve [AUC] = 0.920) was significantly higher than that of pulmonologists (AUC = 0.756) and radiologists (AUC = 0.782) without assistance of DLAD. With help of DLAD, the AUCs were significantly higher for both groups (pulmonologists AUC = 0.853; radiologists AUC = 0.854). A greater than 50% decrease in mean interpretation time was observed in the pulmonologist group with assistance of DLAD compared to mean reading time without aid of DLAD (from 67 s per case to 30 s per case). No significant difference was observed in the radiologist group (from 61 s per case to 61 s per case).
Conclusion
DLAD demonstrated good performance in interpreting chest radiographs of patients at an outpatient clinic, and was especially helpful for pulmonologists in improving performance.
Keywords
Introduction
Deep learning-based automated detection (DLAD) is a branch of machine learning. Whereas typical machine learning needs a human-engineered pattern to extract results from raw data, DLAD identifies image patterns through a multi-layered convolutional neural network without human involvement (1–3). DLAD has been applied to many areas of thoracic imaging, such as image detection, classification and segmentation (1). Recently, favorable diagnostic accuracy with DLAD in chest radiograph has been reported (4–11). Good performance was noted in the detection of malignant pulmonary nodules (12) and major thoracic diseases (13), with area under receiver operating characteristic (AUROC) curves of 0.990 and 0.983, respectively. However, many of those studies have limitations such as arbitrarily selected data sets in terms of size and number (13) and intentional exclusion of indeterminate cases (12). In addition, data were structured with a disease-enriched set in some studies, and we consider that the reported accuracy of DLAD in these studies may not reflect real clinical practice. Although Lee et al. (14) reported good diagnostic accuracy for DLAD in lung cancer detection in a real-world health screening population, the incidence of thoracic disease in the general population is quite low and may not reflect the reality in clinical settings.
Recently, DLAD for chest radiographs was approved by the Food and Drug Administration (FDA) and was used and evaluated in an emergency department (15) and in a multicenter health screening cohort. Nam et al. (15) reported that DLAD demonstrated excellent performance, improved radiologist interpretation and shortened time-to-report for critical and urgent cases in an emergency department. Kim et al. (16) reported not only a high concordant rate with chest radiograph reports, but also a slightly increased reading time in a health screening cohort. However, DLAD has not been validated in a real-world setting of an outpatient clinic. The diagnostic performance in other settings where chest radiographs are mostly normal may be exaggerated compared to outpatient chest radiographs for patients with respiratory diseases. Additionally, the interpretation time for chest radiographs at an outpatient clinic may be different from other settings. To date, there have been no reports on the diagnostic performance of DLAD for thoracic diseases in a real-world outpatient clinic. In the present study, we analyzed DLAD for use at an outpatient clinic and focused on practical issues regarding the interpretation time of chest radiographs.
Material and Methods
The institutional review board of Chonbuk National University Hospital approved this study (CUH-2021-08-008) and the requirement for informed consent was waived.
Study population and data collection
From 18 January 2021 to 18 February 2021, DLAD was applied in two chest radiograph machines. In total, 3268 chest radiographs from an outpatient clinic, which included results of DLAD, were assessed. Exclusion criteria were: (a) chest radiographs (n = 2806) that did not have corresponding chest computed tomography (CT) taken within 2 weeks; (b) extensive disease involvement in more than three lung zones (n = 38); (c) previous operation (n = 116); (d) more than 50% lung destruction (n = 92); and (e) interval changes between chest radiograph and chest CT (n = 11) (Fig 1). Finally, 205 chest radiographs and paired chest CT were analyzed for external validation and observer performance.

Flow diagram of patient inclusion and exclusion. CT, computed tomography; DLAD, deep learning-based automated detection algorithm; PA, Posterior-Anterior.
Deep learning-based automated detection algorithm
A commercially-available deep learning-based automated detection algorithm (Lunit INSIGHT CXR 3, version 3.1.2.0; Lunit Inc., Seoul, Korea) was adopted to evaluate the chest radiographs. The algorism was developed to detect eight major thoracic abnormalities: atelectasis, cardiomegaly, consolidation, fibrosis, nodule, pleural effusion, pneumoperitoneum and pneumothorax. Detailed information about this algorithm has been reported previously (15). DLAD provided an image-wise probability score of each abnormality from 0 to 1 with a localization map overlaid on the chest radiographs by dividing the lung into four zones (Fig 2).

An example of interpretation by the deep learning-based automated detection algorithm.
Reader study for the reference standard
Two types of reference standards were made: one based on a CT image and one based on the chest radiograph in reference to the CT image. Two radiologists (DEL and KJC, with 1 and 11 years of experience in reading chest radiographs, respectively) independently reviewed the chest radiographs by referring to the paired chest CT and made reference standards for each abnormality. Regarding discrepant results between two reviewers, a third reviewer (GYJ, with 25 years of experience in reading chest radiographs) provided the majority opinion. Among eight abnormalities detected by DLAD, four categories were used for the observer performance test: consolidation or nodule, fibrosis or atelectasis, cardiomegaly, and pleural effusion. Because it is challenging to differentiate consolidation from a nodule and fibrosis from atelectasis on a chest radiograph, they were classified as belonging to the same category. Consolidation or nodule was defined as a nodular opacity or a consolidation of more than 1 cm on the chest radiograph. Pleural effusion included pleural thickening.
Observer performance test
Two pulmonologists (SYP and JSJ, with 16 and 12 years of experience, respectively) and two thoracic radiologists (SYA and HJY, with 11 and 7 years of experience in reading chest radiographs, respectively), who are primary physicians responsible for interpreting and evaluating chest radiographs in clinical practice, participated in the observer performance test. The observer performance test comprised two steps. In step 1, four observers independently evaluated chest radiographs without the assistance of DLAD. They classified radiographs by the presence or absence of abnormal findings and also classified abnormal findings into four categories (consolidation or nodule, fibrosis or atelectasis, cardiomegaly, and pleural effusion) using free-hand annotation. They also measured the total amount of time taken during the observer performance test. A 1-month washout period was employed to mitigate observer memory of the first step. In the second step, observers re-evaluated the images with the assistance of DLAD, blinded to the decision of the previous step. Observers were requested to measure the total amount of time taken during the observer performance test in both steps.
Statistical analysis
Receiver operating characteristic (ROC) curve analysis was performed to assess the image-wise probability score and the area under receiver operating characteristic (AUROC) was used as the performance measure. In addition, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV) and accuracy were calculated. McNemar's test was used to calculate the P-value for sensitivity and specificity, and the P-values for PPV and NPV were measured using a weighted generalized score statistic. A comparison of AUROCs was performed using DeLong's test. Cohen's kappa (κ) coefficient was used to assess inter-reader agreement between the two radiologists. All statistical analyses were performed using R, version 3.6.2 (R Project for Statistical Computing, Vienna, Austria). P < 0.05 was considered statistically significant. The Bonferroni method was used to correct P-values.
Results
Demographics
In total, 205 chest radiographs and corresponding chest CT from 205 individuals (107 men and 98 women; mean ± SD age: 63 ± 8 years) were included. The average interval between chest radiograph and corresponding CT was 4 days. The number of normal chest radiographs was 150 (73.2%), and the number of chest radiographs with two or one lung zone involvement was 16 (7.8%) and 39 (19.0%), respectively. In total, 205 chest radiographs were included in the study, with the Pulmonology Department ordering the highest number (88/205; 42.9%), followed by Hematology and Oncology (26/205; 12.7%), Breast and Thyroid Surgery (22/205; 10.7%), Nephrology (13/205; 6.3%), Gastrointestinal Surgery (10/205; 4.9%), Healthcare Center (8/205; 3.9%), Orthopedic Surgery (8/205; 3.9%), Otorhinolaryngology (7/205; 3.4%), Gastroenterology (7/205; 3.4%) and others (16/205; 7.8%). The chest radiographs were ordered for various reasons, including surveillance of lung metastasis from primary cancers (83/205; 40.5%), follow-up of underlying conditions such as solitary pulmonary nodule, infectious lung diseases and bronchiectasis (59/205; 28.8%), evaluation of symptoms (27/205; 13.2%), screening for preoperative evaluation or healthcare checkup (20/205; 9.8%) and monitoring of lung cancer (16/205; 7.7%) (Table 1).
Clinical characteristics of individuals and chest radiographs.
Results of observer performance test
Pneumoperitoneum and pneumothorax were excluded from the results because only one pneumothorax case was included in the observer performance test. The sensitivity, specificity, PPV, NPV and accuracy of DLAD were 0.845, 0.990, 0.782, 0.993 and 0.984 (Table 2), respectively. For the first step in the observer performance test conducted without the aid of DLAD, all results for sensitivity, specificity, PPV, NPV and accuracy for both pulmonologists and radiologists were significantly lower than with DLAD. Also, the performance of DLAD (AUROC = 0.920) was significantly higher than that of pulmonologists (AUROC = 0.756) and radiologists (AUROC = 0.782) (Fig 3a).

Comparison of performance for the deep learning-based automated detection algorithm (DLAD), pulmonologists and radiologists. (a) In step 1, the area under the receiver operating characteristic curve (AUROC) of the DLAD was significantly higher than that of pulmonologists and radiologists. (b) In step 2, with assistance of the DLAD, the performance of the DLAD also demonstrated significantly higher than that of all observers. The AUROCs of pulmonologists and radiologists were significantly improved in step 2. AUC, area under the receiver operating characteristic curve.
Results of observer performance test based on chest radiograph.
Note: Numbers in parentheses are the 95% confidence interval. PPV, positive predictive value; NPV, negative predictive value; DLAD, deep learning-based automated detection algorithm.
*Statistically significant difference compared to DLAD.
Statistically significant difference compared to step 1.
In the second step conducted with the assistance of DLAD, all results for sensitivity, specificity, PPV, NPV and accuracy for both groups were significantly improved compared to the observer performance test without the aid of DLAD (Figs 4 and 5). However, even with the aid by DLAD, the results in the radiologists' group were significantly lower than those of DLAD, except for specificity and PPV. Also, the same trend was observed when using chest CT as the reference standards (see Appendix, Table A1). The AUROCs were also significantly improved for both groups (pulmonologists AUROC = 0.853; radiologists AUROC = 0.854) (Fig 3b; see also Appendix, Fig. A1).

Representative case of observer performance test. Lung metastasis in a 59-year-old male. (a) The chest radiograph shows nodular opacity (arrow) at the left upper zone, which was overlapped by a rib. Initially, three of four observers detected it. (b) The paired computed tomography reveals small nodular lesion (arrow) at the left lingular segment. (c) The deep learning-based automated detection algorithm (DLAD) correctly localized the lesion. One observer, who missed it in step 1, additionally detected with help of the DLAD.

False positive case of observer performance test. False positive case of 51-year-old female patient of left breast cancer. (a) The chest radiograph shows no active lung lesion. In step 1, all observers interpreted it as normal. (b) On paired computed tomography, postoperative state of left breast was observed (arrow). (c) However, the deep learning-based automated detection algorithm (DLAD) misinterpreted it as consolidation. Referring to the result of the DLAD, two observers regarded it as nodule in step 2.
Without the aid of DLAD, the mean amount of time taken to interpret chest radiographs was 67 s per case in the pulmonologist group and 61 s per case in the radiologist group. With the assistance of DLAD, the mean reading time was 30 s per case in the pulmonologist group, which decreased by more than 50%. However, the times in the radiologist group were not significantly different with or without the assistance of DLAD (61 s per case) (Fig 6).

The mean amount of time taken by two observer groups. In step 1, the mean amount of time taken per each case in interpreting chest radiographs was 67 s in the pulmonologist group and 61 s in the radiologist group. In step 2, the mean amount of time taken decreased greater than 50% in the pulmonologist group. However, no difference was observed in the radiologist group.
Discussion
The results of the present study show that all sensitivities, specificities, PPVs, NPVs, accuracies and AUROC values of observer performance tests without the aid of DLAD for radiologists and pulmonologists were significantly lower than with DLAD assistance. With the aid of DLAD, the results for both groups significantly improved. Moreover, the mean amount of time taken by pulmonologists to interpret chest radiographs decreased by more than half with the aid of DLAD. These results demonstrate that DLAD provides great performance with high sensitivity and specificity even in outpatient clinic, and can improve interpretation efficiency, especially for clinicians.
We validated commercially-available DLAD in interpreting chest radiographs at an outpatient clinic and focused on practical issues in the real-time interpretation of chest radiographs. There have been many studies about chest radiographs using DLAD (4,5,10,12–14) and, recently, it was approved by the FDA. Several studies have focused on validating FDA-approved DLAD in emergency departments (15) or in health screening cohorts (16). As far as we know, no study has reported chest radiograph interpretation using FDA-approved DLAD in outpatient clinic. We consider it essential to evaluate the value of DLAD on chest radiographs taken in outpatient clinic because a comprehensive range of patients would be included. In addition, because there are often cases where clinicians encounter chest radiographs immediately after patients are filmed (17), it is essential in helping clinicians find abnormalities on chest radiographs more quickly and accurately. Therefore, we focused on how DLAD improves the reading accuracy of radiologists and clinicians and the effect of including DLAD on interpretation time. We observed that faster and more accurate interpretations were possible with the assistance of DLAD.
In the present study, the specificity and NPV of DLAD were 0.990 and 0.993, respectively, and were similar to results reported in other studies (13,14). A high NPV means the number of “false negatives” is small (18) and so, once DLAD interprets a chest radiograph as normal, clinicians can comfortably trust the result. This study identified two clinically significant false negative cases: one of which the nodule was too small to be detected by chest radiograph, and the other in which the nodule was difficult to be visualized due to overlapping by the hilum. These findings suggest that the limitations of chest radiograph, rather than those of DLAD, account for these false-negative cases. We consider that the workflow efficiency of clinicians will be improved by using FDA-approved DLAD in outpatient clinic, especially for patients with normal chest radiographs, although further real-time prospective studies are needed.
Regarding the time required for reading chest radiographs, this decreased by more than 50% with the aid of DLAD compared to without the aid of DLAD for pulmonologists, but there was no significant difference for radiologists (Fig 6). Kim et al. (16) reported that the average reading time taken per case with artificial intelligence support increased by 0.2 s for normal chest radiographs and decreased by 0.2 s for non-normal chest radiographs for radiologists, which was similar to the results of the present study. These results may indicate that DLAD can contribute more significantly to pulmonologists than radiologists in interpretation efficacy. Pulmonologists tend to rely on the results of DLAD reading and accept it as it is, but radiologists seem to interpret the results more critically and want to re-confirm them (16). Based on these results, we expect that DLAD could help clinicians interpret chest radiographs more quickly and precisely in outpatient clinic. We also anticipate that DLAD will eventually be helpful to radiologists after it is more established and validated. In addition, to address the issue of clinicians over-relying on DLAD (13), it is recommended that they establish their own interpretation of chest radiographs before consulting the DLAD results. Further studies are necessary to explore this approach in more detail and evaluate its effectiveness.
The present study has several limitations. First, this is a retrospective and single-center study with a small number of data sets. Second, we excluded chest radiographs with extensive disease involvement in more than three lung zones or severe lung destruction. In a case with severe lung destruction, DLAD does not localize but covers the entire lung as an abnormal lesion, and we regarded that it would unnecessarily lower the accuracy and interpretation time of human interpretation. Third, pneumothorax was excluded from the performance analysis because there was only one case. With the help of DLAD, a radiologist and a pulmonologist were able to detect pneumothorax that was missed in the first step (see Appendix, Fig A2). This suggests that the performance of DLAD may be higher than the results of analysis, and that DLAD could potentially reduce the probability of error. Fourth, interpretation time was retrospectively checked and there might be a difference from the actual reading time at the clinic. We consider that future prospective studies focusing on the effectiveness in radiograph interpretation are warranted. Fifth, we did not validate DLAD in admitted patients or emergency room patients because of frequent chest anterior–posterior images.
In conclusion, deep learning-based automated detection demonstrated good performance in interpreting chest radiographs of patients at an outpatient clinic. It was even more helpful for pulmonologists by enhancing work efficiency in outpatient chest radiograph interpretation.
Footnotes
Acknowledgements
We express special thanks to Sara Park for the statistical analysis. We also extend our gratitude to Hea Jin Yang for participating in the observer performance test.
Ethics approval
The institutional review board of Chonbuk National University Hospital approved this study (CUH-2021-08-008), and the requirement for informed consent was waived.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclose receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Research Foundation of Korea (grant number NRF-2021R1C1C1009818).
