Abstract
Tuberculosis (TB) is a major health issue with high mortality rates worldwide. Recently, tremendous researches of artificial intelligence (AI) have been conducted targeting at TB to reduce the diagnostic burden. However, most researches are conducted in the developed urban areas. The feasibility of applying AI in low-resource settings remains unexplored. In this study, we apply an automated detection (AI) system to screen a large population in an underdeveloped area and evaluate feasibility and contribution of applying AI to help local radiologists detect and diagnose TB using chest X-ray (CXR) images. First, we divide image data into one training dataset including 2627 TB-positive cases and 7375 TB-negative cases and one testing dataset containing 276 TB-positive cases and 619 TB-negative cases, respectively. Next, in building AI system, the experiment includes image labeling and preprocessing, model training and testing. A segmentation model named TB-UNet is also built to detect diseased regions, which uses ResNeXt as the encoder of U-Net. We use AI-generated confidence score to predict the likelihood of each testing case being TB-positive. Then, we conduct two experiments to compare results between the AI system and radiologists with and without AI assistance. Study results show that AI system yields TB detection accuracy of 85%, which is much higher than detection accuracy of radiologists (62%) without AI assistance. In addition, with AI assistance, the TB diagnostic sensitivity of local radiologists is improved by 11.8%. Therefore, this study demonstrates that AI has great potential to help detection, prevention, and control of TB in low-resource settings, particularly in areas with more scant doctors and higher rates of the infected population.
Keywords
Introduction
Tuberculosis (TB) is a communicable disease, one of the top 10 causes of death worldwide, and the second leading cause of death from a single infectious source, after HIV/AIDS, with a mortality of approximately 1.2 million people in 2019 [1]. Concurrently, China has a high TB burden, accounting for 8.4% of the global new cases, and ranking 3rd worldwide in terms of incidence [1]. In a remote region of north-western China, Xinjiang, the average reported incidence of TB was 2.58 times that of the whole country from 2011 to 2015 [2]. The conventional techniques for the diagnosis of pulmonary tuberculosis mainly include sputum smear culture tests, skin tests, chest X-rays, and molecular diagnostic tests [3]. Among them, the chest X-ray (CXR) is a fast and low-cost screening method and is the primary radiologic modality used, which produces images of the thoracic region, and all patients ≥15 years old receive a CXR as a mandatory part of their appointment and treatment against TB [4]. However, there is a shortage of experienced radiologists and a high burden of infected people in remote regions.
Currently, deep learning (DL) has been widely and successfully adopted in many scenarios. Therefore, many researchers and companies have begun to apply DL techniques to TB diagnosis with CXRs to reduce the diagnostic burden [5–11]. Jaeger et al. extracted the lung region using a graph cut segmentation method and then computed a set of texture and shape features to classify CXRs as normal or abnormal [5]. They achieved a performance rate that approached that of human experts. The results showed 78.3% accuracy, whereas radiologists achieved an accuracy of approximately 82%. Radiologists’ false-positive rate was about half of the system’s rate. In contrast, Hwang et al. developed a deep learning based automatic detection (DLAD) algorithm [6]. The performance of the DLAD was validated using six external multicenter and multinational datasets. By employing this method, the results showed significantly higher performance in both classification (0.993 versus 0.746–0.971) and localization (0.993 versus 0.664–0.925) when compared to all groups of physicians. Through literature research, we found that most DL applications were employed in developed cities and regions, and the feasibility of applying DL for TB diagnosis on CXRs in low-resource Settings remained unexplored.
In the remote area of northwestern China, the economic and climatic conditions are considerably inferior to the inland areas. Kashi Prefecture, located in the southwest of Xinjiang, is a unique region on the border of China, with many ethnic groups and wide ethnic differences. Kashi, a high TB epidemic region, accounts for more than a third of the annual incidence in Xinjiang [12]. Owing to the shortage of experienced radiologists and the high number of infected people, the application of DL techniques for TB diagnosis are urgently needed in this area. An artificial intelligence (AI) system could significantly improve health outcomes while evaluating the current system objectively and optimizing the system for further universalization.
In this study, we apply an automatic detection system for mass screening of large populations in the remote region of northwest China. We aim to evaluate the performance and feasibility of this developed AI diagnostic system to help radiologists in diagnosing TB from the data collected from the chest radiographs and the essential medical information of the patient. AI assistance is merged into a diagnostic workflow where radiologists could refer to predictions from the DL algorithm on a website and edit the radiographic reports that needed to be provided to patients. By evaluating the performance of the automatic detection system and the utility of the assistance to radiologists who work with the detection system, we demonstrate the significance of the AI assistant system when applied in low-resource regions.
The rest of the paper is organized as follows: Section 2 describes data used, along with our method. Section 3 presents our results, and the last two sections are used for discussion and conclusion respectively.
Materials and methods
Data
The AI model for TB screening that we apply to the low-resource area is established with CXRs from public datasets and hospitals in other areas of China, mainly from cities. The testing and evaluation procedures, however, are conducted on the data from low-resource regions.
The data used for model establishment are regarded as the training set. The training set consists of 2627 TB cases, 4416 normal cases, and 2959 non-TB abnormal cases. A total of 10,002 cases are collected from a public dataset (CHNCXR [13] and MCUCXR [14]) and several Chinese hospitals located in Shenzhen and Guangzhou, which are developed cities of China.
The testing set consists of 895 cases from The First People’s Hospital of Kashi Area from March 2019 to July 2020, out of which 276 are positive TB cases, 577 are normal cases, and 42 are non-TB abnormal cases. Figure 1 shows images generated by a CXR for a TB case and a normal case of the testing set. For the statistical relevance of the experiment, testing data are collected according to the same ratio as the training set. (The ratio of positive to negative is about 1:2.) All positive TB cases of the test set are identified by the gold standard of image levels, with unanimous assessments of CXRs from three senior radiologists after consultation, based on the WS 288–2017 (diagnosis for pulmonary tuberculosis) as issued by the National Health and Family Planning Commission (NHFPC) of the People’s Republic of China [15]. More complete information on the training and testing sets is summarized in Table 1.

Example CXRs in the testing set: (a) TB case; (b) normal case.
Data used in the development of the AI model
The development of the AI model for detecting TB include data preparation and model training. In the data preparation stage, CXRs of patients with TB are labeled with lesion masks by radiologists to be used for training and validation. In contrast, the testing CXRs are labeled only based on their category because the evaluation in this experiment only requires category results after screening. After image labeling and preprocessing, a segmentation model based on the U-Net is built to detect focus location. In conclusion, the confidence scores of TB can be calculated using the mask that the AI model generated. The following subsections are the specific procedures for model development.
Data preprocessing
The CXRs with lesions are manually labeled by three senior radiologists and used as the gold standard to train deep neural networks for identifying CXRs with abnormal lung findings. The senior radiologists who are all deputy chief physicians have more than 10 years of working experience in reading CXRs. They are authorities in the medical field, and their judgment is consistent with the pathological results. The details of the radiologists’ interpretation are described in Section 2.3. The senior radiologists label the images with 0 (normal), 1 (TB), or others (non-TB abnormal cases).
After collecting and labeling the data, requisite preprocessing is adopted on the dataset. Images are padded, if necessary, to equal height and width and rescaled to 512×512 pixels, preprocessed by normalizing pixel values from the range [0, 255] to [0, 1], and then augmented by random transformation of brightness, contrast, and saturation.
Model: TB-UNet
In this paper, a variant U-Net is proposed by replacing its encoder to detect TB lesions on CXRs, which is called TB-UNet. Figure 2 shows the architecture of TB-UNet. U-Net is a popular network of medical image segmentation, which is a U-shaped full-convolution neural network without a fully connected layer [16]. In the structure of U-Net, the up-sampling operation produces abundant feature channels that allow the network to propagate context information to higher-resolution layers. Structurally, the compression and expansion paths are somewhat symmetrical, forming a U-shape.

The architecture of TB-UNet.
Generally, the depth or width of networks is increased to achieve high accuracy. However, greater numbers of hyperparameters increase the design difficulty and the calculation consumption of the network. To solve this problem, we replace the U-Net encoder with ResNeXt [17]. ResNeXt is a combination of ResNet [18] and Inception [19]. Unlike Inception V4 [20], ResNeXt does not need to manually design complex Inception structural details, but instead uses the same topology for each branch. The structure of ResNeXt is very simple with fewer hyperparameters than Inception V4, but the effect is superior to Inception V4. As the design of branches with the same topology structure is more in line with the graphics processing unit hardware design principles, the running speed of ResNeXt is preferable.
In this research, we use an updated network, TB-UNet, which uses a U-Net encoder/decoder backbone in combination with ResNeXt. In TB-UNet, each submodule of the encoder is replaced with ResNeXt. Therefore, the TB-UNet network can utilize U-Net to learn low-dimensional and high-dimensional features simultaneously while avoiding an increase in design difficulty and calculation consumption, as it deepens the network and improves the optimization efficiency.
The confidence score of TB, which is the likelihood that the patient has TB, is an important result in this experiment. The confidence score of the AI system is acquired from the region of interest (ROI) of the output of the TB-UNet model.
Let Vi,j indicate the pixel value in the ith row and jth column of the output mask, where i∈0,1, ... ,1023 and j∈0,1, ... ,1023. The confidence score c (c∈[0,1]) of the prediction generated by the method is defined as:
where R⊆ { 0, 1, …, 1023 } × { 0, 1, …, 1023 } is the index of pixels in the ROI and ∥R∥ is the size of set R.
Confidence scores represent the possibility of having the TB disease and range from 0 to 1. A confidence score of 0 means that the patient is unlikely to have TB. The closer the confidence score is to 1, the more likely the patient is to have TB.
The experiment mainly aims to compare the results between the AI system, radiologists without AI, and radiologists with AI assistance. Six radiologists participate in this experiment, including three senior radiologists and three junior radiologists in low-resource regions. The senior radiologists have 11 to 14 years of working experience in reading CXRs. They are all Deputy Chief Physicians and take responsibility for the labeling in this experiment. The junior radiologists have one to three years of CXR diagnostic experience, and they compete with the AI system and get help from it during the testing procedure. As designed, every junior radiologist gives their own independent diagnosis on the total testing set; that is, they divide cases into seven categories including 0 (normal cases), 1–5 (TB case with 5 severity levels), and non-TB cases. After multiplication by 0.2, TB cases are divided into five severity levels: faint, mild, moderate, severe, and critical, corresponding to 0.2, 0.4, 0.6, 0.8, and 1 confidence, respectively.
The three junior radiologists diagnose the CXRs of the testing set twice. The first time they read the CXRs and write the report on the website without AI assistance. In particular, the report area beside the image is empty, and only templates of different symptoms can be referred to. Fourteen days later, they read the images for the second time and write reports on the same website. However, the reports have already been generated by the AI model; they only need to edit the reports according to their judgment of CXRs. Figure 3 shows examples of AI system applying to one normal case and one TB case, respectively. The CXR is shown on the left, and the AI analysis result is shown on the right. It includes cardiothoracic ratio, TB suspected degree, abnormality degree, lesions information (amount, type, and location), and diagnostic reports.

Illustration of AI results in one normal case (a) and one TB case (b).
To avoid the interference of short-term memory on the experiment, the testing set is shuffled prior to the second reading of the CXRs by the junior radiologists, and the interval between the two readings has been set to 14 days.
Performance comparison
AI versus radiologists
First, we use indicators to evaluate the performance of the AI system and the average diagnosis of the radiologists. Compared with average manual diagnosis, the sensitivity (85.7%) and accuracy (91.0%) of the AI system increase significantly, with an increase in the false positive rate (FPR). The receiver operating characteristic (ROC) curves of the AI system and the average manual diagnosis are shown in Fig. 4. Furthermore, Table 2 compares the indicators. Although the specificity decreased slightly, the detection rate (sensitivity) increased from 62.7% to 85.7%. This substantial increase can significantly promote the ability of tuberculosis screening, which also has a profound impact on tuberculosis prevention and treatment.

Two ROC curves for (a) AI system; (b) Manual average (alone).
Performance of manual average and AI system
Second, a comparison between radiologists with and without AI assistance is implemented based on the confidence score. The confidence score is introduced in Section 2.2.3. The average confidence scores of the three radiologists are calculated in the two experiments. The t-test is used to analyze the two groups of confidence scores. Generally, a p-value below 0.05 was regarded as statistically significant. In the t-test results of this study, the p-value of average confidence score is 0.007, which indicates that the difference in the average confidence score of diagnoses with and without AI assistance is at a statistically significant level. Consequently, AI prediction results have an impact on the manual diagnosis.
Moreover, we evaluate the significance of AI assistance to radiologists in low-resource areas. The comparison of indicators of radiologists with and without AI assistance is shown in Fig. 5 and Table 3. With the assistance of the AI system, the average level of correct diagnosis achieved higher sensitivity (74.5%) compared to that without AI (62.7%). Although its FPR has a slight rise, this is acceptable for detection of TB because high sensitivity is more important for prevention and control of infectious diseases. As long as the FPR remains in a controllably low range, the increase in sensitivity is of great significance for low-resource areas to reduce missed TB cases.

Comparison of ROC curves.
Performance of average manual diagnosis with and without AI assistance
In addition to accuracy, efficiency is also an essential aspect for the screening of TB, particularly in areas where radiologists have a heavy workload. The three junior radiologists diagnose the CXRs of the testing set twice. The first time they read the CXRs and write the report on the website that we provide without assistance. In particular, the report area beside the image is empty and only templates of different symptoms can be referred. The second time they also read the images and write reports on the same website. However, reports have been generated by the AI model; they only need to edit the reports according to their judgment of CXRs.
The time of reading the CXRs and writing the report is defined from the moment radiologists begin to read the CXR to the moment the radiologists submit the report. Figure 6 shows a boxplot of the time required for one CXR—with AI assistance, all radiologists take approximately half of the average time taken without AI assistance. The specific time is listed in Table 4. The total average of time decreases from 38.83 s to 15.93 s. Hence, diagnosis with AI assistance not only improves the accuracy, but also greatly improves work efficiency, saves time, and speeds up TB screening.

Boxplot of time.
Comparison of reading time
After recent years of introduction of AI in medicine, the need for applications and assistance in primary hospitals and even remote or poor areas continues. In this study, the application of the developed A detection system is applied to Kashi Prefecture. Kashi is located on the border of China; the population is diverse; and the medical and economic conditions are contrasting to those in eastern China. The annual incidence of TB in the region of Kashi is equivalent to that in a high-incidence province in east China. In Kashi, there is shortage in radiologists. According to the statistics from The First People’s Hospital of Kashi, there are 72 radiologists and about one million CXRs to be read annually. In addition, there are approximately 40 radiologists on duty every day that are responsible for reading CXRs and diagnosing lung disease. In some rural hospitals of Kashi, there is no one available who has the ability to read the CXR. Technicians have to send CXRs to doctors in urban hospitals for diagnosis. This study provides a solution to this serious problem and confirms its high efficiency, effectiveness, and relative accuracy.
A major obstacle in the implementation of AI is that the laboratory results are often quite different from the results in clinical practice [21]. On the one hand, owing to the different manners in which technicians operate equipment and the slow network speed of the hospital, there is a practical gap between the laboratory design and clinical application scenarios. Therefore, the improvement in the work efficiency of AI may not be as great as imagined. On the other hand, the differences in the equipment and operator use habits in different hospitals result in data sources (images) obtained in very different formats, brightness, and feature distribution. Thus, the model established on the data from the same or several hospitals will not necessarily achieve a satisfactory performance in another hospital. In this study, we design an experiment based on our proposed AI system to verify whether there exists an inconsistency in our system. Our training and testing sets are from different hospitals in the north and south of China, which are in developed cities and underdeveloped areas, respectively. No testing data are exposed during the training procedure. From the results, both the comparison of radiologist versus AI and comparison of with AI versus without AI exhibited a high detection rate of the AI system. When compared to radiologists, the detection rate (sensitivity) of the AI system increased from 62.7% to 85.7% with a slight increase in false positives. With the assistance of the AI system, the average manual diagnosis level achieves an increase of 11.8% in sensitivity. Although FPRs in the two experiments have a slight rise, it is acceptable for detection of TB because high sensitivity is more significant for prevention and control of infectious diseases. When the false positive rate remains in a controllably low range, the increase in sensitivity is of great significance for low-resource areas to reduce missed TB cases. However, we are also aware that, in future studies, the number of false positives needs to be decreased. False positive detection from AI will be further analyzed to investigate the effect of non-TB abnormality in the future. These effects of false positive detection will also be investigated to determine how radiologists may react to them.
Conclusion
In this study, we demonstrate that the diagnostic ability of the AI system for TB is relatively equivalent to that of radiologists in the low-resource region and even exceeds that of junior radiologists in the low-resource region in some respects. The AI detection rate of TB (85%) is higher than that of the radiologists (62%) in the low-resource region. In addition, with the assistance of the AI system, the TB diagnosis sensitivity of radiologists from the low-resource region can be improved by 11.8%. Therefore, AI has great potential to help the detection, prevention, and control of infectious diseases in low-resource settings, particularly in areas with scant doctors and several infected people.
Footnotes
Acknowledgments
The research was supported by Shenzhen Science and Technology Program (Grant No. KQTD 2017033110081833; JSGG20201102162802008), and the National Key Research and Development Program of China (Grant No. 2019YFE0121400), and partially supported by Regional Collaborative Innovation Project in Xinjiang Uygur Autonomous Region (Grant No. 2020E01012), The First People’s Hospital of Kashi ‘Pearl River Scholars-Tianshan Talents’ Cooperative Expert Studio Innovation Team Project (Grant No. KDYY202017), and The provincial and ministerial joint construction of the open project of the State Key Laboratory of High Incidence and Prevention in Central Asia (Grant No. SKL-HIDCA-2020-KS3).
Conflicts of interest
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
