Abstract
BACKGROUND:
Although rheumatoid arthritis (RA) causes destruction of articular cartilage, early treatment significantly improves symptoms and delays progression. It is important to detect subtle damage for an early diagnosis. Recent software programs are comparable with the conventional human scoring method regarding detectability of the radiographic progression of RA. Thus, automatic and accurate selection of relevant images (e.g. hand images) among radiographic images of various body parts is necessary for serial analysis on a large scale.
OBJECTIVE:
In this study we examined whether deep learning can select target images from a large number of stored images retrieved from a picture archiving and communication system (PACS) including miscellaneous body parts of patients.
METHODS:
We selected 1,047 X-ray images including various body parts and divided them into two groups: 841 images for training and 206 images for testing. The training images were augmented and used to train a convolutional neural network (CNN) consisting of 4 convolution layers, 2 pooling layers and 2 fully connected layers. After training, we created software to classify the test images and examined the accuracy.
RESULTS:
The image extraction accuracy was 0.952 and 0.979 for unilateral hand and both hands, respectively. In addition, all 206 test images were perfectly classified into unilateral hand, both hands, and the others.
CONCLUSIONS:
Deep learning showed promise to enable efficiently automatic selection of target X-ray images of RA patients.
Introduction
Rheumatoid arthritis (RA) is a systemic inflammatory disease characterized by destructive synovitis [1, 2]. Synovial inflammation promotes an immune response that causes articular cartilage degradation leading to joint space narrowing (JSN) [3, 4]. Early diagnosis and treatment of RA can avert or substantially slow progression of joint damage in up to 90% of patients, thereby preventing irreversible disability [5]. Clinically, radiographic assessment is still the most promising tool for joint damage assessment. The gold standard to assess the radiographic progression of RA is currently the Sharp/van der Heijde scoring method (SvdH), in which scoring of radiographic images of hand and feet are performed by subjectively assessing 38 joints of the hand and foot [6].
A number of RA computer-aided diagnosis algorisms has been reported [7–12] and several software [10–12] is comparable with the conventional human scoring method regarding detectability of the radiographic progression of RA. Our goal is to create the RA diagnostic system which allows for objective and quantitative fully-automated assessment of bone destruction using such software. This system will provide prevention and early detection of diseases, and appropriate medication decisions. Furthermore, this is expected to reduce the burden of health care costs. In such a system, the accurate selection of target images (e.g. hand images) among radiographic images of various body parts including chest, abdomen, spine, foot, knee, shoulder and hip can only be of real value if it is also done automatically. This is because manual collecting is inefficient as it is boring, time-consuming, error-prone and expensive.
One method of automatic image collection is to select images in the picture archiving and communication system (PACS) using the digital imaging and communications in medicine (DICOM) tags. Classification with DICOM tags is sometimes not useful because there is a lack of uniformity in tagging among medical institutions. For example, hand images may be registered as “HAND” or “EXTREMITY”, or foot images registered as “EXTREMITY”.
Deep learning with a convolutional neural network (CNN) is gaining attention recently for its high performance in image recognition. Images themselves can be utilized in the learning process with this technique, and feature extraction in advance of the learning process is not required and important features can be automatically learned [13]. The purpose of this study was to automate the process of extracting unilateral hand and bilateral hands in radiographic images using deep learning.
Materials and methods
Basic idea
Multi-class pattern recognition is a problem of building a system that accurately maps an input feature space to an output space of more than two pattern classes. One example is the simple “one-vs-one” binary classifier, to implement multi-class classification [14]. For most of these algorithms, the extension from two-class to multi-class pattern classification is non-trivial, and often leads to unexpected complexity or weaker performance. A popular approach to improve the learning efficiency of a class imbalanced dataset is decomposing the multi-class imbalanced problem into a series of binary classification problems [15]. In this study, as shown in Fig. 1, we created two models, Model 1 to classify unilateral hand images and the others, and Model 2 to classify both hands images and the others.

Basic idea of our study.
This section describes the architecture of the feature extraction and classification system, as well as the training procedure used in this study. CNNs have achieved state-of-the-art results for the recognition of handwritten digits [16] and for the detection of faces [17]. They are deployed in commercial systems to read checks [18] and to recognize faces and video surveillance and public safety management [19].
Figure 2 shows our CNN architecture as well as detailed information about the size of each layer. As depicted in Fig. 2, our network contains 8 layers, namely 4 convolutional layers, 2 max-pooling layers and 3 fully-connected layers. The first layer, called the input layer, represents input data such as individual pixel intensities [20]. The first layer of our network is 252×252 pixels. The images are converted to 252×252 pixels before entering the convolution layer (i.e., immediately after loading the image, before it is augmented). The pixel interpolation method is bilinear interpolation.

Illustration of our CNN Architecture. The network’s input dimension is 187500 neurons and the remaining 8 layers have 2000000, 3936256, 984064, 1968128, 1905152, 476288, 512 and 2 neurons respectively. Note: CNN –Convolutional Neural Network.
In our architecture, we used four convolutional layers. An input image of dimension w×l×c from a hidden layer h∧((n–1)) was convolved with k different kernels of dimension s×s×c where w and l are the width and the height of the input image, respectively, c is the number of feature maps in the hidden layer h∧((n–1)) and s is the filter size. The number of feature maps in the input image of the hidden layer h∧((n)) is k. The convolution was applied to all the s×s local regions of the image, also called receptive fields, with an overlapping distance called stride [21]. The values of the kernels in the convolutional layer are initialized by the Glorot uniform [22].
As seen in Fig. 2, the first convolution layer has 32 kernels of size 3×3 with stride of 1 which yields 250×250×32 feature maps. The second convolution layer has 64 kernels of size 3×3×32 with stride of 1 which yields 248×248×64 feature maps. A convolutional layer is commonly followed by a nonlinear mapping applied in an activation layer. An activation layer is simply a nonlinear function applied to each pixel value. In our work we used the Rectified Linear Unit (ReLU), i.e. [23], f(x) = max(0,x). Krizhevsky et al. showed that ReLU is useful in practice and Waseem Abbas et al. tried different activation functions and found that ReLU as the optimal activation function in their case [24, 25].
The most common pooling operation is max pooling, which outputs the maximum value in a local neighborhood of each feature map and discards all the other values. It progressively reduces the spatial dimensions of the given feature maps, and thus decreases the number of pixels to process in the next layers of the network, while maintaining information important for the task at hand [26]. We used a kernel size of 2 and a stride of 2. In Fig. 2, the size of the feature map in the second convolution layer is reduced from 248×248×64 to 124×124×64 after the first max-pooling layer. The output of the second max-pooling layer is a set of 476,288 neurons, i.e. feature maps of size 61×61×128, reduced from the previous feature maps of size 122×122×128.
The dropout technique [27] consists of setting to zero the neurons of the hidden layer with a certain probability. It reduces the complex co-adaptations of neurons and forces them to learn more robust features.
The fully connected layers at the output produce the required class prediction. The number of parameters required to define a network depends upon the number of layers, neurons in each layer and the connection between neurons [27]. In this study, we set two classes (e.g., unilateral hand / both hands and other parts), therefore the number of neurons in the output layer is two. The output of the fully connected layer is eventually converted by a Softmax function to probabilities for each class [28].
We prepared 1047 X-ray images including various body parts and divided them into two groups: 841 images for training and 206 images for testing. Furthermore, 20% of the training images were used for validation. Ideally, the model should be evaluated on samples that were not used to build or fine-tune the model, so that they provide an unbiased sense of model effectiveness. The training data were used to train the CNN and the test data were to measure the accuracy of the trained CNN [29].
Eight hundred and forty-one (841) training images consisting of 213 unilateral hand images, 104 bilateral hands images and 524 other body parts images were used. A major challenge in the medical imaging domain is how to cope with small datasets and the limited number of annotated samples. Researchers attempt to overcome this challenge by using data augmentation. Generally, data augmentation involves crops, rotations, translations, scaling, and mirroring, etc. of the labeled samples. The performance of many CNN applications is limited by the availability of data which is often solved with data augmentation techniques [30]. For each training image, we generate “duplicated” images; that is rotated, zoomed in/out, shifted, sheared, channel shifted, or flipped as shown in Table 1. Both images and duplicates are fed into the CNN. As can be seen in Table 2, in Model 1, which classifies unilateral hand images and other parts images, the number of images was increased to 2,343 and 3,140, respectively. In Model 2, which classifies bilateral hands and other parts, the data size was increased to 1,144 images and 1,572 images, respectively. Figure 3 showed example images of both training and test datasets in three classes (i.e., unilateral hand, both hands, and other parts).
The range of augmentation methods
The range of augmentation methods
Image augmentation
AUG, augmentation.

Example images of training and test datasets.
The study was conducted in compliance with the Declaration of Helsinki and approved by the local ethics committee. Informed consent was obtained in the form of opt-out on the website.
All hand radiographs targeted were taken in a posterior-anterior view by digital X-ray equipment (DR-155HS2-5; Hitachi) under following standard conditions: X-ray aluminum filter thickness = 1.5 mm; tube voltage = 42 kV; tube current = 100 mA; exposure time = 0.02 sec; film focus distance = 100 cm. The center of X-ray beam was the MCP joint of the middle finger for unilateral and at the center between the tips of thumbs for bilateral imaging. Radiographs were displayed as DICOM images with 2010 by 1490 pixels at 12-bit grayscale resolution.
Results
Using the above mentioned two models, we created software that classifies unilateral hand, both hands, and other images, then verified their accuracy. Table 3 summarizes the performance of our two models for binary classification. We can see from this table that our CNNs are able to distinguish between unilateral hand or not with 95.17% accuracy and both hands or not with 97.89% accuracy. Our model converges to a high accuracy after a few dozen iterations. This takes about two and a half hours. Figure 4 showed example of misclassified images of both hands (a) and unilateral foot (b).
CNN classification accuracy rate
CNN classification accuracy rate
Note: CNN –Convolutional Neural Network.

Example of misclassified images. The both hands (a) may be misclassified as other parts because of 90 degrees rotation. The unilateral foot (b) may be misclassified because the number of unilateral foot image is small and this figure is similar to unilateral hand image.
Classification results of our software are summarized in Table 4. Our software achieves an accuracy of 100% for classification into the three different types of images. From this confusion matrix, we can see that our CNN can classify each image with very high accuracy.
Confusion matrix showing the classification accuracy of our software
Radiographic assessment of bone erosion and JSN using traditional scoring methods has been utilized for joint damage assessment in RA. To accelerate the assessment, several visual scoring methods have been proposed to quantify the joint damage on the radiographs of patients with RA. As found in previous literature, automated software can deal with a huge number of images in a short time. However, when collecting images, it is not easy to select only the images necessary for analysis from many past images including different body parts of the patient in the DICOM server. In this study, we created software that classifies unilateral hand, bilateral hands and other images. We then verified their accuracy.
We used two binary classification networks to solve a multi-class imbalanced problem. We created a CNN which consisted of 4 convolutional layers, 2 max-pooling layers and 3 fully-connected layers. And we prepared 1047 X-ray images including various parts and divided them into two groups, one is for training and the other is for testing. By the method of data augmentation, we solved the problem of small datasets and the limited number of annotated samples. We found that our CNN has high accuracy to distinguish between unilateral and bilateral hands. This software enables users to select necessary images for analysis in a more efficient, time-saving and inexpensive way.
To the best of our knowledge, this is the first study to utilize a CNN to select hand radiographs among many other radiographs from different body parts. Rajkomar, et al successfully classified chest radiographs (frontal or lateral) [31]. Furthermore, Kim, et al attempted to develop and test the performance of a deep convolutional neural network for the automated classification of frontal chest radiographs into anteroposterior or posteroanterior views [32]. These works are applications of CNNs to preselected images, namely chest radiographs. Effective classification of hand images out of miscellaneous images in a PACS system derives from a real demand from clinical necessity.
This study also has a potential contribution to establishing an automatic system to diagnose RA or detect progression of RA. Unique aspects of this study include the limited target images for selection (unilateral hand, both hands, and other parts) and a “real life” dataset collected from an active rheumatology clinic where the automatic system is in need. By limiting the target images to be selected, we believe the performance of the software improved even with cases suffering from advanced finger joint deformity. Limitation to this study is mentioned here: (1) We only limits to a two-class classification and other deep learning CNN models including the ones relying on multiple class recognition or classification were not tested as this is beyond the scope of this study. (2) There was no objectivity in the augmentation methods and the setting of its parameters. (3) Another limitation is that our models were tested for images derived from one facility and has not been verified with images from other facilities. Although one important preprocessing method that has been shown to be effective in training highly discriminative deep learning models is data augmentation [33], exploration of more optimal parameters is needed which could further improve classification strength. In order to operate this software, high versatility is required. In the future, it will be necessary to verify the accuracy of images taken by other medical institutions.
In conclusion, this study indicates that the use of deep learning has made it possible to automatically collect target X-ray images in RA image analysis. Our method can efficiently retrieve relevant rheumatic patient images. Future work will take advantage of other functions of CNN. Based on the segmentation function, it may help to discern lesions of abnormal hand images in RA patient, like articular erosion, narrow joint space and joint deformity, further utilizing the progress of medical image analysis.
