Abstract
BACKGROUND:
Knee Osteoarthritis (KOA) is the most common type of Osteoarthritis (OA) and it is diagnosed by physicians using a standard 0 –4 Kellgren Lawrence (KL) grading system which sets the KOA on a spectrum of 5 grades; starting from normal (0) to Severe OA (4).
OBJECTIVES:
In this paper, we propose a transfer learning approach of a very deep wide residual learning-based network (WRN-50-2) which is fine-tuned using X-ray plain radiographs from the Osteoarthritis Initiative (OAI) dataset to learn the KL severity grading of KOA.
METHODS:
We propose a data augmentation approach of OAI data to avoid data imbalance and reduce overfitting by applying it only to certain KL grades depending on their number of plain radiographs. Then we conduct experiments to test the model based on an independent testing data of original plain radiographs acquired from the OAI dataset.
RESULTS:
Experimental results showed good generalization power in predicting the KL grade of knee X-rays with an accuracy of 72% and Precision 74%. Moreover, using Grad-Cam, we also observed that network selected some distinctive features that describe the prediction of a KL grade of a knee radiograph.
CONCLUSION:
This study demonstrates that our proposed new model outperforms several other related works, and it can be further improved to be used to help radiologists make more accurate and precise diagnosis of KOA in future clinical practice.
Introduction
Osteoarthritis (OA) is an observable result of the process of joint inflammation causing activity limitations and physical disabilities, especially in older adults [1]. This anatomical impairment may show irreversible damages to the cartilage of the joints and its surrounding bones. Knee Osteoarthritis (KOA) is the most common knee joints inflammation, and it is mainly relevant in elderly, aged 60 years and more. KOA is a leading cause of death in America in which half of old American people, of age over 65 years, have OA in at least one knee joint [2]. Moreover, it is estimated that by 2030, more than 20% of American residents will more likely be at risk of having OA [3]. Thus, there is an urgent need to reduce this number by staging OA disease more efficiently, in order to develop the right treatment for every stage.
The joints pain, swelling, and stiffness are considered as the most prominent symptoms of OA. Over time, these symptoms may worsen which prevents elderly from performing daily life activities like walking, stairs climbing, and bending. Eventually, this reduction in a patient’s life activity may lead to disability which incurs loss of productivity [4].
The early detection and rapid diagnosis of knee osteoarthritis is essential to providing behavioral interventions and medical treatments that can prolong a healthy life of a patient [5]. The main hallmarks of KOA are the narrowed joint space, osteophyte formation, and subchondral sclerosis. Magnetic Resonance Imaging (MRI) is a good reflector of the 3D structure of the knee. However, this imaging technique is found mainly in huge health centers and hospitals, and MRI tests are quite expensive. This, therefore; makes it not an appropriate option for routine KOA diagnosis. Hence, X-ray is the alternative option for KOA diagnosis due to its cost-efficiency, simplicity, safety, and wider availability [6]. Since then, the Kellgren and Lawrence (KL) grading system [7] that was approved by WHO (2006) has been used as a KOA severity grading system. Such a system (KL) divides the Knee OA into 5 different grades: 0 (Normal), 1 (Doubtful), 2 (Possible JSN), 3 (Definitive JSN sclerosis), and 4 (Severe sclerosis). Figure 1 shows the Kellgren and Lawrence (KL) grades with X-ray examples.

Knee joint samples with Kellgren-Lawrence grading. A grade of 0 indicates no evidence of osteoarthritis (red); a grade of 1 indicates the possibility of joint space narrowing (red) and osteophyte formation (green); a grade of 2 indicates definite osteophyte formation (green) and possible joint space narrowing (red); a grade of 3 indicates multiple osteophytes (green), definite joint space narrowing (red), and sclerosis (blue); a grade of 4 indicates end-stage OA, marked by severe sclerosis (blue), joint space narrowing and sometimes bone-on-bone contact (red), and large osteophytes (yellow).
Scanned X-ray Knee OAs are usually inspected by physicians who rely on their experience to provide the KL grades of the X-ray. In some cases, the physicians may give different grades for the same X-ray. Moreover, same physician can give different grade for the same X-ray inspected at different times [8]. This makes the reliability of physicians to grade KOA not very applicable as their misdiagnosis affects the accuracy of KOA grading. Therefore, relying on physician’s diagnosis is not enough to KL grade the KOA. In this context, Deep learning and machine learning have been employed in several studies for the knee osteoarthritis severity grading. For such purpose, various approaches were proposed, and researchers were competing to propose the best deep networks that can achieve higher accuracies in correctly classifying the KOA KL grades. These proposed methods are mainly based on deep neural networks with different architectures and depths. However, so far, the low performances achieved by several studies suggest that new methods should be investigated. When a very deep neural network fails to learn and converge over a complex classification task, it can be that it is not learning the distinctive features and representations that allows it to perform well. A main reason behind that may be the network’s depth. It is very known that going deeper doesn’t necessarily solve a complex classification task due to the vanishing gradient and the diminishing features reuse problems. During training, the gradient flowing through the network may be stuck into residual block weights. In this case only a few blocks of the model can learn useful information or multiple blocks can learn and share a very few amounts of information which may not be useful for its final classification task, i.e., the extracted features don’t represent their corresponding classes. Thus, a solution to not fall into such dilemma is to use wide residual learning instead of going deeper, which prevents the vanishing gradient and feature reuse problems and allows the network to learn better representations of the images. Moreover, wider deep residual networks can be significantly better than just plain deep networks with having a smaller number of layers and consequently faster to train. Hence, in this paper, we apply a simple transfer learning approach with the wide residual deep network (WideResNet-50-2), denoted as WRN-5-2, to the severity KL grading of Knee osteoarthritis. Such network is trained and tested using the Osteoarthritis Initiative (OAI) [9] and it shows promising results.
The rest of the paper is organized as follows: section two is a review of some state-of-the-art works proposed to solve the problem of Knee osteoarthritis severity KL grading. Section three is the materials and methods section which presents a general discussion of the wide residual learning and the WRN-50-2 in addition to the dataset used. Section four presents the experimental results while section five is the results and discussion, and finally section six is the conclusion of the work.
The severity grading of knee osteoarthritis can be mainly achieved by analyzing the variations of the joint space width and the formation of osteophytes in the joints. Several methods have been applied to detect these changes and formations in order to find the accurate grade of a knee OA. Antony et al. [11] employed a deep convolutional neural network (CNN) to successively quantify the Knee OA severity. The authors created a CNN from scratch and trained it with knee X-ray radiographs from MOST dataset. As a new approach, this paper used an optimizing weights ratio of categorical cross-entropy and mean-squared loss functions for multi-class classification and mean-square error computation, respectively. When tested on OAI data set [9], the approach showed great efficiency and achieve state-of-the-art performance in grading knee OA.
In another study, Tiulpin et al. [10] proposed a new method for solving the problem of KOA severity grading. The authors propose a new method which consists of splitting the knee x-ray radiograph into lateral and medial sides, representing the left and right parts of the knee. These two radiographs are then fed into a Siamese network which consists of two deep convolutional neural networks that share the same weights. The output produced by the two networks is then forwarded through a concatenation layer and a fully connected one for classification. This new approach achieved state-of-the-art results with an accuracy of 66.7% on the test data set.
A novel study for KOA grading was also proposed by Chen et al. [8]. This study utilizes a convolutional neural network with a novel ordinal loss to accurately grade the KOA severity. The paper uses a transfer learning approach to fine-tune the pre-trained models (DenseNet and Inception V3) to classify the KOA plain radiographs into 5 different grades. Upon training and testing, the authors found that the use of ordinal loss leads to better performance of the pre-trained models reaching an accuracy of 69.7% and a mean absolute error of 0.344.
One additional study utilizing the Siamese network was also proposed in 2020 by Li et al. [12] for the continuous KOA severity change detection and evaluation. This method utilizes a Siamese network of two ResNet101, a Euclidean distance function, and a contrastive loss function. It aims to measure the change in a paired image of two knee x-rays, related to one patient, with a binary label representing change or no change in the disease severity category. The method proposed showed promising results in disease severity change detection, in medical imaging.
More deep learning frameworks were also proposed to stage the KOA using KL Grades. Norman et al. [13] presented an automated algorithm based on densely connected convolutional neural networks for the staging of KOA. Their results showed that such system can act as a computer-aid method for the radiologists to produce more accurate and precise KOA KL grades’ diagnosis.
In this paper, we successively fine-tuned a WideResNet50-2(WRN-50-2) pre-trained model to automatically stage the KOA using KL grades, and achieved a significant performance in terms of accuracy, precision, and recall. We first customized the OAI dataset and applied a balancing algorithm using data augmentations techniques to have balanced dataset of close number of images in all 5 grades. Most related works used ensemble and more complex conceptual frameworks in order to solve this classification problem, whereas in our work we showed that applying a transfer learning of the right deep model, such as WideResNet-50-2 can also outperform other complex models in grading KOA. Figure 2 shows the pipeline of knee severity grading using WRN-50-2.

Data augmentation techniques illustration. For every grade, the original image is kept, and the augmentation techniques are applied to create more complex and multi-features images; based on a certain rate.

Wide residual blocks. (a) Basic-wide (b) bottleneck-wide. Conv 3×3, F/4×k indicates that this is a convolutional layer with 3×3 kernel size and F/4×k filters.

WRN-50-2 scheme. This architecture includes several stages starting from the input images which are augmented and then fed into the pre-trained model to be classified as one of the grades using a Softmax function.
Dataset and preprocessing
The knee X-ray radiographs used for evaluation are obtained from the Osteoarthritis Initiative (OAI) dataset [9]. OAI is multi-center longitudinal and observational study of knee OA, available publicly for research purposes. It contains a total of 4130 X-ray plain radiographs of left and right knees (age = 61.2±9.2, BMI = 28.6±4.8, male:female = 1886:2618). We used the algorithm proposed in Antony et al. [11] to detect the knee in an image and split each image into left and right knee images. In total, we obtained 8260 knee images shuffled randomly. This data was then split into training and testing sets as shown in Table 1.
OAI dataset before augmentation
OAI dataset before augmentation
Table 1 shows the number of knee images per grade. It is seen that there is an imbalance in the number of images of some grades, i.e., some grades have more images than others. Hence, to avoid specific classes misdiagnosis and overfitting, we resorted to the data augmentation algorithm to augment training images of the grades 1, 3, and 4 as they have less images than grades 0 and 2. The data augmentation techniques are used to obtain more robust network are shown in Fig. 2. We used techniques such as mirroring, shift translation, rotation, zooming, and noise addition. These techniques are inspired by inspecting the OAI original images as it is noticed that they vary substantially in terms of contrast, relative zoom, side of the knee, and position of the joint space in the whole image. Thus, having such additional augmented versions of the original training images can better prepare the model to predict new knee images with different combination of such attributes than those found in the original dataset. Note that the data augmentation is applied some KL grades depending on their number of images. For example, 40% of the images of grade 1 were selected randomly and augmented, while the rest remained unaugmented. This scenario was applied on the other two grades (3 and 4) in which different proportions of images were selected to be augmented depending on the number of images in the grade. This augmentation algorithm was also applied to further improve the robustness of our network when tested on real images of various conditions such as shifted, noisy, and rotated images, in addition to the original OAI testing images.
Wide residual network [14], or WRN for short, was proposed to solve residual networks’ problems [15]. Residual networks are very deep and thin, and suffer from diminishing feature reuse, which makes them very slow to train. According to the circuit complexity theory, shallow circuits can require exponentially more components than deeper circuits. Hence, the residual networks are designed to be deep and thin as much as possible to have fewer parameters.
The diminishing feature reuse (or loss in information flow [16]) problem is similar to the vanishing gradients problem but in the forward direction [17]. Input features of feature maps computed by previous layers are washed out due to repeated convolutions with randomly initialized weights. Hence, it becomes hard for the following layers to identify and learn meaningful gradient directions. Residual networks attempt to solve this problem by using identity mappings between layers, which allow the network to pass features from previous layers to the following layers without obstruction. However, it is not guaranteed that gradients will flow through residual block weights during training so that some blocks can avoid learning. It is noted, that most probably, few blocks only learn useful representations; or many blocks share very little information with a small contribution to the final goal [14].
WRNs have one main difference from residual networks: they have higher width, i.e. more filters. Figure 1 shows the basic and bottleneck blocks of the WRNs with k representing the increase in width or widening factor. Each block contains multiple convolutional layers. For example, the bottleneck-wide is composed of 3 convolutional layers. Conv 3×3, F/4×k indicates that this is a convolutional layer with 3×3 kernel size and F/4×k filters. If k = 1, then this is a residual block; otherwise, it is a wide residual block.
To create a WRN, several blocks are cascaded. The WRN- d –k notation indicates means that WRN has the depth of d and widening factor k. In this paper, WRN-50-2-bottleneck is used, and its configuration is shown in Table 1. WRN-50-2-bottleneck outperformed ResNet-152 accuracy on the ImageNet dataset while having 3×fewer layers and being faster.
Transfer learning of WRN-50-2
In this paper, we adopted the original pre-trained deep neural network: WideResNet-50-2 (WRN-50-2) trained on ImageNet. Transfer learning is adopted by restructuring the pre-trained network in the purpose of grading the severity of Knee Osteoarthritis from plain x-ray radiographs. Transfer learning is a popular approach that speeds up the training process since the network has already learnt some patterns and parameters from a different classification task.
The adopted transfer learning-based scheme used for grading the KOA is shown in Fig. 5. We use a simple approach to reshape the pre-trained WRN-50-2 model to fit our KOA new grading task, in which the classification part (i.e., part consisting of the fully connected layers) of the pre-trained model was removed and replaced with a new classifier that consists of three different layers where the last one has 5 output nodes representing the five different KL grades: 0 (Normal), 1 (Doubtful), 2 (Possible JSN), 3 (Definitive JSN sclerosis), and 4 (Severe sclerosis). This classifier receives the activations from the feature extraction part of the network, and these activations are then flattened and fed into two fully connected (FC) layers that are added to the model. The first FC consists of 100 nodes while the second one consists of 5, representing the 5 different classes. Finally, the two different activations of the second FC layer are fed into a Softmax layer that produces the probability of each class. The class with the highest probability is then selected to be the final predicted class.

Performance analysis. The curve on the left is the variation of Loss (error) with respect to number of epochs used in training the model. The one on the right is the confusion matrix of the model during the testing stage.
We used the following evaluation metrics to assess the performance of our fine-tuned model:
N: the number of correctly classified testing images,
T: the total number of testing images.
For better evaluation and fair comparison with the literatures, we also used more metrics that opt to show the real performance of the employed model in classifying the KL grade KOA. These metrics include:
TP: True positive; it indicates the number of correctly predicted positive grades with respect to their true labels.
TN: True negative; it indicates the number of correctly predicted negative grades with respect to their true labels.
FP: False positive; it indicates the number of incorrectly predicted positive grades.
FN: False negative; it indicates the number of incorrectly predicted negative grades.
The employed WRN-50-2 model was fine-tuned on the OAI dataset. We split this dataset into two different sets representing train and test and then augmented the train sets taking into consideration the number of images of every grade (Table 2). This augmentation of the train set was meant to create more complex images and improve the robustness of the KOA grading model when tested in real-life and under different circumstances. The augmented images are first resized to 224×224×3 and normalized to 0–1 range for the sake of reducing the computational cost. The model was trained using 10999 of the available X-ray radiographs and tested using 2796 images.
OAI dataset after augmentation
OAI dataset after augmentation
Structure of WRN-50-k. k indicates the network width. in the original architecture (i.e. ResNet-50 [18]), k = 1. Groups of convolutions are shown in brackets; down-sampling is performed using a 2×2 stride by the first convolutional layer in groups conv3, conv4, and conv5
The Pytorch Framework was used to load and fine-tune the employed network. Training and testing were carried out using a computer equipped with a GeForce GTX 1640Ti Graphical Processing Unit (GPU). A cost function of Cross-Entropy and Stochastic Gradient Descent (SGD) were used for training with a small learning rate of 0.0001 and a batch size of 64. We trained the model using 30 epochs after testing several values and finding out that the model was exhibiting overfitting when the number of epochs exceeded 30.
Figures 5 shows the training curve of the model (left) and the testing confusion matrix (right). It is noticed that the model performed differently for every grade, which is mainly due to the number of images and complexity of every grade. For instance, for Grade 1, the rate of correctly classified images (TP and TN) achieved by the model is 57% while it is 92% for Grade 4.
A full breakdown of testing performance metrics such as Precision, Recall, F1-Score, and Accuracy are shown in Table 4. The model showed good generalization power with an accuracy equals to 72%, when tested on X-ray radiographs that include images of different conditions inspired by the complexities found in knee X-ray plain radiographs. It is also noticed that the model reached its best performance in Grades 0 and 4 in terms of Precision, Recall, and F1-Score. This indicates that the model finds more difficulties in correctly diagnosing the mid-ranged KL grades (1, 2, and 3).
Testing performance results
To fully assess the performance of our model, we also calculated the mean area under curve (AUC) of the images without KOA evidence (Grade 0) versus all other grades (Grade 1–4) considered as abnormal. The results are shown in the ROC curve (Fig. 6) of the model when tested using 2796 X-ray radiographs. We considered a binary classification of Grade 0 versus other Grades to evaluate the potential of our model in distinguishing the healthy knees from those which may have implications. As seen, the model achieves a mean AUC of 92% on the testing set consisting of the original OAI X-rays images.

Receiver Operating Characteristic (ROC) curve demonstrating the performance of the model in detecting the presence of radiographic KOA at all grades.
For more insightful interpretation of the model performance, we also visualize the activation maps that show the areas the model focused on to make its grading decision on every image (Fig. 7). To compute and visualize these activations, we used the gradient weight class activation mapping (Grad-Cam) which shows the suspected regions associated with a predicted class using heatmaps, in which a jet colormap shows the highest activation regions as deep red, and the lowest activation ones as deep blue.

Localizations achieved by Grad-CAM technique based on the WRN-50-2 model on testing KOA X-ray images. The first row represents the original knee images, while the second row represents their corresponding overlaid classification activation maps.
Results comparison
Disease prediction progression is considered a critical task as it benefits patients predefine their diseases and find the suitable treatment at the right time [1, 5]. This study aims to present a simple, yet efficient pre-trained model; fine-tuned to outperform some other complex architectures proposed to KL grade Knee osteoarthritis [6, 13] from X-ray plain radiographs. In this work, we showed a WRN-50-2 architecture yielded an overall AUC of 92%, Accuracy of 72%, Precision of 74%, Recall of 73%, and F1-Score of 36% (Table 4) for KL grading of X-ray knee radiographs. This performance is considered promising since such a model achieved inline results with other related studies as shown in Table 5 where we compare our model’s performance to other related models tested on OAI [11] dataset, in terms of accuracy. We selected the accuracy to be the comparative metrics as it was reported by most studies.
Comparison with other related works
Comparison with other related works
Results in Table 5 show that our fine-tuned model’s accuracy is in line with other state-of-art studies in grading knee osteoarthritis X-ray images. Despite the simplicity of our proposed transfer learning approach, it proved that with a good dataset and a complex augmentation technique, it is possible to successfully use very deep networks in the diagnostic radiology research. In short, our work brings the following contributions: We outperformed several state-of-art techniques used for the automatic severity KOA grading from X-ray plain radiographs, achieving higher accuracy [6, 13] on OAI dataset. We showed that choosing the appropriate pre-trained model through transfer learning can be better than resorting to very complex architectures [10, 13] when grading X-rays KOA. We presented an activation map using Grad-Cam which can be an additional tool to diagnose KOA. We released our code and trained model publicly for the reproducibility purposes.
Regular deep residual networks (ResNets) [15] have shown to outperform other deep networks (AlexNet [20], GoogleNet [21], DenseNet [22], VGGs [23], ResNet [15] etc..). However, this was achieved at the cost of increasing the number of layers and consequently reducing features reuse and increasing training and computation time in general. Wide residual Learning Networks (WRN) were then proposed to solve these problems by introducing skip connections. Experimentally, WRNs have shown a better performance in image classification in terms of training time and even accuracy on some datasets such as: CIFAR [24], SVHN [25] and COCO [26].
In our work, we had the curiosity to understand whether our wide learning model (WRN-50-2) can also outperform other deep networks such as AlexNet, DenseNet201, and ResNet50. Thus, we trained these three models on the same dataset (OAI). Testing results in Table 6 show that wide residual learning models outperform all other deep networks reported.
Comparison with other deep networks
Comparison with other deep networks
While our WideResNet-50-2 pre-trained model show very promising generalization capability on the OAI and augmented datasets compared to other KL grading schemes, we believe that there is still room for improvement in terms of correct KL grading (Accuracy) and visualization tools (Grad-Cam). The achieved accuracy is 72% over the whole dataset, which is most likely reached because of the augmentation techniques we used, which grants the model the power of learning complex and real-life knee osteoarthritis features that helped boost its performance. In one aspect, this can still be improved by using more X-ray images that show more complex examples of every KL grade that vary in terms of contrast, relative zoom, side of the knee, and position of the joint space in the whole image. In another aspect, building a better and more robust model architecture can also improve the KOA KL grading results. An ensemble model of two or more WideResNet-50-2 may also be a good option for this problem as a future venue.
Code Availability
Source code will be available at https://github.com/abdulkader902017/KneeOsteo.WRN-50-2.
CRediT authorship contribution statement
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
