Abstract
Knee osteoarthritis presents a significant health challenge for many adults globally. At present, there are no pharmacological treatments that can cure this medical condition. The primary method for managing the progress of knee osteoarthritis is through early identification. Currently, X-ray imaging serves as a key modality for predicting the onset of osteoarthritis. Nevertheless, the traditional manual interpretation of X-rays is susceptible to inaccuracies, largely due to the varying levels of expertise among radiologists. In this paper, we propose a multimodal model based on pre-trained vision and language models for the identification of the knee osteoarthritis severity Kellgren-Lawrence (KL) grading. Using Vision transformer and Pre-training of deep bidirectional transformers for language understanding (BERT) for images and texts embeddings extraction helps Transformer encoders extracts more distinctive hidden-states that facilitates the learning process of the neural network classifier. The multimodal model was trained and tested on the OAI dataset, and the results showed remarkable performance compared to the related works. Experimentally, the evaluation of the model on the test set comprising X-ray images demonstrated an overall accuracy of 82.85%, alongside a precision of 84.54% and a recall of 82.89%.
Introduction
Osteoarthritis (OA) is a specific form of arthritis characterized by deformities in the knee, which lead to joint pain and functional impairments in the knee. 1 Epidemiological research indicates that OA is a significant contributor to disability on a global scale. 2 In addition to pharmacological treatments and physiotherapy, total knee joint replacement surgery is often the sole option available to enhance the quality of life for patients with promising prognoses. 3 Key characteristics of osteoarthritis include the narrowing of the joint space and the presence of bony spurs.4–6
Typically, healthcare professionals assess knee joint OA through plain radiographs, which are instrumental in evaluating the progression of osteophytes, the reduction in knee joint space width, subchondral geodes, and subchondral bone sclerosis.5,7 The Kellgren-Lawrence (KL) grading system serves as the standard for evaluating OA based on these criteria. X-ray imaging remains crucial for diagnosing and monitoring health, with experts increasingly utilizing digital X-ray images for OA assessment. 8
Several studies have been conducted to identify and quantify joint space width (JSW), yet there exists considerable variability in their findings.3,6,9–15 Machine learning (ML) based computer-aided methodologies hold the potential to address diagnostic challenges by facilitating the automatic assessment of knee OA severity.10–15 Furthermore, the traditional manual diagnostic process for OA is often slow and labor-intensive, highlighting the necessity for computer-assisted solutions that help medical diagnose knee OA severity more reliably and automatically.
Multimodal learning has emerged as a powerful paradigm in healthcare, leveraging information from multiple modalities (e.g., images, text, audio) to improve diagnostic accuracy and clinical decision-making. Several recent studies have explored the potential of multimodal models in various healthcare applications. 13 One notable study proposed a stochastic multimodal Transformer for post-traumatic stress disorder (PTSD) detection using video data. By combining information from audio, visual, and textual modalities, their model achieved state-of-the-art results in PTSD classification.
In this paper, we developed a Transformer based multimodal approach for the knee osteoarthritis severity KL grading from x-ray radiographs. This model framework integrates cutting-edge techniques such as Vision Transformer (ViT) 16 and Bidirectional Encoder Representations from Transformers (BERT) 17 to extract image and text embeddings from x-ray images and their corresponding text descriptions. The essence of this framework lies in its utilization of two Transformer encoders, each trained separately with different input modalities, to derive modality-specific representations or feature vectors for each input modality. This approach enables the model to discern the unique characteristics of each modality without being influenced by the other, ultimately enhancing performance on downstream tasks that necessitate comprehensive understanding of both modalities.
After the training of these Transformer encoders is completed, they are integrated to create a unified multimodal representation. This integration is accomplished by concatenating the hidden states from both encoders, which are then utilized as inputs for a separate multilayer perceptron (MLP). The MLP is designed to learn to grade knee osteoarthritis through the application of stochastic gradient descent. 18
The contributions of this work are as follows:
This paper proposed a unique multimodal approach that leverages the vision and language models to facilitate the learning process of a neural network in assessing the severity of knee OA KL grading. This paper provided a simpler inference framework that can use images only to assess the severity of knee OA KL grades from plain radiographs. The trained multimodal model achieved remarkable performance on testing set when compared to the state-of-the-art architectures.10–15,18–23 This paper also investigates a data augmentation technique to prevent the negative effects of the data imbalances while training the model. Additionally, this paper presents a Grad-CAM based model interpretation to help assess and interpret the knee OA grading of the model Figure 1.

The Kellgren-Lawrence (KL) grading system for knee osteoarthritis classifies the severity of the condition based on two fundamental criteria: the degree of narrowing of the joint space and the level of osteophyte development.
Review
Schmidt et al. 7 identified that the measurement of the joint space between the tibia and femur can be achieved by analyzing the edge pixels of the two bones. The assessment of knee osteoarthritis (OA) encompasses two essential processes: (1) determining the width of the knee joint space and (2) evaluating the identified knee joint space area (7, 9). In a related study, Shamir et al. 9 introduced a template matching technique for quantifying knee joints; however, the overall accuracy in classifying knee OA was reported to be inadequate. Furthermore, Thomson et al. 10 performed an investigation into the width of the knee joint space, utilizing bone anatomy and texture analysis to identify the presence of osteophytes and radiographic OA, particularly focusing on Kellgren-Lawrence grade 2. A comprehensive evaluation of knee OA according to the Kellgren-Lawrence grading system is detailed in Table 1.
Knee osteoarthritis x-ray images grading based on deep learning models.
Mengko et al. 11 and Subramoniam et al. 12 manually identified the area of knee joint space width and extracted the joint space area using semi-automated CAD algorithms. In order to address the limitations, CAD tools must seek out features that are more effective. Their study focuses on assessing the severity of osteoarthritis in patients based on the severity predicted by medical professionals. Juefel et al. 13 employed deep learning methodologies across multiple domains, including pattern recognition, natural language processing (NLP), and signal analysis. Their research highlights the effectiveness of convolutional neural networks (CNNs) in biomedical image processing, demonstrating their capacity to learn functions proficiently in a wide range of applications.
Anthony et al. 14 introduced the CafeNet architecture and VGG16 architecture for the classification of knee osteoarthritis grade. Lin et al. 15 implemented focal loss to enhance the accuracy of object detection. TIulpin et al. 18 highlighted the significance of radiological characteristics by employing attention maps to support practitioners in their decision-making processes. Suresha et al. 19 utilized ImageNet, a pre-trained model, and refined the regional proposal and classification networks to optimize classification accuracy.
Bertalan et al. 20 utilized Entropy-based bone texture and fractal oriental texture algorithms to analyze the width of the knee joint space. Their findings suggest that fractal and texture analysis of bones can predict the early stages of osteoarthritis.
Tiulpin et al. 21 utilized a Siamese based model to assess the similarity between two knee X-ray images with identified changes. The approach involves the selection of random seeds to extract the sub-regions of the left and right knee joints, followed by concatenating the predictions from these chosen regions. Through experimentation, this technique demonstrated encouraging outcomes in the severity grading of knee osteoarthritis using the Kellgren-Lawrence (KL) scale. On the other hand, Helwan et al., 22 employed a wide residual network for the grading of knee osteoarthritis showing outperformance over other deep networks. Alshareef et al. 23 explored the potential of Vision-Transformer (ViT) that uses images patching in which images are converted to patches before being fed to the model which helps it extract features from different parts of the x-ray radiographs.
Materials and methods
Dataset
The knee X-ray images utilized in this research were obtained from the Osteoarthritis Initiative (OAI) dataset as provided by Chen et al. 26 The OAI dataset, which originates from a decade-long observational study focused on the severity of knee osteoarthritis (OA), comprises Kellgren-Lawrence (KL) grades assessed by three radiologists specializing in musculoskeletal conditions. This dataset features a total of 8260 posterior-anterior (PA) fixed flexion X-ray images of the left and right knees, sourced from 4796 participants aged 45 to 79, encompassing both male and female patients. The original dataset is divided into a training set of 5778 images, a validation set of 826 images, and a test set of 1656 images, maintaining a ratio of 7:1:2. Notably, this dataset demonstrates a class imbalance across various KL grades. To tackle this issue, the research utilized a stratified fivefold cross-validation method on a collection of 6604 images, integrating the training and validation sets. Additionally, a training to validation data ratio of 4:1 was preserved for each KL grade, while the testing dataset was kept completely distinct from the training phase. This methodology guarantees a uniform label distribution within both the training and validation datasets, thereby effectively addressing concerns associated with class imbalance.
Table 2 presents a summary of the composition of the dataset utilized in this study.
Total number of images in the dataset and their train-test split.
Data augmentation
As illustrated in Table 2, there is a discrepancy in the number of knee images across different grades. Specifically, some grades have a higher number of images compared to others. In order to prevent misdiagnosis and overfitting in specific classes, we implemented a data augmentation algorithm to increase the number of training images for grades 1, 3, and 4, which had fewer images than grades 0 and 2. The data augmentation techniques utilized to enhance the robustness of the network are depicted in Figure 2. These techniques include mirroring, shift translation, rotation, zooming, and noise addition. The selection of these techniques was influenced by the observation of significant variations in the OAI original images in terms of contrast, zoom level, knee orientation, and joint space position within the image. 22 By generating augmented versions of the original training images, the model is better equipped to predict new knee images with diverse combinations of these attributes that are not present in the original dataset. Additionally, this augmentation algorithm was employed to improve the resilience of our network when evaluated on real images exhibiting various conditions, including shifts, noise, and rotations, alongside the original OAI testing images. Figure 3 shows number of images per every grade.

Data augmentation of knee image of grade 3.

Number of images (train set) before image augmentation. The difference in the number of images per grade shows the imbalance of the dataset.
The proposed multimodal model architecture
This study utilizes a multimodal framework to construct a model capable of identifying knee osteoarthritis KL grade within a knee radiograph (Figure 4). The presented model processes raw input RGB images and textual data, extracting multimodal representations that are sufficiently comprehensive to support various downstream applications. The model is designed to operate within a joint embedding space that encapsulates the input images, facilitating the learning of interrelations between distinct text and image modalities. This joint embedding space is established by integrating the hidden states or feature vectors of both image and text, which are acquired through Transformer encoders. 27

The framework architecture for knee osteoarthritis grading. The framework consists of image and text modalities. The image modality uses ViT to extract the image embedding, while the text modality used BERT to extract the corresponding text embedding. Image and text embedding are then fed into two Transformer encoders which learn to extract hidden states of every modality. Finally, image and text features are concatenated and fed into a neural network to be which consists of three layers (MLP).
The image embeddings are derived from the knee x-ray images through the utilization of a pre-trained Vision Transformer (ViT), whereas the text embeddings are derived from the raw text through the application of a language Transformer model, specifically BERT. The Vision Transformer is constructed based on the Transformer architecture, 16 a neural network type that is highly effective in processing sequential data. The text embeddings are directly extracted using a pre-trained version of the BERT model. 17 Conversely, a pre-trained version of ViT was fine-tuned for the extraction of image embeddings. Subsequently, the Transformer encoders were trained separately by employing a cross_entropy loss function to extract the hidden-states from text and image embeddings.
Figure 4 illustrates the model's utilization of two distinct Transformer encoders, each dedicated to a specific modality: image and text. It is evident that each encoder independently processes its designated input data, thereby extracting representations that are unique to each modality. Subsequently, these representations are integrated across modalities within a semantically hierarchical common space, resulting in a cohesive representation that encapsulates the interrelations among both modalities. A detailed illustration of how the model architecture addresses each modality is provided in Figure 4.
The images encoder, known as ViT, processes raw knee images to derive a series of feature vectors from each individual image. These feature vectors are subsequently input into the Transformer encoder, which utilizes self-attention mechanisms and positional encoding to develop a hierarchical representation of the images sequence. In parallel, the text encoder, referred to as BERT, receives a sequence of words or characters and transforms this input into a series of token embeddings. These embeddings are then directed into a second Transformer encoder, which constructs a representation of the text that effectively captures the semantic and syntactic interrelations among the words.
Implementation details
The proposed architecture was carried out using the TensorFlow framework for implementation. All experimental procedures were conducted on a GeForce GTX 1640Ti graphical processing unit (GPU). This multimodal approach to extracting temporally rich spatial features from knee plain radiographs is based solely on two transformer embedding models. The image encoder is constructed using the ViT-B/16 architecture, which has been pre-trained on ImageNet1 K (IN1k), 28 processing input images of size 248 × 248 pixels and producing output representations of size Ds = 768D. The second transformer encoder is derived from a pre-trained BERT model. The ViT was finetuned on the knee osteoarthritis dataset using Stochastic Gradient Descent (SGD) over 10 epochs, incorporating a 5-epoch warmup 29 and employing cosine annealed decay. Once trained, the fully connected layers were removed, and the image embeddings were extracted.
Following this, the text and image embeddings were fed into the transformer encoders, which were trained independently for 50 epochs using SGD, with a weight decay of 1 × 10−5, a learning rate of 0.001, and a 5-epoch warmup. During the training phase, the transformer encoders were provided with 6604 (images feature vectors and their corresponding text embeddings) extracted from the ViT and BERT respectively, while the remaining frames were reserved for testing purposes (1656 images and text embeddings). Each transformer encoder is structured with 12 layers.
Similarly, the MLP was trained on the concatenated feature vectors from the transformer encoders. This three-layers neural network was trained using SGD with 10 epochs, a batch size of 32, and a learning rate of 0.001. The optimization of learning parameters was guided by their effects on the MLP model performance as assessed on a held-out validation set. Various configurations were evaluated, and the one yielding the highest performance on the selected evaluation metric was presented in this study.
Results and discussion
Text and image embeddings extraction
The ViT 16 and BERT 17 models were first used to extract image and text embeddings. As previously mentioned, the ViT was finetuned on a part of the data to extract embeddings. We chose a pre-trained Vision Transformer (ViT) model, specifically ViT-B/16, which has been trained on a large dataset like ImageNet. This pre-trained model serves as a strong starting point for fine-tuning. the pre-trained ViT model was fine-tuned on the knee osteoarthritis (OA) X-ray dataset. This process involved optionally freezing the initial layers of the ViT to retain the pre-learned features and only fine-tuning the later layers, which are more task-specific. A classification head was added to the ViT model, consisting of a fully connected layer followed by a softmax activation function, to predict the Kellgren-Lawrence (KL) grade of the knee X-ray images.
The model was trained using the training set, and its performance was validated on the validation set. Metrics such as accuracy, precision, recall, and F1-score were monitored to evaluate the model's performance. Regularization techniques, including dropout and L2 regularization, were applied to prevent overfitting, and early stopping was implemented to halt training when the validation loss stopped improving.
To extract image embeddings, the ViT model first divides the input image into non-overlapping patches, each of size P x P pixels. Each patch is then flattened and linearly projected into a sequence of tokens, effectively transforming the image into a sequence of tokens:
A learnable linear projection function E is applied to each patch to obtain a token embedding:
The self-attention operation can be formulated as follows:
Once the model was fine-tuned, it was used to extract image embeddings from the knee X-ray images by passing them through the ViT model and extracting the output from the penultimate layer (before the classification head).
On the other hand, to generate word embeddings via BERT, it was necessary to tokenize the input text into distinct words or subwords through the BERT tokenizer as shown in Figure 5. Subsequently, the tokenized input was processed by the BERT model to produce a series of hidden states. These hidden states can be employed to derive word embeddings for each word in the input text by calculating the dot product between the hidden states and a learned weight matrix.

Tokenization and text embeddings extraction process using BERT.
The word embeddings generated by BERT are particularly advantageous due to their context-sensitive nature, allowing the embedding of a word to differ based on the surrounding context in which it is found. This stands in contrast to numerous other word embedding techniques, which produce a static embedding for each word, irrespective of its contextual usage.
Figure 5 shows the tokenization and text embeddings extraction process using BERT where RCLS represents the embedding of the token [CLS], RGrade corresponds to the embedding of the token Grade, R0 denotes the embedding of the token 0, and so forth. This methodology allows us to derive the representation for each individual token, which fundamentally consists of contextualized word (token) embeddings. For instance, in this study the pre-trained BERT-base model was used in which each token's representation is characterized by a dimensionality of 768.
To derive the representation for the entire sentence, it has been determined that we must prepend the [CLS] token to the start of our sentence. The embedding of the [CLS] token encapsulates the comprehensive representation of the entire sentence. Consequently, we can disregard the embeddings of all other tokens and utilize the embedding of the [CLS] token as the representation for our sentence. Therefore, the representation for the sentence ‘Grade 0, Normal’ is effectively equivalent to the representation of the [CLS] token.
In a comparable manner, we can calculate the vector representation for all sentences within our training dataset. Once we have obtained the sentence representations for every entry in our training set, we can input these representations into a Transformer encoder and then into an MLP classifier to classify them into 5 different grades.
In order to explore the semantic relationships among specific target words, word embeddings were derived from a pre-trained BERT model. These embeddings serve as high-dimensional vector representations that encapsulate both semantic and syntactic information. The process involved feeding the target words into the model and extracting the corresponding representations from the output layer. To enhance visualization and facilitate comparison, Principal Component Analysis (PCA) was employed for dimensionality reduction (See Figure 6). This technique effectively projects the high-dimensional embeddings into a lower-dimensional space, typically comprising two or three dimensions, while retaining the maximum variance present in the data.

BERT word embeddings. The words Normal, Mild, Doubtful, Severe, Moderate, 0, 1,2,3,4 were selected here and the BERT was used to extract their embeddings vectors.
Network performance evaluation
Following the extraction of text and image embeddings, both were processed through distinct Transformer encoders that utilized stochastic gradient descent (SGD) and cross-entropy loss for the purpose of deriving text and input features, commonly referred to as hidden states. Subsequently, these derived features were directed through a multimodality projection head, after which the final combined hidden states were input into a multilayer perceptron (MLP) designed to classify the knee osteoarthritis x-ray images. To enhance generalizability, k-fold cross-validation (CV) with k set to 5 was implemented for a thorough evaluation. The evaluation metrics reflect the average values across all folds. Figure 7 shows the accuracy and loss variations over the number of epochs during the training process of the neural network model.

Training loss and accuracy variations over the number of epochs. This shows the training performance of the MLP.
The testing phase of the model involves some changes to the entire multimodal architecture. During testing, the model is required to receive x-ray images as inputs with no textual description, thus it is not possible to use the same two modalities architecture that was used during the training of the model. Therefore, during testing and inference, the second modality of the model is removed, and the images are fed into the ViT which was already trained to extract large embeddings, then these embeddings pass through the trained Transformer encoder which extract the hidden states of the images as shown in Figure 8. Finally, these features are fed into the MLP, which was trained on images and their textual description, to be classified as one the 5 KL grades. Table 3 shows the training/fine-tuning processes time taken for every model or block in our framework.

The testing and inference model structure after being trained.
Computational time of every model/block training process.
Figure 9 illustrates the confusion matrix of the MLP during testing. This network was trained using the extracted text and image embeddings as input data to grade knee osteoarthritis radiographs. As depicted in Table 3, the MLP achieved a remarkable accuracy rate of 82.85%, a precision rate of 84.54%, and a recall rate of 82.89% during the testing phases. These performance metrics, presented in Table 4, encompass the metrics of accuracy, precision, and recall obtained across fold 3 which resulted in the best performance.

The confusion matrix of the MLP during testing.
Total number of images in the dataset and their train-test split.
Comparative analysis
The advancement of disease prediction is recognized as an essential endeavor, as it enables patients to identify their conditions and access appropriate treatments in a timely manner.1,5 This research seeks to introduce a multimodal approach that leverages the use of large language models (BERT) and vision models (ViT) for the extraction of text and image embeddings, respectively. The proposed model is designed to surpass several other intricate architectures that have been proposed for the KL grading of knee osteoarthritis14,18,22–26 based on X-ray plain radiographs. Table 5 shows an accuracy based comparative analysis of our model versus other related models that used the same training dataset.
Comparison with other related works.
Model interpretation
To enhance the understanding of the model's performance, we visualize the activation maps that illustrate the specific areas the model concentrated on when determining the grading for each image (see Figure 10). The computation and visualization of these activations were achieved through the application of gradient-weighted class activation mapping (Grad-CAM). 30 This technique highlights the regions likely linked to a predicted class by employing heatmaps, where the jet colormap represents the areas of highest activation in deep red and the areas of lowest activation in deep blue.

The Grad-CAM technique applied to the proposed model has successfully generated localizations on the knee OA x-ray images during testing. The first column represents the original images of different KL grades, the second row represents their corresponding heatmaps extracted using Grad-Cam, and third row shows the original images after overlaying their corresponding heatmaps.
Discussion
In this work, we proposed a multimodal model based on vision-language architecture that integrates the use of ViT, BERT and transformer encoders for the knee osteoarthritis (OA) KL severity grading from plain radiographs.
Evaluating the severity of knee OA and assigning a Kellgren-Lawrence (KL) grade through X-ray images presents considerable challenges, largely influenced by the clinician's judgment in determining the KL grade. Consequently, the precision of such assessments can vary markedly based on the clinician's level of expertise, leading to potential inaccuracies. To mitigate the reliance on subjective clinical evaluations and decrease the likelihood of misdiagnosis from X-ray images, models based on artificial intelligence may offer significant advantages. An AI-driven model could serve as a supportive resource, enhancing clinicians’ ability to accurately assess the severity of knee OA as depicted in X-ray images. This motivation underpins our research, which aims to develop a valuable tool for clinical practice.
The evaluation of the model on the test set comprising X-ray images demonstrated an overall accuracy of 82.85%, alongside a precision of 84.54% and a recall of 82.89%, as detailed in Table 5 for the severity KL grading of knee radiographs. This level of performance is regarded as encouraging, as it aligns with the findings of other pertinent studies, which are compared in Table 4. This table illustrates the performance of our model against other models assessed on the OAI 26 dataset, specifically in terms of accuracy. We chose accuracy as the primary metric for comparison due to its prevalence in the literature. The results presented in Table 5 indicate that our model, trained on text-image data, may surpass the accuracy of several leading studies in the assessment of knee osteoarthritis severity from X-ray images. Moreover, we compared the performance of out model to the same model trained with only image modality and the results are shown in Figure 11. The positive effects of the text data modality on the trained model is evident.

Image versus image-text model performance comparison.
Despite the achieved accuracy of this model in the severity grading of KOA, it still encounters some limitations. To gain deeper insights into the model's performance, we conducted an error analysis, focusing on common misclassification patterns. Our findings reveal that the model tends to struggle with distinguishing between adjacent KL grades, particularly between grades 0, 1, and 2 which is shown in the confusion matrix (Figure 9). This is due to the similarity between normal knee and knees with small level of osteophyte development, and it is consistent with the challenges reported in other studies 31
Moreover, the elimination of the text modality model in the testing/inference phase may result in a minor decline in performance; however, we contend that the model's robust architecture, established through multimodal training, equips it to manage the unimodal testing environment proficiently. This assertion is reinforced by recent research in diffusion-based models which illustrate the capability of models trained on multimodal datasets to excel in unimodal contexts. 32
Overall, the simplicity of our proposed multimodal model training methodology demonstrates that, when supported by a robust dataset and an intricate augmentation technique, it is feasible to effectively leverage vision (ViT) and language models (BERT) in diagnostic radiology research. This approach can yield a valuable instrument to assist clinicians in assessing the severity of knee osteoarthritis.
Conclusion
This study presents a novel multimodal approach for assessing knee osteoarthritis (OA) severity using the Kellgren-Lawrence (KL) grading system. By leveraging advanced vision and language models, specifically Vision Transformer (ViT) and Bidirectional Encoder Representations from Transformers (BERT), our framework demonstrates significant improvements in the automatic grading of knee OA from X-ray images. The proposed multimodal model achieves an impressive overall accuracy of 82.85%, with a precision of 84.54% and a recall of 82.89% on the test set. These results surpass those of previous state-of-the-art architectures, highlighting the effectiveness of our approach in addressing the challenges associated with knee OA assessment. Key contributions of this work include:
The development of a unique multimodal framework that combines vision and language models to enhance the learning process for knee OA severity grading. The creation of a simplified inference framework capable of assessing knee OA severity using only X-ray images. The implementation of data augmentation techniques to mitigate the negative effects of data imbalances during model training. The utilization of Grad-CAM for model interpretation, providing insights into the decision-making process of the AI system.
Our research demonstrates the potential of advanced machine learning techniques in improving the accuracy and efficiency of knee OA diagnosis. By automating the grading process, this approach can potentially reduce the variability in diagnoses caused by differences in radiologist expertise and expedite the assessment of knee OA severity. Future work could focus on further refining the model, expanding its applicability to other imaging modalities, and conducting clinical trials to validate its performance in real-world healthcare settings.
Footnotes
Informed consent statement
Not applicable.
Author contributions
Conceptualization: A.H., M.K.S.M; methodology: A.H, S.S. and A.A.; formal analysis and investigation: A.H, S.S.; writing—original draft preparation: A.H, M.K.SM and S.S.; writing—review and editing: A.H, and M.K.S.M., A.R., A.M.S.M; supervision: A.H.; validation: A.H. All the authors contributed to the study. All authors have read and approved the final manuscript.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Institutional review board statement
Not applicable.
Data availability statement
The data presented in this study will be made available on reasonable request to the corresponding author.*
Disclaimer/publisher's note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
