Explainable health prediction from facial features with transfer learning

Abstract

In the recent years, Artificial Intelligence (AI) has been widely deployed in the healthcare industry. The new AI technology enables efficient and personalized healthcare systems for the public. In this paper, transfer learning with pre-trained VGGFace model is applied to identify sick symptoms based on the facial features of a person. As the deep learning model’s operation is unknown for making a decision, this paper investigates the use of Explainable AI (XAI) techniques for soliciting explanations for the predictions made by the model. Various XAI techniques including Integrated Gradient, Explainable region-based AI (XRAI) and Local Interpretable Model-Agnostic Explanations (LIME) are studied. XAI is crucial to increase the model’s transparency and reliability for practical deployment. Experimental results demonstrate that the attribution method can give proper explanations for the decisions made by highlighting important attributes in the images. The facial features that account for positive and negative classes predictions are highlighted appropriately for effective visualization. XAI can help to increase accountability and trustworthiness of the healthcare system as it provides insights for understanding how a conclusion is derived from the AI model.

Keywords

Explainable AI health prediction transfer learning deep learning

1 Introduction

Recently, there has been a significant rise of interest in implementing machine learning and deep learning methods in medical and healthcare systems. A number of studies [1 –6] have shown successful Artificial Intelligence (AI) deployment in advanced healthcare and medical systems. The AI techniques have achieved unprecedented performance in detecting and analyzing observable features such as tumor in medical imaging. Microsoft CustomVision, a service provided by Microsoft Azure Cognitive Services, has been trained to identify COVID-19 pneumonia and pneumonia of other etiologies through chest x-ray (CXR) [7]. The proposed work has achieved outstanding performance with 97% accuracy in diagnosing COVID-19 and non-COVID-19 pneumonia based on CXR of the patients. Besides, AI system has been proven successful in diagnosing rare cardiac disease from electrocardiograms (ECG) signals. A deep learning model trained with human ECG samples are able to detect abnormalities called Phospholamban which is known to cause sudden cardiac death [8]. The method obtained promising accuracy as compared to human level performance.

Despite its huge success, the existing AI systems face a great challenge. The outcomes furnished by the AI models are not readily interpretable. Sophisticated AI models like the deep neural networks contain hundreds of hidden layers with millions of hyper parameters. The massive network makes it hard for the scientists to understand the operations within the architecture. Hence, a deep neural network is always considered a black-box model. The public will be hesitant to adopt the AI technology, especially in making important decisions like in precision medicine, without knowing how the conclusion is inferred from the AI model. Therefore, it is of paramount importance to study Explainable AI (XAI) [9] to overcome the limitation of the existing AI systems. XAI is vital for Responsible Artificial Intelligence [10] which is an important concept for deployment of AI models in practical applications. XAI is used with aims to enhance the model’s confidence level, fairness, safety and security, trustworthiness, transferability, interactivity, informativeness, accessibility as well as the ease of looking for causality in the data.

In this paper, the integration of Convolutional Neural Networks (CNN) [11] and transfer learning [12] is applied to predict the health condition of a person from facial features. CNN is known as a deep neural network that consists of many convolutional layers and is commonly deployed for image processing and classification tasks. On the other hand, transfer learning is an effective solution for many existing applications which provides pre-trained weights and well-defined structure learnt from an initial task to a new task on hand. Many studies have shown that the use of transfer learning with fine-tuning can achieve very high accuracy with limited dataset [4 , 13–16]. Thus, healthcare applications like rare disease diagnosis with relatively small amount of ill samples benefit from this technique.

Furthermore, we explore XAI to achieve meaningful interpretation of the deep learning model. Three XAI techniques namely Integrated Gradient [17], XRAI [18] and LIME [19] are investigated. We systematically examine and evaluate the different levels of transparency achieved by each model. Furthermore, we provide a comparison to highlight the strengths and weaknesses in each approach. The contributions of this paper are two-fold. First, an integrated transfer learning approach with fine-tuning is proposed for health status prediction based on facial features. Second, different XAI techniques are investigated for positive and negative attributions for the decisions made.

Following is the organization of the paper. Section 2 presents the literature reviews for the existing work related to the medical system development with artificial intelligence. Sections 3 and 4 describe the methods used and the results obtained, and the conclusions are drawn in Section 5.

2 Literature review

In the past few years, great advances have been made by AI practitioners in clinical healthcare systems. Deep learning and machine learning approaches have driven the success of the healthcare systems by discovering hidden patterns in the patients’ data. A healthcare system is convenient to use if accurate diagnosis can be performed based on suspicious features on a person’s face. A number of studies have applied deep learning and machine learning techniques to perform health status examination from the facial features [1 –6].

Wang and Luo [1] presented their works in detecting observable disease symptoms from the faces by adopting semi-supervised anomaly detection integrated with computer vision algorithm. The proposed method relied on the statistical model studies of common face features such as skin color and tone. The research has contributed to analyzing the face features among the human regardless of race, age or gender. The dataset was composed of eight thousand two hundred and seventy-eight pictures for training and two hundred and thirty-seven pictures for testing. More than 20 diseases are included in the dataset. The images were obtained from VA Medical Center and UCSD School of Medicine. Semi-supervised detection technique was used for categorization purpose. Unification of multiple processes was proposed as a single automatic procedure to detect the symptoms of different diseases with thirty-one sets of experiments and different threshold settings. The overall performance of the experiments was 0.846 ROC Area (AUC).

According to a research carried out by Henderson et al. [2], the cue from the faces is vital and essential for an accurate assessment of a person’s health. In the study, they introduced a set of facial features sector comprised of men’s facial masculinity, facial symmetry, facial adiposity, skin condition, and facial expression. They found that these were the factors which contributed to the accurate detection of health status based on facial features. They investigated the difference between three-dimensional and two-dimensional facial images for facial BMI scores. Three-dimensional images were captured by using a three-dimensional camera, with 68 women and 50 men as participants. For two-dimensional images, pictures of 67 women were taken by using a camera. Using three-dimensional images, they found that curvature had a positive correlation to the health status, while a higher downturned mouth tended to have a lower rating of health status. Besides, a downturned mouth correlated to a higher facial BMI score. In two-dimensional images, the skin color especially skin yellowness was found to correlate with the health status. Besides, the mouth curvature was positively correlated with the perceived health status. In the research, the authors found that color was a significant contributor to the prediction of health status. Besides, there would be subtle variation in the apparent variation no matter how normal a facial image looked.

A work reported by Kong et al. [3] demonstrated the possibility of detecting Acromegaly disease from facial photographs by applying machine learning methods. The Acromegaly patients typically portray facial symptoms such as widening teeth spacing, enlargement of frontal-bone, lips and nose (tissue swelling), forehead prominence, skin thickening and prognathism. The data used in the research was obtained from the neurosurgery inpatient department of some general hospitals in China. A total of 527 of Acromegaly patients were included. Several machine learning algorithms like Generalized Linear Models, Support Vector Machines, K-nearest neighbors (KNN), Convolutional Neural Network and Forest of randomized trees were used. These algorithms were integrated as an ensemble method. The ensemble method achieved higher accuracy for classification and increased the generalizability of the model. CNN obtained the best result with a Positive Predictive Result as 96%, Negative Predictive Result as 92%, 91% of sensitivity and 96% of specificity.

Singh and Kisku [4] worked on rare genetic disease detection with two-dimensional facial images by applying transfer learning. Twelve rare genetic diseases were included in the research. The dataset was obtained from the eLife database with a total of 1567 images. Diseases were identified from the abnormalities and ill symptoms from the patient’s faces. In the previous works, conventional classifiers were deployed due to limited dataset size. These models were likely to miss the essential features which contributed to accurate results. To overcome limited training sample problem, the authors applied transfer learning with VGGFace model with ResNet50 architecture and employed SVM classifier on the top layer. Less amount of data was used in the proposed work and yet a high recognition accuracy of 98.1% was achieved with the transfer learning approach.

Majtner et al. [5] proposed an approach for non-invasive diabetic disease complications detection via facial color variation pattern analysis. Most of the type two diabetes mellitus patients have at least one sign of complication. Diabetes patients commonly have skin manifestation as blood flushing on the face, also known as rubeosis faciei. In the study, the team had tried for early and non-invasive detection of diabetic complications. The team collected the data by recording setup with a laptop and a camera with a tripod. The dataset was obtained from two groups as DM, comprising twenty patients with insulin-treated diabetes and having an average age of 63±10 years. The second group, known as group C, contained ten patients with an average age of 60±6 years. Hundred and fourteen video files were acquired from group DM, whereas 60 video files from group C. The training dataset consisted of 60 DM and 30 C videos, where the testing dataset consisted of 54 DM videos and 30 C videos. The input videos were preprocessed with color normalization and head movement tracking. The team worked on skin patches detection to track the facial redness changes. Linear discriminant analysis (LDA) and support vector machine (SVM) were then used to classify the extracted features from the skin patches. The SVM classifier reported an accuracy of 92.86% on binary classification and 100% of sensitivity, as well as 80% of specificity. However, the work was considered a pilot study as only a small number of participants was involved.

Ee et al. [6] investigated the possibility of predicting clinical depression development of a person by analysing the facial images. Two approaches namely eigenface (PCA) and fisher face (PCA + LDA) were applied. The classifier used was nearest neighbour (NN). The accuracies obtained in the study were 51% and 61% for person independent and person dependent classification, respectively. Video recordings from a clinical database were used. Two types of family interactions were involved namely problem-solving interaction and event-planning interaction session. During each session, the family were requested to have a discussion on a topic, which reflected the type of the interaction. The video recordings were labelled by a psychologist, based on the Living-In-Family-Environments (LIFE) coding system. Approximately 30,000 images per person were recorded per video. SMQT face detector was utilised to detect the facial images and select the images for analysis. Although the results obtained were relatively low, there was an exciting finding. The study found that problem-solving interaction always had a higher prediction result than event-planning interaction. This can be related to the fact that a controversy always has a higher chance to be elicited during a problem-solving session. In other words, there is a higher chance of having behavior control of facial expressions during a recording session. Table 1 presents a summary of the existing works on health prediction from the facial features using machine learning and deep learning approaches.

Table 1
Summary of existing works

Authors Subject of Study Method Dataset Accuracy

Wang and Luo [1] Observable diseases Semi-supervised learning Dataset from VA Medical Center and UCSD School of Medicine. ROC = 0.846

Henderson et al. [2] Health status 2D and 3D facial images analysis Three-dimensional images with 68 women and 50 men. Two-dimensional images with 67 women –

Kong et al. [3] Acromegaly Generalized Linear Models, SVM, KNN CNN and Forest of randomized trees 527 of Acromegaly patient with 254 women and 273 men Sensitivity = 91%; Specificity = 96%

Singh et al. [4] Rare Genetic Diseases Transfer Learning with VGGFace, SVM 12 rare genetic diseases with a total of 1567 images 98.1%

Majtner et al. [5] Diabetic complications LDA, SVM Two groups involved. The first as DM, comprised of twenty patients with insulin-treated diabetes and having an average age of 63±10 years. The second group is group C, having ten patients with an average age of 60±6 years. 140 video files were obtained from group DM and 60 videos from group C 92.86%

Ee et al. [6] Depression in adolescents PCA, LDA Video recordings from a clinical database, which are annotated as Problem-solving and Event-planning interaction by LIFE coding system 61%

Authors	Subject of Study	Method	Dataset	Accuracy
Wang and Luo [1]	Observable diseases	Semi-supervised learning	Dataset from VA Medical Center and UCSD School of Medicine.	ROC = 0.846
Henderson et al. [2]	Health status	2D and 3D facial images analysis	Three-dimensional images with 68 women and 50 men. Two-dimensional images with 67 women	–
Kong et al. [3]	Acromegaly	Generalized Linear Models, SVM, KNN CNN and Forest of randomized trees	527 of Acromegaly patient with 254 women and 273 men	Sensitivity = 91%; Specificity = 96%
Singh et al. [4]	Rare Genetic Diseases	Transfer Learning with VGGFace, SVM	12 rare genetic diseases with a total of 1567 images	98.1%
Majtner et al. [5]	Diabetic complications	LDA, SVM	Two groups involved. The first as DM, comprised of twenty patients with insulin-treated diabetes and having an average age of 63±10 years. The second group is group C, having ten patients with an average age of 60±6 years. 140 video files were obtained from group DM and 60 videos from group C	92.86%
Ee et al. [6]	Depression in adolescents	PCA, LDA	Video recordings from a clinical database, which are annotated as Problem-solving and Event-planning interaction by LIFE coding system	61%

3 Proposed method

3.1 Data pre-processing

In this paper, Haar Cascade Classifier [20] is used to recognize and crop faces from the facial images. Haar Cascade Classifier is a well-known object detection approach, and it can detect frontal faces reliably. Figure 1 depicts some sample cropped faces using Haar Cascade. The cropped images are resized to 200×200 pixels. After that, data augmentation [21, 22] is performed to increase the image sample size. Data augmentation helps to enhance data diversity and improve the model training process. The techniques applied include horizontal flip, 90-degree rotation, random brightness, Gaussian noise, and hue. Figure 2 depict some examples of the augmented images.

Fig. 1

(a) Original Image (b) Face detected by Haar Cascade Classifier (c) Cropped Face.

Fig. 2

Samples of augmented data.

3.2 Integrated feature extractor with VGGFace

In this study, there are limited number of sick images available. Transfer learning is known to be efficient in learning a target task by transferring the knowledge of a similar task. Therefore, VGGFace [23] is used as an integrated feature extractor. VGGFace is a neural network model computed using the VGG-Very-Deep-16 CNN architecture. VGGFace is chosen to minimize domain divergence as compared to other pre-trained models like VGG-16/VGG-19 and Inception V3. The model is made up of a deeper convolutional architecture than AlexNet and it uses smaller filter sizes. Each series of convolutional layers is followed by a max-pooling layer, except for the last one, which is followed by two fully-connected layers identical to AlexNet. The output of the last fully-connected layer represents the VGG image descriptor. The theoretical background and explanation of the model can be found in [23].

In this paper, the VGGFace model is used as an integrated feature extractor with a CNN model. The pre-trained model somehow acts as a weight initializer for the CNN model. By transferring the pre-trained weights from VGGFace, the CNN model reaps the benefits of receiving a set of prominent facial characteristics learnt from a large number of face images. The network architecture of the proposed model is shown in Fig. 3. The layers of the VGGFace model are frozen and the facial features extracted by VGGFace are fed to the CNN model. Fine-tuning is then performed on the CNN layers. The network is designed in such a way to allow the general features learnt by VGGFace to pass through the network. After that, fine-tuning is performed to extract more meticulously picked features from the facial features.

Fig. 3

Integrated feature extractor with VGGFace.

4 Explainable AI

Explainable AI (XAI) [10] is a technique used to unveil the black-box nature of the neural network. XAI is intended to provide models and methods with explanations while maintaining the model’s high level of performance [19]. XAI increases the reliability of the model by showing reasons and evidence why the model is trustable. A good XAI technique is able to demonstrate and prove the model’s explainability, interpretability, understandability as well as comprehensibility. At the same time, it also increases the model’s transparency. A higher level of transparency in a neural network model is vital as a model is said to be transparent if it is understandable. With these characteristics, XAI enables people to trust, understand, and better manage the AI technology.

In this paper, three XAI approaches namely Integrated Gradient [17], XRAI [18] and LIME [19] are examined. Integrated Gradient and XRAI are attribution-based approaches that tell which features in the input images are responsible for a certain decision. This is important for users to understand how the model behaves so that we know how to improve it. On the other hand, LIME is a model-agnostic approach that provides local model explanations. All the three methods provide post-hoc explanation in which understandable information are communicated for how a developed model yields an output for a given input. The details for each XAI methods are delineated in the subsequent sections.

4.1 Integrated gradient

Integrated Gradient is an attribution method used to highlight the essential values contributing to the trained model’s input [17]. It is simple to be computed, just with few calls of the gradient operations. This method is suitable for both regression and classification problems. This technique combines the axioms of Sensitivity and Implementation Invariance. It also satisfies the axiom of completeness. The salient input is determined by changing the model’s input from the baseline to the original input. This method can be applied to different deep neural networks by attributing the predicted outcome of a model to the input, and a strong theoretical justification has been provided.

Benchmark selection is important in Integrated Gradient where a good baseline is determinant in giving a reasonable output. A near-zero score baseline, i.e. black image baseline, is recommended. Integrated Gradients can be computed via summation. The gradients at the points at small intervals along the path from the baseline to the input are summed up. Given a model’s function F, the equation to compute the gradients is given as follows:

$\begin{matrix} {IG}_{i}^{approx} (x) = (x_{i} - x_{i}^{'}) \\ \times \sum_{k = 1}^{m} \frac{(δ F (x^{'} + \frac{k}{m} \times (x - x^{'})))}{δ x_{i}} \end{matrix}$ (1)

where m stands for the number of steps in the Riemman approximation of the integral, x is the input and x’ is the baseline. As this is computed in a for loop, k refers to the current step of the calculation where k = 1, 2, ... , m. The step-size should be increased to approximate the integral if it is higher than 5%.

4.2 XRAI

XRAI is another attribution method specialized for image input only. It combines Integrated Gradient with over-segmentation and region selection to find the attribution in the images [18]. The attribution is determined at the region level, and not at the pixels level. XRAI has been proven to be able to produce a better result than other saliency methods for common models [18]. It can be applied to any deep neural network model.

XRAI first performs image segmentation starting with an empty mask. It calculates the gain of the regions for region importance selection. Felzenswalb’s graph-based method is used for segmentation. XRAI uses Integrated Gradients for attribution, and with black and white baselines. The integrated gradient is insensitive to pixels that are having similar values as XRAI uses both black and white baselines. XRAI selects attributions with the highest gain value as the region starting with an empty mask. The important attribution rankings are highlighted as gradient.

4.3 LIME

Unlike Integrated Gradient and XRAI that are network/model specific, Local Interpretable Model-agnostic Explanations (LIME) can be applied to any machine learning model without knowing its underlying processing or internal representation [19]. It is an explanation method that is used for the predictions of the classifier and regressor. It is used to recognize the interpretable model on the interpretable attributes which are faithful to the regressor or classifier. Besides, LIME is locally faithful as it must be able to respond to the model’s behavior in the vicinity of the instance which is being predicted. Local explanations are generated by LIME as it perturbs the input around the neighborhood to find out how the predictions change according to the input change.

LIME uses Sparse Linear Explanation. Given an explanation model g, the instance to be explained $x \in ℝ^{d}$ , and $f : ℝ^{d} \to ℝ$ the model being explained. Assume the distance between an instance z to x is given by π_x (z). Then, the locally weighted square loss L is defined as, $L (f, g, π_{x}) = \sum_{z, z^{'} \in Z} π_{x} (z) {(f (z) - g (z^{'}))}^{2}$ (2) where z’ refers to the perturbed sample and f(z) is the predicted class. π_x (z) in the equation can be an exponential kernel defined on some distance function D such as $\exp (\frac{- D (x, z)^{2}}{σ^{2}})$ with width σ.

4.4 XAI techniques comparison

Each of the XAI techniques has its advantages and disadvantages. In this section, we systematically analyze the strengths and limitations of each approach. Integrated Gradient is easy to implement as there is no new training or instrumentation needed. Besides, it is widely applicable for tasks such as textual and image data. However, Integrated Gradient suffers from inconsistency in saliency maps generation due to randomness in the baseline used. Besides, it does not give a global understanding of the model as only attributions are explained. Besides, Integrated Gradients tends to be insensitive to the pixels having close values with the baseline.

XRAI addresses the problem with Integrated Gradient in which Integrated Gradients uses dark image as baseline that reduces the dark pixels attribution. XRAI, on the other hand, is more sensitive to pixels coloration. Besides, XRAI can be used with any deep neural network-based models. Yet, XRAI is recommended for image models only as it localizes attributions at the region level.

On the other hand, LIME is a simpler local model as the implementation is open source and is easy to understand. Besides, it supports structured and image data for explanation. Nevertheless, LIME is slower to compute as it requires multiple perturbation of samples to the model. Besides, the explanation may not be faithful if the model is highly non-linear as it uses sparse linear models for explanation. Table 2 presents the comparison among the techniques.

Table 2
A comparison among XAI techniques

XAI Technique Pros Cons

Integrated Gradient •Easy to implement •Inconsistent saliency map

•Widely applicable •Does not provide global understanding of the model

•Insensitive to pixels that are close to the baseline’s values

XRAI •Solve the issue of Integrated Gradient •Recommended for image models only

•Widely applicable with DNN

LIME •Simpler local model •Slower for samples perturbation

•Support structured and image data •Does not work well in highly non-linear model

XAI Technique	Pros	Cons
Integrated Gradient	•Easy to implement	•Inconsistent saliency map
	•Widely applicable	•Does not provide global understanding of the model
		•Insensitive to pixels that are close to the baseline’s values
XRAI	•Solve the issue of Integrated Gradient	•Recommended for image models only
	•Widely applicable with DNN
LIME	•Simpler local model	•Slower for samples perturbation
	•Support structured and image data	•Does not work well in highly non-linear model

5 Experimental results

5.1 Data collection

In this study, two categories of images namely normal faces and faces with ill symptoms are collected. For normal and healthy-looking faces, the data are obtained from the UTKFace dataset [24]. Among the 20,000 face images in the dataset, 1000 images are selected to be used in this study. The photos are selected manually by manual inspection. Only faces with a clearer look and higher quality are selected. The reason to use 1000 images is to avoid the imbalance class problem. Some examples of healthy face images are shown in Fig. 4.

Fig. 4

Samples of healthy face images.

There is no publicly available dataset for sick face images portraying symptoms like flu, fever or running nose. Towards this end, the sick face images are collected by performing online search using keywords like shortness of breath, sore throat, fever and running nose. The search was performed using platforms like Pixabay, Shutterstock and Goggle Images. The facial images were downloaded manually. The images available for people with ill facial symptoms are very limited. We manage to collect 1000 images containing the relevant ill symptoms. Figure 5 illustrates some sample sick face images in this study.

Fig. 5

Samples of sick face images.

5.2 Transfer Learning with VGGFace

Transfer learning is applied with the VGGFace in the experiment. The model is used as an integrated feature extractor. The layers of the VGGFace are frozen and its output are fed as input to the CNN model. The CNN model is further fine-tuned to obtain more refined facial features. We obtained an accuracy of 97% in the 10th epoch in the experiment. The loss graph is shown in Fig. 6.

Fig. 6

Loss per epoch.

Besides, hyperparameter tuning is performed to choose the optimal hyperparameters. The results for hyperparameter tuning, e.g. varying the optimizers, activation functions, learning rates, and dropout rates, are presented in Figs. 7 to 10. The hyperparameters are configured using the settings: (1) Optimizer: Adam, SGD, Adagrad, Adadelta, RMSprop, (2) Learning rates: 0.001 to 0.40, (3) Activation function: Sigmoid, Softmax, Tanh, and (4) Dropout Rate: 0.30 to 0.70. The best hyperparameter is fixed and used for the subsequent tests. Notably, the following settings provided the best performance: optimizer:Adam, activation function: Sigmoid, learning rate: 0.05, and dropout rate: 0.70. With this setting, the model is then fine-tuned and trained again for a better result.

Fig. 7

Optimizers and accuracy.

Fig. 8

Activation functions and accuracy.

Fig. 9

Learning rate and loss.

Fig. 10

Dropout rates and accuracy.

Table 3 shows the comparison of standalone CNN and the proposed integrated VGGFace with CNN model. From the results, it is clearly shown that the application of transfer learning outperforms standalone CNN model. This observation testifies the superiority of using transfer learning approach. The knowledge learnt from a large set of facial images by VGGFace has indeed helped to boost the system’s accuracy by a large margin. Moreover, the model achieves an Area Under Curve (AUC) score of 0.926. Figure 11 illustrates the Receiver Operating Characteristics (ROC) curve of the model.

Table 3

Comparison of standalone CNN model and VGGFace with CNN model

Model	Epoch	Accuracy
CNN	1	0.6870
	5	0.8247
	10	0.9310
	15	0.9490
VGGFace with CNN	1	0.9270
	5	0.9625
	10	0.9700
	15	0.9690

Fig. 11

ROC Curve.

Explainable AI is applied to get the model’s explanations on why a decision is made. The AI Platform provided by Google is used. The model trained with transfer learning is used as the input model to the XAI techniques.

6 Results of explainable AI

6.1 Integrated gradient

The Integrated Gradient model highlights and visualizes the pixels that contribute to the predicted outcome. The details are provided at the pixels level and this is useful for granular attributions. The clipping values are adjusted for different outcomes. The clipping values are used to filter noises which makes it easier to visualize the strong attributions for clearer illustration. Besides, polarity is also tested for different results. The polarity option highlights the pixels given the most related attributions according to the polarity setting. A positive polarity setting highlights areas containing the most substantial influence on the positive prediction, whereas a negative polarity highlights area that do not lead to positive predictions. Results with positive and negative polarities are depicted in Fig. 12.

Fig. 12

(a) Result with positive polarity, (b) Result with negative polarity.

From the results obtained, each clipping value settings yield different highlighted areas. The clipping value is adjusted when there are too many noises in the highlighted pixels. In Fig. 12 (a), the eyes and nose are highlighted as positive attribution for the decision made, whereas the cheek area is highlighted as the negative attribution for the decision made. This is reasonable as symptoms like reddish nose and swollen eyes always appear on ill face images.

Apart from that, sanity checking is also performed to evaluate if the predicted result is rational. Sanity checking is used to evaluate if the result is legit. The checking is done based on the approximated error. The error should be below 5% to affirm that the result is rational. The lower the error is, the higher confidence level it will be for the explanations. We observe that the approximate errors obtained are less than 5% in average. Some approximate errors are as shown in Fig. 13.

Fig. 13

Result for approximate error in each class using IG.

6.2 XRAI

The XRAI model highlights the important region as explanations by indicating the importance with gradients. It works better with natural images. A high-level summary of insight is provided for showing the relationship of the attribution of the image. Clipping value is not set in this experiment as it uses gradient to show the crucial areas. The yellow color in the image shows the most vital influential regions for the positive class predictions. Figure 14 shows the results with different overlay settings.

Fig. 14

Result with different overlay settings.

Unlike Integrated Gradient, XRAI highlights the attributes with different gradients. From Fig. 13, the facial attributes such as eyes and mouth are highlighted as the explanation for the decision made. However, the nostril area is highlighted as the most important feature. Hence, the results do not seem to be that reliable as the nostril area is not a distinguishing feature to differentiate a healthy face from an ill face (unlike the skin surface of the nose that appears reddish due to frequent rubbing of the nose).

We perform the sanity check again to evaluate the result. The result is the same as in Integrated Gradient where the error is below 5% for an accurate result. The approximated errors for some samples are shown in Fig. 15.

Fig. 15

Results for approximated error in each class using XRAI.

6.3 LIME

In LIME, we choose to highlight the super-pixels with positive weight for a decision made. The images are randomly picked with predicted classes. Figures 16(a) and 17(a) highlight the areas of the top three features for positive class prediction attributions. On the other hand, Figs. 16(b) and 17(b) demonstrate the top six positive and negative attributions towards the negative class prediction, where green and red represent the positive and negative attributions, respectively. Figures 16(c) and 17(c) show the heatmap for the results. We can see from the visualized responses that the model can explain the prediction outcomes reasonably.

Fig. 16

Explanations with a positive class prediction: (a) Top three positive class attribution, (b) Top six positive and negative classes attribution, (c) Heatmap of the facial attribution.

Fig. 17

Explanations with a negative class prediction: (a) Top three positive class attribution, (b) Top six positive and negative classes attribution, (c) Heatmap of the facial attribution.

For example, the reddish cheek on the face of the person shows that the person has a fever. The highlighted areas correspond to the sick symptoms on the faces.

6.4 Comparison

In this section, we compare the three methods used for explanations. Figure 18 show some sample images and their results of using different XAI techniques. The different explanation methods provide different visualization outcomes with different selected attributes. Each technique highlights different areas for explanations. However, all the explanation methods can identify the appropriate regions related to sick symptoms. For example, they tend to highlight areas like red noses, red cheeks and swollen eyes as attributes that are accountable for sickness. The symptoms indicate potential sickness such as red noses for flu, red cheeks for high temperature and fever, swollen eyes for tiredness and unwell body condition. Although the techniques do not highlight the same regions for all the explanations, the selected areas are reasonable (such as areas close to the nose, eyes, and cheek regions). Nevertheless, there are some highlighted regions like the nostril area which do not yield a high confidence level for explanation. This is a possible area for future improvement.

Fig. 18

Different XAI techniques result comparison.

6.5 Discussions

This section summarizes the main findings in this study:

Due to limited number of images available for sick faces, the collected data may contain biases. This problem can be alleviated with transfer learning using the VGGFace model which contains knowledge learnt from a large number of face images.

Integrated transfer learning with CNN model is proposed to harness the general facial features learnt from VGGFace. The CNN model is further fine-tuned to solicit more intrinsic features pertaining to healthy and sick faces.

Explainable AI is essential to tell the reasons why a decision is made. The XAI techniques highlight important regions such as eyes, mouth, nose and cheeks that show symptoms for sickness, e.g. tired eyes and irritated skin condition.

There are also shortcomings in the explanations provided. For example, unrelated area such as the nostril region which does not contribute to the sickness symptoms is highlighted. This is future study to weight the importance of feature attribution when deriving the results.

7 Conclusion

A health prediction system from facial features is crucial for future healthcare development. It can significantly benefit the society as the health condition of a person can be conveniently inferred from the appearance of the face. The results obtained from this study show that it is possible to detect ill symptoms from the faces. The proposed integrated transfer learning approach can effectively determine if someone is sick with high accuracy. Besides, the use of Explainable AI with different methods is investigated to seek explanations by visualizing the necessary attributions of the images. This has helped to deal with the black-box nature of neural network model and enable the public to have more evidence in accepting the decisions made by the AI model. The application of Explainable AI can increase the transparency of the healthcare system which is crucial in practical deployment. With observable explanations, the AI system can gain wider acceptance from the public.

For future endeavors, more dataset will be collected to improve the experimental significance, including race, ethnic and age diversities. Different data pre-processing techniques will also be explored to better recognize the sickness symptoms. Moreover, ad-hoc XAI techniques can be applied to the DL models. It would be beneficial to explore more XAI models as each model could yield a different perspective or interpretation for the result. A thorough analysis will be conducted to identify the optimal combination of techniques that can yield favorable performance.

References

Wang

and Luo

, Detecting Visually Observable Disease Symptoms from Faces, J Bioinform Sys Biology 2016 (2016), 13. doi:10.1186/s13637-016-0048-7.

Henderson

A.J.

, Holzleitner

I.J.

, Talamas

S.N.

and Perrett

D.I.

, Perception of health from facial cues, Phil. Trans. R. Soc. B 371 (2016), 20150380. doi:10.1098/rstb.2015.0380.

Kong

, Gong

, Su

, Howard

and Kong

, Automatic Detection of Acromegaly From Facial Photographs Using Machine Learning Methods, EBioMedicine 27 (2018), 94–102. doi:10.1016/j.ebiom.2017.12.015.

Singh

, Kisku

D.R.

, Detection of Rare Genetic Diseases using Facial 2D Images with Transfer Learning, in: 2018 8th International Symposium on Embedded Computing and System Design (ISED), IEEE, Cochin, India, 2018, 26–30. doi:10.1109/ISED.2018.8703997.

Majtner

, Nadimi

E.S.

, Yderstræde

K.B.

and Blanes-Vidal

, Non-invasive detection of diabetic complications via pattern analysis of temporal facial colour variations, Computer Methods and Programs in Biomedicine 196 (2020), 105619. doi:10.1016/j.cmpb.2020.105619.

, Ooi

, Low

, Lech

and Allen

, Prediction of clinical depression in adolescents using facial image analysis, in: International Workshop on Image Analysis for Multimedia Interactive Services 2011 1–4.

Borkowski

A.A.

, Viswanadhan

N.A.

, Thomas

L.B.

, Guzman

R.D.

, Deland

L.A.

and Mastorides

S.M.

, Using Artificial Intelligence for COVID-19 Chest X-ray Diagnosis, Med Rxiv (2020), 2020.05.21.20106518. doi:10.1101/2020.05.21.20106518.

Lopes

R.R.

, Bleijendaal

, Ramos

L.A.

, Verstraelen

T.E.

, Amin

A.S.

, Wilde

A.A.M.

, Pinto

Y.M.

, de Mol

B.A.J.M.

and Marquering

H.A.

, Improving electrocardiogram-based detection of rare genetic heart disease using transfer learning: An application to phospholamban p.Arg14del mutation carriers, Computers in Biology and Medicine 131 (2021), 104262. doi:10.1016/j.compbiomed.2021.104262.

Han

S.-H.

, Kwon

M.-S.

and Choi

H.-J.

, EXplainable AI (XAI) approach to image captioning, The Journal of Engineering 2020 (2020), 589–594. doi:10.1049/joe.2019.1217.

10.

Barredo Arrieta

, Díaz-Rodríguez

, Del

Ser J.

, Bennetot

, Tabik

, Barbado

, Garcia

, Gil-Lopez

, Molina

, Benjamins

, Chatila

and Herrera

, Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI, Information Fusion 58 (2020), 82–115. doi:10.1016/j.inffus.2019.12.012.

11.

LeCun

, Bengio

and Hinton

, Deep learning, Nature 521 (2015), 436–444. doi:10.1038/nature14539.

12.

Wan

, Yang

, Huang

, Zeng

and Liu

, A review on transfer learning in EEG signal analysis, Neurocomputing 421 (2021), 1–14. doi:10.1016/j.neucom.2020.09.017.

13.

Das

and Chandran

, Transfer Learning with Res2Net for Remote Sensing Scene Classification, in: 2021 11th International Conference on Cloud Computing, Data Science Engineering (Confluence) 2021, 796–801. doi:10.1109/Confluence51648.2021.9377148.

14.

S.Y.

, Ahn

, Lee

and Kang

S.-J.

, Transfer Learning-based Vehicle Classification, in: 2018 International SoC Design Conference (ISOCC) 2018, 127–128. doi:10.1109/ISOCC.2018.8649802.

15.

Ramdan

, Heryana

, Arisal

, Kusumo

R.B.S.

and Pardede

H.F.

, Transfer Learning and Fine-Tuning for Deep Learning-Based Tea Diseases Detection on Small Datasets, in: 2020 International Conference on Radar, Antenna, Microwave, Electronics, and Telecommunications (ICRAMET) 2020, 206–211. doi:10.1109/ICRAMET51080.2020.9298575.

16.

Ghosal

and Sarkar

, Rice Leaf Diseases Classification Using CNN With Transfer Learning, in: 2020 IEEE Calcutta Conference (CALCON) 2020, 230–236. doi:10.1109/CALCON49167.2020.9106423.

17.

Sundararajan

, Taly

, Yan

, Axiomatic Attribution for Deep Networks, ArXiv:1703.01365 [Cs]. (2017). http://arxiv.org/abs/1703.01365 (accessed October 20, 2020).

18.

Kapishnikov

, Bolukbasi

, Viégas

and Terry

, XRAI: Better Attributions Through Regions, ArXiv:1906.02825 [Cs, Stat]. (2019). http://arxiv.org/abs/1906.02825 (accessed October 20, 2020).

19.

Ribeiro

M.T.

, Singh

and Guestrin

, “Why Should I Trust You?”: Explaining the Predictions of Any Classifier, ArXiv:1602.04938 [Cs, Stat]. (2016). http://arxiv.org/abs/1602.04938 (accessed October 29, 2020).

20.

Viola

, Jones

, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, IEEE Comput. Soc, Kauai, HI, USA, 2001, I-511-I-518. doi:10.1109/CVPR.2001.990517.

21.

Mikołajczyk

and Grochowski

, Data augmentation for improving deep learning in image classification problem, in: 2018 International Interdiscilinary PhD Workshop (IIPhDW), 2018 117–122.

22.

Shorten

and Khoshgoftaar

T.M.

, A survey on Image Data Augmentation for Deep Learning, J Big Data 6 (2019), 60. doi:10.1186/s40537-019-0197-0.

23.

Parkhi

O.M.

, Vedaldi

, Zisserman

, Deep Face Recognition, in: Procedings of the British Machine Vision Conference 2015, British Machine Vision Association, Swansea, 2015, 41.1–41.12. doi:10.5244/C.29.41.

24.

Zhang

, Song

and Qi

, Age Progression/Regression by Conditional Adversarial Autoencoder, ArXiv:1702.08423 [Cs]. (2017). http://arxiv.org/abs/1702.08423 (accessed October 17, 2020).