Facial expression recognition for stress detection: A Conv-XGBoost Algorithm approach

Abstract

Facial Emotion Recognition (FER) is a powerful tool for gaining insights into human behaviour and well-being by precisely quantifying a wide range of emotions especially stress, through the analysis of facial images. Detecting stress using FER entails meticulously examining subtle facial cues, such as changes in eye movements, brow furrowing, lip tightening, and muscle contractions. To assure effectiveness and real-time processing, FER approaches based on deep learning and artificial intelligence (AI) techniques was created using edge modules. This research introduces a novel approach for identifying stress, leveraging the Conv-XGBoost Algorithm to analyse facial emotions. The proposed model sustain rigorous evaluation techniques, for employing key metrics examination such as the F1 score, validation accuracy, precision, and recall rate to assess its real-world reliability and robustness. This comprehensive analysis and validation proved the model’s practical utility in facial analysis. Integrating the Conv-XGBoost Algorithm with facial emotion analysis represents a promising and highly accurate solution for efficient stress detection. The method surpasses existing literature and demonstrate significant potential for practical applications based on well-validated data.

Keywords

Stress,emotion recognition,Conv-XGBoost,deep learning,facial expression

1 Introduction

Facial expressions are powerful form of non-verbal communication. Expressions can often convey emotions and feelings more effectively than words. When stressed, their facial expressions can provide immediate insight into their emotional state, even if they choose not to express it verbally. Many facial expressions associated with stress, such as furrowed brows, tensed jaw, or a pained expression, are recognized as universal signs of distress. This universality makes them valuable indicators in stress detection, regardless of cultural or language differences. Stress detection using Facial Expression Recognition (FER) is one of the various stress detection methods. Physiological and psychological parameters can detect stress. The psychological state of a human is gauged by emotions, which could be positive and negative. Positive emotions help individuals stay focused and achieve their goals, while negative emotions create stress and lead to psycho-physiological disorders. There are seven basic emotions which can be identified by facial expression, namely happiness, sadness, surprise, anger, disgust, fear, and neural [1]. Facial expressions are integral in stress, psychology and mental health research. The possibility of using FER in mental health can be carefully applied in clinical settings to diagnose and assess anxiety disorders, post-traumatic stress disorder (PTSD), and depression.

Advances in machine learning and artificial intelligence have made it possible to develop systems that can automatically detect and analyze facial expressions to assess stress levels. Recently, deep learning has been the subject of substantial research to enhance face expression detection systems performance [2, 3]. Deep learning tools have proven highly effective in emotion detection from various data sources, including text, speech, and images. Hugging Face Transformers, OpenSMILE, OpenCV, TensorFlow, DeepFace, and NVIDIA Deep Learning AI are a few Deep Learning tools used in emotion detection [4]. Convolutional Neural Network (CNN) is a class of deep neural networks designed for tasks involving images and visual data. Convolutional Neural Networks (CNNs) have evolved, leading to various architectures designed for different tasks. The most potent CNN models are DeepLab, YOLO, SqueezeNet, Xception, and ResNet [5], and ideas for improving the CNN model are being developed to add additional convolution calculations and increase network depth.

According to the survey conducted in India during Fit Report 22–23, titled Game Changing Health and Wellbeing Revolution in India, a startling 24% of Indians experience stress. The survey looks for answers to a few questions like, how the COVID pandemic has influenced mental health conditions and during the pandemic whether the stress level has increased or decreased. Surprisingly the survey shows that around 26% of Indians are stressed only because of present job conditions. So, there is an urgent need to implement a stress detection mechanism for the well-being of an organization. Stress can be detected by various stressors in our body [6], facial expressions, social media chats and other means. CNNs are excellent at automatically learning hierarchical features from facial images, which are rich source of non-verbal cues and emotional information. By leveraging a CNN-based FER model, stress-induced facial expressions can be mapped to corresponding emotional states, aiding in stress detection. FER is a good choice in stress detection at the organization level because of its implementational level efficiency and non-invasive nature. The proposed work shows the implementation of the Convolutional-XGBoost model (Conv-XGBoost) for facial expression recognition. Conv-XGBoost model offers a powerful and versatile approach by introducing an enhanced classification, over the normal conventional Convolutional models. Using the Facial Expression Recognition dataset, the model validation was performed. The seven basic emotions were mapped into two emotions, positive and negative, which help to detect whether a person is stressed. The model’s performance is compared with CNN and ResNet-50 models, and the proposed Conv-XGBoost model shows remarkable validation accuracy. The pipeline architecture of the proposed model is shown in Figure 1.

Fig. 1

Emotions to Stress Detection.

The remaining section of the article is organized as, Section covers the review of the literature on face expression techniques and the background research is covered in Section. The general concept of the suggested system is explained in Section, and the experimental findings are shown in Section. Section concludes this study and outlines its future scope.

2 Literature review

For millennia, both scientists and philosophers have been intrigued by facial expressions. For stress prediction, an efficient classifier is highly essential. Several researches have been conducted to date where emotion detection was done based on various classifiers like Fuzzy systems [7], Multi-Layer Perceptron [8], Convolutional Neural Networks [9] and many more. For psychiatrists, various detection tools are available: the State-Trait Anxiety Inventory (STAI) [10], Profile of Mood States (POMS) [11], Depression, Anxiety and Stress Scale–21(DASS-21) [12] and also analysing cortisol levels, hormone level analysis and the Cornell Medical Index (CMI) [13]. Hence selection of classifiers play a crucial role in AI related tasks, especially in the field of machine learning. Machine learning was used in a wide range of fields, including healthcare, remote sensing, seismic data analysis, geological exploration and structural health monitoring [14].

Numerous data from studies have demonstrated the effectiveness of stress detection from facial expressions. Mehrabian et al. [15] stated that information conveyed verbally was 7%, whereas 38% of information may be done by voice modulations, pace and rate of speech. At the same time, the total information conveyed through facial expression is 55% . This highlights the importance of facial expression in identifying emotions. The findings substantiate that, by correctly classifying emotions, stress levels in an individual can be analysed. Developing an emotion detection model for stress can be valuable in various fields, including mental health, healthcare, and stress management. The proposed model tries to implement an effective emotion detection model for stress prediction. During the course of development of any stress detection model, it is important to consider ethical and privacy concerns, as well as the need for robust and accurate algorithms. Finally, developing and using such model should make sure that it should always be used in a way that respects individual’s privacy and consent while providing valuable insights and support.

The various research findings in the Facial Expression analysis using several classification techniques and numerous datasets were abstracted in Table 1. For the literature survey, machine learning and deep learning techniques were considered. Using deep learning [16], emotions are classified from facial expressions. A multimodel approach was developed by considering emotions along with physiological signals. The signals were collected using Blood Volume Pulse sensors. The model shows an accuracy of 81.545% with physiological signals, 99.9% with facial expression and 86.2% with a combination of physiological and facial datasets. A FER system with BiLSTM and CNN [17] tries to avoid the dying ReLU problem by using the Elu activation function. The research shows the overfitting problem and how data augmentation can solve this problem. The accuracy of the model is 81.82% before augmentation and 99.43% after augmentation. A Committee neural network [18] was introduced for mood recognition. The work incorporates two separate networks: specialized and generalized committee networks. With this two networks an integral committee was proposed. The integral committee shows an accuracy of 90.43%, in which 255 within 282 images were correctly classified.

Table 1
Various Datasets and Classifiers in FER

Reference Dataset Technique Findings

Oh, S and Kim et al. Real time dataset Deep Learning Four basic emotions were considered and a multimodel

approach using facial and physiological signals was used.

Febrian R et al. [17] CK+ BiLSTM and CNN Data augmentation was done to avoid overfitting and

min-max normalization with ReLu activation function.

Kulkarni S. S. et al. [18] CK Committee neural network Seven basic emotions were taken and real

and binary facial parameters were considered.

Wawage and Yogesh [19] Real time dataset Convolutional Neural Network For aiding drivers, Viola and Jones algorithm was used.

Viegas C et al. [20] Real time dataset LDA, NB, DT and RF Stress detection from Facial Action Units was done.

Ucar et al. [21] CK, Japanese Female

Facial Expression

dataset Curvelet transform Binary classification model where action units

from video frame was considered.

Gay V et al. [22] Real time dataset Spatial approach Application for assisting children with autism.

Barlett M.S et al. [23] CK, DFAT SVM, AdaSVM Application for human computer interaction

in which filters with Haar function was used.

Hasani and Mahoor

et al. [24] CK+, MMI, FERA, DISFA 3D inception-ResNet,LSTM Extracted temporal and spatial relationships

between facial images across several video frames.

Chao L et al[25] EMOTIW CNN, MTCNN Feature extraction using ResNet-64,

Multitask cascaded convolutional networks for

face detection.

Duncan D et al. [26] CK+, JAFFE Transfer learning Facial expression were identified using a pre-trained

CNN model.

Han X and Du Q et al. [27] LFW FACE ConvNet, FaceNet Based on deep learning in the realm of biometrics,

facial features were the emphasis.

Reference	Dataset	Technique	Findings
Oh, S and Kim et al.	Real time dataset	Deep Learning	Four basic emotions were considered and a multimodel
approach using facial and physiological signals was used.
Febrian R et al. [17]	CK+	BiLSTM and CNN	Data augmentation was done to avoid overfitting and
min-max normalization with ReLu activation function.
Kulkarni S. S. et al. [18]	CK	Committee neural network	Seven basic emotions were taken and real
and binary facial parameters were considered.
Wawage and Yogesh [19]	Real time dataset	Convolutional Neural Network	For aiding drivers, Viola and Jones algorithm was used.
Viegas C et al. [20]	Real time dataset	LDA, NB, DT and RF	Stress detection from Facial Action Units was done.
Ucar et al. [21]	CK, Japanese Female
Facial Expression
dataset	Curvelet transform	Binary classification model where action units
from video frame was considered.
Gay V et al. [22]	Real time dataset	Spatial approach	Application for assisting children with autism.
Barlett M.S et al. [23]	CK, DFAT	SVM, AdaSVM	Application for human computer interaction
in which filters with Haar function was used.
Hasani and Mahoor
et al. [24]	CK+, MMI, FERA, DISFA	3D inception-ResNet,LSTM	Extracted temporal and spatial relationships
between facial images across several video frames.
Chao L et al[25]	EMOTIW	CNN, MTCNN	Feature extraction using ResNet-64,
Multitask cascaded convolutional networks for
face detection.
Duncan D et al. [26]	CK+, JAFFE	Transfer learning	Facial expression were identified using a pre-trained
CNN model.
Han X and Du Q et al. [27]	LFW FACE	ConvNet, FaceNet	Based on deep learning in the realm of biometrics,
facial features were the emphasis.

An emotion prediction system for car drivers was developed using CNN [19] with real-time data. For this CNN model, researchers use transfer learning to solve problems like time and the construction of convolution layers. Features like class loss to entropy loss were considered, and the model acquired an accuracy of 87% . Stress detection can be done with Facial Action Units (FAU) [20] with video data. With the action units, the proposed model performed binary classification using a variety of straightforward classifiers and successfully obtained an accuracy of 74% for classification with subject independence and 91% for classification with subject dependency. Using Curvelet transform [21], FER can be done with a radial basis function-based online sequential extreme learning machine. This model tries to overcome the time problem in finding the hidden node by introducing spherical clustering. For autistic children, a helping aid was developed to understand their emotions by analyzing facial expressions [22]. By identifying the arousal and valence levels, the model made the prediction. As an application towards human-computer interaction, a real-time FER system was developed [23]. The classification using Support Vector Machine (SVM) and a combined approach using SVM and AdaBoost called AdaSVM was performed. The model shows an accuracy of 93.3% .

From the literature analysis, there occurs an urge in developing a stress detection model for accessing stress at the organizational level, which is very comfortable and easy to use. As the economy entirely relies on the productivity of organizations, an ample algorithm for identifying stress is highly relevant. Using these benchmark references, the proposed model tried to develop an emotion detection system for organizational well-being that can overcome existing method’s drawbacks. Model uses periodic analysis for stress detection and easy to use. Also, an efficient emotion classification using FER has done which led to an accurate stress prediction.

3 Background study

The Convolutional eXtreme Gradient Boosting method is a pioneering deep learning architecture for classification problems. Convolutional XGBoost creates a cutting-edge performance with excellent accuracy by combining the strengths of a Convolutional Neural Network (CNN) [28] and eXtreme Gradient Boosting (XGBoost) [29]. The diligent method for adopting these two models is one of the main strength of Convolutional XGBoost [30, 31]. Data scientists frequently utilize XGBoost, an expandable machine-learning remedy for tree boosting that minimizes overfitting and model complexity. So, a deep insight into CNN, XGBoost and the combined version is essential for understanding the proposed methodology.

3.1 Convolutional neural network (CNN)

Convolutional neural networks (CNNs) have emerged due to contemporary advancements in Artificial Intelligence and supports novel image-processing technology that would supplant conventional methods [32]. The model demonstrated outstanding performance in several computer vision tasks, including image classification, object identification, and image segmentation. The fundamental idea behind this model’s working is to use convolutional layers to learn hierarchical patterns and features from the input data automatically. These patterns can represent different levels of abstractions.They are capable of identifying effortless features like edges and texture to more complex features like shapes and objects. CNN constitutes:

-
Convolutional Layer: This layer applies convolutional operations to the raw input data which helps to detect local features in the input data.
-
Activation Function: An activation function is applied element-wise after convolution to introduce non-linearity into the network.
-
Pooling Layer: Pooling layers downsample the spatial dimensions of the input, reducing the computational complexity and the risk of overfitting.
-
Fully Connected Layer: A set of convolutional and pooling layers may be followed by one or more fully connected layers that carry out classification or regression tasks using the high-level features discovered by the preceding layers.
-
Softmax Layer: For classification tasks, a softmax layer is often used at the network’s end to convert the network’s outputs into class probabilities.

The mathematical formulation of CNN can be briefly explained as below. For an input grey image J, after applying filter K the output of the I^th layer for the i^th feature map is therefore obtained from the result of the layer’s preceding layer by Equation
$y_{i}^{(l)} = \emptyset (B_{i}^{(l)} + \sum_{j = 1}^{f (l - 1)} K_{i, j}^{(l)} * y_{j}^{(l - 1)})$
(1)

Here ∅ implies the activation function, $B_{i}^{(l)}$ the bias matrix and $K_{i, j}^{(l)}$ indicates the size of filter. $y_{j}^{(l - 1)}$ is the feature map generated by convolution of the previous layer. The final layer of the CNN model is a fully connected layer, which takes its input as the output of the preceding pooling layer that has been stretched into a single column vector. Regarding the multilayer perceptron, if L denotes the fully connected layers and the count of feature maps of a specified size $f_{2}^{(l)}$ $f_{3}^{(l)}$ is represented as $f_{1}^{(l)}$ [33]. Then, in the l^th layer, i^th feature map is calculated using Equation
$({y_{j}^{(l)})}_{m, n} = \emptyset (\sum_{p = 1}^{f_{1}^{(l - 1)}} \sum_{q = 1}^{f_{2}^{(l - 1)}} \sum_{r = 1}^{f_{3}^{(l - 1)}} w_{i, p, q, r}^{(l)} ({y_{j}^{(l - 1)})}_{q, r})$
(2)
where $w_{i, p, q, r}^{(l)}$ denotes the weight that connects feature map in l^th layer at position (m,n) to other at position (q,r) in (l - 1) ^th layer. ∅ is the consistent activation function and $y_{j}^{(l - 1)}$ is preceding layer’s feature map.
3.2 Extreme gradient boosting

Extreme Gradient Boosting(XGBoost), is a robust and prevalent machine-learning technique that belongs to the gradient boosting class proposed by Chen and Guestrin [34]. XGBoost is a classification technique based on Classification And Regression Trees (CART). Here the sum of the prediction scores for each tree represents the final forecast,which can be calculated using the associated labels of classes as y_i and is shown in Equation

{\hat{y}}_{i} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} ɛ F

(3)

Here x_i denotes training set members, f_k is k^th tree’s leaf score and F is the complete CARTs total K score.

3.3 Convolutional extreme gradient boosting (Conv-XGBoost)

The Extreme Gradient Boosting algorithm, a well-liked and effective Gradient Boosting formulation for machine learning, has an improved version called Convolutional Extreme Gradient Boosting (Conv-XGBoost). As the original XGBoost technique, Conv-XGBoost is intended to rapidly train decision tree models for applications requiring classification, ranking, and regression. Conv-XGBoost reduces model complexity and the number of parameters needed for prediction by combining the benefits of CNNs and Extreme Gradient Boost. For implementing this functionality, CNNs without pooling or fully connected layers are used, and the final layer is XGBoost. As a result, there is a lower chance of overfitting, and the model is more effective and more straightforward to train. The architectural model of Conv-XGBoost consists of six essential layers that performs the key functionalities. Every layer will perform two significant functionalities the former is called feature learning, and the latter is the prediction of class labels. Conv-XGBoost architecture is shown in Figure 2.

Fig. 2

Convolutional Extreme Gradient Boost Architecture.

The Feature learning part is performed by the first three layers, namely, the input layer, data pre-processing layer and the convolution layers. The key features were learned in this section which were received from the training data. The model’s initial layer, known as the input layer, is in charge of handling input. We presume that a training data set, X, is made up of tuples,(x_j, y_j) where data set index is denoted as j. The feature matrix of dimension, $\sqrt{N}$ $\sqrt{N}$ is denoted as x_j and class label to corresponding vectors in x_j is y_j. The critical layers in this section are the convolutional layers, which apply a convolution and an additive bias to the input data and are in charge of feature learning.

Class label prediction is done in the second part. As feature learning is used to make predictions, the input to the prediction section must be a vector. Thus, the tensor obtained after prediction is transformed into a vector at the reshape layer. Class label prediction is performed by the class prediction layer with an Extreme Gradient Boost in the final stage.

4 Facial expression recognition using Conv-XGBoost Algorithm.

The proposed FER detection technique has two main parts: The Convolutional Model followed by an XGBoost Model. The output of the convolutional model is collected by an intermediate layer called the dense layer, which controls the input to the XGBoost model. The detailed algorithmic procedure is described in Algorithm.

Intermediate features extracted from the trained CNN model helps to leverage the interpretability and generalization capabilities of gradient boosting. The proposed model creates an intermediate layer model to extract features from the my _ dense layer, which is the output layer of the CNN. These features serve as input to the XGBoost model. XGBoost, a robust gradient-boosting algorithm, is used to build an emotion classification model. We initialize an XGBoost classifier with the objective set to multi : softprob, which is the key parameter setting for multi-class classification. The classifier is trained on the intermediate features extracted from the CNN model, using metrics like log loss. This matrix can also be used as the evaluation tool too. The XGBoost Algorithm in the proposed model has two significant steps: Initializing the XGBoost Classifier and Training the XGBoost Model.

4.1 Initializing the XGBoost classifier

The steps involved in initializing phases are:

1
Initializing an XGBoost classifier using the XGBClassifier class. This classifier is suitable for multi-class classification tasks.
2
Setting key parameters: objective= multi : softprob: This parameter specifies that the objective is multi-class classification with soft probabilities. It means the classifier will output class probabilities for each sample.num _ class = 7: Indicates the number of classes in your emotion classification task.

4.2 Training the XGBoost model

Steps involved in training XGBoost model includes:

1
The XGBoost classifier is trained using intermediate features extracted from the trained CNN model.
2
The training data for XGBoost consists of these intermediate features, and the labels represent the emotions associated with the images.
3
Evaluation metric: use eval _ metric = mlogloss to specify that the model should be evaluated based on the multi-class logarithmic loss (cross-entropy loss) during training.
4
eval _ set: evaluate the model’s performance on the same dataset used for training. This can help to monitor overfitting during training.
5
verbose = True: Setting verbose to True allows you to see training progress, including evaluation metrics for each boosting round.

From the algorithmic procedure explained the time complexity of the algorithm can be effectively reduced to $O (L d^{2} m n p q)$ where the network’s layer number is denoted by L, the number of output channels is denoted by d for an input matrix of size (m×n) using a filter (p×q). The detailed layer-wise action of CNN followed by XGBoost classifier is described below.
4.3 Convolutional model

The Conv-XGBoost algorithm has four convolutional layers, and after preprocessing, the image will be given to the initial input layer. The input layer defines the shape of the input data. In the model, the input images are grayscale images with a size of 48×48 pixels and a single channel (as indicated by 1). The image specification given to the input layer after preprocessing is shown in Table 2. Preprocessing was performed with the help of a tensor flow by Keras called ImageDataGenerator. The images are converted to gray scale and resultant images were resized and rescaled. This preprocessed data was used in training and testing datasets.

Table 2
Image given to input layer

Selected Class Happy

Image Format JPEG

Image Model L

Image Size (48,48)

Image Width 48

Image Height 48

Number of Channels 1

Selected Class	Happy
Image Format	JPEG
Image Model	L
Image Size	(48,48)
Image Width	48
Image Height	48
Number of Channels	1

Using the Keras ImageDataGenerator, which builds batches of tensor image information with real-time data augmentation, the image is normalized to the range between 0 and 1. The Keras ImageDataGenerator is used to receive the input of the original data, which is then randomly modified and releases a result that contains only the newly altered data. Data augmentation is also done using this Keras method to broaden the generalization of the model as a whole. Proposed CNN architecture consists of convolutional layers, batch normalization, activation functions, max-pooling, and dropout layers. The complete methodological framework of the Conv-XGBoost algorithm’s initial half, which is the Convolution model is shown in Figure 3. The four convolutional layers processed the input image, while a dense layer collected the result.

Fig. 3

CNN Framework.

4.4 Extreme gradient boost model (XGBoost Model)

Extreme Gradient Boosting, or XGBoost, is a well-known and effective machine learning method for supervised learning problems, especially in regression and classification. This gradient boosting framework is acclaimed for its effectiveness, speed, and performance across various machine learning challenges and practical applications. XGBoost incorporates L1 (Lasso) and L2 (Ridge) regularisation terms into the objective function to prevent overfitting. Regularization helps to control the complexity of the learned model.

4.4.1 Convolutional layers

A series of learnable filters are applied to the input image using convolutional layers, which aid in the model’s ability to recognize regional patterns and characteristics. It takes images as input and utilizes trainable weights and biases to distinguish between images. These biases and weights are applied to hidden layers. The convolutional layer’s main objective is feature extraction [35], where they automatically recognize and extract valuable features from the incoming data.

The model uses multiple convolutional layers with different filter sizes and activation functions. In CNN, convolution is accomplished by applying a filter or kernel to an input image. Here, we use two filters of size 3×3 and 5×5. The values in the input represent the intensity or amplitude of the signal or the image’s pixel values.

4.4.2 Normalization

An output feature map represents the result that the convolution operation produces. This feature map records the relationship between the source data and the filter at each location. The output feature map’s size depends on factors like the input size, filter size, stride, and padding. Batch normalization is applied after each convolutional layer. This helps to stabilize and accelerate training by normalizing the activations of the previous layer.

4.4.3 Non-linearity

Convolution of filter over the initial input image, will affect the image dimensions and the final output will undergo one more stage known as a nonlinearity function or activation function. The network’s nonlinearity is implemented by determining whether to activate a neuron in response to an input. The activation used is Rectified Linear Unit (ReLU) which is a simple and widely used activation function and defined by Equation .

F (x) = \max (0, x)

(4)

where x indicates the input and the maximum range of the function ranges between 0 to the input size.

4.4.4 Pooling layers

Convolved feature size reduction can be more efficiently performed by introducing pooling layer along the convolution layer. As pooling shows a substantial reduction in dimensionality, the power can be reduced to a lower level for data processing. Pooling has a lot of benefits like translation invariance, feature abstraction, preventing overfitting, improving computational stability and integrating local features. Pooling technique used is Max-pooling [36] with a pool size of (2,2). Max-pooling layers downsample the spatial dimensions of the feature maps created by the convolutional layers.

4.4.5 Dropout layers

Dropout layers are included after specific convolutional layers. Dropout is a regularisation strategy that, during each training iteration, randomly sets a portion of the input units to zero. As a result, there is less reliance on particular neurons, which helps to prevent overfitting. It forces the network to learn more robust and distributed representations. Dropout can be seen as a form of ensemble learning.

4.4.6 Fully connected layers

The model flattens the feature maps into a one-dimensional vector after passing through the convolutional and max-pooling layers. The flattened vector is passed through fully connected layers. These layers learn global patterns and relationships in the feature space. Final layer consists of seven neurons, corresponding to the number of classes, each representing one of the possible emotion classes. This layer receives a softmax activation function to transform the outputs of the model into class probabilities. The complete architectural diagram of the proposed Conv-XGBoost model is shown in Figure 4.

Fig. 4

Convolutional XGBoost Architecture for emotion classification.

5 Experimentation and discussions

Facial Expression Recognition dataset, which is a publically available dataset was considered for the algorithm analysis [37]. The dataset has seven emotions labelled as disgust, anger, fear, happy, neutral, surprise and sad of more than three thousand different human faces, which include males, females and infants. Conv-XGBoost algorithm uses 200 samples from each of the seven primary emotion datasets for training and 100 from each for validation. Thus, a total of 1400 images were considered as training data and for validation, 700 images. The details of data usage for the Conv-XGBoost algorithm is shown in Table 3.

Table 3
METADATA

Class Label Class Training Images Validation Images Total Images

0 Neutral 200 100 300

1 Happy 200 100 300

2 Fear 200 100 300

3 Angry 200 100 300

4 Surprise 200 100 300

5 Disgust 200 100 300

6 Sad 200 100 300

Class Label	Class	Training Images	Validation Images	Total Images
0	Neutral	200	100	300
1	Happy	200	100	300
2	Fear	200	100	300
3	Angry	200	100	300
4	Surprise	200	100	300
5	Disgust	200	100	300
6	Sad	200	100	300

All the image dimensions were set as (48×48×1) since the channel was 1 for greyscale and height and width as 48×48. Thus, all images were resized, and the number of classes was seven. The sample emotions considered from the Facial Expression recognition dataset is shown in Figure 5. The training dataset images were given to the Conv-XGBoost classifier, where the CNN model will train and validate the input model generators. To achieve the best results, various models and numbers of epochs have been run and tested in the Conv-XGBoost classifier. There are two different types of experiments where the initial will determine the accuracy of facial expressions, while the second computes the precision of stress detection. The performance analysis parameters considered were accuracy, F1-score, precision, and recall with the corresponding equation.

F 1 Score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(5)

The F1 score combines precision and recall into one metric to assess a model’s accuracy. When dealing with imbalanced datasets, and one class considerably outnumbers the other, the F1 score is beneficial. The harmonic mean of recall and precision, defined as follows, generates the F1 score. Precision and recall depends on the True Positive (TP), False Positive(FP) and Fasle Negative (FN) measures.

Fig. 5

Seven basic emotions from Face Expression recognition dataset.

5.1 Model comparison

The model evaluation has done in two phases for analysing the performance, with the basic CNN architecture and the secondly with Conv-XGBoost architecture. For comparing the efficiency, the model was compared with the ResNet 50 model since ResNet is considered to be one of the strong Deep Learning models. ResNet has various forms like ResNet-18, ResNet-34, ResNet-50, ResNet-101 and ResNet-152. The bottleneck building block is used in the 50-layer ResNet. A bottleneck residual block also referred to as a "bottleneck", uses 11 convolutions to cut down on the number of parameters. This makes each layer’s training significantly faster, and instead of using a stack of two levels, it employs three layers. It implements a convolution of 64 kernels with a 2-sized stride and a 7x7 kernel. The resultant is given to a 3×3,64 kernel convolution followed by 1×1,64 kernels and 1×1,256 kernels. In total, nine layers will be generated in this step. The further layer generation is shown in Figure 6.

Fig. 6

ResNet-50.

In this experiment, each model was trained using the Facial Expression Recognition dataset in 50 and 100 epochs, respectively. The cycle of learning is one epoch. The number of learning algorithms that will process the complete dataset depends on the number of epochs, and the precision increases with the number of epochs. The model was evaluated, and the performance parameters like training and validation accuracy, F1 score and training and validation loss were calculated.

With the basic CNN architecture, the model shows a training accuracy 95.20% and a remarkable validation accuracy of 83.14% . For the same dataset with the ResNet-50, model training accuracy is 96.40% and validation accuracy is 86.13% . Labelled training data trains the model to categorize images into several emotion categories. The model adjusts its parameters during training to reduce the categorical cross-entropy loss, which gauges the discrepancy between expected and actual class probabilities. The model’s weights and biases are modified iteratively using an Adam optimizer and backpropagation during the training phase. The evaluation matrix parameters are shown in Table 4. Here, model loss is a statistic used to penalize the model when it fails to forecast the input data accurately.

Table 4

Result of evaluation matrix using CNN Model

Evaluation Matrix	Result
Training_loss	0.02008
Training_accuracy	0.9520
Validation_accuracy	0.8314
Validation_loss	0.6952

After training, the XGBoost model can make predictions for emotion classification. Several performance indicators are computed to assess the model’s efficacy and accuracy in classifying emotions. The proposed algorithm, Conv-XGBoost Emotion recognition, shows a better performance of 99.92% than the basic CNN architecture. The performance result of Conv-XGBoost is shown in Table 5. Among the three models, Conv-XGBoost model shows a remarkable validation accuracy of 93.02% and the comparison of three models is shown in Figure 9.

Table 5

Result of evaluation matrix using Conv-XGBoost Model

Evaluation Matrix	Result
Training_loss	0.00102
Training_accuracy	0.99927
Validation_accuracy	0.93021
Validation_loss	0.34002

Fig. 7

Accuracy and loss for 50 epochs.

Fig. 8

Accuracy and loss for 100 epochs.

Fig. 9

Validation Comparison

The detailed comparison of three reference models: CNN, ResNet-50 and Conv-XGBoost are shown in Table 6. Model has compared by considering 50 and 100 epochs and parameters like training and validation accuracy was considered. The CNN model shows validation accuracy of 78.25% and 83.14% for 50 and 100 epochs respectively. With ResNet-50 model shows 82.92% and 86.13% while the proposed model shows 89.68% and 93.02% as validation accuracy which substantiate the performance efficiency of the Conv-XGBoost model.

The proposed model shows a precision, which evaluates the predictions that are true, of 99.92%, Recall, that measures the capacity to detect positive examples, of 99.881% and F1 score, that provides a balance between precision and recall of 99.92% . The performance graph for the model is shown in Figure 7, which compares training and validation accuracy/loss for 50 epochs and in Figure 8, training and validation accuracy/loss for 100 epochs, respectively.

The performance of our model is summarised and portrayed using the confusion matrix. Confusion matrix is a fundamental tool in the evaluation of classification models. It provides a comprehensive summary of the performance of a classification algorithm by breaking down the number of correct and incorrect predictions made by the model. The confusion matrix to visualize the model’s predictions and the true labels for the validation dataset is shown in Figure 10.

In order to evaluate the precision of Conv-XGBoost’s stress detection system based on a facial expression identified in the initial trial, a follow-up experiment has been initiated after the first experiment. The input faces were classified as either stressed or non-stressed by Conv-XGBoost after it examines the facial expression. Happiness, neutrality, and surprise are categorised as positive emotions in the Facial Expression Recognition dataset, whereas sorrow, anger, disgust, and fear are classified as negative emotions. Hence the seven basic emotions are now classified into two classes as positive and negative emotions. Later the classifier consider a two input module and the stress detection module classify whether the person is stressed or non-stressed.

Table 6

Performance Comparison of three models

Parameters	Models
CNN		ResNet-50		Conv-XGBoost
	50 epochs	100 epochs	50 epochs	100 epochs	50 epochs	100 epochs
Validation accuracy	78.25	83.14	82.92	86.13	89.68	93.02
Training accuracy	82.39	95.20	86.75	96.40	95.69	99.92

Fig. 10

Confusion matrix of proposed model.

The same experimental setup was considered for the stress detection phase. The second experiment evaluates the efficiency of the proposed Conv-XGBoost model in accurately identifying the different images of facial expressions into the appropriate stress categories. Our model shows an accuracy of 99.7% on identifying stress. While mapping the emotions into two primary classes as positive and negative, the chance of predicting the emotions and classifying them to correct class is crucial in stress prediction. The method typically create a duplicate of the dataset that is used for training in order to prevent unintentionally altering the original dataset. Stress prediction would be a periodic assessment process where the analysis of a continuous week data is considered for stress detection.

6 Conclusion

Stress detection and management are vital components of maintaining physical and mental well-being in today’s fast-paced and demanding world. With advancements in technology and a growing awareness of the detrimental effects of chronic stress, we can now employ various methods and tools to identify and address stress effectively. This research shows a detailed study of the architecture of CNN and XGBoost algorithms, which helps in developing the Conv-XGBoost algorithm for emotion detection. The proposed model developed an accurate stress detection model by analysing various existing algorithms for the implementation.

A detailed study was carried out for an accurate emotion detection model using Facial Expression Recognition dataset. Conv-XGBoost algorithm shows accuracy of 99.9% than the other state-of-the-art methods. An accurate prediction was produced by the convolution layers using Tensor Flow for fine-tuning and Leaky ReLu activation. For the performance evaluation, the proposed method is compared with the basic CNN model and ResNet-50 with the same Facial Expression Recognition dataset. Comparison of the three models for 50 and 100 epochs were performed and Conv-XGBoost model shows validation accuracy of 89.69% and 93.02% . As the emotion detection part is very crucial, the algorithm shows a better performance in classifying the basic seven emotions into either positive or negative emotion classes. This emotion classification model helps to make the stress prediction comfortable. The Conv-XGBoost model shows an accuracy of 99.7% when compared to the other reference models. This analysis can confirm the efficiency of the proposed model for stress prediction.

This work can be further extended by implementing a real-time stress prediction model with extended cloud storage provision. The extension can be used for assisting elder in a smart home environment with real-time monitoring. Real-time stress detection form motion pictures can be incorporated which has wide range of applications in intelligent healthcare, criminology and a lot more.

References

Tao

Martinez

Compound facial expressions of emotion, Proceedings of the National Academy of Sciences of the United States of America111(15)2014.

Georgescu

M.I.

Ionescu

R.T.

Popescu

Local learning with deep and handcrafted features for facial expression recognition, IEEE Access72019.

Brownlee

A Gentle Introduction to Deep Learning for Face Recognition, Deep Learning for Computer Vision, 2019.

Grekow

Emotion detection using feature extraction tools. In Foundations of Intelligent Systems: 22nd International Symposium, ISMIS 2015, Lyon, France, October 21-23, 2015, Proceedings 22 (pp. 267-272). Springer International Publishing. 2015.

Singh

R.R.

Conjeti

Banerjee

A comparative evaluation of neural network classifiers for stress level analysis of automotive drivers using physiological signals,pp. 740ndash, Biomedical Signal Processing and Control8(6) (2013),pp.740ndash754.

Kumar

G.S.

Ankayarkanni

Comparative Study on Mental Stress Detection Using Various Stressors and Classification Techniques. In 2022 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC) (pp. 1-6). IEEE. 2022.

Widiastuti

Nugroho

A.K.

Nurhayati

Nugraheni

D.M.K.

Classification of stress levels of medical workers in knowing performance levels in productivity based on fuzzy logic. InAIP Publishing, AIP Conference Proceedings2738(1)AIP Publishing. 2023.

Liu

Stress detection using deep neural networks, BMC Medical Informatics and Decision Making20 (2020),pp.1–10.

Raval

Stress detection using convolutional neural network and internet of things, Turkish Journal of Computer and Mathematics Education (TURCOMAT)12(12) (2021),pp.975–978.

10.

Spielberger

C.D.

Sydeman

S.J.

Owen

A.E.

Marsh

B.J.

Measuring anxiety and anger with the State-Trait Anxiety Inventory (STAI) and the State-Trait Anger Expression Inventory (STAXI), Lawrence Erlbaum Associates Publishers 1999.

11.

Keedwell

Snaith

R.P.

What do anxiety scales measure?, Acta Psychiatrica Scandinavica93(3) (1996),pp.177–180.

12.

Oei

T.P.

Sawang

Goh

Y.W.

Mukhtar

Using the depression anxiety stress scale 21 (DASS-21) across cultures, International Journal of Psychology48(6) (2013),pp.1018–1029.

13.

Abramson

J.H.

The cornell medical index as an epidemiological tool,287–298. [CrossRef], Am. J. Public Health Nations Health56 (1966)287–298.

14.

Shao

Wang

Lai

Prediction of coal mine gas emission based on hybrid machine learning model, Earth Science Informatics16(1) (2023),pp.501–513.

15.

Mehrabian

Communication without words, Communication Theory6 (2008),pp.193–200.

16.

Kim

D.K.

Comparative analysis of emotion classification based on facial expression and physiological signals using deep learning, Applied Sciences12(3) (2022),pp.1286.

17.

Febrian

Halim

B.M.

Christina

Ramdhan

Chowanda

Facial expression recognition using bidirectional LSTM-CNN, Procedia Computer Science216 (2023),pp.39–47.

18.

Kulkarni

S.S.

Reddy

N.P.

Hariharan

S.I.

Facial expression (mood) recognition from facial images using committee neural networks, Biomedical Engineering Online8(1) (2009),pp.1–12.

19.

Wawage

Deshpande

Real-time prediction of car driver’s emotions using facial expression with a convolutional neural network-based intelligent system, International Journal of Performability Engineering18(11) (2022),pp.791.

20.

Viegas

Lau

S.H.

Maxion

Hauptmann

,September. Towards independent stress detection: A dependent model using facial action units. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI) (2018), (pp. 1-6). IEEE.

21.

Ucar

Demir

Güzeliş

s, A new facial expression recognition based on curvelet transform and online sequential extreme learning machine initialized with spherical clustering, Neural Computing & Applications27(1) (2016),pp.131–142.

22.

Gay

Leijdekkers

Agcanas

Wong

CaptureMyEmotion: Helping autistic children understand their emotions using facial expression recognition andmobile technologies. 2013.

23.

Bartlett

M.S.

Littlewort

Fasel

Movellan

J.R.

Real Time Face Detection and Facial Expression Recognition: Development and Applications to Human Computer Interaction. Inpp. 53–53. IEEE, 2003 Conference on Computer Vision and Pattern Recognition Workshop5 (2003),pp.53–53.IEEE.

24.

Hasani

Mahoor

M.H.

Facial expression recognition using enhanced deep 3D convolutional neural networks, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (2017),(pp. 30–40).

25.

Tan

Zhang

Wang

Zeng

Peng

Qiao

Group emotion recognition with individual facial emotion CNNs and global image based CNNs. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (2017), pp. 549–552.

26.

Duncan

Shine

English

Facial emotion recognition in real time, Computer Science (2016), pp.1–7.

27.

Wang

Guo

Research on face recognition based on deep learning. In 2021 3rd International Conference on Artificial Intelligence and Advanced Manufacture (AIAM) (2021), pp. 540–546. IEEE.

28.

Bhatt

Patel

Talsania

Patel

Vaghela

Pandya

Modi

Ghayvat

CNN variants for computer vision: History, architecture, application, challenges and future scope, Electronics10(20) (2021),pp.2470.

29.

Asselman

Khaldi

Aammou

Enhancing the prediction of student performance based on the machine learning XGBoost algorithm, Interactive Learning Environments (2021), pp. 1–20.

30.

Khan

M.S.

Salsabil

Alam

M.G.R.

Dewan

M.A.A.

Uddin

M.Z.

CNN-XGBoost fusion-based affective state recognition using EEG spectrogram image analysis, Scientific Reports12(1) (2022), pp.14122.

31.

Esha

I.A.

Rahman

Chowdhury

S.K.

Mim

J.F.

Multiclass emotion classification by using Spectrogram image analysis: A CNN-XGBoost fusion approach (Doctoral dissertation, Brac University). 2023.

32.

Lecun

Bottou

Bengio

Haffner

Gradientbased learning applied to document recognition, (in English), pp. –, P Ieee86(11) (1998),pp.2278–2324.doi: Doi 10.1109/5.726791.

33.

Stutz

Understanding convolutional neural networks, Seminar report, FakultC at fur Mathematik, Informatik und Naturwissenschaften Lehr-und Forschungsgebiet Informatik VIII Computer Vision (2014).

34.

Chen

Guestrin

Xgboost:AScalableTree Boosting System, 2016. ArXiv eprintsarXiv:1603.02754.

35.

Hsieh

C.-C.

Hsih

M.-H.

Jiang

M.-K.

Cheng

Y.-M.

Liang

E.-H.

Effective semantic features for facial expressions recognition using svm, Multimedia Tools and Applications75(11) (2016),pp.6663–6682.

36.

Max-pooling dropout for regularization of convolutional neural networks. In Neural Information Processing: 22nd International Conference, ICONIP 2015, Istanbul, Turkey, November 9-12, 2015, Proceedings, Part I 22 (2015), (pp. 46-54). Springer International Publishing.

37.

,https://www.kaggle.com/code/mh0386/facialemotionsdetection

Facial expression recognition for stress detection: A Conv-XGBoost Algorithm approach

Abstract

Keywords

1 Introduction

3.1 Convolutional neural network (CNN)

4.1 Initializing the XGBoost classifier

Table 2 Image given to input layer Selected Class Happy Image Format JPEG Image Model L Image Size (48,48) Image Width 48 Image Height 48 Number of Channels 1

4.4.1 Convolutional layers

4.4.2 Normalization

4.4.3 Non-linearity

4.4.5 Dropout layers

4.4.6 Fully connected layers

Table 3 METADATA Class Label Class Training Images Validation Images Total Images 0 Neutral 200 100 300 1 Happy 200 100 300 2 Fear 200 100 300 3 Angry 200 100 300 4 Surprise 200 100 300 5 Disgust 200 100 300 6 Sad 200 100 300

References

Table 2
Image given to input layer

Selected Class Happy

Image Format JPEG

Image Model L

Image Size (48,48)

Image Width 48

Image Height 48

Number of Channels 1

Table 3
METADATA

Class Label Class Training Images Validation Images Total Images

0 Neutral 200 100 300

1 Happy 200 100 300

2 Fear 200 100 300

3 Angry 200 100 300

4 Surprise 200 100 300

5 Disgust 200 100 300

6 Sad 200 100 300