Abstract
This manuscript proposes an automated artifacts detection and multimodal classification system for human emotion analysis from human physiological signals. First, multimodal physiological data, including the Electrodermal Activity (EDA), electrocardiogram (ECG), Blood Volume Pulse (BVP) and respiration rate signals are collected. Second, a Modified Compressed Sensing-based Decomposition (MCSD) is used to extract the informative Skin Conductance Response (SCR) events of the EDA signal. Third, raw features (edge and sharp variations), statistical and wavelet coefficient features of EDA, ECG, BVP, respiration and SCR signals are obtained. Fourth, the extracted raw features, statistical and wavelet coefficient features from all physiological signals are fed into the parallel Deep Convolutional Neural Network (DCNN) to reduce the dimensionality of feature space by removing artifacts. Fifth, the fused artifact-free feature vector is obtained for neutral, stress and pleasure emotion classes. Sixth, an artifact-free feature vector is used to train the Random Forest Deep Neural Network (RFDNN) classifier. Then, a trained RFDNN classifier is applied to classify the test signals into different emotion classes. Thus, leveraging the strengths of both RF and DNN algorithms, more comprehensive feature learning using multimodal psychological data is achieved, resulting in robust and accurate classification of human emotional activities. Finally, an extensive experiment using the Wearable Stress and Affect Detection (WESAD) dataset shows that the proposed system outperforms other existing human emotion classification systems using physiological data.
Keywords
Introduction
Nowadays, smartwatches are being developed to monitor physiological data and regulate emotions to improve human existence. These devices are small, compact, and cheap [1, 2]. Many researchers have encountered the identification of human emotions by collecting various data, including physiological signals, as the human sympathetic nervous system manages different emotions such as fear, rage, panic, and so on [3, 4]. The changes in an individual’s emotional levels typically reflect their psychological condition directly. These changes are identified by the EDA, which defines the variation of electrical characteristics of the skin due to sweat gland activity. In most cases, tonic and phasic components can be separated in the observed EDA skin conductance signal [5–7]. The tonic component comprises slowly varying activity and is called Skin Conductance Level (SCL). Conditions such as air temperature and humidity, as well as individual physiological characteristics like skin thickness, can affect the SCL signal. Aside from the SCL, shifts in the sensors’ placement can cause noticeable shifts in the skin’s conductance. These variations are referred to as motion artifacts. Baseline signal refers to the signal component caused by SCL variations and motion artifacts.
On the other hand, the phasic component or Skin Conductance Response (SCR) comprises quickly varying elements of the EDA signals [8–10]. Human stress analysis can benefit from the measurement of these EDA signal components related to user reaction to a stimulus [11–14]. Therefore, many techniques have been developed, such as a parameterized sigmoid-exponential model, impulse response models, quasi-deconvolution and a finite difference Bayesian decay method [15], to collect the SCR occurrences from the measured EDA signals. But these methods cannot eliminate the SCL signal components that disguise the SCR events.
Research gap
The above-mentioned issues can be solved by the Compressed Sensing-based Decomposition (CSD) of EDA signals [16], which considers the recorded EDA signals are a superposition of the baseline signal (signal element due to SCL variations and physical changes), the precise SCR components and noise. First, a pre-processing step was used to convert the SCR event repossession from the baseline signal using sparse deconvolution in the occurrence of limited noise. Then, the CSD method was applied to estimate the SCR events with corresponding error bounds (i.e., the variance between the actual and the predicted SCR events signal) using a convex optimization algorithm. Though this method facilitates the recovery of true SCR events from the EDA signals, it has a significant computational complexity due to the use of convex optimization modeling.
So, a Modified CSD (MCSD) method [17] has been developed, which uses a computationally efficient matrix-free convex optimization to decompose EDA signals and recover the true SCR events. On the other hand, this method has several limitations. They are: First, EDA signals can be easily affected by noise and artifacts like motion artifacts, electrode contact issues, or sweat evaporation variations. This degrades the accuracy of analyzing human emotional states. So, it is crucial to preprocess the EDA data and mitigate the impact of these artifacts. Second, analyzing and classifying multiple states of human emotions based solely on EDA data can be complex and time-consuming because of the ambiguity and overlapping nature of emotional experiences. So, it is vital to integrate data from multiple physiological sources like ECG, etc., to increase the accuracy of human emotion analysis.
Nowadays, automatic detection of artifacts and classification of emotional activity from multimodal physiological data is an interesting and challenging research area that lies at the intersection of various fields, such as signal processing, machine learning and affective computing. According to this context, this paper develops an automated artifact detection and a multimodal human emotion classification system to address the gaps in [16, 17]. To improve the human emotion classification accuracy, data from additional physiological sources such as respiration rate, BVP and ECG, are considered along with the EDA signal.
Contributions of the study
The major goal of this study is to process physiological signals collected from multiple sources (such as ECG, EDA, etc.) to identify artifacts that might distort the analysis and extract meaningful emotional information for enhancing classification performance. Hence, the major novelties/contributions of this study are the following: Initially, different physiological signals such as the EDA, ECG, BVP and respiration rate signals are taken as the multimodal dataset. The MCSD method is utilized to obtain the SCR events of EDA signals. For physiological signals, the statistical features associated with the amplitude as well as the first and second derivatives are extracted. Besides, a Discrete Haar Wavelet Transform (DHWT) is applied to extract additional features called wavelet coefficients that define the abrupt variations in the different physiological signals. Then, those extracted features are given to the parallel DCNN that determines the probability values to choose the most relevant feature vectors, i.e., artifact-free features for each emotion class (e.g., neutral, stress and pleasure). This prevents redundant features and reduces the search space to achieve better solutions. The selected feature vectors are fused to obtain the artifact-free feature vector. Moreover, the fused artifact-free feature vector can be used to train the RFDNN classifier for achieving the human emotional activity classification. The trained classifier is later applied to accurately classify the test signals into neutral, stress and pleasure.
The novelty of this research involves the following: (i) considering multimodal physiological data rather than specific data for emotional classification; (ii) developing a parallel DCNN instead of a single DCNN for automatically detecting artifacts from the different physiological data; (iii) integrating RF and DNN models, which enables more comprehensive and effective feature learning. Thus, this multimodal system provides a more comprehensive view of emotional responses, leading to more accurate and nuanced emotion classification.
This article continues with the following structure: Section 2 covers the literature survey. Section 3 explains the proposed automated artifact detection with a multimodal classification system for human emotion analysis. The experimental outcomes are depicted in Section 4. Finally, the study’s implications for the future are discussed in Section 5.
Literature survey
Greco et al. [18] devised a method to analyze EDA using a convex optimization method by maximizing posterior probability. The model’s definition of SCR included the phasic component, the tonic component, and additive white Gaussian noise, which accounted for measurement errors and artifacts in EDA. But it was not appropriate for a huge number of signals due to the high computational complexity.
Kelsey et al. [19] presented K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and isolation forest algorithms to automatically detect and eliminate artifacts in EDA signals by employing curve fitting and sparse recovery schemes. However, these algorithms were only suitable for limited signals, whereas the accuracy was less for a large number of signals.
Zhang et al. [20] suggested an unsupervised identification of motion artifacts in wrist-measured EDA data collected in both a laboratory-based environment and a real-world setting. In this model, both supervised and unsupervised algorithms were tested to define, on a large-scale dataset, the accuracy of motion artifact detection. But the accuracy was not high since it cannot handle abrupt changes in the EDA signals.
Gashi et al. [21] presented an automatic method for detecting shape artifacts in long-term EDA signals captured in the wild. First, the EDA signals were collected and annotated by the experts. Then, the EDA signals were segmented by the fixed window and features were extracted from each segment. Such features were given to the XGBoost classifier to identify artifacts. But the accuracy of separately identifying the signal that contains both clean SCR and the artifact was very less due to the fixed window.
Wei et al. [22] developed an EEG-based emotion recognition system that uses Variational Mode Decomposition (VMD) for feature extraction and a deep neural network (DNN) for classification, all without the need for a human subject. Using Empirical Mode Decomposition (EMD) and VMD, researchers were able to extract IMFs from EEG data. For each IMF, we calculated its peak power spectral density value and its first difference. The DNN was then trained to recognize emotions based on these characteristics. Nonetheless, the recognition accuracy was less due to the limited data, whereas it needs additional kinds of physiological data.
Gannouni et al. [23] described a method for emotion recognition from EEG signals. First, the zero-time windowing approach was employed to extract instantaneous spectral features utilizing the numerator group-delay function to easily detect epochs in all emotional states. Then, the Quadratic Discriminant Recurrent Neural Network (QDRNN) was performed to classify emotional states. However, accuracy was less because it considered only a limited signal. A multimodal scheme was required to improve accuracy by utilizing other physiological signals.
Cavallo et al. [24] presented a mood classification based on physiological factors. First, physiological factors such as ECG, EDA, and brain activity signals were collected. Then, such signals were classified by the SVM, decision tree, and KNN to detect individuals’ moods. But the accuracy was less because these algorithms did not handle disturbances like noise or motion artifacts in the collected physiological signals.
Hossain et al. [25] presented a machine-learning model for automatic motion artifact detection on EDA signals. First, EDA signals were collected and labeled by the experts. Then, the signals were segmented, and various features were extracted. Such features were learned by the GradBoost algorithm to detect motion artifacts in EDA signals. But these algorithms were not able to train more signals, resulting in less accuracy.
Dogan et al. [26, 28] presented a prime pattern and tunable q-factor wavelet transform to decompose the EEG signals into different sub-bands. After that, new features were created from the decomposed sub-bands using the Tetromino scheme. Moreover, the most discriminative features were chosen by the maximum Relevance Minimum Redundancy (mRMR) and the chosen features were classified by the SVM classifier. But its accuracy was less and training complexity was high while using more multimodal data.
Limitations of the literature
From the literature, it is addressed that most studies consider computational-based algorithms to identify artifacts in the EDA signals. They use only a small amount of instances such as small-scale datasets. Also, they concentrated only on certain physiological signals for human emotion or mood classification. In contrast with earlier studies, this study focuses on developing deep-learning methods to improve the accuracy of automatically detecting artifacts and classifying human emotions from large-scale multimodal physiological data.
Proposed methodology
In this section, an automated artifact detection and emotion classification system using multimodal physiological signals is briefly explained. It consists of three phases: dataset acquisition, identification of artifacts and classification of human emotions. The dataset acquisition phase can obtain the physiological signal dataset from the open source. The artifact detection phase comprises different processes, including SCR recovery using MSCD, feature extraction and artifact detection using parallel DCNN. The classification phase includes the RFDNN classifier to classify human emotional activities into baseline, stress and pleasure. A brief description of these processes can be given in the below subsection.
Dataset description
Initially, a publicly accessible multimodal database called Wearable Stress and Affect Detection (WESAD) [26] is obtained. The Trier Social Stress Examination is used as a stress stimulus on 13 participants (12 men and 3 women) during the information-gathering procedure. This collection of data is especially aimed at pregnant graduate students, heavy smokers, mental disorders, infectious diseases, and cardiovascular diseases. The 15 subjects examined had an average age of 27.5±2.4 years. Each subject’s data is linked to many self-reports that, during an affective stimulus, represent the subjective experience.
This dataset includes physiological modalities of high resolution such as ECG, EDA, EMG, body temperature, respiration rate, and BVP, along with triaxial acceleration signals obtained at 700 Hz from two different devices, such as a chest-worn device (RespiBAN professional) and a wrist-worn device (Empatica E4). The RespiBAN is around the subject’s chest. A respiratory inductive plethysmograph sensor records respiration. A standard three-point ECG records the ECG data.
The rectus abdominis, which enables the individual to proceed as freely as practicable, records the EDA information. In addition, both subjects used the Empatica E4 to record BVP (64 Hz) and EDA (4 Hz) on their non-dominant hand. The recorded data is stored locally for further processing and then transferred to the computer.
Artifact detection phase
Feature extraction
In this phase, the EDA signals in the dataset are decomposed by the MCSD [17] to recover true SCR events. Then, for each raw psychological signal and SCR, statistical and wavelet coefficient features are extracted. The DHWT is applied to determine the wavelet coefficient features, which may indicate the rapid changes in the EDA signal. In the DHWT, the signal is divided into coefficients at different scales, such as 4 Hz, 2 Hz, and 1 Hz. Since the DHWT calculates the degree of similarity between successive points in the actual signal, it is useful to recognize the variations in the edges and sharps. The three tiers of feature coefficients are determined from the complete EDA signal and statistics are determined on the coefficients over each particular interval. Table 1 provides the statistical and wavelet coefficient features extracted from the raw EDA and SCR.
Extracted EDA features
Extracted EDA features
Along with the EDA data features, features from other physiological data are also extracted to enhance the accuracy of artifact detection and emotion classification. Table 2 presents the statistical and wavelet features extracted from the raw ECG and BVP signals.
Extracted ECG/BVP features
The R-peak is identified using the Daubechies eighth-layer wavelet coefficients. The Q and S peaks, as well as the locations of the negative peak values, are discovered by navigating the windowing operation on the left and right sides of the R peak. The P-peak, which is placed on the left side of the Q-peak, is where the largest value is found. The highest result is found as the T-peak within the window by moving along the right side of the S-peak.
All points’ onset and offset are calculated. These data points are used to extract anatomical traits. Finding the aggregate of all the data elements in each ECG signal is another way to get the mean. The deviation is used to quantify how far a value deviates from the mean. The complexity of the time series is gauged by Sample Entropy (SE). It enhances efficiency without utilizing the bias carried on by the own coincidences that are taken into account while computing approximation entropy. Higher SE levels imply a more complicated signal, while lesser SE values indicate a simpler signal. Consider the ECG signal points (y1, y2, …, y
P
) of length $P$ and the SE can be estimated as:
Where
In Equation (3),
In Equations (5), j and k differ from 1 to P - mτ and k ≠ j. If the distance D [Y (j) , Y (k)] between any pair of corresponding measurements of Y (j) and Y (k) is less than or equal to r, then the two signal points of window length m are similar. FD is used for finding signal dimensions that explain fractal characteristics. A rectangular window of the same dimension W is used to calculate the localized FD for each random value in the signal by the deviation FD predictor. According to the following energy principle, the deviation σ2 of the magnitude increases for a time series y (n) over a specific time adjustment is equal to the time iteration:
In Equation (6), H denotes the Hurst exponent and var (·) represents the deviations. By setting Δn = |n2 - n1| and (Δy)
Δn
= y (n2) - y (n1), the value of H is estimated from the log-log plot of Δn
m
versus var (Δy)
m
, whereas Δn
m
refers to a discrete-time adjustment at an order of m
th
. The frequency variation over the continuous period adjustment at the m
th
degree level is represented by var (Δy)
m
. The Hurst coefficient is calculated from the following Equation (7), and the gradient s of the line fitting the log-log plot is estimated using the least squares procedure:
Hurst exponent H is associated with the FD (D) as follows:
Here, E is the Euclidean dimension which is equal to 1 for ECG signals. Therefore, the Equation (8) can be rewritten as:
The heart/pulse rate is calculated using the BVP signal along with the related mean and standard deviation values. The difference between two R peaks in the ECG/BVP signal is used to calculate heart rate, which is typically expressed as the number of R peaks per minute or the number of heart beats per minute. Table 3 displays the features extracted from the raw respiration signals.
Extracted respiration signal features
Comparable to breaths per minute, respiration velocity describes the speed at which breathing takes place. The respiratory rate is also calculated using the RR intervals. Before computing the features on the respiration signal, the time sequence RR at the timings R is converted into an alternating period at a sampling frequency of 10 Hz. This time series has been band-pass filtered from 0.1 to 0.45 Hz. By normalizing the filtered time series, the impact of a single deep breath is reduced. By dividing the time sequence based on the local maximum’s magnitude, which is exceeded by 25% of all other extremes, the time series is uniform. Only complete rounds in the uniform time sequence in which the local maxima exceeded 0.3 are used to calculate the mean respiration rate. This threshold is used to determine which respiratory rounds are acceptable. In addition, the mean and standard deviation of inhalation and exhalation are computed. The multimodal physiological signals’ features are thus extracted.
Once all the features are extracted, the artifacts are detected by a parallel DCNN that constructs the feature vectors using the extracted features for each physiological signal related to each emotion class. The convolutional, pooling and dropout modules of the proposed parallel 1D-CNN are followed by one or more Fully-Connected (FC) networks that depict the feature representation. The convolutional layer first converts the signal’s higher-resolution characteristics that were retrieved into coarser, more complex information. Multiple convolutional kernels, like a ReLU, may be used in every convolutional layer.
Convoluted data from the preceding level is included in the feature map (fm), which is the result of the convolutional layers, as follows:
In Equation (10), input is a N × N data sample from the preceding layers, kernel is a p × p filter that represents ω and ⊗ is the convolution function, which is defined as:
Where
In Equations (12),
In Equation (13), P (·) is the probability function and z is the output of the softmax layer. The feature vectors with a maximum probability are chosen as the most relevant feature vectors per each emotion class, whereas the feature vectors with a lowest probability are termed as artifacts and discarded for further processing. Then, the chosen feature vectors are fused to create the artifact-free feature vector of each participant.
Ablation Study: The ablation study analyzes the performance of a DCNN by modifying its layer design and configuration, considering task behaviors and related problems. By training the model multiple times, it can identify potential reductions in performance and solve them by modifying network components or hyperparameter. The WESAD dataset is used for each trial. According to this, the details of the DCNN architecture used for the proposed artifact detection scenario are presented in Table 4. Also, the framework of parallel DCNN for artifact detection is shown in Fig. 1.
Details of DCNN layers

Overall framework of artifact detection using parallel DCNN.
Accordingly, the time complexity of the DCNN model for artifact detection is defined by
In Equation (14), k is the Conv kernel size, n is the number of training data and d is the input dimension. Similarly, the space complexity of the DCNN model for artifact detection is defined by
In Equation (15), P is the number of parameters in the DCNN and M is the memory needed to store the intermediate feature maps for a single layer.
Once the artifact-free feature vectors are created, such feature vectors are classified by the RFDNN into different emotional states. The RFDNN combines both random forest and DNN for efficient classification and high accuracy. So, an ensemble model, i.e., utilizing random forest over DNN, can generate predictive performance from each of its base learners as opposed to just a unified projected possibility value. This RFDNN has two parts: a forest part and a DNN part. Under the guidance of training results, the forest portion can learn the sparse descriptions from the provided real SCR events and real R-peak signals, and the DNN component can categorize human emotional activity. In the forest section, independent decision trees are constructed, and those trees are then combined to create the forest.
In this RFDNN model, a forest
In this case, M stands for the overall amount of trees in the forest, and θ ={ θ1, …, θ
M
} represents the variables in
In Equation (17),
Here, F = (f
i
, …, f
M
)
T
stands for the forest matrix, which consists of feature vectors with n observations and M tree estimations; Ψ stands for all the DCNN method variables; Z
out
and Z
k
, k = 1, …, g - 1 are hidden neurons with matching weight vectors and bias vectors; and W
out
, W
k
are hidden neurons. The number of hidden neurons h
in
and h
k
, k = 1, …, g including the input scale M and the number of categories h
out
. determines the dimensions of Z and W. In a typical network, the hidden neurons are distributed as like h
in
= M > h1 > h2 … > h
out
. Also, σ (·) is the activation function i.e., (σ
ReLU
(y) = max (y, 0)) and l (·) is the softmax function can be rewritten as:
Where
All weights and biases must be calculated as D-CNN constants. A Stochastic Gradient Descent (SGD) approach can be used to retrain this RFDNN by decreasing the cross-entropy loss operation as:
In Equation (26),

Overall framework of RFDNN-based human emotional activity classification.

Structure of RFDNN classifier.
Accordingly, the time complexity of the RFDNN for the emotion classification model is defined by
In Equation (27), n is the number of trees, m is the number of features, L is the number of layers in the DNN and N is the number of neurons in each layer. Also, the space of the proposed RFDNN for the emotion classification model is defined by
In this section, the effectiveness of the parallel DCNN-based artifact detection and RFDNN-based multimodal classification methods on the WESAD dataset is assessed in MATLAB 2019b. The WESAD dataset is randomly split into training and test sets with a ratio of 7 : 3 (i.e., record-wise cross-validation) to individually evaluate the performance of parallel DCNN and RFDNN methods.
Performance assessment for artifact detection methods
Figure 4 shows examples of actual and artifact-free signals from the WESAD dataset. The performance of the proposed parallel DCNN for artifact detection is compared with existing artifact detection methods such as KNN [19], XGBoost [21], and GradBoost [25] on the WESAD dataset. The comparison is prepared in terms of Artifact Power Attenuation (APA), Normalized Mean Squared Error (NMSE), Standard Deviation (STD), and mean R-square. APA: It is calculated as:
NMSE: It is calculated as:
In Equations (30), y (n) is the observed signal, STD: The STD value of the actual signal and the artifact-free signal is calculated by
In Equation (31), y
i
is the value of the sample and Mean R-square: It is calculated as:

Results of raw and artifact-free physiological signals. (a) EDA, (b) ECG, (c) BVP and (d) respiration rate.
Table 5 presents the performance results for the proposed parallel DCNN and existing methods on the WESAD dataset for artifact detection.
Performance analysis for artifact detection methods
Figure 5 portrays the efficiency of the proposed parallel DCNN and existing artifact detection methods, including XGBoost [21], GradBoost [25] and KNN [19] on the WESAD dataset. It is noted that the proposed parallel DCNN accomplishes a greater detection performance in terms of APA, STD, m_R2 and NMSE values.

Comparison of existing and proposed artifact detection methods on WESAD dataset.
Table 6 presents the time and space complexities of proposed and existing artifact detection methods on the WESAD dataset. This shows that the proposed parallel DCNN has less time and space complexities compared to the existing artifact detection methods.
Time and space complexities of different artifact detection methods on WESAD dataset
The performance of the proposed RFDNN for emotional activity classification is compared with the DNN [22], RNN [23], decision tree [24], SVM [24] and KNN [24] classification methods on the WESAD dataset. The comparison is prepared in terms of precision, recall, f-score, accuracy and Receiver Operating Characteristics (ROC) curve. Accuracy: It is the percentage of precise classification over the total data instances tested.
In Equation (33), TP represents the number of stress instances classified as stress, TN represents the number of neutral/pleasure instances classified as neutral/pleasure, FP represents the number of neutral/pleasure instances classified as stress and FN represents the number of stress instances classified as neutral/pleasure. Precision: It measures the appropriately classified data instances at TP and FP rates.
Recall: It is the percentage of data instances that are appropriately classified at TP and FN rates.
F-score (F): It is calculated by
ROC curve: It is used to evaluate the classification performance. It defines the relationship between the FP Rate (FPR) and TP Rate (TPR).
Performance analysis for emotion classification methods
Table 6 presents the performance results for the proposed RFDNN and existing methods on the WESAD dataset for emotion classification.
Figure 6 illustrates the efficiency of the proposed RFDNN and existing emotional activity classification methods, including decision tree [24], SVM [24], KNN [24], DNN [22] and RNN [23] on the WESAD dataset. It is noted that the proposed RFDNN classifier accomplishes a greater classification performance because of learning the most relevant artifact-free features of all physiological signals for all the emotion classes.

Comparison of existing and proposed emotional activity classification methods on WESAD dataset.
Table 8 presents the ROC results for the proposed RFDNN and existing methods on the WESAD dataset for classifying stress class.
ROC analysis for classifying stress class
Table 9 presents the ROC results for the proposed RFDNN and existing methods on the WESAD dataset for classifying pleasure class.
ROC analysis for classifying pleasure class
Figures 7 8 present the ROC curve results of classifying stress and pleasure class instances using various classification methods on the WESAD dataset. It is noted that if the FPR is 0.6, then the TPR of the RFDNN is 33.33%, 25%, 16.28%, 11.11% and 5.26% higher than the decision tree, SVM, KNN, DNN and RNN methods for classifying stress class. Also, the TPR of the RFDNN is 72.41%, 36.99%, 17.65%, 8.7% and 2.04% higher than the decision tree, SVM, KNN, DNN and RNN methods for classifying pleasure class.

ROC curve for existing and proposed emotional activity classification methods on WESAD dataset for stress class.

ROC curve for existing and proposed emotional activity classification methods on WESAD dataset for pleasure class.
Thus, it concludes that the RFDNN can efficiently increase the accuracy of classifying human emotional activities compared to the other methods.
Table 10 presents the time and space complexities of proposed and existing emotional activity classification methods on the WESAD dataset. This shows that the proposed RFDNN has less time and space complexities compared to the existing methods for classifying human emotional activities.
Time and space complexities of different human emotional activity classification methods on WESAD dataset
Generally, in human emotion classification using physiological data, many technical problems can arise that affect the reliability of the datasets used. Some of them include data preprocessing problems, artifacts and noise, less generalizability, etc. The proposed parallel DCNN and RFDNN can ensure the quality and reliability of the datasets such as WESAD. The parallel DCNN can preprocess the dataset that includes physiological data to alleviate the artifacts, noise, or other undesired signals. This results in effective training of the RFDNN classifier for classifying different emotional states of humans with higher accuracy. Thus, both parallel DCNN and RFDNN can achieve greater reliability in the datasets used for human emotion classification.
Limitations of the study
The proposed RFDNN cannot learn long-term dependencies from all physiological data. Such long-term dependencies can be used to understand how stress and emotional states evolve over a period. Also, it did not consider latent emotional patterns that impact the classification of new emotional states beyond predefined classes. It cannot handle class imbalance problems that influence classification accuracy.
Conclusion and future work
In this study, an automated artifact detection and multimodal classification system was developed using physiological data for human emotion analysis. At first, multimodal physiological signals were collected. The MCSD was applied to extract the SCR events from the EDA signals. For SCR and other physiological signals, statistical and wavelet coefficient features were extracted and given to the parallel DCNN for finding the most relevant feature vectors per each emotion class. Such feature vectors were then fused to get the artifact-free features, which were learned by the RFDNN classifier for emotion classification. At last, extensive experiments were conducted using the WESAD dataset in MATLAB 2019b to measure the proposed parallel DCNN and RFDNN method’s performance against existing methods. The test results of the parallel DCNN show that it achieved 0.0851 APA, 0.1866 STD, 0.9106 m-R2 and 0.0763 NMSE compared to the existing methods for artifact detection. Similarly, the RFDNN achieved 93.48% precision, 94.93% recall, 94.21% f-score and 95.24% accuracy compared to the existing methods for emotional classification.
Future work will develop forest CNN with Long Short-Term Memory (LSTM) network to enhance the classification accuracy. Also, other datasets can be used to analyze the model performance.
