A novel stochastic deep conviction network for emotion recognition in speech signal

Abstract

Deep learning is far and wide considered to be the most powerful method in computer vision fields, which has a lot of applications such as image recognition, robot navigation systems, and self-driving cars. Recent developments in neural networks have led to an efficient end-to-end architecture to human activity representation and classification. In light of these recent events in deep learning, there is now much considerable concern about developing less expensive computation and memory-wise methods. This paper presents an optimized end-to-end approach named stochastic deep conviction network (SDCN) formulated using the deep learning method. It comprises of deep learning method namely deep belief network (DBN), two supervised machine learning algorithm support vector machine (SVM) and decision tree (DT) with optimization capability for speech emotion identification. In the beginning, pre-processing is performed and the features are automatically extracted from the input speech signal by the DBN. Since speech signal features loses most of the information and the performance cannot be guaranteed because dynamic interactions can generate uncountable emotion-specific experiences that have the same core feeling state but different perceptual inclinations so DBN provides more robust features. The next step is to classify the emotions in the training phase; here the SVM classifier is chosen which performs dual classification. In order to enhance this classification process, defects must be reduced and the best discrimination of the extracted features should be obtained hence particle swarm optimization (PSO) technique is being added along with SVM classifier in the training phase. To reduce the over fitting problem and risks of a single classifier a DT is being used in the testing phase for the exact identification of emotions (anger, disgust, fear, happiness, neutral and sadness) and therefore it obtains better performance than a single classifier. The complication of the decision tool is that it can increase the computation time. Thus to eliminate this defect whale optimization (WO) technique is being added to the decision tree to reduce the complexity of the system, which in turn lessens the time taken for recognizing the emotion of the speech signal. This formulated proposed SDCN system improves the recognition rate accurately. In this work, theMATLAB environment is being preferred to perform speech emotion recognition. Using the proposed technique the achieved accuracy of emotion detection is above 95% and the identification of various emotions exceeds 98% recognition rate with a computation time of 23 seconds, which has not been achieved so far by any other existing techniques.

Keywords

Stochastic deep conviction network restricted Boltzmann machine particle swarm optimization support vector machine whale optimization

1 Introduction

Emotions are personalized and people deduce it differently. It is difficult to define the meaning of emotions and it can also be through the facial images etc. [1]. In emotion recognition through facial images there is a setback that each of the action units in it corresponds to a specific facial muscle movement that are used both individually and in combinations to define a particular facial expression and different illusion which may also bring confusion and provides less accuracy in recognition [2]. Thus, in search of another mode, the human speech signal comes into the picture. At every instance, emotions play a vital role in the lifetime of each and every human being. At least 38% of the complete communication can be identified and communicated through emotions of the speech signal. Automatic classification of emotion is being developed by speech emotion identification (SEI) which is applied in various fields [3, 4]. A medical robot has also been designed using SEI that helps in providing an enhanced wellness program for patients by monitoring the patients’ emotional state [5] constantly and provides indicative ideas [6]. Both speech feature extraction and classification are included in machine learning through which the SEI is being implemented. Due to the deficiency of better generalization, feature extraction has become an issue for all classification methods [7]. The extracted features must diminish the distances between samples with a similar emotion class and exploit the distances between samples with the altered emotion classes [8]. The good performance cannot be obtained if the features are not well defined. Actually, the most powerful speech features cannot be clearly determined in various emotions. The features which are being extracted from the speech include pitch and energy contours which are directly affected by various factors like speaker’s speaking way, phrases, and dialogue rates [9]. A new speech signal cannot be tuned accordingly because obtaining correct classification boundaries is hard as there are overlying features in various speeches. These disputes can be eradicated by using deep learning methods which can spontaneously discover the several levels of representations in speech signals [10 –13]. Moreover, deep learning is preferred here because it does not require manual feature extraction. Thus, maximum recognition tasks are being successfully undertaken by these techniques. But the challenges faced by these methods are the complicated decision boundary of the classification [14 –19]. The common ensemble classifiers are boost-based, bagging-based approaches, random subspace [20, 21], and so forth. In the case of a single classifier [22], the overfitting complications cannot be reduced, whereas the different classifiers usage has the potential capability to reduce the overfitting problems and expected performance cannot be obtained even if these methods are applied in speech emotion recognition. In order to solve this issue, the novel deep learning architecture is necessary. In [23] a genetic algorithm, particle swarm optimization, and artificial bee colony optimization techniques are preferred for optimizing the different hidden layers and neurons of the hidden layers of an artificial neural network for maximum speech recognition accuracy. In [24] the opposition artificial bee colony optimization technique is proposed for the enhancement of accuracy. Minimax and pole-zero constant optimization methods are proposed in [25] for improving the performance of the system. The optimization techniques are used by various researchers for increasing the accuracy of the work [26 –30]. WO algorithm provides the best-optimized output because the problem that is found in the output of the classifier, can be learned and corrected by itself.

Even though the available strategies perform emotion recognition from speech, they fail in finding the necessity of subject-matter from an input, so by using machine learning algorithms the subjective matter can be obtained by optimized learning strategies and automatic feature extraction can be performed. Incoming voice signals cannot be tuned because obtaining correct classification boundaries is hard as there are overlying features which are identified in various speeches. These disputes can also be removed by using deep learning methods which can spontaneously determine the numerous levels of representations in speech signals. The proposed network includes numerous successive frames to form a high dimensional features, which can automatically pre-process the incoming signal and proceed with the next step.

This paper is organized as follows; introduction emphasizes speech emotion identification, Section 2 describes the related work. The proposed framework is explained in Section 3. The experimental results are evaluated in Section 4 and finally, Section 5 provides a brief conclusion.

2 Related works

Hook et al. [31] in 2019 have proposed a modest performance in speech-based emotion identification with a small set of features. The outcome and the proposed features seems to be promising. But distinguishing the anger and happiness is found to be difficult in the SEI field for machines and humans.

Zhao et al. [32] in 2019 has proposed one dimensional and two dimensional (1D and 2D) convolutional neural network (CNN) and long short term memory (LSTM) networks to identify the speech emotion. Learning the local correlations and global contextual details from various sources like raw audio clips and log-mel spectrograms are being examined. Various aspects must be improved even if the proposed methodology improves the performance in speech emotion identification.

Badshah et al. [33] in 2019 have presented a CNN with a rectangular kernel method to recognize emotions in speech. The proposed technique can be enhanced better if more labeled data can be gathered and deeper CNN has rectangular kernels that can be trained effectively with enhanced recognition rate.

Pu et al. [34] in 2019 have demonstrated a robust principal component analysis (RPCA) which decomposes a data matrix into a superposition of a low-rank matrix and a sparse matrix under certain incoherent conditions. In this paper, a nonlinear generalization of RPCA is proposed that uses two autoencoder networks to achieve such a decomposition, in which one autoencoder accounts for the low-rank component and the other for the sparse components.

Gupta et al. [35] proposed a novel CNN architecture with the spatial pyramid pooling (SPP) layer which works on the fluctuating length feature illustration of speech signals for classifying the emotions from various speech inputs. The benefit of the projected method is that the varying length is considered as the feature representation of speech signals as input. The restriction of the projected kernel is that it necessitates a CNN model to acquire fluctuating size feature maps. Shen et al. [36] in 2019 have explained a cloud removal procedure based on multisource data fusion to overcome this limitation. On the basis of the temporal-based approaches, which employ a cloud-free image as reference, this method further introduces two auxiliary images with similar wavelengths and close acquisition dates to the reference and target (contaminated) images into the reconstruction process.

Wei et al. [37] in 2019 have proposed a novel speech emotion recognition algorithm based on an improved stacked kernel sparse deep model, which is based on auto-encoder, denoising auto-encoder and sparse auto-encoder to improve the Chinese speech emotion recognition. The first layer of the structure uses a denoising autoencoder to learn a hidden feature with a larger dimension than the dimension of the input features, and the second layer employs a sparse auto-encoder to learn sparse features. Finally, a wavelet-kernel sparse SVM classifier is applied to classify the features. The proposed algorithm is evaluated on the testing dataset, which contains the speech emotion data of spontaneous, non-prototypical, and long-term accuracy.

Huang et al. [38] in 2019 have described a novel sub-band spectral centroid weighted wavelet packet cepstral coefficients (W-WPCC) for robust speech emotion recognition. The W-WPCC feature is computed by combining the sub-band energies with sub-band spectral centroids via a weighting scheme to generate noise-robust acoustic features. Deep belief networks (DBNs) are artificial neural networks having more than one hidden layer, which are first pre-trained layer by layer and then fine-tuned using the backpropagation algorithm. The well-trained deep neural networks are capable of modeling complex and non-linear features of input training data and can better predict the probability distribution over classification labels.

3 Proposed methodology

Speech emotion recognition brings interaction between humans and the machine. The speech signal is taken to be the input that is being processed to determine the emotion of that particular speech signal. These dynamic interactions can generate numerous emotions which may cause loss of important features, so in order to extract efficient features, the deep learning algorithm is considered in the proposed work. It identifies the emotions since it extracts high-level features from data in an increasing way and moreover, it eradicates the necessity of subject-matter expert and performs automatic feature extraction. Incoming voice signals cannot be tuned because obtaining correct classification boundaries is hard as there are overlying features which are identified in various speeches. These disputes can also be removed by using deep learning methods which can spontaneously determine the numerous levels of representations in speech signals. In order to solve the above problem, the stochastic deep conviction network (SDCN) is introduced here, which is an ensemble of deep learning method DBN and two machine learning techniques, support vector machine and decision tree with the combination of optimization techniques PSO and WO. The proposed network includes a process to form a high dimensional feature, which can automatically pre-process the incoming signal and proceed with the next step. At the training phase, feature extraction is being performed on the training set. Firstly only low-level features are extracted and these features are served as an input to DBN after which high-level features are extracted. In order to improve the classification of different emotions, DBN adopts SVM with PSO, which gives the best training to extract high-level features. After that, it is required to classify the different emotions using trained conditions. In the testing phase, DBN inspired decision tree with whale optimization is used which categorizes an N number of emotions from the trained voice signal. The overall proposed framework introduces the SDCN method for speech emotion identification with the combination of generic stochastic subspace to improve the effect of identifying the emotion from the speech. The overall framework is described in Fig. 1.

Fig. 1

Block diagram of the proposed system.

The proposed SDCN system consists of extracting features automatically using DBN and then uses a training phase and testing phase. In the training phase, the SVM classifier makes use of the extracted features of speech emotion to perform classification and its performance is categorized into two classifications. Thus, only two categories can be identified out of it in two dimensions (2D). Moreover, the features plotted nearby the boundary may cause less accurate output, so these defects are eliminated using the PSO optimization technique [39] which evaluates the fitness value only after updating the velocity and position of the particle. Hence, the optimized output can be obtained which is the trained output. This trained output is fed into the testing phase which thereby decides the emotion using the decision tree. All the categories of emotions in the speech can be differentiated, but it’s not a rapid process because more complications occur when different emotions are being classified which include (anger, disgust, fear, happiness, neutral, sadness). Thus the decision tree consists of a lot of complications during this process. These complexities have been eradicated by using the whale optimization technique [40] which uses a whale search mechanism that increases overall system efficiency, enhanced system reliability and security. Thus the optimized output obtained proves that the proposed method is the best way to recognize the emotions from the voice input data.

3.1 Input voice data

Speech is a significant way to communicate between individuals which will include various emotions. The tone of the voice can also represent the emotional state of a person.

The short time discrete input speech signal is given by $S (n) = \sum_{k = 1}^{p} a_{k} S (n - k) + Gv (n)$ (1)

Here S(n) is the input signal, v(n) is the excitation, G is the gain parameter, a_k is the prediction coefficients, p is an integer.

3.1.1 Extraction of features

The common parameters which can be computed from the speech signal are the vocal quality which includes voice type, glottal attack, resonance, pitch, loudness, respiratory dynamics, and vocal registers. The parameters of voice given in Table 1 are said to be multidimensional in nature.

Table 1
Parameters of the voice signal

Physical Acoustic Perceptual

Rate of vibration Frequency (Hz) Pitch

Amplitude of vibration Intensity (dB) Loudness

Periodicity of vibration Perturbation-jitter, shimmer Quality

Complexity of vibration Range Flexibility

Physical	Acoustic	Perceptual
Rate of vibration	Frequency (Hz)	Pitch
Amplitude of vibration	Intensity (dB)	Loudness
Periodicity of vibration	Perturbation-jitter, shimmer	Quality
Complexity of vibration	Range	Flexibility

The input voice signal must undergo two primary steps which include preprocessing and feature extraction. Pre-processing is performed to remove unwanted noises, but in this work, this process is not used here because the proposed SDCN method is used which automatically performs pre-processing strategy and dimensionality reduction.

3.2 Proposed stochastic deep conviction network (SDCN)

Deep learning plays an important role in extracting different speech emotion features and obtaining all the parameters. But these parameters are used for tuning which is a very expensive process. To avoid these optimal parameters, the ensemble learning framework is used. But this ensemble learning does not have the ability to enhance the effect of speech emotion recognition, thus random subspace is being inbuilt to train the base classifier for the ensemble. Even now the speech signal description is not perfect which affects the ensemble classifier’s performance due to the presence of low-level features in subspaces. In order to avoid this problem, a stochastic deep conviction network is introduced. The overall proposed framework is shown in Fig. 2.

Fig. 2

Framework of SDCN.

Initially, DBN is used to extract the high-level features, which is made of a large number of restricted Boltzmann machines (RBMs), so that the high-level representation can be learned beneficially only to full fill speech emotion recognition. When a voice signal is taken as the input it may consist of even a few unwanted noises that do not help in recognizing the emotion. So the unwanted noise must be eliminated. After that, features are extracted automatically with the help of a greedy layer-wise learning strategy. A generic stochastic subspace is introduced in order to improve the effect of identifying the emotion from the speech by training the chosen base classifier. To extract high-level features, DBN is being adopted along with SVM–PSO classifier in the training phase, which is used to select parameters required and the emotions of the speech are classified. Finally, the emotions of the input speech signal are being labeled.

The general deep conviction network structural probability distribution function is given below, $P (d, c) = \frac{e^{- E (d, c)}}{\sum_{d, c} e^{- E (d, c)}}$ (2)

Here d gives one feature of the input layer and is an element of the detectable layer. The concealed layer c aims to find dependencies between observed variables. w_ij represents the weight between the detectable unit d_i and the concealed unit c_j. P (d, c) is the joint probability distribution of (d, c) and is given by the Gibbs distribution.

The energy function is equated as $E (d, c) \equiv - \sum_{ij} W_{ij} d_{hj} - \sum_{i} b_{i} d_{i} - \sum_{j} a_{j} c_{j}$ (3)

where W represents the weight between the detectable unit and the concealed unit, b and a represents the offset of the detectable unit and the concealed unit respectively in RBM. The parameters used in the above equation are involved in learning from training data using DBN. The training and testing with ith parameters are time-consuming and features of different emotions may miss lead to inaccurate results.

On considering the traditional deep network there are various shortcomings such as lack of accuracy and computational time sufferings, so here a stochastic deep conviction network (SDCN) is used where DBN shortcomings are overcome by using an additional optimal classifier for both training and testing purpose. The section below describes the sectional working of SDCN.

3.2.1 Training phase of proposed SDCN

In the training phase of SDCN, both SVM classifiers and PSO optimization techniques are introduced. The SVM classifier is among the most dominant machine learning algorithm which is preferred for data classification. The SVM classifier exploits the margin amidst boundary points of the classes and the splitting hyperplane. For the training, the SVM classifier is utilized to rectify the quadratic programming problem and avoid the convergence problem. The SVM classifier builds a linear model depending on the support vectors just to determine the verdict function. If the training data are linearly independent, then the discriminative classifier discovers the optimal hyperplane that departs the data from the error. The SVM classifier is said to be one among the best classifier due to their output having increased maximum margin, its ability to handle very higher dimensionality samples; and their convergence is less compared to the cost function. SVMs attain expressively advanced search accurateness than traditional query refinement schemes. Even though the SVM classifier is considered to be among the best classifiers, it also consists of various disadvantages like the dearth of an immaculate relationship between distance from the margin and the probability of the posterior class. Another issue of this classifier is that, in their unique formulation, they are confined to work with input vectors of fixed dimensions. The optimum separating hyperplane can be found by minimizing ||w||² under the constraint y_i (w . x_i + b) ≥ 1, i = 1, 2, …, n, where b is a margin slack variable, the addition of margin slack variables allows a controlled violation of the constraints.

The determination of optimum hyperplane is required to solve the optimization problem given by: $min \frac{1}{2} | | w | |^{2}$ (4)

s.t y_i (w . x_i + b) ≥1, i = 1,2,3, ... .n

The new optimization problem is given as: $min \frac{1}{2} | | w | |^{2} + c \sum_{i = 1}^{n} ξ i$ (5)

s.t y_i (w . x_i + b) ≥1 - ξi, ξi ≥ 1, 2, 3, … n

c is a free parameter (known also as regularization parameter or penalty factor.

In the selection of the hyperplane in SVM classifier, additional time is consumed therefore here the optimally the hyperplane is selected and Equation 4 and 5 indicate this process. Thus a condition indicated in Equation 5 is defined to get exact hyperplane by replacing Equation 4 of the general SVM classifier.

At last, another constraint is that this classifier just orders, however, they don’t give us a dependable proportion of the likelihood of the rightness of the classification. The Equation 6 represents the positive SVM classification. $Y_{i} = arg max = (\sum_{i = 1}^{16} α_{i} y_{i} K (z, z_{i}) + b)$ (6)

Here Y_i describes speech emotion features, output b represents the bias, argmax means argument of max, K is the kernel, y_i is the non-linear function and α is the Lagrange multipliers.

Figure 3 represents the structure of the SVM classifier. x(i-1), x(i-2), ... x(i-p) are the extracted features fed for training into the SVM classifier. In our approach, spectral features, prosodic features, and Hu moments for weighted spectral features (HuSWF) are combined. The spectral features contain linear predictor cepstral coefficients (LPCC), zero crossings with peak amplitudes (ZCPA), and perceptual linear predictive (PLP) [39]. Prosodic features are often used together with spectral features in speech emotion recognition, as they have good supplement effectiveness within the SVM classifier; these features undergo non-linear function (Ø) and kernel function (K). Finally, the predicted output is determined. In this paper, the SVM classifier is executed along with the PSO optimization technique. Yi will be the optimized output as gbest in Fig. 4.

Fig. 3

Structure of SVM classifier.

Fig. 4

Process Flow of PSO Optimization.

Figure 4 shows the process that takes place when the PSO optimization algorithm is inserted inside a procedure.

Each particle’s velocity is updated using this Equation:

$\begin{matrix} l_{i} (t + 1) = {al}_{i} X (t) + e_{1} s_{1} (t) [X_{pbest} (t) - X_{i} (t)] \\ + e_{2} s_{2} (t) [X_{pbest} (t) - x_{i} (t)] \end{matrix}$ (7)

i is the particle index

a is the inertial coefficient

e₁,e₂ are acceleration coefficients, 0 ≤ e₁, e₂ ≤ 2

s₁, s₂ are random values 0 ≤ s₁, s₂ ≤ 2 regenerated every velocity update

Each particle’s position is updated using Equation 7: $x_{i} (t + 1) x_{i} (t) + l_{i} (t + 1)$ (8)

PSO optimization technique has two phases: initialization phase and cycle phase. Hence it begins by initializing the particle with random position and velocity vector. After the initialization, the fitness value is being calculated for each and every particle position. If fitness (p) is better than fitness (pbest) then pbest = p. From the acquired pbest the best one is chosen to be gbest. Now both the position and the velocity are updated. The final step is reached when gbest is chosen to be the optimal solution. Until the gbest value is selected the cycle goes on which is said as cycle phase. The optimized output is then provided to the testing phase. The advantage of PSO compared to other techniques is that it has fewer algorithmic parameters to specify. Thus well-optimized data is said to be the trained data.

Figure 5 shows the flow graph of the hybrid form combining the SVM classifier and PSO optimization technique. The training samples are fed for feature extraction and after high-level feature extraction classification is being performed on the extracted features, using the SVM as the base classifier. After classification is being implemented the output is optimized using the PSO optimization technique. The hybrid form obtained by combining the SVM classifier and PSO optimization technique provides the actual output as the class label of the input sample. This combo has just one aim that is to shrink the time usage. This hybrid form has a positive point that is to estimate its effectiveness.

Fig. 5

Flow process representing the hybrid form combining SVM classifier and PSO optimization technique.

The investigational result determines both the sound confirmation and training time of the proposed forward combination algorithms are enhanced than that of using the SVM classifier algorithm independently. Furthermore, this hybrid form converges more rapidly than traditional gradient descent. Moreover, when comparing with the regular algorithm the percentage enhancement in the time and accuracy is found to be enhanced. The output of the SVM classifier is provided to PSO optimization since this SVM classifier has few defects that are rectified by adding PSO optimization technique. Now the optimized output will be gbest. $F (t) = gbest$ (9)

Let ‘F(t)’ be the output of the training phase. This output is fed to the testing phase as its input. Thus it is made very clear that by the usage of SVM classifiers the quadratic programming problem and the convergence problem are being evaded. The combination of this classification method and optimization process as the combo has shrunk the time usage and increases the effectiveness.

3.2.2 Testing phase for the proposed stochastic deep conviction network

In the testing phase, a hybrid form combining the decision support tree and whale optimization method is introduced. Decision tree models are prevailing analytical models that are simple to analyze, visualize, implement, and score. They are also skilled at managing variable interaction and model convoluted verdict boundary by piece-wise approximations. But the decision tree has various disadvantages which include complexity and they are time-consuming. The charge of training makes decision tree analysis an expensive selection. In order to reduce all these disadvantages whale optimization method is added to it, so that speech can be recognized with high effectiveness and reduced complexity. Let’s consider the output of the decision tree tool be S_i. The whale optimization method is one of the recent metaheuristic optimization based on the whale hunting mechanism. Compared to various other techniques WO algorithm provides the best-optimized output because the problem that is found in the output of the decision tree can be learned and corrected by itself. This ability is very less in various other traditional techniques thus in this paper the whale optimization method is preferred. The best search agent is being identified in the below equations, $\vec{D} = \vec{c} . \vec{s} (t) - \vec{s} (t)$ (10) $\vec{s} (t + 1) = \vec{s} * (t) - \vec{A} . \vec{D}$ (11)

Where $\vec{A}$ and $\vec{c}$ represents coefficient vectors, $\vec{s}$ vector represents the finest solution position vector, t indicates the current iteration. $\vec{A} = 2 \vec{a} . \vec{r} - \vec{a}$ (12) $\vec{C} = 2 . \vec{r}$ (13)

Where the value of the variable $\vec{A}$ is lessened from 2 to 0 linearly over the algorithm iterations. Moreover, the vector $\vec{r}$ represents a random value over 0 to 1.

The second phase is the exploitation phase which is accomplished in two techniques which are shrinking encircling technique, and spiral updating position. $\vec{s} (t + 1) = {\vec{D}}^{'} . e^{bl} . cos (2 \prod l) + \vec{s} * (t)$ (14)

Here ${\vec{D}}^{'}$ is formulated as ${\vec{D}}^{'} = X * (t) - \vec{X} (t)$ $\vec{s} (t + 1) = {\begin{matrix} \vec{s} * (t) - \vec{A} \vec{D} \cdot if P < 0.5 \\ {\vec{De}}^{bt} . cos (2 \prod l) + \vec{s} * (t) if P \geq 0.5 \end{matrix}}$ (15)

Here p is a random number over the interval [0,1]. In global search, the new position is being been formulated as in (8) and (9) equation. $\vec{M} = \vec{D} = | \vec{C} . {\vec{S}}_{rand} | - \vec{s}$ (16) $\vec{N} = \vec{s} (t + 1) = {\vec{s}}_{rand} - \vec{A} . \vec{M}$ (17)

Here ${\vec{S}}_{rand}$ represents the position of the randomly nominated whale. After the whale optimization method is accomplished and $\vec{N}$ is obtained as the output.

Thus the effectiveness, as well as accuracy, are being increased and the time consumption, as well as complexity, are being reduced.

4 Result and discussion

Experiments are performed on the input speech database. This input speech signal can have any form of emotion. Those can be any one of the following emotions: anger, disgust, fear, happiness, neutral and sadness. This paper finally recognizes which one of the above-mentioned emotion is present in the input speech signal.

Figure 6 is the input speech database waveform of an incoming speech signals for experimentation and to extract the required features. An innovative idea is introduced here by using the SVM classifier in the training phase and decision tree method in the testing phase. The trained data is capable of identifying the exact emotion from the speech signal with high accuracy and efficiency. The enhancement of the output is performed using the PSO optimization which boosts the system performance. The complexity of the output obtained from the decision tree is completely eradicated by the usage of the whale optimization technique.

Fig. 6

Input Speech Database waveform.

4.1 Speech emotion databases

To authenticate the proposed SDCN, experiments are performed on a speech database. Berlin’s emotional speech database in German (EMODB) [28] is one of the famous databases used for speech emotion identification. Six emotion classes are present in this database. The number of each class is termed as follows: include anger (127), fear (69), disgust (46), happiness (71), neutral (79) and sadness (62).

4.2 Simulation work using our proposed technique

The input data undergoes pre-processing and feature extraction simultaneously using proposed SDCN and then the extracted features are fed for classification using SVM classifier and since this classifier has few defects those are rectified by adding PSO optimization technique. The aim of this combo is for lessening the time usage. SVM classifier based approach is compared with the combination of SVM classifier and PSO optimization algorithm in order to estimate the effectiveness. The experimental result determines both the sound verification and training time of the suggested combination algorithms are enhanced. Furthermore, it has been established that this combination converges more rapidly than traditional gradient descent. The communication is said to be effective only if the emotion is not misinterpreted and replied quickly, but the complication of the decision tool can cause a delay. Thus this defect is being eliminated using whale optimization by reducing the complication of the system which lessens the time taken for recognizing the emotion of the speech signal. Thus in this paper, the proposed SDCN along with various enhancing techniques involved have fulfilled all the requirements to recognize all the emotion of the speech successfully.

Figure 7 shows that the data is being classified using an SVM classifier, which uses a threshold value to classify the given data. Thus the classification given above is differentiated in two different colors. If the values are above the threshold value then it is shown in green color, above a hyperplane and if the value is beneath the threshold value then it is shown in red color below a hyperplane, which visibly shows the classification. Here x₁ and x₂ defines the feature for classification.

Fig. 7

Classification of positive and negative classes using SVM classifier and PSO optimization technique.

In order to show the classification perfectly, a hyperplane in 2D is being used so that the classification can even be viewed if the data is just nearby the threshold value. In Fig. 8 the decision boundary represents the threshold value. Here x₁ and x₂ defines the feature for classification. The nearby threshold value, the data are represented using dots.

Fig. 8

Hyperplane in 2D which represents the classifications properly.

This graph in Fig. 9 displays the initial data set of the features drawn between fitness value and iteration. The required features are being extracted automatically using DCN, which are represented by various symbols in order to differentiate various emotions present in the input speech signal. Figure 10 visibly shows that the majority of the emotion present in the input speech signal consists of fear as the dominant emotion whereas anger and disgust are very less in the ratio.

Fig. 9

Initial data sets of features are represented.

Fig. 10

Region of various emotions including anger, disgust, fear, and happiness are being represented.

Figure 11 shows training and testing points for various emotions which include anger, fear, disgust, happiness, neutral, surprise and sadness. Here Et is the threshold of emotions and En is the total emotion features. Based on this emotion classification the actual emotion can be identified, but just by using SVM classifier accurate output cannot be obtained hence PSO optimization is being applied to the output of the SVM classifier.

Fig. 11

Training and Test points Emotion classification.

When PSO optimization is preferred, it calculates the fitness value only after updating the velocity and position of the particle. After evaluating the fitness value corresponding to each position and velocity iteration, it forms a convergence curve plotted in Fig. 12. The different colors in the plot indicate relative curves between different fitness values and iterations for different emotions.

Fig. 12

The convergence curve is plotted between fitness value and iterations.

There are four possibilities which include true negative (TN), false negative (FN), true positive (TP) and false positive (FP). True negative represents that there are no definite class values and no expected class value. False-positive represents that there is actual class value, but no predicted class value. False-negative represents that there is no actual class value, but predicted class value is present. By including the enhanced methods optimized output are being obtained which shows excellent improvement in their performance. This is verified by receiving the actual characteristics of the applied techniques which include precision, recall, and accuracy in Fig. 13.

Fig. 13

Bar chart to display the percentage of precision, recall, and accuracy obtained by using the proposed method.

4.2.1 Performance analysis of different parameter with a different emotion

The performance of the proposed method is being improved comparatively. The upcoming graphs are plotted that show precision, recall, F-score and accuracy for various emotions. Precision is being calculated using TP and FP. $Precision = TP / (TP + FP)$ (18) $Recall = TP / (TP + FN)$ (19) $F - Score = 2 * (Recall * Precision) /$ $(Recall + Precision)$ (20) $Accuracy = (TP + TN) / (TP + FP + FN + TN)$ (21)

The above graph displays the precision, recall, F-Score and accuracy variations for all the mentioned emotions. Figure 14(a) shows precision,which represents the consistency of the measurement. As the precision is high for all the emotions the repeated value of the reading can be gained. In Fig. 14(b) recall value is being plotted for various emotions. This recall value determines the number of appropriate documentation improved by a search segregated by the all-out number of existing significant records.

Fig. 14

Various Emotions with (a) Precision; (b) Recall; (c) F-Measure and (d) Accuracy.

Thus from the above graph, it is clear that the recall value for all emotion has a maximum percentage. In Fig. 14 (c) F-score is being plotted based on the recall and precision value obtained in the previous graphs. If the class distribution is random F-score is more useful when compared to accuracy. In Fig. 14(d) accuracy is being plotted which clearly shows that the proposed method displays the finest accuracy ever. Only if the emotion is not misinterpreted and should be replied quickly the communication is said to be effective. Now it is proved that the accuracy of emotion detection is above 95% using proposed SDCN.

4.3 Performance analysis and comparison of the proposed method with the existing techniques

When related to other existing methods the proposed methodology consists of inbuilt deep learning architecture SDCN which automatically filters and pre-process the incoming speech signal and extracts the required features. An innovative idea has been introduced which uses the SVM classifier in the training phase by using the input voice data and decision tree method in the testing phase by using the trained data which have the ability to identify the exact emotion from the speech signal with high accurateness and efficiency. The enhancement of the output is performed using the PSO optimization which boosts the performance of the system. The complexity of the output obtained from the decision tree is completely eradicated by the usage of the whale optimization technique. Thus the optimized output has a precision of 96.94, recall of 97.42 and accuracy of 98.21 hence it is proved that the proposed method is the finest method to recognize the emotions from the voice input data. When comparing the proposed method with existing methods in like CNN and LSTM [41] tabulation is formed which shows that the proposed method has an excellent score under every category.

From Fig. 15, it is very clear that CNN and LSTM have accuracy, precision, F-score and recall values in the range of 80s whereas the proposed method gives the values in the 90s range which reflects that the proposed system has enhanced all the values. CNN which is a class of DNN contains several convolutions and max-pooling layers to a fully connected network [45]. Even though CNN works in the frequency domain to standardize acoustic variations for speech recognition, but the computation cost and time are comparatively greater than the proposed SDCN technique.

Fig. 15

The (a) Precision (b) Recall and (c) F-Score is plotted for various emotions by comparing the proposed method with CNN and LSTM [44].

LSTM is an artificial recurrent neural network (RNN). An LSTM neural network cell comprises of input gate, output gate and forget gate. Even though long time lag problems are being reduced but by the usage of LSTM there is no memory associated with the model so that causes a problem for sequential data, like time series.

The precision determines the closeness of measurement and is independent of accuracy value. The graph given in Fig. 15(a) is plotted between precision percentage and various emotions (anger, disgust, fear, happiness, neutral, sadness) by comparing the proposed methods with CNN and LSTM methods.

The recall value is being plotted for various emotions mentioned. This recall value determines the number of appropriate documentation improved by a hunt segregated by the all-out number of existing significant records. Thus for existing methods, the recall value is comparatively less which is being visibly seen in the below graph plotted for CNN, LSTM and our proposed method-score is being plotted based on the recall and precision value obtained. If the class distribution is random F-score is more useful when compared to accuracy. The F-score plot also proves that the proposed method shows the superlative outcome comparatively.

The proposed method is also compared with the bagged ensemble of SVMs [42] and the AdaBoost ensemble of SVMs [43]. Here it is made clear that recognition accuracy is excellent only for the preferred method given in Fig. 16. The bagged ensemble of SVMs was used in classifying problems when single SVMs cannot comfortably manage very large data sets, but the recognition accuracy is less when compared with the proposed novel SDCN. The AdaBoost ensemble of SVMs [44] provides better performance on imbalanced classification problems, but the existing algorithm CNN [45], AdaBoost and bagged SVMs accuracy are very less when comparing it with the proposed method. A communication is said to be effective only if the emotion is recognized quickly, the time of computation must be as less as possible. This requirement is also fulfilled in the proposed method which is being proved in Table 2.

Fig. 16

The recognition accuracy is plotted by comparing it with CNN [43], Bagged SVMs [41], AdaBoost [42], and the proposed method.

Table 2

Tabulation to displays the computation rate and recognition rate for various emotions comparing CNN, LSTM and proposed method

CNN		LSTM		Proposed
	Computation Time (sec)	Recognition Rate(%)	Computation Time (sec)	Recognition Rate(%)	Computation Time (sec)	Recognition Rate(%)
EMOTION	30	95	27	97	23	98
Anger	23	89	19	96	12	97
Disgust	45	91	43	93	31	96
Fear	33	91	30	94	21	96
Happiness	37	90	28	95	20	97
Neutral	32	88	28	92	21	95
Sadness	39	90	36	94	32	98

The computation rate is the term that indicates the time taken to accomplish the proposed work. Here recognition of emotion is done by the combination of DBN with optimized SVM thus the process finishes with greater accuracy and less computational time. In previous strategies CNN and LSTM only a single classification method is utilized thus it takes more time completion above Table 2 reveals the numerical value and below graphical representation gives the description.

The graph shown in Fig. 17 evidently proves that the computation time for the proposed SDCN system is comparatively very less. When CNN and LSTM are used the time utilized for computation is greater during emotion recognition.

Fig. 17

Computation time is plotted for various emotions by comparing CNN, LSTM and proposed a novel SDCN method.

The recognition rate is determined based on the exact and false decisions chosen. It is based on the Equation 7 given below

Recognition rate = (no. of correctly identified voice samples / Total no. of voice samples)*100 (7)

The sum of exact and false decisions chosen is the total number of voice samples. Figure 18 shows the recognition rates for different techniques. Finally from all the obtained graphs and tabulation, it is made clear that the best performance is provided by the proposed system and Table 3 and Table 4 shows the tremendous growth in accuracy when compared to various other methods.

Fig. 18

The recognition rate is plotted for various emotions by comparing CNN, LSTM, and proposed novel SDCN method.

Table 3

Comparison of Accuracy for different emotions using Berlin database

Accuracy (%)
EMOTION	CNN [41]	CNN-RNN [42]	CF-BPNN [43]	Proposed
Anger	84.21	91.42	100	97.35
Disgust	68.55	89.9	77.8	96.31
Fear	25.33	82.18	100	97.43
Happiness	36.36	68.26	88.9	98.32
Neutral	42.11	88.16	77.8	95.32
Sadness	83.08	99.66	88.9	97.32
Average	56.22	86.86	88.9	97.1

Table 4

Comparison of Average Accuracy using Berlin database

Author Name	Classifier	Average Accuracy (%)
S. Lalitha(2019) [44]	CF-BPNN	88.9
Yang and Lugger (2010) [45]	Bayesian GMM	73.5
Hassan and Damper (2012) [46]	SVM with linear kernel	87.7
Zao et al. (2014) [47]	GMM	80.1
Deb and Dandapat (2017) [48]	ELM	85.1
Proposed	SDCN	97.1

Figure 19 describes the different emotion to find performance of accuracy of our proposed deep learning and existing work like CNN [39], CNN-RNN [42], and cascading feed forward-back propagation neural network (CF-BPNN) [44] but compared to all the work our proposed work attains the better result. Similarly Fig. 20 shows the performance comparison of accuracy with different classifiers like CF-BPNN [46], Bayesian GMM, SVM with linear kernel, Gaussian mixture model (GMM), extreme learning machine (ELM) [48] our proposed work attain the best level of accuracy compared to all others. The proposed work also shows better results when compared with [49]. So it can be concluded that compared to all other existing work the proposed SDCN is far better to identify the different emotions accurately.

Fig. 19

Performance Analysis of Different Emotion Accuracy level with other Deep Architecture.

Fig. 20

Performance Analysis of Different emotion Average Accuracy level with other deep Architecture.

5 Conclusion

This paper proposes a novel SDCN ensemble method to undertake speech emotion identification. Various benefits are obtained using this method. Firstly, since the random subspace is utilized they have the capability to eliminate the dimensionality difficulties. Secondly, when the deep conviction network is applied on random subspaces and larger training database is presented then it has the probable capacity to attain improved performance. Thirdly, the concrete emotion label is being eliminated by the usage of the SVM classifier as the base classifier which results in providing the probability of a testing sample for various emotions. DCN has the capability to handle uncertainty information with the fusion of the used classifier. Finally, DCN also has the capacity to identify complicated emotions from the speech samples. So far in the various paper, the same classification technique is used in both the training and testing phase, but in this paper, different techniques have been proposed because only then the overfitting problem can be reduced and different emotions can be identified from the speech. Thus the accuracy has been improved massively comparatively. Thus in this paper, the proposed SDCN along with various enhancing techniques involved have fulfilled all the requirements to recognize all the emotion of the speech successfully. Finally, we have achieved an accuracy of emotion detection to be above 95% and the recognition rate obtained is about 98% with a computation time of 23 seconds which is not reached so far by any other existing works. Even though the proposed method shows various improved output in speech recognition the diversity of the ensemble is not taken under consideration, which must be stressed to additionally enrich the performance of our approach.

References

Fong

and Westerink

, Affective computing in consumer electronics, IEEE Transactions on Affective Computing 3(2) (2012), 129–131.

Khan

, Siddiqi

, Khan

M.U.G.

, Wahla

S.Q.

and Samyan

, Geometric positions and optical flow based emotion detection using MLP and reduced dimensions, IET Image Processing 13(4) (2019), 634–643.

Harimi

, AhmadyFard

, Shahzadi

and Yaghmaie

, Anger or joy? Emotion recognition using nonlinear dynamics of speech, Applied Artificial Intelligence 29(7) (2015), 675–696.

Sun

and Wen

, Ensemble the softmax regression model for speech emotion recognition, Multimedia Tools and Applications 76(6) (2017), 8305–8328.

Park

J.S.

, Kim

J.H.

and Oh

Y.H.

, Feature vector classification-based speech emotion recognition for service robots, IEEE Transactions on Consumer Electronics 55(3) (2009), 1590–1596.

France

D.J.

, Shiavi

R.G.

, Silverman

and Wilkes

, Acoustical properties of speech as indicators of depression and suicidal risk, IEEE transactions on Biomedical Engineering 47(7) (2000), 829–837.

Huang

Z.W.

, Xue

W.T.

and Mao

Q.R.

, Speech emotion recognition with unsupervised feature learning, Frontiers of Information Technology & Electronic Engineering 16(5) (2015), 358–366.

Wang

, Ruan

and An

, Projection-optimal local Fisher discriminant analysis for feature extraction, Neural Computing and Applications 26(3) (2015), 589–601.

Morgan

, Deep and wide, Multiple layers in automatic speech recognition, IEEE Transactions on Audio, Speech, and Language Processing 20(1) (2011), 7–13.

10.

Hinton

, Deng

, Yu

, Dahl

, Mohamed

A.R.

, Jaitly

, Senior

, Vanhoucke

, Nguyen

, Kingsbury

and Sainath

, Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Processing Magazine 29 (2012).

11.

Deng

, Hinton

and Kingsbury

, New types of deep neural network learning for speech recognition and related applications: Anoverview, In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), 8599–8603.

12.

Deng

, Li

, Huang

J.T.

, Yao

, Yu

, Seide

, Seltzer

, Zweig

, He

, Williams

and Gong

, Recent advances in deep learning for speech research at Microsoft, In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013), 8604–8608.

13.

, Ma

and Lee

K.A.

, Spoken language recognition: from fundamentals to practice, Proceedings of the IEEE 101(5) (2013), 1136–1159.

14.

Cui

, Goel

and Kingsbury

, Data augmentation for deep convolutional neural network acoustic modelling, In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2015), 4545–4549.

15.

Hinton

G.E.

and Salakhutdinov

R.R.

, Reducing the dimensionality of data with neural networks, Science 313(5786) (2006), 504–507.

16.

Hubel

D.H.

and Wiesel

T.N.

, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, The Journal of Physiology 160(1) (1962), 106–154.

17.

Jones

, Computer science: The learning machines, Nature News 505(7482) (2014), 146.

18.

Bengio

, Learning deep architectures for AI, Foundations and trends® in Machine Learning 2(1) (2009), 1–127.

19.

Bengio

, Deep learning of representations: Looking forward, In International Conference on Statistical Language and Speech Processing (2009), 1–37.

20.

Hinton

, Deng

, Yu

, Dahl

, Mohamed

A.R.

, Jaitly

, Senior

, Vanhoucke

, Nguyen

, Kingsbury

and Sainath

, Deep neural networks for acoustic modeling in speech recognition, IEEE Signal Processing Magazine 29 (2012).

21.

Tamilselvan

and Wang

, Failure diagnosis using deep belief learning based health state classification, Reliability Engineering & System Safety 115 (2013), 124–135.

22.

Hinton

G.E.

, A practical guide to training restricted Boltzmann machines, In Neural networks: Tricks of the Trade (2012), 599–619.

23.

Shukla

, Jain

and Dubey

R.K.

, Increasing the performance of speech recognition system by using different optimization techniques to redesign artificial neural network, Journal of Theoretical and Applied Information Technology 97(8) (2019), 2404–2415.

24.

Shukla

and Jain

, A novel system for effective speech recognition based on artificial neural network and opposition artificial bee colony algorithm, International Journal of Speech Technology 22(4) (2019), 959–969.

25.

Jain

, Gupta

and Jain

, Analysis and design of digital IIR integrators and differentiators using minimax and pole-zero and constant optimization methods, ISRN Electronics 2013 (2013), 1–14.

26.

Jain

, Gupta

and Jain

, Linear phase second order recursive digital integrators and differentiators, Radioengineering 21(2) (2012), 712–717.

27.

Gupta

and Jain

, Wideband Digital Integrator and Differentiator, IETE Journal of Research 58(2) (2012), 166–170.

28.

Jain

, Gupta

and Jain

, The design of the IIR differintegrator and its application in edge detection, Journal of Information Processing Systems 10(2) (2014), 223–239.

29.

Jain

, Gupta

and Jain

, Design of half sample delay recursive digital integrators using trapezoidal integration rule, International Journal of Signal & Imaging Systems Engineering 9(2) (2016), 126–134.

30.

Gupta

, Jain

and Kumar

, Novel class of stable wideband recursive digital integrators and differentiators, IET Signal Processing 4(5) (2010), 560–566.

31.

Hook

, Noorozi

, Toygar

and Anbarjafari

, A Automatic speech based emotion recognition using paralinguistics features, Bulletin of the Polish Academy of Sciences Technical Sciences 67(3) (2019).

32.

Zhao

, Xia

and Lijiang

, Speech emotion recognition using deep 1D & 2D CNN LSTM networks, Biomedical Signal Processing and Control 47 (2019), 312–323.

33.

Badshah

A.M.

, Nasir

, Noor

, Ahmad

, Jamil

, Lee

M.Y.

, Soonil

and Baik

, Deep features-based speech emotion recognition for smart affective services, Multimedia Tools and Applications 78(5) (2019), 5571–5589.

34.

, Panagakis

and Pantic

, Learning Low Rank and Sparse Models via Robust Autoencoders. In ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (3192–3196).

35.

Gupta

, Kishalaya

, Dinesh

and Thenkanidiyoor

, Recognition from Varying Length Patterns of Speech using CNN-based Segment-Level Pyramid Match Kernel-based SVMs, In 2019 National Conference on Communications (NCC). (2019), 1–6

36.

Shen

, Wu

, Cheng

, Aihemaiti

, Zhang

and Li

, A spatiotemporal fusion based cloud removal method for remote sensing images with land cover changes, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12(3) (2019), 862–874.

37.

Wei

and Zhao

, A novel speech emotion recognition algorithm based on wavelet kernel sparse classifier in stacked deep auto-encoder model, Personal and Ubiquitous Computing (2019), 1–9.

38.

Huang

, Tian

, Wu

and Zhang

, Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition, Journal of Ambient Intelligence and Humanized Computing 10(5) (2019), 1787–1798.

39.

Wen

, Li

, Huang

, Li

and Xun

, Random deep belief networks for recognizing emotions from speech signals, Computational Intelligence and Neuroscience, (2017).

40.

Wang

and Sung

, AdaBoost with SVM-based component classifiers, Engineering Applications of Artificial Intelligence (2008).

41.

Lili

, Longbiao

, Dang

, Zhang

, Guan

and Li

, Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network, Proc Interspeech (2018), 1611–1615.

42.

Badshah

A.M.

, Ahmad

, Rahim

and Baik

S.W.

, Speech emotion recognition from spectrograms with deep convolutional neural network, 2017. In International conference on platform technology and service (PlatCon), Busan, South Korea (2017), 1–5.

43.

Lim

, Jang

and Lee

, Speech emotion recognition using convolutional and recurrent neural networks, In 2016 Asia-Pacific signal and Information Processing Association Annual Summit and Conference (APSIPA), Jeju (2016), 1–4.

44.

Lalitha

, Tripathi

and Gupta

, Enhanced speech emotion detection using deep neural networks, International Journal of Speech Technology 22(3) (2019), 497–510.

45.

Yang

and Lugger

, Emotion recognition from speech signals using New Harmony features, Signal Processing 90(5) (2010), 1415–1423.

46.

Hassan

and Damper

R.I.

, Classification of emotional speech using 3 DEC hierarchical classifier, Speech Communication 54(7) (2012), 903–916.

47.

Zao

, Cavalcante

and Coelho

, Time-frequency feature and AMS-GMM mask for acoustic emotion classification, IEEE Signal Processing Letters 21(5) (2014), 620–624.

48.

Deb

and Dandapat

, Emotion classification using segmentation of vowel-like and non-vowel-like regions, IEEE Transactions on Affective Computing 99 (2017), 1.

49.

Jain

and Shukla

, Accurate Speech Emotion recognition by using Brain-Inspired Decision-Making Spiking Neural Network, International Journal of Advanced Computer Science and Applications 10(12) (2019).

A novel stochastic deep conviction network for emotion recognition in speech signal

Abstract

Keywords

1 Introduction

2 Related works

3 Proposed methodology

Table 1 Parameters of the voice signal Physical Acoustic Perceptual Rate of vibration Frequency (Hz) Pitch Amplitude of vibration Intensity (dB) Loudness Periodicity of vibration Perturbation-jitter, shimmer Quality Complexity of vibration Range Flexibility

4.2 Simulation work using our proposed technique

References

Table 1
Parameters of the voice signal

Physical Acoustic Perceptual

Rate of vibration Frequency (Hz) Pitch

Amplitude of vibration Intensity (dB) Loudness

Periodicity of vibration Perturbation-jitter, shimmer Quality

Complexity of vibration Range Flexibility