Attention mechanism based LSTM in classification of stressed speech under workload

Abstract

In order to improve the robustness of speech recognition systems, this study attempts to classify stressed speech caused by the psychological stress under multitasking workloads. Due to the transient nature and ambiguity of stressed speech, the stress characteristics is not represented in all the segments in stressed speech as labeled. In this paper, we propose a multi-feature fusion model based on the attention mechanism to measure the importance of segments for stress classification. Through the attention mechanism, each speech frame is weighted to reflect the different correlations to the actual stressed state, and the multi-channel fusion of features characterizing the stressed speech to classify the speech under stress. The proposed model further adopts SpecAugment in view of the feature spectrum for data augment to resolve small sample sizes problem among stressed speech. During the experiment, we compared the proposed model with traditional methods on CASIA Chinese emotion corpus and Fujitsu stressed speech corpus, and results show that the proposed model has better performance in speaker-independent stress classification. Transfer learning is also performed for speaker-dependent classification for stressed speech, and the performance is improved. The attention mechanism shows the advantage for continuous speech under stress in authentic context comparing with traditional methods.

Keywords

Attention mechanism speech under stress multi-feature fusion SpecAugment transfer learning

1. Introduction

Speech is the most significant mode of communication among human beings and a potential method for human-computer interaction (HCI) by using a microphone sensor [1]. Speech recognition technology has achieved significant progress with strides in large data and deep learning technologies. However, the accuracy of speech recognition is influenced by various factors regarding its actual application. The variation in stressed speech caused by internal and external environmental factors mainly contribute to the decline in speech recognition performance [2]. Therefore, stress classification is mainly applied to enhance the robustness of speech recognition system and anomaly detection.

The use-case of stressed speech classification model in the whole speech analysis system is shown in Fig. 1. The stressed speech classification model provides the system with the ability of anomaly detection caused by workloads and improves the robustness of automatic speech recognition.

Stressed speech is produced from physiological and psychological factors. The physiological variation mainly caused by pathological factors, including (1) diseases of vocal cords; (2) lesions in the vocal tract caused by occlusal muscle, labial muscles, lingual muscles and facial symmetry. In the psychological, speech under stress usually refers to variations in phonation caused by workloads, specific emotion, sleep deprivation and perceived threat. For example, a speaker focuses on a certain task, with speaking only serves as a secondary task. Here, due to the load on the brain as well as mental stress, the speaker experiences an abnormal mental state characterized by nervousness and absent-mindedness, greatly affecting his/her phonation. This paper is devoted to classification for stressed speech under multitasking workloads.

The recognition/classification system for speech under stress has two stages, feature extraction stage and classification stage [3]. Researchers have done lots of research works on feature extraction. First, various spectral features include linear prediction coefficients (LPC) and mel frequency cepstral coefficients (MFCC) [4] have acted a pivotal part in stressed speech analysis these year. A method to automatically obtain an optimized filter bank for stress recognition in speech was proposed in [5]. A new method of feature extraction using Fourier model for out-of-breath speech was presented in [3]. Various classifiers were also used for recognition of stressed state or different emotions in speech, such as hidden Markov models (HMM), Gaussian mixture model (GMM), Support vector machine (SVM) and so on [6, 7, 8, 9]. Recently, Besbes proposed a method to extract advanced acoustic features from pressure speech signals and employed a multi-class Support Vector Machines (SVM) with different kernels to recognize speech under stress, achieving favorable results [10]. Bandela put forward Gaussian mixture model (GMM) for emotional recognition of stressed speech with a combined feature, containing Teger energy operator (TEO) and linear prediction coefficients (LPC) [11]. Dumpala employed a deep neural network (DNN) to analyze the breathing sounds of speakers and showed that the increasing number of hidden layers improves performance [12]. MUSTAQEEM presented CNN model to extract normalized features from the speech spectrogram and then feed them into deep bi-directional long short-term memory (BiLSTM) for recognizing the final state of emotion [13]. Badshah [14] transformed emotional speech classification into image classification, combined with convolution neural network (CNN) and spectrogram. Roza [15] designed a curriculum during the training process of deep neural network (DNN) for speech emotion, that used the disagreement between evaluators as a measure of difficulty for the classification task. Zhang [16] proposed a multiscale deep convolutional long short-term memory (LSTM) framework for spontaneous speech emotion recognition, that a deep LSTM was adopted on the basis of the learned segment-level CNN features for utterance-level emotion recognition. However, the above research did not take into account the instantaneous characteristic of stressed speech, and ignored the significance of the key frame for stressed speech under workloads.

Figure 1.

The use-case of stressed speech classification in the whole speech analysis system.

Although the above algorithms and models have been successfully applied in emotion recognition, traditional methods also face numerous challenges in speech under stress. Due to the transient nature and ambiguity of stressed speech under multitasking workloads, it is not certain whether stressed speech is produced by simply exerting pressure on the speakers. Therefore, the speech labeled as stress not only contains truly stressed segments, but also includes many neutral speech segments. Each part of the stressed speech is of varying importance in the classification. However, in traditional machine learning algorithms and deep learning networks, the importance difference for segments in stressed speech is not represented and quantified. Stress-irrelevant neutral speech segments as well as vital stress-relevant speech segments contribute equally toward the result in the traditional model. In addition, due to the existence of a large number of neutral segments in stressed speech, the number of truly stressed segments is relatively small, causing stressed speech classification to suffer a small samples problem. Hence, overfitting is caused in traditional models.

This paper discusses stressed speech caused by psychological stress under multitasking workloads and a stressed classification framework for multi-feature fusion LSTM based on the attention mechanism is proposed. In the study, speech frames are weighted using the attention mechanism to reflect the different correlations to the actual stressed state, and the solution for the problems of ambiguous sample label is achieved. Furthermore, the multi-channel fusion of features representing the stressed speech for classification, and modified SpecAugment and transfer learning are proposed to solve the small sample problem.

Our major contributions in this paper are documented below:

We propose a classification learning framework for stressed speech under multitasking workloads, which improves the robustness of the automatic speech recognition. The framework models continuous temporal signal for stressed speech based on time series learning mechanism. In addition, small sample problem in stressed speech is solved by modified SpecAugment and transfer learning.

We proposed a learning strategy for stress classification based on attention mechanism. In view of the transient nature and ambiguity of stressed speech under multitasking workloads, the model can learn effectively the weights of different frames to solve the quantization problem of stressed states in continuous speech.

Our study is a success of quasi end to end neural network learning for stress classification under multitasking workloads.

We tested the proposed framework on different dataset and evaluated from different perspectives, and the proposed framework achieved accuracy of 80.0% and 86.7% in Fujitsu stressed speech and CASIA dataset respectively, which performed superiorly over traditional method and shown generalization for different datasets.

The rest of the paper is organized as follows: In Section 2, related works for attention mechanisms and long short-term memory model are introduced, Section 3 elaborates proposed framework of stressed speech recognition, the experimental result of the mentioned technique and comparison evaluations with traditional are discussed in Section 4, finally Section 5 presents the conclusions and the feature work of the proposed framework.

2. Related work

2.1 Long-Short Term Memory Model

Owing to lack of public dataset for stressed speech under multitasking workloads, there are few works of Long-Short Term Memory Model (LSTM) on stress classification. However, LSTM has made progress in speech emotion recognition (SER). Since Wöllmer [17] first applied LSTM to SER, the application of LSTM in SER has achieved continuous development in recent years, which mainly follows two streams: distinguishing features [18, 19, 20] and emotion recognition [21, 22, 23].

The LSTM network includes an input layer, an output layer, and several recursive hidden layers. The recursive hidden layer consists of several memory modules, containing one or more self-connected memory units as well as three gates that control information flow: the input gate, output gate, and forget gate. The structure of LSTM is shown in Fig. 2.

Figure 2.

The internal structure of LSTM. $i_{t}$ , $f_{t}$ , $c_{t}$ and $O_{t}$ represents input gate, forget gate, memory gate and output gate respectively; $\phi$ is the hyperbolic tangent nonlinear function, and $\otimes$ is the element-wise phase difference sign between the vectors.

The input sequence is shown as $X=(x_{1},x_{2},x_{3}\ldots x_{T-1},x_{T})$ , and the recursive hidden layer calculates the activation values of the three gates and the memory unit in sequence according to the time $t=1\sim T$ . The calculation formula at time $t$ is:

Input gate:

$\displaystyle i_{t}=\sigma({W_{ix}x_{t}+W_{ih}h_{t-1}+W_{ic}c_{t-1}+b_{i}})$ (1)

Forget gate:

$\displaystyle f_{t}=\sigma({W_{fx}x_{t}+W_{fh}h_{t-1}+W_{fc}c_{t-1}+b_{f}})$ (2)

Memory unit:

$\displaystyle c_{t}=f_{t}\otimes c_{t-1}+i_{t}\otimes\phi({W_{cx}x_{t}+W_{ch}h% _{t-1}+b_{c}})$ (3)

Output gate:

$\displaystyle O_{t}=\sigma({W_{ox}x_{t}+W_{oh}h_{t-1}+W_{oc}c_{t-1}+b_{o}})$ (4)

Hidden layer output:

$\displaystyle h_{t}=o_{t}\otimes\phi({c_{t}})$ (5)

Here, $W_{.x}$ weight matrix is the connection matrix between the input $x_{t}$ from the previous hidden layer and memory module; $W_{.h}$ weight matrix is the connection matrix between the output $h_{t-1}$ at the previous moment of the current hidden layer and the memory module at the current moment; $W_{.c}$ is a diagonal matrix connecting the 3 gates and the memory unit inside the memory module; $b_{.}$ is the offset vector, $\sigma$ is current moment is used as the input of the next hidden layer. The network output layer contains matrix transformation and the SoftMax normalization function, and the output of the normalization function serves as the posterior probability of output results.

Due to the transient nature and ambiguity of stressed speech, the speech sentence labeled as stressed wholly may not only contains truly stressed segments, but also includes some neutral phonation segments. Hence, the silence and neutral segments in the continuous stressed speech have adverse effect on the final recognitive decision.

2.2 Attention mechanism

Attention mechanism is widely used in natural language processing (NLP) and automatic speech recognition (ASR) [24, 25, 26, 27, 28]. Researchers have also applied attention mechanism to speech emotion recognition (SER). Mirsamadi [29] proposed a novel strategy for emotion recognition which used local attention in order to focus on specific regions of a speech signal that are more emotionally salient. Huang [30] explored a convolutional attention mechanism to learn the utterance structure relevant to the task.

The attention mechanism is a model simulating the attention system of the human brain, which was proposed by Treisman and Gelade [31]. Moreover, it is viewed as a combination function, highlighting the impact of key inputs on output by calculating the probability distribution of attention. Most attention mechanisms are generally based on the Encoder-Decoder abstraction framework, especially in the field of natural language processing, as shown in Fig. 3

Figure 3.

Abstract Encoder-Decoder framework. The model maps a variable-length input ${X}=({x_{1},x_{2},\ldots x_{n}})$ to a variable-length output $Y=({y_{1},y_{2},\ldots y_{m}})$ .

The model maps a variable-length input ${X}=({x_{1},x_{2},\ldots x_{n}})$ to a variable-length output ${Y}=({y_{1},y_{2},\ldots y_{m}})$ . The encoder then transforms a variable-length input sequence $X$ into an intermediate semantic representation ${C}={f}({x_{1},x_{2},\ldots x_{n}})$ through a non-linear transformation. The task of Decoder is to predict and generate the output $y_{i}=g({y_{1},y_{2},\ldots y_{i-1},C})$ at time time i based on the intermediate semantic representation $C$ of the input sequence $X$ as well as the previously generated $y_{1},y_{2}\ldots y_{i-1}$ , where ${f}()$ and ${g}()$ are both non-linear transformation functions. As the traditional Encoder-Decoder framework lacks discrimination for the input $X$ , Bahdanau et al. introduced an attention mechanism to solve this problem [32], shown in Fig. 4.

Figure 4.

Schematic diagram of the attention mechanism mechanism.

Here, $S_{t-1}$ is the hidden state at the decoder at time $t-1$ , $y_{t}$ is the target word, and $C_{t}$ is the context vector. Accordingly, the hidden state at time $t$ is:

$\displaystyle s_{t}={f}({s_{t-1},y_{t-1},C_{t}})$ (6)

depends on the hidden layer representation of the input sequence at the encoding end and can be expressed after weighting, as:

$\displaystyle C_{t}=\sum_{k=1}^{T}a_{t,k}h_{k}$ (7)

where $h_{k}$ represents the hidden vector of the k-th word on the side of the Encoder, which contains the message of the entire input sequence. However, it focuses on the part around the k-th word. ${T}$ is the length of the input, and $a_{t,k}$ is the attention distribution coefficient of the kth word in the Encoder to the tth word in the Decoder. The probability $a_{t,k}$ is shown as:

$\displaystyle a_{t,k}=a({s_{t-1},h_{k}})$ (8) $\displaystyle A_{t,k}=\frac{{\exp}({a_{t,k}})}{\sum_{k=1}^{T}{\exp}({a_{t,k}})}$

Here, $a_{t,k}$ represents an alignment model used to measure the alignment degree (influence degree) for the word in the encoder position $j$ with respect to the word at the decoder position $t$ . The alignment model a is usually parameterized as a feedforward neural network, trained with the rest of the system.

3. Method

The proposed stress classification framework is shown in Fig. 5, which is divided into four parts: speech preprocessing, feature selection, data augment and classification.

Figure 5.

The proposed framework for stress classification.

3.1 Attention-based multi-feature fusion model (Att_MF-LSTM)

Since the stressed speech has the instantaneous characteristic, the speech sentence labeled as stressed wholly may not only contains truly stressed segments, but also includes some normal phonation segments. So, each segment of the stressed speech possesses varying importance in stress classification. Traditional methods assume that all frames equally contribute to stress classification, however, attention mechanism enables model to learn to focus different attention on the frames at different time.

Compared to the traditional recursive neural network (RNN), the structure of LSTM selectively retains the memory of the previous node while selectively absorbing the new memory at each node, thereby achieving a consideration for context. Speech as a time series signal is related to the before and after states, with the length of speech. The attention mechanism is able to alleviate the problem. Traditional LSTM outputs the last node for stress detection, but the combination of attention mechanism and LSTM retains the intermediate node output, and the weights of different speech frames are learned through the attention mechanism, achieving selective learning of the input speech vector.

In order to improve the classification performance while taking computational cost into account, a stress classification model for multi-feature fusion LSTM based on the attention mechnism (Att_MF-LSTM) is proposed. In this study, multiple features are selected as model inputs, and stressed speech is described more comprehensively.

Figure 6.

Overall architecture of the proposed multi-feature fusion LSTM based on the attention mechanism for stressed speech classification.

Figure 7.

Feature selection based on attention mechanism, where similarity between output of hidden layer and output of LSTM at last moment is calculated to distinguish frames. Then feature selection is performed according to variance of $z_{t}$ .

The structure of Att_MF-LSTM is shown in Fig. 6. Feature selection refers to the selection of various effective features from the original features in order to reduce the dimensionality of data [33]. Each feature was input into LSTM models we calculate the weight of each frame through the attention mechanism. Accumulated information is most abundant in the output of LSTM at last moment, therefore the last moment output of LSTM should obtain a large weight in attention mechanism. This study takes the output of last moment in LSTM as a reference to ensure that it can obtain a large weight, as shown in Fig. 7 the attention score of the each speech frame is calculated as:

$\displaystyle z_{t}=\textit{Attention}({h_{t},O_{T}})=O_{T}\times({h_{t}\times W% _{t}})^{H}$ (9)

Where $O_{T}\in R^{B\times N}$ is the output of LSTM at last moment where $B$ represents the size of batch, $N$ is the member of hidden units and Tf means the last frame. $h_{t}\in R^{B\times N}$ is the output of hidden layer at frame $t(t\in[{1,T}])$ . $W_{t}\in R^{N\times N}$ is the weight for training. $H$ denotes the transpose operator. $z_{t}$ indicates the similarity between the $h_{t}$ and $O_{T}$ in time dimension. A larger $z_{t}$ means the frame-level speech feature at frame $t$ is more significant, which has a greater impact on the classification result.

Due to the transient nature and ambiguous of stressed speech, only a few frames play a decisive role in the recognition results, such as laughter and speech segments becoming lighter and slower. In stress classification, the attention mechanism highlights some important frames while ignoring various insignificant frames. For each feature, we calculate the weight of each frame through the attention mechanism mentioned above. If the weight variance of each frame of the feature is large, attention mechanism of the feature will be more likely to identify key frames. On the other hand, if weight values of each frame are close to each other, each frame will have similar degrees of effect on the recognition results, which means that attention mechanism of the feature can not distinguish the key frames of stressed speech. Therefore, variance of weight can be used for feature selection based on attention mechanism. We discuss different features within a feature set, and the variance of the attention mechanism weights of all frames in dataset is calculated, and the feature selection is performed according to the magnitude of the variance, as represented in Fig. 5.

The variance is calculated as follows:

$\displaystyle V=\sum_{t=1}^{T}\left({z_{t}-\frac{\sum_{t=1}^{T}z_{t}}{T}}% \right)^{2}$ (10)

where $z_{t}$ is calculated as Eq. (9) and $T$ is the number of frames for all speech. Finally, the features with larger variances are selected to simplify feature set.

After feature selection and data augment three speech features $x_{1},x_{2},x_{3}$ were input into the respective LSTM models. The three features are separately learned before the attention mechanism. As shown in Fig. 5, $x_{t}^{i}\in R^{1\times M_{i}}$ is the input feature of the LSTM, $i$ means the i-th speech feature, $t$ means the t-th frame of speech, 1 represents the input is the feature of one frame, $M_{i}$ is the dimension of the i-th feature, and all features are aligned in time. Then we calculate the weight of each frame in three speech feature through the attention mechanism. The attention score of the each speech frame in these three speech features is calculated separately as:

$\displaystyle z_{t}^{i}=\textit{Attention}({h_{t}^{i},O_{T}^{i}})=O_{T}^{i}% \times({h_{t}^{i}\times W_{t}})^{H}$ (11)

Here, $i$ means the i-th speech feature and meaning of other parameters is consistent with that in Eq. (9).

$z_{t}^{i}$ is normalized through the SoftMax function, mapped to a weight $a_{t}^{i}$ between 0 and 1:

$\displaystyle a_{t}^{i}=\frac{\exp(z_{t}^{i})}{\sum_{t=1}^{T}{\exp(z_{t}^{i})}}$ (12)

Different attention focus for three speech features results in different weights. The weights learned from the three speech features are further normalized to obtain the final weight of t-th frame in speech:

$\displaystyle a_{t}=\textit{Normalize}(a_{t}^{1},a_{t}^{2},a_{t}^{3})=\alpha% \cdot a_{t}^{1}+\beta\cdot a_{t}^{2}+\delta\cdot a_{t}^{3}$ (13)

where $\alpha$ , $\beta$ , and $\delta$ is the normalization coefficients ( $\alpha+\beta+\delta=1$ , $0<\alpha,\beta,\delta<1$ ). The weights are determined according to the importance of the corresponding speech features, which is measured by the variance in feature selection.

Each hidden state is then multiplied with the normalized weights, and features are fused by performing matrix concatenation.

$\displaystyle H_{t}=\textit{concat}({a_{t}\cdot h_{t}^{1},a_{t}\cdot h_{t}^{2}% ,a_{t}\cdot h_{t}^{3}})$ (14)

Finally, the feature fusion matrix is input into the embedding linear layers, mapped to the classification result:

$\displaystyle\text{Output}=\text{embedding}({H_{1},H_{2},\ldots H_{T}})$ (15)

Therefore, the model essentially uses the output of all hidden nodes in LSTM to obtain information concerning the entire speech. Meanwhile, the importance weights for each speech frame of different features are learned through attention mechanism, which is beneficial to stress classification.

3.2 Data augment based on feature spectra

Training models requires a large amount of data. However, the data set in this study poses some problems, such as difficulty in collection and high subjectivity of sample labeling. Moreover, the speech data labeled as stressed only includes a small number of truly stressed speech segments, resulting in a small training sample size. Small sample sizes often lead to overfitting of the model, making the model lack generalization and reliability.

In order to avoid overfitting of the network, a data augment method based on feature spectra was proposed. SpecAugment [34] is extended to non-log mel spectra, and data augment is performed by time warping, time masking and feature channel masking. The algorithm is shown as following.

Algorithm 1. SpecAugment Based on Spectra
Require: Feature Vector: D $\in R^{M\times N}$ , where M is time dimension and N is the feature channel
1. Randomly initialize time masking parameters: ${\alpha}\in\text{int}({[{0.1\times M,0.2\times M}]})$
2. Randomly initialize feature channel masking parameters: ${\beta}\in\text{int}({[{0.1\times N,0.2\times N}]})$
3. Randomly initialize time warping parameters: ${\gamma}\in\text{int}({[{0.05\times M,0.1\times M}]})$
4. i $\leftarrow$ 0, j $\leftarrow$ 0, k $\leftarrow$ 0
5. while i $<{\alpha}$ do
6. Randomly initialize a $\in$ [1, M]
7. Masking the feature of time a: D[a,:] $\leftarrow$ 0
8. i $\leftarrow$ i $+$ 1
9. end
10. while j $<{\beta}$ do
11. Randomly initialize b $\in$ [1, N]
12. Masking the feature of channel b: D[:,b] $\leftarrow$ 0
13. j $\leftarrow$ j $+$ 1
14. while k $<{\gamma}$ do
15. Randomly initialize c $\in$ [1, M] and d $\in$ [1, M]
16. Warping the feature of channel c and channel d: D[c,:] $\leftrightarrow$ D[d,:]
17. k $\leftarrow$ k $+$ 1
18. end

This method can enhance the robustness of the model to resist the distortion of time series and the partial loss of feature channel. Moreover, the augment based on the feature spectrum is directly applied to the features, which may be performed dynamically to avoid influencing the training speed. A simplified example are shown in the Fig. 8, where the matrix represents the feature vector in a certain speech segment and denotes the n-dimensional feature of the m-th frame. The features of the first and third frames are exchanged in a time sequence, covering the 2nd-dimensional feature vector and the 2nd-frame.

4. Experimental evaluation

4.1 Database and experimental method

4.1.1 Fujitsu stressed speech corpus

A stressed speech corpus used in the paper is collected by Fujitsu containing speech samples from telephone conversations that perform different tasks. Three different tasks are introduced to simulate stressed speech which is caused by psychological stress: 1. Solving logical puzzles; 2. spotting differences; 3. gambling games. These tasks are performed by the speaker while talking with the operator. Among them, the logical puzzle task is shown in Table 1. During the phone call, the speaker is required to give a logical answer to the puzzle and explain its reasoning according to the given hints. Spotting difference tasks is shown in Fig. 9. A speaker is asked to examine the differences between the two pictures, and required to answer questions at the same time. While solving logical puzzles and spotting differences, the remaining time on the display is shown to exert time pressure on the speaker, and the speaker must answer the questions within the given time. Gambling games are used to assess the speakers’ desire for monetary gain. In a gambling game, the goal of the speaker is to win the target amount. If the speaker loses all the money, the speaker must borrow money from the operator over the phone in order to continue playing the game.

Table 1
Logic puzzle task. During the phone cell, the speaker is required to fill out the form according to the given tips

Position	Left	Right
Character
Sports

Figure 8.

A simplified example of data augment based on feature spectra.

Figure 9.

Spotting difference. During the phone cell, the speaker is required to find difference between above two figures.

In this corpus, four different conversations are collected for each speaker. The first conversation is regarding a relaxed topic without any tasks. In the second and third conversations, speaker is required to complete a task under workload. The speaker imposed pressure, must focus on the task within the limited time. Finally, the fourth task is a light topic and does not involve any tasks. This corpus is divided into neutral and stressed speech, which collected from 100 people, including 50 men and 50 women. We assume that the stressed speech is generated under the workload conditions, and the neutral speech is collected during the relaxed discussion. Eleven speakers’ speech are selected through the subjective evaluation, whose voice under stress is obviously different from the speech under relax. Hence, this corpus contains 156 stressed speech examples and 61 neural speech examples totally.

4.1.2 CASIA Chinese emotion corpus

Emotional speech is another type of stressed speech. CASIA Chinese emotion corpus is provided by institution of Automation, Chinese Academy of Sciences. The speech data of this corpus is recorded by four subjects (i.e., two men and two women) in a clean recording environment (SNR is about 35 dB), adopting 16 kHz sampling, 16 bit quantified. It has 9600 short utterances, in which six emotional states (i.e., sad, angry, fear, surprise, happy, and neutral) are contained in total. In order to compare with the experimental results of Fujitsu corpus, the CASIA corpus is also divided into two classes: neutral speech and stressed speech. The speech with emotion as sadness, anger, fear, surprise and happiness is classified as stressed speech, and neutral speech is classified as another category, that means there are 8352 short utterances in stressed speech, while only 1248 short utterances in neutral speech.

4.1.3 Experimental method

The next experimental process is shown in Fig. 10. Firstly, the corresponding features are extracted from original corpus according to feature selection and then we performed SpecAugment Based on Feature Spectra to enhance the extracted features. In the next, the features are input to the classification model, which divides the corpus into two categories (i.e., stressed and neutral). In the end, we analyze the experiment according to the confusion matrix of the experimental results. We get the recognition rate (Accuracy) according to the confusion matrix, and evaluate the results of experiments.

Figure 10.

The experimental flow of stress classification. In confusion matrix, TP and TN means the correct classification of stressed and neutral samples while FP and FN means the error classification of those. Specifically, FP means that neutral samples are wrongly classified into stressed samples, while FN is the opposite.

4.2 Speech preprocessing

In preprocessing, we performed endpoint detection to eliminate long periods of silence in speech, and pre-emphasis on the high frequency of speech to remove the influence of the lips’ radiation and enhance the high-frequency resolution of the speech. The pre-emphasis is realized by a FIR high-pass digital filter with a transfer function shown as:

$\displaystyle H(z)=1-0.98z^{-1}$ (16)

The sampling frequency selected was 16000 HZ, and the Hamming window was used for framing. The frame size chosen to perform the experiment was 32 ms, with 16 ms for frame shift.

4.3 Feature selection

Seven features are selected to form a feature set: Mel Frequency Cepstral Coefficients (MFCC), Filterbank Energies (FBANK) [35], Spectral Subband Centroids (SSC) [36], Energy, Linear Prediction Coefficient (LPC), Pitch (f0) and Zero-crossing rate. The features are further measured based on the attention mechanism, and the results are sorted according to the variance. The results are shown in the Table 2.

Table 2
Attention mechanism weights’ variance of different feature mechanism

Ranking	Speech feature	Variance
1	Mel Frequency Cepstral Coefficients	1.834458e-06
2	Filterbank Energies	1.949985e-07
3	Spectral Subband Centroids	2.591459e-08
4	Energy	2.591459e-08
5	Linear Prediction Cepstarl Coefficient	2.203480e-09
6	Pitch(f0)	5.358589e-10
7	Zero-crossing rate	5.606927e-10

The results demonstrate the three speech features: MFCC, FBANK, and SSC are more suitable as inputs for the stress classification. According to the variance, the coefficients in Eq. (14) were assigned. MFCC corresponds to the coefficient $\alpha=$ 0.5, $\beta=$ 0.3 is weighted for FABNK, and $\delta=$ 0.2 for SSC. The feature extraction steps of MFCC, FBANK and SSC are shown in Fig. 11. Among them, the MFCC is currently the most widely used acoustic feature, which reflects the auditory characteristics of human ear [35]. Compared with MFCC, extraction process of FBANK omits the DCT module, retaining more original speech information. However, the two types of cepstrum features are not as successful in noisy speech. While SSC uses higher amplitude sections of the frequency spectrum (such as formants) [36], which is less affected by noise. Hence, SSC is often used as a supplement to cepstrum features in many experiments.

Figure 11.

The feature extraction steps of MFCC, FBANK and SSC.

In order to verify the effectiveness of feature selection algorithm, seven features mentioned above were extracted from the Fujitsu stressed speech corpus and separately input into the LSTM based on attention mechanism, where the layer number is 2 and hidden nodes is 512. The results of classification accuracy is shown in Fig. 12.

Figure 12.

The classification accuracy of the LSTM model based on attention mechanism with each feature as input separately.

We found from the experimental results that the recognition accuracy of each feature is basically consistent with the result of feature selection. Although LPCC was close to SSC in view of classification accuracy, the results for MFCC and FBANK is perform obviously better, which basically verified the hypothesis of feature selection.

Furthermore, we compared the above algorithm with the traditional feature selection algorithm, including correlation analysis (CA), information gain (IG) and forward selection algorithm (FS). Results of the above feature selection algorithm for the feature set are that CA selected MFCC, LPCC, f0, and IG selected MFCC, FBANK, LPCC, and MFCC, FBANK, SSC for FS. It is found that MFCC is the first choice for the three feature selection algorithms, which was also consistent with the experimental results in Fig. 12. Moreover, FS algorithm had the same result as our method. The selected features based on above algorithms are inputted into Att_MF-LSTM to perform the classification. Results are shown in Fig. 13.

Figure 13.

Comparison of ours and traditional feature selection algorithms for stressed speech classification.

According to experimental results, FS and our algorithm have the same recognition effect because the selected features were consistent, which are better than the other two algorithms. In Fig. 13, CA was obviously inferior to the other three algorithms because of the generally low correlation and discrimination among the features. Therefore, we still followed the result of feature selectin and MFCC, FBANK and SSC features are selected as the input of the model.

In the end, we compared the proposed features with previous feature set. [14, 37] respectively chose spectrogram and reduced GeMAPS as the input of emotional recognition model. Therefore, we compare the effect of our proposed features with MFCC, spectrogram, and reduced GeMAPS respectively. Spectrogram is inputted into the CNN model proposed in [14], and other three features are inputted into the proposed Att_MF-LSTM. Results are shown in Fig. 14.

Figure 14.

Comparison of the proposed features and previous feature sets for stressed speech classification.

We found that the proposed feature is slightly better than rGeMAPS. The major difference between our feature and rGeMAPS is SSC, which further proves that SSC feature is benefit for cepstrum features. Moreover, MFCC as a sub feature of the proposed features have a satisfactory performance for classification.

4.4 Comparison with state-of-the-art methods on the CASIA

In order to verify the generalization of the model to different corpus, we performed the experimental evaluation on CASIA corpus. The dichotomy of CASIA led to the imbalance of samples, that is, the stressed speech took up 87% of the data, which made it difficult to classify correctly neutral speech that only took up 13% in the corpus. In order to reduce the influence of unbalance sample, we replace the standard cross entropy loss function with focal loss function. The loss function is as follows [38]:

$\displaystyle\text{FL}({p_{t}})=-a_{t}({1-p_{t}})^{\gamma}\log({p_{t}})$ (17)

The main differences between focal loss function and standard cross loss entropy function are weighting factor $a_{t}$ and modulating factor $({1-p_{t}})^{\gamma}$ . The weighting factor $a_{t}$ makes the model more sensitive to the small-sample class by giving them higher misclassification cost and modulating factor parameters ${\gamma}$ smoothly adjusts the ratio of large-sample class weight reduction [35].

The focal loss functions with different weighting factor and modulating factor are applied to the proposed model for experimental comparison. The experimental results show that the best classification performance can be obtained when $\gamma=$ 3, $a=$ 0, 25, as shown in Table 3.

Table 3

Influence of modulating factor parameters $\gamma$ and weighting factor $a_{t}$ on accuracy and recall for stress classificaion

${\gamma}$	${a}$	Accuracy (%)	Recall rate
0	0	86.0	1.0
0.1	0.75	86.1	0.97
0.2	0.75	86.1	0.95
0.5	0.5	86.3	0.95
1.0	0.25	86.4	0.92
2.0	0.25	86.5	0.89
3.0	0.25	86.7	0.87
5.0	0.25	86.3	0.91

When ${\gamma}=$ 0, ${a}=$ 0, focal loss cross entropy function is equivalent to standard cross entropy function. We found that the classification accuracy was improved by 0.7% when $\gamma=$ 3, $a=$ 0.25, and the recall rate is reduced by 1.3%, which represents the neutral speech had more contribution to the model training and over fitting to the stressed speech is alleviated. The recall rate with original cross entropy loss function is 100%, which means that the model can classify all the stressed speech correctly, and all the wrong cases occur in the classifying neutral speech from stressed speech, that is, the classification accuracy of neutral speech is relatively low. The classification effect for neutral and stressed speech is obviously different. Focal loss reduces the difference of classification ability of the model for two different types of speech, and greatly improves the recognition effect of neutral speech to 83%. Due to the unbalanced sample in CASIA corpus, the neutral speech only constitute only a small proportion of all data. Therefore, the accuracy does not change significantly. However, when applied to the data set with balanced samples for emotional classes, the significant improvement of accuracy is achieved.

In next evaluation, features for MFCC, FBANK, and SSC were extracted from the speech data. Data augment was performed, four speakers’ speech feature is divided into the stressed and the neutral. We utilized 4-fold cross-validation method to train speaker-independent condition, the three sessions are used for training and one sessions used for testing in each fold. Parameter configuration of the proposed model is shown in Table 4. tanh activation function in traditional LSTM model was replaced by softsign activation function. softsign curve is smoother and derivative changes more slowly comparing with tanh, which can better solve the problem of gradient disappearance.

Table 4

Att_MF-LSTM model configuration parameters

Configuration	Parameter	Configuration	Parameter
Epoch	40	Lstm_num_layers	2
Learning rate	0.0001	Lstm_hidden_size	512
Batch_size	20	Activation function	Softsign
Optimizer	Adam	Regularization	Dropout (0.75)
Weigh regulation	L2

Next, we compared the effectiveness of the proposed framework with traditional models, including SVM, GMM, spectrogram $+$ CNN, spectrogram $+$ resnet34, LSTM $+$ CNN, LSTM and LSTM based on attention mechanism (Att_LSTM). LSTM is chosen as baseline in light of the proposed model is an improvement of LSTM model. Parameter configuration for the models is shown in Table 5 and Focal loss was used as activation function except for SVM and GMM. The comparison results are shown in Figs 15 and 16, Table 6.

Table 5

Configuration parameters for traditional models

Model	Configuration	Parameter
SVM [10]	Kernel	RBF
	Approach	One-anainst-one
	Speech Feature	MFCC
GMM	Number	3
	Covariance_type	full
	Speech Feature	MFCC
CNN [14]	Convolutional Layers	4
	Pooling Layers	4
	Linear Layers	2
	Kernel (Layer $=$ 1, 2, 3)	3 $\times$ 3
	Kernel (Layer $=$ 4)	2 $\times$ 2
	Speech Feature	Spectrogram
Resnet	Layers	34
	Speech Feature	Spectrogram
CNN $+$ LSTM	Convolution Layers	3
	Pooling Layers	3
	Linear Layers	2
	Kernel	3 $\times$ 3
	Lstm_num_layers	1
	Lstm_hidden_size	128
	Speech_Feature	MFCC
LSTM (baseline)	Layers	2
	LSTM_hidden_size	512
Att_LSTM	LSTM_num_layers	2
	LSTM_hidden_size	512
	Speech Feature	MFCC

Figure 15.

Confusion matrix of models for speaker-independent stress classification on CASIA stressed speech corpus.

Table 6

Performance of the speaker-independent stress classification on CASIA stressed corpus with demographic feature (%)

Model	ACC	RECALL	PRECISION	F1_SCORE
SVM	83.8	96.2	86.3	91.0
GMM	83.7	96.2	86.3	91.0
CNN	85.1	96.4	86.4	91.1
Resnet	85.3	96.5	86.3	91.2
LSTM $+$ CNN	85.5	96.5	86.5	91.2
LSTM (baseline)	85.3	96.4	86.4	91.2
Att_LSTM	85.8	96.6	86.4	91.2
Att_MF-LSTM	86.8	96.9	86.3	91.3

Figure 15 presents the class level accuracy of the proposed model and other traditional model in a confusion matrix which indicated the true label and predicted label. Similar to the proposed model, focal loss also effectively solved the problem of imbalanced samples in CNN, LSTM and other models. Figure 16 shows the classification accuracy of each model. The proposed model highly classify the stress and neutral speech with 86.7% and improves classification accuracy compared with the traditional model, hence the proposed model is qualified for CASIA corpus. Further, we found that attention mechanism contributed, but is not a major factor in improving performance by comparing the classification rate of LSTM and Att_LSTM. This is largely due to each example in CASIA corpus consisted of one simple sentence and only one emotion type. Each frame in a CASIA corpus example is consistent with the emotional label of example, hence the influence of each frame in the example for classification result is not close, which weakens the function of attention mechanism.

Table 7

Performance of the speaker-independent stress classification on Fujitsu stressed corpus with demographic feature (%)

Model	ACC	RECALL	PRECISION	F1_SCORE
SVM	70.7	93.3	86.0	89.5
GMM	69.9	93.1	85.9	89.4
CNN	71.7	93.7	85.9	89.6
Resnet	72.0	93.7	86.0	89.7
LSTM $+$ CNN	75.3	94.7	85.9	90.1
LSTM (baseline)	74.2	94.3	85.9	89.9
Att_LSTM	77.7	95.3	85.9	90.4
Att_MF-LSTM	80.5	95.9	86.0	90.7

Figure 16.

Comparison of Att_MF-LSTM model and other methods in CASIA, where Model 1 is SVM, Model 2 is GMM, Model 3 is CNN, Model 4 is Resnet, Model 5 is LSTM $+$ CNN, Model 6 is LSTM, Model 7 is Att_LSTM and Model 8 is Att_MF-LSTM.

Figure 17.

Confusion matrix of models for speaker-independent stress classification on Fujistu stressed speech corpus.

Figure 18.

Comparison of Att_MF-LSTM model and other methods with the Fujistu stressed speech corpus, where Model 1 is SVM, Model 2 is GMM, Model 3 is CNN, Model 4 is Resnet, Model 5 is LSTM $+$ CNN, Model 6 is LSTM, Model 7 is Att_LSTM and Model 8 is Att_MF-LSTM.

4.5 Comparison with state-of-the-art methods on the Fujitsu corpus

4.5.1 Evaluation under speaker-independent condition

Speaker-independent refers to classification of neutral and stressed speech without distinguishing speakers. In this evaluation, features for MFCC, FBANK, and SSC were extracted from the speech data. Data augment was performed, and different speakers’ speech features were mixed to form a speaker-independent data set. The standard cross entropy function was still utilized because of the balanced sample in this corpus.

Next, we still compared the effectiveness of the proposed framework with traditional models, including SVM, GMM, spectrogram $+$ CNN, spectrogram $+$ resnet34, LSTM $+$ CNN, LSTM. The parameter configuration of above models is same with that in Session 4.4, which as shown in Tables 4 and 5. The comparison results are shown in Figs 17 and 18, Table 7.

According to the experimental results, the proposed method Att_MF-LSTM is found to have an improvement in classification accuracy comparing with the other models. Through comparative experiments, the following conclusions may be achieved.

First, neural networks perform better compared with traditional machine learning algorithms SVM. The results show that the overfitting does not occur for SVM training without data augment, and the classification rate was almost the same before and after data augment. Moreover, it is found that GMM model has a similar phenomenon in the experiment. This indicates that traditional machine learning algorithms are prone to bottlenecking in classification, and data augment does not contribute higher classification rates. While neural networks can obviously learn more data, and a increase of data can improve the recognition rate.

Compared with the Spectrogram $+$ CNN and Spectrogram $+$ Resnet34 models, the classification rate of the models including LSTM increases by 3.3%, indicating that LSTM performs well in temporal stressed speech signals under multitasking workloads. In the CNN models, speech signal is transfer to the spectrogram and method for image classification is used for stress classification. Furthermore, since the stressed is transient, the spectrogram for stressed speech only contains obvious features in limited local areas, causing a difficulty in distinguishing between stressed speech and neutral in image classification.

In addition, results show the accuracy of the Att_LSTM model increases by 3.6% compared to LSTM $+$ CNN, which indicates that the attention mechanism enables the model to focus on keyframes for stress in speech. The Att_MF-LSTM model further improves the classification performance for 2.8%, confirming that multi-feature fusion allows the weight assignment to be more reasonable.

Attention mechanism proved more effective in Fujistu stressed speech corpus by comparing the classification accuracy of LSTM and Att_LSTM in Figs 16 and 18. This is largely due to each example in CASIA corpus consisted of one simple sentence and only one emotion type. All speech segments of each sample in CASIA are speech data under real emotional state. However, samples in Fujitsu stressed speech corpus are continuous speech. A continuous speech labeled as “stressed” includes not only the stressed segments under the brain load, but also some neutral segments. A continuous speech experiences these two different states alternately, which is closer to natural context. Since the state transition between neutral and stress state is instantaneous and ambiguous, it is difficulty to make a clear frame-level division. We believe from the experimental results that the effect of attention mechanism will be more obvious for the continuous stressed speech in the natural context. Learning importance of different frames is a preliminary solution to fuzzy representation of the stressed speech.

Furthermore, frame-level speech labels can not be achieved due to subjectivity involved in stress labeling, and the ambiguity of the label leads to a drop in classification rate in some extent. Therefore, we consider if objective evaluation is performed for frame-level labeling, the classification performance could be further improved.

4.6 Ablation studies

In order to verify the effectiveness of the selected parameters and the generalization of the model, we compared the effects of feature for different frame length and different model training optimizers on the classification accuracy. We perform experiments with the frame length of 20 ms, 35 ms, 30 ms and 32 ms respectively. Except for the frame length, other experimental conditions are the same as Att_MF-LSTM model conditions in 4.4.1, and the results are shown in Table 8.

Table 8
Comparison of different length feature for stressed speech classification, where time represents the model training time

Frame length (ms)	Accuracy (%)	Time (hour)
20	80.3	2.554
25	79.9	2.553
30	80.4	2.554
32	80.5	2.554

The experimental results show that, the feature with frame length of 32 ms achieves a better performance for classification task from the perspective of accuracy and training time. Moreover, the influence of the frame length on the training time can be ignored. Furthermore, we compared various optimizers.

Table 9

Comparison of different optimizer classification results, where time represents the model training time

Optimizer	Accuracy (%)	Time (hour)
SGD	80.3	2.647
Adadelta	80.5	2.579
Adagrad	80.5	2.576
Adam	80.5	2.554

The results reflected that Adam optimizer was the best choice from two aspects of accuracy and training time. Moreover, the Adam optimizer itself is quite robust to the selection of hyperparameters.

4.7 Evaluation under speaker-dependent condition

In the evaluation, speaker-dependent stress classification was performed on the proposed model. Considering that speakers have different physical and psychological characteristics, leading to difference in vocal properties as well as stress expression, we train the models for each speaker to perform classification. In traditional classification learning tasks, in order to ensure the accuracy and high reliability of the classification model, two basic assumptions exist: (1) The training samples and testing samples used for learning satisfy the conditions of independent identical distribution; (2) enough training samples are required to train the model. However, sufficient speaker-dependent samples are not available in the original database to obtain a reliable classification model. As results shown in Fig. 19, the classification rate for the training set is above 95%, but the average classification accuracy for testing is only 66.4%, which indicates that the generalization ability of model is not strong and cannot classify unknown data samples, resulting in overfitting.

Transfer learning was employed to solve the small samples of speaker-dependent stressed speech. In the experiment, the target speaker is selected, and samples from the remaining ten speakers were utilized for pre-training, after which model is transferred to the target speaker using fine-tuning. 4-fold cross-validation was also utilized to divide each speaker’s corpus into training corpus and test corpus, and we tested the each speaker’s classification accuracy of the model in test corpus. The experimental results for each speaker are shown in Fig. 19. Following transfer learning, the classification rate for testing set increases to 74.6%, which illustrates the generalization ability is improved. The problems of overfitting and small sample have been solved.

Figure 19.

Experimental comparison before and after transfer learning, where F1 $\sim$ F6 represent 6 male speakers while M1 $\sim$ M4 represent 4 female speakers.

Speaker-dependent experiments reflected the different degrees of phonetic variation among different speakers under multitasking workloads. Results demonstrated that the stress classification rates of different speakers were varied, which proved that speakers had different expressions of stress due to their physiological and psychological states. In the figure, F represents female while M for male. It was found that following transfer learning, the average classification rate of women is 75.6% while 73.1% for men, which indicate that the speakers of different gender can differ at expressing depressed pressure.

Furthermore, results show there is no direct relationship between duration of stressed speech and classification rate. However, the generalization ability of the pre-trained model behaves a linear relationship with the number of speech pre-trained sample. The model will be more generalized if the pre-trained samples are sufficient. Therefore, even in case of a single speaker with a small number of data samples, satisfactory effects in classification are achieved through transfer learning, and the training time is shortened.

5. Conclusion

In this study, we consider the importance differences of speech frames for stress classification, and a multi-feature fusion framework based on the attention mechanism is proposed. LSTM hidden node information and multi-feature fusion is utilized to improve attention mechanism. Compared to the traditional model, the speaker-independent stress classification rate is improved by 5.2%. During the speaker-dependent experiments, the proposed method is not as effective for classification because of small sample problem, resulting in overfitting for model. The proposed model combined with transfer learning is verified to have a better performance in generalization and problem of small sample size is solved. Future works will further discuss the fuzzy representation of stressed speech under workload and enhance the generalization and robustness of stressed speech recognition.

Footnotes

Acknowledgments

This work is supported by the Fundamental Research Funds for the Central Universities B200202205, National key research and development program 2018AAA0100800, the Key Research and Development Program of Jiangsu under grants BK20192004, BE2018004-04, BE2017071, and BE2017647, National Nature Science Foundation of China under grants (61501170, 41876097 61401148), the Open Research Fund of State Key Laboratory of Bioelectronics, Southeast University under grant 2019005, and the State Key Laboratory of Integrated Management of Pest Insects and Rodents under grant IPM1914.

Abbreviations

References

Mustaqeem and Kwon

, A CNN-assisted enhanced audio signal processing for speech emotion recognition, Sensors (Basel, Switzerland) 20(1) (2019), 183.

Evans

P.C.

and Annunziata

, Industrial Internet: Pushing the Boundaries of Minds and Machines, General Electric, Tech. Rep., 2012.

Deb

and Dandapat

, Fourier model based features for analysis and classification of out-of-breath Speech, Speech Commun. 90 (2017), 1–14.

Sezgin

M.C.

Gunsel

and Kurt

G.K.

, Perceptual audio features for emotion detection, EURASIP J. Audio, Speech, Music P. 2012(1) (Dec. 2012), 1–21. doi: 10.1186/1687-4722-2012-16.

Vignolo

L.D.

, Feature optimisation for stress recognition in speech, Pattern Recogn. Lett. 84 (Jul. 2016), 1–7.

Bandela

S.R.

and Kumar

T.K.

, Emotion Recognition of stressed Speech Using Teager Energy and Linear Prediction features, in: 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT), Mumbai, IN, 2018, pp. 422–425.

Mower

Matarić

and Narayanan

, A framework for automatic human emotion classification using emotion profiles, IEEE T. Audio Spe. 19(5) (Jul. 2011), 1057–1070. doi: 10.1109/TASL.2010.2076804.

Attabi

and Dumouchel

, Anchor models for emotion recognition from speech, IEEE T. Affect. Comput. 4(3) (Jul. 2013), 280–290. doi: 10.1109/T-AFFC.2013.17.

Kotti

and Paternò

, Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema, Int. J. Speech Techn. 15(2) (Jun. 2012), 131–150.

10.

Besbes

and Lachiri

, Multi-class SVM for stressed speech recognition, in: 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Monastir, TN, 2016, pp. 782–787.

11.

Bandela

S.R.

and Kumar

T.K.

, Emotion recognition of stressed speech using teager energy and linear prediction features, in ICALT, Mumbai, In, 2018, pp. 422–425.

12.

Dumpala

S.H.

and Kopparapu

S.K.

, Improved speaker recognition system for stressed speech using deep neural networks, in: 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, 2017, pp. 1257–1264.

13.

Mustaqeem

S.M.

and Kwon

, Clustering-based speech emotion recognition by incorporating learned features and deep BiLSTM, IEEE Access 8 (2020), 79861–79875.

14.

Badshah

A.M.

Ahmad

Rahim

et al., Speech Emotion Recognition from Spectrograms with Deep Convolutional Neural Network[C]// 2017 International Conference on Platform Technology and Service (PlatCon). IEEE, 2017.

15.

Lotfian

and Busso

, Curriculum Learning for Speech Emotion Recognition From Crowdsourced Labels, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 4, pp. 815–826, April 2019, doi: 10.1109/TASLP.2019.2898816.

16.

Zhang

Zhao

and Tian

, Spontaneous speech emotion recognition using multiscale deep convolutional LSTM, IEEE Transactions on Affective Computing (2019), (99), p. 1-1.

17.

Martin

et al., Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies, in: Proc. Annu. Conf. Int. Speech Commun. Assoc., 2008, pp. 597–600.

18.

Lee

and Tashev

, High-level Feature Representation using Recurrent Neural Network for Speech Emotion Recognition, in International Speech Communication Association, Sept. 2015.

19.

Zhao

Mao

and Chen

, Speech emotion recognition using deep 1D & 2D CNN LSTM networks, in Biomedical Signal Processing and Control, Vol. 47, Jan. 2019, pp. 312–323.

20.

Rengaswamy

Reddy

M.K.

Rao

K.S.

and Dasgupta

, Robust f0 extraction from monophonic signals using adaptive sub-band filtering, in Speech Communication, Vol.116, Jan. 2020, pp. 77–85.

21.

Xie

Liang

Huang

Zou

and Schuller

, Speech Emotion Classification Using Attention-Based LSTM, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 27, no. 11, Nov. 2019, pp. 1675–1685. doi: 10.1109/TASLP.2019.2925934.

22.

Son

L.H.

Kumar

Sangwan

S.R.

Arora

Nayyar

and Abdel-Basset

, Sarcasm Detection Using Soft Attention-Based Bidirectional Long Short-Term Memory Model With Convolution Network, in IEEE Access, Vol. 7, 2019, pp. 23319–23328. doi: 10.1109/ACCESS.2019.2899260.

23.

Meftah

A.H.

Mathkour

Kerrache

and Alotaibi

Y.A.

, Speaker Identification in Different Emotional States in Arabic and English, in IEEE Access, Vol. 8, 2020, pp. 60070–60083. doi: 10.1109/ACCESS.2020.2983029.

24.

Cheng

Shen

Sun

and Liu

, Agreement-based joint training for bidirectional attention-based, in: International Joint Conference on Artificial Intelligence, Vol. 16, 2016, pp. 2761–2767.

25.

Radev

D.R.

Hovy

and McKeown

, Introduction to the special issue on summarization, in Computational LinguisticsDecember, 2002. doi: 10.1162/089120102762671927.

26.

Golub

and He

, Character-level question answering with attention, in the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1598–1607.

27.

Chorowski

Bahdanau

Serdyuk

Cho

and Bengio

, Attention-based models for speech recognition, 2015. doi: 10.1016/0167-739X(94)90007-8.

28.

Bahdanau

Chorowski

Serdyuk

Brakel

and Bengio

, End-to-end attention-based large vocabulary speech recognition, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 2016, pp. 4945–4949. doi: 10.1109/ICASSP.2016.7472618.

29.

Mirsamadi

Barsoum

and Zhang

, Automatic speech emotion recognition using recurrent neural networks with local attention, in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 2227–2231. doi: 10.1109/ICASSP.2017.7952552.

30.

Huang

and Narayanan

S.S.

, Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition, in: 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, 2017, pp. 583–588. doi: 10.1109/ICME.2017.8019296.

31.

Treisman

A.M.

and Gelade

, A feature-integration theory of attention, Cognitive Psychology 12(1) (1980), 97–136.

32.

Bahdanau

Cho

and Bengio

, Neural machine translation by jointly learning to align and translate, Computer Science, 2014.

33.

Zhou

Wang

et al., GA-SVM based feature selection and parameter optimization in hospitalization expense modeling. Applied Soft Computing, 2018, 75.

34.

Park

D.S.

et al., SpecAugment: A simple data augmentation method for automatic speech recognition, in Interspeech, Graz, AT, 2019, pp. 2613–2617.

35.

Zulfiqar

Muhammad

and M.E.A.M., A speaker Identification System Using MFCC features with VQ Technique, in 2009 Third International Symposium on Intelligent Information Technology Application, Shanghai, CN, 2009, pp. 115–118.

36.

Paliwal

K.K.

, Spectral subband centroid features for speech recognition, in Proc. ICASSP, Vol. 2, 1998, pp. 617–620.

37.

Neumann

and Vu

N.T.

, Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech. 2017.

38.

Lin

T.Y.

Goyal

Girshick

et al., Focal loss for dense object detection, IEEE Transactions on Pattern Analysis & Machine Intelligence 2017, PP(99): 2999-3007.

Attention mechanism based LSTM in classification of stressed speech under workload

Abstract

Keywords

1. Introduction

2.1 Long-Short Term Memory Model

4. Experimental evaluation

4.1 Database and experimental method

4.1.1 Fujitsu stressed speech corpus

Table 1 Logic puzzle task. During the phone cell, the speaker is required to fill out the form according to the given tips

4.1.3 Experimental method

Table 2 Attention mechanism weights’ variance of different feature mechanism

4.5.1 Evaluation under speaker-independent condition

4.6 Ablation studies

Table 8 Comparison of different length feature for stressed speech classification, where time represents the model training time

Footnotes

Acknowledgments

Abbreviations

References

Table 1
Logic puzzle task. During the phone cell, the speaker is required to fill out the form according to the given tips

Table 2
Attention mechanism weights’ variance of different feature mechanism

Table 8
Comparison of different length feature for stressed speech classification, where time represents the model training time