Abstract
The difference between English and Chinese expressions is that English emphasizes the stress of syllables, so the recognition of English speech emotions plays an important role in learning English. This study uses transfer learning as the technical support to study English speech emotion recognition. The acoustic model based on weight transfer has two different training strategies: single-stage training and two-stage training strategy. By comparing the performance of the English speech emotion recognition model based on CNN neural network and the model proposed in this paper, the statistical comparison data is drawn into a statistical graph. The research results show that transfer learning has certain advantages over other algorithms in English speech emotion recognition. In the subsequent teaching and real-time translation equipment research, transfer learning can be applied to English models.
Introduction
Speech Emotion Recognition (SER) is an important branch in the field of speech research. It is a technique to determine the emotions that people belong to and involves many core issues in speech research such as signal processing, feature extraction and pattern recognition. Emotion recognition is based on the theoretical research of human emotion. Darwin, the founder of evolution, believes that emotion is a broad trait shared by all human beings across ethnic groups and cultures [1]. Some experts and scholars have done more detailed research on human emotions. In their research, all different cultures have a common expression of emotions. These emotions can be divided into happiness, sadness, anger, etc. [2], which has become the theoretical basis of the field of emotion recognition. On the other hand, emotion, as a complex psychological activity of human beings, is an expression of the true thoughts in the heart, a predictive of the person’s next move and can even reflect a person’s health level. Due to the cross-cultural and inter-ethnic characteristics of emotion and its great significance to people, the research on emotion recognition has been paid more and more attention.
Emotion, as a complex set of human physiological and psychological activities, is a research hotspot in many disciplines such as psychology and physiology. At the same time, emotion recognition, as an important part of the intelligent human-computer interaction system, is of great significance for improving machine intelligence, improving machine friendliness, and creating a smart city for us. In addition, there are many other application scenarios for emotion recognition. In the field of education, emotion recognition technology recognizes students ‘emotional changes in the learning process through signals such as students’ facial expression changes, so as to judge their concentration and understanding of the content taught by the teacher. At the same time, the judgment of the machine can also be fed back to the teacher to help them improve the teaching process [3]. For example, if the machine judges that the emotions of the students are mainly confused and frustrated, it means that the content taught by the teacher is too difficult or the teacher’s teaching method is inappropriate. At this time, the teacher should reflect on whether the content is difficult, or the teaching method needs to be improved.
The difference between English and Chinese expressions is that English emphasizes the stress of syllables, so the recognition of English speech emotions plays an important role in learning English. English speech emotion recognition is to make the machine understand the human language and convert the speech information into the corresponding text. Speech emotion recognition is the premise of intelligent information processing, has very important research value and practical significance, and brings many conveniences to people’s lives [4].
This study uses transfer learning as a technical support to study English speech emotion recognition, so as to provide reference for English learners, and also provide theoretical research reference for real-time English translation.
Related work
Speech recognition appears before the computer. In the 1920 s, the world’s first speech recognition machine “Radio Rex” came out. Modern automatic speech recognition technology was born in the AT & T Bell Labs in the 1950 s. Based on the formant information of digital vowels, the researchers developed a system that can recognize numbers [5], and the system can only recognize ten isolated English numbers spoken by a specific person. Later, the application of dynamic time warping (DTW) method in the field of speech recognition [6] enabled different lengths of speech to be aligned in time series, which further stimulated the rapid development of speech recognition technology, and at the same time laid a solid foundation for the development of speech recognition technology of continuous speech. Since then, research on speech recognition technology has transitioned from the recognition of isolated words to the Large Vocabulary Continuous Speech Recognition (LVCSR). From the 1980 s to the 1990 s, the method based on probability statistics replaced the method based on template matching, which marked the beginning of the rapid development of speech recognition technology. Hidden Markov Model (HMM), Gaussian Mixture Model (GMM) and n-gram language models have emerged one after another, which has gradually improved the performance of large vocabulary continuous speech recognition systems. After that, the performance of the speech recognition system with the GMM-HMM acoustic model and the n-gram language model [7] as the main frame has been a key breakthrough, and it has also achieved good results for continuous recognition of large vocabulary. After entering the 21st century, the speech recognition system based on GMM-HMM is relatively mature, but the performance of its system still cannot reach the practical level. During this period of time, the development of speech recognition technology has been unable to break through the bottleneck. With the development of deep learning and the rise of neural networks, speech recognition technology has been further breakthrough and gradually entered a climax. The literature [8] proposes a layer-by-layer initialization training method in Deep Belief Networks (DBN). The literature [9] proposed an acoustic model based on DBN-HMM, and this model fits the posterior probability of the HMM model state through DBN and takes training on the TIMIT data set and has achieved good results. The literature [10] proposed a hybrid acoustic model based on context-sensitive deep neural network (DNN) and hidden Markov. Compared with the GMM-HMM acoustic model, the performance of this model has been significantly improved. After that, the emergence of various neural networks with different structures and the widespread application in the field of speech recognition marked the official development of speech recognition technology to its peak. Long-Short Term Memory (LSTM) and Time Delay Neural Network (TDNN) can model long-term speech signals and gradually become the mainstream acoustic model in speech recognition systems. In addition, Convolutional Neural Network (CNN) can effectively overcome the diversity of speech signals through convolution calculation, which improves the recognition rate of acoustic models to a certain extent [11].
The method of “learning to learn” proposed in the literature [12] is the prototype of transfer learning thought. The Defense Advanced Research Projects Agency (DARPA) clearly defined that the classification system based on transfer learning has the ability to apply the knowledge that has been learned to other tasks. Since then, transfer learning technology has been widely researched and applied. The research in the literature [13] showed that transfer learning can effectively solve the problem of data sparseness by applying the knowledge learned from one model to another model. With the development of machine learning and deep learning, deep neural network models have been used in various classification fields. One of its advantages is that, by setting a multi-layer network structure to extract features layer by layer, these features can be migrated layer by layer as needed. The literature [14] proposed that the features learned by the hidden layer closer to the input layer (lower) are more general (general feature), and the features of the hidden layer closer to the output layer (higher) are more specific, which has played a key guiding role in the development of transfer learning. The application of transfer learning in the field of speech recognition is mainly reflected in two methods: domain adaptation and pre-training. The literature [15] proposed a domain adaptive technology based on linear hidden network (LHN) and fDLR method, which has achieved good results in speech recognition systems. The literature [16] used deep belief networks to model phoneme sequences, and proposed that the features of the middle layer of the acoustic model are not specific to the training data. The unsupervised pre-training method proposed in the literature [17] and the supervised pre-training method using the extra-domain data used in the literature [18] have been successively applied to speech recognition systems, which has further developed the application of transfer learning in speech recognition. The literature [19] used the multi-language trained DNN model as the initial model of the low-resource target language, which has achieved a significant improvement in the target domain. In terms of cross-language transfer, the DNN-based acoustic model of English speech recognition proposed in the literature [20] has achieved better results than the baseline system in the target domain French. Although the speech recognition technology has been greatly promoted by deep learning, the performance of speech recognition systems with small data sets has not yet reached the level required by people. How to improve the performance of the acoustic model of low-resource speech recognition system is the main content of this study, and transfer learning is an effective method to solve this problem.
Autoencoder
Autoencoder (AE) is a neural network model that learns the hidden features of data through training to make the network output equal to the input. It is currently used for pre-training of deep neural networks and is one of the main network structures of deep learning. The learning process of the autoencoder is an unsupervised learning, that is, no labeled data samples are needed for training. The process of learning from input data to express features efficiently is called encoding, and the process of reconstructing the original data through the hidden features learned is called decoding. After the feature is extracted from the data by the autoencoder, it can be sent to the supervised learning model to complete the required tasks. In this process, the role of the autoencoder is to act as a feature extractor to increase or decrease the dimension of the data to obtain a suitable and efficient hidden feature representation. The basic structure of the autoencoder is similar to a three-layer neural network, that is, it includes an input layer, a hidden layer, and an output layer, as shown in Fig. 1. An input sample is {x1, x2, x3, ⋯ , x
i
}, and the forward propagation formula of the autoencoder is shown in Equations (1) and (2). Among them, S is the activation function, which is generally a sigmoid function, W1 is the weight between the input layer and the hidden layer, W2 is the weight between the hidden layer and the output layer, b
i
is the bias term, and hω,b (x) is the activation value of the output layer. The expected output of the self-encoder training target is equal to the input, and the final learning result is shown in formula (3) [21–23].

Basic structure diagram of autoencoder.
The self-encoder uses training to make the output equal to the input as much as possible and uses different numbers of hidden units to obtain hidden features of different dimensions, so that the original data can be dimensioned or reduced. By stacking multiple autoencoders together and adding constraints, we can learn the efficient representation of data at different levels. Therefore, the autoencoder can be used as pre-training for deep learning, extracting useful data features, which is helpful for subsequent supervised learning.
The recurrent neural network is an improved version of traditional neural network, which is a neural network used to analyze time series data. The traditional artificial neural network generally refers to the feedforward neural network. In a feedforward neural network, each neuron starts at the input layer, receives the input of the previous stage, and outputs the input to the next stage, and finally outputs to the output layer. Moreover, there is no information feedback in the entire network. Due to its structural limitations, the network has no temporal characteristics, and it can only process each input sample separately, and there is no correlation between each input. Therefore, it is often not effective when dealing with time series problems. Many tasks are time-related, so we need to mine their time feature information. Based on the traditional artificial neural network, RNN introduces a directional loop, which makes hidden nodes directional connection into a loop. The internal structure of the ring helps the network to transfer time information and demonstrate dynamic timing behavior and helps the network to mine the characteristics of time series and finally make a classification or prediction.
The structure of traditional artificial neural network and RNN is shown in Fig. 2. It can be seen that the main difference between them is that RNN has an additional information transmission channel in the hidden layer, which makes the network have a time characteristic. The current time of the network can not only receive the information of the sample, but also the information of the previous time, so that the network can handle the time series problem well.

Comparison of the structure of traditional neural network (left) and RNN (right).
The RNN in Fig. 2 can be expanded into Fig. 3. The forward propagation formula of the RNN can be derived from Fig. 3, as shown in formulas (4)–(7). Among them, x is the network input, y is the network output, w is the weight of the network to connect each neuron, z is the weighted sum of neurons, f is the activation function, s is the neuron activation value, t represents the current time of the network, I is the input vector dimension, i is the input vector subscript, H is the hidden layer unit dimension, h′ is the hidden layer unit vector index at time t - 1, g is the hidden layer unit vector index at time m, and o is the output vector index. From the structure diagram and the forward propagation formula, it can be derived that the value s
t
of the RNN hidden layer depends not only on the current input x
t
, but also on the value st-1 of the previous hidden layer. The current output of the RNN is affected by inputs such as x, xt-1, xt-2, ⋯. Therefore, the network can update the weights through training, extract the useful information from the entire time series, and obtain a good classification or prediction effect.

Expanded view of RNN structure.
For the least squares problem, the gradient descent method is the most classic optimization method. In this method, the minimum value of the function is iteratively solved step by step along the direction of gradient descent, so as to obtain the minimum loss function and the minimum model parameters. The gradient descent method can deal with the case where the loss function is clear, or the error is available. For example, for a multi-layer feedforward neural network, the parameters of the output layer can be
sealed using gradient descent and the error of the output layer. However, there are no errors in other hidden layers, so the gradient descent method cannot be used directly. In this study, the chain rule is used to propagate the error to the hidden layer in the reverse direction of the network, and then the gradient descent method is used to update the parameters. This training method is called the back propagation (BP) algorithm. This multi-layer feedforward neural network trained according to the error back propagation algorithm is currently the most widely used, also known as BP neural network. At present, for the training of various deep neural networks, the algorithms used are basically BP algorithms. The training method of RNN is Back Propagation Through Time (BPTT) over time. The idea of this method is similar to the BP algorithm. Since RNN is a neural network model based on time series, the idea of BPTT algorithm is to back-propagate the accumulated residuals from the last time. This section will introduce the back-propagation algorithm and the back propagation algorithm through time.
We assume that the training sample set is { (x1, y1) , ⋯ , (x
m
, y
m
) }, which means that there are m samples in total, and the complex nonlinear model of the neural network is represented by hW,b (x). Among them, W, b respectively represent the weight and bias terms of the neural network, and the loss function of the network is expressed by the mean square error, as shown in formula (9).
According to the gradient descent method, the parameter W, b is updated according to equations (10) and (11) in each iteration. η represents the learning rate of the network, and
Here, the chain rule for derivation of composite functions is used, as shown in Equations (14) and (15). Among them, z is the weighted sum and a is the activation value.
We set
For other layers, namely l < n
l
, using the chain rule, the following results can be obtained:
Therefore, when l < n
l
,
When l < n
l
Therefore, according to the formulas (12), (13), the recursive formula of the BP algorithm is obtained, and the parameters are updated through iterative iterations during the neural network training process.
The BPTT algorithm is similar to the BP algorithm and is a time-based error back propagation algorithm. Based on formulas (4)–(7), we use vectors to represent simplified formulas and add bias terms to obtain a simplified RNN forward propagation model as shown in formulas (22)–(25). Among them, f
h
is the activation function of the hidden layer, which is generally tanh, and f
o
is the activation function of the output layer, which is generally softmax. Since the parameters of the RNN are shared at every moment, the parameters that need to be updated are W, U, V, b, and c.
According to the gradient descent method, the iterative update formula for W is shown in (26)-(27). Similarly, the update formula for the parameters U, V, b, and c can be obtained.
We set the loss function of each sample of RNN as J (W, U, V, b, c, x
k
, y
k
). Because RNN is a propagation process with time sequence, the network has losses at every moment. If it is assumed that the sequence has a total of T moments, then
In this paper, the training steps of the acoustic model of English speech emotion recognition system based on transfer learning are as follows:
M randomly initialized hidden layers and a soft-max layer are added to the first n layers of the mi-grated pre-trained model. According to the obtained English-aligned label data and 2-state FSA topology, two different training strategies are used to train the migrated chain TDNN model. The English feature vector is used as input. The acoustic model structure of English speech emotion recognition system based on transfer learning is shown in Fig. 4:

Model structure diagram.
The left is the pre-trained model, and the right is the acoustic model based on weight migration. The model includes both a migration layer and a randomly initialized hidden layer. Therefore, according to whether the migration layer and the newly added layer are simultaneously trained, the acoustic model based on weight migration has two different training strategies: single-stage training and two-stage training strategy. The two-stage training strategy includes a fine-tune to the network. The detailed elaboration is as follows: The single-stage training strategy refers to using different learning rates while training the model’s migration layer and the newly added random initialization layer. Generally, the transferred layer uses a smaller learning rate during training, and the newly added layer requires a larger learning rate. Two-stage training requires two complete training stages. In the first stage, the weights of the migration layer are fixed, and the newly added hidden layer with random initialization is trained with a larger learning rate. The second stage uses a smaller learning rate to fine-tune the entire network.
Because the original speech signal contains a lot of noise signals, it will seriously affect the classification effect. In this paper, the wavelet transform function with good directivity and translation invariance is used to denoise the signal, which can effectively smooth the noise and retain the EEG signal characteristics. While denoising, this study uses the “db5” wavelet base to perform 6-layer decomposition to decompose the original EEG signal into Q, 13, and Y bands, and then uses the sparse group lasso-granger algorithm to extract features from each band. Figure 5 shows the waveform comparison of the original EEG signal of a channel of subject 1 before and after denoising. It can be seen that the denoising signal retains the characteristics of the original signal better.

Comparison of signal graphs before and after denoising.
The database of English speech emotion recognition system mainly includes two parts: English speech emotion recognition and English speech recognition database. Among them, the database is divided into speech library and text library. The voice library is an audio file of the training set and the test set, and the training set and the test set are not repeated. The text library includes the labeled corpus corresponding to the speech library, the training corpus of the language model and the pronunciation dictionary.
An English voice signal contains a lot of information. Feature extraction is to remove useless interference information and extract meaningful information for speech recognition as input to the acoustic model. Speech recognition has the following requirements for feature extraction: higher resolution in time and frequency domain, clearly distinguishing different phonemes, robust to noise and speakers.
The main point of speech production is that the sound we emit is filtered by the shape of the channel, and different shapes can produce different envelopes, namely phonemes. The essence of MFCC is to accurately represent this envelope. The extraction process is as follows: (1) Analog-to-digital conversion: English analog signals are sampled and converted into digital signals that the computer can process. (2) Pre-emphasis: High-frequency components in English digital signals are increased to improve the signal-to-noise ratio. (3) Framing and windowing: A movable fixed-length Hamming window is used to window and frame the pre-emphasized English digital signal to obtain a short-term stable voice signal. In order to make a smooth transition from frame to frame, the frame length is set to 25 ms and the frame shift is set to 10 ms. (4) Fast Fourier Transform: The signal power spectrum is obtained from the spectral information of the English digital signal after windowing. (5) Mel filter: The signal power spectrum of English is filtered and logarithmic. (6) Cepstral Coefficient: Discrete cosine transform is used to obtain the mel frequency cepstrum coefficient of English digital signal, which is the MFCC feature vector of a frame of speech.
Before the emergence of deep neural network acoustic models, acoustic models based on GMM-HMM dominated the field of speech recognition. The HMM model contains a double random process, which expresses the relationship between the speech frames to be recognized through the transition between states and accurately simulates the timing. The acoustic model of English emotion recognition based on GMM-HMM is shown in Fig. 6.

Acoustic model of English emotion recognition.
There are some problems to be solved in current speech emotion recognition. First, deep learning methods have become more and more popular due to their powerful feature learning capabilities. Moreover, generally speaking, increasing the network depth will bring a performance improvement to the machine learning task. However, the limited size of the emotion data set greatly limits the complexity of the deep learning model. At present, most of the speech emotion recognition research is to use several layers of network models and the data set needs to be specially designed. At the same time, the models currently making progress in other fields, such as ResNet, cannot be effectively used. Secondly, the current research mainly focuses on the method of cross-sentiment data set. However, due to the difficulty of obtaining emotional tags, this method still has its limitations. In contrast, in other speech fields, labeled data is very rich, and a typical field is the speaker field. Because the collection of speaker tags is less difficult than emotional tags, it has rich labeled data. For example, the commonly used speaker data set VoxCeleb2 is much larger than the current emotion data set. If the speaker data can be used as auxiliary information for emotion recognition, the problem of insufficient emotional data scale can be greatly solved. This paper expects to use speaker labels and deep learning models in the VoxCeleb2 dataset to learn the shared parts of the speaker and emotional information. On this basis, the model test is carried out, and the performance comparison between the English speech emotion recognition model based on CNN neural network and the model proposed in this paper is performed. The model in this paper is named TEOTL.
In the experiment, through 50 sets of speech emotions, the recognition accuracy and speed were compared. The results of the recognition accuracy are shown in Table 1, and the recognition speed is shown in Table 2. The statistical graphs drawn are shown in Figs. 7 and 8, respectively.
Statistical table of model recognition accuracy
Statistical table of model recognition speed

Statistical diagram of model recognition accuracy.

Statistical diagram of model recognition speed.
According to the recognition accuracy model of Fig. 7, we can see that the model recognition accuracy curve constructed in this paper is above the CNN recognition accuracy curve, and the gap is close to 20%. It can be seen that the model proposed in this paper has a high accuracy rate, which is close to 90% and much higher than the traditional recognition model. Moreover, this shows that the model proposed in this paper has a certain practical basis, and it can be applied to practice in the future.
Through the recognition speed model of Fig. 8, we can see that the model recognition accuracy curve constructed in this paper is below the CNN recognition accuracy curve, and the gap is close to 100 ms. It can be seen that the model proposed in this paper has a faster recognition speed, which is much higher than the traditional recognition model. Moreover, this shows that the model proposed in this paper has a certain practical basis, and it can be applied to practice in the future.
As can be seen from Figs. 7 and 8, the model proposed in this paper has certain advantages over the traditional model. Moreover, transfer learning has certain advantages over other algorithms in English speech emotion recognition. In the follow-up teaching and real-time translation equipment research, we can consider applying transfer learning to the model.
Conclusion
The economic and cultural exchanges between China and the world are getting closer and closer, so the study of English speech emotion recognition system has very important significance and value. The focus of this paper is to model the acoustics of the English speech emotion recognition system and improve the performance of the acoustic model through transfer learning. M randomly initialized hidden layers and a softmax layer are added to the first n layers of the migrated pre-trained model. According to the obtained English-aligned label data and 2-state FSA topology, two different training strategies are used to train the migrated chain TDNN model. The database of English speech emotion recognition system constructed in this paper mainly includes two parts: English speech emotion recognition and English speech recognition database. Among them, the database is divided into a speech library and a text library, and the speech library is an audio file of the training set and the test set, and the training set and the test set have no duplication. Moreover, the text library includes the labeled corpus corresponding to the speech library, the training corpus of the language model and the pronunciation dictionary. The model proposed in this paper has certain advantages over traditional models, and transfer learning has certain advantages over other algorithms in English speech emotion recognition. In the subsequent teaching and real-time translation equipment research, we may consider applying transfer learning English to the English model.
