Speech emotion recognition of Chinese elderly people

Abstract

Recently, studies have been performed for speech emotion recognition. However, little research focused on the emotion of the elderly, especially the lonely elderly. In this paper, we propose a six layer Wavelet Packet Coefficients Model for speech emotion recognition of the Chinese elderly. Six layer Wavelet Packet Coefficients, Mel Frequency Cepstrum Coefficient and the Fourier Parameter features are extracted from speech emotion database of Chinese elderly, respectively. Experimental results show that the six layer wavelet packet coefficients features are effective for recognizing emotions from speech. In particularly, when combining these three features, the recognition rates of the elderly can be improved.

Keywords

Six Layer Wavelet Packet Coefficients Mel Frequency Cepstrum Coefficient Fourier Parameter elderly

1. Introduction

Social computing has a rapid development in recent years as a new interdisciplinary field of social science, management science and computational science [7,15,24]. The percentage of the elderly using social media has increased substantially in recent years [9], yet little research has been done to understand the emotion underlying social media. As we all know that the elderly, especially the lonely elderly, need to receive more attention. It is significant to do some researches for recognizing their emotions. Speech emotion recognition is an important task in understanding each other of speech in social media [16]. However, recognizing emotion from speech, especially for the elderly, is a very challenging problem [31]. Although there has a growing interest of researching speech emotion recognition [22,25,27,29,31] in the last decade, most researches focus on the children or the youth. Nowadays, researchers have established some kind of corpora such as the children emotional corpora [26], young peoples emotional corpora [6] and make some researches on these emotions. However, little research has focused on the emotional corpora of the elderly and emotion recognition of the elderly, especially of the Chinese elderly. Therefore, in this paper, we make a deep research and analysis of emotional recognition of Chinese elderly people. The following are the particularities of this paper:

We established our own speech emotion corpora of Chinese elderly, which expanded the research field of emotional topic of Chinese elderly.

We built a model of six wavelet packet coefficient for emotional speech signal.

The six layer Wavelet Packet Coefficients (WPC) [30], the Mel Frequency Cepstrum Coefficient (MFCC) [22], and the Fourier Parameter (FP) features [29] of the Chinese elderly were extracted for speech emotion recognition, respectively.

The structure of this paper is reproduced below. First, we briefly introduce the related works in this paper in section second. Then, we detail three major types of feature extraction (WPC, MFCC and FP) and the multiple feature fusion for speech emotion of Chinese elderly in section three. Then, the experimental results are discussed in the fourth section. Finally, conclusions are drawn in the fifth section.

2. Related work

In speech recognition research, feature extraction and selection is very important for the effect of the recognition. If the extracted features are not representative, it can’t get good recognition results. While the effective ones can achieve better effort.

At present, lots of literatures do much research on how to extract the feature parameters from the emotional speech, which mainly consider in time, amplitude, frequency and formant structure etc. In 1972, Withams [14] found that emotion has a great influence on the pitch contour. In 1981, Williams and Steven [32] summed up the different emotional states by analyzing the mechanism of speech production, the physiological role of the nervous system and the corresponding physiological response. In 1996, Dellaer [5] proposed a classification method based on pitch frequency correlation information. In 2003, Kwon [13] used SVM and HMM to recognize the speech emotion, and found that the pitch frequency and energy are affective parameters. The classification accuracy of the four kinds of emotion is 70%. In 2006 Dimitri [28] et al. found that the regional characteristics of the vocal tract have a definite influence on emotion and achieve good results. In 2010, Guven [8] et al. considered SVM as a classifier to recognize emotion on the German corpus, and the experimental results were better than those of previous studies on the basis of the extraction of prosodic features. In [10], both acoustic and lexical features were extracted for emotion recognition. Their experimental results showed that late fusion of both acoustic and lexical features achieves recognition accuracy of 69.2% for four emotions. Despite these contributions, most of the existing speech emotion recognition mainly focus on children and young people, which is rarely involved in that of the elderly.

In this paper, we propose sixth layer wavelet packet method to extract features from the Chinese elderly. MFCC and FP features are also extracted. In particular, we propose a combination of these features for speech emotion recognition. Applying the proposed methods for extracting multiple features from the speech emotion database of Chinese elderly, the recognition results of combined features are verified to be better than signal ones.

3. Multiple feature parameter extraction

3.1. Experimental database

In this paper, we introduce a new speech emotion database of Chinese elderly people (SECEPDB) [31]. The sampling frequency is 16 kHz, 16 bit quantization. There are a total of 480 emotional speeches signals, which are composed of 11 actors in different emotional states. It contains seven types of emotion (angry, anxiety, boredom, disgust, happy, neutral and sad). Emotional distribution of the database is shown in Fig. 1.

Fig. 1.

Distribution of emotion in SECEPDB.

3.2. Pre-processing

Before the feature extraction, we need to preprocess the emotional speech. The transfer function of the pre emphasis process is, $H (z) = 1 - μ z^{- 1}$ . μ is the pre weighting coefficient and $μ = 0.9375$ . The frame length is 256 sampling points, and per frame moves 128 sampling points, plus Hamming window.

3.3. Emotional feature extraction

Emotional features of WPC, MFCC and FP are extracted for speech emotion recognition of Chinese elderly people. The following is the characteristics of three emotional feature extractions.

Wavelet Transform [3] is a modern spectral analysis tools, which can study the frequency characteristics of local domain, time domain features and effects of local frequency process, so even for non-stationary process, processing and hand.

Due to the orthogonal wavelet transform only low-frequency part of the signal can do further decomposition, while the high frequency part can’t do. However, wavelet packet transform can provide a finer decomposition of the high-frequency part, and this decomposition is neither redundant, nor omissions. So, signals through wavelet packet transform show better time-frequency analysis which contain a large number of middle and high frequency information.

The definition of a subspace $μ_{j}^{n}$ is space of a function of the closure $μ_{n} (t)$ , but $μ_{j}^{2 n}$ is space of the function of the closure $μ_{2 n} (t)$ . It is shown in Eq. (1). $\begin{matrix} (1) & \begin{array}{l} μ_{2 n} (t) = \sqrt{2} \sum_{k ϵ z} h (k) μ_{n} (2 t - k), \\ μ_{2 n + 1} (t) = \sqrt{2} \sum_{k ϵ z} g (k) μ_{n} (2 t - k) . \end{array} \end{matrix}$

In the formula, $g (k) = {(- 1)}^{k} h (1 - k)$ , when $n = 0$ , it is given directly as Eq. (2). $\begin{matrix} (2) & \begin{array}{l} μ_{0} (t) = \sum_{k ϵ z} h (k) μ_{0} (2 t - k), \\ μ_{1} (t) = \sum_{k ϵ z} g (k) μ_{0} (2 t - k) . \end{array} \end{matrix}$

In resolution analysis, scaling function $φ (t)$ and wavelet basis function $ψ (t)$ satisfy double scale as Eq. (3). $\begin{matrix} (3) & \begin{array}{l} φ (t) = \sum_{k ϵ z} h (k) φ (2 t - k), \\ ψ (t) = \sum_{k ϵ z} h (k) φ (2 t - k) . \end{array} \end{matrix}$

Wavelet packet is defined as orthogonal scaling function $μ_{0} (t) = φ (t)$ . wavelet packet decomposition algorithm is as Eq. (4). $\begin{matrix} (4) & \begin{array}{l} d_{n}^{k, 2 l} = \sum_{m} d_{m}^{k + 1, l} h_{m - 2 n}, \\ d_{n}^{k, 2 l + 1} = \sum_{m} d_{m}^{k + 1, l} g_{m - 2 n} . \end{array} \end{matrix}$

Wavelet packet reconstruction algorithm is as Eq. (5). $\begin{matrix} (5) & \begin{matrix} d_{n}^{k + 1, l} = & \sum_{m} d_{m}^{k, 2 l} h_{m - 2 n} \\ + \sum_{m} d_{m}^{k, 2 l + 1} g_{m - 2 n} . \end{matrix} \end{matrix}$

Now, we study the relevance of every wavelet packet coefficient feature in speech emotion recognition of Chinese elderly people. We use Daubechies wavelet filter Db2 with 6 levels of decomposition. 64 wavelet packet coefficients are obtained in 6 levels of decomposition. Here, 6 statistical values of each wavelet packet coefficient were extracted, the WPC feature vector is comprised of amplitude, first-order difference and second-order difference. These mean, maximum, minimum, median and standard variation which led to a 5760-dimensional WPC feature vector in total.

Fig. 2.

Structure of speech emotion recognition system.

MFCC was first introduced and applied to speech recognition in [4]. It has been popularly used in speech emotion recognition [2,11,12,18] too. By considering the reaction of human ears to different frequencies, the Mel frequency is determined according to the characteristics of human audition. In this study, MFCC features were extracted for comparison with the proposed WPC features. For emotion recognition, MFCC features usually include mean, maximum, minimum, median, and standard deviation. All speech signals were first filtered by a high-pass filter with a pre-emphasis coefficient of 0.97. Here, the first 12 harmonic coefficients were extracted. MFCC feature vector is comprised of amplitude, a first-order difference and second-order difference. These mean, maximum, minimum, median, and standard variation led to a 180-dimensional MFCC feature vector in total.

We extracted a set of FP features from speech signals as detailed in [29]. Here, the first 120 harmonic coefficients were extracted. The FP feature vector is comprised of amplitude (H), first-order difference ( $Δ H$ ) and second-order difference ( $Δ Δ H$ ). Their minimum, maximum, mean, median and standard deviation were also computed. There were a total of 1800 features for speech emotion recognition of Chinese elderly people.

4. Experiment results

4.1. Speech emotion recognition system structure

In this paper, the structure of speech emotion recognition system is shown as Fig. 2, feature extracted from the sample data are six layer wavelet packet coefficients (WPC), Mel-Frequency Cepstral Coefficients (MFCC) and Fourier Parameter (FP). Principal Component Analysis (PCA) method [19] is used for feature dimension reductions and SVM is selected as classification. We first preprocess voice signal, and then extract the WPC, MFCC and FP speech feature, respectively. All the data sets of the extracted features are normalized. Because of the higher features dimension, we use PCA to reduce the dimension, and finally use SVM to recognize emotions [17].

4.2. Support Vector Machine

During the process of speech emotion recognition, classification process is quite important during the process of speech emotion recognition. So far, there are a variety of classification techniques used in the field of speech emotion recognition including Hidden Markov Model (HMM) [1], Gaussian Mixture Model (GMM) [20], Artificial Neutral Networks (ANN) [23], and Support Vector Machine (SVM) [17], etc. Speech emotion recognition belongs to pattern recognition problem. GMM and HMM methods require a large number of emotional speech samples in the training process of all kinds of emotion models. ANN method is limited to improve the robustness and accuracy of emotion recognition. SVM has good generalization ability because of its better solution to small sample, nonlinear and high dimensional pattern recognition problem of machine learning [17]. Many researches [10,17,29] applied SVM as classifier and achieved better performance. Therefore, We selected SVM as classifier for speech emotion recognition.

The SVM method is able to transform the sample space into a high dimension or even a dimension (Hilbert space) using a mapping function which is known as kernel function. So, in this case, it is possible to solve a Non-Linear problem in a higher dimensional space. The most common kernel functions are RBF Kernel and Gaussian Kernel. The key to solving the nonlinear separable problem is to construct the optimal classification hyperplane. The structure of the optimal hyperplane is transformed into the optimal weight and bias. Set up training samples $\begin{matrix} {(X^{1}, d), (X^{2}, d^{2}), \dots, (X^{p}, d^{p}), \dots, (X^{p}, d^{p})}, \end{matrix}$ and minimizing the cost function of the weights W and the slack variables [5]. $\begin{matrix} (6) & \begin{matrix} Φ (W, ε) = \frac{1}{2} * W^{T} * W + C * \sum_{p = 1}^{p} ε_{p} . \end{matrix} \end{matrix}$

The limit bar is $d^{p} (W^{T} * X^{p} + b) ⩾ 1 - ε_{p}$ , $p = 1, 2, \dots, P$ , C is the penalty factor. Slack variables $ε_{p} ⩾ 0$ , $p = 1, 2, \dots, P$ are introduced in the formula (3), which used to measure the deviation of a sample point relative to the linear separable ideal condition. The calculation of the optimal weights and bias determined by the above sample set can be converted to its dual problem. When the optimal hyperplane is constructed in the feature space, only the inner product in the feature space is used, as equal (7). Equation (8) is the Optimal classification decision function. $\begin{array}{l} (7) & \begin{matrix} K (X, X^{p}) & = Φ^{T} (X) Φ (X) \\ = \sum_{j = 1}^{M} Φ_{j} (X) Φ_{j} (X^{p}), \\ p = 1, 2, \dots, P, \end{matrix} \\ (8) & \begin{matrix} f (x) = sgn [\sum_{p = 1}^{P} a_{0} d^{P} K (X^{P}, X) + b_{0}] . \end{matrix} \end{array}$

We tried SVM classifiers for ten times cross validation [21] (training set and validation set are randomly separated each time but the ration remains the same) and computed the average values as the ultimate results of speech emotion recognition.

Table 1
Speech emotion recognition based on SVM

Extracted features Average recognition accuracy

WPC 73.77%

FP 71.24%

MFCC 42.55%

WPC + FP 74.61%

WPC + MFCC 79.99%

FP + MFCC 73.78%

WPC + FP + MFCC 75.03%

Extracted features	Average recognition accuracy
WPC	73.77%
FP	71.24%
MFCC	42.55%
WPC + FP	74.61%
WPC + MFCC	79.99%
FP + MFCC	73.78%
WPC + FP + MFCC	75.03%

4.3. Experimental results

In the paper, we study seven types of emotions (angry, anxiety, boredom, disgust, happy, neutral and sad) in speech emotion database of Chinese elderly people (SECEPDB). We Extract common features such as six layer Wavelet Packet Coefficients, Fourier Parameter and the Mel Frequency Cepstrum Coefficient, and fuse them. The recognition results are shown in Table 1.

As it is shown in Table 1, the recognition rates of WPC, MFCC and FP were 73.77%, 42.55% and 71.24% respectively. The recognition rate of WPC is higher than those of MFCC and FP, which shows that the features of WPC are more suitable for the emotional recognition of the old people, and its confusion matrix of the speech emotion recognition is shown in Table 2. In addition, the recognition result of WPC + MFCC is the best, and it has a recognition rate of 79.99%. The recognition rate of WPC + MFCC is higher than WPC + FP + MFCC and WPC + MFCC. That’s to say, it is not the more features, the higher emotional recognition rate.

Table 2, Table 3 and Table 4, show the confusion matrixes of speech emotion recognition of WPC, MFCC and FP features, respectively. From the three confusion matrix, we find that the emotional recognition rates of angry and neutral emotion are higher, and those of boredom and disgust emotions are lower. In addition, we can find that angry are easier to mix with anxiety and boredom.

Table 2
Average confusion matrix of WPC (%)

Angry Anxiety Boredom Disgust Happy Neutral Sad

Angry 65.85 1.63 0.00 0.81 30.89 0.81 0.00

Anxiety 2.91 72.82 0.00 0.00 24.27 0.00 0.00

Boredom 21.43 0.00 65.00 0.00 14.29 0.00 0.00

Disgust 12.90 0.00 0.00 41.94 41.94 3.23 0.00

Happy 0.00 0.00 0.00 0.00 98.67 1.33 0.00

Neutral 1.00 0.00 0.00 0.00 14.00 82.00 3.00

Sad 4.00 0.00 0.00 0.00 23.33 13.33 60.00

	Angry	Anxiety	Boredom	Disgust	Happy	Neutral	Sad
Angry	65.85	1.63	0.00	0.81	30.89	0.81	0.00
Anxiety	2.91	72.82	0.00	0.00	24.27	0.00	0.00
Boredom	21.43	0.00	65.00	0.00	14.29	0.00	0.00
Disgust	12.90	0.00	0.00	41.94	41.94	3.23	0.00
Happy	0.00	0.00	0.00	0.00	98.67	1.33	0.00
Neutral	1.00	0.00	0.00	0.00	14.00	82.00	3.00
Sad	4.00	0.00	0.00	0.00	23.33	13.33	60.00

Table 3

Average confusion matrix of MFCC (%)

	Angry	Anxiety	Boredom	Disgust	Happy	Neutral	Sad
Angry	73.98	11.38	1.63	0.81	3.25	8.00	0.81
Anxiety	44.66	25.24	0.00	1.94	2.91	25.24	0.00
Boredom	64.29	0.00	00.00	0.00	7.14	28.57	0.00
Disgust	72.00	16.13	0.00	6.45	3.23	3.23	0.00
Happy	30.26	18.42	0.00	0.00	28.95	23.00	0.00
Neutral	26.00	7.00	0.00	0.00	9.00	58.00	0.00
Sad	43.33	20.00	0.00	6.67	0.00	16.67	13.33

Table 4

Average confusion matrix of FP (%)

	Angry	Anxiety	Boredom	Disgust	Happy	Neutral	Sad
Angry	81.11	1.63	0.00	0.00	10.00	6.00	0.00
Anxiety	12.62	66.02	0.97	0.97	15.53	3.88	0.00
Boredom	6.67	0.00	46.67	20.00	13.33	13.33	0.00
Disgust	9.68	0.00	3.23	41.94	41.94	3.23	0.00
Happy	9.33	0.00	0.00	0.00	82.67	8.00	0.00
Neutral	3.00	0.00	0.00	1.00	17.00	75.00	4.00
Sad	13.33	3.33	0.00	3.33	3.33	31.00	46.67

Figure 3 shows the performance of seven emotion types by using WPC, MFCC and FP features on the Chinese elderly databases. It is shown that recognition rates of anxiety, boredom, happy, neutral and sad are the best when using WPC features. The recognition rates of anxiety, happy and neutral emotions are higher which are 72.82%, 98.67% and 82%, respectively. In addition, the highest recognition rate is angry emotion when using FP features, which is 82.11%, and the recognition rate of angry emotion is 73.98% using MFCC features.

Fig. 3.

Speech emotion recognition results of different features on seven emotion.

As it is shown in Fig. 4, FP + MFCC can gain the best performance of recognition for angry emotion, WPC + MFCC can gain the best recognition rate for anxiety emotion, and WPC + FP + MFCC can gain the best recognition rate for boredom emotion. The recognition rate of happy emotion is best while that of disgust emotion is the lowest. The recognition rate of happy emotion achieves more than 98% by using WPC + FP, WPC + MFCC and WPC + FP + MFCC, respectively. The best recognition rates of neutral and sad emotion can achieve by WPC + MFCC feature set. The combination of WPC + MFCC features can get the best average performance for recognizing the emotion from the database of Chinese old people.

Fig. 4.

Speech emotion recognition results of different feature combination on seven emotion types.

The above is a emotional analysis for Chinese old people. We obtain different emotion recognition results according to different emotional features. In the emotional domain of the elderly, there are still many directions that need us to explore such as the emotions of the elderly for different genders, the emotions of the elderly for different regions, the emotions of the elderly for different races and etc. Here, because of the limitations of the conditions, we have only made some exploration of the emotions of the elderly for different genders in the existing database. As it is shown in Fig. 5, the effect of emotion recognition of women is generally better than that of men. On the one hand, this may be because emotional statements of older women in the database are more, on the other hand, the emotional expression of elderly women in daily life is more legible. In addition, Fig. 5 also shows that the recognition results of angry and happy emotions for elderly men are better, while those of anxiety and neutral emotions for elderly women are better. This accords with the characteristics that the emotions of men are stronger and those of women are more moderate.

Fig. 5.

Speech emotion recognition result of different genders based on the wavelet packet coefficients of six layers.

5. Conclusion

In this paper, We proposed six layer Wavelet Packet Coefficient (WPC) features for speech emotion recognition of the Chinese elderly. Besides WPC, Mel-Frequency Cepstral Coefficients (MFCC) and Fourier Parameter (FP) were also extracted. Experimental results show that the proposed WPC features can attain better performance for emotion recognition. Furthermore, The combination of these three features can attain better performance than single feature. In the future, we will extract more and more helpful emotion features for the old peoples.

Footnotes

Acknowledgements

This work was supported by the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (No. 201700014), Anhui Provincial Natural Science Foundation (No. 1708085MF167), Fundamental Research Funds for the Key Research Program of Chongqing Science & Technology Commission (No. cstc2017rgzn-zdyf0064), the Chongqing Provincial Human Resource and Social Security Department (No. cx2017092), the Central Universities in China (No. CQU0225001104447) and the National Natural Science Foundation of China (No. 61672157). Any correspondence should be made to Kunxia Wang.

References

Blunsom, Hidden Markov models, Lecture Notes 15 (2004), 18–19.

S.E.

Bou-Ghazale and

J.H.L.

Hansen, A comparative study of traditional and newly proposed features for recognition of speech under stress, IEEE Transactions on Speech and Audio Processing 8 (4) (2000), 429–442. doi:10.1109/89.848224.

Daubechies, The wavelet transform, time-frequency localization and signal analysis, IEEE Transactions on Information Theory 36 (5) (1990), 961–1005. doi:10.1109/18.57199.

Davis and

Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing 28 (4) (1980), 357–366. doi:10.1109/TASSP.1980.1163420.

Dellaert,

Polzin and

Waibel, Recognizing emotion in speech, in: Fourth International Conference on Spoken Language Processing (ICSLP 96). Proceedings, Vol. 3, IEEE, 1996, pp. 1970–1973. doi:10.1109/ICSLP.1996.608022.

N.C.

Ebner,

Riediger and

Lindenberger, FACESA database of facial expressions in young, middle-aged, and older women and men: Development and validation, Behavior Research Methods 42 (1) (2010), 351–362. doi:10.3758/BRM.42.1.351.

J.M.

Epstein, Generative Social Science: Studies in Agent-Based Computational Modeling, Princeton University Press, 2006.

Guven and

Bock, Speech emotion recognition using a backward context, in: 2010 IEEE 39th Applied Imagery Pattern Recognition Workshop (AIPR), IEEE, 2010, pp. 1–5.

C.J.

Hutto,

Bell,

Farmer et al., Social media gerontology: Understanding social media usage among older adults, Web Intelligence 13 (1) (2015), 69–87.

10.

Jin,

Li,

Chen et al., Speech emotion recognition with acoustic and lexical features, in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 4749–4753.

11.

Kamaruddin,

Wahab and

Quek, Cultural dependency analysis for understanding speech emotion, Expert Systems with Applications 39 (5) (2012), 5115–5133. doi:10.1016/j.eswa.2011.11.028.

12.

K.V.K.

Kishore and

P.K.

Satish, Emotion recognition in speech using MFCC and wavelet features, in: 2013 IEEE 3rd International Advance Computing Conference (IACC), IEEE, 2013, pp. 842–847. doi:10.1109/IAdCC.2013.6514336.

13.

O.W.

Kwon,

Chan,

Hao et al., Emotion Recognition by Speech Signals (INTERSPEECH), 2003.

14.

C.M.

Lee and

S.S.

Narayanan, Toward detecting emotions in spoken dialogs, IEEE Transactions on Speech and Audio Processing 13 (2) (2005), 293–303. doi:10.1109/TSA.2004.838534.

15.

M.J.

Liberatore and

G.J.

Titus, The practice of management science in R&D project management, Management Science 29 (8) (1983), 962–974. doi:10.1287/mnsc.29.8.962.

16.

Lim,

Jang and

Lee, Speech emotion recognition using convolutional and recurrent neural networks, in: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), IEEE, 2016, pp. 1–4.

17.

Y.L.

Lin and

Wei, Speech emotion recognition based on HMM and SVM, in: Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vol. 8, IEEE, 2005, pp. 4898–4901.

18.

Lugger,

M.E.

Janoir and

Yang, Combining classifiers with diverse feature sets for robust speaker independent emotion recognition, in: 2009 17th European Signal Processing Conference, IEEE, 2009, pp. 1225–1229.

19.

Mackiewicz and

Ratajczak, Principal components analysis (PCA), Computers and Geosciences 19 (1993), 303–342. doi:10.1016/0098-3004(93)90090-R.

20.

C.E.

Rasmussen, The infinite Gaussian mixture model, Advances in Neural Information Processing Systems 12 (2000), 554–560.

21.

Refaeilzadeh and

Tang, Cross-validation, in: Encyclopedia of Database Systems, Springer, 2009, pp. 532–538.

22.

Sato and

Obuchi, Emotion recognition using mel-frequency cepstral coefficients, Information and Media Technologies 2 (3) (2007), 835–848.

23.

R.J.

Schalkoff, Artificial Neural Networks, McGraw-Hill Higher Education, 1997.

24.

Schuler, Social computing, Communications of the ACM 37 (1) (1994), 28–29. doi:10.1145/175222.175223.

25.

Song, Transfer linear subspace learning for cross-corpus speech emotion recognition, IEEE Transactions on Affective Computing (2017).

26.

Steidl, Automatic Classification of Emotion Related User States in Spontaneous Children’s Speech, University of Erlangen–Nuremberg, Erlangen, 2009.

27.

Trigeorgis,

Ringeval,

Brueckner et al., Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network, in: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2016, pp. 5200–5204. doi:10.1109/ICASSP.2016.7472669.

28.

Ververidis and

Kotropoulos, Emotional speech recognition: Resources, features, and methods, Speech Communication 48 (9) (2006), 1162–1181. doi:10.1016/j.specom.2006.04.003.

29.

Wang,

An,

B.N.

Li et al., Speech emotion recognition using Fourier parameters, IEEE Transactions on Affective Computing 6 (1) (2015), 69–75. doi:10.1109/TAFFC.2015.2392101.

30.

Wang,

An and

Li, Speech emotion recognition based on wavelet packet coefficient model, in: 2014 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), IEEE, 2014, pp. 478–482.

31.

Wang,

Z.B.

Zhu,

Wang et al., A database for emotional interactions of the elderly, in: 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), IEEE, 2016, pp. 1–6.

32.

C.E.

Williams and

K.N.

Stevens, Emotions and speech: Some acoustical correlates, The Journal of the Acoustical Society of America 52 (4 (Part 2)) (1972), 1238–1250.

Speech emotion recognition of Chinese elderly people

Abstract

Keywords

1. Introduction

2. Related work

3. Multiple feature parameter extraction

3.1. Experimental database

3.3. Emotional feature extraction

4.1. Speech emotion recognition system structure

4.2. Support Vector Machine

Table 1 Speech emotion recognition based on SVM Extracted features Average recognition accuracy WPC 73.77% FP 71.24% MFCC 42.55% WPC + FP 74.61% WPC + MFCC 79.99% FP + MFCC 73.78% WPC + FP + MFCC 75.03%

Footnotes

Acknowledgements

References

Table 1
Speech emotion recognition based on SVM

Extracted features Average recognition accuracy

WPC 73.77%

FP 71.24%

MFCC 42.55%

WPC + FP 74.61%

WPC + MFCC 79.99%

FP + MFCC 73.78%

WPC + FP + MFCC 75.03%