Recognition of human emotion with spectral features using multi layer-perceptron

Abstract

For emotion recognition, here the features extracted from prevalent speech samples of Berlin emotional database are pitch, intensity, log energy, formant, mel-frequency ceptral coefficients (MFCC) as base features and power spectral density as an added function of frequency. In these work seven emotions namely anger, neutral, happy, Boredom, disgust, fear and sadness are considered in our study. Temporal and Spectral features are considered for building AER(Automatic Emotion Recognition) model. The extracted features are analyzed using Support Vector Machine (SVM) and with multilayer perceptron (MLP) a class of feed-forward ANN classifiers is/are used to classify different emotional states. We observed 91% accuracy for Angry and Boredom emotional classes by using SVM and more than 96% accuracy using ANN and with an overall accuracy of 87.17% using SVM, 94% for ANN.

Keywords

Multilayer perceptron ANN Support Vector Machine

1. Introduction

Automatic Emotion Recognition is a challenging area in human-computer interaction acting a significant role in normal human interactions. However in the recent past emotion detection form speech signals gained much attention and a challenging task for intelligible decisions. A successful solution to tricky setback would facilitate a wide range of important applications. There have been many studies for emotional speech but it is observed that most of the studies are not in regional Languages. In a short time a researchers have been working on a wide variety of databases like., Emo-Db (a database of German emotional speech developed at Berlin Technical University for Department of acoustic technology) [2], IITKGP(SESC)-is a Telugu speech database built by IIT- Kharagpur developed at All India Radio(AIR), Vijayawada, India [3]. Surrey Audio-Visual Expressed Emotion (SAVEE) [4], DES-danish emotional speech database developed by Aalborg University [5, 6], Denmark. Many researchers have proposed important speech features which contain emotion information, such as energy, pitch [7], formant, Linear Prediction Coefficients (LPC) [8], Linear Prediction Cepstrum Coefficients (LPCC) [9], Mel-Frequency Cepstrum Coefficients (MFCC) [7] and its first derivative. Furthermore, many researchers explored several classification methods, such as Artificial Neural Networks (ANN)[5], Gaussian Mixture Model (GMM), Hidden Markov model (HMM) [10, 6], Maximum Likelihood Bayesian classifier (MLC), Kernel Regression and K-nearest Neighbors (KNN) [11] and Support vector machines (SVM) [12] . The analysis of both prosodies related features and spectral features for the evaluation of emotion recognition is necessary 5–50 LPC coefficients as spectral features, whereas mean value of pitch (F0), intensity, the pressure of the sound, Power Spectral Density (PSD), as prosody related features have been studied. The human competence to recognize the emotion from the speech was also studied and compared with machine classifiers. Initially, a listening test of sample Sentences was done to identify a speaker’s emotion based on auditory impressions and Mean opinion score was collected. Then speaker’s emotion Identification of sample sentences was done with SVM and ANN [9] using pitch and subsequently, PRAAT [13, 14, 15] software package was used to extract the Pattern of acoustic parameters for sample sentences. This paper comprises of the following sections: Section two review of speech database with a description followed by section three briefly describes the feature extraction of speech emotion recognition system, next to this section’s describes the architecture with an idea about feature extraction and feature selection followed by a short review of classifiers used for emotion recognition and in the sixth section we presented results with discussions and finally the last section concludes the paper.

2. Speech database

The database used in this paper is Berlin Emotional Database is a well known German corpus. It is an open access which is used in the field of SER. This database contains 535 speech files with seven emotional classes. The seven emotions are Happy, Neutral, Angry, sad, Fear, Boredom, Disgust and there are 71, 79, 127, 62, 69, 81, 46 speech utterances for each class of emotion. This has been developed with 10 professional actors of them 5 male, 5 female. Table 1 describes the number of samples for each category [16]. The length of each utterance ranges from 20 ms–30 ms where more than 1080 segments of 2 seconds each exists in our understanding.

3. Feature extraction

Here, we categorize the emotion of each speech utterance in the typical Berlin database. Each short-term utterance of (generally 20 ms–30 ms) 20 ms length. We separate each utterance into 60 ms segments with 20 ms time shifts. Contemporary research approach favors the long-time features for analysis of emotions, since the long-time features correlate emotions better than short time ones in our experiments long-time features are considered where the performance is degraded by short-time features, And by considering most basic features: Pitch, energy, formant (f0, f1, f2, f3) Since the change in acoustic features is also related to emotional states, we include the energy difference as additional features [17]. The acoustic features [18, 19, 16] employ are as follows.

3.1 Feature selection

The intuition behind using the acoustic and prosodic features is to summarize the intentional variations observed in humans. The acoustic features are 1. Maximum & Minimum counter ascent energy. 2. Mean and Median values of energy. 3. Mean and Median of energy decline in values. 4. Maximum of pitch frequency 5. Mean and Median of pitch frequency. 6. Maximum duration of pitch in terms of frequency. 7. Mean and Median of first format 8. Rate of change in formats. 9. Speed in voice frames.

Algorithm 1: Emotion Extraction using speech prosodic and spectral features.
Begin
For Each f in Q
P $=$ Preprocessing(f)
V $=$ VAD(P)
S $=$ Segmentation (V)
For each s in S
[M] $=$ Mfcc (s)
P $=$ Pitch (s)
l $=$ Intensity (s)
L $=$ Ltas (s)
Features $=$ [P, Pitch mean, intensity, intensity average, ltas, meanMfcc, medianMfcc, stdMfcc, class label]
End For
End For
TrainMLP (Features)
End

Figure 1.

Speech emotion recognition system.

MFCC (Mel-Frequency cepstral coefficients):[20, 21] Mel frequency analysis of speech is based on human perception experiments. Human ear acts as a filter. These filters are non – uniformly spaced on the frequency axis and concentrate only on certain frequency components. In the pre-emphasis high-frequency signals are amplified or balanced by improving SNR and 20 ms to 40 ms framing is done with a 1/3 overlapping of frames, the Hamming window function is applied on each frame, DCT is applied to extract MFCC. The most commonly used MFCC features for greater recognition of linguistic content by discarding background noise Mel spectrogram represents an acoustic time frequency of sound, the power spectral density $p(f,t)$ It is sampled into a number of points around equally spaced times $t_{i}$ and frequencies $f_{j}$ Mel frequency scale is defined as

$\displaystyle\textit{mel}=2595\times\log 10\left({1+\frac{\textit{hertz}}{700}% }\right)$ (1)

Inverse

$\displaystyle\textit{hertz}=700*(10.0^{mel}/2595.0-1)$ (2)

$c_{k}$ in each frame of MFCC object result from the output of DCT on spectral values $p_{j}$ of Mel-spectrogram for the corresponding frame

$\displaystyle C_{k-1}=\sum\limits_{j=1}^{\infty}Np_{j}\cos\left(\frac{\pi(k-1)% (j-0.5)}{N}\right)$ (3)

$N$ -spectral value $p_{j}$ power in dB of $j^{\rm th}$ spectral value

After extracting MFCC values and matrix is generated where each row represent the frame and each column represent the extracted coefficient where the matrix is fed to a Neural Network.

4. System implementation

Involving with speaker’s emotion is one of the main challenging tasks in speech technology. The design of SER(Speech Emotion Recognition) system mostly contain three modules which relay in our discussions they are’, Speech acquisition, Feature extraction from signals, Training the machine with feature sets and classifying emotions through SVM and ANN. The input to the system will be given as a .wav file shown in Fig. 1 from any emotional database. Here Berlin emotion database which contains emotional speech utterances with different emotional states. From the given input, the temporal and spectral features like Pitch, intensity, MFCC, spectral density are extracted. In the next step “:data” file which consists class labels is fed into SVM and ANN classifiers. The machine is trained in such a way to classify all possible emotions.

After training the system with acted speakers the system is fed with a real-time speech signal given as input to the “.model” then the system will automatically predict the emotional states with an orientation to training sets. A continuous process of training is initiated iteratively in anticipation of accurate results.

4.1 Architecture

Figure 1 represents the architecture used for emotions recognition. It mainly comprises of two phases first one is used for training and second is left for testing, the training phase a model is built with the extracted feature of pitch, Intensity, MFCC from the utterances. The extracted MFCC features are fed to emotion recognition models like, SVM (Support Vector Machines) and ANN (Artificial Neural Network). The testing phase determines emotion class of the corresponding input utterances. It is done by the decision block by selecting the highest probability from the training model and they are classified. We propose a methodology rooted in our previous studies [2] for a confusion problem. Usually there are many such statements which will come with different emotions. In a supervised learning identifying different sources of ambiguity of different classes over regional languages within the state.

5. Classification model

5.1 SVM

A, powerful classifiers widely used in pattern recognition that uses linear and nonlinear hyper planes for classifying data, A binary nonlinear classifier efficient predicting input vector x belongs to class 1 or class 2. For a given set of separable data, the goal is to find the optimal decision function. This is done by choosing a maximum margin as the distance between the closest sample and the decision boundary. It performs classification methods by constructing hyper planes in a multidimensional space that separates different class labels based on statistical learning theory. SVMs applied in various fields for efficient in: High accuracy and flexibility, Capacity to accommodate large number of attributes, Ease of training and Ability to model complex and real-world problems.

Figure 2.

Multilayer Perceptron with a hidden layer.

5.2 ANN

Artificial Neural Networks are fundamentally feed-forward neural network models and a derived class as Multi-Layer Perceptron, which utilizes a Backpropagation technique in supervised learning. Non-linear separable data can be distinguished by MLP. The Network contains at least three layers, one input layer, at least one hidden layer as a part, and an output layer. The network layers contain multiple neurons aka nodes. There is a connection or edges form each adjacent layer. Figure 2 describes the fully connected Neural Net, as each node is fully connected with the other nodes. The nodes at the input layer are represented as $X_{1}$ , $X_{2}$ , respectively.

Algorithm 2: Perceptron Training Algorithm
Let $x_{i}$ and $x_{j}$ be the set of Training examples
$S_{x}=X_{1},X_{2,}X_{k}$ is training sequence on X
Begin
Let $w_{k}$ be the input vector at $K$
Choose $w_{0}$ arbitrarily a $w_{0}=$ 0, 1, 2 $\ldots\ldots$
Each step $K=$ 0, 1, 2, $\ldots\ldots$
Classify $x_{k}$ using $w_{k}$
If $x_{k}$ is correctly classified take $w_{k+1}=w_{k}$
CASE1
Choose $w_{k+1}=w_{k}-c_{k}+x_{k}$ (if $x_{k}\neq w_{k}$ )
CASE2
$w_{k+1}=w_{k}+c_{k}+x_{k}(\textit{if}x_{k}=w_{k})$
end

All the connections are associated with weights as $W_{0}$ , $W_{1}$ , $W_{2}$ , $W_{3}$ , $\ldots$ $W_{n}$ . The input is fed to the hidden layer with no computation. The activation function of each node in the hidden layer was sigmoid and the training process was run several times. At the output layer, the errors are estimates and they are backpropagated with a back propagation algorithm. The parameter setting of the artificial neural network architecture shown in Table 1.

The sequence $C_{k}$ should be chosen according to data. During training, it leads to fickle with large constant values if in case choosing small values will increase training time. However, $c_{k}=c_{0}/k$ will work for any positive $c_{0}$ .

6. Experimental results and discussions

To evaluate the performance we selected Berlin Emotional Database a well known German corpus. It is an open access which is used in the field of SER. This database contains 535 speech files with seven emotional classes. The seven emotions are Happy, Neutral, Angry, sad, Fear, Boredom, Disgust and there are 71, 79, 127, 62, 69, 81, 46 speech utterances for each class of emotion. This has been developed with 10 professional actors of them 5 male, 5 female. Table 2 describes the number of samples for each category [16].

Table 1
Parameter setting of MLP architecture

sFramework	Variable
Minimum error	$10^{-10}$
Activation function’	sigmoid
Iteration	1000
Learning rate(L)	0.1
Momentum rate(M)	0.2
Input attributes	14
Hidden layers	1
Hidden nodes(H)	10
Training Epochs(N)	1000

Table 2

Sample size in Berlin-DB

Emotion’	Male	Female
HAP	27	44
NEU	39	40
ANG	60	67
SAD	25	37
FEA	36	33
BOR	35	46
DIS	11	35
Count	233	302

Performance Index: The metric used to measure the accuracy, of the classifier is confusion matrix which is a commonly used visualization tool in supervised learning are easier to visualize the classifier errors while trying to predict from original classes instances. The representation of this tool is shown in Table 2. Each row of the matrix represents instances of original classes and each column represents predicted classes from original classes.

Table 3

Confusion matrix by using ANN

Emotions	Hap	Neu	Ang	Sad	Fear	Bor	Dis
Hap	68	1	0	0	2	0	0
Neu	0	75	0	3	0	1	0
Ang	0	0	127	0	0	0	0
Sad	1	0	2	57	0	1	1
Fea	4	1	1	0	61	0	2
Bor	0	0	0	1	1	78	1
Dis	3	0	2	2	1	0	38

In our first experiment, we used SVM for classification. In this, the network was trained several numbers of times where the extracted features are given as input for training and they are tested using open and closed testing [25]. Intact, the dataset is used in closed testing, the remainder dataset is used for open testing, here 90% is used for training and remaining for testing. Gender Recognition is 100%, Male as 233 and Female 302 in closed testing and 28, 26 in open testing. Table 3 shows results obtained by SVM classifier where emotions neutral and angry have a recognition accuracy of 89.87% and 91.38%.

Table 4

Results obtained with SVM classifier

Emotion class	Responsiveness’(%) (or) recall	Precision (%)	Specificity (%)	F-measure	Overall accuracy (%)
HAP	70.4	67.6	95.47413793	69	70.42253521
NEU	89.9	91	98.90350877	90.4	89.87341772
ANG	91.3	86.6	96.32352941	88.9	91.33858268
SAD	77.4	92.3	99.2	84.2	77.41935484
FEAR	68.1	78.3	97.21030043	72.9	83.92857143
BOR	90.1	81.1	96.47577093	85.4	90.12345679
DIS	87	85.1	98.77300613	86	86.95652174

In the second Experiment Multilayer perceptron (MLP) a class of feedforward ANN [9] for classification with backpropagation algorithm is used. The formulation of the neural network can be

$\displaystyle n=\sum\limits_{i}^{n}{(x_{i}w_{i}+b)}$ (4)

Where $n$ is weight summation $X_{i}$ is an input vector $i$ . $w_{i}$ is a weight vector $i$ . and $i,$ is input attribute. $b$ is biased. $i$ is vector index from 1 to i. 14 attributes’ are measured to be input attributes. For classifying emotions from a speech the MLP classifier is trained with the extracted features. The network is created with an input layer, one hidden layer of ten neurons and an output layer. The activation function of each node in the hidden layer was sigmoid and the training process was run for 1000 epochs with a batch size set to 100. The overall accuracy achieved is 94.2% with 100% prediction accuracy for emotional class Angry. Table 4 shows the results obtained by using the MLP with a slight improvement in recognizing the emotion classes of Neutral and Angry as more than 94.93% each.

Table 5

Results obtained with ANN (MLP)

Emotion class	Responsiveness’(%) (or) recall	Precision (%)	Specificity (%)	F-measure	Overall accuracy (%)
HAP	95.8	89.5	99.13978495	92.5	95.77464789
NEU	94.9	97.4	99.56236324	96.2	94.93670886
ANG	100	96.2	99.02200489	98.1	100
SAD	91.9	90.5	98.73417722	91.2	91.93548387
FEAR	88.4	93.8	99.57173448	91	88.4057971
BOR	96.3	97.5	99.56043956	96.9	96.2962963
DIS	82.6	90.5	99.3877551	86.4	82.60869565

In these works, two classifiers SVM and ANN are used for pattern recognition. The recognition accuracy obtained for all seven emotions for Berlin dataset. The recognition accuracy is shown in Fig. 3. It is noticed that the ANN is slightly higher accuracy when compared with SVM over long-time features set without degrading the performance.

Figure 3.

Recognition accuracy obtained with ANN and SVM.

6.1 Comparison with state-of-art

In comparison with other state-of-art approaches we picked M.Ahamad [26] and T.M.Rajisha [16] as they low level descriptors of the speech signal as the input of the Neural Network. With comparison to their feature representation schemes: Pitch in Hz and its smooth contour, Energy, ZCR, Mel frequency coefficients, Formants, Jitter, Shimmer, SNR, for voice quality on the other hand only MFCC [16] features with an optional overlap of 50% is/are considered, moreover they have chosen various databases, although [26] has used the same database for their experiments but we have picked EMO-DB benchmark database for our experiments. The results revealed that they have obtained higher recognition accuracy when used Multi-layer Perceptron classifier, which is about 91.2%. The proposed method achieved 94% average accuracy on the same dataset by using deeper kernels.

Figure 4.

Prediction accuracy of various methods with SVM, MLP.

Figure 4 shows the prediction accuracy of various methods with two well known classifiers SVM and MLP. Results revealed that models with low level descriptors with recognition accuracy ranges from 78% to 91.2% and the proposed model achieved 87.17% with SVM classifier and 94% with MLP classifier.

7. Conclusion and future work

The experiment presents us; the recognition rate of emotions by using spectral features is slightly improved when observed the prosodic features. We propose algorithm where at the outset we extract features based on prosodic and spectral features there is higher recognition accuracy in situations by combining prosodic and spectral features, the recognition rate with energy, pitch, MFCC is higher. We conclude that both prosodic and spectral features contain the combinations for better recognition of human speech emotion without degrading the performance. And from the experiments, it is observed that anger always has higher recognition accuracy. More work is needed to improve the system to extract more effective features even if the speech is a compressed one.

References

Pramod Reddy

and Felix Burkhardt

B.W.

, Miriam Kienast, “Berlin emo-db”, 1999. [Online]. Available: http://emodb.bilderbar.info/docu/.

Pramod Reddy

and Vijayarajan

, Extraction of emotions from speech-A survey, Int J Appl Eng Res 12(16) (2017).

Bhaykar

Manjunath

K.E.

Yadav

and Rao

K.S.

, Sspeaker Independent Emotion Recognition from Speech using Combination of Different Classification Models.

Banda

and Robinson

, Noise analysis in audio-visual emotion recognition, Cl Cam Ac Uk, 2011, pp. 3–6.

Ververidis

and Kotropoulos

, Emotional speech recognition: Resources, features, and methods, Speech Commun 48(9) (2006), 1162–1181.

Engberg

I.S.

and Hansen

A.V.

, Documentation of the danish emotional speech database des. Internal AAU report, Center for Person Kommunikation, Denmark, Documentation of the danish emotional speech database des., Intern AAU report, Cent. Pers. Kommun. Denmark, 1996, vol 22.

Vijayalakshmi

and Leema

A.A.

, Real-time Speech Emotion Recognition Using Support Vector Machine.

Kwon

Chan

Hao

and Lee

, Emotion recognition by speech signals, Eighth Eur. Conf. Speech Commun. Technol., 2003, pp. 125–128.

Ewender

Hoffmann

and Pfister

, Nearly perfect detection of continuous F 0 contour and frame classification for TTS synthesis, Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2009, pp. 100–103.

10.

Schuller

Rigoll

and Lang

, Hidden Markov model-based speech emotion recognition, 2003 IEEE Int. Conf. Acoust. Speech, Signal Process. 2003. Proceedings. (ICASSP ’03), vol. 2, 2003, pp. 401–404.

11.

Chen

You

Song

and Liu

, An enhanced speech emotion recognition system based on discourse information, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 3991 LNCS, 2006, pp. 449–456.

12.

Milton

Professor

Sharmy Roy

and Selvi

S.T.

, SVM scheme for speech emotion recognition using MFCC feature, Int J Comput Appl 69(9) (2013), 975–8887.

13.

Styler

, Using Praat for Linguistic Research, Savevowels, 2013, pp. 1–70.

14.

Boersma

and van Heuven

, Speak and unSpeak with Praat, Glot Int 5(9–10) (2001), 341–347.

15.

Boersma

, The Use of Praat in Corpus Research, 2013.

16.

Rajisha

T.M.

Sunija

A.P.

and Riyas

K.S.

, Performance Analysis of Malayalam Language Speech Emotion Recognition System Using ANN/SVM, Procedia Technol 24 (2016), 1097–1104.

17.

Reddy

A.P.

and Vijayarajan

, Extraction of Emotions from Speech-A Survey, 12(16) (2017), 5760–5767.

18.

Shahzadi

, Recognition of emotion in speech using spectral patterns, Malaysian J 26(2) (2013), 140–158.

19.

Rao

K.S.

and Koolagudi

S.G.

, Robust Emotion Recognition using Spectral and Prosodic Features, 2013.

20.

Boersma

and Kovacic

, Spectral characteristics of three styles of Croatian folk singing, J Acoust Soc Am 119(3) (2006), 1805–1816.

21.

Lalitha

Geyasruti

Narayanan

and Shravani

, Emotion detection using MFCC and cepstrum features, Procedia Comput Sci 70 (2015), 29–35.

22.

Cheng

and Duan

, Speech emotion recognition using gaussian mixture model, Proc 2nd Int Conf Comput Appl Syst Model, 2012, pp. 1222–1225.

23.

Mao

and Chen

, Speaker independent emotion recognition based on SVM/HMMs fusion system, ICALIP 2008 – 2008 Int Conf Audio, Lang Image Process Proc, 2008, pp. 61–65.

24.

Rabiner

L.R.

, A tutorial on hidden Markov models and selected applications in speech recognition, Proc IEEE 77(2) (1989), 257–286.

25.

Nicholson

Takahashi

and Nakatsu

, Emotion recognition in speech using neural networks, Neural Comput Appl 9 (2000), 290–296.

26.

Ahmad

M.A.

, Artificial Neural Network vs. Support Vector Machine For Speech Emotion Recognition, 21(6) (2016), 167–172.

Emotions	Hap	Neu	Ang	Sad	Fear	Bor	Dis
Hap	68	1	0	0	2	0	0
Neu	0	75	0	3	0	1	0
Ang	0	0	127	0	0	0	0
Sad	1	0	2	57	0	1	1
Fea	4	1	1	0	61	0	2
Bor	0	0	0	1	1	78	1
Dis	3	0	2	2	1	0	38

Emotions	Hap	Neu	Ang	Sad	Fear	Bor	Dis
Hap	68	1	0	0	2	0	0
Neu	0	75	0	3	0	1	0
Ang	0	0	127	0	0	0	0
Sad	1	0	2	57	0	1	1
Fea	4	1	1	0	61	0	2
Bor	0	0	0	1	1	78	1
Dis	3	0	2	2	1	0	38

Recognition of human emotion with spectral features using multi layer-perceptron

Abstract

Keywords

1. Introduction

2. Speech database

3. Feature extraction

3.1 Feature selection

4.1 Architecture

5. Classification model

5.1 SVM

6. Experimental results and discussions

Table 1 Parameter setting of MLP architecture

References

Table 1
Parameter setting of MLP architecture

Emotions	Hap	Neu	Ang	Sad	Fear	Bor	Dis
Hap	68	1	0	0	2	0	0
Neu	0	75	0	3	0	1	0
Ang	0	0	127	0	0	0	0
Sad	1	0	2	57	0	1	1
Fea	4	1	1	0	61	0	2
Bor	0	0	0	1	1	78	1
Dis	3	0	2	2	1	0	38