Emotion recognition from speech using wavelet packet transform and prosodic features

Abstract

Emotion is a property by which human beings and machines can be differentiated as machines are emotionless while human beings are not. If the emotion of a speaker is recognized then others can interact accordingly. This paper presents a new approach for recognizing all the six basic emotions (Happy, anger, fear, sadness, boredom and neutral) from the speech signals more effectively. To recognize the emotion of a speaker, pitch value and two wavelet packet feature vectors derived from speech signals are used. Principal Component Analysis (PCA) has been applied to reduce the dimension of feature vectors. Random Forest (RF) and Support Vector Machine (SVM) classifiers are trained separately based on these reduced feature vectors. The experimental results show that the accuracy of emotion recognition with Random Forest classifier is 86.11% while with SVM classifier it is 84.41%. Experimentally, it is also found that clean speech of 1 sec duration is sufficient enough to recognize emotion of the speaker.

Keywords

Pitch emotions speech recognition SVM Random Forest (RF)

1 Introduction

Speech is a common and an easy medium for conversation among human beings. Emotions change the semantic of the speech. Emotion, gender, age etc of the speaker are sometimes necessary to be identified. Correct recognition of emotion of a speaker improves the working in many areas like human computer interaction, E-learning, call center systems, counseling and health care systems. By detecting the mental status of the drivers from their speech, monitoring systems may alert other drivers of nearby vehicles. By recognizing the mood of online students, tutors working in E-learning system may change their teaching style accordingly. During conversations counselors make use of responses received from the clients to analyze the emotional status of their clients. By detecting the angriness of client from the speech, call center representative may give priority to that call. Recognition of emotions from human speech is a growing research area mainly for better human computer interaction. It allows computers/robots to perform the task effectively by understanding the emotion of the speakers during conversations. Features extracted from speech for emotion recognition are categorized into three categories [2]- Excitation features (jitter, shimmer etc), Vocal Tract features (formant, bandwidth etc) and Prosodic features (pitch, intensity, intonation etc). Tune and rhythm of a speech are obtained using prosodic features. Information about vibrations of glottal folds of speaker are contained in the source excitation features which are generated at the time of excitation of speech signals. Information about vocal tract vibrations are contained in vocal tract features. Speech signals spoken by speaker are the convolved sum of excitation source and vocal tract information.

The state of the art approaches for emotion recognition from speech are as follows:

In 2004, Yildirim et al. [22] recognized four types of emotions (sadness, anger, happiness and neutral) by fusing spectral energy, speech prosody and vowel articulation features from speech recorded in English language with the accuracy of 67%. They also claimed that acoustic properties of happiness and anger, sadness and neutral are similar in nature. In 2005, Vidrascu and Devillers [14] classified five emotions of the speech signals taken from French database into two sets. One set contains negative emotions (fear, anger, sadness, hurt) and other contains neutral emotion. Lexical, spectral, prosodic, disflence (pauses) and paralinguistic (crying, laughter) features extracted from recorded call center conversations are combined. The recognition accuracy was 80% by using Support Vector Machine (SVM) classifier. In 2006, Hoque et al. [15] classified four emotions into two categories: positive emotions (flow, delight/happy) and negative emotions (frustration/anger, confusion/boredom) using extracted spectral and prosodic features from speech. The accuracy of emotion recognition through speech signals taken from DES speech database was 83.3% with Random Forest classifier. In 2007, M. Lugger and B. Young [16] recognized six types of emotions (anger, happiness, sadness, boredom, anxiety/disgust and neutral) using Berlin emotional database with Bayes classifier. They have used supra-segmental features (pitch, energy, duration) and segmental features (voice quality, zero crossing rate and articulation) for the said task. Maximum obtained accuracy claimed was 82% when sequential floating forward selection (SFFS) algorithm was used to select the subset of features among available feature vectors. In 2007, K.S.J. Cichosz [12] recognized six emotions (anger, fear, sadness, boredom, joy and neutral) from speech signals taken from two databases: Polish database and German database with accuracy 76,30% and 74,39% respectively. Binary tree classifier was used for classification. In 2007, Mansoorizadeh and Charkari [17] recognized emotions from speech signals by taking frames of duration 10 msec and voiced segments (10 to 12 in number) from speech signals. In frame based approach, minimum duration of speech required to recognize the emotion was 1.5 sec whereas in voiced segment based approach, 10 or more voiced segments were required to recognize emotion. In 2008, Zhang [20] recognized four emotions (anger, joy/happy, sadness and neutral) from speech signals taken from Chinese natural emotional speech corpus. By using only prosodic features the accuracy claimed was 75% while using prosodic and voice quality features the accuracy was increased to 76%. This increment was because voice quality features were added to prosodic features. In 2009, Iliou and Anagnostopoulos [23] also used prosodic features (pitch, energy and duration) for emotion recognition. They recognized seven emotions (anger, boredom, disgust, fear, happy, neutral and sad) from speech signals taken from Berlin database using Artificial Neural Network (ANN) with accuracy 51%. In 2009, Bozkurt et al. [7] recognized six emotions (anger, boredom, fear, happy neutral and sad) from speech signals taken from INTERSPEECH- 2009 emotion challenge corpus using Hidden Markov Model (HMM) classifier.They extracted spectral, prosodic and HMM based features from speech signals and accuracy claimed was 63.03%. In 2012, Dorota Kaminska and Adam Pelikant [6] extracted Fundamental frequency (F0) feature from speech signals taken from Polish emotional database. They claimed 70% accuracy with K-Nearest Neighbor (KNN) classifier. They used Plutchik’s cone-shaped 3D model to represent intensities of different emotions. In 2013, Gayatri Lokhande and K.S. Bapat [9] recognized seven emotions (anger, disgust, fear, happy, neutral, sadness and surprise) from speech signals taken from IITKGP-SESC database. They divided different emotions into three categories: fast (anger, disgust and fear), normal (happy and neutral) and passive (boredom and sad) based on experiments performed using extracted features: speaking rate, vocal tract characteristics and MFCCs. They claimed 82% accuracy with Gaussian mixture models (GMM) classifer. In 2014, Chien Shing et al. [4] extracted MFCC feature from speech signals taken from eNTERFACE and RML database with accuracy 75.89% and 68.57%. RBF Neural Network classifier was used to classify the emotions. In 2014, Farah Chenchaha and Zied Lachiri [8] extracted wavelet packet energy and entropy features and then applied extracted features on Mel, Bark and ERB scale. The maximum accuracy claimed with SAVEE database was 78.5% and with IEMOCAP database was 50.06%. In 2015, Fayek et al. [10] used DNN classifier to recognize emotion from one second frame of raw speech spectrogram. They claimed accuracy 60.53% to classify six emotions of speech signals taken from eNTERFACE database and also 59.7% to classify seven emotions of speech signals taken from SAVEE database. In 2015, Lanjewar et al. [18] extracted Spectral MFCC [19], wavelet features and pitch from speech signals to recognize six emotions (happy, angry, neutral, surprise, fearful and sad). GMM classifier was used to recognize angry emotion with highest accuracy 92% and surprise emotion with 25%. In 2015, Rajisha et al. [24] extracted Mel Frequency Cepstral Coefficients (MFCCs), Short Time Energy (STE) and Pitch features from speech signals taken from Malayalam language database. The claimed accuracy were 88.4% in the case of ANN and 78.2% in the case of SVM.

From literature survey, it is realized that pitch feature was widely used for emotion recognition. Wavelet packet transform features contain both time domain and frequency domain information. Therefore, wavelet packet transform features along with pitch are used to recognize six types of emotions (happy, anger, fear, sadness, boredom and neutral) in the proposed work. Support Vector Machine (SVM) and Random Forest (RF) classifiers are used for the recognition of emotions in theexperiments.

Rest of the paper is organized as follows: Section 2 describes the methods used for extracting features for recognizing emotions. Section 3 describes the classifiers used. Section 4 shows the proposed block diagram and algorithms. Section 5 presents the experimental results followed by Section 6 presenting comparison of results obtained from proposed approach and state of the art approaches. Section 7 concludes the paper.

2 Speech signal features and their extraction methods

In the literature, features like spectral energy, pitch, energy, duration, voice quality have been used for emotion recognition with ANN, HMM, SVM, Bayes, Random Forest classifiers. Pitch and wavelet packet transform features (wavelet packet energy and wavelet packet entropy) are extracted from speech signals in proposed approach. Support Vector Machine and Random Forest classifiers are also used in the proposed approach.

2.1 Features extraction

In the proposed approach, two features pitch and wavelet packet transform (wavelet packet energy and wavelet packet entropy) are extracted from the preprocessed speech signals. The size of pitch feature and wavelet packet features are 1 and 383 respectively. Figure 1 shows the details of process of forming combined feature vector of size 384.

Fig.1

Extraction of features used for emotion recognition.

2.1.1 Pitch extraction

Pitch value of speech signal is determined using Subharmonic-to-Harmonic Ratio (SHR) [25] method. This method uses logarithmic frequency scale and a spectrum shifting technique to sum up the amplitudes of harmonics and subharmonics. Some emotions have high pitch value and some have low pitch value. Experimentally, it is observed that there is a correlation between pitch value with emotion.

2.1.2 Wavelet Packet Transform (WPT)

Wavelet Transform (WT) provides the information in both time and frequency domain. In other words, WT is localized in both time and frequency domain [26]. In case of discontinuity of signals, WT generates few significant coefficients around discontinuity as compared to Fourier Transform (FT) that generates large number of coefficients with large magnitudes. The whole process of decomposition of signal can be expressed as tree where left branch represents the low pass filtering output and right branch represents high pass filtering output [5]. In WPT, low and high frequency bands are separated.

In MATLAB, “wpdec” function is used for the decomposition of speech signal. $T = wpdec (x, 7,^{'} db 3^{'},^{'} {shannon}^{'});$ (1)

This function decomposes speech signal x upto depth 7 with Daubechies-3 filter (db3) and shannon entropy. T is a wavelet tree generated after the decomposition of speech signal. Following two wavelet packet features from each frame of the speech signal are extracted.

Wavelet Packet Energy

The energy associated with each frame of speech signal is calculated using MATLAB function: $E = wenergy (T);$ (2)

For a wavelet packet tree T, “wenergy(T)” returns a vector E, which contains the percentages of energy corresponding to the terminal nodes of the tree T. The size of wavelet packet energy vector E is 128.

Wavelet Packet Entropy

The wavelet packet entropy associated with each frame is extracted using MATLAB function

$nodes = allnodes (T);$ (3)

The allnodes returns a column vector containing indices of all nodes of tree T. The nodes are numbered from top to bottom and from left to right. The index of root node is 0.

$P = read (T,^{'} {ent}^{'}, nodes);$ (4)

T is the wavelet packet tree and ‘ent’ is the keyword used for entropy. The size of wavelet packet entropy feature vector P is 255.

2.2 Feature reduction

The size of pitch feature and wavelet packet transform feature vector are 1 and 383 respectively. So there is a need to reduce the size of combined feature vector of size 384. The size of feature vector is reduced using feature reduction techniques. Here, Principal Component Analysis (PCA) is used to reduce the dimension or size of feature vector.

Principal Component Analysis (PCA)

PCA is a technique used for reduction of dimension or size of feature vectors. It also reduces the redundancy presents in feature vectors by projecting original feature vector onto new reduced dimension sub space [11]. It is a linearly transformation process in which high dimensional data is transformed into low dimensional data with maximum variance.

In PCA, principal components (PCs) represent projection for which variance is maximum. There may be more than one PCs, but all must be orthogonal to each other and variance must be maximum among them. The objective is to get new reduced features which are uncorrelated and having maximum variance among them. The reduced features can be efficiently classified into number of classes with least misclassification errors.

3 Classifiers used for emotion recognition

In Literature, it is found that SVM and Random Forest are widely used for emotion recognition.

3.1 SVM

Support vector machine (SVM) is being used for classification. It efficiently handles the large number of features and also tackles overfitting in efficient manner. Overfitting occurs because of large data variations. The decision surface is the boundary among the training points. Points closest to decision surface are known as support vectors. Classification accuracy of SVM is inversely proportional to the number of support vectors. In case of non linearly separable training data sets, data sets are transformed into new transformed feature space so that the data set becomes linearly separable in new featurespace [3].

3.2 Random forest

To classify speech signals into various emotions, multiple decision trees are created at training time. Higher number of decision trees improves the classification accuracy. Random Forest (RF) takes the final decision based on weighted sum of all the decisions taken by multiple decision trees based on polling/voting approach [1]. Average weighted sum is used by random forest for classification.

Bagging technique or bootstrap aggregating [21] is used in random forest to randomly select the subset with replacement from training set. Generally the size of new subset is the 2/3^rd of the original size.

4 Proposed approach

The EMO-DB emotional speech database is used here for emotion recognition. This database contains speech samples in German language spoken by 05 male and 05 female speakers in six different emotions: anger, boredom, fear, happy, neutral and sad. In proposed approach total 600 speech signals are taken where 100 speech signals are of one emotion. All the speech signals are in. WAV format of varying size from 1.87 seconds to 3 seconds. A single speech signal containing only one emotion at a time. The speech signals are sampled at 16000 Hz sampling frequency. Each sampling point is represented by18 bits.

4.1 Block diagram

Block diagram for the proposed approach as shown in Fig. 2 has following steps: Preprocessing of the speech signals (removal of Differential Coefficient (DC) effect, normalization and silence removal), Features extraction (pitch and wavelet features), Features reduction (PCA), Training of classifiers with 70% of speech signals and Testing of trained classifier with 30% of speech signals.

Fig.2

Block Diagram of proposed approach for emotion recognition.

4.2 Proposed algorithm

First of all, speech signals are preprocessed using steps 4.2.1, 4.2.2 and 4.2.3 explained in block diagram. The proposed algorithm contains four sub algorithms. These four sub algorithms are explained in Algorithms 4.2.1, 4.2.2, 4.2.3 and 4.2.4.

Algorithm 4.2.1
Algorithm for Removal of DC effect and Normalization

INPUT: Speech signal S of size N × 1, i^th sample of speech signal S (i), total number of samples of speech signal N and sampling frequency of speech signal f_s.

OUTPUT: Normalized signal S_n.

Ensure:

   (1) max (.) represents a function that returns maximum value of input vector.

   (2) sum stores the sum of amplitude values of all the samples of speech signal

   (3) mean is the average value of all samples of speech signal

   (4) max1 stores the maximum absolute value of amplitudes of samples of speech signal

   (5) abs () represents a function that returns absolute value of samples

1: Initialize sum = 0

2: fori ← 1 to Ndo

3:  sum = sum + S (i)

4: end for

5: mean= $\frac{sum}{N}$

6: fori ← 1 to Ndo

7:  S (i) = S (i) - mean ▷ Remove DC effect

8: end for

9: fori ← 1 to Ndo

10:  abs _ signal (i) = abs (S (i)) ▷ Absolute value of signal

11: end for

12:  max1 = max (abs _ signal)

13: fori ← 1 to Ndo

14   $S_{n} (i) = \frac{S (i)}{\max 1}$

15: end for

16:  return S_n

Algorithm 4.2.2

Algorithm for Silence Removal

INPUT:F_i is i^th frame of size 30 msec, N is total number of samples of speech signal and f_s is sampling frequency of speech signal.

OUTPUT: Speech Signal without silence part S.

Ensure

(1) k is number of samples in particular frame

(2) nf number of frames in speech signal

(3) E_i is the short term energy of i^th frame

(4) A_i is the average energy of i^th frame

(5) N_i is the normalized energy of i^th frame

1: Number of samples in each frame = f_s × 30

2: Number of frames in speech signal (nf) =

\frac{N}{f_{s}}

3: fori ← 1 to nf

4: for j ← 1 to k

5: E_i = (F_i (j)) ²

6: Endfor

7: end for

8: fori ← 1 to nfdo

A_{i} = \frac{E_{i}}{k}

10: N_i=

\frac{A_{i} * 100}{\max (Average energy of all frames)}

11: end for

12: if (N_i <2) then remove that frame

13: Combine remaining frames and reconstruct speech signal S

14: return S

Algorithm 4.2.3

Algorithm for Windowing

INPUT:F_i is the i^th frame of size 30 msec, S is the Speech Signal without silence part and f_s sampling frequency of speech signal.

OUTPUT:S_w: Windowed Speech Signal without silence part.

Ensure:

(1) k is number of samples in particular frame

(2) nf number of frames in speech signal

(3) Nh is the total number of samples in hamming window

(4) pi = 3.14

1: Hamming window w(n)=

0.54 - 0.46 \cos \frac{2 * pi * n}{Nh - 1}

2: for i ← 0 to nf - 1 do

3: S_w = S ((i * k + 1) : (i + 1) k) * w (n)

4: end for

5: return S_w

Algorithm 4.2.4

Algorithm for PCA

INPUT: X is the matrix containing training data sets.

OUTPUT: AA is the reduced feature set matrix.

Ensure:

(1) k is reduced dimension

(2) eigVecs is eigen vector matrix

(3) eigVals is eigen values matrix

(4) sort(): sort the vector in increasing order

(5) flipud(): change the sorted order of vector

1: cov = X^T * X

2: [eigVecs, eigVals] = eig (cov)

3: diagonal = diag (eigVals)

4: [diagonal, index] = sort (diagonal)

5: index = flipud (index)

6: kk = length (index)

7: ss = sum (diagonal)

8: mm = kk - 1

9: forj ← mm - 1 to 1 doif (sum (diagonal (j : kk))/ss > =0.99)

10: k = j

11: break

12: end for

13: reduced _ dim = kk - k

14: reduced _ eigVecs = eigVecs (: , k : kk)

15: reduced _ eigVecs

16: AA = reduced _ eigVecs′ * X

17: return AA

It has been noticed that some speech signals are shifted from origin. This shifting is known as DC effect. After the removal of this DC effect, the waveform passes through origin. For better analysis of speech signals, the waveform must be pass through origin [13]. The removal of DC effect and normalization procedure are described in Algorithm 4.2.1. The samples of the speech signals have different range of amplitude values [6]. To analyze the speech signals they must be normalize to specific range. After normalization, the upper values of amplitudes of all the speech signals become same. The normalization procedure is also described in Algorithm 4.2.1.

Speech signal may also contains silence parts. These silence parts have no significant role in feature extraction. Silence parts are removed from speech signal using the Algorithm 4.2.2. Normalized speech signal is divided into frames of size 30 msec (480 samples). Experimentally it is found that normalized energy of silence frames are less than 2. Therefore, the threshold value for normalized energy is set as 2. Non silence frames are combined after removing of silence frames. The results are shown in Figs.3– 6. Figure 3 presents the normalized speech signal. Figure 4 presents the normalized energy of each frame. Figure 5 presents the frames having normalized energy greater than or equal to 2. Speech signal having no silence part is represented by Fig. 6. Speech signal is then multiplied by hamming window to remove the discontinuity at start and end of the waveform. Windowing is the process of multiplying the signal S (without silence part) with hammimg window of size 30 msec with overlap of 50% samples [13]. This procedure is described in Algorithm 4.2.3. The resultant signal is known as windowed signal S_w.

Fig.3

Normalized Speech Signal.

Fig.4

Normalized energy of each frame.

Fig.5

Frames having normalized energy >=2.

Fig.6

Speech signal after removal of silence part.

Pitch and wavelet packet transform features are extracted from windowed speech signal using methods defined in Sections 2.1.1 and 2.1.2. Wavelet packet transform features are wavelet packet energy and wavelet packet entropy of vector size 128 and 255 respectively. The total size of wavelet packet feature vector is 383. Pitch, wavelet packet energy and wavelet packet entropy features are combined to make feature vector of size 384. This process is applied on all the speech signals. PCA is then applied to extracted feature data sets of all the speech signals to reduce the dimension of the feature data sets. The variance among features must be greater than 99% [11] for selecting reduced dimensions with least misclassification error.

70% of all the reduced feature vectors are used for training and remaining 30% are used for testing. SVM and Random Forest classifiers are trained separately. Based upon the results obtained in experiment 2, it is found that the performance of Random Forest is better than SVM classifier. Therefore, it may be used in future.

5 Experimental results

Experiments have been performed on speech signals taken from EMO-DB database. This database contains speech signals having all six different emotions. We extract two features i.e. pitch and wavelet packet transform from all speech signals after their pre processing. The feature vector size for pitch is 1 and for wavelet packet transform feature is 383 (128 + 255). Total 600 speech signals (100 speech signals for each emotion) are taken from EMO-DB database. 70% speech signals (total 420 speech signals, 70 speech signals per emotion) are used for training the classifiers while 30% speech signals (total 180 speech signals, 30 speech signals per emotion) are used for testing the classifiers. Following experiments have been performed for testing.

5.1 Experiment-1:

Total 180 speech signals are taken from database for testing where 30 speech signals are of one emotion. Half of speech signals belong to male speakers and remaining belong to female speakers. There are total 90 speech signals spoken by male speakers and remaining 90 by female speakers. Pitch values of testing speech signals (180 speech signals) are obtained using subharmonic to harmonic ratio method. Average pitch value is calculated for all the speech signals belonging to male and female speakers separately. Following observations are obtained:

It has been observed that with the help of pitch only the six emotions could be classified into two classes. One class contains emotions whose pitch values are higher than pitch value of neutral speech and other class contains emotions whose pitch values are lesser than that of neutral speech. Average pitch value of neutral speech thus partitions the whole set into two classes: one class containing anger, fear and happy emotions whose pitch values are greater than neutral emotion while other containing boredom and sad emotions having pitch values lesser than neutral emotion.

Table 1 shows the average value of pitch for male and female speakers. The average pitch values for anger, fear and happy emotions are greater than average pitch value of the neutral emotion while the average pitch values of boredom and sad emotions are lesser than average pitch value of neutral emotion.

Table 1
Average pitch value of each emotion for male and female speakers

Emotions Anger Fear Happy Neutral Boredom Sad

Male Speakers 222.55 183.55 219.3 148.7 118.9 135.65

Female Speakers 293.5 261.8 246.55 190.15 184 164.4

Emotions	Anger	Fear	Happy	Neutral	Boredom	Sad
Male Speakers	222.55	183.55	219.3	148.7	118.9	135.65
Female Speakers	293.5	261.8	246.55	190.15	184	164.4

Figure 7 shows the average pitch value of speech signals for each emotion spoken by both male and female speaker. The average pitch value of neutral speech signals for both male and female speakers divides remaining emotions into two classes. The average pitch values obtained from neutral speech of male and female speakers are mentioned in Fig. 7.

Fig.7

Pitch value in different emotions for male speakers and female speakers.

Table 2 shows the average pitch value irrespective of gender of speakers. For each emotion, average pitch value is calculated for all the speech signals. Following observations are made: The average pitch values of each of anger, fear and happy emotions are greater than average pitch value of the neutral emotion while the average pitch values of each of boredom and sad emotions are lesser than average pitch value of neutral emotion.

Table 2

Average pitch value of each emotion for speakers irrespective of male or female

Emotions	Anger	Fear	Happy	Neutral	Boredom	Sad
Male or Female Speakers	258.025	222.675	232.92	169.42	151.45	150.025

Figure 8 shows average pitch values of all the speech signals for each emotion irrespective of male or female speaker. The average pitch value of speech signals belongs to neutral speech is mentioned in Fig. 8.

Fig.8

Average pitch value in different emotions for speakers.

5.2 Experiment-2:

The results obtained from experiment 1 show that pitch value alone cannot recognize specific emotion of the speaker. Some other features are additionally required for correctly recognizing the emotions. Wavelet transform features- wavelet packet energy and wavelet packet entropy are used for recognizing the emotions along with pitch. The combined feature vector taken is of size 384 (1 for pitch, 128 for wavelet packet energy and 255 for wavelet packet entropy). Here two classifiers SVM and RF are used for recognizing the emotions from speech signals. Total 420 speech signals are used for training each classifiers. Pitch and wavelet packet transform features are extracted from all the 420 speech signals and then PCA is used to reduce the feature vectors. SVM and RF classifiers are trained using the reduced feature vectors. Remaining 180 speech signals (30 speech signals from each emotion) are taken for testing the trained classifiers. Both the trained classifiers are tested with reduced extracted feature vectors and error matrices are generated. Error matrices for both the trained classifiers are shown in Tables 3 and 5 which show how many emotions are correctly identified. Tables 4 and 6 show the recognized accuracy of each emotion with SVM and Random Forest classifier respectively. Table 7 shows the overall recognized accuracy.

Table 3
Error Matrix of SVM classifier using Pitch and Wavelet Packet Transform features

Count Predicted Emotions

Anger Fear Happy Neutral Boredom Sad Total speech signals for each emotion

Anger 27 0 1 1 1 0 30

Boredom 0 24 1 1 2 2 30

Fear 1 2 24 1 1 1 30

Happy 1 0 2 26 1 0 30

Neutral 1 1 1 1 25 1 30

Sad 0 1 1 1 1 26 30

Count	Predicted Emotions
Anger	27	0	1	1	1	0	30
Boredom	0	24	1	1	2	2	30
Fear	1	2	24	1	1	1	30
Happy	1	0	2	26	1	0	30
Neutral	1	1	1	1	25	1	30
Sad	0	1	1	1	1	26	30

Table 4

Accuracy of each emotion with SVM classifier

Emotions	Accuracy (%)
Anger	90
Boredom	80
Fear	80
Happy	86.6
Neutral	83.3
Sad	86.6

Table 5

Error Matrix of Random Forest classifier using Pitch and Wavelet Packet Transform features

Count	Predicted Emotions
	Anger	Fear	Happy	Neutral	Boredom	Sad	Total speech signals for each emotion
Anger	28	0	0	1	1	0	30
Boredom	0	25	2	1	1	1	30
Fear	1	1	25	1	1	1	30
Happy	2	0	0	27	1	0	30
Neutral	2	0	2	1	24	1	30
Sad	0	1	1	1	1	26	30

Table 6

Accuracy of each emotion with Random Forest classifier

Emotions	Accuracy (%)
Anger	93.3
Boredom	83.3
Fear	83.3
Happy	90
Neutral	80
Sad	86.6

Table 7

Overall Accuracy Matrix

Classifier/Model	AccuracyRatio (%)
Random Forest with 300 trees	86.11
SVM	84.41

5.3 Experiment-3

The objective of this experiment to determine the minimum duration of speech signal required to correctly recognize the speaker’s emotion. Samples in increasing order of their length are extracted from the starting point of speech signal for testing the trained classifier. The recognized accuracy with samples of different lengths are given in Table 8. Accuracy 83.95% is achieved taking 16000 samples (duration 1 sec) from starting point of speech. Accuracy in case of samples having length less than 16000 is found to be lesser than 83.95%. Hence minimum 1 second duration (16000 samples) is sufficient to recognize the emotion.

Table 8
Accuracy of extracted speech samples of different length from starting

Number of samples taken from starting point of speech signals Accuracy (%)

500 11.1

1500 29.5

2500 32.69

5000 42.65

8000 72.39

12000 81.67

14000 82.29

16000 83.95

18000 83.67

20000 84.13

22000 84.05

Number of samples taken from starting point of speech signals	Accuracy (%)
500	11.1
1500	29.5
2500	32.69
5000	42.65
8000	72.39
12000	81.67
14000	82.29
16000	83.95
18000	83.67
20000	84.13
22000	84.05

5.4 Experiment-4

The objective of this experiment is to determine the proper location for the samples in speech signal from which emotion can be recognized correctly. Three locations are taken for the experiment at the interval of 8000 samples from starting point of speech signal. From each location, samples of size 2000, 4000 and 6000 are collected for the experiment. The results are shown in the Table 9. It is clear from table that the samples (around 6000) taken after the interval of 1 sec (16000 samples) from the start of speech gives the best result. It confirms that samples from middle part of speech signal gives better accuracy than samples taken from starting point.

Table 9
Accuracy of extracted speech samples of different length from marked locations

– Number of samples extracted Accuracy (%)

Beginning of speech signal (start from 0^th sample) 2000 32.7

4000 41

6000 71.2

Mid point of the speech (start from 16000^th sample) 2000 83.2

4000 84.1

6000 84.72

3/4^th length of speech (start from 24000^th sample) 2000 75.2

4000 74.9

6000 81.3

–	Number of samples extracted	Accuracy (%)
Beginning of speech signal (start from 0^th sample)	2000	32.7
4000	41
6000	71.2
Mid point of the speech (start from 16000^th sample)	2000	83.2
4000	84.1
6000	84.72
3/4^th length of speech (start from 24000^th sample)	2000	75.2
4000	74.9
6000	81.3

5.5 Experiment-5

The objective of this experiment is to confirm the angriness of speaker. The ratio of silence part and non silence part in any speech signal depends upon the emotion. Here we find out the relationship between ratio of silence and non silence part with emotion. Experiments have been performed on all the test speech signals taken from EMO-DB database. The equation 5 is used for calculating ratio. Speech signal with anger emotion contains more number of silence frames than speech signals of remaining emotions. $R a t i o = \frac{n u m b e r s o f f r a m e s i n s i l e n c e p a r t}{n u m b e r s o f f r a m e s i n n o n s i l e n c e p a r t}$ (5)

Table 10 shows the average ratio calculated from speech signals of male and female speakers for each emotion. Table 11 shows the average ratio calculated from all speakers irrespective of male or female. It is clear from above tables that the angriness is confirmed when average ratio is highest.

Table 10

Average ratio of silence and non silence part for male and female speaker

Emotions	Anger	Fear	Happy	Neutral	Boredom	Sad
Male Speakers	91.33	61.96	87.56	85.36	73.1	68.63
Female Speakers	89.76	40.93	69.6	69.53	64.7	87.8

Table 11

Average ratio of silence and non silence part for all speakers

Emotions	Anger	Fear	Happy	Neutral	Boredom	Sad
Speakers	90.54	51.44	78.58	77.44	68.98	78.21

6 Comparison of results with state of the art work

In 2009, Iliou and Anagnostopoulos [23] extracted prosodic features (pitch, energy and duration) from speech signals taken from EMO-DB database to recognized emotion. They recognized seven emotions (anger, boredom, disgust, fear, happy, neutral and sad) from speech signals with recognized accuracy 51%.

We have performed our experiment on speech signals having six emotions taken from same database. Prosodic feature (pitch) and wavelet packet transform features are extracted. As wavelet features contain both time and frequency domain information, when they are combined with pitch then we get more accurate results. The recognized accuracy in our experiment is 86.11%.

In 2009, Bozkurt et al. [7] recognized same set of emotions (anger, boredom, fear, happy neutral and sad) as recognized by us from speech signals. They claimed recognized accuracy of 63.03% with HMM classifier while our accuracy is 86.11% with Random Forest classifier.

7 Conclusion

Emotion recognition is a growing research area as it is very helpful in smooth working of many areas. In this paper, six basic emotions of the human beings are recognized from the speech signals. Recognition of emotion using information available from both time and frequency domain are analyzed taking combination of prosodic and wavelet features. Random Forest is found to be more appropriate classifier as compared to SVM. The accuracy obtained here is better than state of the art works available in the literature using the same EMO DB database and same set of emotions.

In future, we plan to use some more properties of male and female speakers to enhance the accuracy. Some properties of male and female speakers are different. For robustness of proposed system, we will test our proposed approach on database containing real life and noisy speech signals.

References

Breiman

, Random forests, Machine Learning45(1) (2001), 5–32.

Gold

and Morgan

, Speech and audio signal Processing, John Wiley and Sons, 2003, ent….

Burges

C.J.

, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discoveryvol. 2 (1998), 121–167.

Chien

et al, A new approach of audio emotion recognition, in: Expert Systems with Applications41 (2014).

Ten

D.I.

, Lectures on wavelets SIAM, Philadelphia, PA, 1992.

Kaminska

and Pelikant

, Recognition of Human Emotion from a Speech Signal Based on Plutchik’s Model, in: Intl Journal of Electronics and Telecommunications58(2) (2012), 165–170.

Bozkurt

et al, Improving automatic emotion recognition from speech signals, 10th Annual Conference of the International Speech Communication Association (INTERSPEECH), Brighton, UK, (2009), 324–327.

Chenchaha

and Lachiri

, Speech emotion recognition in acted and spontaneous context, 6th International Conference on Intelligent Human Computer Interaction, IHCI (2014), 139–145.

Lokhande

and Bapat

K.S.

, Emotion recognition in speech, in: International Journal of Electrical, Electronics and Data Communication, ISSN: 2320-20841(9) (2013).

10.

Fayek

H.M.

et al, Towards real-time speech emotion recognition using deep neural networks, in: 9th International Conference on Signal Processing and Communication Systems (ICSPCS- 2015), 2015.

11.

Jolliffe

I.T.

, Principal Component Analysis, Springer-Verlag, New York, 1986.

12.

Cichosz

K.S.J.

, Emotion recognition in speech signal using emotion extracting binary decision trees, Affective Computing and Intelligent Interfaces ACII, 2007, Lisbon, Doctoral Consortium.

13.

Rabiner

L.R.

, Juang

B.H.

Fundamental of Speech Recognition, 1, Pearson Education, Delhi, 2003.

14.

Vidrascu

and Devillers

, Real life emotion representation and detection in call centres data, LNCS; vol 3784, ACII, Berlin: Springer, 2005, pp. 739–746.

15.

Hoque

M.E.

et al, Robust recognition of emotion from speech, Intelligent virtual agents, Lecture notes in computer science, Berlin: Springer, 2006, pp. 42–53.

16.

Luggar

and Yang

, The relevance of voice quality features in speaker independent emotion recognition, ICASSP, Honolulu, Hawaii, USA, New York, IEEE Press, 2007, pp. IV17–IV20.

17.

Mansoorizadeh

and Charkari

N.M.

, Speech emotion recognition: comparison of speech segmentation approaches, in: IKT 2007, 2007.

18.

Lanjewar

R.B.

, et al, Implementation and comparison of speech emotion recognition system using Gaussian Mixture Model (GMM) and K-Nearest Neighbor (K-NN) techniques, Proceedings of International Conference on Advances in Computing, Communication and Control, ICAC3 2015, 2015, pp. 50–57.

19.

Lalitha

et al, Emotion Detection using MFCC and Cepstrum Features, 4th International Conference on Eco-friendly Computing and Communication Systems2015, 29–35.

20.

Zhang

, Emotion recognition in Chinese neutral speech by combining prosody and voice quality features, Advances in neural networks. Lecture notes in computer science, Berlin: Springer, 2008, pp. 457–464.

21.

Skurichina , et al, Bagging, boosting and random subspace methjod for linear classifiers, Pattern Analysis and Applications5(2) (2002), 121–140.

22.

Yildirim

et al, An acoustic study of emotions expressed in speech, International Conference on Spoken Language Processing (ICSLP 2004), Jeju Island, Korea, 2004.

23.

Iliou

, Anagnastopoulos

C.N.

, Statistical evaluation of speech features for emotion recognition, Fourth International Conference on Digital TelecommunicationsColmar, France, 2009, 121–126.

24.

Rajisha

T.M.

et al, Performance Analysis of Malayalam Language Speech Emotion Recognition System using ANN/SVM, in: International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST-2015).

25.

Sun

, Pitch determination and voice quality analysis using subharmonic to harmonic ratio, IEEE, 2002.

26.

Peng

Z.K.

and Chu

F.L.

, Application of the wavelet transform in machine condition monitoring and fault diagnostics: A review with bibliography, machine systems and signal processing.