Recognition of speech emotion using custom 2D-convolution neural network deep learning algorithm

Abstract

Speech emotion recognition has become the heart of most human computer interaction applications in the modern world. The growing need to develop emotionally intelligent devices has opened up a lot of research opportunities. Most researchers in this field have applied the use of handcrafted features and machine learning techniques in recognising speech emotion. However, these techniques require extra processing steps and handcrafted features are usually not robust. They are computationally intensive because the curse of dimensionality results in low discriminating power. Research has shown that deep learning algorithms are effective for extracting robust and salient features in dataset. In this study, we have developed a custom 2D-convolution neural network that performs both feature extraction and classification of vocal utterances. The neural network has been evaluated against deep multilayer perceptron neural network and deep radial basis function neural network using the Berlin database of emotional speech, Ryerson audio-visual emotional speech database and Surrey audio-visual expressed emotion corpus. The described deep learning algorithm achieves the highest precision, recall and F1-scores when compared to other existing algorithms. It is observed that there may be need to develop customized solutions for different language settings depending on the area of applications.

Keywords

Deep learning emotion recognition feature extraction learning algorithm neural network speech emotion

1. Introduction

Speech Emotion Recognition (SER) is an open research agenda that is laden with many challenges, which have inspired this present study. Emotion is a difficult word to define [1], but some researchers have defined it as a mixture of physiological responses and inner thoughts of people [2]. It plays a pivotal role in influencing the rational actions and decisions that human beings make on a daily basis [3]. The Automatic Speech Emotion Recognition (ASER) has recently become popular because of its many useful applications in Human Computer Interaction (HCI) [4]. The recognition of emotion from speech has recently become an active research theme for applications such as smart health care [5], smart home [6], smart entertainment [7] and other important services. For example, SER can be used to monitor the emotion of agents in a call center in order to control their behavior when talking to customers [8]. There are various methods that have been developed for processing speech emotion over the recent years [9]. Nevertheless, SER is a complex task because it is difficult to accurately identify the correct emotion for a given situation [10]. Furthermore, the distinction between different emotional tones is pretty narrow [11]. There are other important factors that make SER difficult, which include the dependency of emotions on language and culture and selection of relevant features that are required to perform effective emotion recognition [12].

Many traditional speech recognition methods have been proposed in literature over the past decade [13], but most of them use handcrafted features [14] such as prosodic and spectral features [15]. The intrinsic drawback of using handcrafted features is that they sometimes require extra processing steps such as feature selection [16, 17]. This presents an imminent need to replace the current standard methods with models that can adaptively learn low-level features from raw data and high-level features from low-level ones in a hierarchical manner. Such models should be able to remove the over-reliance of choice of features and other preprocessing steps [18]. Recent research has shown that deep learning models can solve some of the above mentioned problems [19]. In particular, deep neural networks have become popular in speech processing tasks such as speaker identification [20], gender identification [21], Alzheimer recognition [22] and many more. A growing interest has been noticed in the application of deep neural networks in speech analysis and language analysis. However, all of these gave us a mind boggling question as to what method can be used to effectively extract salient features from raw speech files?

In this study, therefore, the use of custom 2D-CNN model is described as a feature extraction and speech classification tool for recognising emotion from one second frames in raw spectrogram data. Spectrograms of speech emotion are used in this study for feature generation because they contain acoustic and semantic features that could improve the accuracy of emotion classification task [23]. Moreover, we have performed a comparative analysis of the proposed 2D-CNN deep learning algorithm against Deep Multi-Layer Perceptron (DMLP) and Deep Radial Basis Function Neural Network (DRBFNN) [24]. Experimental results are reported for three benchmark databases of Berlin Database of Emotional Speech (EMO-DB) [25], Ryerson Audio-Visual Database of Emotional Speech (RAVDESS) [26] and Surrey Audio-Visual Expressed Emotion (SAVEE) [27]. The distinctive contributions of this study to the existing research on speech recognition are the following:

•
The application of DRBFNN and DMLP as feature extraction and classification tools for raw spectrograms to achieve high accuracy values without applying a feature selection algorithm.
•
The validation of DRBFNN as an extremely fast algorithm for training raw spectrum data.
•
The discovery that DMLP is effective in recognising angry, disgust, happy, neutral and sad emotion states.
•
The development of a custom 2D-CNN speech recognition model that can recognise eight basic emotion states with high precision, accuracy, recall and F1-score.
•
The description of a 2D-CNN algorithm that can detect disgust and fear emotion states better than the DRBFNN and DMLP based methods.

The discussion of related work is succinctly presented in Section 2 to demonstrate the originality and relevance of this study. The details about material and methods of this study are given in Section 3. The study results are well articulated in Section 4. Conclusions of the present study and possible future extensions are discussed in Section 5.
2. Related work

The related studies previously reported are reviewed to demonstrate the uniqueness and relevance of this study and to identify a possible gap. Yoon et al. [28] proposed a model that uses two Bi-directional Long Short-Term Memory (BLSTM) to obtain salient representation of vocal utterances. They proposed an attention mechanism referred to as multi-hop. They trained their model to automatically infer the correlation between the modalities and obtained a weighted average accuracy of 76.5%. Fayek et al. [29] proposed a deep learning model based on CNN that uses spectrograms as input features of speech signals. The proposed model achieved 64.78% accuracy using the Interactive Emotional dyadic MOtion CAPture (IEMOCAP) database [30]. In [31], an attention-based CNN Long Short-Term Memory (LSTM) fully connected Dense Neural Network (DNN) layer (CNNLSTM-DNN) model was introduced and tested on IEMOCAP dataset that achieved an average accuracy of 87.2%. They conducted experiments using a five-fold cross validation and the results were consistent with neutral, happiness, sadness, surprise and questioning emotions. In a similar vein, [32] employed a variant of WaveNet encoder to extract features in speech utterances and used a deep Residual Network (ResNet) to extract features from video signals. They fused these features using Multimodal Compact Bilinear (MCB) pooling to form a joint feature vectors for speech signal. The LSTM network was used to determine voice activity and they obtained an average accuracy of 91.52%.

A sparse autoencoder-based feature transfer learning method for SER was proposed in [33]. Several databases were used, including the EMO-DB and eNTERFACE database, which achieved an accuracy of 59.1% [34]. Zhu et al. [2] developed a novel classification method by combining Deep Belief Networks (DBN) and Support Vector Machine (SVM). They performed feature extraction using DBN and fed the extracted features to their DBN ensemble model, which achieved 95.8% classification accuracy using the Chinese Academy of Sciences (CAS) database. Zhang et al. [35] developed two CNN models for classifying emotions using video and audio files. The first CNN model was used to extract features from the mel-spectrograms generated from audio signals. The second one was used to process the corresponding video recordings. The extracted features were fused and fed to a deep fully connected neural network to classify anger, joy, sadness, surprise, fear and disgust emotions using Ryerson Multimedia Research Lab (RML) dataset to obtain an accuracy of 74.32%. In [36], an end-to end model for emotion recognition from raw audio signal was proposed, wherein the features were extracted from raw audio files using CNN. These features were then fed to an LSTM network to capture the temporal information in the data using the REmote COLlaborative and Affective (RECOLA) interactions dataset. Their proposed method outperformed handcrafted features on the RECOLA database achieving an accuracy of 74.1% in recognising arousal.

An end-to-end visual audiovisual fusion system that jointly learns to extract features directly from pixels, audio waveforms and performs classification using Bidi-rectional Gated Recurrent Units (BGRUs) has been reported [37], which obtained 98% accuracy in recognizing speech. Kim et al. [38] proposed 3-Dimensional CNNs (3D CNNs) to learn spectro-temporal features for recognizing speech emotions. They designed the 3D CNNs to learn short and long-term spectro-temporal features with a moderate number of parameters. They did a comparative analysis of the proposed method and other state-of-the-art methods using seven datasets, including RECOLA and EMO-DB. Their proposed model improved the accuracy of detecting sad emotion state by 17%. Fayek et al. [23] presented a SER system that makes use of Deep Neural Networks (DNNs). Their novel approach was used to recognize emotions from one second frames of raw speech spectrograms that achieved a classification accuracy of 60.53% using the eNTERFACE dataset. They also achieved a classification accuracy of 59.7% when they tested their model on the SAVEE dataset. In [39], a stochastic gradient descent optimized DNN was proposed using only three emotion states of angry, neutral and sad from the EMO-DB dataset. They used raw data without applying feature selection techniques and their trained model achieved an overall validation accuracy of 96.97%. Han et al. proposed a novel approach in [40], where they used segment-level features such as Mel-Frequency Cepstral Coefficients (MFCC), pitch period and harmonic to noise ratio and utterance-level features to recognize speech emotions. They used DNN to develop emotion probabilities in all the speech segments. The developed probabilities were then used to create the utterance-level features that were fed to the Extreme Machine Learning (ELM) based classification algorithm. They applied these techniques on the IEMOCAP database [30], achieving an accuracy of 54.3%. In [41], it was observed that CNN is sensitive to the sequence of images and that the deep learning model learns a dictionary of features that are portable across various languages. Furthermore, the researchers discovered that combining Recurrent Neural Network (RNN) with deep CNN can result in a notable increase in accuracy. In their work they applied Multiple Kernel Learning (MKL) to CNN to achieve a classification accuracy of 96.55% with feature selection.

Several methods have been suggested by various researchers for the purpose of recognising emotion through speech. Some authors in the literature have developed deep learning ensemble algorithms to improve classification accuracy and results were indeed promising. However, the problem with this assortment of methods as described in the literature is that they are more concerned with classification accuracy. The nature of the problem of speech emotion recognition requires more with regards to performance benchmarks. The analysis of SER models requires a profound evaluation that can be achieved through the use of precision, recall, F1-score and processing time. Most of the work in the literature was silent in this regard, yet the use of the above mentioned performance benchmarks is key to the evaluation of SER models.

Furthermore, the evaluation of a SER model requires an in-depth analysis of each and every one of the emotions involved. From the literature survey, it is apparent that recognising fear and disgust is quite a big problem. The importance of emotions can vary based on the application domain. For example, fear is an important emotion when developing a SER model for a police call centre. In the same vein, disgust is very important in environments that require an evaluation of customer satisfaction. In addition, some of the methods suggested in the literature involve the use of transfer deep learning models such as Resnet that are pre-trained models. The problem with these models are that if they are used to develop new applications on a new set of data that they have never been exposed to, the following problems can arise. The developed model will inherit certain innate problems of transfer learning models. The models may need time to learn new data patterns because they are pre-trained using google images that do not involve audio spectrogram images.

In addition, handcrafted features have been widely used in speech processing that have yielded some fantastic results. However, these are not suitable for use in recognising emotion through speech when data is huge. The advent of deep learning models has remarkably shown improvements in the processing of large amounts of data. Consequently, this inspired us to develop a custom 2D-CNN deep learning model that is robust enough to recognize disgust and fear emotion states with high precision, recall, F1-score and accuracy.

3. Material and methods

The material for this study involves the experimental datasets that were employed to validate the proposed custom 2D-CNN algorithm. The datasets are Emodb, Ravdees and Savee, which are subsequently discussed. The study methods follow three main phases, which are spectrogram generation (preprocessing), feature extraction and feature classification [42]. Spectrogram generation involves the conversion of raw audio data to spectrogram representation. Feature extraction involves the extraction of salient features from the spectrogram. Feature classification is the last phase that involves a comparative analysis of the proposed algorithm against the existing Deep MLP (DMLP) and Deep RBFNN (DRBFNN) learning methods implementation of Keras and Tensorflow in python. Python gives the flexibility to prescribe the number of layers, optimizers to use and many other functionalities to adjust DMLP or DRBFNN, thus transforming it into a DNN deep learning algorithms. Figure 1 illustrates the set of phases that were followed in conducting the experiments of this study. The figure depicts the components of audio data, spectrogram generation, feature extraction and feature classification, which are subsequently described.

Figure 1.

Flow chart of the proposed study methods.

3.1 Experimental databases

The databases that were utilized in this study to validate the performance of the proposed custom 2D-CNN algorithm are EMO-DB [25], RAVDESS [26] and SAVEE [27].

3.1.1 Emodb

The acted EMO-DB speech corpus [21] consists of 535 German vocal utterances. It is a multiclass speech emotion database that consists of seven different acted emotions, which are anger, joy, sadness, neutral, boredom, disgust and fear. The dataset is a collection of ten professional native German-speaking actors made of voices of five females and five males that were asked to simulate these emotions. The simulations were recorded in an anechoic chamber using a high-quality recording equipment. The recordings were produced at a sampling rate of 16 kHz with a 16-bit resolution and mono channel. The version of EMO-DB used in this study consists of 340 audio files. Each of the audio files is approximately 3 seconds long. The angry class constitutes of 127 audio files as shown in Fig. 2. The happy class has a total of 72 class, while 79 vocal utterances were recorded in the neutral class and the sad class consists of 62 audio files.

Figure 2.

Emodb dataset.

3.1.2 Ravdess

The RAVDESS speech corpus [26] is a collection of validated multimodal speeches and songs. It consists of speeches of 24 professional gender balanced actors with 12 females and 12 males. These actors were recorded speaking two similar statements in a neutral North American accent. This database comprises of 8 speech emotion classes that comprise of angry, happy, neutral, calm, sad, surprised, fear and disgust expressions. The 24 recorded vocal utterances consist of three modality formats, which are Audio-only (16 bit, 48 kHz .wav), Audio-Video (720p H.264, AAC 48 kHz, .mp4), and Video-only (no sound). The angry, calm, disgust, fear, happy and sad emotion classes constitute of 192 audio files each while the surprise class contains a total of 184 files a shown in Fig. 3. Only 96 files were recorded in the neutral emotion state.

Figure 3.

Ravdess dataset.

3.1.3 Savee

The SAVEE dataset [27] is an audio-visual database that comprises 480 British English vocal utterances. These vocal utterances were recorded from four male actors in seven different emotions, which are anger, disgust, fear, happiness, sadness, surprise and neutral. The recordings were done in an advanced media laboratory with high quality audio-visual equipment. The neutral class consists of 120 audio files and all the other remaining classes have a total of 60 audio files each as illustrated in Fig. 4.

Figure 4.

Savee dataset.

Figure 5.

EMO-DB spectrograms showing four emotion states.

Spectrograms of speech emotions were generated in this study to provide input to 2D-CNN, DMLP and DRBFNN classification algorithms for feature extraction, instead of using handcrafted features [43] such as MFCCs. Spectrograms are spectral representations of speech signals [44]. The application of speech spectrogram data transforms SER problem to an image processing task for which different image processing methods abound. The short-time Fourier transform spectrograms were generated using Hamming window frames as done in [18]. The Hamming window frames used were of length 25 milliseconds. Subsequently, 12.5 milliseconds stride were applied to the original speech waveforms, including both voiced and unvoiced components. Spectrograms representing spectral magnitudes for blocks of frames of time duration of 1 second with a stride of 10 milliseconds between blocks were calculated using Librosa, which is a speech processing python package. The audio files of the experimental databases were converted to spectrograms when the above processes were applied and results are shown in Figs 5–7.

Figure 6.

Ravdess spectrograms.

Figure 7.

Savee spectrograms.

3.2 Feature extraction

Feature extraction is the process of converting raw data to a data set that presents a reduced number of attributes containing the most discriminatory information. The original goal of the process was to reduce the data dimensionality, remove redundant information and convert data to a format that is more appropriate for subsequent classification. Research has shown that a classification model is as good as the features used in the classification process. Most of the solutions suggested in the literature employ the end-to-end technique in which raw audio files are fed into classification models. Based on the literature survey we observed that this method yields low accuracy scores because of the inefficient feature extraction process involved. This is why we have decided to evaluate our custom 2D-CNN deep learning algorithm using spectrograms as our core features. The spectrogram generation process is important because it has helped in generating the required spectrograms that were fed to the deep learning models for feature classification.

3.3 Feature classification

CNNs are one of the most popular techniques used in various feature classification tasks such as text processing and image classification [45]. The networks apply a series of filters to raw pixel data to extract and learn higher level features in images. These are the features that a CNN model can use to classify images [46]. CNNs were introduced towards the end of 20 ${}^{\rm th}$ century, but did not gain popularity because of high computational cost. They later came into limelight in 2012, after AlexNet won the ImageNet challenge [47]. The networks have recently become popular because of the increased availability of Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU) that offer the required processing power [48]. CNN is a special form of feed-forward neural network, which comprises of numerous filter steps and one classification step. These filter steps consist of four different types of layers, which are pooling layer, batch normalization layer, convolutional layer and activation layer. The classification stage is composed of a number of fully connected layers and a classification layer.

Figure 8.

Basic architecture of custom 2D-CNN algorithm.

Figure 8 shows the basic architecture of the custom 2D-CNN for processing spectrogram images of emotion. In this study, we have used the Keras Application Programming Interface (API) with TensorFlow on the backend to develop the proposed custom 2D-CNN. Since the spectrograms generated in the preprocessing stage were in RGB format, we had to convert them to grey scale for efficient processing. In this study, we have implemented a deep neural network consisting of convolutional and fully connected layers to classify speech emotion from three labeled datasets, which are EMO-DB, RAVDESS and SAVEE. The labeled datasets consist of spectrogram images of size 28 $\times$ 28 $=$ 784 pixels. For each of the three datasets, the test set was created using 33% of the data. The designed architecture of the proposed custom 2D-CNN had 7 convolution layers, 4 convolution layers, 2 fully connected layer and an output layer.

It is extremely difficult to distinguish speech emotion visually through observing the spectrogram images because the difference between spectrograms for each emotion state is small. Consequently, 2D-CNN is applied to extract features from raw spectrogram images of speech emotion. The parameter settings of the proposed custom 2D-CNN are listed in Table 1. The filter size used to develop the proposed model is 4*4 for the first two convolutional layers. The remaining two layers employed a 3*3 filter size. A 2*2 max-pooling was applied for each convolutional layer. The first fully connected layers had 256 units while the second one had 128 units. The softmax classifier was used in the end to identify speech emotion states.

Table 1

Detailed architecture of the proposed custom 2D-CNN algorithm

Layer type	Number of filter	Size of feature map	Size of kernel	Number of stride	Number of padding
Image input layer	–	28281	–	–	–
Convolution Layer 1	32	282832	4*4	2	2
Max Pooling Layer 1	1	14141	2*2	1	0
Convolution Layer 2	64	282864	4*4	2	2
Max Pooling Layer 2	1	141464	2*2	1	0
Convolution Layer 3	128	2828128	3*3	2	2
Max Pooling Layer 3	1	1414128	2*2	1	0
Convolution Layer 4	128	2828128	3*3	2	2
Full Connection Layer 1	128	256*1	–	–	–
Full Connection Layer 2	–	128*1	–	–	–
Output Layer	–	8*1	–	–	–

Figure 9.

DRBFNN validation accuracy on EMODB.

4. Results and analysis

In this study, we have built a custom 2D-CNN deep learning algorithm consisting of convolutional, pooling and fully connected layers. The algorithm was trained with standardized input dataset cleared of silent segments using mini-batches of size 28. Moreover, it was experimentally tested against two other algorithms based on RBFNN and MLP on EMODB, RAVDESS and SAVEE benchmark speech emotion databases. Benchmark has been used to come up with an in-depth analysis that informs the development of SER systems across a variety of platforms such as mobile devices and desktop computers. The standard performance metrics of processing time, validation accuracy, validation loss, precision, recall and F1-score were used in the experimental study.

Table 2
Accuracy analysis of three classifiers on speech emotion spectrograms dataset

Classifier	EMODB	RAVDESS	SAVEE
DRBFNN	97.050	98.220	93.750
DMLP	94.110	93.950	97.910
2D-CNN	93.510	98.220	90.200

Figure 10.

DRBFNN validation accuracy on RAVDESS.

Figure 11.

DRBFNN validation accuracy on SAVEE.

Figure 12.

DMLP validation accuracy on EMODB.

Figure 13.

DMLP validation accuracy on RAVDESS.

Figure 14.

DMLP validation accuracy on SAVEE.

Figure 15.

2D-CNN validation accuracy on EMODB.

Figure 16.

2D-CNN validation accuracy on RAVDESS.

Figure 17.

2D-CNN validation accuracy on SAVEE.

The study experiments were conducted on a computer with an i7 2.3 GHz processor and 8 GB of RAM. Table 2. Figures 9 to 17 show the validation accuracies obtained using the three classifiers investigated. DRBFNN obtained the highest validation accuracy score of 97.05% on EMODB database. The 2D-CNN yielded the lowest validation accuracy of 93.51% while 94.11% accuracy was achieved using DMLP as shown in Figs 10 and 11. The results changed when the classifiers were applied on RAVDESS database, wherein the highest validation accuracy of 98.22% was achieved using 2D-CNN and accuracy of 98.22% was achieved with DRBFNN. DMLP produced the lowest validation accuracy in this regard as illustrated in Fig. 13. However, it obtained the highest score of 97.91% when the classifiers were applied to the SAVEE database. This time 2D-CNN was outperformed by DRBFNN because it obtained an accuracy score of 90.2% while the later achieved 93.75% validation accuracy as shown in Figs 16 and 17.

Table 3

Validation loss analysis of three classifiers on speech emotion spectrograms dataset

Classifier	EMODB	RAVDESS	SAVEE
DRBFNN	0.085	0.063	0.156
DMLP	0.112	0.263	0.124
2D-CNN	0.121	0.119	0.387

Figure 18.

DRBFNN test loss on EMODB.

Figure 19.

DRBFNN test loss on RAVDESS.

Table 4

Training time in seconds of three classifiers

Classifier	EMODB	Number of runs	RAVDESS	Number of runs	SAVEE	Number of runs
DRBFNN	7.1	1	57.4	1		1
DMLP	19.8	1	66	1	38.4	1
2D-CNN	1170	3	25200	5	14400	4

Figure 20.

DRBFNN test loss on SAVEE.

The reported results in Table 3 and Figs 18 to 26 show the test losses obtained using the three classifiers. DRBFNN recorded the lowest test loss of 0.085 when the classifiers were applied to the EMODB. The 2D-CNN obtained test loss of 0.121 while DMLP achieved 0.112. When the classifiers were applied on the RAVDESS dataset, the test losses recorded show that DRBFNN (0.063) performed better than the other classifiers. For the SAVEE database, DMLP recorded the lowest test loss score of 0.124 while the highest test loss was obtained using the 2D-CNN. The combined validation accuracy and test loss results show that DRBFNN is the overall best performing classifier. This is because it achieved the highest accuracy scores on EMODB and RAVDESS databases and it yielded the lowest test loss scores on the same databases.

Figure 21.

DMLP test loss on EMODB.

Figure 22.

DMLP test loss on RAVDESS.

Figure 23.

DMLP test loss on SAVEE.

Figure 24.

2D-CNN test loss on EMODB.

Figure 25.

2D-CNN test loss on RAVDESS.

Figure 26.

2D-CNN test loss on SAVEE.

Table 4 shows that DRBFNN is the fasted performing classifier in terms of the training time because it processes over a thousand spectrograms in just 57.4 seconds on RAVDESS database. When DRBFNN was implemented on the three datasets, the training was done in a single run, wherein the classifier ran only once. In addition, DMLP recorded shorter processing time of 66 seconds on RAVDESS database. In addition, the classifier ran once, which shows that it can be implemented in real time systems. However, 2D-CNN was quite taxing in terms of the training time. It processed the RAVDESS spectrograms for 25200 seconds, which translates to approximately 7 hours. Moreover, the deep learning model ran five times to improve the accuracy and this took 35 hours, which translates to approximately a day and 9 extra hours.

Tables 5–7 depict precision, recall and F1-score obtained from the three classifiers using Emodb, Ravdess and Savee benchmark databases. The proposed custom 2D-CNN obtained 100% precision scores, except in the calm and fear emotion states. The highest overall precision of 98%, recall of 98% and F1 score of 98% were recorded by the classifier. The highest scores were achieved using DMLP on SAVEE database, where 98% was obtained for each benchmark metric. DMLP was seen to be more effective in recognising angry, disgust, happy, neutral and sad emotion states because the model obtained perfect scores of 100%. In addition, DMLP obtained the best precision score in recognising angry emotion state. DRBFNN was efficient in recognising happy emotion state (96%) when it was applied on EMODB database. It performed better than all the other classifiers in detecting happy emotion. However, 2D-CNN achieved the best overall precision of 97%, recall of 97% and F1-score of 97%. The model obtained perfect precision scores of 100% in recognizing neutral and sad emotion states.

Table 5

Precision, recall and F1-score analysis on Emodb spectrograms

Emotion	DRBFNN			DMLP			2D-CNN
	Precision	Recall	F1-score	Precision	Recall	F1-score	Precision	Recall	F1-score
Angry	89	89	89	100	96	98	96	96	96
Happy	100	92	96	86	100	92	93	93	93
Neutral	73	85	79	100	84	91	100	100	100
Sad	96	92	94	85	100	92	100	100	100
MEAN	90	90	90	95	94	94	97	97	97

Table 6

Precision, recall and F1-score analysis on Savee spectrograms

Emotion	DRBFNN			DMLP			2D-CNN
	Precision	Recall	F1-score	Precision	Recall	F1-score	Precision	Recall	F1-score
Angry	88	78	82	100	100	100	92	100	96
Disgust	100	78	88	100	100	100	92	100	96
Fear	67	67	67	92	92	92	92	92	92
Happy	93	81	87	100	92	96	92	92	92
Neutral	14	50	22	100	100	100	100	96	98
Sad	88	62	73	100	100	100	92	100	96
Surprised	75	100	86	92	100	96	92	79	85
MEAN	84	79	80	98	98	98	94	94	94

Table 7

Precision, Recall and F1-score analysis on Ravdess spectrograms

Emotion	DRBFNN			DMLP			2D-CNN
	Precision	Recall	F1-score	Precision	Recall	F1-score	Precision	Recall	F1-score
Angry	76	48	59	97	97	97	100	100	100
Calm	52	77	62	87	94	90	95	100	97
Disgust	78	60	68	100	95	97	100	100	100
Fear	64	69	66	100	97	99	92	100	96
Happy	70	61	65	89	100	94	100	95	97
Neutral	55	35	43	100	86	93	100	95	97
Sad	77	64	70	87	87	87	100	97	99
Surprised	54	76	63	94	91	93	100	97	99
MEAN	66	64	64	94	94	94	98	98	98

Table 8 shows a comparative analysis of the results obtained in this study and results from related work. In particular, Table 8 shows that features extracted from raw spectrograms have more discriminative power, especially when fed to artificial neural network models. The proposed 2D-CNN algorithm performs better than others, especially when compared to those algorithms that utilize CNN with spectrogram features. However, the apparent limitation of the custom 2D-CNN algorithm is learning time constraint. Certain authors have argued that faster learning algorithms may acquire more training samples, but only when they are more effective will they achieve higher performance on unseen testing data [49]. This work places a strong emphasis on tradeoff between method efficiency versus method accuracy. However, improving on the processing speed of custom 2D-CNN is an exhilarating work for the future.

Table 8

Comparison of the proposed method with related methods from the literature

Reference	Features	Classifiers	Corpus	Accuracy (%)
[33]	Spectrograms	Sparse autoencoder-based	eNTERFACE	59.50
[32]	Spectrograms	WaveNet encoder, ResNet, LSTM	Custom database	91.52
[31]	Spectrograms	An attention-based CNNLSTM-DNN	IEMOCAP	87.20
[2]	Spectrograms	SVM and DBN	Chinese Academy of Sciences (CAS)	95.80
[35]	Spectrograms and Videos	CNN	RAVDESS	74.32
[36]	Spectrograms	CNN	RECOLA	74.10
[37]	Spectrograms and Videos	Audio waveforms and performs classification using Bidi-rectional Gated Recurrent Units (BGRUs)	Lip Reading in the Wild (LRW) database	98.00
[39]	Spectrograms	A stochastic gradient descent optimized DNN	EMO-DB	96.67
[41]	Spectrograms	Combined CNN and RNN	Multimodal sentiment analysis	96.55
[50]	MFCC, Fourier Parameter (FP), fundamental Frequency (F0), energy and Zero-Crossing Rate (ZCR)	SVM	Berlin speech database	88.88
[51]	Weighted Spectral Features based on Hu moments (HuWSF)	SVM	SAVEE	89.32
[52]	Glottal Compensation to Zero Crossings with Maximal Teager energy operator (GCZCMT)	SVM	Berlin speech database	84.45
[53]	eGeMAPS features	MLP neural network	RAVDESS	87.95
Proposed	Spectrograms	DRBFNN	RAVDESS	98.22
method		DMLP	RAVDESS	93.95
		Custom 2D-CNN	RAVDESS	98.22

5. Conclusion

In this paper, we have introduced 2D-CNN deep learning algorithm and presented a comparative analysis of its performance with deep learning based DMLP and conventional DRBFNN. Three popular speech emotion databases were used in this study, which are EMODB, RAVDESS and SAVEE. These databases were chosen because they include different languages and accents. We have carried out a series of experiments in our quest to develop a model that recognizes emotion states such as anger, happiness, sadness, calmness, fear, neutral, disgust and surprise. One of the main objectives of this study was to explore the performance of deep learning models and DRBFNN in extracting features for classifying various emotion states. The results obtained show that the proposed custom 2D-CNN model is quite effective in recognising almost all of the above mentioned emotions. However, we have observed that the proposed model is resource intensive as shown by the long processing time. Moreover, it was noted that DRBFNN and DMLP are extremely fast, which makes them ideal for use in computing environments that have low specification requirements such as handheld devices. However, these may not be a viable option when classifying vast amounts of data [19].

In addition, we have observed that the classifiers investigated in this study perform differently when applied to corporal with different languages. For example, DRBFNN achieved the highest precision of 100% in recognising happiness in utterances spoken in the German language while achieving the lowest score of 70% in recognising the same emotion state in the Northern American language. This gave us the impression that there may be need to customize a speech recognition model based on the nature of the target market and application. The results obtained in this paper are quite interesting. However, the major limitation is that the experimental study was conducted using acted speech datasets. Furthermore, the speech datasets were free of noise because the recordings were done using state of the art recording equipment. There is a possibility that there can be major differences between working with acted and real data. Recently, new features have been used to recognize emotion such as breath [54]. In future work, we would like to explore this technique further and improve on the processing speed of the 2D-CNN. Moreover, we would like to evaluate CNN as a feature extraction tool in speech emotion recognition systems.

Footnotes

Acknowledgments

This work was supported by Durban University of Technology through the ICT and Society Research Group.

References

Lisetti

C.L.

, Affective computing, Pattern Anal Appl 1(1) (1998), 71–73.

Zhu

Chen

Zhao

Zhou

and Zhang

, Emotion recognition from Chinese speech for smart affective services using a combination of SVM, Sensors 17 (2017), 1127–1138.

Scherer

K.R.

, Vocal communication of emotion: A review of research paradigms, Speech Commun 40(1–2) (2003), 227–256.

Ashrafidoost

, Recognizing rmotional state changes using speech processing, 2016, pp. 0–5.

Vadovsky

and Paralic

, Parkinson’s disease patients classification based on the speech signals, 2017 IEEE 15th Int Symp Appl Mach Intell Informatics, 2017, pp. 000321–000326.

Hossain

M.S.

and Muhammad

, An emotion recognition system for mobile applications, IEEE Access 5 (2017), 2281–2287.

Gomes

and El-Sharkawy

, Implementation of i-vector algorithm in speech emotion recognition by using two different classifiers: Gaussian mixture model and support vectormMachine, International Journal of Advanced Research in Computer Science and Software Engineering 6(9) (2016), 8–16.

Irastorza

and Torres

M.I.

, Analyzing the expression of annoyance during phone calls to complaint services, 7th IEEE Int Conf Cogn Infocommunications, CogInfoCom 2016 – Proc., no. CogInfoCom, 2017, pp. 103–107.

Hamzah

Jamil

Samah

K.A.F.A.

Mangshor

N.N.A.

Sabri

and Roslan

, Comparing statistical classifiers for emotion classification, 2017 7th IEEE Int Conf Syst Eng Technol ICSET 2017 – Proc., no. October, 2017, pp. 183–188.

10.

Narendra

N.P.

Airaksinen

Story

and Alku

, Estimation of the glottal source from coded telephone speech using deep neural networks, Speech Commun 106(December 2018) (2019), 95–104.

11.

Guidi

Gentili

Scilingo

E.P.

and Vanello

, Analysis of speech features and personality traits, Biomed. Signal Process Control 51 (2019), 1–7.

12.

Basu

Chakraborty

Bag

and Aftabuddin

, A review on emotion recognition using speech, 2017 Int Conf Inven Commun Comput Technol, no. Icicct, 2017, pp. 109–114.

13.

Bastos-Filho

T.F.

Ferreira

Atencio

A.C.

Arjunan

and Kumar

, Evaluation of feature extraction techniques in emotional state recognition, 4th Int Conf Intell Hum Comput Interact Adv Technol Humanit IHCI 2012, 2012.

14.

Thakur

Adetiba

Olugbara

O.O.

and Millham

, Experimentation using short-term spectral features for secure mobile internet voting authentication, Math Probl Eng 2015 (2015), pp. 1–21.

15.

Zvarevashe

and Olugbara

O.O.

, Gender voice recognition using random forest recursive feature elimination with gradient boosting machines, in 2018 International Conference on Advances in Big Data, Computing and Data Communication Systems, icABCD 2018, 2018, pp. 1–6.

16.

Lim

Jang

and Lee

, Speech emotion recognition using convolutional and recurrent neural networks, 2016 Asia-Pacific Signal Inf Process Assoc Annu Summit Conf APSIPA 2016, 2017.

17.

Yogesh

C.K.

et al., A new hybrid PSO assisted biogeography-based optimization for emotion and stress recognition from speech signal, Expert Syst Appl 69 (2017), 149–158.

18.

Stolar

M.N.

Lech

Bolia

R.S.

and Skinner

, Real time speech emotion recognition using RGB image classification and transfer learning, 2017, 11th Int Conf Signal Process Commun Syst ICSPCS 2017 – Proc, vol. 2018-Janua, 2018, pp. 1–8.

19.

Zatarain-Cabada

Barron-Estrada

M.L.

González-Hernández

and Rodriguez-Rangel

, Emotion recognition using a convolutional neural network, Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics), vol. 10633 LNAI, 2018, pp. 208–219.

20.

Xia

and Liu

, DBN-ivector framework for acoustic emotion recognition, Proc Annu Conf Int Speech Commun Assoc INTERSPEECH, vol. 08-12-Sept, 2016, pp. 480–484.

21.

Junger

et al., NeuroImage sex matters: Neural correlates of voice gender perception, Neuroimage 79 (2013), 275–287.

22.

Luz

, Longitudinal monitoring and detection of Alzheimer’s type Dementia from spontaneous speech data, 2017 IEEE 30th Int Symp Comput Med Syst, 2017, pp. 45–46.

23.

Fayek

H.M.

Lech

and Cavedon

, Towards real-time speech emotion recognition using deep neural networks, 2015, 9th Int Conf Signal Process Commun Syst ICSPCS 2015 – Proc., 2015.

24.

Adetiba

Olugbara

O.O.

and Taiwo

T.B.

, Identification of pathogenic viruses using genomic cepstral coefficients with radial basis function neural network, Adv Intell Syst Comput 419(2015) (2016).

25.

Alonso

J.B.

Cabrera

Medina

and Travieso

C.M.

, New approach in quantification of emotional intensity from the speech signal: Emotional temperature, Expert Syst Appl 42(24) (2015), 9554–9564.

26.

Livingstone

S.R.

and Russo

F.A.

, The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north American english, PLoS One 13(5) (2018).

27.

Haq

and Jackson

P.J.B.

, Speaker-dependent audio-visual emotion recognition, in Auditory-Visual Speech Processing, 2009, pp. 1–6.

28.

Yoon

Byun

Dey

and Jung

, Speech emotion recognition using multi-hop attention mechanism, IEEE Int Conf Acoust Speech Signal Process, 2019, pp. 2822–2826.

29.

Fayek

H.M.

Lech

and Cavedon

, Evaluating deep learning architectures for speech emotion recognition, Neural Networks 92 (2017), 60–68.

30.

Busso

et al., IEMOCAP: Interactive emotional dyadic motion capture database, 2008, pp. 335–359.

31.

Hifny

and Ali

, Efficient arabic emotion recognition using deep neural networks, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6710–6714.

32.

Ariav

and Cohen

, An end-to-end multimodal voice activity detection using wavenet encoder and residual networks, IEEE J Sel Top Signal Process, 2019, p. 1.

33.

Interaction

Deng

Zhang

and Marchi

, Sparse autoencoder-based feature transfer learning for speech emotion recognition, in 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, 2013, pp. 511–516.

34.

Martin

Kotsia

Macq

and Pitas

, The eNTERFACE’05 audio-visual emotion database, in Proceedings of the 22nd International Conference on Data Engineering Workshops (ICDEW’06), 2006, no. 1.

35.

Zhang

Huang

and Gao

, Multimodal deep convolutional neural network for audio-visual emotion recognition, 2016, pp. 281–284.

36.

Bonfiglio

, Osservatorio Europeo. Il diritto alla salute dei cittadini europei e degli immigrati extracomunitari nell’ordinamento svizzero, Cittadinanza Eur (1) (2017), 121–135.

37.

Petridis

Stafylakis

Cai

Tzimiropoulos

and Pantic

, End-to-end audiovisual speech recognition, ICASSP, IEEE Int Conf Acoust Speech Signal Process Proc, vol. 2018-April, 2018, pp. 6548–6552.

38.

Kim

Truong

K.P.

Englebienne

and Evers

, Learning spectro-temporal features with 3D CNNs for speech emotion recognition, 2017 7th Int Conf Affect Comput Intell Interact ACII 2017, vol. 2018-Janua, 2018, pp. 383–388.

39.

Harár

Burget

and Dutta

M.K.

, Speech emotion recognition with deep learning, 4th Int Conf Signal Process Integr Networks, 2017, pp. 137–140.

40.

Han

and Tashev

, Speech emotion recognition using deep neural network and extreme learning machine, in Proc. INTERSPEECH 2014, September 2014, pp. 223–227.

41.

Poria

Chaturvedi

and Cambria

, Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis, in 2016 IEEE 16th International Conference on Data Mining (ICDM), 2016, pp. 439–448.

42.

Abe

B.T.

Olugbara

O.O.

and Marwala

, Hyperspectral image classification using random forests and neural networks, Proc World Congr Eng Comput Sci I (2012), 522–527.

43.

Adetiba

and Olugbara

O.O.

, Lung cancer prediction using neural network ensemble with histogram of oriented gradient genomic features, Sci World J, vol. 2015, 2015.

44.

Zhao

and Wang

, A survey on automatic emotion recognition using audio big data and deep learning architectures, Proc – 4th IEEE Int Conf Big Data Secur Cloud, BigDataSecurity 2018, 4th IEEE Int. Conf. High Perform. Smart Comput. HPSC 2018 3rd IEEE Int. Conf. Intell. Data Secur, 2018, pp. 139–142.

45.

Okuboyejo

D.A.

Olugbara

O.O.

and Odunaike

S.A.

, Automating skin disease diagnosis using image classification, World Congr Eng Comput Sci Wcecs 2013, Vol Ii, vol. Ao, 2013, pp. 850–854.

46.

Weißkirchen

Böck

and Wendemuth

, Recognition of emotional speech with convolutional neural networks by means of spectral estimates, 2017 7th Int Conf Affect Comput Intell Interact Work Demos, ACIIW 2017, vol. 2018-Janua, no. November, 2018, pp. 50–55.

47.

Deng

Eyben

Schuller

and Burkhardt

, Deep neural networks for anger detection from real life speech data, 2017 Seventh Int Conf Affect Comput Intell Interact Work Demos, 2017, pp. 1–6.

48.

Chernykh

and Prikhodko

, Emotion recognition from speech with recurrent neural networks, no. ICCES, 2017, pp. 333–336.

49.

Saito

P.T.M.

Nakamura

R.Y.M.

Amorim

W.P.

Papa

J.P

de Rezende

P.J.

and Falco

A.X.

, Choosing the most effective pattern classification model under learning-time constraint, Plos One 10(6) (2015).

50.

Wang

B.N.

Zhang

and Li

, Speech emotion recognition using Fourier parameters, IEEE Trans Affect Comput 6(1) (2015), 69–75.

51.

Sun

Wen

and Wang

, Weighted spectral features based on local Hu moments for speech emotion recognition, Biomed Signal Process Control 18 (2015), 80–90.

52.

Ying

and Xue-Ying

, Characteristics of human auditory model based on compensation of glottal features in speech emotion recognition, Futur Gener Comput Syst 81 (2018), 291–296.

53.

Shaqra

F.A.

Duwairi

and Al-Ayyoub

, Recognizing emotion from speech based on age and gender using hierarchical models, Procedia Comput Sci 151(2018) (2019), 37–44.

54.

Liu

Hussain

M.J.

and Liu

, I sense you by Breath: Speaker Recognition via Breath Biometrics, IEEE Trans Dependable Secur Comput 5971 (2017), 1–15.

Recognition of speech emotion using custom 2D-convolution neural network deep learning algorithm

Abstract

Keywords

1. Introduction

3. Material and methods

3.1.1 Emodb

3.3 Feature classification

Table 2 Accuracy analysis of three classifiers on speech emotion spectrograms dataset

Footnotes

Acknowledgments

References

Table 2
Accuracy analysis of three classifiers on speech emotion spectrograms dataset