Abstract
Over the years the need for differentiating various emotions from oral communication plays an important role in emotion based studies. There have been different algorithms to classify the kinds of emotion. Although there is no measure of fidelity of the emotion under consideration, which is primarily due to the reason that most of the readily available datasets that are annotated are produced by actors and not generated in real-world scenarios. Therefore, the predicted emotion lacks an important aspect called authenticity, which is whether an emotion is actual or stimulated. In this research work, we have developed a transfer learning and style transfer based hybrid convolutional neural network algorithm to classify the emotion as well as the fidelity of the emotion. The model is trained on features extracted from a dataset that contains stimulated as well as actual utterances. We have compared the developed algorithm with conventional machine learning and deep learning techniques by few metrics like accuracy, Precision, Recall and F1 score. The developed model performs much better than the conventional machine learning and deep learning models. The research aims to dive deeper into human emotion and make a model that understands it like humans do with precision, recall, F1 score values of 0.994, 0.996, 0.995 for speech authenticity and 0.992, 0.989, 0.99 for speech emotion classification respectively.
Keywords
Introduction
Experimentation on audio signals has been going on since it was possible to be able to capture it in 1877. Since then there have been algorithms to store the received audio signal as efficiently as possible, reproducing the signal when required, transmitting the signal over long distances with minimal losses and removing noise. The next step in audio processing was feature extraction that has led to speech emotion recognition. In the last decade, we have generated tons of data and have found the use of this data in improving our applications which have led to better experience and have had a significant impact on businesses. The combination of plenty of data and accelerated hardware has enabled us to develop these models.
Many researches are focused on extracting the desired features and preparing a suitable model. In the earlier research, the features were focused on the spectral and prosodic properties of the signal [9]. These features included formants, energy, spectral coefficients, spectral centroid, spectral skewness, spectral roll-off, spectral flatness, and spectral spread. These features are preprocessed with principal component analysis and a cubic support vector machine is evaluated with k-fold cross-validation. Another approach uses a recurrent neural network with modulation spectral features with speaker normalization [11, 21]. A comparison of different models with different configurations shows that the Recurrent Neural Network (RNN) out performs the Support Vector Machine (SVM) based classification. A scale to visualize the features and the emotion is provided by the Schlosberg space which has three orthogonal axes namely, valence, arousal, and potency. Valence describes the positivity or negativity in the emotion; arousal denotes the activation or the excitement in the speaker. While potency denotes the strength in the emotion [22]. A benchmark on the kind of models that can be used for the analysis of emotions on a standard corpus such as deployed customer-care application HMIHY paves the way for model selection [23]. The benchmark rates the model in the increasing order of accuracy as follows: Interpolated language model classifier, Mutual information-based feature selection classifier, Kernel-based classifier with one-best and Kernel-based classifier with lattices. Several machine learning models can be combined with techniques like bagging and boosting to produce models that are different features such as a k-Nearest Neighbors and a Gaussian mixture model to produce an information fusion [12]. An important revelation of using articulation features with maximum likelihood Bayes classifier is that simple classifiers can lead to overfitting [14]. A more effective Gaussian Mixture model of the Hidden Markov model can be trained with iterative feature normalization which also discriminates between neutral, negative and positive emotions [5]. Another classification includes comparing the style in the spoken language, such as the dialects in different databases [1]. These styles have a huge impact on the model and should be considered important for a robust model. To distinguish between stimulated emotions, there are two main features- voice quality parameters and dynamic changes [16, 18]. There have been many attempts in using Neural networks and other machine learning models [6, 20], out of which Hidden Markov Model (HMM) has been praised a lot for its ability to deal with states in an audio signal [7, 17]. The main feature that most of the other researchers and also this project is based on is Mel frequency cepstral coefficients (MFCC) in combination with HMM and Discriminant analysis [2, 26]. MFCC based feature extraction algorithm to extract the features from the speech to identify the emotions and SVM classification algorithm to conduct speech emotion recognition (SER) are proposed and the authors concluded that this approach was highly effective for automatic emotion prediction in speaker-independent experiments [28]. Guizzo, E. et al. proposes a Multi Time Scale (MTS) CNN layer for Speech emotion recognition which was effective and well suited for small datasets [29]. Vryzas, N. et al., proposes a CNN approach for the continuous speech recognition on continual time frames. This method shows a better result for recognising the speech when compared to SVM model in terms of accuracy [30]. Kwon, S., proposed a CNN architecture to increase the accuracy and to reduce the computational complexity of SER. A dynamic adaptive threshold method is used to enhance the speech signals, which is then converted to spectrograms to achieve the same [31]. Rachman, F.H. et al. proposed a Psycholinguistic based lyric feature extraction and Fast Fourier Transform (FFT) based audio feature extraction which is used for the hybrid synchronizing approach with the structural features and lyrical features to detect the right emotions of a song. The authors have achieved a 0.82 F1 score [32]. In this research work a combination of MFCC and LPCC is used to select features that have the most impact while determining the type of emotion. We have developed a model that can classify the emotion as well as the fidelity of the emotion using different machine learning and deep learning techniques. We have also developed an application to demonstrate the usage of the proposed model. The model is trained on features extracted from a dataset that contains stimulated as well as actual utterances. The research aims to dive deeper into human emotion and make a model that understands it like humans do. So the paper could prove to be beneficial for the people with speech related issues so that they could lead an independent life.
The paper is organized as follows: The details about the Dataset, Data preprocessing, Feature extraction steps and SpectroNet algorithm are discussed in section 2. Section 3 belongs to the classification algorithms. Finally, results and discussions are drawn in section 4.
Materials and methodology
The proposed research is shown in Fig. 1. The first step in the process is to preprocess the data to get the actual and the stimulated audio to the same scale by removing noise. After preprocessing, various features are extracted from the speech signal. The features of audio signals are visualized in spectrograms and Spectronet is used for classification which classifies the authenticity of Audio signal as well as the type of emotion. The step by step process is explained in the subsections.

Flow diagram on the overview of model.
The SUSAS dataset by Robust Speech Processing Laboratory at the University of Colorado-Boulder provides labelled audio files [15, 25]. The dataset comprises 16000 utterances under actual and stimulated scenarios where the person under test was an Apache helicopter pilot. The actual emotion was recorded during a flight while the stimulated emotion was recorded during flight training. The dataset was captured using a microphone with a 16-bit A/D converter and a sample rate of 8 kHz. Each utterance has duration of 1-2 seconds. This dataset provides a clean and labelled set of audio files in SPH format which is a linear PCM based audio file commonly used in speech recognition research. They have ASCII text formatting with a 1024-byte human-readable header. The dataset from Toronto Emotional Speech Set (TESS) consists of 200 words which were spoken by two actresses of age 64 and 26 [27]. Recordings were made to utter these words into seven different emotions (anger, happiness, disgust, neutral, surprise, fear and sadness). There are 2800 total recordings in English language.
Data pre-processing
The audio signals were from the dataset, where the signals were recorded using a microphone with a 16-bit A/D converter and a sample rate of 8 kHz under real-world scenarios. The recordings were the in-flight conversations during a test or in an actual flight [15]. Therefore, the recorded audio signal is susceptible to noise. To remove the noise, the data was first converted to a.wav file format. A typical adult male’s voice has a fundamental frequency of 85 to 180 Hz, while it is a bit higher in females ranging around 165 to 255 Hz. Thus, most of the information falls in the range of 50 to 350 Hz. A fifth-order FIR bandpass filter was devised for the aforementioned frequency range and the noise was removed. To prepare the data for the neural network we have to extract the MFCC and LPCC for the machine learning models and the Mel spectrogram for the deep learning models. The sample length for MFCC and LPCC varied from 1–3 seconds while the Mel spectrogram was sampled at 3 sec for the specialized CNN. All the MFCC and LPCC features are extracted using the librosa library in python at a sample rate of 16000 Hz.
Feature extraction
Mel frequency cepstral coefficients (MFCC)
The step by step process of MFCC is shown in Fig. 2a. The first step is to decide the window size, if the window is longer than required the signal changes too much throughout the frame, on the other hand, if the window size is smaller than required, we cannot have enough samples to generate a meaningful spectral estimate. Therefore a 20–40 ms window size is ideal when dealing with audio signals [19]. Now that we have partitioned our original signal into frames, the next step is to find out the periodogram spectral estimate. To mimic the cochlea which perceives different frequencies based on the distance of the hair. We use Mel filter banks which tell us how much energy exists near a frequency, which can be later summed up to determine the amount of energy within a frame. Mel scale is used to determine the positioning and the width of each filter bank. This scale relates the perceived frequency of a pure tone to its actual frequency. Thus, applying this scale helps us mimic what the human ears perceive.

Flow diagrams for Feature extraction procedures; a. MFCC, b. LPCC.
Equation 1 converts the frequency to Mel scale:
Equation 2 converts the Mel scale back to frequency:
(where: M → Mel’s scale; f → frequency in hertz)
The logarithm of the filter bank energies is taken because it is how the human ears perceive volume; generally to double the volume the power required is 8 times the original. The final step is to compute the discrete cosine transform of the log filter bank energies to de-correlate the energies, as the filter banks are all overlapping. To increase the performance of the model we use additional coefficients about the trajectories of the MFCC coefficients over time which are represented by the delta (differential) and delta-delta (acceleration) coefficients.
LPCC gives us an imitation of the human vocal tract. By approximating formants, it evaluates the signal, getting rid of its effects from the speech signal and estimating the concentration and frequency of the residue. The sole purpose of using LPCCs is to consider the speaker’s vocal tract characteristics while performing automatic emotion recognition. The basic idea behind linear predictive analysis is that the nth speech sample can be estimated by a linear combination of its ‘p’ samples as shown in the Equation (3). Figure 2b shows the step by step process of LPCC.
(where: s is speech signal; a1,a2 ... are predictor coefficient).
To observe the difference between emotions we have the same utterance by a set of speakers. These utterances have been visualized with the help of MFCC spectrograms. Figure 3 shows the Mel spectrogram for the same word uttered with a different emotion. Aggressive emotions tend to pack more energy than polite emotions which are visible hence to develop a classifier that distinguishes the actual emotion from the stimulated emotion for each signal.

Mel power Spectrograms for various emotions.
Another important visualization is the comparison of actual and stimulated utterances of the same word in the same emotion. Figure 4 shows the actual and stimulated signal spectrograms for the two emotions, anger and neutral.

Actual and Stimulated Signal Mel power Spectrograms; a. Anger, b. Neutral.
The idea of audio to an image can be absurd but not when both the time domain and the frequency domain information can be extracted from it. In SpectroNet we have parallel preprocessing blocks which the first sample i.e. audio signal will parallelly process and extract the feature and then merge them into a single layer for further classification.
When using the Mel spectrogram for extracting features from a speech signal one can say that it is one of the best ways for analyzing low-frequency signals such as the human voice. To use this information, we have designed a convolutional neural network inspired by transfer learning and style-transfer deep learning approaches. Transfer learning involves using a pre-trained neural network with adjustment in the final layers to match the number of classes to be predicted. A popular trend in style-transfer learning is to use pre-trained convolutional neural networks like VGG-16, AlexNet, and Googlenet. The content image and the style image are both fed into the convolutional neural network and the result is extracted from a few of the intermediate layers. In our case, the layers from VGG-16 were chosen for style- transfer learning as shown in Fig. 5. After a thorough analysis of the classification results and the computation time, five layers of VGG16 Network were chosen to extract the output. These output images from each layer are then combined with different weights starting from 0.5 to 2.5 respectively for each image, denoting the details they bring into the final image. These weights are chosen in such a way that the finer details are superimposed on the coarser details without loss in their significance. To preserve these features the weights are chosen by experimentation such as increasing the weights in the order of their occurrence in the convolutional neural network and it was found that increasing the weight in the order of 0.5 gave the best results. These images are resized to a size of 50×50. This unit forms the pre-processing block of the classifier.

Pre trained VGG16 layer.
Figure 6 shows the layers from which the pre-processing block is built, the initial image, and the output of the other five layers are chosen and scaled to a 56×56×10 image. This forms the pre-processing block shown in Fig. 6. The output from the layers above is reshaped into 56×56 and scaled to preserve the low-level details. These layers help in extracting the relevant features without having to increase the training load.

Pre-processing block for SpectroNet.
The output of the preprocessing block is then flattened and passed to a multi-layer perceptron with Rectified Linear Unit as its activation function which in turn is connected to a Softmax classifier as shown in Fig. 7. Therefore, the entire computation can be summarized into three steps: First being the capturing of an audio signal and preprocessing it for noise removal and feature extraction using MFCC and LPCC, second is the spectrogram image formed after preprocessing, and then it is fed into the network and Third step is the classification from the softmax layer at the end of the network, which gives us the label corresponding to the emotion. Since the emotions have been classified into 7 major categories, we have a 7×1 output vector, denoting the probabilities of each of the emotions.

SpectroNet Architecture.
This layer classifies the output of the MLP to two classes which can be either actual or stimulated. The entire process is shown as step by step procedure in Algorithm 1.
This research initiative is focused on two types of classification –Using Machine learning models on MFCC and LPCC and Using Deep learning models on MFCC spectrograms as images to differentiate between Actual and Stimulated emotions.
Machine learning models
A. Naïve Bayes Classifier
Naive Bayes is a machine learning model based on the principle of Bayes theorem used for classification of different entities based on feature values. Bayes equation can be seen below in Equation (4).
Bayes theorem is instrumental in finding the probability of A when B has already occurred. So here, A is the hypothesis and B is the event. It is assumed that the features are independent and there is no correlation between them. Hence, the model is named naïve.
B. Support Vector Machine (SVM)
SVM is a supervised machine learning algorithm that uses a separating hyperplane for classification. It is a discriminative classifier in which labelled training data is fed to the learning model which outputs an optimal hyperplane that categorizes a new feature set. In 2-D space, this hyperplane is drawn as a line that divides a plane into two parts where each of the classes lies on either side. The hyper parameters in the model are the kernel, regularization factor and the gamma factor. The kernel can be linear, which uses the inner product of all input vectors with all support vectors in the training data, a polynomial kernel or an exponential kernel which enhances the power of SVMs by calculating a separation line in higher-dimensional space.
C. Random Forest Classifier
A Decision Tree is a set of rules that determine the probability of an event happening are arranged in the decreasing order of entropy gain at each level. A random forest as the name suggests is a large number of comparatively uncorrelated models operating as a family and outperforms its weaker constituents when bagged together. Such techniques are called ensemble learning methods where each component is a machine learning model and is trained on a subset of features. The outcomes are chosen based on the majority.
A. AlexNet
AlexNet is a deep convolutional neural network (CNN) for image classification [13]. It has a total of eight layers out of which five are convolution layers and the rest three are fully connected layers which fed to Softmax layer with 1000 class labels. It has a total of 62.3 million parameters. The main attribute of this architecture is that after every convolution and fully connected it uses Rectified linear unit (ReLu) non-linearity which reduces the training time of the model. Also, a dropout layer which is applied before the first and second fully connected layer tackle the problem of overfitting and vanishing gradient which could harm the classification ability of the network. Over time it has proven to be one of the most efficient CNN architecture in terms of solving complex models for image classification. After training AlexNet for authenticity classification 99.2% accuracy is achieved and for emotion classification 97.14% accuracy is achieved.
B. VGG16
VGG16 is a deep convolutional neural network used for image classification tasks [24]. As the name suggests it has a total of 16 layers consisting of convolution Layers, Max pooling Layers, Activation layers, and fully connected layers. It takes a fixed-size input of 224×224 RGB image. Instead of having a large number of hyper parameters, VGG16 uses 3×3 filters of convolution layers with a stride of 1 with the same padding and max-pooling layer of 2×2 filter of stride 2. Throughout the whole architecture, it follows the same pattern of convolution layers and MaxPooling layers consistently. In the end, it has two Fully Connected layers and a Softmax layer for the final output of the architecture. After training VGG16 for authenticity classification 98.4% accuracy is achieved and for emotion classification 98.7% accuracy is achieved.
C. ResNet
The problem with most of the network is that when its depth increases the vanishing gradient problem also increases. It directly affects the weights as they never update their values and no learning is being achieved saturating the accuracy and networks have trouble reaching convergence. Residual Networks (ResNet) solves the problem of vanishing gradient by introducing the concept of skip connections to a plain CNN network which adds the output of the previous layer to a later layer [8]. After training ResNet for authenticity classification validation 96.4% accuracy is achieved and for emotion classification 84.21% validation accuracy is achieved.
D. Implementation of the model
Training data for AlexNet, VGG16 and ResNet models contains Mel spectrogram plot image dataset for stimulated and real speech as well different kinds of emotions for authenticity and speech emotions classification respectively. Model is then trained on these architectures where the last layer of models contains 2 layers and 7 layers for authenticity and emotion classification respectively.
Results and discussion
The accuracy of the SpectroNet model with 9 preprocessing blocks was 99.27% while with 15 blocks was 99.5%. The scalability and deployment in real-time depends on the hardware as well as optimization in audio IO libraries. There are certain claims that spectrograms are not the correct inputs for speech signal processing because of the sequential nature of speech signal and it still applies to the current model. An attempt has been made to include this serial nature by sequentially placing these blocks. Prospects for this network are to introduce architecture changes to include the functionalities of long-term and short-term memories as a speech signal is sequential and will help in both determining the type of emotion as well as the authenticity.
In this experiment, we tried to compare SpectroNet approach of speech analysis with the existing machine learning and state of the art deep learning models to classify voice signals into actual and stimulated. For supervised machine learning models, MFCCs and LPCCs coefficients are computed as feature values each for stimulated and actual voice signals. The normalized data is then divided into 75% train and 25% validation sets. The first method is Naïve Bayes classifier using Gaussian distribution results in 72.2% of accuracy for authenticity classification only. Next SVM model is used for classification by setting regularization parameter value C (penalty parameter) to 10, kernel radial basis kernel function (RBF) results in 97.04% of accuracy for authenticity classification and 98.02% for emotion classification. The next model is the Random Forest Classifier with the number of trees used is 100 which results in 94.02% accuracy for authenticity classification and 98.4% for emotion classification. The values are shown in Table 1.
Hyper parameters of ML algorithms and the selection of values
Hyper parameters of ML algorithms and the selection of values
For deep learning models, voice signals are pre-processed by removing noise and MFCC spectrogram is plotted for each stimulated and actual voice signals in jpeg format. Dataset is again divided into 75% training and 25% validation set. These images are trained on the state of the art models mentioned above.
For SpectroNet, to understand the effect of the number of pre-processing block vs speed/accuracy trade-off, we tried different numbers of blocks and changed the size of the MLP input layer accordingly, the accuracies for different processing block numbers are given in Table 2.
Number of pre-processing blocks of SpectroNet vs accuracy
The training time and the forward pass time raises rather exponentially when dealing with this hyper parameter. Each voice signal is filtered by an FIR filter for removing noise and attaining the part which forms the region of interest, The signal is then passed on to compute the MFCC and the following spectrogram is stored as an image. The SpectroNet is capable of classifying between actual and stimulated emotions with accuracy of 99.7% the confusion matrix is shown in the Fig. 8, and accuracy of 99.4% for classifying the type of emotion in audio signal. Since the dataset was hand labelled by naive experts, it is safe to assume that SpectroNet is able to achieve human-like accuracy in identifying the nature of the emotion.

Confusion matrix –SpectroNet based emotion authentication.
Tables 3 and 4 compares the SpectroNet model with the rest of the models and can be observed that it outperforms all models in every aspect of performance classification. SpectroNet model has proven to be deployed for real-time classification on an Nvidia jetson-nano.
Performance Comparison between algorithms for authenticity classification
Performance Comparison between algorithms for emotion classification
The difference between images containing objects and the power distribution in a spectrogram being passed in a convolutional neural network is the filtered response we get at the output. For an image with objects, the output is a refined feature set that describes the object on a macro or a micro-scale depending upon the depth of the Convolution layer. While the output for a spectrogram is a feature set that contains the accumulated power spectrum in different regions which loses a lot of information. Thus, it becomes difficult to extract features when the output of a convolutional layer gives us the features about the power of a combination of frequencies, and the loss due to phase cancelation is not accounted for in CNNs. This makes it difficult to segregate simultaneous sounds in Mel spectrograms.
This research initiative has brought us one step closer to identifying stimulated and actual emotion in humans especially in children. In the dataset used, individual age information for each speech utterance is not given due to the ethical reason, so classification based on the age group was not part of this study. But as per the recent recommendations, by Shivakumar, P.G. and Georgiou, P., [33] the transfer learning approaches can be used more effectively children speech recognition and the emotion of the child. So the children speech recognition with the implemented algorithm will yield very high accurate emotion recognition systems. The final aim is to have a deep learning model that will be able to identify real and stimulated emotions in real-time for different types of voices based on their pitch and loudness. The inspiration is taken from style transfer learning on using pre-trained CNN architectures The dataset for each emotion was passed to a standalone classifier, therefore the entire model would comprise of two classifiers the first would be identifying the kind of emotion and the other would be determining whether the emotion is actual or stimulated. This cascaded model can then be deployed directly on embedded systems or servers for classifications. Different models are deployed on an Nvidia jetson nano and an exhaustive training approach has been applied to find the best architecture and features to be used when using the model in applications like voice assistants and automated chatbots. The model when fed an audio signal sampled at 16 kHz was able to classify the emotion at a rate of 4 seconds for 48000 samples(3 sec) which in total takes 7 seconds from capturing the audio signal to processing, and is sufficient for the purpose of emotion detection.
Funding
This study is financially supported by the
Conflict of interest:
The authors declare that they have no conflict of interests.
Footnotes
Acknowledgments
The authors thank VIT for providing ‘VIT SEED GRANT’ for carrying out the initial study for this research work.
