Investigation of automatic mixed-lingual affective state recognition system for diverse Indian languages

Investigation of automatic mixed-lingual affective state recognition system for diverse Indian languages

Abstract

Keywords

1 Introduction

2 State of the Art

3 Insights of database creation

Table 1
Frequency of emotion samples of each language in the created database

Sl. No Language Number of Emotion Samples

1 English 103

2 Hindi 64

3 Malayalam 121

4 Tamil 50

5 Kannada 50

6 Telugu 108

7 Marati 16

8 Bengali 16

9 Gujarati 13

10 Kongini 12

11 Oriya 16

TOTAL 569

Table 2
Cepstral Feature Set [27]

Sl.No Speech Feature Size Sl. No Speech Feature Size

1 MFCC Functionals 6 11 Extended IMFCC Un-voiced Functionals 6

2 MFCC Voiced Functionals 6 12 LPC Functionals 6

3 MFCC Un-voided Functionals 6 13 Functionals of MEDC, LFPC, LPCC 18

4 Extended MFCC Functionals 6 14 PLPC Functionals 6

5 Extended MFCC Voiced Functionals 6 15 MFPLPC Functionals 6

6 Extended MFCC Un-voiced Functionals 6 16 BFCC Functionals 6

7 IMFCC Functionals 6 17 RPCC Functionals 6

8 IMFCC Voiced Functionals 6 18 H-Coefficients 8

9 IMFCC Un-voided Functionals 6 19 Functionals of Cceps and Rceps, Rceps_ph 18

10 Extended IMFCC Voiced Functionals 6 20 Skewness, Kurtosis, Variance, Frequency, Phase, Average Amplitude, Max Amplitude, Maximum at Pitch, Pitch, Entropy 11

4.4 Classification

5 Performance measures

6.1 Experimental work and result analysis of the proposed mixed-lingual affect recognition system across the emotion models

6.1.1 Proposed system on Model 1

Table 3
Confusion matrix for Model 1

Emotion a b c

a=Happy 145 26 9

b=Neutral 27 123 28

c=Sad 22 23 125

Table 4
Confusion matrix for Model 2

Emotion a b c d

a=Angry 157 5 9 9

b=Fear 30 105 20 17

c=Neutral 18 14 122 24

d=Sad 26 7 36 111

Table 5
Confusion matrix for Model 3

a b c d e f

a=Angry 137 5 21 5 3 9

b=Fear 18 98 18 14 16 8

c=Happy 30 13 98 20 6 13

d=Neutral 13 11 11 106 26 11

e=Sad 7 5 17 26 101 24

f=Surprise 16 5 13 17 13 108

6.2.1 Comparative performance analysis of Neutral and Sad emotions using the proposed model

6.2.2 Comparative performance analysis of the proposed mixed-lingual affect recognition system across all the three models

7 Conclusion and future outlook

References

Abstract

Keywords

1 Introduction

2 State of the Art

3 Insights of database creation

Table 1 Frequency of emotion samples of each language in the created database Sl. No Language Number of Emotion Samples 1 English 103 2 Hindi 64 3 Malayalam 121 4 Tamil 50 5 Kannada 50 6 Telugu 108 7 Marati 16 8 Bengali 16 9 Gujarati 13 10 Kongini 12 11 Oriya 16 TOTAL 569

4.4 Classification

5 Performance measures

6.1 Experimental work and result analysis of the proposed mixed-lingual affect recognition system across the emotion models

6.1.1 Proposed system on Model 1

Table 3 Confusion matrix for Model 1 Emotion a b c a=Happy 145 26 9 b=Neutral 27 123 28 c=Sad 22 23 125

Table 4 Confusion matrix for Model 2 Emotion a b c d a=Angry 157 5 9 9 b=Fear 30 105 20 17 c=Neutral 18 14 122 24 d=Sad 26 7 36 111

Table 5 Confusion matrix for Model 3 a b c d e f a=Angry 137 5 21 5 3 9 b=Fear 18 98 18 14 16 8 c=Happy 30 13 98 20 6 13 d=Neutral 13 11 11 106 26 11 e=Sad 7 5 17 26 101 24 f=Surprise 16 5 13 17 13 108

6.2.1 Comparative performance analysis of Neutral and Sad emotions using the proposed model

6.2.2 Comparative performance analysis of the proposed mixed-lingual affect recognition system across all the three models

7 Conclusion and future outlook

References

Table 1
Frequency of emotion samples of each language in the created database

Sl. No Language Number of Emotion Samples

1 English 103

2 Hindi 64

3 Malayalam 121

4 Tamil 50

5 Kannada 50

6 Telugu 108

7 Marati 16

8 Bengali 16

9 Gujarati 13

10 Kongini 12

11 Oriya 16

TOTAL 569

Table 3
Confusion matrix for Model 1

Emotion a b c

a=Happy 145 26 9

b=Neutral 27 123 28

c=Sad 22 23 125

Table 4
Confusion matrix for Model 2

Emotion a b c d

a=Angry 157 5 9 9

b=Fear 30 105 20 17

c=Neutral 18 14 122 24

d=Sad 26 7 36 111

Table 5
Confusion matrix for Model 3

a b c d e f

a=Angry 137 5 21 5 3 9

b=Fear 18 98 18 14 16 8

c=Happy 30 13 98 20 6 13

d=Neutral 13 11 11 106 26 11

e=Sad 7 5 17 26 101 24

f=Surprise 16 5 13 17 13 108

Automatic recognition of human affective state using speech has been the focus of the research world for more than two decades. In the present day, with multi-lingual countries like India and Europe, population are communicating in various languages. However, majority of the existing works have put forth different strategies to recognize affect from various databases, with each comprising single language recordings. There exists a great demand for affective systems to serve the context of mixed-language scenario. Hence, this work focusses on an effective methodology to recognize human affective state using speech samples from a mixed language framework. A unique cepstral and bi-spectral speech features derived from the speech samples classified using random forest (RF) are applied for the task. This work is first of its kind with the proposed approach validated and found to be effective on a self-recorded database with speech samples comprising from eleven various diverse Indian languages. Six different affective states of angry, fear, sad, neutral, surprise and happy are considered. Three affective models have been investigated in the work. The experimental results demonstrate the proposed feature combination in addition to data augmentation show enhanced affect recognition.

Affective state cepstral mixed-lingual recognition Indian languages

Affect is regarded as a physiological term that describes an emotion or state of mind of an individual. Human affective state is featured with a change of tone and usually followed by physiological behavior change. Various factors responsible for this condition is confidence, health related issues, loss in family, break up in personal relations, self-esteem etc. [1]. Various modalities such as speech, facial expressions, heart rate, blood flow, body temperature can be used to recognize the human affective state. Of all these methods, the modality using speech is more popular as recording the data and storage is easier [2] [3]. In this work, various speech patterns are used to recognize the affective (also referred to be para-linguistic / emotional) state of the individual. The applications of the proposed work comprise call center applications, on-line tutoring, robotics, e-learning, marketing, entertainment, law etc. [4][5].

The important modules in any speech emotion recognition system are database, pre-processing, feature extraction and classification. The first module.ie database consist of a collection of speech sample recordings of a specific language with different emotions. Berlin (German language recordings)[6], eNTERFACE (English language recordings)[7], Savee (British English recordings) [8] have been the pioneering databases on which there have been many published works. Recently Baum-1a (Turkish language recordings) [9], MELD (English language recordings) [10] databases have been released. The second module of pre-processing involves mainly framing and windowing as speech is a non-stationary signal. Various normalization techniques and sometimes noise reduction mechanisms have been part of this module [11][12]. Various speech features or patterns are derived from the third module. Various prosodic, spectral, teager energy operated and voice quality features have been applied till date for affect recognition [13][14][15]. The extracted speech patterns are the input to the last stage i.e. Classifier that does the recognition task. Classical classifiers such as Gaussian Mixture model (GMM), Hidden Markov Model (HMM), Support Vector Machines (SVM) etc., deep learning classifiers such as convolutional neural networks (CNN), Recur-rent Neural Networks (RNN) etc and deep learning based classifier ensemble such as auto encoders, adversarial training, attention mechanism have been part of the final module of classification [16][17] [18].

This work is focused on speech based affective state recognition in a mixed-lingual environment of Indian background. The key contributions in this work are as follows:

Creation of an in house mixed lingual emotional database with twenty four speakers in eleven diverse Indian languages

A unique cepstral and bispectral feature combination is proposed for affect recognition in mixed-lingual framework

The proposed mixed-lingual system is unique with three different emotion models investigated besides attainment of considerable performance of affect recognition

This article has state of the art discussed in Section 2, database creation explained in Section 3, overflow of work is described in Section 4, evaluation measures and experimental work with analysis put forth in Section 5 and 6. Finally, Section 7 has conclusion with future directions.

Literature reports work on affective state recognition majorly on single and multi-corpus corpus contexts that deals with recognition of emotion samples from one language each time system is analyzed and tested. Mainly, each work is diverse in the type of features and classifiers used. In single corpus affect recognition works, only one language emotion samples are involved. A small set of prosodic features constituting primarily energy, intensity and pitch were popular during the infantry stage of emotion recognition [19][20]. Spectral features of Mel cepstral coefficients (MFCC) and linear prediction coefficients (LPC) contain emotion specific information as investigated by Ayadi et al [21]. In another work by Bang et al., prosodic and cepstral features with data augmentation method proved to be significant to recognize emotions from IEMOCAP speech samples [22]. A combination of Teager-Kaiser energy operator based features and empirical mode decomposition (EMD) were applied by Kerkeni et al [23]. Prominent emotion features for two different databases of English and Spanish independently were selected using recursive feature selection.

In multi-corpus affect recognition works, the system is tested for more than one database / corpus. With more databases involved, researchers explored and applied large sized features for affect recognition. Interspeech 2010 feature set comprising feature vector size of 1582 along with neurogram features were derived from voice samples of Berlin and eNTERFACE databases and classified using SVM to recognize emotions yielding an accuracy performance of 89.2% (Berlin database) and 78.2% (eNTERFACE database) [24]. Deep multitask networks accompanied by shared hidden layers classified affect samples from nine distinct corpora using GeMAPS (83 features) and ComParE (6373 features) feature sets [25]. Two heterogeneous emotion corpus of English and Japanese voice samples were classified using multi-task learning with the aid of Interspeech 2010 features [26].

However, in the recent times affective recognition in speech with databases from mixed-lingual context is witnessed. A combination of MFCC and SDC features of size 124 extracted from samples of IEMOCAP and FAU-AIBO were classified using deep neural networks (DNN) yielding an accuracy of around 51% [23]. In another work, cepstral features of size 151 using Random Forest classifier have been applied to recognize emotions from a mixed-lingual context with number of databases involved from 2 to 5. Accuracy of around higher than 80.0% was demonstrated in the work [27]. An interesting point is that although appreciable results are achieved in the work reported in [27], but databases involve samples recorded in a fixed noise free background conditions.

Summary of state of the art and research gaps:

Existing work on affect emotion detection describes various methodologies. Prosodic and spectral feature combination have been popular for recognizing affect. Among large feature sets, Interspeech 2010 feature set is often applied with affective work involving more than one database. SVM has been effective among the traditional classifier category for classification. Different deep neural networks are applied recently for recognizing emotions. Although lot of work is suggested in speech based affect recognition, mixed-lingual affect recognition is quiet minimal. Especially, in India with so many languages which people communicate, an affect recognition system that performs emotion recognition in a framework of mix of diverse languages is a deficit. This problem is addressed with a suitable methodology in this work.

An in house speech emotion database with a diverse Indian language samples is performed in this work. The voice samples were recorded from various hand held mobile devices with each had its own configuration in terms of sampling frequency and default settings. Additionally, few samples were from mono channel re-cording and the others were from stereo recording settings in the mobiles. The speakers were from undergraduate, post graduate and Ph.D. students in the age group of 20–28 years. Each voice sample duration length lasted from 2–4 sec. A total of 32 speakers with 24 male and 12 female students participated in the re-cording. The database created consists of emotion samples with six categories i.e. angry, fear, neutral, sad, happy, and surprise in 11 diverse Indian languages. A total of 675 samples were recorded. A perception test was conducted with 4 members as judges. Each time a voice sample was played and the emotion perceived by each of the member was recorded. A total of 106 samples were dis-carded after the perception test due to the disagreement on the type of emotion perceived by the judges. The number of emotion samples from each language that passed the perception test and included in the database for the proposed work is listed in Table 1.

Sl. No	Language	Number of Emotion Samples
1	English	103
2	Hindi	64
3	Malayalam	121
4	Tamil	50
5	Kannada	50
6	Telugu	108
7	Marati	16
8	Bengali	16
9	Gujarati	13
10	Kongini	12
11	Oriya	16
	TOTAL	569

4 Overview of the proposed mixed-lingual affective recognition

An overview of the methodology in this work for recognizing affective state in an environment of mixed languages is shown in Fig. 1.

Fig. 1

Block diagram of the proposed work.

As depicted in Fig. 1, the proposed system consists of four major phases i.e., pre-processing, acoustic / speech feature extraction, data augmentation and finally classification. Each of these sections are discussed as follows:

4.1 Pre-processing

This phase involves framing and windowing of voice samples, due to the non-stationary characteristic of the data. A 30 ms length of hamming window is chosen and an overlap of 10 ms is performed.

The window function w(n) is defined in equation (1). $w (n) = 0.54 - 0.46 \cos (\frac{2 π n}{N - 1}) 0 \leq n \leq N - 1$ (1) where ‘N’ is the length of the window.

The windowed frame y(n) of the input speech signal x(n) is expressed in equation (2). $y (n) = x (n) w (n)$ (2)

4.2 Acoustic/speech feature extraction

The acoustic/speech features considered in this work are composed of two major constituents, i.e. cepstral speech feature functional set and bi-spectral features. The cepstral feature set had majorly features derived from Mel, bark and inverted Mel filter banks besides modified H-Coefficients and additional parameters as shown in Table 2.

Sl.No	Speech Feature	Size	Sl. No	Speech Feature	Size
1	MFCC Functionals	6	11	Extended IMFCC Un-voiced Functionals	6
2	MFCC Voiced Functionals	6	12	LPC Functionals	6
3	MFCC Un-voided Functionals	6	13	Functionals of MEDC, LFPC, LPCC	18
4	Extended MFCC Functionals	6	14	PLPC Functionals	6
5	Extended MFCC Voiced Functionals	6	15	MFPLPC Functionals	6
6	Extended MFCC Un-voiced Functionals	6	16	BFCC Functionals	6
7	IMFCC Functionals	6	17	RPCC Functionals	6
8	IMFCC Voiced Functionals	6	18	H-Coefficients	8
9	IMFCC Un-voided Functionals	6	19	Functionals of Cceps and Rceps, Rceps_ph	18
10	Extended IMFCC Voiced Functionals	6	20	Skewness, Kurtosis, Variance, Frequency, Phase, Average Amplitude, Max Amplitude, Maximum at Pitch, Pitch, Entropy	11

It is composed of a set of 151 sized functional set. The functionals of maximum, minimum, mean, standard deviation, variance and median were considered. These features have found to play a major role for mixed-lingual affective recognition with majority of the speech samples in the corpus belonging to non-Indian languages [27]. From the experimentation performed in this work, it is observed that these stand-alone features of Table 1 were not sufficient to recognize the affective state of mixed-lingual Indian corpus with diverse languages. Hence, it was very much required to add some additional features to the cepstral feature set.

Extraction of Bi-spectral features

Bi-spectrum is basically a Fourier transform which is of two dimension with respect to the cumulant function of third order as represented in equation 3. $P (fa, fb) = E [X (fa) X (fb) X * (fa + fb)]$ (3) where P(fa,fb) represents Bi-spectrum from frequencies (fa,fb). X(f) denotes Fourier transform, * represents complex conjugate, E[.] showcases expectation of operation[28]. Redundant data is contained in the speech signal Bi-spectrum.

Hence, from the non-redundant area (Ω) Bi-spectral features are selected as depicted in Fig. 2.

Frequencies depicted in Fig. 2 are normalized with Nyquist frequency. Equation (4) to (13) characterizes the process to obtain Bi-spectral speech features. Average magnitude of Bi-spectrum is represented in Equation (4).

Fig. 2

Non-redundant region.

$Amp = (1 / p) * \sum_{Ω} | P (fa, fb) |$ (4) where ‘p’ represents to quantify points existing in that region [29].Weighted center of Bi-spectrum (WCOB) is evaluated using Equation (5) to Equation (8). $g_{1 n} = \frac{\sum_{Ω} l * P (m, n)}{\sum_{Ω} P (m, n)}$ (5) $g_{2 n} = \frac{\sum_{Ω} m * P (m, n)}{\sum_{Ω} P (m, n)}$ (6) $g_{3 n} = \frac{\sum_{Ω} l * | P (m, n) |}{\sum_{Ω} | P (m, n) |}$ (7) $g_{4 n} = \frac{\sum_{Ω} n * | P (m, n) |}{\sum_{Ω} | P (m, n) |}$ (8) where m and n, provides the bin index of the frequency existing in the region, where g_1n, g_2n represents WCOB and g_3n, g_4n signifies absolute values of WCOB [29].

The log amplitude summation (T_a) of Bi-spectrum is derived by: $T_{a} = \sum_{Ω} \log (| P (fa, fb |)$ (9)

The log amplitude summation from diagonal elements (Tb) in the Bi-spectrum are given as: $T_{b} = \sum_{Ω} \log (| P (fd, fd) |)$ (10)

The diagonal elements amplitude (T_c, T_d, T_e) for the spectral moments of order one and two is derived by: $T_{c} = \sum_{d = 1}^{N} d * \log (| P (fd, fd) |)$ (11) $T_{d} = \sum_{d = 1}^{N} {(d - Tc)}^{2} * log (| P (fd, fd) |)$ (12) $T_{e} = \sum_{Ω} \sqrt{m^{2} + n^{2}} * | P (m, n) |$ (13)

A total of 6 features that includes average magnitude of bi-spectrum and five features derived from log amplitudes of the bi-spectrum are extracted. Thus, in this work a combination of cepstral features of size 151 and bi-spectral features of size 6 results in a total size of 157 coefficients are extracted for each speech signal.

4.3 Data augmentation

One of the popular data augmentation method, introduced by Nitesh et al., referred as synthetic minority oversampling technique (SMOTE) is applied in this work [30]. Synthetic samples of the minority classes are created in a feature space using this method. The methodology of sample creation is based on choosing k-nearest neighbours. Here, the value of k is selected to be 5.

This forms the last phase with recognition of the speech sample affective state performed using random forest (RF) classifier [27]. Multiple decision trees having more similar distributions associate to form RF. Several decision trees are formed during the process of training with the output considered from every individual tree. A subset of data is obtained using bagging method. In this work, a total of ten trees are selected.

To evaluate the performance of the proposed mixed-lingual affective recognition system, recall, precision, F-score and confusion matrix are used in this work [31]. These measures are defined as follows:

Recall refers to the fraction of samples relevant among the total amount of relevant instances. $\begin{matrix} Recall (%) = \\ \frac{True Positive (TP)}{True Positive (TP) + False Negative (FM)} * 100 \end{matrix}$ (14) where, TP is the sample count predicted positive being actually positive and FN is the sample count predicted negative being actually negative

Precision refers to the sample count relevant among the samples retrieved. It is also referred to be positive predictive value. $\begin{matrix} Precision (%) = \\ \frac{True Positive}{True Positive + False Positive} * 100 \end{matrix}$ (15)

F-Score refers to the harmonic average of recall and precision. $F - Score (%) = 2 * \frac{Recall * Precision}{Recall + Precision} * 100$ (16)

Confusion matrix refers to a table which indicates the relevant information of the test samples correctly recognized.

6 Experimental work, results and analysis

The experimental work was conducted by extracting cepstral and bi-spectral features making a total of 157 coefficients for each speech signal. Data augmentation is applied following classification using various classifiers from python [32]. However, random forest was found to give optimum results for the proposed mixed-lingual work. A fivefold cross validation was applied to analyse the system performance. The work is performed and investigated in two phases as follows:

Experimentation and analysis of the proposed mixed-lingual affect recognition system on three emotion models

Comparative analysis on the emotions and across the models

The proposed mixed-lingual affect recognition system is applied on three different emotions models referred i.e Model 1, Model 2 and Model 3 created from the in-house database recorded in this work and performance of the system is analysed in each case. The insights into these three models are as follows;

Model1 comprises of three emotions i.e. happy, sad and neutral. Recognition of emotions in this model are quiet crucial in the investigation of mood swings of an individual.

Model 2 consists of four emotions i.e. angry, fear, neutral and sad. Recognition of these emotions are important during investigation of psychological state of an individual.

Model 3 has all the six emotion classes i.e. angry, fear, neutral, sad, happy, and surprise. This model serves to recognize most frequent emotions recognized in daily life.

Hence, each model is applicable for different application. The experimentation is carried out to analyse the performance and suitability of the proposed approach for each model.

The proposed mixed-lingual affective system is applied on voice samples of Model 1. Performance of emotion recognition on voice samples across the model is shown by the confusion matrix in addition to performance measures depicted in Table 3 and Fig. 3 respectively. It is observed that happy samples are better recognized while neutral and sad are at times misclassified with one another. Happy is recognized with a recall rate of 80.6% while sad and neutral samples exhibit a moderate recognition of around 70%. Performance measures of precision, recall and F-score exhibit performance rates not lesser than 70% across all the three classes of emotions. The average from all the three measures are consistent depicting rates higher than 75%.

Emotion	a	b	c
a=Happy	145	26	9
b=Neutral	27	123	28
c=Sad	22	23	125

Fig. 3

Performance of the proposed mixed-lingual affect recognition system across Model-1.

6.1.2 Proposed system on Model 2

The proposed mixed-lingual system is applied to recognize emotions from Model-2 with the results reported in confusion matrix of Table 4 and performance measures shown in Fig. 4 respectively. Among the four emotions, angry is recognized across most of the voice samples. Fear is moderately recognized but often misclassified to be angry. Similarly, there exists a small confusion between neutral and sad emotion.

Emotion	a	b	c	d
a=Angry	157	5	9	9
b=Fear	30	105	20	17
c=Neutral	18	14	122	24
d=Sad	26	7	36	111

Fig. 4

Performance of the proposed mixed-lingual affect recognition system across Model 2.

From the perspective of performance measure from Fig. 4, recall rates are higher for angry with 87.2%, while neutral is the next better recognized emotion around 70%. Precision has been higher with fear, reaching around 80% indicating lesser number of false negatives predicted by the methodology in the context. Lastly, except for sad emotion F-Score, has remained around 70% and higher. The average rates of all the three measures are found to be around 70%.

6.1.3 Proposed system on Model 3

The proposed system is applied on Model 3 with performance results achieved depicted in confusion matrix of Table 5 and three performance measures shown in Fig. 5.

Fig. 5

Performance of the proposed mixed-lingual affect recognition system across Model 3.

From the confusion matrix of Table 5, angry is better recognized among the six emotions. Besides, Surprise is the next emotion fairly predicted. However, Neutral and Sad emotions are misclassified with one another while, with few samples, fear is identified as recognized as angry or happy. Recall rates for angry is higher than 75%. Similarly, precision rates of fear is around 75% since the false negatives predicted are lower. For rest of the other emotions all the three performance measures have almost remained around 60.0%.

Further, with eleven diverse languages, the proposed model is able to recognize all the six emotions almost around 60% or higher shows the effectiveness of the proposed method.

6.2 Comparative analysis performed on the emotions and models

This section discusses performance of proposed system in recognizing, the most confusing emotions of sad and neutral across all the three models. These two emotions are often difficult to be classified by humans also. Next, performance measure averages across all the three models in this work are compared.

The neutral and sad emotions which are usually referred as low emotions are important to be recognized in any application. With three models investigated in this work, the performance measures of both in presence of other emotions across the models is depicted in Fig. 6. With the first model, both emotions are recognized higher than 70% in the presence of the third emotion happy. Next, presence of other two emotions of angry and fear resulted sad to be slightly misclassified higher with neutral in model 2. However, with Model 3, in-spite of presence of another four emotions, both the emotions sad and neutral have been fairly recognized by the proposed model with rates of around 60%.

The average evaluation measures of recall, precision and F-score across all the three models investigated in this work are compared in Fig. 7. Across the models it could be observed all the three evaluation measure average are almost remained same. The best averages of around 73% are achieved using Model 1 with a set of three emotions. However, as the number of emotions increased to four and six with Model 2 and Model 3, confusion increased during classification among the emotion class causing slight reduction in the recognition rates. The key observation is that even with all the six emotions considered in Model 3, averages have resulted around 60% which is still a better performance by the proposed mixed-lingual affect recognition system since diverse Indian languages and speakers are involved in the mixed-sample recordings.

This article provides an investigation of mixed-lingual affect recognition system for different Indian languages. Against the huge feature sets applied in literature for affect recognition, in this work a narrow set of 157 co-efficients of cepstral and bi-spectral features have been the key constituents of the speech feature set. With various classifiers applied in the work, random forest classifier is found suitable for the mixed-lingual environment. Diversity in the speakers and language of the speech emotion samples are considered during in house data creation. Experimental results are demonstrated using the performance measures of recall, precision and f-score. All the three metrics have proven to be consistent and shown to be considerable across all the three models investigated. The proposed methodology of proposed feature set and data augmentation has proven to be superior since emotions samples captured with a simple hand held devices were recognized. In future, some additional features could be added to the proposed feature set for enhanced recognition rates. Also, other Indian language emotion samples could be added to the database and the proposed method could be validated further.

http://canwetalk.ca/about-mental-illness/factors-affectingmental-health/

Sloten

J.V.

, Verdonck

, Nyssen

and Haueisen

, Influence of mental stress on heart rate and heart rate variability, International Federation for Medical and Biological Engineering Proceedings, 2008, pp. 1366–1369.

Bakker

, Pechenizkiy

and Sidorova

, What’s your current stress level? Detection of stress patterns from GSR sensor data, In Proceedings of ICDM, 2011, pp. 573–580.a

Gowda

R.K.

, Nimbalker

, Lavanya

, Lalitha

and Tripathi

, Affective computing using speech processing for call centre applications. In: International conference on advances in computing, communications and informatics (ICACCI), Udupi; 2017. p. 766–71..a

Lalitha

and Tripathi

G.D.

, Enhanced speech emotion detection using deep neural networks, Int J Speech Technol22(3) (2018), 1–14.

Burkhardt

, Paeschke

, Rolfes

, Sendlmeier

W.F.

and Weiss

, A database of German emotional speech, Interspeech, ISCA (2005), p. 1517–1520.

Martin

, Kotsia

, Macq

and Pitas

, The eNTERFACE 05 audio-visual emotion database, In: 22nd international conference on data engineering workshops (ICDEW’06), Atlanta, GA, USA, 2006. p. 8–8.

Jackson

and Haq

, Surrey audio-visual expressed emotion (Savee) database, Guildford, UK: University of Surrey, 2014.

Zhalehpour

, Onder

, Akhtar

and Erdem

C.E.

, BAUM-1: a spontaneous audio visual face database of affective and mental states, IEEE Transactions on Affective Computing3., vol. 8. p. 300–313.

10.

Poria

, Hazarika

, Majumder

, Naik

, Cambria

and Mihalcea

, MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, (2019), 527–536

11.

Sefara

T.J.

, The Effects of Normalisation Methods on Speech Emotion Recognition, 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC), Vanderbijlpark, South Africa, 2019, pp. 1–8,

12.

Anagnostopoulos

C.-N.

, Theodoros

and Ioannis

, Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011, Artif Intell Rev43(2) (2015), 155–177.

13.

Lalitha

and Gupta

, An encapsulation of vital non-linear frequency features for various speech applications, Journal of Computational and Theoretical Nanoscience17(1) (2020), 303–307.

14.

Daneshfar

, Jahanshah

and Neekabadi

, Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier, Applied Acoustics, (2020).

15.

Zvarevashe

and Olugbara

, Ensemble Learning of Hybrid Acoustic Features for Speech Emotion Recognition, Algorithms13 (2020), 70.

16.

Shahin

, Nassif

A.B.

and Hamsa

, Emotion recognition using hybrid gaussianmixture model and deep neural network, IEEE Access7 (2019), 26777–26787.

17.

Lalitha

and Tripathi

, Emotion detection using perceptual based speech features, 2016 IEEE Annual India Conference (INDICON), Bangalore, 2016, pp. 1–5.

18.

Lee

C.M.

, Yildirim

, Bulut

, Kazemzadeh

, Busso

, Deng

, Lee

S.S.

and Narayanan

, Emotion recognition based on phoneme classes, Int Conf Spoken Lang Process (2004), 205–211.

19.

El Ayadi

, Kamel

M.S.

and Karray

, Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recogn44(3) (2011), 572–587.

20.

Bang

, Hur

, Kim

, Lee

, Han

, Banos

, Kim

J.I.

and Lee

, Adaptive data boosting technique for robust personalized speech emotion in emotionally-imbalanced small-sample environments, Sensors18 (2018), 3744.

21.

Leila

, Youssef

, Kosai

, Mohamed

and Catherine

, Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO, Speech Communication 2019, 114.

22.

Jassim

W.A.

, Paramesra

and Harte

, Speech emotion classification using combined neurogram and INTERSPEECH 2010 paralinguistic challenge features, IET Signal Proc11(5) (2017), 587–595.

23.

Zhang

, Liu

, Weninger

and Schuller

, Multi-task deep neural network with shared hidden layers: Breaking down the wall between emotion representations, In: IEEE International conference on acoustics, speech andsignal processing (ICASSP), New Orleans, LA, 2017.

24.

Lee

, The generalization effect for multilingual speech emotion recognition across heterogeneous languages, In: ICASSP 2019 –2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). p. 5881–5885.

25.

Zhang

, Liu

, Weninger

and Schuller

26.

T.-T.

, Chang

S.-F.

and Sun

, Blind detection ofphotomontage using higher order statistics, International Symposium on Circuits and Systems IEEE685 (2004), 688–691.

27.

Sreeram

, Gupta

, Zakariah

and Alotaibi

, Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation, Applied Acoustics170 (2020), 107519. 10.1016/j.apacoust.2020.107519.

28.

Bachu

, Kopparthi

, Adapa

and Barkana

, Voiced/Unvoiced Decision for Speech Signals Based on Zero-CrossingRate and Energy, ElleithyK. (eds), Advanced Techniques in Computing Sciences.

29.

, Dua

, Acharya

R.U.

and Chua

C.K.

, Classification of epilepsy using high-order spectra features and principle component analysis, Journal of Medical Systems36 (2012), 1731–1743.

30.

Chawla

N.V.

, Bowyer

W.K.

, Hall

O.L.

and Kegelmeyer

W.P.

, SMOTE: synthetic minority over-sampling technique, J Artif Intell Res (2002), 321–357.

31.

Goutte

and Gaussier

, A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation, Lecture Notes in Computer Science3408 (2005), 345–359. 10.1007/978-3-540-31865-1_25.

32.

https://www.python.org

f=Surprise