Machine hearing system for teleconference authentication with effective speech analysis

Abstract

The speaker identification in Teleconferencing scenario, it is important to address whether a particular speaker is a part of a conference or not and to note that whether a particular speaker is spoken at the meeting or not. The feature vectors are extracted using MFCC-SDC-LPC. The Generalized Gamma Distribution is used to model the feature vectors. K-means algorithm is utilized to cluster the speech data. The test speaker is to be verified that he/she is a participant in the conference. A conference database is generated with 50 speakers. In order to test the model, 20 different speakers not belonging to the conference are also considered. The efficiency of the model developed is compared using various measures such as AR, FAR and MDR. And the system is tested by varying number of speakers in the conference. The results show that the model performs more robustly.

Keywords

K-means MFCC SDC LPC GGD AR FAR MDR

1. Introduction

The speech wave extracted from the spoken language conveys several kinds of information. In speech production viewpoint, this signal conveys the linguistic information, which indicates the meaning the speaker wishes to impart, who is speaking and the emotion of the speaker [2, 4, 5, 21]. Whereas from the speech perception viewpoint, the environment of speech production and transmission can also get conveyed. The information to be transmitted in the form of speech signal involves complex encoding scheme but can be easily decoded by the human ears. This ability of humans has driven many researchers in developing systems that can extract the speech signals automatically and process the information. The processed information helps in recognition of the speaker.

Any speaker recognition systems have three specific tasks which are to be performed, viz., the speaker identification, speaker verification, speaker segmentation [28, 30]. The Speaker identification involves two processes, the open set, and the closed set. As the initial step, a database is generated with speech voices from different speakers of both genders and the aim of speaker recognition is to determine whether the individual belongs to a database or not. In the closed set mode, even called as, multi-classification problem, the speech recognition is done based on the voices from the generated databases, else if the voice is outside the database then it is said to be in open-set mode; the voices considered for open-set mode are called imposters. This open set mode is very much useful in applications like security/forensics, where the speech evidence can be used for the identification of the suspects or voice verification or the authentication on speaker detection [3, 6, 11, 14].

Figure 1.

Training process for speaker recognition in teleconference applications.

Speaker verification is the second task which aims at the verification of a person basing on the speech signal. This can be a binary decision problem. In this process, the Claimly’s voice is to be tested with the large group of voices and the authentication is to be generated. This system has applications in control telephone and banking services.

The final task, speaker segmentation/clustering has wide applications in multi-speaker scenarios. In order to recognize a speaker, the test speaker voice is considered, processed and is compared with that of the existing voices in the data set. But may not be as simple as assumed, because, in most of the speaker recognition applications, it is always assumed that the speech from a particular individual is only available for processing, the speech signals, in most of the cases are interlined with other individual speech signals [12, 13, 18, 19]. Hence in order to identify a particular individual’s speech from the database, it is, therefore, necessary to divide the speech samples into clusters or segments, so that the homogenous speech samples can be identified [8, 15, 16]. These samples are then labeled using the speaker’s identity. These models of segmenting and identifying the homogenous voices have many advantages, including far speaker identification, identifying a speaker in the conference/meeting and consumer electronic devices [17, 27, 29]. In the process of speaker segmentation, the voice samples are indexed which helps the retrieval process simpler.

In this paper, we focus on the identification of the speaker from the group of speakers from the conference meeting. A lot of reviews are projected in this area of speaker segmentation. Many challenging issues are focused on dividing the voice speech into homogenous segments. The challenges include discriminating the boundaries between the speakers, identification of the regions/portions of the individual speakers and identifying the unique speakers from the voice database. The boundary location problem assumed to be the segmentation problem, region/portion identification of the individual speaker is assumed to be the clustering problem.

The above picture describes training process of conference speech, which includes clustering and obtaining MFCC-SDC-LPC feature vector combination [1, 2, 15], and then obtained probability density function (PDF) of the generalized gamma distribution [22, 24]. The model for any segmentation and clustering system is presented above. The voices from the meeting are considered and after processing the speech samples, the feature vectors are identified and the numerical equivalents are obtained from the speech sample [6]. The segmentation process is carried out by dividing the database into homogenous segments basing on the identity of each speaker. The environmental and transition channel conditions are also considered. The clustering process aims at identifying the homogeneous clusters basing on the speaker’s identity i.e. in this process we group all the speech that pertains to an individual speaker [7, 9, 10].

The speech segmentation and clustering process are useful in other applications such as far-speaker identification and identifying the 3 ${}^{\text{rd}}$ person in a telephone conversation with speaker identification in a meeting environment.

A meeting is defined as a gathering of a group of like-minded people to interact with each other and exchange their views regarding a particular issue. The set of people participating in the meeting need not be in the same location, video conferencing; tele-metering can also be viewed as the cases under consideration

The Fig. 2 is the schematic view of the Teleconference authentication system, aims to build speaker segmentation and clustering system for a meeting environment. Speaker segmentation segments the voice database into audio streams, with the intention of finding the changes in the acoustic points that may be due to the difference in the background or due to the change of the speaker.

Figure 2.

A schematic view of the teleconference authentication system.

The continuous speech signal is considered, and the boundaries of the speech are identified to categorize the speech labels are classes. The speech signals of all the speakers involved in the meeting are recorded and the speech samples are synthesized/processed. The feature vectors are identified using spectral coefficients. MFCC coefficients and MFCC-LPC are used for the feature extraction [16, 20]. MFCC is preferred because of its ability to recognize the noise in noisy environments and also in short frames. In order to eliminate the silence part/white speech, segmentation process will be useful. The individual speaker voice from the overlapping sequence is overcome by considering an overlapping window of 10 ms, 25 ms; 30 ms varying the speech frame length of 10 frames each. The meeting database considered for the study is a recorded data of a continuous speech of 5 min. the testing purpose, we have considered the data of 0.5 sec.

In order to segment the data, effectively we have to model the voice-parameters accurately and to differentiate the voices/speakers, maximum likelihood criteria will be of more advantageous and K-Means algorithm is utilized for the segmentation process. Many model-based techniques have been used for speaker segmentation and clustering, GMMs which GMM based approach is of most preferred. But the main disadvantage of GMM is that GMM segments the data effectively when the shape of the speech signal under consideration is symmetric and the range is infinite [15, 23, 25]; but most of the speech signals, in reality, are non-Gaussian in shape i.e. either Lepty-kurtic or Platy kurtic and the range of the speech signal is not infinite. Hence to model the parameters most effectively maximum posterior algorithms are mostly preferred. Hence in this paper, generalized gamma distribution is utilized. For any segmentation process, the initial clusters are to be identified. In the process we have utilized the histogram process, to identify the number of peaks, which helps to identify the number of segments. Once the segments are identified, the k-means algorithm is utilized to segment the data into clusters of homogenous groups. The initial estimates that are obtained from the k-means algorithm are updated using E-M algorithm and the updated model parameters are used for the labeling of the speakers.

An Experiment is conducted in mat lab environment by considering a meeting environment sequence recorded for 10 min duration. The efficiency of the model developed is contemplated to that of the existing model based on GMM [31, 32]. The performance evaluation is carried out using measures such as Acceptance Rate (AR), False Acceptance Rate (FAR), and Missed Detection Rate (MDR). The results show that the model developed obtains good results.

The probability density function of each speech spectrum of speakers $X_{1},X_{2},\ldots,X_{c}$ from $C$ different speakers uttering the speeches of $X_{1},X_{2},\ldots,X_{c}$ among registered speakers $S=\{1,2,\ldots,C\}$ is identified by using the Generalized Gamma Distribution $\lambda_{k}$ .

$\displaystyle\hat{S}=\arg^{\max}_{1\leqslant k\leqslant C}p(|{\lambda}_{k}|X_{% 1},X_{2},\ldots,X_{c})$ (1) $\displaystyle=\arg^{\max}_{1\leqslant k\leqslant C}{{\frac{p(|{\lambda}_{k}|X_% {1},X_{2},\ldots,X_{c})\cdot P(\lambda k)}{p(X_{1},X_{2},\ldots,X_{c})}}}$ (2)

By using Bayes’ rule equal prior probability (i.e. $P(\lambda_{k})=1/C)$ , and the conditional independence between different speech inputs $X_{1},X_{2},\ldots,X_{c}$ given speaker model $\lambda_{p}$ , and not in that $p(X_{1},X_{2},\ldots,X_{c})$ is the same for all speakers, equation1 can be simplified as,

$\displaystyle\hat{S}=\arg^{\max}_{1\leqslant k\leqslant C}\prod^{C}_{c=1}{p(X_% {C}|{\lambda}_{k})}$ (3)

The speaker’s identity can be determined by the sum of the hypothesis log probability scores obtained from speech samples from $C$ Speakers. However, in a meeting environment, the log probability score itself is expected to degrade in Eq. (3), i.e. its reliability cannot be guaranteed. In addition, a wide range of factors that cause such as noise, speaker direction and distance may affect each speech sample differently. The integration is therefore greatly affected by the incorrect classification in such cases. We are therefore proposing a new method of integration to re-score the scores of the hypothesis, measure and combine the distance between them.

The Mel frequency cepstral coefficients (MFCC) are used to represent the speaker identification characteristics. In the setup used, a Mel-scale filter bank is used to process the magnitude spectrum from a short frame. The outputs of the log energy filter are the cosine transformed into cepstral coefficients. Each frame repeats the process resulting in several feature vectors [1]. We assume that each of the Mel frequency cepstral coefficients follows a generalized gamma distribution. The entire speech spectrum of each speaker can therefore be characterized as M-component Generalized Gamma Distribution. Using Generalized Gamma Distribution, the probability density function of each individual speech spectra speaker is modeled. In order to identify an individual speaker, from the meeting situations, we have to cluster the data.

2. Literature review on speaker recognition models

The speaker identification problem is defined as the determination of a speaker identity from his/her voice. Speech recognition is used in real-world human language applications, such as information retrieval. It is the most common means of the communication because the information contains the fundamental role in conversation. From the conversation or speech, it converts an acoustic signal that is captured by a microphone or a telephone, to a set of words. A set of word can either be the final result or it can then apply the synthesis to pronouns into sounds, which means speech-to-speech. Its means that, speech recognition can serve as the input to further linguistic processing in order to achieve speech understanding. Speech recognition systems are characterized by speaking style, speaking mode, environment, vocabulary, acoustic model, language model, perplexity, SNR and transducer.

A speaker identification system is said to be open-set if it can determine whether the given testing utterance belongs to the set of enrolled speakers or not. Otherwise, it is called a closed-set speaker identification system. In order to recognize the speaker, different classification techniques are presented in literature.

Classification refers to deciding the class label (the unknown identity) of the testing signal. In closed-set speaker identification systems, classification is performed by assigning a score for each class that attempts to measure how likely the corresponding speaker produced the given testing utterance. A decision is made in favor of the speaker whose model provides the highest matching score. The classifier performance is measured by its ability to predict the true labels of unknown testing utterances as well as the time need for making a decision. In order to have a good classification performance, training examples (utterances) with known labels are used to estimate the classifier parameters.

The main approaches for speaker identification include Unsupervised learning models, Supervised learning models, Dimensionality reduction techniques and other models based on Acoustic speaker recognition, Feature vector based and speaker Diarization.

2.1 Speaker recognition models based on unsupervised techniques

In this approach, a model is constructed for each speaker. The training examples (feature vectors) of each speaker are used to train the corresponding model. In the testing phase, each model calculates a likelihood score with respect to the given testing utterances. This approach is called unsupervised learning because, when each speaker model is trained, the corresponding class label information is not used. Various unsupervised modeling approaches include k-NN, vector quantization (VQ), GMM and HMM.

The advantage of unsupervised learning algorithms over supervised methods is that; new speakers can be added easily to the identification system without the need to retrain other speaker models; another advantage of unsupervised learning algorithms is in the flexibility of adding and removing speakers from the system. Unsupervised learning algorithms are often characterized by the decision function that is used to measure the match between a given testing utterance and a certain speaker model. While distance metrics are utilized with the NN, the dynamic time warping (DTW),and the VO classification methods, a probabilistic likelihood is used with statistical classifiers such as the GMM and the HMM.

In this section we briefly review some of the Speaker recognition models based on Unsupervised Techniques

2.2 Speaker recognition based on nearest neighbor

The NN classification method is a conceptually simple classification technique that is found to be efficient in many pattern classification problems. The training phase just consists of storing all the training data vectors with their corresponding labels. To classify a testing example, the closest $k$ training examples are found, and a decision is made to the class that is most common in those k neighbors. Since a speech utterance is represented by a set of feature vectors, a distance metric between two sets of feature vectors should be defined.

3. Generalized gamma distribution

The probability density function of the generalized gamma distribution is given by

$\displaystyle f(x,k,c,a,b)=\frac{c{(x-a)}^{ck-1}e^{-{\left(\frac{x-a}{b}\right% )}^{c}}}{b^{ck}\Gamma(k)}$ (4)

where $a, b, c, k$ are called the gamma variants and $c, k$ are called shape parameters such that $c,k>0$ . $a$ is called location parameter, $b$ is called shape parameter with $a,b>0$ . In our approach we used generalized gamma distribution for model building and to compare probability density function (PDF) using EM –algorithm.

Figure 3.

Recorded waveform of the conference audio.

Figure 4.

One cluster of conference speech.

4. Estimation of the model parameters using EM algorithm

We obtain estimates of model parameters via EM algorithm in this section. It is mandatory to estimate the speaker model’s parameters effectively for effective speaker identification model. To estimate the parameters EM Algorithm is used that maximizes the model’s likely hood function for a sequence I; let $x_{i}=(x_{1},x_{2},\ldots,x_{t})$ be the training vectors drawn from the speech of a speaker and are characterized by the Generalized Gamma Distribution’s probability density function as given in Eq. (1).

The likelihood function of sample observation $z_{1},z_{2},$ $\ldots,z_{n}$ drawn from a speech signal with a probability function

$\displaystyle h(z,\theta)=\sum^{K}_{l=1}{{\alpha}_{i}g_{i}(Z_{s}},\theta)$ (5)

is given by

$\displaystyle L(\theta)=\prod^{N}_{s=1}{\left(\sum^{K}_{l=1}{{\alpha}_{i}g_{i}% (Z_{s},\theta)}\right)}$ (6)

The updated equations for the initial parameters $\alpha^{(l+1)},c^{(l+1)},a^{(l+1)},b^{(l+1)}$ and $k^{(l+1)}$ are given in the Eqs (7) to (11).

For updating ${\alpha}^{(l+1)},c^{(l+1)},a^{(l+1)},b^{(l+1)}$ and $k^{(l+1)}$ we have

$\displaystyle{{\alpha}_{k}}^{(l+1)}=\frac{1}{N}\sum^{N}_{s=1}{E^{(l)}[t_{k}(Z_% {s},{\theta}^{(l)})]}$ (7) $\displaystyle c^{(l+1)}=\frac{1}{\frac{1}{f}\frac{\partial f}{\partial c}-k{% \log\left(\frac{x-a}{b}\right)}+\frac{{(x-a)}^{c}}{b^{c}{\log\left(\frac{x-a}{% b}\right)}}}$ (8) $\displaystyle a=x-\frac{1-ck}{\frac{1}{f}\frac{\partial f}{\partial a}-\frac{c% }{b}{\left(\frac{x-a}{b}\right)}^{c-1}}$ (9) $\displaystyle b^{(l+1)}=\frac{ck}{\frac{c}{b^{c+1}}{(x-a)}^{c}-\frac{1}{f}% \frac{\partial f}{\partial b}}$ (10) $\displaystyle k^{(l+1)}=1+\frac{\left[\int^{\infty}_{0}{e^{-t}}{(\log_{e}t)t^{% k-1}{\rm d}t}\right]}{\gamma(k-1)\left[c\log\left(\frac{x-a}{b}\right)-\frac{1% }{f}\frac{\partial f}{\partial k}\right]}$ (11)

To develop the model of speaker identification, in a teleconferencing situation, estimating the speaker model parameters is necessary. In order to estimate the parameters in the model, we consider the EM algorithm that maximizes the model’s probability function for a speaker I from the teleconference $x_{i}=(x_{1},x_{2},\ldots,x_{t})$ group of speakers. The likelihood function of the Generalized Gamma Distribution is given as

$\displaystyle h(z,\theta)=\sum^{K}_{l=1}{{\alpha}_{i}g_{i}(Z_{s}},\theta)$ (12)

where $h(z,\theta)$ is as given in the above Eq. (8), The probability function contains the number of components M that can be determined from the clustering algorithm and for this purpose K-Means algorithm is used. The K-Means algorithm requires the initial number of components that can be taken by diagramming the Mel frequency cepstral coefficients associated with the speaker and the number of peaks. Once the K- has been assigned, the EM algorithm can be applied with updated equations to refine the parameters.

4.1 K-Means algorithm

A popular approach to segmenting speech samples into K-Clusters is the K-Means clustering [34]. The steps to be taken are:

Step 1: Step 1:
Start with an initial value of $k=$ Segment number.
Step 2:
Choose the number of $k$ clusters with the initial $V_{i}$ centroids cluster $i=1,2,\ldots,k$ ;
Step 3:
By assigning each speech sample $x_{j}$ to the closest Cluster centroid $V_{i}$ using the selected distance measure, for example, partition the input speeches into $k$ clusters. Defined Euclidean distance as:

$\displaystyle d_{ij}=\|x_{j}-V_{i}\|$
Step 4:
Calculate a cluster assignment matrix $U$ representing the partition of speech samples to the $i^{\text{th}}$ cluster with the binary membership value of the $j^{\text{th}}$ speech sample such that:

$\displaystyle U=[\mu_{ij}],$

where,

$\displaystyle u_{ij}\in\{0,1\}∼{}∼{}\forall i,j$ $\displaystyle\sum^{k}_{i=1}{u_{ij}}=1∼{}∼{}\forall j\text{ and }0<\sum^{n}_{j=% 1}{u_{ij}}<n∼{}∼{}\forall i.$
Step 5:
Use the membership values to recalculate the centroids

$\displaystyle V_{i}=\frac{\displaystyle\sum^{n}_{j=1}{u_{ij}x_{j}}}{% \displaystyle\sum^{n}_{j=1}{u_{ij}}}∼{}∼{}\forall i.$
Step 6:
If there is no change in the cluster centroids or assignment matrix from the previous iteration, stop go to step 3 otherwise.

The speech signals are extracted using the MFCC coefficients. In the situations of identifying a speaker in the conference or group meeting, it is needed to cluster/segment the speech data set, so that the individual speaker having the same frequency speech signal or feature vectors are pooled together, i.e., the likelihood of each speaker is to be modeled from the M-speakers. K-Means algorithm is utilized for this purpose.

The K-Means algorithm requires the initial value of K, false estimation of the K value results in either overestimation or underestimation. To overcome this situation, the initial value of K- is determined by plotting the Cepstrum of the Mel cepstral coefficients associated with each speaker, and for one speech sample, it is shown in Fig. 5.

After obtaining the value of K, the EM algorithm is used for the updating of the initial estimates, given in the Eq. (7).

The steps for the speaker recognition are given below

Figure 5.
Mel frequency cepstral coefficients.

1.
In order to initialize the model parameters, EM algorithm can be effectively used. Obtain the range of each feature vector by estimating the minimum and the maximum frequency value of each speaker. The initial value of the $\alpha$ can be obtained by using ${\alpha}_{i}={1/M}$ , where $M$ is the value obtained from the mel frequency of the histogram of the speech signal from the continuous waves.
2.
Apply the K-Means algorithm, described in Section-3. After estimating the initial values of K from the histogram of MFCC. Segment the speech data into K-clusters. Arrange the speech samples either randomly or serially. The procedure for the K-Means algorithm is given below.

a)
Choose the speech signals, train the speech data for all the M-speech samples, take the MFCC coefficients
b)
Assume that there are N-samples and train the remaining (N-M) samples by computing the centroid of each sample and assign the samples to the component with the minimum centroid. If there are N-samples and M-coefficients, then the samples ( $C_{1}$ - $C_{M}$ ) are classified, such that each sample contains a single speech pattern.
c)
The mean is calculated as Mean ${}^{(n)}=1/N^{M}$ $\left(\sum X^{(}M)_{i}\right)$ where $X^{(M)}_{i}$ is the $i^{\text{th}}$ speech pattern that belongs to the cluster $C_{i}$ . After each assignment, recalculate the centroid, until a convergent value is obtained and the initial values are obtained. After estimating the initial estimates, the EM algorithm is used to obtain the final estimates and the Generalized Gamma Distribution is used to model the parameters.

5. Tele conference authentication system: Apraoch

After obtaining the speech sample of the all the speakers in the participating in the meeting, the main purpose is to identify who is speaking or who is the speaker when a speech voice is generated by considering the test voice and comparing with that of the database of the speaker in the meeting and another feature that can be included is that when the speech test voice is obtained, it is compared with that of the voices of the speakers in the meeting and we can conclude whether the speaker is participating in the meeting or not [26, 28, 30, 33]. For the effective recognition of the speaker, the following algorithm is adopted using Generalized Gamma Distribution as model building tool to store voice template.

Step 1: Step 1:
Record continuous speech in the conference.
Step 2:
Cluster the speech using a k-means clustering algorithm, call the clusters as $c_{1},c_{2},\ldots,$ $c_{n}$ .
Step 3:
Train the clusters after extracting feature using MFCC-SDC-LPC and then obtain probability density function of the generalized gamma distribution.
Step 4:
Perform step3 for test speeches (including conference speakers and non-conference speakers).
Step 5:
If the test speech PDF is matched with the trained PDF (range of min and max values of PDF).

Then speaker is accepted (conference-speaker), otherwise reject (non-conference speaker).
6. Experimentation

The experimentation is conducted using 50 speakers involved in a meeting scenario. The Speech samples of the individual speakers are obtained and stored in.WAV format. The preprocessing is done for eliminating the white speeches and noise removal. The MFCC cepstral coefficients for each speech signal of every speaker is obtained and stored in the database. The following sequence of graphs indicates every stage of the speech signal for training and to acquire probability density function (PDF) for comparing with test speech PDF. And also tested this system with TIMIT dataset available in literature.

Figure 6.

Speech Cepstrum after application of MFCC-SDC-LPC.

Figure 7.

Probability density function of the generalized gamma distribution.

The Fig. 3 is speech waveform after recorded in.wav format. This spectrum further clustered for identifying conference speaker and non-conference speaker.

The Fig. 4 is one cluster of conference speech after cluster the speech into k-clusters.

This clustered speech further calculated MFCCs for feature extraction.

The Fig. 5 is Mel frequency cepstral coefficients (MFCC) of the individual cluster, these MFCCS are further converted to shifted delta coefficients (SDC) and then given to calculate LPC for extracting low frequencies.

Figure 8.

Performance of speaker recognition system with AR, FAR, MDR.

The Fig. 6 indicates speech Cepstrum after applying MFCC-SDC-LPC, This feature vector further given to generalized gamma distribution for computing PDF.

Figure 7 indicates probability density function of the generalized gamma distribution. This PDF will be used for comparing trained speech and test speech for taking a decision about acceptance.

7. Performance evaluation

In order to evaluate the performance of the developed model various metrics such as Acceptance Rate (AR), False Acceptance Rate (FAR), and Missed Detection Rate (MDR) are considered. The various formulas for evaluating the metrics are given below.

$\displaystyle\text{Acceptance rate}=$ $\displaystyle\left(\frac{\text{total no of accepted}}{\text{total no of % speakers}}\right)\times 100$ (13) $\displaystyle\text{False acceptance rate}=$ $\displaystyle\left(\frac{\text{total no of speakers-total no of sccepted}}{% \text{total no of speakers}}\right)\times 100$ (14) $\displaystyle\text{Misssed detection rate}=$ $\displaystyle\left(\frac{\text{total no of missed to recognize}}{\text{total % no of speakers}}\right)\times 100$ (15)

The developed model is tested for accuracy using the above metrics.

The Fig. 8 shows performance of teleconference authentication system with AR, FAR, MDR metrics. Also shows the Comparison between existing Gaussian Mixture Model (GMM) and Generalized Gamma Distribution (GGD) and it is observed that the Generalized Gamma Distribution gives better performances.

8. Conclusion

Further analysis is needed for to Speaker Diarization to establish who said when and for the tasks of speaker segmentation and 230 indexing of multi-speaker speech and further dimensions are to be developed to assess the behavior of the speaker in the absence of an interviewer. These methods are to be addressed effectively for the development of a robust speaker identification system.

References

Adiwijaya

M.N.A.

Mubarok

M.S.

Untari Novia

and Nhita

, A comparative study of MFCC-KNN and LPC-KNN for hijaiyyah letters pronounciation classification system, in: 2017 5th International Conference on Information and Communication Technology (ICoIC7), IEEE, 2017, pp. 1–5.

Kumar

and Hemani

, Effective preprocessing of speech and acoustic features extraction for spoken language identification, Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM), IEEE, 2015, pp. 81–88.

Dixit

Vidwans

and Sharma

, Improved MFCC and LPC algorithm for bundelkhandi isolated digit speech recognition, 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, 2016, pp. 3755–3759.

Srinivasan

, Speaker identification and verification using vector quantization and mel frequency cepstral coefficients, Journal of Applied Sciences, Engineering, and Technology 4(1) (2012), 33–40.

Corneliu Octavian

and Gavat

, Feature extraction modeling & training strategies in continuous speech recognition for roman language, in: EU Proceedings of IEEE Xplore, EUROCN, 2005, pp. 1424–1428.

Calvo

J.R.

Fernández

and Hernández

, Channel/handset mismatch evaluation in a biometric speaker verification using shifted delta cepstral features. in: CIARP 2007, LNCS Rueda

Mery

Kittler

, eds, Springer, Heidelberg, Vol. 4756, 2007, pp. 96–105.

Madhu

and Goerge

, Automatic language identification for seven indian languages using higher level features, Signal Processing, Informatics, Communication, and Energy Systems (SPICES), IEEE, 2017, pp. 1–6.

Gonzalez

D.R.

and de Lara

J.R.C.

, Speaker verification with shifted delta cepstral features: Its pseudo-prosodic behaviour, proc I Iberian SLTech, 2009.

Valente

Vijayasenan

and Motlicek

, Speaker diarization of meetings based on speaker role n-gram models, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, 2011, pp. 4416–4419.

10.

Domes

Schulze

and Herpertz

S.C.

, Emotion recognition in borderline personality disorder – a review of the literature, Journal of Personality Disorders 23(1) (2009), 6–9.

11.

González

D.R.

Arias

S.C.

and de Lara

J.R.C.

, Single channel speech enhancement based on zero phase transformation in reverberated environments, The Proceedings of REVERB Challenge, 2014.

12.

Wang

Leung

Lee

and Li

, Shifted-delta MLP features for spoken language recognition, IEEE Signal Processing Letters 20(1) (2013), 15–18.

13.

Freng

Ramabhadran

Hansen

J.H.

and Williams

J.D.

, Trends in speech and language processing, IEEE Signal Processing Magazine 29(1) (2011), 177–179.

14.

Razik

Enac

C.S.

Fohr

Mella

and Parlangeau-Valles

, Comparision of two speech/Music segmentation systems for audio indexing on Web, in: Proc WMSCI’03, Florida, USA, 2003.

15.

Calvo

Fernndez

and Hernndez

, Channel/handset mismatch evaluation in biometric speaker verification using shifted delta cepstral features, Proc of CIARP 2007, LNCS 4756, 2007, pp. 96–105.

16.

Veena

K.V.

and Mathew

, Speaker identification and verification of noisy speech using multitaper MFCC and Gaussian Mixture models, 2015 International Conference on Power, Instrumentation, Control and Computing (PICC), Thrissur, 2015, pp. 1–4.

17.

Patel

and Prasad

R.K.

, Speech recognition and verification using MFCC & VQ, Int J Emerg Sci Eng (IJESE) 1(7) (2013), 137–140.

18.

Lin

Y.L.

and Wei

, Speech emotion recognition based on HMM and SVM, in: 2005 International Conference on Machine Learning and Cybernetics, IEEE 8 (2005), 4898–4901.

19.

Kockmann

and Burget

, Application of speaker- and language identification state-of-the-art techniques for emotion recognition, Speech Communication 53(9–10) (2011), 1172–1185.

20.

Kos

Vlaj

and Kačič

, Speaker’s gender classification and segmentation using spectral and cepstral feature averaging, in: 2011 18th International Conference on Systems, Signals and Image Processing, IEEE, 2011, pp. 1–4.

21.

Navyasri

RajeswarRao

DaveeduRaju

and Ramakrishnamurthy

, Robust features for emotion recognition from speech by using Gaussian mixture model classification, in: International Conference on Information and Communication Technology for Intelligent Systems, Springer, Cham, 2017, pp. 437–444.

22.

Singh

Khan

R.A.

and Shree

, MFCC and prosodic feature extraction techniques: A comparative study, International Journal of Computer Applications 54(1) (2012).

23.

Singh

P.P.

and Rani

, An approach to extract feature using MFCC, IOSR Journal of Engineering 4(8) (2014), 21–25.

24.

Torres-Carrasquillo

P.A.

Singer

Kohlerand

M.A.

Greene

R.J.

Reynolds

D.A.

and Deller

J.R.

, Jr, Approches to language identification using gausian mixture models and shifted delta cepstral features, Proc of ICSLP (2002), 89–92.

25.

Koolagudi

Rastogi

and Rao

, Identification of language using mel-frequency cepstral coefficients (MFCC), Procedia Engineering 38 (2012), 3391–3398.

26.

Jothilakshmi

Ramalingam

and Palanivel

, A hierarchical language identification system for Indian languages, Digital Signal Processing 22(3) (2012), 544–553.

27.

Sadjadi

and Hansen

, Mean Hilbert envelope coefficients (MHEC) for robust speaker and language identification, Speech Communication 72 (2015), 138–148.

28.

Shanthi

T.S.

and Lingam

, Review of feature extraction techniques in automatic speech recognition, International Journal of Scientific Engineering and Technology 2(6) (2013), 479–484.

29.

Agrawal

Shruti

A.K.

and Krishna

C.R.

, Prosodic feature based text dependent speaker recognition using machine learning algorithms, International Journal of Engineering Science and Technology 2(10) (2010), 5150–5157.

30.

Swamy

and Ramakrishnan

K.V.

, An efficient speech recognition system, Computer Science & Engineering: An International Journal (CSEIJ) 3(4) (2013), 21–27.

31.

Kinnunen

Koh

, L.Wang Li

and Chang

E.S.

, Temporal discrete cosine transform: Towards longer term temporal features for speaker verification, in: Proc Fifth Internat Symposium on Chinese Spoken Language Processing (ISCSLP 2006), Singapore, 2006, pp. 547–558.

32.

Torres-Carrasquillo

P.A.

Singer

Kohler

M.A.

Greene

R.J.

Reynolds

D.A.

and Deller

J.R.

, Jr, Approches to language identification using gausian mixture models and shifted delta cepstral features, in: Proc of ICSLP, 2002, pp. 89–92.

33.

Kumar

Vydana

and Vuppala

, Significance of GMMUBM based modelling for indian language identification, Procedia Computer Science 54 (2015), 231–236.

34.

Anguera

Bozonnet

Evans

N.W.D.

Fredouille

Friedland

and Vinyals

, Speaker diarization: A review of recent research, IEEE Transactions on Audio, Speech, and Language Processing 20(2) (2012).