Metaheuristic adapted convolutional neural network for Telugu speaker diarization

Abstract

In speech technology, a pivotal role is being played by the Speaker diarization mechanism. In general, speaker diarization is the mechanism of partitioning the input audio stream into homogeneous segments based on the identity of the speakers. The automatic transcription readability can be improved with the speaker diarization as it is good in recognizing the audio stream into the speaker turn and often provides the true speaker identity. In this research work, a novel speaker diarization approach is introduced under three major phases: Feature Extraction, Speech Activity Detection (SAD), and Speaker Segmentation and Clustering process. Initially, from the input audio stream (Telugu language) collected, the Mel Frequency Cepstral coefficient (MFCC) based features are extracted. Subsequently, in Speech Activity Detection (SAD), the music and silence signals are removed. Then, the acquired speech signals are segmented for each individual speaker. Finally, the segmented signals are subjected to the speaker clustering process, where the Optimized Convolutional Neural Network (CNN) is used. To make the clustering more appropriate, the weight and activation function of CNN are fine-tuned by a new Self Adaptive Sea Lion Algorithm (SA-SLnO). Finally, a comparative analysis is made to exhibit the superiority of the proposed speaker diarization work. Accordingly, the accuracy of the proposed method is 0.8073, which is 5.255, 2.45%, and 0.075, superior to the existing works.

Keywords

Speaker diarization segmentation clustering Telugu language MFCC optimization CNN

1. Introduction

The humans express their ideas and information through speech, which is being a tool for communication. Recently, a massive count of audiovisual content is being generated from diverse sources like broadcasting radio, meetings, TV channels and lectures as well [10, 11, 12, 13, 14, 15, 16]. Owing to the technological development the virtually unlimited storage capacity is maintained for generating, storing and delivering the audio visual contents with cheaper prices. In this context, there is a necessity to have an affordable and suitable content management model to retrieve and search the information. In the case of speech, the amount of data seems to be bulky and the manual handling of data is complex. Therefore, it is necessary to have an automatic human language processing model that can efficiently search, index and access the information sources. While compared to the text documents, it is challenging to search as well as assess the information contents in the audio due to its computational complexity and time consumption. Therefore, the need for automated processing is essential for searching, indexing and accessing the contents spoken at a particular period of time. Moreover, the multi-party dialogue problems (where more than two peoples are engaged in a conversation) are serious issues while assigning temporal segments of speech. One obvious solution to this issue is speaker diarization [17, 18, 19].

The art of splitting the audio signal into homogeneous segments based on the identity of the speakers is referred as Speaker diarization. It is utilized for estimating the audio stream on “who speak what and when” [20, 21, 22, 23, 24]. It is well suited for information retrieval in many real-life applications like speaker recognition, auxiliary video segmentation, multimedia summarization, telephone and broadcast meetings and so on. The audio sources in all these application include speech signals from speakers, non-speech signals, music, channel characteristics and background noises. In the speaker diarization system, the speaker clustering, speech activity detection and speaker segmentation is considered as crucial components. The speech signals are classified from the non-speech signals like music, noise in the speech activity detection phase [42]. On the basis of the speaker identities the audio stream is split into uniform segments in the segmentation phase [25, 26]. The mechanism of associating the same speaker’s segmented speech is done in the speaker clustering process.

In recent days, the speech diarization with the Dravidian languages is gaining more attention. Telugu is the 3 ${}^{\text{rd}}$ most spoken language in India and it ranks 15 ${}^{\text{th}}$ in the list of world-wide most-spoken languages. It is spoken by 75 million peoples. Therefore, this research works intents to develop an automatic Telugu speaker diarization [7]. In literature, the existing speaker diarization techniques have reported that the success of speaker diarization is completely depends on the extracted features that convey the speaker specific information and its appropriate modeling of the extracted features with machine learning models. The features like MFCC, LPCC and PLP are more commonly extracted from the audio signals. Further, the Bayesian analysis, Artificial Neural Network (ANN) and Gaussian mixture model (GMM) is often utilized for final classification. Due to the increase in computational complexity false accept/alarm; false reject/missed detection and Diarization error ratio (DER) performance with the existing classifiers, the models must be refined with tuning mechanisms like optimization algorithms [34, 35, 36, 30, 31, 32, 33, 41, 46] for appropriate results Hence, this paper aims to present a metaheuristic adapted CNN for telugu speaker diarization. Moreover, the presented approach attains a lower error rate on the basis of online decoding, and makes it particularly appropriate for real-time applications. It is very much useful for doctors in Electronic health records (EHR) documentation.

The major contribution of this research work is:

•
An optimized CNN based speaker segmentation process is introduced in this research work, where the weight and activation function of the model is optimized by a new algorithm.
•
Proposing a new Self Adaptive Sea Lion algorithm for solving the optimization issue mentioned in this work.

The rest of the paper is arranges as: Section 2 addresses the literature works undergone in speaker diarization. Section 3 tells about proposed Telugu speaker diarization: an architectural description. The feature extraction, speech activity detection and speaker segmentation is described in Section 4. The results acquired with the proposed work are discussed in Section 5 and this paper is concluded in Section 6.
2. Literature review

2.1 Related works

In 2017, Ramaiah and Rao [1] have developed a novel speaker diarization system by amalgamating the concepts of both the Holoentropy with the eXtended Linear Prediction using autocorrelation Snapshot (HXLPS) and Deep Neural Network (DNN). The HXLPS was developed by blending the Holoentropy concept with the XLPS. The Voice Activity Detection (VAD) method has detected the non-speech signals from the speech signals. Then, the Universal Background Model (UBM) model was utilized for obtaining i-vector representation corresponding to each of the segments. To the speaker’s signals, the labels were assigned using DNN. The outcomes had exhibited better diarization performance in terms of DER.

In 2018, Zewoudie et al. [2] have proposed the utilization of Voice-Quality Features (V-QF) for i-vector and GMM based speaker diarization systems. The long-term and the short-term cepstral features were extracted using the proposed voice-quality features. The speaker clustering was accomplished via the delta dynamic features. As a resultant there has observed a substantial DER improvement in the proposed work.

In 2019, Karim et al. [3] have introduced an efficient heuristic model by combining the concepts of “Differential Evolution (DE) algorithm and K-means algorithm”. This was done to solve the optimal non-hierarchical clustering problem in the speaker diarization task, especially in the speaker clustering phase. The speaker classification is optimized with two criteria: trace within criterion (TRW) and variance ratio criterion (VRC). In DE algorithm, the classifications of clusters that need to be the optimized were encoded. The evolutionary operators were applied to rearrange the centres in the population. The outcomes of the proposed work had exhibited best efficiency.

In 2020, Diez et al. [4] have presented a complete analysis of speaker diarization with Bayesian Hidden Markov Model. The authors have described the model’s inference and has provided the evaluation in terms of “sensitivity and interactions” of entire model parameters. They have introduced a speaker regularization coefficient (SRC) to control the count of inferred speakers. In addition, the evaluation was done with CALLHOME and DIHARD datasets. The variational inference was driven out of local optima by means of employing the naive speaker model merging strategy.

In 2017, Gebru et al. [5] have developed a novel spatiotemporal Bayesian Fusion based diarization model for several participants who were engaged in a multi-party dialogue. Then, a novel Audio-Visual Fusion (A-VF) method was introduced, where the extraction of Binaural Spectral Features (BSF) were done from a microphone pair. These features were mapped into the image using a supervised audio-visual alignment technique. Finally, the binaural spectral features were classified using the semi-supervised clustering method.

In 2020, Park et al. [6] have projected a new Spectral Clustering Framework (SCF) for speaker diarization, in which the parameters of the clustering algorithm were auto-tuned. The Normalized Maximum Eigengap (NME) values were utilized in the proposed framework to evaluate the number of clusters.

In 2020, Sethuram et al. [7] have developed a new speaker diarization or indexing model for the Telugu language. The features were extracted using MFCC and the clustering process was done using the Optimized ANN. In the Optimized ANN, the training was done using the hybrid concept of Artificial Bee Colony (ABC) and Lion Algorithm (LA). As a consequence, there observed an enhancement in accuracy rate.

Table 1
Features and challenges of existing speaker diarization approaches

Author [Citations]	Methodology	Features	Challenges
Ramaiah and Rao [1]	HXLPS and DNN	• Achieves lower DER • Enhances the readability of the speech transcription	• False alarm rate can be improved • The tracking distance can be improved
Zewoudie et al. [2]	GMM- and i-vector	• Provides the minimum variance • Provide a substantial relative DER	• DER values increase with the increase in the count of iterations • Over-clustering problem was not solved
Karim et al. [3]	Hidden Markov models	• Have a high-confidence level • Achieves maximum accuracy value (95%)	• Needs to avoid the local optimum • Need to avoid premature convergence. • The training and testing times can be reduced
Diez et al. [4]	Bayesian Hidden Markov Model	• Drives the variational inference from local optima • Effectively reduces the frame rate • The Variational Bayes inference was speedup	• Sensitivity of the diarization performance can be increased. • Lower convergence
Gebru et al. [5]	Spatiotemporal Bayesian Fusion	• Can deal with simultaneously speaking participants • Handles fragmented tracks robustly	• Higher computational complexity • Applicable only to limited number of speakers
Park et al. [6]	Normalized
Maximum Eigengap	• Avoids hyper-parameter tuning • Applicable for optimized set	• Poor estimation of the number of clusters • Noisy similarity values need to be reduced
Sethuram et al. [7]	ANN-ABC-LA	• Minimized error rate • Attained minimized DER rate	• The loss in data can be reduced • The Time of execution can be improved • The accuracy can be improved
Cyrta et al. [8]	Recurrent convolutional neural network	• Reduces diarization error rate	• Overall accuracy can be improved

In 2017, Cyrta et al. [8] have proposed a new deep learningbased speaker diarization. In the context of speaker diarization, they have embedded the low-level speaker using the recurrent CNN architecture. The experimental evaluation of the proposed model had exhibited reduction in the diarization error rate.

In 2019, Gupta et al. [43] have presented an incorporated scheme that produces the opinionated performance on the basis of extractive summaries and graphics from the huge set of reviews of mobile. The presented scheme consist of three major stages like computation of sentiment polarity of all aspect, generates opinionated performance based on extractive summaries and graphical as well as recognition of aspects in the known field. Further, the proposed scheme was executed on dataset of three mobile-reviews and attained higher recall as well as precision when compared with the baseline method.

In 2020, Gupta et al. [44] have utilized diverse conventional techniques based on classification for identifying the emotion of humans and carried out a comprehensive comparative analysis on the basis of mathematical and statistical results. Moreover, an optimal model DSCNN was presented for raw spectrogram, which offered in 61% un-weighted accuracy and for raw spectrogram and for clean spectrogram 79% for the improvement of the human emotion assessment scheme.

In 2017, Piryani et al. [45] have presented aspect-level sentiment analysis on reviews of the movie. Moreover, rule-based linguistic schemes have also been offered that recognize the aspects from movie reviews, calculates the sentiment polarity of that opinion by utilizing the linguistic scheme and place opinion about that aspect. Further, the experiment was carried on the basis of two movie datasets. The outcome showed higher accuracy and was more promising for the utilization of an incorporated opinion profiling system.

2.2 Review

The recent fascinating speaker diarization works are discussed in the literature section. The features and challenges of those works are summarized in Table 1. The HXLPS and DNN deployed in [1] attained lower DER. But, here the count of false alarms generated was higher. The GMM- and i-vector in [2] provides the minimum variance. The major drawback of this approach is, the DER performance tends to increase with increase in the count of iterations. The Hidden Markov models in [3] have high-confidence level. To make this approach more significant, the training as well as testing times can be reduced. The Bayesian Hidden Markov Model utilized in [4] reduces the frame rate more effectively and hence increases the accuracy of the diarization. On the other hand, the Sensitivity of the diarization performance can be increased. Further, Spatiotemporal Bayesian Fusion in [5] has the potential of dealing with multiple-speakers, who speak at the same time. But, the major drawback of this approach was its higher computational complexity. The Normalized Maximum Eigengap in [6] avoids hyper-parameter tuning. The Noisy similarity values need to be reduced for making the proposed work more accurate. The ANN+ABC-LA in [7] exhibits lower error rates. To make this approach more attractive for large-scale applications, the loss in data can be reduced. The RCNN utilized in [8] was capable of discriminating between the bigger corpora of speakers. But, here the overall accuracy of the speaker diarization was lower. Thus, all these drawbacks together motivates the current research work to have a better speaker diarization model with higher accuracy and lower DER.

Figure 1.

Overall architecture of the proposed speaker diarization approach.

3. Proposed Telugu speaker diarization: An architectural description

3.1 Architecture of proposed Telugu speaker diarization

The speaker diarization belongs to the speech processing category and here the speaker is identified along with the boundary/frame of the speech spoken (i.e. who spoke when), by automatically partitioning the audio signal into homogeneous segments. The Speaker diarization is the combination of both the speaker clustering and the speaker segmentation. In India, the Telugu language is the 3 ${}^{\text{rd}}$ most spoken language and it is being the official language of Pondicherry, Yanam, Telangana and Andra Pradesh in India. Being the 15 ${}^{\text{th}}$ most spoken language of world-wide, the current research work intends to develop a novel speaker diarization model with the aid of the deep learning approach and the optimization tactics. The steps followed in the proposed work are depicted below:

Step 1 (Data Collection): Initially, the input audio stream corresponding to the multiple Telugu speaker are collected. Let the collected raw speech data (audio signal) is denoted as $D$ and the count of speakers $\left(N\right)$ taking part in the conversation is $P=\left\{{P_{1},P_{2},\ldots P_{N}}\right\}$ .

Step 2 (Feature Extraction): The MFCC features $\left(F\right)$ are extracted from $D$ .

Step 3 (Speech Activity Detection): Here, the silence and the music are removed from the extracted features $\left(F\right)$ . The silence removed signal is denoted as $S$ and the music removed signal is denoted as $M$ .

Step 4 (Speaker Segmentation): Subsequently, the silence removed audio $S$ and the music removed audio $M$ are subjected to segmentation, and the speaker segmented audio is denoted as Seg.

Step 5 (Speaker clustering): The Segmented audio signal Seg is clustered into $k$ clusters using the optimized CNN, whose weight $\left(W\right)$ and activation function $\left(A\right)$ is optimized via SA-SLnO algorithm. This clustering is accomplished with respect to the identity of the speaker. The speaker clustered audio signal is denoted as $C=C_{1},C_{2},\ldots C_{c}$ where $1\leqslant c\leqslant M$ At the end of clustering, the diarized output is acquired. The proposed framework of the proposed speaker diarization model is shown in Fig. 1.

4. Feature extraction, speech activity detection and speaker segmentation

4.1 Feature extraction

MFCC is the renowned audio feature extraction method with less computational complexity. The key objectives of MFCC are exhibited below:

•
The extracted features $\left(F\right)$ are made independent.
•
Adjusts the frequency and the loudness of sound perceived by the humans.
•
Can capture the “dynamics of phones”.
•
Vocal fold excitation (pitch information) are removed.

The filtering stages of MFCC are shown in Fig. 2. The elaboration of these stages are depicted below.

Figure 2.
General architecture of MFCC.

The collected raw speech signal is subjected to the several stages of filters of MFCC [37] for feature extraction.

Pre-emphasis: The input data $D$ is subjected to per-emphasis filter for balancing the voiced sounds spectrum by means of boosting up the frequencies of the high speech signal. The mathematical formula for computing the pre-emphasis is shown in Eq. (1), where $a$ is the parameter that controls the slope of the filter.

$\displaystyle H(Z)=1-a.Z^{-1}$ (1)

Framing: The pre-emphasised signal is split into smaller segments equal sizes of 20–30 ms referred as frames $\left(\textit{frame}\right)$ . This is often employed for the slowly time-varying signals or quasi-stationary signal to stabilize its acoustic characteristics.

Windowing: The frames $\left(\textit{frame}\right)$ are multiplied with a “hamming window” to lessen the discontinuity of the signal.

Fast Fourier Transform (FFT): The time domain signal acquired after windowing is converted into frequency domain by FFT. The outcome from FFT is a periodogram or a spectrum.

Mel-Frequency Cepstrum (MFC): It is based on the “linear cosine transform of a log power spectrum” and it is computed using Eq. (2).

$\displaystyle\textit{Mel}_{\textit{freq}}=2595.\log_{10}\left(1+\frac{\textit{% freq}}{700}\right)$ (2)

Where, freq (Hz) is the frequency of the signal and the perceived frequency is denoted as $\textit{Mel}_{\textit{freq}}$ . In the frequency as well as time domain, the MFCC can be computed by employing a filter bank (“triangular shaper filter or Hanning filter”).

Discrete Cosine Transform (DCT): To the transformed Mel frequency coefficients, the DCT is applied for producing the cepstral coefficients set. In general, DCT is employed for reducing the redundancy presented in the audio information and for enhancing the speed of the system. DCT is computed using Eq. (3).

$\displaystyle C_{\textit{coeff}}(n)=\sum\limits_{p=1}^{P=1}{\text{Log}_{10}}.s% (m).\text{Cos}\left({\frac{\pi.m.(p-0.5)}{P}}\right);m=0,1,\ldots,J-1$ (3)

Here, $C_{\textit{coeff}}$ is the cepstral coefficient and $J$ is the count of MFCC. In addition $s(m)$ is the computed outcome from the triangular shaper filter or Hanning filter and $P$ is the count of triangular Mel weighting filters. The features extracted from MFCC are denoted as $F$ , which is subjected to the SAD process.
4.2 SAD: Music and silence removal

In this research work, the 2 decoupled stages are utilized for implementing the SAD subsystem. Initially, from $F$ , the silence signals are removed with the aid of repetitive classification followed by energy based bootstrapping. In case of public meetings as well, the external music signal might get interrupt and degrade the quality of the speech. Therefore, the “music as well as extra perceptible non-speech signals” are removed. The detailed description of the Silence Removal and Music Removal mechanism are described below:

•
Silence removal: In this research work, the silence (i.e. Removal of non- audible from audible signals) is removed from $F$ using the Short Time Energy (STE) combined with 19 feature of MFCC and the 1 ${}^{\text{st}}$ and 2 ${}^{\text{nd}}$ order derivatives of the signal. The bootstrap segmentation allots a confidence value for both the speech and silence classes. The bootstrap silence model is trained with the Gaussian mixtures of size 4 out of the 60 dimensional feature spaces. Further, the training of the speech model takes place with homogeneous sizes. As a resultant, all the frames of $F$ is split into two classes: speech and silence class. As the count of iterations increases, the silence signals are modelled with 60 dimensional Gaussian and speech signals with GMM. As a consequence, the optimal outcomes are obtained for the speech signal at GMM size $=$ 30 and non-speech at GMM size $=$ 16. At the end, the non-audible signals like the silence and pause are removed, while the audible signals like speech, music as well exists. Therefore, the silence removed signal $S$ is subjected for music removal.
•
Music removal: This bootstrap segmentation prevents loss of informative speech, and so it is utilized in this research work. The features like STE and Zero Crossing Rates (ZCR) are extracted from $S$ , and then the high confidence frames of speech and music classes are first utilized for training the bootstrap segmentation. The same iterative classification that is utilized for the silence removal from speech is utilized here for speech and music classes generation. As a resultant, the music signal is removed from the speech signal and the speech from music segments is refined to acquire the ensured speech signal $M$ . In the iterative classification step, the STE is not used. The final speech outcome from the SAR is denoted as $y$ . Then, the speaker segmentation takes place.

4.3 Speaker segmentation

The Window-growing-based segmentation (WinGrow) is a popular distance-based segmentation approach, which is employed in this research work for measuring the distance between two audio segments. The Bayesian Information Criterion (BIC) is utilized for evaluating the two hypotheses: (a) in the feature space, the two segments ( $M$ and $S$ ) of the feature vectors forms a Gaussian cluster (GC) and (b) a distinct Gaussian cluster is formed for each segment from these feature vectors. The distance measure is $\Delta\textit{BIC}$ , which is the difference between the two evaluation scores. Initially, at the beginning of the audio stream, a small analysis window is deployed. In case of absence of the change point detection, the search range is increased.

Figure 3.

Speaker clustering with optimized CNN.

(a) One-change-point detection: Initially, the single speaker change is investigated from the starting point of the audio stream by fixing the search window size $w$ . In case of identifying a variation, the evaluation is done on the next frame. In this estimation, $\Delta\textit{BIC}$ distance and the search window are declared for all frames that are within the window. When the maximal size of window $w_{\max}$ exceeds the threshold value $\psi$ the frame encounters the change. On the other hand, when $w_{\max}$ is below $\psi$ , then the frame is considered to be non-variation and so $w$ is increased. This process is repeated till a change is identified. The change points in the audio signals are identified and the respective positions of the speaker are identified from the original audio. In the previous researchers, the segmentation process was carried out based on two major criteria: with $\psi$ , carry out the $\Delta\textit{BIC}$ based change recognition and combination of the segments having the positive $\Delta\textit{BIC}$ score. The major reason behind this implementation was to override the over-segmentation problem. However, the computational complexity of the system was increased. Therefore, the maxima value that exceeds $\psi$ will only be allowed for further processing. The over-segmentation is minimized and the mathematical formula for $\Delta\textit{BIC}$ computation is stated as in Eq. (4). Here, Mat is the sample covariance matrix of the feature vectors corresponding to the input audio stream and the dimension of the feature vector is denoted as $d$ The outcome from the segmentation phase is denoted as seg.

$\displaystyle\Delta\textit{BIC}\left({y_{i}}\right)=M\log\left|\textit{Mat}% \right|-M_{1}\log\left|{Mat_{1}}\right|--M_{2}\log\left|\textit{Mat}\right|-% \frac{\gamma}{2}\left({c+\frac{1}{2}c\left({c+1}\right)}\right)\log M$ (4)

4.4 Optimized CNN

In this work, optimized CNN [39, 40] is used to cluster the segmented speech signal seg into $k$ count of clusters (Fig. 3). The process architecture of the CNN model is as follows: architecture has three major layers: “convolutional layer, pooling layer, and fully-connected layers”. In the convolution layer, the neurons of the present layer are connected to the neurons of the previous layer and this form of interconnection is referred as the neuron’s receptive field. Further, several kernels combine together to form the complete feature maps. Mathematically, $g^{th}$ feature map existing on the $l^{th}$ layer at location $\left({u,v}\right)$ is represented as $B_{u,v,g}^{l}$ as shown in Eq. (5).

$\displaystyle B_{u,v,g}^{l}=\textit{Wt}_{g}^{l^{T}}\textit{seg}_{u,v}^{l}+% \textit{bias}_{g}^{l}$ (5)

Where, $\textit{Wt}_{g}^{l}$ and $\textit{bias}_{g}^{l}$ is the weight vector and the $\textit{bias}_{g}^{l}$ in $g^{th}$ feature map existing on the $l^{th}$ layer. $\textit{seg}_{u,v}^{l}$ is the input patch on $l^{th}$ layer at location $\left({u,v}\right)$ . Here, the value of weight Wt is fine-tuned by a new SA-SLnO algorithm. Moreover, the algorithm takes its responsible on selecting the appropriate activation function among “sigmoid, tanh and ReLU” as per the defined objective (fitness). The nonlinear features are detected by this activation function and it is denoted as $\textit{act}(\bullet)$ . The activation function in $g^{th}$ feature map existing on the $l^{th}$ layer at location $\left({u,v}\right)$ is represented as $\left({\textit{act}_{u,v,g}^{l}}\right)$ . The activation value $\left({\textit{act}_{d,p,g}^{l}}\right)$ for the convolutional features $B_{u,v,g}^{l}$ is computed using Eq. (6).

$\displaystyle\textit{act}_{u,v,g}^{l}=\textit{act}\left({B_{u,v,g}^{l}}\right)$ (6)

Further, the shift-invariance is achieved by the pooling layer $\textit{pool}\left(\bullet\right)$ and this is accomplished by means of reducing the “resolution” of the feature map. The pooling function is computed for $\textit{act}_{u,v,g}^{l}$ using Eq. (7) and this is symbolized as $I_{u,v,g}^{l}$ .

$\displaystyle I_{u,v,g}^{l}=\textit{pool}\left({\textit{act}_{u,v,g}^{l}}% \right),\forall\left({u,v}\right)\in\Re_{u,v}$ (7)

Here, $\Re_{u,v}$ is the local neighborhood around the location $\left({u,v}\right)$ The output layer is the end layer of CNN and here exist the softmax operator, which performing the final detections. The CNN’s loss function $\left(\textit{Loss}\right)$ is computed as per Eq. (8). Where, the count of input-output relations $\left\{{\left({\textit{seg}^{\left(q\right)}\ \ I^{\left(q\right)}}\right);nq% \in\,\left[{1,\,\ldots,Q}\right]}\right\}$ is denoted as $Q$ . The overall parameter $\theta$ and $\textit{seg}^{\left(q\right)}$ is the $q^{th}$ input data, and $I^{\left(q\right)}$ is the target labels and output of CNN is represented as $O^{\left(q\right)}$ . Figure 3 exhibits the speaker clustering model with optimized CNN.

$\displaystyle\textit{Loss}=\frac{1}{Q}\sum\limits_{q=1}^{Q}{l\left({\theta;I^{% \left(q\right)},O^{\left(q\right)}}\right)}$ (8)

4.5 Fitness evaluation (objective) and solution encoding

The fitness or objective fixed in speaker clustering is determined in Eq. (9), where the loss indicates the error function of CNN as defined in Eq. (8). The input solution to the proposed algorithm is weight $\left(\textit{Wt}\right)$ and the activation function $\left(\textit{act}\right)$ of CNN that results in the fine-tuned value, which ensures the precise speaker clustering process. The input solution given to the proposed SA-SLnO is shown in Fig. 4.

$\displaystyle\textit{Obj}=\textit{Min(Loss)}$ (9)

4.6 Proposed SA-SLnO algorithm

The traditional Sea Lion Algorithm (SLnO) is based on the hunting behavior of sea lions. It is good in solving the complex optimization problems, but suffers from lower convergence and minimization accuracy. Therefore, in this research work, an improved version of SLnO referred as SA-SLnO is introduced for achieving the more accurate speech diarization by fine tuning act and Wt of CNN. The steps followed in SA-SLnO are described below:

Figure 4.

Solution encoding.

Step 1 (Initialization): The population Pop of search agent (Sea lion) is initialized. The count of iterations denoted as iter and its maximal count of iterations is denoted as $\max(\textit{iter})$ and set the leader sea lion’s sound speed as $\overrightarrow{\textit{Sp}_{\textit{leader}}}=1$ . The sound travelling speed mad by sea lion in water is denoted as $\overrightarrow{V_{1}}$ , and $\overrightarrow{V_{2}}$ is the speed of sounds mad by sea lion in air.

Step 2: Compute the fitness of the search agents using Eq. (9).

Step 3: while $\textit{iter}<\max(\textit{iter})$ , proceed to the following steps else terminate the algorithm.

Step 4: On the basis of the objective function (minimization of loss in CNN), shown in Eq. (9), compute the fitness of the overall iteration and identify the best fitness $\textit{best}_{\textit{fit}}$ .

Step 5: If $\overrightarrow{\textit{Sp}_{\textit{leader}}}=1$ and if $\textit{abs(z)}<1$ , then move to the detect and track phase of traditional SLnO. The Sea Lion make use of their whiskers to locate the location, dimension and form of the target (prey). Once a prey is identified the sealion calls its associates in the group to join it in the hunting process. This behaviour of hunting the prey is denoted as per Eq. (10).

$\displaystyle\overrightarrow{\textit{Dist}}=\left|{2.\overrightarrow{G}.% \overrightarrow{\textit{Pos}}(\textit{iter})-\overrightarrow{\textit{SL}}(% \textit{iter})}\right|$ (10)

Here, the distance between the sea lion and the prey (target) is denoted as $\overrightarrow{\textit{Dist}}$ and $\overrightarrow{\textit{SL}}(\textit{iter})$ is the sea lion’s position vector. In addition, $\overrightarrow{\textit{Pos}}(\textit{iter})$ is the position vector of the target prey. Then, move to the next iteration $(\textit{iter}+1)$ and locate the target that is near to the prey. This is mathematically expressed in Eq. (11).

$\displaystyle\overrightarrow{\textit{SL}}(\textit{iter}+1)=\overrightarrow{% \textit{Pos}}(\textit{iter})-\overrightarrow{\textit{Dist}}.\overrightarrow{z}$ (11)

Step 6: if $\overrightarrow{\textit{Sp}_{\textit{leader}}}1$ and if $\textit{abs}(z)>1$ , then move to the exploitation phase of traditional SLnO. The Exploitation phase has two sub-phases:” Dwindling encircling technique and Circle updating position”. The behavior of sea lion in the Dwindling encircling technique depends on the parameter $\overrightarrow{z}$ . The value of $\overrightarrow{z}$ is decreased to 0 from 2 in a linear manner, and this helps in moving the sea lion towards the prey and thus encircles them. In addition, the mathematical formula for the Circle updating position is shown in Eq. (12).

$\displaystyle\overrightarrow{\textit{SL}}(\textit{iter}+1)=\left|{% \overrightarrow{\textit{Pos}}(\textit{iter})-\overrightarrow{\textit{SL}}(% \textit{iter})}\right|.\textit{Cos}(2\pi t)+\overrightarrow{\textit{Pos}}(% \textit{iter})$ (12)

Where, $\left|{\overrightarrow{\textit{Pos}}(\textit{iter})-\overrightarrow{\textit{SL% }}(\textit{iter})}\right|.$ is the distance between the sea lion and the best optimal solution. In addition, $t$ denotes a random number exist in the limit $-$ 1 to 1. The $\textit{.Cos}(2\pi\textit{.iter})$ parameter represents the hunting of the prey by the sea line by means of moving around the encircled path.

Step 7: if $\overrightarrow{\textit{Sp}_{\textit{leader}}}\neq\neq 1$ , then perform the exploration phase of the traditional SLno. On the basis of the best search agent, the positions of the sea lions are updated in the exploitation phase. Further, when $\overrightarrow{z}>1$ , the global optimal solution is identified by performing the global search behavior using Eqs (13) and (14), respectively.

$\displaystyle\overrightarrow{\textit{Dist}}=\left|{2.\overrightarrow{G}.% \overrightarrow{\textit{SL}}_{\textit{rand}}(\textit{iter})-\overrightarrow{% \textit{SL}}(\textit{iter})}\right|$ (13) $\displaystyle\overrightarrow{\textit{SL}}(\textit{iter}+1)=\overrightarrow{% \textit{SL}}_{\textit{rand}}(\textit{iter})-\overrightarrow{\textit{Dist}}.% \overrightarrow{z}$ (14)

Where, $\overrightarrow{\textit{SL}}_{\textit{rand}}(\textit{iter})$ is the selected random sea lion acquired from the current population.

Step 8: Now find the best fitness of the corresponding iterations by computing the objective function. The resultant best solution acquired is denoted as $\textit{best}(\textit{after})_{\textit{fit}}$ . If $\textit{best}(\textit{after})_{\textit{fit}}>\textit{best}_{\textit{fit}}$ , then change the value of $\textit{Sp}_{\textit{leader}}$ to 2 and proceed the algorithmic steps from step 2.

Figure 5.

Performance Evaluation of the proposed and existing works for speaker diarization under Test case 1 (a) Accuracy, (b) DER, (c) FDR and (d) FNR.

Step 6: Terminate.

5. Results and discussion

5.1 Simulation procedure

The proposed speaker diarization model with the aid of the SA-SLnO based CNN was implemented in MATLAB and the results acquire were noted. The collected audio signal from Telugu language is split into five Test Cases: “Test case 1, Test case 2, Test case 3, Test case 4 and Test case 5”, respectively. The presented work model (SA-SLnO with CNN) is compared over the traditional models like ANN+ABC-LA, CNN+SLnO and CNN+ABC+LA. The evaluation is made in terms of accuracy, False Discovery Rate (FDR), diarization error, False Negative Rate (FNR), and False Positive Rate (FPR), respectively All the evaluations are done by varying the learning percentage from 40 to 80, respectively.

5.2 Performance analysis for Test case 1

In this section, the performance analysis is done for Test case 1 with respect to the accuracy, FDR, diarization error, FNR, and FPR, respectively The corresponding results are shown in Fig. 5. The accuracy of speaker diarization is the key parameter that decides the potential of the proposed work. The accuracy of the presented work is higher for certain variations in the learning percentage. When learning percentage $=$ 40, the presented work is 6.6%, 5.3% and 2.687% better than the extant works like the ANN+ABC-LA algorithmbased speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. In addition, DER minimization is being the major objective and it is not yet met in most of the research work. In contrast to this, the DER performance of the presented work is found to be lower than the extant works. On observing the LP $=$ 80, the presented works is 60%, 23.7% and 16% better than ANN+ABC-LA algorithmbased speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. In addition, the FNR performance of the presented work is 13.3%, 3.7% and 3.84% is better than ANN+ABC-LA algorithmbased speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. In addition, the overall performance of the presented and the existing works for Test case 1 is shown in Table 2. Here, the accuracy of the presented work is 0.71515, which is the highest one when compared to the existing works with 70% of accuracy. Further, the error minimization is also attained by the proposed work, which is evident from Table 2.

Table 2
Overall performance of the proposed and exiting works for speaker diarization: Test case 1

Measures	ANN-ABC-LA [7] based speaker diarization	SLnO [29] based speaker diarization	CNN+ABC-LA [7] based speaker diarization	CNN+SA-SLnObased speaker diarization
Accuracy	0.71204	0.70919	0.71336	0.71515
Sensitivity	0.71204	0.70919	0.71336	0.71515
Specificity	0.71204	0.70919	0.71336	0.71515
Precision	0.71204	0.70919	0.71336	0.71515
FPR	0.28796	0.29081	0.28664	0.28485
FNR	0.28796	0.29081	0.28664	0.28485
NPV	0.71204	0.70919	0.71336	0.71515
FDR	0.28796	0.29081	0.28664	0.28485
F1_score	0.71204	0.70919	0.71336	0.71515
MCC	0.42407	0.41837	0.42672	0.4303

Figure 6.

Performance Evaluation of the proposed and existing works for speaker diarization under Test case 2 (a) Accuracy, (b) DER, (c) FDR and (d) FNR.

5.3 Performance analysis of proposed work over existing works for Test case 2

For Test case 2, the performance of the presented work and the existing works are compared in terms of accuracy, FDR, diarization error, FNR, and FPR, respectively The outcomes are graphically shown in Fig. 4. In the case of accuracy, the performance of the presented work from Fig. 6a is higher and hence said to be suitable for speaker diarization. In the case of accuracy, the presented works is 23%, 3.8% and 5.125 better than ANN+ABC-LA algorithmbased speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. In addition, the DER of the presented works is the lowest for every variation in the learning percentage. The DER performance at 40 ${}^{\text{th}}$ LP is 7.14%, 5.7% and 18.75% better than ANN+ABC-LA algorithmbased speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. In addition, FDR of the presented work is much lower than the existing works for every variation in LP. The FDR of the presented work lies between the range 30–40 and this is the lowest one. In addition, FNR of the presented work is lower in all the variation of LP and the lowest FNR is recorded by the presented work at 80 ${}^{\text{th}}$ LP. The overall performance of the proposed and the existing works for 2 ${}^{\text{nd}}$ test case is shown in Table 3.

Table 3
Overall performance of the proposed and exiting works for speaker diarization: Test case 2

Measures	ANN-ABC-LA [7] based speaker diarization	SLnO [29] based speaker diarization	CNN+ABC-LA [7] based speaker diarization	CNN+SA-CNN+SLnO based speaker diarization
Accuracy	0.70099	0.7385	0.7586	0.75887
F1_score	0.55138	0.60774	0.6379	0.6383
FDR	0.44849	0.39226	0.3621	0.3617
FNR	0.44875	0.39226	0.3621	0.3617
FPR	0.22414	0.19613	0.18105	0.18085
MCC	0.32715	0.41162	0.45685	0.45745
NPV	0.77586	0.80387	0.81895	0.81915
Precision	0.55151	0.60774	0.6379	0.6383
Sensitivity	0.55125	0.60774	0.6379	0.6383
Specificity	0.77586	0.80387	0.81895	0.81915

Figure 7.

Performance Evaluation of the proposed and existing works for speaker diarization under Test case 3 (a) Accuracy, (b) DER, (c) FDR and (d) FNR.

5.4 Performance analysis of proposed work over existing works for Test case 3

The presented telugu speaker diarization model is compared over the existing works for Test case 3 and the results acquired in terms of accuracy, FDR, diarization error, FNR, and FPR, respectively are shown in Fig. 7 On observing, 80 ${}^{\text{th}}$ LP, the presented work is 6.25%, 3.52% and 2.355 better than the existing works with ANN+ABC-LA algorithm based speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. In addition, the lowest DER performance is recorded by the presented work and it is 12.5%, 30% and 41.6% better than ANN+ABC-LA algorithm based speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. The FDR and FNR of the presented work are lowest and hence achieve the objective of error minimization. The overall performance measures inclusive of positive, negative and other measures for both proposed and the existing models are shown in Table IV. The accuracy of the presented work is 0.911, which is 3.17%, 1.57% and 0.34% better than the existing works like ANN+ABC-LA algorithm based speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively.

Table 4
Overall performance of the proposed and exiting works for speaker diarization: Test case 3

Measures	ANN-ABC-LA [7] based speaker diarization	SLnO [29] based speaker diarization	CNN+ABC-LA [7] based speaker diarization	CNN+SA-CNN+SLnO based speaker diarization
Accuracy	0.88256	0.89718	0.90834	0.91152
F1_score	0.87415	0.89718	0.90834	0.91152
FDR	0.06088	0.10282	0.091657	0.088476
FNR	0.18241	0.10282	0.091657	0.088476
FPR	0.05276	0.10282	0.091657	0.088476
MCC	0.77152	0.79436	0.81669	0.82305
NPV	0.94724	0.89718	0.90834	0.91152
Precision	0.93912	0.89718	0.90834	0.91152
Sensitivity	0.81759	0.89718	0.90834	0.91152
Specificity	0.94724	0.89718	0.90834	0.91152

Figure 8.

Performance Evaluation of the proposed and existing works for speaker diarization under Test case 4 (a) Accuracy, (b) DER, (c) FDR and (d) FNR.

Table 5

Overall performance of the proposed and exiting works for speaker diarization: Test case 4

Measures	ANN-ABC-LA [7] based speaker diarization	SLnO [29] based speaker diarization	CNN+ABC-LA [7] based speaker diarization	CNN+SA-CNN+SLnO based speaker diarization
Accuracy	0.83767	0.87374	0.87573	0.87663
F1_score	0.82619	0.87374	0.87573	0.87663
FDR	0.11326	0.12626	0.12427	0.12337
FNR	0.22661	0.12626	0.12427	0.12337
FPR	0.09834	0.12626	0.12427	0.12337
MCC	0.68081	0.74748	0.75146	0.75325
NPV	0.90166	0.87374	0.87573	0.87663
Precision	0.88674	0.87374	0.87573	0.87663
Sensitivity	0.77339	0.87374	0.87573	0.87663
Specificity	0.90166	0.87374	0.87573	0.87663

Figure 9.

Performance Evaluation of the proposed and existing works for speaker diarization under Test case 5 (a) Accuracy, (b) DER, (c) FDR and (d) FNR.

5.5 Performance analysis of proposed work over existing works for Test case 4

Figure 8 shows the performance evaluation of both the presented and the existing works. The accuracy measure decides the selection of the diarization approach. In Fig. 8a, the accuracy of the presented work is 9.09%, 6.8% and 3.45 better than ANN+ABC-LA algorithm based speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively at 80 ${}^{\text{th}}$ LP. Thus, from the evaluation, it is clear that the presented work is suitable for speaker diarization. In addition, the objective of the current research work is the minimization of error and this is verified from the error analysis. The DER of the presented work seems to be alike the DER performance of CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. But, there is smaller variation in DER performance. The presented work shows lower DER values. In addition, FDR and FPR of the presented work are lower and this is clearly verified from Fig. 8c and d, respectively. The overall performance of the proposed and the existing work for TEST CASE4 is shown in Table 5. Here, FDR, FPR and FNR of the presented work are the lowest one. The lowest negative performance values recorded by the presented work are FDR $=$ 0.123, FPR $=$ 0.123 and FNR $=$ 0.123. Thus, from the overall evaluation, it is clear that the presented work shows the lowest negative performance values and hence reaches the objective of error minimization.

Table 6
Overall performance of the proposed and exiting works for speaker diarization: Test case 5

Measures	ANN-ABC-LA [7] based speaker diarization	SLnO [29] based speaker diarization	CNN+ABC-LA [7] based speaker diarization	CNN+SA-CNN+SLnO based speaker diarization
Accuracy	0.76491	0.78748	0.80668	0.8073
Sensitivity	0.64516	0.68122	0.71003	0.71096
Specificity	0.82479	0.84061	0.85501	0.85548
Precision	0.64803	0.68122	0.71003	0.71096
FPR	0.17521	0.15939	0.14499	0.14452
FNR	0.35484	0.31878	0.28997	0.28904
NPV	0.82479	0.84061	0.85501	0.85548
FDR	0.35197	0.31878	0.28997	0.28904
F1_score	0.64659	0.68122	0.71003	0.71096
MCC	0.47047	0.52183	0.56504	0.56644

Table 7

Overall DER performance of the proposed speaker diarization and exiting works for Test case 1, Test case 2, Test case 3, Test case 4 and Test case 5

Methods	Test case 1	Test case 2	Test case 3	Test case 4	Test case 4
ANN+ABC-LA [7] based speaker diarization	0.22391	0.69427	0.42318	0.30346	0.60193
SLnO [29] based speaker diarization	0.11775	0.7077	0.44104	0.27551	0.69369
CNN+ABC-LA [7] based speaker diarization	0.12833	0.63325	0.40538	0.25606	0.5522
CNN+SA-SLnObased speaker diarization	0.10523	0.61996	0.39355	0.25037	0.53304

5.6 Performance analysis of proposed work over existing works for Test case 5

The outcomes acquired under test case 5 are exhibited in Fig. 9. The accuracy of the presented work is higher for every variation in the LP. Thus, portrays itself suitable for speaker diarization. Then, in case of DER, the proposed work attains lower value. On observing the 80 ${}^{\text{th}}$ LP for DER, the presented work is 3.2%, 25%, 9.095 better than the existing works like ANN+ABC-LA algorithm based speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. In addition, the FNR and FDR of the presented work is the lowest one for each variation in LP. From the overall evaluation, it is clear that the presented work has attained the lowest errors and hence become apt for speaker diarization. Table 6 shows the overall performance evaluation of the pre4sented and the existing works for Test case 5. The accuracy being the key measure that shows the significance of the presented work is higher, while compared to the existing works. The accuracy of the presented work is 0.8073, which is 5.255, 2.45% and 0.075 better than the existing works like ANN+ABC-LA algorithm based speaker diarization, CNN+SLnO based speaker diarization and CNN+ABC+LA based speaker diarization, respectively. Thus, from the overall evaluation, it is obvious that the presented works is well suited for speaker diarization.

5.7 DER performance

The DER performance is the common measure discussed in most of the literature works [1, 2, 5, 6, 7, 8] . Among them, very few counts of works have satisfied the required level of DER. But, in those works the accuracy of speaker diarization was not met. By taking this as a challenge, the proposed speaker diarization work has achieved both the maximization in accuracy as well as minimization in DER performance. The DER performance for all the test cases is shown in Table 6. The proposed speaker diarization works achieves the lowest DER performance in all the five test cases. The DER performance achieved by the presented work is 0.10523 for test case 1, 0.61 for Test case 2, 0.39 for Test case 3, 0.25 for test case 4 and 0.53 for test case. Among all these, the lowest DER is recorded under Test case 1. From this evaluation, it is proved that the presented work is better with less DER performance and the maximized accuracy rate.

6. Conclusion

A novel speaker diarization approach was introduced in this paper by following three major phases: Feature Extraction, SAD, and Speaker Segmentation and Clustering process. Initially, from the input audio stream (Telugu language), the MFCC based features were extracted. Subsequently, in SAD, the music and silence signals were removed. Then, the acquired speech signals were segmented for each individual speaker. Finally, the segmented signals were subjected to speaker clustering process, where the Optimized CNN was used. To make the clustering more appropriate, the weight and activation function of CNN were fine–tuned by a new SA-SLnO. Finally, a comparative analysis is made to exhibit the superiority of the proposed speaker diarization work. The proposed speaker diarization works achieves the lowest DER performance in all the five test cases. The DER performance achieved by the presented work is 0.10523 for Test case 1, 0.61 for Test case 2, 0.39 for Test case 3, 0.25 for Test case 4 and 0.53 for test case. Among all these, the lowest DER is recorded under Test case 1. However, the presented model cannot use the visual features, like head-pose tracking, as well as head pose estimation. In future our work will be extended to integrate richer visual features, to improve the detection of speech turns based on gaze In addition, to deal with more multifaceted scenarios concerning tens of participants we will consider wearable devices, or distributed sensors

Footnotes

Nomenclature

Abbreviation	Description
MFCC	Mel Frequency Cepstral coefficient
SAD	Speech Activity Detection
LPCC	Linear Predictive Cepstral Coefficients
ANN	Artificial Neural Network
PLP	Perceptual Linear Predictive
GMM	Gaussian mixture model
HXLPS	Holoentropy with the eXtended Linear Prediction using autocorrelation Snapshot
DNN	Deep Neural Network
VAD	Voice Activity Detection
UBM	Universal Background Model
DER	Diarization error ratio
DE	differential evolution
TRW	trace within criterion
VRC	variance ratio criterion

LA	Lion Algorithm
ABC	Artificial Bee Colony
MSR	Missed Speech Rate
FASR	False Alarm Speech Rate
BIC	Bayesian Information Criterion
FDR	False Discovery Rate
FPR	False Positive Rate
FNR	False Negative Rate
SA-SLnO	Self Adaptive Sea Lion Algorithm
SLnO	Sea Lion Algorithm
MCC	Mathews Correlation Coefficient
NPV	Negative Predictive Value
V-QF	Voice-Quality Features
A-VF	Audio-Visual Fusion
BSF	Binaural Spectral Features
SCF	Spectral Clustering Framework
NME	Normalized Maximum Eigengap
CNN	Convolutional Neural Network
FFT	Fast Fourier Transform
MFC	Mel-Frequency Cepstrum
DCT	Discrete Cosine Transform
SAD	Speaker Activity Detection
STE	Short Time Energy
ZCR	Zero Crossing Rates

References

Ramaiah

Rao

. Speaker diarization system using HXLPS and deep neural networ, Alexandria Engineering Journal, 2017.

Zewoudie

Luque

Hernando

. The use of long-term features for GMM- and i-vector-based speaker diarization systems, EURASIP Journal on Audio, Speech, and Music Processing, 2018.

Karim

Salah

Adnen

, Hybridization DE with K-means for speaker clustering in speaker diarization of broadcasts news, International Journal of Speech Technology, August 2019.

Diez

Burget

Landini

Cernocký

, Analysis of speaker diarization based on bayesian hmm with eigenvoice priors, LEEE/ACM Transactions on Audio, Speech, And Language Processing, Vol. 28, 2020.

Gebru

Horaud

. Audio-visual speaker diarization based on spatiotemporal bayesian fusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

Park

Han

Kumar

Narayanan

. Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap, IEEE Signal Processing Letters, Vol. 27, 2020.

Sethuram

Prasad

Rao

. Optimal trained artifcial neural network for Telugu speaker diarization, Evolutionary Intelligence, 2020.

Cyrta

Trzcinski

Stokowiec

. Speaker diarization using deep recurrent convolutional neural networks for speaker embeddings, arXiv:170802840v2. [cs.SD], 15 Sep 2017.

Ferràs

Madikeri

Bourlard

. Speaker Diarization and linking of meeting data, IEEE/ACM Transactions on Audio, Speech, and Language Processing. Nov. 2016; 24(11): 1935-1945.

10.

Ahmad

Zubair

Alquhayz

. Speech enhancement for multimodal speaker diarization system, IEEE Access. 2020; 8: 126671-126680.

11.

Diez

Burget

Landini

Černocký

. Analysis of speaker diarization based on bayesian HMM with eigenvoice priors, IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2020; 28: 355-368.

12.

Hansen

JHL

. Active learning based constrained clustering for speaker diarization, IEEE/ACM Transactions on Audio, Speech, and Language Processing. Nov. 2017; 25(11): 2188-2198.

13.

Le Lan

Charlet

Larcher

Meignier

. An adaptive method for cross-recording speaker diarization, IEEE/ACM Transactions on Audio, Speech, and Language Processing. Oct. 2018; 26(10): 1821-1832.

14.

Gebru

Horaud

. Audio-visual speaker diarization based on spatiotemporal bayesian fusion, IEEE Transactions on Pattern Analysis and Machine Intelligence. 1 May 2018; 40(5): 1086-1099.

15.

Park

Han

Kumar

Narayanan

. Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap, IEEE Signal Processing Letters. 2020; 27: 381-385.

16.

Jati

Georgiou

. Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics, IEEE/ACM Transactions on Audio, Speech, and Language Processing. Oct. 2019; 27(10): 1577-1589.

17.

Han

Song

Zhao

. Retrieval of TV talk-show speakers by associating audio transcript to visual clusters, IEEE Access. 2017; 5: 20512-20523.

18.

Sun

Gao

Fang

Lee

. A speaker-dependent approach to separation of far-field multi-talker microphone array speech for front-end processing in the CHiME-5 cChallenge, IEEE Journal of Selected Topics in Signal Processing. Aug. 2019; 13(4): 827-840.

19.

Stöter

Chakrabarty

Edler

Habets

EAP

. CountNet: estimating the number of concurrent speakers using supervised learning, IEEE/ACM Transactions on Audio, Speech, and Language Processing. Feb. 2019; 27(2): 268-282.

20.

Shahin

Zafar

Ahmed

. The automatic detection of speech disorders in children: Challenges, opportunities, and preliminary results, IEEE Journal of Selected Topics in Signal Processing. Feb. 2020; 14(2): 400-412.

21.

Shahin

Zafar

Ahmed

. The automatic detection of speech disorders in children: Challenges, opportunities, and preliminary results, IEEE Journal of Selected Topics in Signal Processing. Feb. 2020; 14(2): 400-412.

22.

Ban

Alameda-Pineda

Girin

Horaud

. Variational bayesian inference for audio-visual tracking of multiple speakers, IEEE Transactions on Pattern Analysis and Machine Intelligence. doi: 10.1109/TPAMI.2019.2953020.

23.

Kumano

Otsuka

Ishii

Yamato

. Collective first-person vision for automatic gaze analysis in multiparty conversations, IEEE Transactions on Multimedia. Jan. 2017; 19(1): 107-122.

24.

Moore

Evers

Naylor

. Direction of arrival estimation in the spherical harmonic domain using subspace pseudointensity vectors, IEEE/ACM Transactions on Audio, Speech, and Language Processing. Jan. 2017; 25(1): 178-192.

25.

Lapidot

Shoa

Furmanov

Bonastre

J-F

. Generalized Viterbi-based models for time-series segmentation and clustering applied to speaker diarization, Computer Speech & Language, 2017.

26.

Teimoori

Razzazi

. Incomplete-data-driven speaker segmentation for diarization application; a help-training approach, Circuits, Systems, and Signal Processing. 2019; 38: 2489-2522.

27.

Alhlffee

. MFCC-based feature extraction model for long time period emotion speech using CNN, Revue d’Intelligence Artificielle. April 2020; 34(2).

28.

Muzzaffar

Goyal

Butt

. Speaker Change Detection for News Audio: A Practical Approach, 2017.

29.

Masadeh

Mahafzah

Sharieh

. Sea lion optimization algorithm, International Journal of Advanced Computer Science and Applications (IJACSA). 2019; 10(5).

30.

Rajakumar

. Optimization using lion algorithm: A biological inspiration from lion’s social behavior, Evolutionary Intelligence, Special Issue on Nature inspired algorithms for high performance computing in computer vision. 2018; 11(1-2): 31-52, doi: 10.1007/s12065-018-0168-y.

31.

Rajakumar

. Static and adaptive mutation techniques for genetic algorithm: A systematic comparative analysis, International Journal of Computational Science and Engineering. 2013; 8(2): 180-193. doi: 10.1504/IJCSE.2013.053087.

32.

Rajakumar

. Lion algorithm for standard and large scale bilinear system identification: A global optimization based on Lion’s social behavior, 2014 IEEE Congress on Evolutionary Computation, Beijing, China, July 2014, pp: 2116-2123, doi: 10.1109/CEC2014.6900561.

33.

Rajakumar

George

. A new adaptive mutation technique for genetic algorithm, In proceedings of IEEE International Conference on Computational Intelligence and Computing Research (ICCIC), Coimbatore, India, 18–20 December 2012, pp 1-7. doi: 10.1109/ICCIC.2012.6510293.

34.

Darekar

Dhande

. Emotion recognition from speech signals using DCNN with hybrid GA-GWO algorithm, Multimedia Research. 2019; 2(4): 12-22.

35.

Brajula

Bibin Prasad

. ODFF opposition and dimension based firefly for optimal reactive power dispatch, Journal of Computational Mechanics, Power System and Control. 2018; 1(1): 1-10.

36.

Brajula

Praveena

. Energy efficient genetic algorithm based clustering technique for prolonging the life time of wireless sensor network, Journal of Networking and Communication Systems. 2018; 1(1): 1-9.

37.

et al., MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications, IEEE Access. 2020; 8: 48720-48730. doi: 10.1109/ACCESS.2020.2979799.

38.

Muzzaffar

Goyal

Butt

. Speaker Change Detection for News Audio: A Practical Approach. 2017; 4.

39.

Arul

Sivakumar

Marimuthu

Chakraborty

. An approach for speech enhancement using deep convolutional neural network, Multimedia Research. 2019; 2(1): 37-44.

40.

Sarkar

. Optimization assisted convolutional neural network for facial emotion recognition, Multimedia Research. 2020; 3(2).

41.

Dhivakar

Mohana

. A survey on privacy preservation recent approaches and techniques, International Journal of Innovative Research in Computer and Communication Engineering. 2014; 2(11): 6559-6566.

42.

Arachchige

Sathsara

. The Impact Of Outbound Training (OBT).

43.

Gupta

Singh

Mukhija

Ghose

. Aspect-based sentiment analysis of mobile reviews, Journal of Intelligent & Fuzzy Systems. 2019; 36(5): 4721-4730.

44.

Gupta

Juyal

Singh

Killa

Gupta

. Emotion recognition of audio/speech data using deep learning approaches, Journal of Information and Optimization Sciences. 2020; 41(6): 1309-1317.

45.

Piryani

Gupta

Singh

Ghose

. A linguistic rule-based approach for aspect-level sentiment analysis of movie reviews. In Advances in Computer and Computational Sciences. Springer, Singapore, 2017; 201-209.

46.

Mohsin

Abdalla

, Optimization driven adam-cuckoo search-based deep belief network classifier for data classification. IEEE Access. 2020; 8: 105542-105560.

Metaheuristic adapted convolutional neural network for Telugu speaker diarization

Abstract

Keywords

1. Introduction

2.1 Related works

Table 1 Features and challenges of existing speaker diarization approaches

3.1 Architecture of proposed Telugu speaker diarization

4. Feature extraction, speech activity detection and speaker segmentation

4.1 Feature extraction

5.1 Simulation procedure

5.2 Performance analysis for Test case 1

Table 2 Overall performance of the proposed and exiting works for speaker diarization: Test case 1

Table 3 Overall performance of the proposed and exiting works for speaker diarization: Test case 2

Table 4 Overall performance of the proposed and exiting works for speaker diarization: Test case 3

Table 6 Overall performance of the proposed and exiting works for speaker diarization: Test case 5

5.7 DER performance

6. Conclusion

Footnotes

Nomenclature

References

Table 1
Features and challenges of existing speaker diarization approaches

Table 2
Overall performance of the proposed and exiting works for speaker diarization: Test case 1

Table 3
Overall performance of the proposed and exiting works for speaker diarization: Test case 2

Table 4
Overall performance of the proposed and exiting works for speaker diarization: Test case 3

Table 6
Overall performance of the proposed and exiting works for speaker diarization: Test case 5