SVM based Voice Activity Detection by fusing a new acoustic feature PLMS with some existing acoustic features of speech

Abstract

Voice activity detection (VAD) identifies the presence/absence of human speech in a frame of a given speech signal. Presence/Absence of human speech can easily be identified in clean speech signal but its accuracy decreases with decreasing Signal-to-Noise ratio (SNR) value. Robust VAD helps to enhance the efficiency of speech signal based automated applications like speech enhancement, speaker identification, hearing aid devices etc. In this paper, a new feature of speech signal- “Peak of Log Magnitude Spectrum (PLMS)” is introduced and used for VAD. This newly defined feature PLMS along with three existing acoustic features(MFCC;RASTA-PLP and Formant Frequency) are used to train SVM classifier for VAD. Experimentally, it is found that coefficients of PLMS play most prominent role. Experimentally, it is also observed that the accuracy of the trained SVM classifier for VAD is the highest when compared with other state of the art methods (Sohn VAD and VAD G.729).

Keywords

VAD PLMS SVM MFCC RASTA-PLP Formant Frequency

1 Introduction

Voice Activity Detection (VAD) is a binary classification problem because any frame either contains human speech or does not. VAD differentiates the presence or absence of human speech in a speech frame. Voice (human speech) can be easily identified in the clean (absence of noise) speech signal using simple acoustic features of speech like zero crossing rate (ZCR) and energy. Identification of voice becomes tedious when signal-to-noise ratio (SNR) is low. Due to this, robust voice activity detection is considered to be one of the most open and challenging task in speech signal processing area. An efficient and robust VAD improves SNR of existing speech enhancement algorithms by enhancing the capability of noise estimation. Several approaches for VAD like traditional algorithmic approach, machine-learning based approach etc. exist in the literature. Traditional algorithmic approach takes decision on the basis of one or two acoustic feature/features. Machine learning based VAD uses multiple features for taking its decision that makes it more resistive with noise than traditional approaches. These VADs are more applicable in other speech processing systems because they can naturally integrate with the systems like speech/ speaker recognition. Since, machine-learning based VAD fuses multiple acoustic features for taking decision therefore selection of features play crucial role. Fusion of multiple features of speech (with similar properties) may or may not increase the performance of the system. One can select large number of features to produce better result but it may take larger time in training, modelling and testing which will defeat its applicability in real life applications like hearing aid devices, online audio chat etc. Hence, features should be selected to enhance the accuracy without much affecting the time required.

As per source filter theory of speech production, human speech is produced with the convolution of excitation sequence and vocal tract filter characteristics [2]. Sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth etc [1]. Shape of phonemes in the human speech is designed by his/her vocal tract which distinguishes it from sounds(noise) generated from other sources like CAR, TRAIN, BABBLE, AIRPORT etc. Therefore, the features capturing the vocal tract characteristics will be more suitable for VAD. In the cepstral or quefrency domain, starting few values of every speech sample contain the information about vocal tract characteristics. Features like formant frequency, MFCC, RASTA-PLP etc. capture vocal tract characteristics.

Hence, motivation of this work is to select suitable set of features of human speech for VAD so training time of SVM is reduced and efficiency of the system is also enhanced. In this paper a new feature PLMS of human speech is introduced and examined for the said task. PLMS coefficients are the log magnitude spectrum values of a speech frame of duration of 15 to 20 msec. Experimentally, it is found that combining this feature with MFCC and formant frequency, overall performance (in terms of training time and accuracy) of SVM based VAD is improved.

Following are the major contribution of this Paper:

A new speech feature PLMS is introduced.

Appropriate reduced set of speech features for real time efficient and robust VAD is defined.

The rest of this paper is organized as follows. In section 2 Literature survey for VAD is discussed. procedures of extracting the selected acoustic features of speech are discussed in section 3 followed by section 4 where, parameters used for measuring the performance are explained. Analysis of Experiments are performed in section 5 and paper is conclued in section 6.

2 Literature Survey for Voice Activity Detection (VAD)

Number of researchers have been working in the field of VAD using different acoustic features like spectral correlation, energy entropy etc. In this section, a brief review of state of the art approaches for VAD scheme is presented.

Human speech signal contains two main informations: excitation sequence and vocal tract characteristics. Some researchers have used only excitation sequence features of speech [15] for VAD. Features like spectral divergence, spectral correlation, low-frequency ultrasound, single frequency filtering, energy entropy [5 , 25] are also used for the same. Some researchers used long-term features or suprasegmental features like long-term spectral divergence measure (LTSDM), long-term signal variability (LTSV) [17] for VAD and claimed that they give effective performance even in the low SNR. LTSDM measures the spectral divergence between speech and noise over longer sample duration [11]. Features with spectral energy in different frequency bands and in different scales like bark-scale or MEL- Scale have been widely used for the said purpose. In MEL-Scale, spectral energy is known as MFCC [23]. Voiced sounds (Showing quasi-periodic behavior and dynamic spectral characteristic for short utterance) features have been extracted and used for VAD by many researchers [7 , 9].

Statistical methods like log-likelihood ratio test [14], optimum likelihood ratio test [12], low-variance spectrum estimate [3], Laplace distribution [19], Gaussian distribution [4, 14], Gamma distribution [10] or combination of them [28] are also popular for VAD in the literature. Computational efficiency of these methods are high but accuracy of these systems decreases drastically with decreasing SNR values.

Performance of VAD system is not consistent by using traditional approaches at different SNR values for different category of noises by using few features because individual feature captures only specific characteristics of human speech. Fusion of multiple features improve the performance of VAD. Many machine learning methods like artificial neural networks [24], Support Vector Machine (SVM) [13, 23], Deep Belief Network (DBN) [27] etc. use multiple feature fusion for VAD.

One of the most popular and successful approach for VAD is machine learning based approach. This approach has been further categorized into two categories: supervised and unsupervised. When labeled speech data is used for training, it is known as supervised otherwise unsupervised. In an unsupervised method, dimensionality reduction on extracted feature is performed first. Thereafter, resultant set of features are used for training and validation of the classifier. Principle component analysis [21], non-negative matrix factorization [18], and spectral decomposition of graph Laplacian [20] are the popular methods for dimensionality reduction. Performance of such types of trained classifiers are found not satisfactorily in the presence of noise like Babble at low SNR.

Supervised trained classifiers for VAD perform well in almost all types of noises when trained with sufficient amount of labeled dataset.

Recently, deep neural network [22 , 30] has been introduced for VAD. Zhang and Wu [27] have used ten different acoustic features with total number of 273 coefficients. They claimed that the performance of DBN is better than SVM in terms of accuracy. X.L. Zhang et al. proposed a new method for VAD using Deep Neural Network by boosting contextual information. They claimed that when this model is trained with many numbers of noises and wide range of signal-to-noise ratios, then its performance is good even for unknown test cases [31]. S. Shahsavari et al. also proposed speech activity detection using deep neural networks and claimed that performance of the system increases by adding context information [32]. L. Jie and Y. Datao proposed enhanced speech based jointly statistical probability distribution function for VAD. They claimed that the performance of the proposed approach is better than other the baseline methods even in non-stationary noise conditions [33]. P. Sertsi et al. proposed VAD based on LSTM recurrent neural networks and modulation spectrum. They claimed that performance of the proposed system is better than conventional baseline methods in both seen and unseen type of noises [34].

3 Features used

In this section, motivation behind introducing a new speech feature "PLMS" and algorithm for extracting it are discussed. Also, other features of human speech for VAD used in this paper are discussed.

3.1 Peak of Log Magnitude Spectrum (PLMS)

It has been assumed in the literature that three formant values that keep the information of vocal tract are present in any speech frame of duration 15-25 ms. We analyzed the pattern of amplitudes corresponding to each formant values of human speech and also for the noise. We observed that amplitudes corresponding to all formant values for human speech show different characteristics than the real world noises. Hence, we introduced three PLMS coefficients as the three amplitude values corresponding to the three formant values of a speech frame of duration 15-25 ms. Here, PLMS coefficients of a speech frame is shown in Fig. 1.

Fig.1

PLMS feature coefficient of a speech frame corresponding to formant values.

Experimentally, it is also found that this feature is useful for VAD irrespective of language. Absolute values of PLMS coefficients for a frame are found to be in increasing order.

First PLMS coefficients of Airport noise at 0 dB and 10 dB are shown in Figs. 2 and 3 respectively.

Fig.2

PLMS feature coefficient corresponding to first formant.

Fig.3

PLMS feature coefficient corresponding to first formant.

Algorithm 1 is used for extracting the PLMS coefficients of speech.

Algorithm 1

Algorithm for PLMS:

INPUT: Speech signal S of size N₁ × 1.

OUTPUT:PLMS of size N × 3.

Ensure:

(1) S is a vector of size N₁ × 1. It contains N₁ samples of speech signal S.

(2) N is the total number of frame.

(3) fs is the sampling frequency of speech signal S.

(4) N₂ is the number of samples in a frame.

(5) N =

\frac{N_{1}}{N_{2}}

(6) F_i represents i^thframe.

(7) L is defined as a low time liftering window of size

\frac{N_{2}}{2}

whose first fifteen values are 1 and remaining are 0.

Procedure PLMS (S)

2: fori ← 1 to Ndo

y₁ ← fft (F_i) (Fast fourier transform of frame F_i).

4: y₂ ← log(|y₁|)

y₃ ← ifft (y₂) (Inverse fourier transform of y₂).

6: Assign first

\frac{N_{2}}{2}

elements of vectors y₃ to vector y₄

forj ← i to length

\frac{N_{2}}{2}

8: y₅^j ← y₄^j × L^j (y_p^jand L^j represents j^th element of y_p and L).

end for

10: y₆← first fifteen elements of y₅.

y₇ ← fft (y₆, fs) (Assifn fs point fft of y₆ to y₇).

12: Assign first

\frac{fs}{2}

absolute values of elements of vector y₇ to vector y₈.

l ← 1

14: fork ← 2 to

\frac{fs}{2} - 1

ify₈ is local maxima then

16: location (k) ← y₈ⁱ

k ← k + 1

18: end if

end for

20: Assign first three elements of location to PLMS.

end for

22: returnPLMS

end procedure

All the first coefficients of PLMS of speech signal at 0 dB SNR with noises (Babble, Car, Exhibition, Restaurant and Station) respectively are shown in Figs. 4 8. Gap between coefficients of noisy speech and noise is found to be almost same for all types of noises. Gap in the respective pair of coefficient vectors of PLMS increases with increasing SNR as found from Figs. 2 and 3. Same behavior is also observed with other types of noises used for experiments in this paper.

Fig.4

PLMS feature coefficient corresponding to first formant.

Fig.5

PLMS feature coefficient corresponding to first formant.

Fig.6

PLMS feature coefficient corresponding to first formant.

Fig.7

PLMS feature coefficient corresponding to first formant.

Fig.8

PLMS feature coefficient corresponding to first formant.

3.2 Mel Frequency Cepstral Coefficient (MFCC)

Mel Frequency Cepstral Coefficients MFCCs of speech are one of the most used features and are used for various purposes of automated speech signal processing tasks like speech/speaker recognition, speech synthesis etc. These coefficients contain the information about vocal tract envelopes that are used for representing spoken phonemes. Davis and Mermelstein [16] proposed this feature in 1980. Algorithm 2 explains the process of extracting these coefficients.

Algorithm 2
Algorithm for MFCC:

INPUT: Speech signal S of size N₁ × 1, minimum frequency f_min, maximum frequency f_max and number of frequency band nofb.

OUTPUT: Mel- Frequency Cepstral Coefficients (MFCC).

Ensure:

(1) S is a vector of size N₁ × 1. It contains N₁ samples of speech signal S.

(2) N is the total number of frames.

(3) fs is the sampling frequency of speech signal S.

(4) N₂ is the number of samples in a frame.

(5) N = $\frac{N_{1}}{N_{2}}$ .

(6) F_i represents i^thframe.

(7) m_min, m_max represents minimum mel and maximum mel value corresponding to f_min, f_max.

(8) fb represents frequency bank of size 1 × (nofb + 2).

(9) mb represents mel bank of size 1 × (nofb + 2).

(8) mfb is defined as mel frequency bank. It is a matrix of size $(\frac{n}{2} + 1) \times nofb$ .

(9) eb is energy in mel frequency bank.

1: Procedure MFCC (S, f_min, f_max, nofb)

2: fori ← 1 to Ndo

3:  Assign periodogram estimate value of the power spectrum of each frame with n-point fft to matrix pgm. (n must be power of 2 and greater than or equal to N₂).

4: end for

5: Map f_min and f_max value into mel scale using following two formulas and stores it into mel bank mb:

6: $m_{\min} = 1125 \times log (1 + \frac{f_{\min}}{700})$ .

7: $m_{\max} = 1125 \times log (1 + \frac{f_{\max}}{700})$ .

8: Linearly divide m_min to m_max into (nofb + 2).

9: fori ← 1 to Ndo

10:  Map each male value into frequency scale using following formula and store it into frequency bank fb.

11:   ${fb}^{i} = 700 \times (\exp (\frac{{mb}^{i}}{1125}) - 1)$ (fbⁱ, mbⁱ is the i^th value of fb and mb correspondingly).

12:  Round fbⁱ to nearest fft-bin and store it into vector fftbinbank.

13: end for

14: fori ← 2 to nofb + 1 do

15:  x ← fftbinbank (i - 1)

16:  y ← fftbinbank (i + 1)

17:  forj ← x to ydo

18:   if x ≤ fftbinbank (j) then

19:     $mfb (x, j - 1) \leftarrow \frac{(x - fftbinbank (j - 1))}{(fftbinbank (j) - fftbinbank (j - 1))}$

20:   else

21:     $mfb (x, j - 1) \leftarrow \frac{(fftbinbank (j + 1) - x)}{(fftbinbank (j + 1) - fftbinbank (j))}$

22:   end if

23:  end for

24: end for

25: Calcualate energy in bank by multiplying transpose of pgm to mfb and assign it to eb.

26: y ← dct (log (eb)) (dct is discrete cosine transformation).

27: $MFCC \leftarrow y (\frac{nofb}{2} \times N)$

28: returnMFCC

29: end procedure

3.3 RASTA-PLP, Formant Frequency and AMS

We have used the same procedure to extract RASTA-PLP and AMS features of human speech that is used by X.L. Zhang and J. Wu [27]. To calculate formant frequency same procedure as PLMS is used because PLMS coefficients are extracted corresponding to each formant value.

4 Parameters used for measuring the performance

False positive rate (fpr), true positive rate (tpr), average accuracy and time for extracting the features, modeling, training and testing of the classifiers are used for measuring the performance of the proposed system. Suppose in a given speech signal S, N is the total number of frames out of which N1 number of frames contain human speech while N2 number of frames do not include human speech (N = N1 + N2). Classifier classifies frames as shown in Table 1.

Table 1
Classification results of classifier

Obtained

Frame Human speech Non human speech

Human speech (Actual) M1 N1 - M1

Non human speech (Actual) M2 N2 - M2

	Obtained
Human speech (Actual)	M1	N1 - M1
Non human speech (Actual)	M2	N2 - M2

Parameters tpr, fpr and accuracy of a trained classifier are defined as per in Equations. (1), (2) and (3) respectively. $tpr = \frac{M 1}{M 1 + N 1 - M 1} = \frac{M 1}{N 1}$ (1) $fpr = \frac{M 2}{M 2 + N 2 - M 2} = \frac{M 2}{N 2}$ (2) $accuracy = \frac{M 1 + N 2 - M 2}{N}$ (3)

5 Experimental analysis

To extract the desired acoustic features of speech signal for VAD, MATLAB R.2015B has been used. All experiments are conducted in windows 8.1 on an Intel core i7 processor having 4 GB physical memory. Appropriateness of the selected features for VAD is confirmed by comparing the accuracy with two standard existing state of the art methods VAD G . 729 and Sohn VAD.

5.1 Dataset Used

NOIZEUS [29] database is used to measure the performance of the proposed approach. It contains 30 IEEE sentences (spoken by three male and three female speakers). These sentences have been mixed with eight types (Airport, Babble, Car, Exhibition hall, Restaurant, Railway station, Street, and Train) of real world noises at four different SNRs (0dB, 5dB, 10dB and 15dB) in it. Added noises have been taken from AURORA database. To verify the appropriateness of our selected features for VAD, experiments are also done using IndicTTS [35] dataset where sentences are spoken in 13 different Indian languages by male and female belonging to different regions of the country (India).

To create real life simulation environments, small sentences (size ≤10 sec) of the database are merged together to create sentences of size between 200 sec to 300 sec. Ratio of human speech and non-human speech (noise) are kept between 2:3 to 3:2 in these sentences.

SVM classifier is trained in six phases. In the first five phases, these classifiers are trained using combinations of three/four features from PLMS, MFCC, RASTA-PLP and formant frequency as shown in Table 2. Here, PLMS (3 ×1) represents 3 coefficients of PLMS per frame. Similar notations are used for other features.

Table 2
Description of features used in different Phases of the experiment

Features Used

Phase1 PLMS (3 ×1) , MFCC (20 × 1) , RASTA - PLP (17 × 1)

Phase2 MFCC (20 × 1) , RASTA - PLP (17 × 1) , FormantFrequency (3 ×1)

Phase3 RASTA - PLP (17 × 1) , FormantFrequency (3 ×1) , PLMS (3 ×1)

Phase4 Formant Frequency (3 ×1) , PLMS (3 ×1) , MFCC (20 × 1)

Phase5 PLMS (3 ×1) , MFCC (20 × 1) , Formant Frequency (3 ×1) , RASTA - PLP (17 × 1)

Phase6 Pitch (1 ×1), DFT (16 × 1), DFT₈ (16 × 1), DFT₁₆ (16 × 1), MFCC (20 × 1)

MFCC₈ (20 × 1), MFCC₁₆ (20 × 1), LPC (12 × 1), RASTA - PLP (17 × 1) and AMS (135 × 1)

	Features Used
Phase1	PLMS (3 ×1) , MFCC (20 × 1) , RASTA - PLP (17 × 1)
Phase2	MFCC (20 × 1) , RASTA - PLP (17 × 1) , FormantFrequency (3 ×1)
Phase3	RASTA - PLP (17 × 1) , FormantFrequency (3 ×1) , PLMS (3 ×1)
Phase4	Formant Frequency (3 ×1) , PLMS (3 ×1) , MFCC (20 × 1)
Phase5	PLMS (3 ×1) , MFCC (20 × 1) , Formant Frequency (3 ×1) , RASTA - PLP (17 × 1)
Phase6	Pitch (1 ×1), DFT (16 × 1), DFT₈ (16 × 1), DFT₁₆ (16 × 1), MFCC (20 × 1)
MFCC₈ (20 × 1), MFCC₁₆ (20 × 1), LPC (12 × 1), RASTA - PLP (17 × 1) and AMS (135 × 1)

Here, size of feature vector for a frame varies between 23 × 1 (In Phase 3, 23= 17 (MFCC) + 3 (Formant Frequency) +3 (PLMS)) to 43 × 1 (In Phase 5, 43 = 3 (PLMS) + 20 (MFCC) + 3 (Formant Frequency) + 17 (RASTA-PLP)). In the sixth phase of the experiment feature vector of size 273 × 1 is used to train the SVM classifier. To extract these features, input speech signal is divided into frames of duration 25 ms along with frame shift of 10 ms. To confirm the effectiveness of the approach, results are compared with two standard methods: VAD G.729 and Sohn VAD.

VAD G.729 is ITU-T standard that is applied for reducing the transmission rate during silence periods. Sohn-VAD [20] takes decision using statistical methods. Process for training of SVM classifier and to test the frame of new input speech signal is shown in Fig. 9.

Fig.9

Procedure to train SVM classifier and test new speeh signal.

5.2 Result analysis

In this subsection, experimentally obtained accuracy of trained SVM classifier is compared with the accuracy of existing state of the art approaches for VAD. Accuracy through SVM classifier in Phase1 to Phase6, VAD G.729 and Sohn VAD for four different SNRs (0 dB, 5 dB, 10 dB and 15 dB) are shown in Fig. 10 with Babble noise. From Fig. 10, it is found that maximum accuracy is obtained in Phase4, for SNR 0 dB, 10 dB and 15 dB. Accuracies for both Phase4 and Phase6 are comparable in case of 5 dB SNR. Among first five Phases of the experiments accuracy of the classifier is least in Phase2 (when PLMS coefficients are not used for training) at 0 dB, 10 dB and 15 dB SNRs. These two observations confirm the importance of PLMS coefficients for VAD.

Fig.10

Accuracy of SVM classifier (in presence of Babble noise) in different phases of the experiment and VAD G.729.

Accuracy of SVM classifiers in all first five phases except Phase2 are almost same for noisy speech mixed with Car noise at 0 dB, 5 dB, 10 dB and 15 dB SNRs as shown in Fig. 11. It is also found that its accuracy is minimum in Phase2 where PLMS coefficients have not been used for training. These observations again support the importance of PLMS coefficients for VAD. Accuracy of SVM classifier in Phase3 is found to be highest when noisy speech contains Exhibition noise, Restaurant noise, Station noise and Street noise at 0 dB, 5 dB, 10 dB and 15 dB SNRs as shown in Figs. 12 15. Among all the first five phases accuracy of the classifier is least in Phase 2.

Fig.11

Accuracy of SVM classifier (in presence of Car noise) in different phases of the experiment and VAD G.729.

Fig.12

Accuracy of SVM classifier (in presence of Exhibition noise) in different phases of the experiment and VAD G.729.

Fig.13

Accuracy of SVM classifier (in presence of Restaurant noise) in different phases of the experiment and VAD G.729.

Fig.14

Accuracy of SVM classifier (in presence of Station noise) in different phases of the experiment and VAD G.729.

Fig.15

Accuracy of SVM classifier (in presence of Street noise) in different phases of the experiment and VAD G.729.

Accuracy of SVM classifiers in all first six phases except phase 2 are almost same while minimum in Phase 2 for noisy speech mixed with Train noise at 0 dB, 5 dB, 10 dB and 15 dB SNRs as shown in Fig. 16. It is concluded that higher accuracy is obtained of VAD through SVM classifier when trained with selected features of speech in first five phases except phase 2 in comparison to other state of the art methods (VAD G.729 and Sohn VAD). This proves appropriateness of the selected features. It is also observed that average performance of SVM classifier is least in Phase 2 (when PLMS coefficients are not used for training). Hence, it can be concluded that newly proposed PLMS feature plays a prominent role in VAD.

Fig.16

Accuracy of SVM classifier (in presence of Train noise) in different phases of the experiment and VAD G.729.

Comparison of accuracy for different phases of the proposed SVM based VAD and state of the art methods (G.729 and Sohn) for real world noises (Babble, Car, Exhibition, Restaurant, Station, Street and Train) are listed in Table 3. From the Table 3 it can be observed that accuracy is least in Phase 2 among first five Phases for most of the noises with SNR 0 dB which again confirms the effectiveness of PLMS coefficients for VAD. It can also be concluded from the Table 3 that PLMS coefficients are effective even for low SNR values.

Table 3

Accuracy(%) comparison of referenced VADs (Sohn and G.729) with proposed SVM based VADs in different Phases

Noise type	SNR	G . 729	Sohn	Phase1	Phase2	Phase3	Phase4	Phase5	Phase6
Babble	0 dB	61.2	54.09	81.63	74.32	79.45	83.46	81.65	83.33
	5 dB	65.03	48.92	85.82	68.77	75.63	85.16	63.66	85.74
	10 dB	65.24	42.20	99.73	99.19	99.62	99.1	99.7	88.57
	15 dB	80.55	33.94	99.73	99.23	99.71	98.98	99.72	97.48
Car	0 dB	77.09	84.20	77.64	39.88	78.02	77.27	77.71	58.73
	5 dB	85.83	82.85	78.05	57.89	78.13	77.65	77.99	67.88
	10 dB	87.21	81.56	73.52	77	72.06	72.2	72.93	46.81
	15 dB	82.78	75.79	99.43	98.33	94.76	93.69	99.59	71.67
Exhibition	0dB	65.27	62.64	81.13	56.62	93.84	75.42	82.73	83.78
	5 dB	72.23	62.84	84.77	55.83	98.75	76.91	86.34	92.22
	10 dB	87.39	63.97	99.98	94.75	99.83	99.94	99.97	74.46
	15 db	84.49	64.24	99.98	99.5	99.83	99.76	99.96	99.64
Restaurant	0dB	51.03	38.88	38.46	33.81	63.98	36.9	39.46	61.87
	5dB	59.14	51.79	48.2	36.6	75.66	44.95	48.76	66.68
	10dB	60.03	63.21	97.57	82.99	86.11	98.4	98.42	67.67
	15 dB	54.58	47.29	71.99	89.97	75.99	71.74	71.77	71.71
Station	0dB	83.03	82.23	56.96	55.93	79.81	56.64	60.22	62.58
	5 dB	87.17	82.80	79.98	55.92	88.25	77.14	81.01	67.27
	10dB	76.35	83.48	99.99	74.06	99.88	99.94	99	63.86
	15 dB	89.99	83.49	87.14	93.26	88.37	82.92	87.56	85.15
Street	0 dB	67.77	61.73	62.89	60.99	80.97	58.31	63.21	66.47
	5 dB	53	62.40	53.52	46.81	92.18	44.83	54.95	69.52
	10dB	61.88	62.08	99.78	71.24	98.11	96.87	99.84	72.38
	15 dB	74.44	70.07	72.02	71.09	87.38	71.75	71.87	87.41
Train	0 dB	61.59	69.10	91.13	68.82	99.6	85.22	94.12	81.34
	5dB	64.68	67.44	98.99	71.31	99.92	97.76	99.99	86.17
	10 dB	65.57	66.98	98.99	98.87	99.97	99.98	99.99	85.15
	15 dB	66.55	66.81	99.5	99.88	99.98	99.98	99.99	99.99

Comparison of true positive rate (tpr) and false positive rate (fpr) for different phases of proposed SVM based VAD and state of the art methods (G.729 and Sohn) for real world noises (Babble, Car, Exhibition (Exh), Restaurant (Rest), Station (Stn), Street and Train) are listed in Table 5. From the Table 5 it can be observed that tpr is least in Phase 2 among first five Phases for most of the noises. It again supports the effectiveness of PLMS coefficients for VAD.

Table 4

tpr and fpr comparison of referenced VADs (Sohn and G.729) with proposed SVM based VADs in different Phases

Noise type	\|	G.729		Sohn		Phase1		Phase2		Phase3		Phase4		Phase5		Phase6
Noise type	SNR	tpr	fpr	tpr	fpr	tpr	fpr	tpr	fpr	tpr	fpr	tpr	fpr	tpr	fpr	tpr	fpr
Babble	0 dB	33.58	38.08	57.08	48.01	67.64	1.8	55.24	0.08	64.27	0.08	73.06	3.78	67.7	0.10	72.32	0.00
	5 dB	39.79	05.65	55.21	44.33	78.68	0.3	52.88	0.76	63.25	0.18	78.16	2.18	45.29	0.52	74.19	0.00
	10dB	43.32	39.85	54.69	67.01	99.98	0.18	99.99	0.13	99.98	0.17	99.99	0.08	99.98	0.00	98.34	0.24
	15dB	41.82	02.09	54.57	81.30	99.98	0.55	99.98	1.74	99.95	0.58	99.93	2.25	99.98	0.60	99.42	0.00
Car	0dB	17.45	0.49	65.23	1.75	66.18	0.02	12.09	0.18	66.75	0.02	65.52	0.04	66.27	0.00	61.38	0.24
	5 dB	29.85	0.42	59.49	0.54	66.8	0.00	40.84	5.15	66.93	0.01	66.21	0.03	66.70	0.00	58.36	0.48
	10dB	32.22	0.43	57.84	1.24	53.81	0.00	60.03	0.19	51.56	0.01	51.52	0.01	52.77	0.00	48.34	0.28
	15dB	34.84	0.96	56.88	12.33	99.07	0.10	97.91	1.12	91.26	0.68	88.97	0.16	99.35	0.10	49.87	0.46
Exh	0dB	28.43	29.12	54.95	32.27	71.5	0.09	34.52	0.31	91.36	1.34	63.21	0.79	73.91	0.09	76.58	1.20
	5 dB	37.11	24.33	53.02	29.60	77.05	0.16	33.34	0.03	98.29	0.34	65.43	0.68	79.43	0.16	88.34	0.26
	10dB	39.29	4.50	52.63	25.99	99.99	0.02	91.53	0.91	99.93	0.31	99.99	0.10	99.99	0.04	64.63	0.50
	15dB	41.71	10.11	52.54	27.18	99.96	0.00	99.55	0.56	99.96	0.32	99.59	0.10	99.99	0.04	99.36	0.00
Rest	0dB	37.50	54.86	53.35	71.93	36.56	2.34	32.58	1.98	45.96	0.89	50.73	1.08	84.45	0.08	54.36	1.38
	5 dB	40.23	46.13	52.60	48.84	38.38	1.56	34.54	2.13	63.35	0.35	34.38	2.15	33.36	1.20	50.02	0.86
	10dB	41.19	48.99	52.49	28.27	96.23	0.62	71.15	1.27	77.04	1.71	98.72	2.02	97.8	0.75	55.23	0.00
	15dB	43.09	54.05	52.44	56.66	50.44	0.00	83.37	1.45	57.74	0.29	50.18	0.00	50.04	0.00	49.93	1.24
Stn	0 dB	28.44	0.94	65.14	5.35	34.88	0.00	33.32	0.00	69.46	0.03	34.43	0.08	39.81	0.00	48.90	1.80
	5 dB	34.39	0.49	63.19	1.90	69.72	0.00	33.32	0.00	82.24	0.02	65.43	0.01	71.27	0.00	50.90	1.26
	10dB	37.10	19.50	68.37	6.42	99.98	0.00	99.99	0.00	74.86	0.00	99.91	0.14	99.91	0.02	49.95	0.00
	15dB	38.81	2.79	57.90	0.18	77.25	0.00	88.34	0.33	79.42	0.00	69.78	0.00	77.98	0.00	73.73	0.00
Street	0 dB	26.03	24.13	58.51	36.21	42.65	0.13	39.81	0.33	71.28	1.31	36.68	2.20	43.16	0.17	56.38	2.47
	5 dB	34.25	49.49	60.02	36.29	51.86	2.35	43.57	3.12	88.35	0.03	43.76	2.38	51.78	0.56	54.28	0.68
	10dB	36.43	39.56	54.72	32.61	99.63	0.02	49.99	0.20	96.85	0.19	94.59	0.05	99.75	0.02	61.75	1.32
	15dB	37.26	22.96	54.31	18.27	50.49	0.00	49.98	0.53	77.7	0.04	49.94	0.00	50.22	0.00	77.72	0.00
Train	0 dB	30.10	35.65	58.88	24.59	85.91	0.05	50.36	0.00	99.75	0.32	77.6	1.90	90.68	0.08	68.82	2.48
	5 dB	36.31	28.34	55.29	24.59	99.95	0.00	56.62	0.00	99.93	0.09	96.63	0.03	99.98	0.00	79.98	0.00
	10dB	37.53	22.18	54.37	24.59	99.98	0.06	99.99	0.00	99.99	0.00	99.99	0.00	99.99	0.00	80.08	0.00
	15dB	42.38	11.54	54.02	24.59	99.99	0.03	99.98	0.01	99.99	0.00	99.99	0.00	99.99	0.00	99.98	0.00

Table 5

Average CPU time (in sec) comparison in different Phases of the experiment

	Phase1	Phase2	Phase3	Phase4	Phase5	Phase6
F.E.	31	31.3	29	30.4	32.7	38
Training	4	4	4	4	4	34
Test	23	23	22.6	23	23.3	26
Total	58	58.3	55.6	57.4	60	98

Time required to (i) extract selected features of speech signal (ii) training time required for SVM classifier (iii) testing of the SVM classifier and (iv) total time required for complete process are shown in Fig. 17. Feature extraction (F.E.) time, training time, testing time and total time are almost same in first five Phases and higher in Phase 6 of the experiment as listed in Table 4. In Phase 6, more time is required due to the larger size of feature vector for each speech frames.

Fig.17

Time required in feature extrction (F.E.), training of SVM classifier, testing time and total time for different phases of the experiment.

From Table 4 it can be concluded that total time required for VAD in all first five phases is less than total time required in Phase 6. Hence, it can be more useful for real life applications like Internet telephony, hearing aid devices etc.

6 Conclusions and future work

Applications of VAD depends upon total time taken by classifier to distinguish speech frames with or without human speech. Hence, selection of features play important role for robustness of Machine-learning based VAD. In this Paper, a new feature of speech PLMS is introduced. Experimentally, it is verified the effectiveness of this feature for VAD. Here, experiments are performed in six phases for selecting the appropriate combination of features for VAD. Experimentally, it is found that features: PLMS, RASTA-PLP and Formant Frequency (Phase 3 as shown in Table 2) is the best combination for the proposed SVM-based VAD. It is also found that features: PLMS, MFCC and Formant Frequency (Phase 4) is the best combination for Babble noise. The proposed set of features in both Phase3 and Phase 4 take smaller time comparison to time taken in Phase 6 which makes it appropriate for real life applications like hearing aid devices, online audio chat etc.

In future, applications of the new proposed feature PLMS can be applied in other speech processing areas like speech recognition, speaker identification, language identification etc. for increasing the accuracy and reducing the computational time required.

References

http://practicalcryptography.com/miscellaneous/machinelearning/guide-mel-frequency-cepstral-coefficients-mfccs/.

http://iitg.vlab.co.in/index.php?sub=59&brch=164&sim=615&cnt=1 [Accessed on 18.05.2016].

Davis

, Nordholm

, and Togneri

, Statistical voice activity detection using low-variance spectrum estimationand an adaptive threshold, in: IEEE Transactions on Audio, Speech, and Language Processing14 (2) (2006), 412–424.

Ying

, Yan

, Dang

, and Soong

, Voice activity detection based on an unsupervised learning framework, in: IEEE Transactions on Audio, Speech, and Language Processing19 (8) (2011), 2624–2644.

Nemer

, Goubran

, and Mahmoud

, Robust voice activity detection using higher-order statistics in the LPC residual domain, in: IEEE Transactions on Speech and Audio Processing9 (3) (2001), 217–231.

Aneeja

, and Yegnanarayana

, Single frequency filtering approach for discriminating speech and on speech, in: IEEE/ACM Trans Audio, Speech, Lang Process23 (4) (2015), 705–717.

Yoo

I.-C.

, Lim

, and Yook

, Formant-based robust voice activity detection, in: IEEE/ACM Trans Audio,Speech, Lang Process23 (12) (2015), 2238–2245.

McCowan

, Dean

, McLaren

, Vogt

, and Sridharan

, The delta-phase spectrum with application to voice activity detection and speaker recognition, in: IEEE Trans Acoust, Speech, Language Process19 (7) (2011), 2026–2038.

Haigh

and Mason

, Robust voice activity detection using cepstral features, in: Proc TENCON (1993), pp. 321–324.

10.

Chang

J.H.

, Kim

N.S.

, and Mitra

S.K.

, Voice activity detection based on multiple statistical models, in: IEEE Transactions on Signal Processing54 (6) (2006), 1965–1976.

11.

Ramrez

, Segura

J.C.

, Bentez

, Torre

A.D.L.

and Rubio

, Efficient voice activity detection algorithms using long-term speech information, in: Speech Communication42 (2004), 3–4.

12.

Ramirez

, Segura

J.C.

, Benitez

, Garcia

and Rubio

, Statistical voice activity detection using a multiple observation likelihood ratio test, in: IEEE Signal Processing Letters12 (10) (2005), 689–692.

13.

Ramirez

, Ye lamos

, Gorriz

J.M.

and Segura

J.C.

, SVM based speech endpoint detection using contextual speech features, in: Electronics Letters42 (7) (2006), 426–428.

14.

Sohn

, Kim

N.S.

, and Sung

, A statistical model-based voice activity detection, in: IEEE Signal Process Letters6 (1) (1999), 1–3.

15.

Dhananjaya

and Yegnanarayana

, Voiced/nonvoiced detection based on robustness of voiced epochs, in:IEEE Signal Process Lett17 (3) (2010), 273–276.

16.

Davis

and Mermelstein

, Comparison of parametric representations for monosyllabic word recognition incontinuously spoken sentences, in: IEEE Transactions on Acoustics, Speech, and Signal Processing28 (4) (1980), 357–366.

17.

Ghosh

, Tsiartas

, and Narayanan

, Robust voice activity detection using long-term signal variability, in: IEEE Trans Acoust, Speech, Language Process19 (3) (2011), 600–613.

18.

Teng

and Jia

, Voice activity detection via noise reducing using non-negative sparse coding, in: IEEE Signal Process Lett20 (5) (2013), 475–478.

19.

Gazor

and Zhang

, A soft voice activity detector based on a Laplacian-Gaussian model, in: IEEE Trans Acoust, Speech, Language Process11 (5) (2003), 498–505.

20.

Mousazadeh

and Cohen

, Voice activity detection in presence of transient noise using spectral clustering, in: IEEE Trans Audio, Speech, Lang Process21 (6) (2013), 1261–1271.

21.

Sadjadi

S.O.

and Hansen

J.H.L.

, Unsupervised speech activity detection using voicing measures and perceptual spectral flux, in: IEEE Signal Process Lett20 (3) (2013), 197–200.

22.

Hughes

and Mierle

, Recurrent neural networks for voice activity detection, in: Proc IEEE Int ConfAcoust, Speech, Signal Process, 2013, pp. 7378–7382.

23.

Kinnunen

, Chernenko

, Tuononen

, Franti

, and Li

, Voice activity detection using MFCC features and support vector machine, in: Proc Int Conf on Speech and Computer (SPECOM07), 22007, pp. 556–561.

24.

Pham

, Tang

, and Stadtschnitzer

, Using artificial neural network for robust voice activity detection under adverse conditions, in: Int Conf on Computing and Communication Technologies, RIVF, 2009, pp. 1–8.

25.

McLoughlin

, The use of low-frequency ultrasound for voice activity detection, in: Proc Interspeech (2014), 1553–1557.

26.

Zhang

X.-L.

, and Wang

D.L.

, Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection, in: Proc Interspeech, 2014, pp. 1534–1538.

27.

Zhang

X.-L.

and Wu

, Deep belief networks based voice activity detection, IEEE Trans Acoust, Speech, Language Process21(4) (2013), 697–710.

28.

Ephraim

and Malah

, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, in: IEEE Trans Audio, Speech, Signal Process32 (6) (1984), 1109–1121.

29.

and Loizou

, Subjective evaluation and comparison of speech enhancement algorithms, in: Speech Communication49 (2007), 588–601.

30.

Wang

, Chen

, and Wang

D.L.

, Deep neural network based supervised speech segregation generalizes to novelnoises through largescale training, Dept. of Comput. Sci. and Eng., The Ohio State Univ., Columbus, OH, USA,Tech. Rep. OSUCISRC-3/15-TR02, 2015.

31.

Zhang

X.-L.

and Wang

, Boosting, contextual information for deep neural network based voice activity detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing24 (2) (2016), 252–264.

32.

Shahsavari

, Sameti

, and Hadian

, Speech Activity Detection Using Deep Neural Networks, in: 25th Iranian Conference on Electrical Engineering (lCEE2017) (2017), pp. 1564–1568.

33.

Jie

and Datao

, Enhanced, speech based jointly statistical probability distribution function for voiceactivity detection, in: Chinese Journal of Electronics26 (2) (2017), 325–330.

34.

Sertsi

, Boonkla

, Chunwijitra

, Kurpukdee

, and Wutiwiwatchai

, Robust Voice, Activity, Detection Based on LSTM Recurrent Neural Networks and Modulation Spectrum, in: Proceedings of APSIPA Annual Summit and Conference 20172017, pp.342–346.

35.

Baby

, Thomas

A.L.

, Nishanthi

N.L.

, and Consortium

, Resources for Indian languages, in: Community Based Building of Language Resources (CBBLR), 2016, pp. 37–43, Brno, Czech Republic: Tribun EU. [Online]. Available: https://www.iitm.ac.in/donlab/tts/index.php.

	Obtained
Frame	Human speech	Non human speech
Human speech (Actual)	M1	N1 - M1
Non human speech (Actual)	M2	N2 - M2

SVM based Voice Activity Detection by fusing a new acoustic feature PLMS with some existing acoustic features of speech

Abstract

Keywords

1 Introduction

2 Literature Survey for Voice Activity Detection (VAD)

3 Features used

3.1 Peak of Log Magnitude Spectrum (PLMS)

4 Parameters used for measuring the performance

Table 1 Classification results of classifier Obtained Frame Human speech Non human speech Human speech (Actual) M1 N1 - M1 Non human speech (Actual) M2 N2 - M2

5.1 Dataset Used

References

Table 1
Classification results of classifier

Obtained

Frame Human speech Non human speech

Human speech (Actual) M1 N1 - M1

Non human speech (Actual) M2 N2 - M2