Supervised speech enhancement based on deep neural network

Abstract

In real-world situation, speech signals reaching our ears are usually degraded by the background noise. These distortions are detrimental to the speech quality and intelligibility and also cause a serious problem to many speech-related applications, such as automatic speech recognition and speaker identification. In order to deal with the background noise distortions, we propose a strategy to enhance the degraded speech in this paper, where speech enhancement is conducted using supervised deep neural network models. The models are trained to learn a mapping from the features of noisy speech to estimate the ideal-ratio mask (IRM). The estimated IRM is then applied to the noisy speech in order to obtain an enhanced version of the degraded speech. The mean square error (MSE) is used as an objective cost function. Additionally, Global Variance Equalization is performed as a post-processing step to equalize variances of the features. Systematic evaluations and comparisons show that the proposed supervised method improves objective metrics of speech quality and intelligibility substantially and significantly outperforms the competing and baseline speech enhancement methods. Finally, the proposed method is examined in speaker identification task in noisy situations. The proposed method leads to the highest speaker identification rates when compare to the competing and baseline speech enhancement methods.

Keywords

Speech enhancement deep neural networks supervised learning global variance quality intelligibility

1 Motivation

There are many forms of human communication, for example nonverbal and text based. Speech is, however, the most effective and efficient form for humans. Through speech, we are able to convey instructions, emotions etc. The usefulness of the speech has led to a variety of speech processing applications. Successful use of these applications is, however, considerably aggravated in presence of background noise. The noise signals overlap and mask the useful speech signals. To deal with the overlapping background noise, a speech enhancement strategy is essential in order to make noisy speech more understandable and pleasant. Speech enhancement formulates noisy speech signals to enhanced speech signals with better perceptual quality and intelligibility. The motivation behind this research work is to deal effectively with background noise and produce high quality and intelligible enhanced speech.

2 Introduction

Speech enhancement–which in fundamental nature suppresses background noise and thereby improves the quality and intelligibility of the noisy speech–has various applications, such as, automatic speech recognition (ASR), speaker identification (SI), hearing aid, etc. Generally, single channel speech enhancement (SCSE) methods are categorized into two wide classes: unsupervised and supervised SCSE methods. In unsupervised SCSE methods, statistical models are considered to estimate the clean speech from noisy speech signals without prior knowledge of the noise type and speaker identity. Therefore, in this class, no supervision and classification of the signals as speech or noise type is necessary. Unsupervised methods are robust in conditions when noise sources are stationary; however, the noise suppressing potential in nonstationary conditions is not high. On the other hand, for the supervised SCSE methods, models are considered for both the speech and noise signals. The parameters of the models are learned via training of signal samples (speech and noise) and an interfacing model is defined by combining the individual models for the speech and noise and speech enhancement is performed. Unsupervised SCSE methods such as, spectral subtraction (SS) [1], Wiener filter (WF) [2], log-minimum mean squared error (LMMSE) estimation [3] and others [4 –10] are usually not very efficient in low SNR and nonstationary noisy situations [11]. Model-based methods showed promising outcomes in the adverse conditions. For example, methods in [12, 13] proposed probabilistic interfacing models between sources which are based on the prior learning and demonstrated considerable gains in the performance. A vital line of work is the nonnegative matrix factorization (NMF), which is a model based method for noise suppression [11, 14]. Noisy speech is modeled as the sum of nonnegative source bases. However, generalizations to unknown noise sources in these methods is fundamental problem and are usually successful for structured interfering sources, e.g. competing speakers [15].

3 Related literature

SCSE can also be addressed as a supervised learning problem [16]. The prelude methods commonly use Multilayer Perceptron (MLP) in time-domain or frequency-domain to map a mixed segment to a speech segment [17 –19]. The cited methods utilize shallow neural networks (SNNs) and small sets of the training data. Consequently, these methods are not fully efficient to demonstrate all capabilities and potentials of the supervised speech enhancement. In past decade, a prolific research in Computational Auditory Scene Analysis (CASA) has revitalized for the best utilization in the field of supervised speech enhancement methods. In [20] a classifier is trained to estimate the Ideal binary mask (IBM) for binaural speech separation. In this method, a maximum a posteriori classifier is trained with two binaural features, i.e. inter-aural time differences and inter-aural intensity differences, to further classify time-frequency (T-F) units as speech-dominant or noise-dominant, respectively. This method produced large gains in speech intelligibility during matched training and test conditions. Seltzer et al. [21] applied a Bayesian classifier to estimate and eliminate noise-dominant T-F units for the robust automatic speech recognition (ASR). For CASA based speech separation, authors in [22] trained sub-band MLPs to classify T-F units as speech or noise dominant. In mel-spectral domain, authors in [20], [23] implemented the Gaussian mixture model (GMM) for IBM estimation. For human beings with normal listening capabilities, and with low SNRs and comparable training and test noise segments, this method has demonstrated considerable improvements in the speech intelligibility.

Supervised speech enhancement comprises of three fundamental phases, training targets, learning machines and acoustic features [24]. The first proposed training target was IBM, which was influenced by the auditory masking in auditory perception. IBM marks the value of 1 to speech-dominant T-F units and 0 otherwise. The tests have demonstrated that IBM enormously improved the speech intelligibility for both normal hearing and listening impaired listeners [25 –28]. Similar to IBM, the target binary mask (TBM) [29] classifies T-F units by matching target speech with the reference speech-shaped noise (SSN), and has also been shown to improve speech intelligibility. On the other hand, instead of a binary decision for a T-F unit, a soft decision defines the concept of IRM [30 –32]. The IRM however, when compared to IBM, improves speech quality [33]. Both estimations are different; that is, IBM is a classification whereas IRM is a regression problem. Except masking-based targets, mapping-based targets have also been utilized in supervised speech enhancement algorithms. Mapping-based targets are specifically T-F representations of clean speech, e.g. log-spectrum. Though mapping-based targets seem more straightforward, prevailing studies have revealed their underperformance in masking-based targets primarily in terms of the speech intelligibility [33, 34].

Learning machines are vital and critical for supervised speech enhancement methods. Wang and Wang [35] first utilized the Deep Neural Networks (DNNs) for supervised speech enhancement and have shown large improvement over previous conventional methods. In each sub-band, a DNN is trained and tested to extract high-level features, which are then forwarded to estimate T-F mask. Acoustic features facilitate discriminative information to estimate T-F mask. The acoustic features, including mel-frequency cepstral coefficient (MFCC), perceptual linear prediction (PLP) [36], relative spectral transform PLP (RASTA-PLP) [37] and gammatone frequency cepstral coefficient (GFCC) [38, 39] are recently used for supervised SCSE methods.

In this paper, a supervised SCSE method is proposed. The main contributions of the proposed method are threefold. First, we propose a DNN-based supervised speech enhancement framework to enhance the noisy speech. Second, we use Global Variance Equalization as a post-processing step to equalize variance of features and to reduce over-smoothing residual errors. Third, examine the robustness of proposed speech enhancement method to speaker identification task. Comparisons show that the proposed method improves objective metrics of speech quality and intelligibility substantially and significantly outperforms the competing and baseline methods.

The remaining paper is organized as: section 4 describes the proposed supervised method. The experimental settings and results are presented in section 5 and section 6. Speaker identification results are discussed in section 7. The advantages are listed in section 8. Finally, the summary and concluding remarks are presented in section 9.

4 Proposed speech enhancement framework

Supervised speech enhancement maps the process as a supervised learning problem so that mapping is determined absolutely from the input data. The proposed method contains four modules i.e. feature extraction, training, enhancement and waveform reconstruction. In training stage, DNN model is trained by using features of noisy and underlying clean speech signals. The acoustic feature sets include the PLP, RASTA-PLP, MFCC, GFCC and AMS. We have selected the combination of RASTA-PLP, MFCC and AMS acoustic features. The features are coupled with related delta features. Auto-regressive moving average (ARMA) filter [40, 41] is applied to smooth temporal curves of extracted features to improve speaker identification rates: $\hat{F} (t) = \frac{\hat{F} (t - k) + . . . . . + F (t) + . . . . . + F (t + k)}{2 k + 1}$ (1)

Where F(t) shows feature vector at time frame t, $\hat{F} (t)$ is filtered feature vector and k is the order of filter. A second order ARMA filter is used according to [42]. For improving the performance of DNN, dropout method is used. Short-time Fourier transform (STFT) is applied to the input noisy speech signal as; $y (t) = s (t) + n (t)$ (2)

Where s(t) and n(t) shows clean speech and noise signals. After computing STFT of the noisy speech, we have frequency-domain representation of y(t) as; $Y (ω) = S (ω) + N (ω)$ (3)

Where ω shows frequency bands. In enhancement stage, the trained DNN is fed with noisy features to estimate the coefficients of IRM. The IRM is given by equation as: $IRM (t, ω) = {(\frac{X^{2} (t, ω)}{X^{2} {(t, ω) + N}^{2} (t, ω)})}^{α}$ (4) $IRM (t, ω) = {(\frac{SNR (t, ω)}{SNR (t, ω)} + 1)}^{α}$ (5)

Where X² (t, ω) and N² (t, ω) represent the speech and noise energies in different T-F units, whereas α is a parameter to tune the mask. We tested with various values of α and found α= 0.5 to be the finest choice. The estimate of clean speech magnitude is achieved by multiplying the estimated mask with the noisy speech magnitude. Finally, inverse STFT and add-and-overlap is applied to reconstruct enhanced speech.

Fig.1

(A): The dimension-independent and dimension-dependent global variances of the estimated and corresponding clean speech features, (B): Global variances after performing equalization.

4.1 Post-processing using global variance equalization

The over-smoothing residual errors result in a muffled effect when estimated speech is compared to the underlying clean version. To mitigate this problem, global variance equalization is performed in order to equalize and lift global variances (GV) of the features. Global variance Equalization is a type of the histogram equalization, which plays a vital role in the density matching [43]. It is demonstrated in [44] that using information regarding global variance during voice conversion can considerably improve the subjective quality. The dimension-dependent global variances of the estimated and underlying clean speech features are referred to as GV_EF(d) and GV_CF(d), and are given by equations as: ${GV}_{EF} (d) = \frac{1}{N} \sum_{n = 1}^{N} {({\hat{S}}_{n} (d) - \frac{1}{N} \sum_{n = 1}^{N} {\hat{S}}_{n} (d))}^{2} .$ (6) ${GV}_{CF} (d) = \frac{1}{N} \sum_{n = 1}^{N} {(S_{n} (d) - \frac{1}{N} \sum_{n = 1}^{N} S_{n} (d))}^{2} .$ (7)

${\hat{S}}_{n} (d)$ is d-th element of output vectors from DNN at n-th frame and N shows total speech frames number in training data. The dimension-independent global variances of estimated and underlying clean speech features are referred to as GV_EF - DI and GV_CF - DI and are given by equations as: $\begin{matrix} {GV}_{EF - DI} (d) = \\ \frac{1}{N^{*} D} \sum_{n = 1}^{N} \sum_{d = 1}^{D} {({\hat{S}}_{n} (d) - \frac{1}{N^{*} D} \sum_{n = 1}^{N} \sum_{d = 1}^{D} {\hat{S}}_{n} (d))}^{2} \end{matrix}$ (8) $\begin{matrix} {GV}_{CF - DI} (d) = \\ \frac{1}{N^{*} D} \sum_{n = 1}^{N} \sum_{d = 1}^{D} {(S_{n} (d) - \frac{1}{N^{*} D} \sum_{n = 1}^{N} \sum_{d = 1}^{D} S_{n} (d))}^{2} \end{matrix}$ (9)

Fig.2

Over-smoothing analysis; spectrograms of speech utterance degraded by white noise at 0 dB. Estimated DNN output (right), the clean utterance (middle) and the noisy utterance (left)

The GVs of the estimated and the underlying clean speech spectra across the various frequency bands are demonstrated in Fig. 1(A). The GVs of the estimated features are lower than underlying features; indicate that estimated speech spectra are smoothed. Moreover, in low SNR conditions, over-smoothing phenomenon became worst and the formant peaks are suppressed. Over-smoothing in the high frequency bands leads to muffle speech. Figure 2 shows spectrograms of the speech utterance degraded by white Gaussian noise at 0 dB. A substantial over-smoothing could be noticed. The formant peaks are suppressed, mainly in the frequency bands between 2000–4000 Hz, produces a muffle speech. To minimize the over-smoothing, equalization factors are given as: $μ (d) = \sqrt{\frac{{GV}_{CF} (d)}{{GV}_{EF} (d)}}, λ (d) = \sqrt{\frac{{GV}_{CF}}{{GV}_{EF}}}$ (10)

Parameters μ(d) and λ(d) shows the equalization factors, used to equalize dimension-dependent and dimension-independent GVs of the estimated and underlying clean speech features vectors. The equalization factors are learned and updated automatically from training data. The features of all input utterances are normalized to the unit-variance and zero-mean. The output from DNN is transformed as: $\bar{S} (d) = \hat{S} (d)^{*} σ (d) + m (d) .$ (11)

Parameters σ(d) and m(d) shows variance and mean of the speech features. The equalization factors lift the variances of output speech as: $\bar{S} (d) = \hat{S} (d)^{*} ψ^{*} σ (d) + m (d) .$ (12)

Parameter ψ can either be μ or λ. The equalization factors sharp formant peaks of estimated speech and suppressed residual noise, hence, significantly improves the overall speech quality and intelligibility.

5 Experimental setting

We used 720 IEEE open speech repository [45] utterances as training utterances. The core testing set comprised of 200 utterances from unknown speakers of both genders as a testing set. To match the sampling rates with noise sources, all speech utterance are resampled to 8000 kHz. We used twelve noise sources from the AURORA dataset [46] during training and testing. The noise sources include: airport, babble, car, coffee shop, exhibition hall, factory, restaurant, street, subway, train, white noise and pink noise. Except white and pink, the other 10 noise sources are considered non-stationary. The duration of each noise source is approximately 5 minutes. To construct the training sets, the first half of all noise sources is used and mixed with training utterances at –10 dB, –5 dB, 0 dB, 5 dB and 10 dB SNR. The testing mixtures are created by mixing the last half of all noise sources.

DNNs are selective learning machines, and have shown to perform well in SCSE [34 , 48]. Here, a DNN framework used three hidden layers; the outer layer contains 1024 sigmoid activation functions. The standard back propagation algorithm is used to train the networks. To better deal with the mismatch problem between training and testing conditions, dropout regularization is used to improve the generalization potential of the network. The dropout rate is set to 0.2 and no unsupervised pretraining is used during the entire process. The adaptive gradient descent algorithm is coupled with a momentum term η to optimize the DNN. For first 5 epochs, the momentum rate is fixed at 0.6 while rate is increased and fixed at 0.8. The mean squared error (MSE) is used as objective cost function. Figure 3 shows objective cost function values at various epochs.

Fig.3

Objective Cost Function (MSE) values at different epochs in Airport, Exhibition Hall and Subway noisy environments.

For objective evaluation, Short Time Objective Intelligibility (STOI) [49] is used to predict the objective speech intelligibility. STOI refers to a correlation between the clean and enhanced speech utterances and has been demonstrated to show high correlation to human speech intelligibility. To evaluate the objective speech quality, Perceptual Evaluation of Speech Quality (PESQ) measure is used [50]. Similar to the STOI, PESQ scores are achieved by comparing enhanced speech with underlying clean version. The STOI scores range from 0 to 1 whereas PESQ scores range from –0.5 to 4.5. An SNR-based measure is the earliest metric to evaluate the performance of the speech enhancement methods. However, standard SNR metric does not offer a correlation with the speech quality because average over entire signal length may remove crucial contents. To handle this problem, the SNR is computed over short segments and then averaged. This SNR computation is referred to as the segmental SNR (SNRSeg). We considered this metric to examine the noise suppression in the reconstructed speech. Weighted spectral slope (WSS) and log likelihood ratio (LLR) are considered to measure the distance between the enhanced and the underlying clean speech signals. The competing methods are selected from three different classes of single channel speech enhancement; (a) NMF [11] is selected from the model based class, (b) Nonnegative robust principle component analysis (NRPCA) [51] is selected from matrix decomposition based speech enhancement class and (c): an improved version of the OM-LSA [3] denoted as LMMSE is selected from the statistical speech enhancement class for performance comparison. The proposed DNN based speech enhancement method is referred to as DNN_P.

6 Analysis and results validation

To fully evaluate and validate the performance of the proposed method at all input SNR, mean values of the performance are reported here for noise sources.

6.1 Comparison with competing methods

The proposed method is first evaluated in terms of the STOI. STOI is used extensively as a performance evaluation metric in the speech enhancement, provides a measure of the overall speech intelligibility of enhanced speech. Higher STOI scores imply better performance. Figure 4 indicates STOI scores obtained with DNN_P and the competing methods in 60 situations (12 noise types x 5 SNR levels = 60 situations). The outcomes are averaged over 200 speech utterances. According to Fig. 4, DNN_P outperformed the competing methods consistently in all 60 situations. The only exceptions were: subway and train noise at SNRs –10 dB and –5 dB, where we deem that all processing methods performed well. However, the proposed method surpasses competing methods at all SNR levels. All noise sources led to high intelligibility score (STOI > 85%) for SNR≥5 dB. But, considerable differences in STOI scores are found with the low SNRs. The proposed method led to the best overall prediction score: 94.04%. Note from Fig. 4 after comparison with NMF, NRPCA and L-MMSE, DNN_P achieved the best average STOI scores for ten nonstationary and two stationary noise sources. For instance, the predicted scores with airport noise are improved from 55.14% with L-MMSE and 59% with NRPCA to 74.34% with DNN_P at –5 dB SNR. Similarly, the average predicted scores with white and pink noises are improved from 56.20% and 62.80% with noisy speech to 78.03% and 85.6% with DNN_P, respectively. The overall average STOI scores in all noises for DNN_P are 60.89%, 71.39%, 79.96%, 87.73%, and 92.82% at –10 dB, –5 dB, 0 dB, 5 dB and 10 dB respectively. Results confirmed the dominance of DNN_P in terms of intelligibility.

Fig.4

Objective Speech Intelligibility Rates using STOI Measure.

Table 1

Average PESQ against competing methods

SNR (dB)	Noisy	NMF	LMMSE	NRPCA	DNN_P
SNR -10	1.20	1.30	1.42	1.46	1.58
SNR -5	1.41	1.52	1.53	1.59	1.83
SNR 0	1.71	1.85	1.89	1.94	2.26
SNR 5	2.03	2.18	2.26	2.31	2.59
SNR 10	2.31	2.51	2.57	2.62	2.85
Average	1.73	1.87	1.93	1.98	2.22

Table 2

Average PESQ at different SNRs with different DNN Layers

SNR (dB)	Noisy	NMF	LMMSE	NRPCA	DNN₁	DNN₂	DNN₃	DNN₄
SNR -10	1.20	1.30	1.42	1.46	1.53	1.57	1.58	1.57
SNR -5	1.41	1.52	1.53	1.59	1.77	1.80	1.83	1.82
SNR 0	1.71	1.85	1.89	1.94	2.14	2.23	2.26	2.23
SNR 5	2.03	2.18	2.26	2.31	2.54	2.58	2.59	2.59
SNR 10	2.31	2.51	2.57	2.62	2.79	2.81	2.85	2.84
Average	1.73	1.87	1.93	1.98	2.15	2.19	2.22	2.21

Table 3

Average SNRSeg at different SNRs with different DNN Layers

SNR (dB)	Noisy	NMF	LMMSE	NRPCA	DNN₁	DNN₂	DNN₃	DNN₄
SNR -10	–6.17	–5.06	–3.82	–3.44	–2.94	–2.93	–2.90	–2.93
SNR -5	–5.56	–4.07	–2.37	–2.03	–1.19	–1.18	–1.15	–1.17
SNR 0	–3.76	–2.01	0.51	0.73	1.43	1.41	1.43	1.42
SNR 5	–1.30	0.96	2.18	2.33	3.97	3.95	3.96	3.93
SNR 10	2.40	3.45	4.93	5.02	6.40	6.39	6.39	6.35
Average	–2.87	–1.34	0.28	0.52	1.53	1.52	1.54	1.52

The PESQ metric is used to evaluate the performance of DNN_P in terms of the speech quality. PESQ is found to have a high connection with subjective judgment of the speech quality [52] and reproduces perceptual quality of enhanced speech. The higher PESQ shows better performance. Table 1 shows results in terms of the PESQ. The results demonstrate that DNN_P performs better than competing methods in all noisy conditions, especially at low SNRs (–10 dB and –5 dB). Note from Table 1, as compared to NMF, NRPCA and L-MMSE, the DNN_P achieved the best average PESQ scores for all noise sources. For example, the predicted scores with street noise are improved from 1.25 with noisy speech to 1.69 at –10 dB SNR (ΔPESQ_street = 0.44) with DNN_P. Similarly, the predicted scores with white and pink noises are improved from 1.18 and 1.19 with noisy speech to 1.41 to 1.48 at –10 dB SNR (ΔPESQ_white = 0.23 and ΔPESQ_Pink = 0.29) with DNN_P, respectively. The overall average PESQ scores in all noises for DNN_P are 1.58, 1.83, 2.26, 2.59, and 2.85 at –10 dB, –5 dB, 0 dB, 5 dB and 10 dB, respectively. General comparison results presented in Table 2 have demonstrated that a DNN using three hidden layers (DNN₃) performs the best at all noise levels in terms of the PESQ. Large improvements are achieved with DNN_P, especially at low SNRs. The highest and the lowest improvements in PESQ scores are observed at 5 dB (ΔPESQ = 0.56) and –10 dB (ΔPESQ = 0.38), respectively.

SNRSeg metric is also used for the evaluation of DNN_P. SNRSeg shows degree of noise suppression and higher SNRSeg indicates an improved performance. Table 3 provides average SNRSeg results for DNN_P and competing methods. According to these results, DNN_P outperforms the competing methods in all 60 conditions. By observing Table 3, DNN_P achieved the highest average SNRSeg scores for all noise sources. The predicted average scores with DNN_P are higher than all competing methods. The average scores are improved from –2.87 with noisy speech to 1.54 with DNN_P. The overall average SNRSeg scores in all noises for DNN_P are –2.90, –1.15, 1.43, 3.96, and 6.39 at –10 dB, –5 dB, 0 dB, 5 dB and 10 dB respectively. The highest and lowest improvements in SNRSeg scores are observed at 5 dB (ΔSNRSeg = 5.26) and –10 dB (ΔSNRSeg = 3.77). From general comparison results, it is clear that DNN_P outperforms the competing methods at all SNRs and reflected higher degree of noise suppression. Figure 5 shows the SNRSeg improvements.

Fig.5

SNRSeg improvements analysis.

To further assess the performance DNN_P, all the methods are judged in terms of LLR metric. LLR states the distance between clean and enhanced speech signals. Unlike PESQ and SNRSeg, lower LLR scores indicate an improved performance. LLR results achieved using DNN_P and competing methods are given in Table 4. It is obvious from the scores that DNN_P produces lower LLR scores than the competing methods in all test conditions. Overall average results demonstrated that DNN_P performed better as compared to the competing methods. The average distance is improved from 1.41 with noisy speech to 0.55 with DNN_P.

Table 4

Average LLR at different SNRs with different DNN Layers

SNR (dB)	Noisy	NMF	LMMSE	NRPCA	DNN₁	DNN₂	DNN₃	DNN₄
SNR -10	2.31	1.85	1.39	1.29	0.97	0.91	0.90	0.91
SNR -5	1.72	1.49	1.16	1.05	0.83	0.77	0.77	0.79
SNR 0	1.37	1.21	0.99	0.79	0.56	0.50	0.51	0.51
SNR 5	1.02	0.93	0.79	0.71	0.41	0.39	0.36	0.38
SNR 10	0.70	0.68	0.65	0.61	0.30	0.29	0.25	0.27
Average	1.42	1.23	0.99	0.89	0.61	0.57	0.55	0.57

The last speech quality measure used for evaluation of DNN_P with competing methods is WSS metric. Similar to LLR, lower WSS shows improved performance. Table 5 gives WSS results for DNN_P and competing methods. As the results show, DNN_P surpasses the competing methods in all noisy conditions. The results demonstrated that DNN using three hidden layers (DNN₃) presented the most excellent performance at all SNRs in terms of PESQ, SNRSeg, LLR and WSS.

Table 5

Average Normalized WSS at different SNRs with different DNN Layers

SNR (dB)	Noisy	NMF	LMMSE	NRPCA	DNN₁	DNN₂	DNN₃	DNN₄
SNR -10	0.69	0.66	0.63	0.62	0.53	0.54	0.51	0.52
SNR -5	0.65	0.60	0.56	0.55	0.50	0.51	0.48	0.49
SNR 0	0.54	0.52	0.51	0.50	0.45	0.46	0.43	0.44
SNR 5	0.44	0.41	0.40	0.38	0.29	0.30	0.26	0.26
SNR 10	0.34	0.33	0.31	0.30	0.21	0.22	0.19	0.20
Average	0.53	0.50	0.48	0.47	0.39	0.40	0.37	0.38

We examined time-varying spectrograms to evaluate DNN_P in terms of the residual noise and speech distortion. Figure 6 presents time-varying spectrograms of DNN_P and competing methods. A sample speech signal is degraded by airport noise at –5 dB SNR. The spectrogram of DNN_P is depicted in Fig. 6(F), where the harmonic spectrums of the vowel are sustained in output speech. Moreover, the spectra also revealed a fine structure during speech activity areas. By analyzing the spectra during speech-pause areas, DNN_P outperforms competing methods in removing background noise. The weak harmonic structures in high frequency subbands are better preserved. Therefore, perceptual quality of speech offered by DNN_P is better than competing methods. The residual noise is evident in the spectrograms showing the output speech of the competing methods; shown in Fig. 6(C)-(E). The residual noise is noticeably reduced in speech of DNN_P shown in Fig. 6(F).

Fig.6

Time-varying Spectral Analysis. (A) clean speech, (B) Noisy Speech: Degraded by –5 dB Babble Noise, (C) Processed by LMMSE, (D) Processed by NMF, (E) Processed by NRPCA and (F) Processed by DNN_P.

6.2 Comparison with baseline DNN

We compared the proposed method with DNN-based single-channel supervised method proposed in [33], referred to as baseline-DNN and denoted by DNN_B. Table 6 presents PESQ and SNRSeg results achieved from the DNN_P and the DNN_B at different input SNRs across twelve noise sources. The results demonstrate that the DNN_P outperforms the DNN_B in all noisy conditions. The equalization factors efficiently sharpen the peaks of formant and significantly improved the overall perceptual speech quality, confirmed by higher PESQ scores. High SNRSeg scores indicate that equalization factors also helped in suppressing of noise as compare to DNN_B. The equalization factors lifted the variances of DNN output speech signals effectively, shown in Fig. 1(B). Table 7 demonstrates LLR and normalized WSS results achieved from the DNN_P and the DNN_B at different input SNRs across twelve noise sources. Low LLR and WSS results indicate that DNN_P outperforms the DNN_B and the spectral distance between estimated and clean speech is minimum. The estimated speech is a closed replica of the underlying clean speech. These evaluation results demonstrated the improved performance of DNN_P and verified its effectiveness. Figure 7 indicates STOI scores obtained with DNN_P and DNN_B over 60 situations. All the outcomes are averaged over 200 sentences. Figure 7 indicates that DNN_P outperformed the baseline DNN_B consistently in all 60 situations. The spectrograms of DNN_P and DNN_B are inspected to measure the residual noise and speech distortion. Figure 8 shows time-varying spectrograms of DNN_P and DNN_B. A speech utterance is degraded by babble noise at –5 dB SNR. The spectrogram of DNN_P is depicted in Fig. 8 (D); where harmonic spectrums of vowel are sustained. Moreover, the spectrogram also showed an excellent structure during speech activity. During speech-pause, DNN_P outperforms the DNN_B in eliminating the background noise. The weak harmonic structures in high frequency subbands are kept. Hence, perceptual quality of the speech presented by DNN_P is better as compare to DNN_B. The residual noise is noticeably reduced in the enhanced speech of DNN_P shown in spectrogram Fig. 8 (D). The speech with weak energy contents are maintained, thus, yielded less speech distortion.

Fig.7

Average objective speech Intelligibility Rates for noisy, DNN_B and DNN_P.

Fig.8

Time-varying Spectral Analysis. (A) Clean speech, (B) Noisy Speech: Degraded by –5 dB babble noise, (C) Processed by DNN_B, (D) Processed by DNN_P.

Table 6

Average PESQ and SNRSeg scores at all Noise Types

PESQ				SNRSeg
SNR (dB)	Noisy	DNN_B	DNN_P	SNR (dB)	Noisy	DNN_B	DNN_P
SNR -10	1.20	1.41	1.58	SNR -10	–6.17	–3.14	–2.90
SNR -5	1.41	1.64	1.83	SNR -5	–5.56	–1.39	–1.15
SNR 0	1.71	2.03	2.26	SNR 0	–3.76	1.21	1.43
SNR 5	2.03	2.42	2.59	SNR 5	–1.30	3.63	3.96
SNR 10	2.31	2.69	2.85	SNR 10	2.40	6.18	6.39
Average	1.73	2.04	2.22	Average	–2.87	1.29	1.54

Table 7

Average LLR and Normalized WSS scores at all Noise Types

LLR				Normalized WSS
SNR (dB)	Noisy	DNN_B	DNN_P	SNR (dB)	Noisy	DNN_B	DNN_P
SNR -10	2.31	1.07	0.90	SNR -10	0.69	0.62	0.51
SNR -5	1.72	0.92	0.77	SNR -5	0.65	0.57	0.48
SNR 0	1.37	0.67	0.51	SNR 0	0.54	0.51	0.43
SNR 5	1.02	0.49	0.36	SNR 5	0.44	0.36	0.26
SNR 10	0.70	0.39	0.25	SNR 10	0.34	0.28	0.19
Average	1.42	0.71	0.55	Average	0.53	0.46	0.37

7 Speaker identification

Noisy speech enhancement has been studied to offer robustness to speaker identification systems [53]. In order to evaluate the DNN_P and the baseline methods for improving the accuracy of identification in nonstationary noisy conditions, speaker identification experiments are performed. In general, a speaker identification system consists of training and testing phase [54]. The system extracts sets of speech features during the training phase and generates models for speakers. During testing phase, the speech features are obtained from the test speech utterances and compared to the generated models of speakers. The key objective of the speaker identification task is to identify the test utterance of the enrolled speakers. It is shown in the literature that speaker identification systems based on the Mel-frequency cepstral coefficients (MFCC) features and the Gaussian mixture speaker model (GMM) are commonly used because of high recognition accuracies for clean speech utterances [55]. However, their performance can be severely degraded when the test speech signals are corrupted by the acoustic noises [56]. During experiments described here, DNN_P based speech enhancement is applied to provide noise robustness to the speaker identification system. The MFCC feature vectors are extracted from the enhanced versions of the noisy speech utterances.

7.1 MFCC extraction

Following the acquisition and pre-processing, the speech utterances are divided into short-time overlapping frames. The FFT is then applied to all speech frames and the spectral envelopes are acquired using Mel-scaled bandpass filters. The frequencies in Mel-scale (f_MEL) are correlated to the frequencies in the linear scale (f_Hz), given by equation as: $f_{MEL} = 1127 log (1 + \frac{f_{Hz}}{700})$ (13)

The Mel-scale is frequently used in speaker identification system due to its excellent illustration of the human auditory system. Consider K numbers of filters in Mel-frequency filterbank [64] and E_K log-energy output of the Kth filter, the MFCC coefficients are computed as:

$\begin{matrix} {MFCC}_{j} = \sum_{k} = 1^{K} E_{K} \cos [j (k - \frac{1}{2}) \frac{π}{N}], \\ j = 1, 2, 3, . ., M \end{matrix}$ (14)

M is number of cepstrum coefficients. We have selected K = 26 for MFCC extraction as adopted in literature [54, 57].

7.2 Experiments and results

The experiments regarding speaker identification are performed with 100 speakers from TIMIT database. Out of 10 speech utterances per speaker, seven speech utterances are used to train the models of the speaker and the remaining three utterances are used for testing the models. All testing speech utterances are contaminated by twelve noise sources considering SNR values of 0 dB, 5 dB, 10 dB and 15 dB. The tests are conducted considering a confidence interval of 95%. The feature vectors (composed by coefficients) of the enhanced speech are extracted from 50% overlapping frames with 32 ms frame duration. Figure 9 shows the speaker identification rates for four noise types achieved during speaker identification experiments. The identification rates using clean utterances are about 98.8%. For speech processed by DNN_P, the identification rates varied from 86.65% with airport noise at 15 dB SNR to 31.11% at 0 dB SNR. Similarly, for speech processed by DNN_B, the identification rates varied from 76.61% with airport noise at 15 dB SNR to 23.13% at 0 dB SNR. Figure 9 demonstrates that DNN_P outperforms the competing methods. The average identification accuracy improved from 33.84% with noisy speech utterances to 67.42% with utterances processed by DNN_P method, equivalent to 33.58% identification gain. In different noisy conditions, preeminent improvement is achieved for the car noise at 10 dB SNR, from 37.22% to 72.11%, equivalent to a 34.89% gain. With reference to the additional DNN-based baseline method (DNN_B), the DNN_P improved the overall identification accuracy in 06.74%. It is important to notice that, although outperformed by DNN_P, the DNN_B also improved the identification performance for all the noise sources. In contrast, NMF and LMMSE degraded the identification accuracies for all the noise sources, whereas NRPCA improved average identification rates for few noise sources. It is important to point out here that, DNN-based methods which obtained the finest STOI scores (see Fig. 4), also offered the finest identification rates. Table 8 shows the average speaker identification rates for all noise sources.

Fig.9

Speaker identification accuracies obtained with DNNP, DNNP, NRPCA, NMF, and LMMSE for four noise sources.

Table 8

Speaker Identification rates (in %) Over 12 noise sources

Processing methods	0dB	5dB	10dB	15dB	Average rates
Un-Proc	16.12	21.01	41.03	57.21	33.84
LMMSE	20.13	29.12	45.42	61.01	38.92
NMF	23.32	31.02	43.12	60.22	39.42
NRPCA	29.22	35.10	49.23	68.44	45.49
DNN_B	38.33	53.21	72.02	83.19	61.68
DNN_P	41.03	62.32	80.54	89.81	68.42

8 Advantages of the proposed method

Finally, we highlight the advantages of the proposed method over baseline-DNN and competing methods. (A): the proposed method utilized the same features and DNN structure in all processes, hence, the computational complexity of the proposed method is considerably reduced. (B): Auto-regressive moving average filter is used in the proposed method to smooth the temporal curves of extracted features, hence, provided improved speaker identification. (C): Combination of RASTA-PLP, MFCC and AMS acoustic features is used during feature extraction, hence, made more robust speech enhancement framework is structured. (D): The over-smoothing residual errors are reduced by using Global Variance Equalization as a post-processing step. The equalization enhanced the perceptual; quality and intelligibility of speech. With these additional features, the proposed method outperformed the baseline DNN which are clearly shown in the results.

9 Summary and conclusions

In this paper, a DNN-based speech enhancement method is presented; since it is deemed that the supervised learning machine frameworks offer an extensive approximation potential through comparative measures. A series of experiments are carried out to examine the deep neural network structures in terms of the intelligibility, quality and speaker identification rates. A combination of RASTA-PLP and AMS features is used. The features are smoothed through Auto-regressive moving average filter so that the distortion levels are appropriately controlled to provide improved speech intelligibility and identification. For better generalization, dropout and Global variance equalization techniques are used. To validate the effectiveness of the proposed method, the performance is compared to NMF, NRPCA and LMMSE methods. For further validation, DNN_P is compared with baseline DNN_B. Several hidden layers are considered to examine the performance of DNN_P. It is noted that DNN_P with three hidden layers has yielded comparable or even improved results in terms of the STOI, PESQ, SNRSeg, LLR and WSS. In last step, we conducted speaker identification accuracies. On the basis of results and analysis, the following concluding remarks are drawn:

All noise sources have led to the high intelligibility rates (STOI > 85%) for SNR≥0 dB. The best overall STOI prediction rate of 94.04% is achieved by the proposed method. An average STOI prediction rate of 65.25 % and 72.11% is achieved by the proposed method at low SNR conditions (–10 dB and –5 dB). The proposed method achieved the best average rates at all noise sources as compare to the noisy speech and the competing methods.

In terms of the PESQ, the proposed method performed better than the competing methods in all noise conditions consistently, especially at low SNR levels, –10 dB and –5 dB. The overall average PESQ scores in all noise sources for the proposed method is consistently higher at all input noise levels. The PESQ improvements at low SNRs (–10 dB and –5 dB) are significant as compared to competing methods.

SNRSeg rates reflected that high degree of noise is suppressed in the speech processed by the proposed method and achieved higher SNRSeg rates at all input SNR levels. The average SNRSeg rates are improved from –2.87 dB with noisy speech to 1.54 dB with the proposed method, which is a significant overall improvement.

In terms of LLR and WSS, the proposed method showed a smaller distance between clean and enhanced speech, which confirmed that the enhanced speech is more close to clean speech and achieved lower rates compared to competing methods.

The identification rates using clean utterances are about 98.8%. The average identification accuracy improved from 33.84% with noisy speech utterances to 67.42% with utterances processed by DNN_P method, equivalent to 33.58% identification gain. In different noisy conditions, significant improvements are achieved for the car noise at 10 dB SNR, from 37.22% to 72.11%, equivalent to a 34.89% gain. With reference to DNN-based baseline method (DNN_B), the overall identification accuracy of 06.74% is achieved.

References

Boll

, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech, and Signal Processing 27(2) (1979), 113–120.

Scalart

, Speech enhancement based on a priori signal to noise estimation, IEEE International Conference on Acoustics, Speech, and Signal Processing (1996), ICASSP96.

Ephraim

and Malah

, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing 33(2) (1985), 443–445.

Saleem

and Irfan

, Noise Reduction Based on Soft Masks by Incorporating SNR Uncertainty in Frequency Domain, Circuits, Systems, and Signal Processing (2017), 1–22.

Cohen

, Relaxed statistical model for speech enhancement and a priori SNR estimation, IEEE Transactions on Speech and Audio Processing 13(5) (2005), 870–881.

Cohen

, Speech enhancement using a noncausal a priori SNR estimator, IEEE Signal Processing Letters 11(9) (2004), 725–728.

Cohen

, Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator, IEEE Signal Processing Letters 9(4) (2002), 113–116.

Ephraim

and Van

H.L.

, Trees, A signal subspace approach for speech enhancement, IEEE Transactions on Speech and Audio Processing 3(4) (1995), 251–266.

Hasan

M.K.

, Salahuddin

and Khan

M.R.

, A modified a priori SNR for speech enhancement using spectral subtraction rules, IEEE Signal Processing Letters 11(4) (2004), 450–453.

10.

H.T.

and Yu

, Adaptive noise spectral estimation for spectral subtraction speech enhancement, IET Signal Processing 1(3) (2007), 156–163.

11.

Mohammadiha

, Smaragdis

and Leijon

, Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing 21(10) (2013), 2140–2151.

12.

Hershey

J.R.

, Rennie

S.J.

, Olsen

P.A.

and Kristjansson

T.T.

, Super-human multi-talker speech recognition: A graphical modeling approach, Computer Speech & Language 24(1) (2013), 45–66.

13.

Reddy

A.M.

and Raj

, Soft mask methods for single-channel speaker separation, IEEE Transactions on Audio, Speech, and Language Processing 15(6) (2007), 1766–1776.

14.

Virtanen

, Gemmeke

J.F.

and Raj

, Active-set Newton algorithm for overcomplete non-negative representations of audio, IEEE Transactions on Audio, Speech, and Language Processing 21(11) (2013), 2277–2289.

15.

Wang

and Wang

, A structure-preserving training target for supervised speech separation. (ICASSP), 2014.

16.

Wang

, On ideal binary mask as the computational goal of auditory scene analysis Speech separation by humans and machines, (2005), 181–197.

17.

Tamura

, An analysis of a noise reduction neural network, International Conference on the Acoustics, Speech, and Signal Processing, ICASSP-89. 1989.

18.

Tamura

S.i.

and Waibel

, Noise reduction using connectionist models, International Conference on the Acoustics, Speech, and Signal Processing, ICASSP-88. 1988.

19.

Xie

and Van

, Compernolle, A family of MLP based nonlinear spectral estimators for noise reduction, IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-94., 1994.

20.

Roman

, Wang

and Brown

G.J.

, Speech segregation based on sound localization, The Journal of the Acoustical Society of America 114(4) (2003), 2236–2252.

21.

Seltzer

M.L.

, Raj

and Stern

R.M.

, A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition, Speech Communication 43(4) (2004), 379–393.

22.

Jin

and Wang

, A supervised learning approach to monaural segregation of reverberant speech, IEEE Transactions on Audio, Speech, and Language Processing 17(4) (2009), 625–638.

23.

Kim

, Lu

, Hu

and Loizou

P.C.

, An algorithm that improves speech intelligibility in noise for normal-hearing listeners, The Journal of the Acoustical Society of America 126(3) (2009), 1486–1494.

24.

Chen

and Wang

, DNN based mask estimation for supervised speech separation Audio source separation, (2018), 207–235.

25.

Brungart

D.S.

, Chang

P.S.

, Simpson

B.D.

and Wang

, Isolating the energetic component of speech-on-speech masking with ideal time-frequency segregation, The Journal of the Acoustical Society of America 120(6) (2006), 4007–4018.

26.

and Loizou

P.C.

, Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction, The Journal of the Acoustical Society of America 123(3) (2008), 1673–1682.

27.

Wang

, Kjems

, Pedersen

M.S.

, Boldt

J.B.

and Lunner

, Speech intelligibility in background noise with ideal binary time-frequency masking, The Journal of the Acoustical Society of America 125(4) (2009), 2336–2347.

28.

Saleem

, Shafi

, Mustafa

and Nawaz

, A novel binary mask estimation based on spectral subtraction gain-induced distortions for improved speech intelligibility and quality. University of Engineering and Technology Taxila, Technical Journal 20(4) (2015), 36.

29.

Kjems

, Boldt

J.B.

, Pedersen

M.S.

, Lunner

and Wang

, Role of mask pattern in intelligibility of ideal binary-masked noisy speech, The Journal of the Acoustical Society of America 126(3) (2009), 1415–1426.

30.

Hummersone

, Stokes

and Brookes

, On the ideal ratio mask as the goal of computational auditory scene analysis Blind source separation, (2014), 349–368.

31.

Narayanan

and Wang

, Ideal ratio mask estimation using deep neural networks for robust speech recognition, IEEE International Conference on the Acoustics, Speech and Signal Processing (ICASSP), 2013.

32.

Srinivasan

, Roman

and Wang

, Binary and ratio time-frequency masks for robust speech recognition, Speech Communication 48(11) (2006), 1486–1501.

33.

Wang

, Narayanan

and Wang

, On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22(12) (2014), 1849–1858.

34.

Spille

, Ewert

S.D.

, Kollmeier

and Meyer

B.T.

, Predicting speech intelligibility with deep neural networks, Computer Speech & Language 48 (2018), 51–66.

35.

Hussain

, Siniscalchi

S.M.

, Lee

C.-C.

, Wang

S.-S.

, Tsao

and Liao

W.-H.

, Experimental Study on Extreme Learning Machine Applications for Speech Enhancement, IEEE Access 5 (2017), 25542–25554.

36.

Wang

and Wang

, Towards scaling up classification-based speech separation, IEEE Transactions on Audio, Speech, and Language Processing 21(7) (2013), 1381–1390.

37.

Hermansky

, Perceptual linear predictive (PLP) analysis of speech, The Journal of the Acoustical Society of America 87(4) (1990), 1738–1752.

38.

Hermansky

and Morgan

, RASTA processing of speech, IEEE Transactions on Speech and Audio Processing 2(4) (1994), 578–589.

39.

Shao

and Wang

, Robust speaker identification using auditory features and computational auditory scene analysis, IEEE International Conference on the Acoustics, Speech and Signal Processing, ICASSP 2008.

40.

Zhao

, Shao

and Wang

, CASA-based robust speaker identification, IEEE Transactions on Audio, Speech, and Language Processing 20(5) (2012), 1608–1616.

41.

Chen

C.-P.

and Bilmes

J.A.

, MVA processing of speech features, IEEE Transactions on Audio, Speech, and Language Processing 15(1) (2007), 257–270.

42.

Chen

, Wang

and Wang

, A feature study for classification-based speech separation at low signal-to-noise ratios, IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(12) (2014), 1993–2002.

43.

De La Torre

, Peinado

A.M.

, Segura

J.C.

, Pérez-Córdoba

J. L.

, Benítez

M.C.

and Rubio

A.J.

, Histogram equalization of speech representation for robust speech recognition, IEEE Transactions on Speech and Audio Processing 13(3) (2005), 355–366.

44.

Toda

, Black

A.W.

and Tokuda

, Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter, IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings. (ICASSP’05). 2005.

45.

Rothauser

, IEEE recommended practice for speech quality measurements, IEEE Trans. on Audio and Electroacoustics 17 (1969), 225–246.

46.

Hirsch

H.-G.

and Pearce

, The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. ASR-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop, (2000).

47.

, Du

, Dai

L.-R.

and Lee

C.-H.

, A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23(1) (2015), 7–19.

48.

Rashmirekha

and Mohnaty

M.N.

, Enhancement of speech using deep neural network with discrete cosine transform, Journal of Intelligent & Fuzzy Systems Preprint: 1–8.

49.

Taal

C.H.

, Hendriks

R.C.

, Heusdens

and Jensen

, An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing 19(7) (2011), 2125–2136.

50.

Rix

A.W.

, Beerends

J.G.

, Hollier

M.P.

and Hekstra

A.P.

, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, IEEE International Conference on the Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001.

51.

Min

, Zhang

, Zou

and Sun

, Mask estimate through Itakura-Saito nonnegative RPCA for speech enhancement, IEEE International Workshop on the Acoustic Signal Enhancement (IWAENC). 2016.

52.

and Loizou

P.C.

, Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing 16(1) (2008), 229–238.

53.

Ciira

W.M.

and MacLaren

, Walsh, Joint speech enhancement and speaker identification using approximate Bayesian inference, IEEE Transactions on Audio, Speech, and Language Processing 19(6) (2011), 1517–1529.

54.

Reynolds

D.A.

, Richard

and Robust

Rose.

text-independent speaker identification using Gaussian mixture speaker models, IEEE Transactions on Speech and Audio Processing 3(1) (1995), 72–83.

55.

Steven

and Mermelstein

, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4) (1980), 357–366.

56.

A. Reynolds

Douglas

, Speaker identification and verification using Gaussian mixture speaker models, Speech Communication 17(2) (1995), 91–108.

57.

Ming

Ji.

, J. Hazen

Timothy

, R. Glass

James

and d A. Reynolds

Douglas

, Robust speaker recognition in noisy conditions, IEEE Transactions on Audio, Speech, and Language Processing 15(5), 1711–1723.