Learning time-frequency mask for noisy speech enhancement using gaussian-bernoulli pre-trained deep neural networks

Abstract

Speech enhancement is a very important problem in various speech processing applications. Recently, supervised speech enhancement using deep learning approaches to estimate a time-frequency mask have proved remarkable performance gain. In this paper, we have proposed time-frequency masking-based supervised speech enhancement method for improving intelligibility and quality of the noisy speech. We believe that a large performance gain can be achieved if deep neural networks (DNNs) are layer-wise pre-trained by stacking Gaussian-Bernoulli Restricted Boltzmann Machine (GB-RBM). The proposed DNN is called as Gaussian-Bernoulli Deep Belief Network (GB-DBN) and are optimized by minimizing errors between the estimated and pre-defined masks. Non-linear Mel-Scale weighted mean square error (L_MW-MSE) loss function is used as training criterion. We have examined the performance of the proposed pre-training scheme using different DNNs which are established on three time-frequency masks comprised of the ideal amplitude mask (IAM), ideal ratio mask (IRM), and phase sensitive mask (PSM). The results in different noisy conditions demonstrated that when DNNs are pre-trained by the proposed scheme provided a persistent performance gain in terms of the perceived speech intelligibility and quality. Also, the proposed pre-training scheme is effective and robust in noisy training data.

Keywords

Supervised speech enhancement deep learning deep belief networks restricted boltzmann machine intelligibility quality

1 Introduction

The development of the robust methods in the speech processing has captivated substantial attention. Robust processing addresses the problems by actual usage scenarios to maintain or improve the performance of various systems. There can be approaches for robust processing in system pipeline, either replacing some stages, e.g. robust feature extraction or adding method at vital step, e.g. feature normalization. Speech enhancement appears as pre-processing step, and act directly on the speech signals to either improve the speech intelligibility and quality or feeding a speech processing system; e.g., automatic speech recognition (ASR) system. The aforementioned scenario aims to design a robust speech enhancement method that can attain performance in both purposes; however, simultaneous performance gain is a challenging task [1].

Speech enhancement has been the goal of research efforts for previous several decades. Indeed, substantial advancements in understanding the acoustic distortion of the speech signals have been observed during these periods, and countless useful proposals have been presented attempting more effective enhancement methods. The problems and associated solutions differ whether dealing with single-channel or multi-channels. Single-channel speech enhancement has been widely explored and has developed state-of-the-art methods which have been divided into two major classes: supervised and unsupervised speech enhancement. Usually, unsupervised speech enhancement has been performed using statistical methods and established several spectral filtering approaches such as spectral subtraction [2 –4] and Wiener filtering [5, 6]. In addition, estimators of the clean speech signals, such as Minimum-Mean Square Error (MMSE) estimator [7] and log-spectral amplitude estimator [8], which have been the motivation for various speech estimation methods, e.g. the Minima Controlled Recursive Averaging (MCRA) [9], Multiplicatively-Modified Log-Spectral Amplitude (MM-LSA) [10], and Optimally-Modified Log-Spectral Amplitude (OM-LSA) [11]. Although, the statistical methods have developed as state-of-the-art for many years, these methods have few main shortcomings which affect their performance, especially performance in non-stationary noise sources. Their formulations are based on the unrealistic assumptions, such as, uncorrelated nature of spectral coefficients in speech frames. However, spectral coefficients in speech frames are actually correlated at different time intervals and in various frequencies [12]. These methods are also involved in a running estimate of noise and clean speech variances; still such estimates are usually deprived for non-stationary noisy speech samples.

Alternatively, learning approaches, for illustration, the Gaussian Mixture Models (GMMs) [13, 14], Support Vector Machine (SVM) [15], Non-negative Matrix Factorization (NMF) [16 –18] and Neural Networks [19] have been developed and examined for speech enhancement. In recent years, the speech enhancement is considered as a supervised learning problem, formerly motivated by the approach of time-frequency (T-F) masking in Computational Auditory Scene Analysis (CASA). In such enhancement methods, the trained learning machines directly estimate the underlying clean speech or estimate a T-F mask like IBM, IRM etc. which are enforced to the T-F representation of the contaminated speech to procure estimate of enhanced speech [20]. Perhaps, paradigms of the data-driven methods present a convenient explanation to grasp the complex mechanism of acoustic speech distortion. Recently, lots of DNN frameworks have been developed with encouraging outcomes. Starting from the autoencoders to feed-forward DNNs, many frameworks have been evaluated on speech enhancement task [21 –26]. DNN-based enhancement deal with three primary attributes: the learning algorithm, the training-target, and the complementary acoustic features. Pursuant to the above explanation, DNN-based supervised speech enhancement methods can be categorized into two underlying classes: (i) masking-based and (ii) mapping-based enhancement methods. Here in this study, we are dealing with masking-based enhancement method.

In recent past, supervised speech enhancement methods using deep learning frameworks have accomplished enormous performance gain and outperformed the conventional speech enhancement methods based on signal processing techniques [27]. Wang and Wang [15] first suggested and implemented the deep neural network for the binary classification. Their proposed scheme notably outperformed the earlier methods by using feed-forward deep networks and RBM pre-training for subband classification to estimate IBM [15]. It has demonstrated that DNN comprises of IRM achieved even better results than IBM and considerably improved the speech quality and predicted speech intelligibility [28]. A unified Stochastic Gradient Descent (SGD) and Monte Carlo Markov Chain (MCMC) method is proposed for RBM pre-training and demonstrated better outcomes [29]. Bayesian estimation methods have been realized for RBM pre-training [30]. Alternative variants of RBM such as Recurrent Temporal RBM [31], conditional RBM, Gaussian RBM, pointwise gated RBM and cardinality RBM have been evolved by changing regular RBM framework. RBMs and their deep frameworks have been usually enforced in several applications, e.g., feature learning, classification, etc. A comprehensive insight of RBMs and their deep structures can be found in the study [32]. A Deep Auto-Encoder (DAE) has been proposed to enhance the noisy speech and learnt mapping from the Mel-frequency power spectra of noisy speech samples [33]. A non-linear noise-aware regression DNN network for the speech enhancement has been proposed, which was based on the spectral-mapping by utilizing the log power spectra of noisy speech signals for better generalization to unseen noise [21]. The masking based speech enhancement methods outperformed the mapping based methods; however, large performance deterioration can take place because of mismatch between the noisy and clean speech. A masking-based deep speech enhancement method is proposed in [34], where two separate restoration layers are integrated to address mismatching. In addition, a DNN-based multi-target approach has been proposed to estimate the target speech and interfering noise. Such dual-output method showed improved perceived speech quality [35]. Besides, a number of speech enhancement methods are available in the literature, where sources have been modeled jointly in mixtures using a deep RNN framework [37, 38]. In [38], RNN-based speech enhancement method is proposed which exploited the recurrent temporal RBM to explore temporal-correlation between speech frames. This idea has been extended to the features of input and output signals into the elemental feature-spaces. The proposed network has been fine-tuned by a jointly optimized deep RNN with additional masking layer which has been enforced a restoration limitation.

In this paper, we have studied the complex theoretical background of the learning algorithms to train deep neural networks in order to formulate the T-F masking-based speech enhancement. The bottom line of deep learning is the adaptation of hidden layers quantity such that network may learn from the input speech features [39, 40]. In contrast to shallow neural networks (SNNs), if DNNs are directly trained by using a standard backpropagation, the errors propagate backward in network and the gradient turns insignificantly small that alters the weights updating capability of a very deep network in the prior layers. In deep networks research, this gradient-vanishing problem is recognized to be one of the core challenges. To address this problem, a robust multi-layer generative framework is proposed, called as Deep Belief Network (DBN), a layer wise pre-trained DNN with stacked multiple-RBMs [41]. After pre-training, the backpropagation is applied. This present work differs from the previous studies. We have formulated and examined three different T-F masking-based DNNs for single-channel supervised speech enhancement. A layer-wise pre-training scheme for DNN is proposed by stacking GB-RBM, examined to be more efficient in dealing the noisy training data than typical RBM. The aim of this work is not to develop a state-of-the-art, but somewhat to examine GB-RBM-based deep learning and to analyze performance of the GB-DBN with other speech enhancement approaches which have used the regular RBM. Additional improvements can be obtained with more complex and robust deep networks. The main contributions of this study are summarized and discussed below:

A layer wise pre-training scheme is proposed by stacking GB-RBM, examined to be more efficient in dealing the noisy data than typical RBM. Instead of binary values in visible layer, GB-RBM has adopted the real-valued data and replaced binary visible units with real-valued Gaussian distribution; but, hidden layers remained binary. The parameters have been initialized by layer wise unsupervised pre-training using GB-RBM. The output layer is combined with the pre-trained network. The parameters of DNNs are fine-tuned through adaptive gradient descent and backpropagation. In experiments, it has been noticed that a pre-trained GB-RBM outperformed the DNNs that have been randomly initialized or pre-trained with regular RBM for the speech enhancement.

Conventional speech enhancement based on DNNs usually uses mean square error (MSE) as loss function. Since MSE measures the errors using a linear-frequency scale; therefore, error measurements are not aligned with human auditory perception. Here, DNNs are trained by applying Mel-frequency scaled gradients which are more sensitive towards the vital perceptual bands. The proposed GB-DBN is optimized by minimizing the error using L_MW-MSE loss function. In experiments, it has been noticed that improved speech quality is obtained when L_MW-MSE loss function is used.

Performance of the proposed pre-training scheme is examined by using different DNNs which are created on IRM, IAM, and PSM.

The remaining paper is organized as follows: In Section 2, we presented the main design methodology of the proposed GB-DBN-based speech enhancement with a detailed description of T-F masks; feature extraction, DNN architectures and loss function. In Section 3, experiments are presented. We have presented results and discussions in Section 4. Finally, the concluding remarks are drawn in Section 5.

2 Proposed deep belief network with GB-RBM

We have proposed a supervised speech enhancement method to estimate T-F masks from input noisy speech features. For robust learning of the input acoustic features, DNNs are formulated with pre-trained GB-DBN, constructed by a layer-wise stacking of multiple GB-RBMs. The input noisy speech y(n) is transformed into Short-Time Fourier Transform (STFT) domain by computing DFT of overlapped windowed frames. Time-frequency masks are estimated from the noisy acoustic features and the estimated mask is applied to magnitude of the noisy magnitude, given as: ${| \hat{S} (k, f) |}^{2} = \max ({\hat{M} (k, f), G}_{\min}) \otimes {| Y (k, f) |}^{2}$ (1)

Where $\hat{M} (k, f)$ and | $\hat{S} (k, f) |^{2}$ indicates estimated mask and estimated spectral magnitude of output speech. The minimum gain G_min is a vital parameter used to reduce the artifacts in enhanced speech signal. The term |Y(k,f)|² shows the squared spectral magnitude of the noisy signal y(n) at time-frame k and frequency bin f. Figure 1 depicts the proposed GB-DBN speech enhancement method.

Fig.1

Block diagram of Deep Speech Enhancement framework.

The RBM is energy-based model and is convenient for representing latent features that cannot be observed but surely exist in the background. The Bernoulli-Bernoulli RBM (BB-RBM) was first introduced and it defines distribution of the binary-valued visible variables v∈B^V and binary-valued hidden variables h∈B^H with their undirected real-valued connection weights W∈R^V × H, shown in Fig. 2(a) where V and H are numbers of dimensions in their respective visible and hidden units. Instead of using binary-values in the visible layer, RBMs are extended to deal with real-valued data, known as GB-RBM, shown in Fig. 2(b). GB-RBM with visible layer and L hidden layers is parameterized with weights W of synaptic connections between visible layer and first hidden layer, W^(l) between layers l and l + 1, biases b of visible layer, b^(l) of each hidden layer l, and standard-deviations σ_i of visible neurons. For a state [v^Th^{(1)^T} . . . . h^{(L)^T}] energy is:

$\begin{matrix} {E (v, h}^{(1)} . . . h^{(L)} | λ) = \sum_{i = 1}^{N_{v}} \frac{{(v}_{i} {- b}_{i})}{{2 σ}_{i}^{2}} - \sum_{i = 1}^{N_{v}} \sum_{j = 1}^{N_{1}} \frac{v_{i}}{σ_{i}^{2}} h_{k}^{(1)} w_{ij} - \\ \sum_{l = 1}^{L} \sum_{j = 1}^{N_{l}} b_{j}^{(l)} h_{k}^{(1)} - \sum_{l = 1}^{L - 1} \sum_{j = 1}^{N_{l}} \sum_{k = 1}^{N_{l} + 1} h_{j}^{(l)} h_{k}^{(l + 1)} w_{jk}^{(l)} \end{matrix}$ (2)

Fig.2

Schematics of (a) BB-RBM, and (b) GB-RBM.

The corresponding probability is given by equation as: ${P (v, h}^{(1)}, . . . h^{(L)} | λ) = \frac{1}{Z (λ)} {\exp (- E (v, h}^{(1)}, . . . h^{(L)} | λ))$ (3) Where, N_v and N_l are number of neurons in visible and l^th hidden layer. The parameter Z(λ) is the normalization factor. According to [42], we used GB-RBM parameterization, including learning z _i = log σ_i instead of σ_i. In same layer, states of the neurons are independent of each other. The conditional probability of visible neuron is given as:

$\begin{matrix} {P (h}_{j}^{(1)} | {v, h}^{(2)}, λ) = \\ Sig (\sum_{i = 1}^{N} \frac{v_{i}}{σ_{i}^{2}} w_{ij} + \sum_{k = 1}^{N_{2}} h_{k}^{(2)} w_{ij}^{(1)} {+ b}_{j}^{(1)}) \end{matrix}$ (5)

$\begin{matrix} {P (h}_{j}^{(1)} | h^{(l - 1)} {, h}^{(l + 1)}, λ) = \\ Sig (\sum_{i = 1}^{N_{l - 1}} \frac{v_{i}}{σ_{i}^{2}} w_{ij}^{l - 1} + \sum_{k = 1}^{N_{(l + 1)}} h_{k}^{(l + 1)} w_{jk}^{(l)} {+ b}_{j}^{(l)}) \end{matrix}$ (6)

Where, Sig (.) shows sigmoid function and N_L₊₁ =0. For a given training data, the aim of GB-RBM pre-training is to identify the best possible parameter λ that increases the probability of training data. For identifying the optimal parameter λ, the stochastic gradient-descent method is applied as: $\frac{\partial η}{\partial λ} μ {〈 \frac{\partial {( - E(V}^{(T)},h | λ))}{\partial λ} 〉}_{d} - {〈 \frac{\partial ( - E(V,h | λ))}{\partial λ} 〉}_{m}$ (7)

Where, ${〈 \cdot 〉}_{d}$ and ${〈 \cdot 〉}_{m}$ denote expectations over data distribution $P(h | {v^{(T)}}, λ)$ and model distribution P(v, h|λ) respectively, whereas {V^(T)} denotes the training samples. The GB-RBM focuses on dealing with the real-valued data which cannot be processed by regular RBM, as it is a model used for processing binary data. To deal with the real-valued data, GB-RBM uses Gaussian distributed visible neurons and Bernoulli distributed hidden neurons. Similarly, BB-RBM uses binary values in both hidden and visible neurons.

2.1 Time-frequency masks

Estimating time-frequency masks is an important step towards predicting the estimate of magnitude spectra of clean speech signals. We trained different DNNs, which are formulated on three on hand time-frequency masks comprised of IRM, IAM, and PSM.

2.1.1 Ideal ratio mask (IRM)

IRM represents a soft adaptation of IBM; and can be defined on the cochleagram or spectrogram [28] time-frequency representation of the noisy speech: $M_{S}^{IRM} (k, f) = {(\frac{{| S (k, f) |}^{2}}{{| S (k, f) |}^{2} + {| E (k, f) |}^{2}})}^{β}$ (8)

Where, |S(k,f)|² and |E(k,f)|² shows magnitude squared spectrum of the underlying clean speech signal and the noise, whereas β indicates a tuning parameter and is fixed to β=0.5. IRM is a usual time-frequency mask for speech enhancement methods [43 –45] and is bounded to follow $0 ⩽ M_{S}^{IRM} (k, f) ⩽ 1$ for all time-frequency units.

2.1.2 Ideal amplitude mask (IAM)

Another useful time-frequency mask is IAM, also known as FFT-Mask [20] and is defined as: $M_{S}^{IAM} (k, f) = \frac{| S (k, f) |}{| Y (k, f) |}$ (9)

By using the IAM, we can estimate the exact |S(k,f)| given the magnitude spectrum of the noisy speech |Y(k,f)|. IAM follows $0 ⩽ M_{S}^{IAM} (k, f) ⩽ \infty$ but we found that most of the T-F units follow $0 ⩽ M_{S}^{IAM} (k, f) ⩽ 1$

2.1.3 Phase sensitive mask (PSM)

IRM and IAM do not deal with the phase between clean speech and noisy speech signals. The PSM [46], however, takes this difference into the account and is defined as: $M_{S}^{PSM} (k, f) = \frac{| S (k, f) | cos (φ_{y} (k, f) - φ_{s} (k, f))}{| Y (k, f) |}$ (10)

Where, φ _Y and φ _S denotes the phase of the noisy and underlying clean speech, respectively. Although, PSM is unbounded theoretically, we found analytically that majority of PSM is in range of $0 ⩽ M_{S}^{PSM} (k, f) ⩽ 1$ .

2.2 Acoustic feature extraction

Sets of the complementary acoustic features are extracted from input speech at frame level. The acoustic features include: 13-D relative spectral transformed perceptual linear prediction coefficients (RASTA-PLP), 31-D Mel-frequency cepstral coefficients (MFCC), 15-D amplitude modulation spectrogram (AMS) and 64-D Gammatone filter energies (GFE). All acoustic features are concatenated with the corresponding delta (Δ) and double-delta (ΔΔ) features and affixed with all raw features. Finally, a set of 1845-D acoustic features is obtained which is used for the training and enhancement phases. Additionally, an ARMA (Autoregressive moving average) filter is also used in order to flat the acoustic features trajectories [47]. All the feature vectors are normalized to the zero-mean and unit-variance.

2.3 Network architecture

The selected parameter settings for DNN training are as follows. The deep network has been optimized by using non-linear Mel-scale weighted MSE criteria. Each DNN has three hidden layers. Each hidden layer contains 1024 neurons. The rectified linear unit (ReLU) has been used as an activation function in the hidden layers since ReLU allows excellent optimization with quick learning [48]. The output layer has used sigmoid activation function. The number of epochs for backpropagation training has set to 50. Moreover, the dropout regularization has been used to avoid overfitting. The dropout rates of 0.2 and batch size of 128 are used. The scaling factor for adaptive stochastic gradient descent is set to 0.0010, and the learning rate is minimized linearly from 0.06 to 0.002. The momentum term for the initial few epochs is set to 0.4, and increased for other epochs and set to 0.8. The training process is illustrated in Fig. 3.

Fig.3

DNN training architecture to estimate IRM, IAM and PSM.

2.4 Training criterion

In conventional training stage, MSE between estimated and the target mask is minimized, which is given by: $L_{MSE} = \frac{1}{KF} \sum_{k = 1}^{K} {\sum_{f = 1}^{F} | \hat{M} (k, f) - M (k, f) |}^{2}$ (11)

The L_MSE in (11) has been replaced with L_MW-MSE in the proposed method which follows Mel-frequency scale in order to learn error weights from different frequency bins selectively. Since MSE measures the errors using a linear-frequency scale; therefore, error measurements are not aligned with human auditory perception. Here, DNNs are trained by applying Mel-frequency scaled gradients which are more sensitive towards vital perceptual bands. Mel-frequency f_Mel is given as: $f_{Mel} = 2595 {log}_{10} (1 + \frac{f}{700})$ (12)

Where, f is linear frequency. The significance of all spectral coefficients could be measured by the derivatives of f_Mel at the corresponding frequencies f_C, such that: $d (f_{C}) = min (\frac{d (f_{Mel})}{d (f)} | f = f_{C}, ξ)$ (13)

Where, ξ shows a constant used to set the minimum value of the weights. The C_MW-MSE is determined by multiplying the normalized weights W(f) with all elements of C_MSE as follows: $C_{MW - MSE} = \frac{1}{KF} \sum_{k = 1}^{K} {\sum_{f = 1}^{F} W (f) | \hat{M} (k, f) - M (k, f) |}^{2}$ (14)

Where, W(f) is given as: $W (f) = \frac{d (f_{C})}{\sum_{f = 1}^{F} d (f_{C})}$ (15)

Compared to C_MSE, C_MW-MSE targeted the errors in frequencies that are more important for the human auditory system.

2.5 Generative training of deep network

The random initialization of the parameters usually underperforms during optimization. In order to circumvent this problem, an unsupervised pre-training scheme using RBMs has been selected to initialize network parameters. The significant preference of the pre-training is initialization of network parameters. During fine-tuning (back-propagation) enforcement, a less over-fitting is produced and also, the network is prevented to be stuck at the local optima. RBM represents a generative graphical model composed of multiple layers of latent variables (hidden units), with connections between the layers but not between units within each layer. We explicitly adopted the greedy layer wise pre-training approach [49]. This approach works as: First, for a hidden layer, the network parameters are locally trained by enforcing the unsupervised GB-RBM learning method and secondly, the parameters of the previous hidden layer are considered as the feature extractors. Each new layer that is stacked on the top of the GB-RBM will model the output of the previous layer and aims at extracting higher-level dependencies between the original inputs variables, thereby improving the ability of the network to capture the underlying regularities in the data. The greedy layer-wise training of a RBM with 3 hidden layers is depicted in Fig. 4. The GB-DBN in Fig. 4 contains a layer of visible units v, layers of hidden units [h⁽¹⁾, h⁽²⁾,h⁽³⁾] weights [W⁽¹⁾, W⁽²⁾, W⁽³⁾] and biases [b⁽¹⁾, b⁽²⁾, b⁽³⁾], respectively. Hence, a three hidden layers DBN is formulated by training three GB-RBMs independently.

Fig.4

Flow diagram of Greedy layer-wise training scheme to construct GB-DBN in unsupervised approach.

2.6 Supervised fine-tuning

After pre-training the hidden layers, the parameters of network are initialized. Hereafter, a trained DBN structure is added, which formulates a fully pre-trained DNN. The complete structure is then fine-tuned by using back-propagation scheme with define data. The optimization method in [50] (Adam: Adaptive momentum estimator) has been used for fine-tuning, which provides the advantage of the stochastic gradient descent with momentum. Additionally, batch-normalization [51] has also been enforced that improves the robustness of DNN.

3 Experimental setting

To access the performance of the proposed method, the clean speech utterances are collected from GRID corpus [52], and IEEE corpus [53]. The GRID corpus contains 1000 speech utterances, spoken by 34 different speakers (18 male and 16 female). Similarly, IEEE corpus contains 720 phonetically balanced utterances. To generate noisy stimuli during DNN training, we have used 12 different noises: airport, babble, buccaneer, coffee-shop.

Destroyer-engine, destroyer-ops, f16, factory, hf-channel, pink, volvo and white noise from the AURORA database [54]. The spectrograms of all the noise sources used for training are depicted in Fig. 5. The speech utterances and noises are re-sampled to 16 kHz. Acoustic Features are extracted with frame length set to 20 ms and the frame shift is set to 10 ms. All clean utterances are mixed with the aforesaid noises at SNR levels ranging from –5 dB to+5 dB with 5 dB step.

Fig.5

Spectrograms of 12 noise sources used to generate noisy training set.

3.1 Model training

According to the time-frequency masks, the training criterion, and pre-training approach, we have created various DNNs, listed in Table 1. To express all DNNs, we have adopted a representation: (DNN-<time-frequency Mask> -<Pre-Training>). For example, a speech enhancement method denoted by DNN-IAM-RBM, it implies that the training-target of this deep speech enhancement method is IAM and the RBM is enforced for pre-training of parameters. Similarly, a deep speech enhancement method which have not been pretrained by either RBM or GB-RBM, are expressed as DNN-IAM-Random, which implies that the network parameters are randomly initialized. After pretraining, all DNN structures have been trained in supervised style using back-propagation with Adam optimizer. The dropout regularization has been used to avoid overfitting [55]. The dropout rate of 0.2 and batch size of 128 has been used. The time-frequency masks are constructed from the DFT of the noisy and underlying clean speech utterance. The DFTs are computed by segmenting the noisy time-domain speech signal into 20 ms frames, using 50% overlap between contiguous frames after applying the Hanning windowing function. The G_min is set to 20 dB for all deep enhancement methods. All the DNN structures listed in Table 1 have been trained with the same training dataset, and we have evaluated the C_MW-MSE cost function during the training. The cost optimization curves at epochs for different deep speech enhancement methods with IAM training-target have been depicted in Fig. 6.

Table 1
DNN Speech Enhancement Methods with all settings

DNN Speech Enhancement Methods T-F Masks Loss Function Pretraining Activation (Hidden Layer) Activation (Output Layer)

DNN-IRM-Random IRM L_MW-MSE No Pretraining ReLU Sigmoid

DNN-IAM-Random IAM L_MW-MSE No Pretraining ReLU Sigmoid

DNN-PSM-Random PSM L_MW-MSE No Pretraining ReLU Sigmoid

DNN-IRM-RBM IRM L_MW-MSE RBM (DBN) ReLU Sigmoid

DNN-IAM-RBM IAM L_MW-MSE RBM (DBN) ReLU Sigmoid

DNN-PSM-RBM PSM L_MW-MSE RBM (DBN) ReLU Sigmoid

DNN-IRM-GB-RBM IRM L_MW-MSE GB-RBM ReLU Sigmoid

DNN-IAM-GB-RBM IAM L_MW-MSE GB-RBM ReLU Sigmoid

DNN-PSM-GB-RBM PSM L_MW-MSE GB-RBM ReLU Sigmoid

DNN Speech Enhancement Methods	T-F Masks	Loss Function	Pretraining	Activation (Hidden Layer)	Activation (Output Layer)
DNN-IRM-Random	IRM	L_MW-MSE	No Pretraining	ReLU	Sigmoid
DNN-IAM-Random	IAM	L_MW-MSE	No Pretraining	ReLU	Sigmoid
DNN-PSM-Random	PSM	L_MW-MSE	No Pretraining	ReLU	Sigmoid
DNN-IRM-RBM	IRM	L_MW-MSE	RBM (DBN)	ReLU	Sigmoid
DNN-IAM-RBM	IAM	L_MW-MSE	RBM (DBN)	ReLU	Sigmoid
DNN-PSM-RBM	PSM	L_MW-MSE	RBM (DBN)	ReLU	Sigmoid
DNN-IRM-GB-RBM	IRM	L_MW-MSE	GB-RBM	ReLU	Sigmoid
DNN-IAM-GB-RBM	IAM	L_MW-MSE	GB-RBM	ReLU	Sigmoid
DNN-PSM-GB-RBM	PSM	L_MW-MSE	GB-RBM	ReLU	Sigmoid

Fig.6

Cost optimization curves of different Deep speech enhancement methods.

3.2 Objective evaluation metrics

Four objective evaluation measures have been considered to examine the performance of the proposed deep speech enhancement methods. Short-time objective intelligibility (STOI) and extended STOI (ESTOI) are used to predict the intelligibility of the speech. Perceptual evaluation of speech quality (PESQ) is used to measure the quality of the perceived speech. Moreover, the Segmental SNR (SNRSeg) is used to examine the noise reduction capacity in the enhanced speech. Each of the objective evaluation metrics is briefly expressed as:

PESQ: An ITU-T P.862 standard objective evaluation metric, predicts the perceptual quality of the perceived speech by providing an output score from 0.5 to 4.5, and a high score implies better speech quality. PESQ output scores correlate with the MOS (mean opinion score) in the subjective listening tests. Initially, PESQ [56] has been developed for the narrow-band (08 kHz) telephone speech. However, PESQ has been extended to ITU-T P.862.2 [57], which can deal with the wide-band (16 kHz) systems. The PESQ extension has been widely used to evaluate speech enhancement methods.

STOI: It is also an objective evaluation metric, which predicts the intelligibility of the perceived speech by providing an output score in 0 to 1 range, and a high value implies better speech intelligibility. The computation of STOI scores is based on the correlation between the underlying clean speech and the processed speech signal in the short-time overlapped segments [58].

ESTOI: It is also an objective evaluation metric, which predicts the intelligibility of the perceived speech by providing an output score from 0 to 1, and a high score implies better speech intelligibility [59].

SNRSeg: Signal to noise ratio objective evaluation metric is one of the earliest methods used to evaluate a speech enhancement method. However, the SNR metric has shown a poor correlation with the speech quality as the average has been computed over entire signal. Since the speech signals are highly nonstationary; rapid variations in speech signals can cause incorrect SNR average. As a result; average over entire signal can remove vital speech components. To evade this problem, SNR is calculated in segments and then averaged, known as the segmental SNR (SNRSeg) [1].

4 Results and discussions

We first examined the performance of the deep speech enhancement methods in terms of the speech intelligibility by using STOI and ESTOI at -5 dB, 0 dB and 5 dB SNR levels. STOI and ESTOI measure the overall speech intelligibility of the enhanced speech. The higher scores of the STOI and ESTOI imply the improved performance. Table 2 shows the achieved STOI and ESTOI scores of all deep speech enhancement methods. The STOI and ESTOI scores have been averaged over all noise sources and training utterances. It is evident in the Table 2 that the proposed training scheme with different training-targets outperformed the RBM and random training schemes consistently at all SNR levels. The only exceptions are at 5 dB SNR, where we deem that all deep speech enhancement methods performed very well. However, the proposed training schemes outperformed the competing methods at all SNR levels. The deep speech enhancement with IAM training-target and GBRBM (DNN-IAM-GBRBM) provided the highest intelligibility scores (STOI≥77%, and ESTOI≥67%) for all noise sources at SNR≥-5 dB. Overall, the best intelligibility scores in terms of the STOI and ESTOI are achieved with DNN-IAM-GBRBM, which are: 89.36% and 84.19%, respectively. Similarly, IRM training-target and GBRBM (DNN-IRM-GBRBM) provided the second best intelligibility scores (STO I≥74.16%, and ESTOI≥66.26%) for all noise sources at SNR≥-5 dB. The highest intelligibility scores with the DNN-IRM-GBRBM are: 83.31% and 82.14%, respectively. DNN-PSM-GBRBM outperformed the RBM and Random training schemes; however, underperformed the DNN-IRMM-GBRBM and DNN-IAM-GBRBM. The highest intelligibility scores with the DNN-PSM-GBRBM are: 81.90% and 80.35%, respectively. For example, the average intelligibility scores are improved from 51.6% and 41.40% with unprocessed noisy speech to 77.32% and 67.16% with DNN-IAM-GBRBM (ΔSTOI=25.71% and ΔESTOI=25.76%) at -5 dB SNR. Similarly, the average scores are improved from 68.25% and 57.74% with DNN-IRM-RBM to 74.16% and 66.26% with DNN-IRM-GBRBM (ΔSTOI=5.91% and ΔESTOI=8.52%) at -5 dB SNR. Finally, the average scores are improved from 60.10% and 49.01% with DNN-PSM-Random to 70.67% and 64.39% with DNN-PSM-GBRBM (ΔSTOI=10.6% and ΔESTOI=15.38%) at -5 dB SNR.

Table 2
Average Predicted Objective Scores for all DNN-Speech Enhancement Methods

Deep Speech Enhancement Methods STOI in % ESTOI in %

-5dB 0dB 5dB Avg -5dB 0dB 5dB Avg

Noisy Unprocessed 51.61 67.11 78.90 68.87 41.40 59.67 73.47 58.18

DNN-IRM-Random 62.52 74.10 82.21 73.61 50.15 65.28 77.24 64.22

DNN-IRM-RBM 68.25 75.79 83.31 75.39 57.74 68.45 80.24 68.81

DNN-IRM-GBRBM 74.16 78.34 84.13 78.60 66.26 70.59 82.14 72.99

DNN-IAM-Random 64.27 78.73 86.10 76.36 54.80 67.40 79.25 67.15

DNN-IAM-RBM 70.81 79.48 87.84 79.37 59.38 70.10 82.50 70.66

DNN-IAM-GBRBM 77.32 80.66 89.36 82.44 67.16 72.47 84.19 74.60

DNN-PSM-Random 60.10 71.44 79.31 70.28 49.01 64.75 76.10 63.28

DNN-PSM-RBM 63.56 72.43 80.02 72.03 55.12 65.46 78.20 66.26

DNN-PSM-GBRBM 70.67 75.84 81.90 76.13 64.39 68.80 80.35 71.18

Deep Speech Enhancement Methods	STOI in %	ESTOI in %
Noisy Unprocessed	51.61	67.11	78.90	68.87	41.40	59.67	73.47	58.18
DNN-IRM-Random	62.52	74.10	82.21	73.61	50.15	65.28	77.24	64.22
DNN-IRM-RBM	68.25	75.79	83.31	75.39	57.74	68.45	80.24	68.81
DNN-IRM-GBRBM	74.16	78.34	84.13	78.60	66.26	70.59	82.14	72.99
DNN-IAM-Random	64.27	78.73	86.10	76.36	54.80	67.40	79.25	67.15
DNN-IAM-RBM	70.81	79.48	87.84	79.37	59.38	70.10	82.50	70.66
DNN-IAM-GBRBM	77.32	80.66	89.36	82.44	67.16	72.47	84.19	74.60
DNN-PSM-Random	60.10	71.44	79.31	70.28	49.01	64.75	76.10	63.28
DNN-PSM-RBM	63.56	72.43	80.02	72.03	55.12	65.46	78.20	66.26
DNN-PSM-GBRBM	70.67	75.84	81.90	76.13	64.39	68.80	80.35	71.18

Note: The shaded boxes show the best performance.

We examined the performance of proposed deep speech enhancement methods in terms of the speech quality by using PESQ at -5 dB, 0 dB and 5 dB SNR levels. Table 3 shows the PESQ scores of all speech enhancement methods. The PESQ scores have been averaged over all noise sources and training utterances. It is clear in the Table 3 that the proposed training scheme with different training-targets outperformed the RBM and random training schemes consistently at all SNR levels. The deep speech enhancement with IAM training-target and GBRBM (DNN-IAM-GBRBM) achieved the best scores (PESQ≥1.99) for all noise sources at SNR≥-5 dB. Overall, the best score in terms of the PESQ is achieved with DNN-IAM-GBRBM, that is, 2.66. IRM training-target and GBRBM (DNN-IRM-GBRBM) achieved the second best scores (PESQ≥1.94) for all noise sources at SNR≥-5 dB. The highest score with DNN-IRM-GBRBM is: 2.48. Again, DNN-PSM-GBRBM outperformed the RBM and Random training schemes in terms of PESQ; however, underperformed the DNN-IRMM-GBRBM and DNN-IAM-GBRBM. The highest PESQ score with the DNN-PSM-GBRBM is: 2.41. The average PESQ scores are improved from 1.39 with unprocessed noisy speech to 1.94 with DNN-IRM-GBRBM (ΔPESQ=0.55) at -5 dB SNR. Similarly, the average PESQ scores are improved from 1.95 with DNN-IAM-Random to 2.39 with DNN-IAM-GBRBM (ΔPESQ=0.44) at 0 dB SNR. Lastly, the average PESQ scores are improved from 2.33 with DNN-PSM-RBM to 2.41 with DNN-PSM-GBRBM at 5 dB SNR. We also examined the performance of the proposed deep speech enhancement methods in terms of the noise reduction in the enhanced speech by using SNRSeg at -5 dB, 0 dB and 5 dB SNRs. Table 3 shows the average results of all deep speech enhancement methods in terms of the SNRSeg. The average predicted SNRSeg scores with the proposed DNN structures are consistently higher than the competing structures at all noise sources and SNR levels, indicates that the proposed methods efficiently reduced the noise in enhanced speech. Figure 7 demonstrates the average STOI, ESTOI, PESQ and SNRSeg improvements achieved with all speech enhancement methods. The proposed DNN structures outperformed the competing structures consistently; however, the only exceptions are at 5 dB SNR, where we deem that all DNN structures performed well. Table 4 indicates the performance comparison against conventional speech enhancement methods. The competing methods include: Non-negative matrix Factorization (NMF) [16], Non-negative Dynamic System (NNDS) [60], Non-negative Robust Principle Component Analysis (NRPCA) [61], and Deep denoising Audio Encoder (DDAE) [62]. The results in Table 4 indicate a consistent improvement in all measuring parameters. To validate the superiority of the proposed DNN structures at high SNRs, we have conducted the one-way analysis-of-variance (ANOVA) statistical analysis (at 0 dB and 5 dB). All the statistical tests are conducted at 95% confidence interval.

Table 3

Average Predicted Objective Scores for all DNN-Speech Enhancement Methods

Deep Speech Enhancement Methods	PESQ				SNRSeg
	-5dB	0dB	5dB	Avg	-5dB	0dB	5dB	Avg
Noisy Unprocessed	1.39	1.51	1.97	1.62	0.54	1.21	3.36	1.70
DNN-IRM-Random	1.73	1.99	2.28	1.98	0.86	2.53	4.43	2.60
DNN-IRM-RBM	1.82	2.02	2.37	2.07	1.11	2.25	4.55	2.63
DNN-IRM-GBRBM	1.94	2.26	2.48	2.22	1.55	2.87	5.03	3.15
DNN-IAM-Random	1.78	1.95	2.45	2.06	1.05	2.73	4.71	2.83
DNN-IAM-RBM	1.86	2.18	2.54	2.19	1.41	2.84	4.95	3.06
DNN-IAM-GBRBM	1.99	2.39	2.66	2.34	1.68	3.07	5.21	3.32
DNN-PSM-Random	1.68	1.86	2.28	1.77	0.96	2.48	4.26	2.51
DNN-PSM-RBM	1.79	1.99	2.33	2.03	1.00	2.19	4.45	2.54
DNN-PSM-GBRBM	1.88	2.21	2.41	2.16	1.47	2.76	4.94	3.05

Note: The shaded boxes show the best performance.

Fig.7

Average PESQ, STOI, ESTOI and SNRSeg improvements for 12 noise sources with three timefrequency masks (training-targets).

Differences between achieved scores are deemed statistically significant if probability (P_value) is lower than 0.05 (P < 0.05) and F_value of F_Distribution is higher than critical value of F_Distribution (F_value > F_Critical). Table 5 demonstrates the statistical test at 95% confidence interval with F_Critical is 3.09. It is clear from the Table 5 that P_values of all the proposed DNN structures are less than 0.05 and F_Critical are higher than 3.09, which indicates the statistical significance of the proposed structures over the competing methods.

Finally, we have conducted the time-varying spectral analysis to evaluate the performance of deep speech enhancement methods. For time-varying spectral analysis, we have mixed a clean speech utterance with the babble noise at 0 dB SNR. The spectrograms of the different DNN structures with training-targets and training schemes are depicted in Fig. 8. It is evident in the spectrograms that the harmonic spectrums of the vowels and the formant peaks are sustained. Moreover, the spectrograms of the proposed speech enhancement methods also revealed an excellent structure during speech activity. The spectrums during the speech-pause areas show that the proposed DNN structures outperformed in reducing residual noise. The weak harmonics in high-frequency bands are well structured. Therefore, the proposed DNN structures achieved the better perceived speech quality. The utterances with weak energies are better preserved, and yielded less speech distortion. Hence, the proposed DNN structures achieved the improved speech intelligibility. Figure 9 shows the spectrograms of the proposed speech enhancement methods with various training-targets and training schemes and a clean speech utterance is mixed with the factory noise at -5 dB SNR level.

Table 4

Average Predicted Objective Scores for the Proposed and competing methods

Methods	STOI	ESTOI	PESQ	SNRSeg
Noisy	68.87	58.18	1.97	1.69
DNN-Random	73.41	64.88	2.33	1.94
DNN-RBM	75.59	69.57	2.41	2.10
DNN-GBRBM	79.05	72.92	2.52	2.24
NMF [16]	72.03	35.96	1.47	2.05
NNDS [60]	74.50	48.51	1.64	2.14
NRPCA [61]	72.43	43.71	1.63	2.10
DDAE [62]	78.31	54.48	1.76	2.19

Table 5

Statistical Analysis of Average Objective Scores at High SNRs at 95% confidence interval with F_Critical is 3.09 and P_Critical is 0.05

Comparison: Deep Speech Enhancement Methods	STOI				PESQ
	0dB		5dB		0dB		5dB
	P_Value	F_Value	P_Value	F_Value	P_Value	F_Value	P_Value	F_Value
GBRBM-IRM ⟶ Noisy	<0.001	21.3	<0.001	20.7	<0.001	65.7	<0.001	78.4
GBRBM-IRM ⟶ Random	<0.001	33.2	<0.002	39.3	<0.001	58.2	<0.003	46.5
GBRBM-IRM ⟶ RBM	<0.002	26.0	<0.005	31.3	<0.001	20.5	<0.005	18.5
GBRBM-IAM ⟶ Noisy	<0.001	47.1	<0.001	51.2	<0.001	17.1	<0.001	36.8
GBRBM-IAM ⟶ Random	<0.001	50.3	<0.004	48.6	<0.002	19.4	<0.004	46.2
GBRBM-IAM ⟶ RBM	<0.001	45.2	<0.006	40.4	<0.001	21.1	<0.002	33.4
GBRBM-PSM ⟶ Noisy	<0.001	10.3	<0.001	21.1	<0.001	08.7	<0.001	09.5
GBRBM-PSM ⟶ Random	<0.001	16.4	<0.004	19.0	<0.001	14.1	<0.003	16.2
GBRBM-PSM ⟶ RBM	<0.003	21.1	<0.009	18.3	<0.002	18.6	<0.003	19.3

Fig.8

Spectrograms of utterance processed by DNN-IAM-GBRBM, DNN-IRM-GBRBM, and DNN-IPSMGBRBM systems trained with IAM, IRM and IPSM. The input utterance is corrupted by factory noise at -5 dB.

Fig.9

Spectrograms of utterance processed by proposed deep speech enhancement trained by GB-RBM and regular RBM with IAM, IRM and IPSM training targets. The input utterance is corrupted by babble noise at 0 dB.

5 Summary and conclusions

In this study, we have presented the time-frequency masking-based supervised deep learning structures for the speech enhancement. We have established and examined that a significant gain in the performance can be achieved if the DNN structures are layer-wise pretrained by stacking the GB-RBM. We have estimated three training-targets (IRM, IAM and PSM) with GBRBM training scheme and the proposed deep structures are named as the GB-DBN. All the DNN structures are optimized by minimizing the error using the C_MW-MSE objective cost function. We have examined in this study that DBN with regular RBM is not very effective in supervised speech enhancement and extended the concept to present an effective training scheme. It is examined that in regular RBM, the weights and biases are binary-valued which have been imposed the constraints on their representations in the real-world environments. Secondly, the RBM underperforms when trained by severely noisy data. We have extended the concept of the regular RBM to GB-RMB and instead of using binary-values in the visible layer; the RBM has been extended to deal with the real-valued data. The proposed deep structures are evaluated with four objective metrics: STOI, ESTOI, PESQ and SNRSeg. We also conducted one-way ANOVA analysis to show the statistical significance at high SNRs. Based on results, following conclusions are drawn:

The DNNs pre-trained by formulating the GB-DBN with stacked GB-RBM optimized more than DNN pre-trained with RBM and without any pre-training.

It is concluded that in terms of the speech intelligibility, the proposed training scheme with different training-targets outperformed the RBM and random training schemes consistently at all SNR levels. The average intelligibility scores are improved from 51.61% and 41.40% with noisy speech to 77.3% and 67.16% with DNN-IAM-GBRBM (ΔSTOI=25.71% and ΔESTOI=25.76%) at -5 dB SNR. Also, the average intelligibility scores are improved from 68.25% and 57.74% with DNN-IRM-RBM to 74.16% and 66.26% with DNN-IRM-GBRBM (ΔSTOI=5.91% and ΔESTOI=8.52%) at -5 dB SNR. lastly, the average scores are improved from 60.10% and 49.01% with DNN-PSM-Random to 70.7% and 64.39% with DNN-PSM-GBRBM (ΔSTOI=10.8% and ΔESTOI=15.4%) at -5 dB SNR

It is concluded that in terms of the speech quality, the proposed training scheme with different training-targets outperformed the RBM and random training schemes consistently at all SNR levels. The average PESQ is improved from 1.39 with noisy speech to 1.94 with DNN-IRM-GBRBM (ΔPESQ=0.55) at -5 dB SNR. Similarly, the average PESQ scores are improved from 1.95 with DNN-IAM-Random to 2.39 with DNN-IAM-GBRBM (ΔPESQ=0.44) at 0 dB SNR. Lastly, the average PESQ scores are improved from 2.33 with DNN-PSM-RBM to 2.41 with DNN-PSM-GBRBM at 5 dB SNR.

We also concluded that the proposed deep speech enhancement outperformed in terms of the noise reduction in the enhanced speech by using SNRSeg at -5 dB, 0 dB and 5 dB SNRs. The average predicted SNRSeg scores with the proposed DNN structures are consistently higher than the competing DNN structures.

It is concluded that in terms of the P_values of all the proposed DNN structures are less than 0.05 and F_Critical are higher than 3.09, which indicates the statistical significance of the proposed structures over the competing methods.

The spectrogram analysis concluded that the harmonic spectrums of the vowels and the formant peaks are sustained. Also, the spectrograms of the proposed methods showed an excellent structure in speech activity areas. The enhanced spectrograms of the proposed method also showed less residual noise during the speech-pause areas. The weak harmonics in high-frequency bands are well structured.

To conclude, the proposed training scheme (GB-RBM) with various training-targets (IRM, IAM and PSM) outperformed, provided high speech quality and intelligibility. In the future work we would be devoted in attempting further improvements in the performance of proposed deep speech enhancement methods through incorporating the phase information. Also, we will systematically examine the acoustic features set to find more robust acoustic features set in order to train the masking based enhancement methods.

References

Loizou, Philipos C. Speech Enhancement: Theory and Practice. CRC press, 2013.

Boll, Steven, Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech, and Signal Processing 27(2) (1979), 113–120.

Saleem Nasir, Ali Sher, Khan Usman and Ullah Farman, Speech enhancement with geometric advent of spectral subtraction using connected time-frequency regions noise estimation, Research Journal of Applied Sciences, Engineering and Technology 6.6 (2013), 1081–1087.

Hu, Hwai-Tsu, Fang-Jang Kuo and Hsin-Jen Wang, Supplementary schemes to spectral subtraction for speech enhancement, Speech Communication 36(3–4) (2002), 205–218.

Scalart, Pascal. “Speech enhancement based on a priori signal to noise estimation.” 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. Vol. 2. IEEE, 1996.

Xia, Bingyin and Changchun Bao, Wiener filtering based speech enhancement with weighted denoising auto-encoder and noise classification, Speech Communication 60 (2014), 13–29.

Ephraim, Yariv and David Malah, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing 32(6) (1984), 1109–1121.

Ephraim, Yariv and David Malah, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing 33(2) (1985), 443–445.

Cohen, Israel, Speech enhancement using a noncausal a priori SNR estimator, IEEE Signal Processing Letters 11(9) (2004), 725–728.

10.

Malah, David, Richard V. Cox and Anthony J. Accardi. “Tracking speech-presence uncertainty to improve speech enhancement in non-stationary noise environments.” 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258). Vol. 2. IEEE, 1999.

11.

Cohen, Israel, Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator, IEEE Signal Processing Letters 9(4) (2002), 113–116.

12.

Cohen, Israel and Sharon Gannot. “Spectral enhancement methods.” Springer Handbook of Speech Processing. Springer, Berlin, Heidelberg, 2008, 873–902 .

13.

Burshtein, David and Sharon Gannot, Speech enhancement using a mixture-maximum model, IEEE Transactions on Speech and Audio Processing 10(6) (2002), 341–351.

14.

Hao, Jiucang, Te-Won Lee and Terrence J. Sejnowski, Speech enhancement using Gaussian scale mixture models, IEEE Transactions on Audio, Speech, and Language Processing 18(6) (2009), 1127–1136.

15.

Wang, Yuxuan and DeLiang Wang, Towards scaling up classification-based speech separation, IEEE Transactions on Audio, Speech, and Language Processing 21(7) (2013), 1381–1390.

16.

Mohammadiha, Nasser, Paris Smaragdis and Arne Leijon, Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing 21(10) (2013), 2140–2151.

17.

Saleem, Nasir, Muhammad Irfan Khattak and Muhammad Shafi, Unsupervised speech enhancement in low SNR environments via sparseness and temporal gradient regularization, Applied Acoustics 141 (2018), 333–347.

18.

Chung

, Badeau

, Plourde

, Champagne

, Training and compensation of class-conditioned NMF bases for speech enhancement, Neurocomputing 284 (2018), 107–118.

19.

Wang, DeLiang and Jitong Chen, Supervised speech separation based on deep learning: An overview, IEEE/ACM Transactions on Audio, Speech, and Language Processing 26(10) (2018), 1702–1726.

20.

Wang, Yuxuan, Arun Narayanan and DeLiang Wang, On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(12) (2014), 1849–1858.

21.

Xu, Yong, et al. A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 23(1) (2015), 7–19.

22.

Xu, Yong, et al. An experimental study on speech enhancement based on deep neural networks, IEEE Signal Processing Letters 21.1 (2013), 65–68.

23.

Ram, Rashmirekha and Mihir Narayan Mohanty, Deep neural network based speech enhancement, Cognitive Informatics and Soft Computing. Springer, Singapore (2019), 281–287.

24.

Saleem, Nasir, et al. Deep neural network for supervised single-channel speech enhancement, Archives of Acoustics 44 (2019).

25.

Saleem, Nasir, Muhammad Irfan Khattak and Abdul Baser Qazi, Supervised speech enhancement based on deep neural network, Journal of Intelligent & Fuzzy Systems Preprint (2019), 1–15.

26.

Elbaz, Dan and Michael Zibulevsky. “End to End Deep Neural Network Frequency Demodulation of Speech Signals.” Future of Information and Communication Conference. Springer, Cham, 2018.

27.

Kolbk, Morten, et al. Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25(1) (2017), 153–167.

28.

Narayanan, Arun and DeLiang Wang. “Ideal ratio mask estimation using deep neural networks for robust speech recognition.” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2013.

29.

Fischer, Asja and Christian Igel. “An introduction to restricted Boltzmann machines.” Iberoamerican Congress on Pattern Recognition. Springer, Berlin, Heidelberg, 2012.

30.

Aoyagi, Miki, Learning coefficient in Bayesian estimation of restricted Boltzmann machine, Journal of Algebraic Statistics 4(1) (2013).

31.

Sutskever, Ilya, Geoffrey E. Hinton and Graham W. Taylor. “The recurrent temporal restricted boltzmann machine.” Advances in neural information processing systems, 2009.

32.

Zhang, Nan, et al. An overview on restricted Boltzmann machines, Neurocomputing 275 (2018), 1186–1199.

33.

Lu, Xugang, et al. Speech restoration based on deep learning autoencoder with layer-wised pretraining, Thirteenth Annual Conference of the International Speech Communication Association, 2012.

34.

Chen, Zhuo, et al. “Improving Mask Learning Based Speech Enhancement System with Restoration Layers and Residual Connection.” INTERSPEECH, 2017.

35.

Grais, Emad M., Mehmet Umut Sen and Hakan Erdogan. “Deep neural networks for single channel source separation.” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

36.

Huang, Po-Sen, et al. Joint optimization of masks and deep recurrent neural networks for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(12) (2015), 2136–2147.

37.

Qazi, Khurram Ashfaq, et al. A hybrid technique for speech segregation and classification using a sophisticated deep neural network, PloS One 13(3) (2018), e0194151.

38.

Samui, Suman, Indrajit Chakrabarti and Soumya K. Ghosh. “Deep Recurrent Neural Network Based Monaural Speech Separation Using Recurrent Temporal Restricted Boltzmann Machines.” INTERSPEECH, 2017.

39.

LeCun, Yann, Yoshua Bengio and Geoffrey Hinton. Deep learning, Nature 521(7553) (2015), 436–444.

40.

Geoffrey, Hinton, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine 29(6) (2012), 82–97.

41.

Hinton, Geoffrey E., Simon Osindero and Yee-Whye The, A fast learning algorithm for deep belief nets, Neural Computation 18(7) (2006), 1527–1554.

42.

Cho, KyungHyun, Alexander Ilin and Tapani Raiko. “Improved learning of Gaussian-Bernoulli restricted Boltzmann machines.” International conference on artificial neural networks. Springer, Berlin, Heidelberg, 2011.

43.

Kolbæk, Morten, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 25(10) (2017), 1901–1913.

44.

Delfarah, Masood and DeLiang Wang, Deep learning for talker-dependent reverberant speaker separation: an empirical study, IEEE/ACM Transactions on Audio, Speech, and Language Processing 27(11) (2019), 1839–1848.

45.

Zhang, Xiao-Lei and DeLiang Wang, A deep ensemble learning method for monaural speech separation, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24(5) (2016), 967–977.

46.

Erdogan, Hakan, et al. “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks.” 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015.

47.

Chen, Jitong, Yuxuan Wang and DeLiang Wang, A feature study for classification-based speech separation at low signal-to-noise ratios, IEEE/ACM Transactions on Audio, Speech, and Language Processing 22(12) (2014), 1993–2002.

48.

Nair, Vinod and Geoffrey E. Hinton. “Rectified linear units improve restricted boltzmann machines.” Proceedings of the 27th international conference on machine learning (ICML-10), 2010.

49.

Bengio, Yoshua, et al. “Greedy layer-wise training of deep networks.” Advances in neural information processing systems, 2007.

50.

Kingma, Diederik P. and Jimmy Ba. “Adam: A method for stochastic optimization.” arXiv preprint arXiv: 1412.6980 (2014).

51.

Ioffe, Sergey and Christian Szegedy. “Batch normalization: Accelerating deep network training by reducing internal covariate shift.” arXiv preprint arXiv: 1502.03167 (2015).

52.

Cooke, Martin, et al. An audio-visual corpus for speech perception and automatic speech recognition, The Journal of the Acoustical Society of America 120(5) (2006), 2421–2424.

53.

Rothauser

E.H.

, IEEE recommended practice for speech quality measurements, IEEE Trans. on Audio and Electroacoustics 17 (1969), 225–246.

54.

Hirsch, Hans-Günter and David Pearce. “The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions.” ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), 2000.

55.

Srivastava, Nitish, et al. Dropout: a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research 15(1) (2014), 1929–1958.

56.

Recommendation, ITU-T. “Perceptual evaluation of speech quality (PESQ), An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs.” Rec. ITU-T P. 862, (2001).

57.

Rec, I. T. U. T. “P. 862.2: Wideband extension to recommendation P. 862 for the assessment of wideband telephone networks and speech codecs.” International Telecommunication Union, CH–Geneva, (2005).

58.

Taal, Cees H., et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing 19(7) (2011), 2125–2136.

59.

Jensen, Jesper and Cees H. Taal, An algorithm for predicting the intelligibility of speech masked by modulated noise maskers, IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(11) (2016), 2009–2022.

60.

Févotte

, Le Roux

and Hershey

J.R.

, May. Non-negative dynamical system with application to speech and audio. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (2013) (pp. 3158–3162). IEEE.

61.

Min

, Zhang

, Zou

, Sun

, September. Mask estimate through Itakura-Saito nonnegative RPCA for speech enhancement. In 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC) (2016) (pp. 1–5). IEEE.

62.

Liu

H.P.

, Tsao

, Fuh

C.S.

, Bone-conducted speech enhancement using deep denoising autoencoder, Speech Communication 104 (2018), 106–112.