Speech enhancement algorithm of improved OMLSA based on bilateral spectrogram filtering

Abstract

In this paper, a bilateral spectrogram filtering (BSF)-based optimally modified log-spectral amplitude (OMLSA) estimator for single-channel speech enhancement is proposed, which can significantly improve the performance of OMLSA, especially in highly non-stationary noise environments, by taking advantage of bilateral filtering (BF), a widely used technology in image and visual processing, to preprocess the spectrogram of the noisy speech. BSF is capable of not only sharpening details, removing unwanted textures or background noise from the noisy speech spectrogram, but also preserving edges when considering a speech spectrogram as an image. The a posteriori signal-to-noise ratio (SNR) of OMLSA algorithm is estimated after applying BSF to the noisy speech. Besides, in order to reduce computing costs, a fast and accurate BF is adopted to reduce the algorithm complexity O(1) for each time-frequency bin. Finally, the proposed algorithm is compared with the original OMLSA and other classic denoising methods using various types of noise with different signal-to-noise ratios in terms of objective evaluation metrics such as segmental signal-to-noise ratio improvement and perceptual evaluation of speech quality. The results show the validity of the improved BSF-based OMLSA algorithm.

Keywords

Speech enhancement bilateral filtering optimally modified log-spectral amplitude bilateral spectrogram filtering spectrogram

1 Introduction

Speech is an important carrier of human communication, but its quality can be inevitably degraded by background noise, especially in an adverse acoustic environment, which may not only decrease the speech intelligibility, but also cause human auditory fatigue. Generally, speech enhancement (SE) techniques aims to extract clean speech from the noisy speech signals by suppressing or eliminating the noise, so that the intelligibility and/or the quality of speech can be improved [1]. Besides for speech communication, SE techniques are also often used as a front-end processing in speech recognition, speech coding and intelligent communication equipment in recent years [2].

During the development of speech enhancement techniques, some classical algorithms, such as spectral subtraction, wiener filtering, subspace method, minimum mean square error estimation, and so on, have been catching researches continuously [3 –6]. The subspace algorithm is excellent at suppressing the musical noise problem, and at the same time can balance the speech distortion and noise residual problems. In [7], the researchers tried to combine psychoacoustic masking effect into the subspace algorithm, and made efforts to improve the performance of the algorithm, and the subspace speech enhancement base on auditory masking (SSE-AM) achieved certain success. However, the subspace method relies on matrix operations and generally requires a large amount of computational load, which is not practical for the real-time processing requirements of speech enhancement. In [8], a priori SNR estimator based on united speech presence probabilities (PSNR-USPP) was proposed, which can improve the tracking performance of a priori SNR by combining maximum likelihood (ML) estimator with the decision-directed SNR estimator and reduce the musical noise simultaneously. However, when the noise power spectral density changes abruptly, the model parameter estimation of PSNR-USPP method does not have a good adaptability, and it is still a problem to estimate the non-stationary noise power spectral density.

It is worth mentioning that the statistical model based on short-time spectral amplitude (STSA) is the most widely used one among statistical model-based methods. Taking the minimum mean square error (MMSE) estimator as an example, the MMSE-based STSA estimator, proposed by Ephraim and Malah [9], could effectively suppress the musical noise. After that, they further proposed an MMSE-based short-time log-spectral amplitude (LSA) estimator [10], because the log-spectra match human auditory property better. However, because of the speech presence uncertainty, the multiplicative gain modifier of MMSE-LSA is not optimal, then Cohen proposed the MMSE-based optimally modified log-spectral amplitude estimator algorithm (OMLSA) [11]. OMLSA can adapt to adverse noise environment, avoid musical noise problem, and protect weak speech components. The gain function of OMLSA is related to the speech presence probability. However, when the speech presence probability is expected to be zero, i.e., in the silent period, because of underestimating the noise power spectral density in non-stationary noise environments, the value of the gain function is often not equal to zero. Therefore, OMLSA has the residual noise inevitably, and further improvement is necessary.

Bilateral filtering (BF) is a nonlinear image filtering technique proposed by Tomasi and Manduchi [12]. BF utilizes not only the spatial proximity information (geometric distance) of pixels to smooth out the noise, but also the gray similarity information to make the filter effectively preserve the image edge texture. Because of this property, BF has become widely used in many computer vision and image applications [13, 14].

To the best of our knowledge, BF has not been applied into speech signal processing and speech enhancement yet. In this paper, the bilateral filtering (BF) technique is first used to preprocess noisy speech spectrogram to implement speech denoising, which is named as bilateral spectrogram filtering (BSF). Herein, the spectrogram of clean speech is regarded as a clean image, each time-frequency (TF) bin represents a pixel, and the normalized noisy speech spectrogram is regarded as the corresponding image disturbed or atomized by certain noise. Thus, BSF has the ability to sharpen details, remove unwanted textures or background noise, and preserve edges of a speech spectrogram. Besides, in order to reduce computing costs, a fast and accurate BF is adopted to reduce the algorithm complexity to O(1) for each time-frequency bin [12, 15]. Therefore, the main contribution of the proposed improved OMLSA algorithm is that we first introduce BSF as a preprocessing scheme, and then estimate the a posteriori SNR from the BSF enhanced speech. Experimental results show that, by combining with BSF, the performance of OMLSA can be improved in terms of segmental signal-to-noise ratio improvement (SegSNRI) and perceptual evaluation of speech quality (PESQ), especially in adverse acoustic environment.

2 Improved OMLSA speech enhancement algorithm

2.1 Overview

The additive noise signal d (n) and the clean speech signal x (n) are assumed to be independent uncorrelated, and the noisy speech signal y (n) can be modeled as: $y (n) = x (n) + d (n)$ (1)

Then the short-time Fourier transform (STFT) of Equation (1) can be given by: $Y (k, l) = X (k, l) + D (k, l)$ (2) where k and l are, respectively, the frequency bin index and the frame index. Y (k, l), X (k, l) and D (k, l) represent, respectively, STFT of y (n), x (n) and d (n).

Taking advantage of the speech presence probability, OMLSA is the optimally modified log-spectral amplitude estimator based on MMSE-LSA. By combining with BSF, the block diagram of the improved OMLSA is illustrated in Fig. 1, where $\hat{Y} (k, l)$ is the initial estimate amplitude spectrum of the enhanced speech processed by the bilateral spectrogram filtering, which will be described in detail in section 3. At the first step, the noisy speech signal in the time domain needs to be transformed into the frequency domain to extract the amplitude spectrum information for processing, while the phase spectrum information is unchanged for speech reconstruction. This is based on the assumption that the phase information is not important for the human auditory perception, so only an estimate of the magnitude or the power of the speech is required [16]. After obtaining the image preprocessing operation of the spectrogram, the bilateral spectrogram filter is performed to obtain the filtered output. Combining with the image enhancement processing mechanism [17], we perform the enhancement of the spectrogram to sharpen the edges, and can suppress most of the background noise to obtain a rough noise reduction effect. Immediately thereafter, the algorithm performs fine noise reduction adjustment. The initial estimate amplitude spectrum of the enhanced speech can be used to calculate the noise spectrum by the improved minimal controlled recursive averaging (IMCRA) method [18] and act on the a posteriori signal-to-noise ratio estimation of the OMLSA to obtain the spectral gain function of our proposed algorithm. In the end, the estimated spectrum of the clean speech can be obtained. And the additional phase information calculated from the original noisy speech and an overlap-add method is used to reconstruct the waveform of the estimated clean speech.

Fig. 1

Block diagram of improved OM-LSA algorithm.

2.2 Spectral gain function of MMSE-LSA

The purpose of MMSE-LSA is to calculate a spectral gain function and perform $\hat{X} (k, l) = G_{H_{1}} (k, l) \times Y (k, l)$ , where $\hat{X} (k, l)$ represents the estimate spectrum of the clean speech signal which is obtained by solving the following constraint equation: $\min {E [{(\log | X (k, l) | - \log | \hat{X} (k, l) |)}^{2}]}$ (3) where min{ · } denotes the minimization and E [·] indicates the expectation. By solving Equation (3), we can get [10]: $G_{H_{1}} (k, l) = \frac{ξ (k, l)}{ξ (k, l) + 1} \exp {\frac{1}{2} \int_{μ (k, l)}^{+ \infty} \frac{e^{- t}}{t} dt}$ (4) where $μ (k, l) = \frac{γ (k, l) ξ (k, l)}{1 + ξ (k, l)}$ , which is determined by the a priori SNR ξ (k, l) and the a posteriori SNR γ (k, l). While ξ (k, l) can be calculated by: $\begin{matrix} ξ (k, l) = α G_{H_{1}}^{2} (k, l - 1) γ (k, l - 1) + \\ (1 - α) \max {γ (k, l) - 1, 0} \end{matrix}$ (5) where α is a weighting factor that controls the tradeoff between noise reduction and speech distortion. Here α is empirically set at 0.92. In addition, the estimation of a posteriori SNR γ (k, l) is expressed as $\hat{γ} (k, l) = {| \hat{Y} (k, l) |}^{2} / {\hat{λ}}_{d} (k, l)$ , where ${\hat{λ}}_{d} (k, l)$ represents the estimate variance of noise, which obtained by the improved minimal controlled recursive averaging (IMCRA) method [18]. $\hat{Y} (k, l)$ is the estimate noisy speech spectrum, and is calculated by BSF (to be defined in section 3).

2.3 Improved OMLSA algorithm

The proposed improved OMLSA algorithm utilizes the speech absence probability based on a binary hypothesis model: $H_{0} (k, l) : Y (k, l) = D (k, l)$ (6) $H_{1} (k, l) : Y (k, l) = X (k, l) + D (k, l)$ (7) where H₀ (k, l) indicates speech absence that the input signal only contains noise, and H₁ (k, l) indicates speech presence that the input signal contains both speech and noise. Under the condition of H₁ (k, l), the spectral gain function of OMLSA algorithm is G_{H
₁} (k, l). In the case of H₀ (k, l), the gain factor is set at the minimum threshold G_min for the purpose of suppressing the musical residual noise. Experiments show that better performance can be achieved when G_min = 0.06. p (k, l) represents the speech presence probability, which can be computed by the Bayes rule: $p (k, l) = {1 + \frac{q (k, l)}{1 - q (k, l)} (1 + ξ (k, l)) \exp (- μ (k, l))}^{- 1}$ (8) where q (k, l) denotes the a priori speech absence probability (SAP), and can be computed by the method in [11], which exploits the strong correlation of speech presence in neighbouring frequency bins of consecutive frames.

Accordingly, the spectral gain function of the improved OMLSA algorithm can be given by: $G (k, l) = {G_{H_{1}} (k, l)}^{p (k, l)} G_{\min}^{1 - p (k, l)}$ (9)

The estimated magnitude of the clean speech can be written as: $\hat{X} (k, l) = G (k, l) | \hat{Y} (k, l) |$ (10)

In the end, the phase information of noisy speech and inverse FFT (IFFT) are used to reconstruct the enhanced speech signal in the time domain.

3 Bilateral spectrogram filtering

3.1 Spectrogram preprocessing

The spectrogram is obtained by calculated the logarithmic envelope of the power spectral density (PSD) of speech signal, and the expression is given by: $X (k, l) = 10 \log_{10} ({| Y (k, l) |}^{2})$ (11)

There are two reasons for preprocessing and normalization about spectrogram. On one hand, the dynamic range of PSD can be reduced by Eq. (11), and the distribution of light and shadow corresponding to the mean and variance of spectrogram are more uniform. On the other hand, in order to facilitate the image processing, it is necessary here to transform the speech spectrogram to a range of 0–255 gray values [19, 20].

If the normalized spectrogram is marked as $\tilde{X}$ , the element of $\tilde{X}$ can be obtained by: $\tilde{X} (k, l) = \bar{X} (k, l) / max {\bar{X}}$ (12) where $max {\bar{X}}$ extracts the maximum value of the matrix $\bar{X}$ . The element of $\bar{X}$ is given by: $\bar{X} (k, l) = \hat{X} (k, l) - min {\hat{X}}$ (13) where $min {\bar{X}}$ extracts the minimum value of the matrix $\hat{X}$ . The element of $\hat{X}$ can be obtained as follow: $\hat{X} (k, l) = {\begin{matrix} X (k, l), if X (k, l) ⩾ X_{min} \\ X_{min}, otherwise \end{matrix}$ (14)

Here, since the range of pixel grayscale value generally ranges from 0 to 255, to transform the speech spectrogram into image grayscale value processing, the value of $X_{\min}$ is defined as $X_{\min} = max {X} - 255$ , and the element of matrix $X$ is determined by Equation (11).

3.2 Bilateral spectrogram filtering

Originally introduced by Tomasi and Manduchi [12], bilateral filters are edge preserving operators that have found wide spread use in many computer vision and graphics tasks like image denoising [21, 22], texture editing and relighting [23], tone management [24], demosaicking [25], stylization [26], optical-flow estimation [27] and stereo matching [28].

To the best of best our knowledge, the BF method has not been introduced for speech denoising yet. Based on the BF method, we propose a BSF for speech denosing. In practice of speech signal, the proposed filter input $\tilde{P}$ can be obtained by recursive smoothing using the normalized spectrogram $\tilde{X}$ of noisy speech, whose element can be computed by [29]: $\tilde{P} (k, l) = β \tilde{P} (k, l - 1) + (1 - β) \tilde{X} (k, l)$ (15) where β ∈ (0, 1), is the forgetting factor that related to the inter-frame stability. The larger value β, the harder to track non-stationary signals, and its typical value is 0.9.

Let ${\tilde{P}}_{u}$ denote spectral component of smoothing spectrogram $\tilde{P}$ at time-frequency bin u = (k, l), ${\tilde{Q}}_{u}$ denote spectral component of filtered spectrogram $\tilde{Q}$ at u. According to [12], the filter output of BSF can be expressed by: ${\tilde{Q}}_{u} = \frac{1}{W_{u}} \sum_{v \in Ω} G_{σ_{s}} (∥ u - v ∥) G_{σ_{r}} (| {\tilde{P}}_{u} - {\tilde{P}}_{v} |) {\tilde{P}}_{v}$ (16) $W_{u} = \sum_{v \in Ω} G_{σ_{s}} (∥ u - v ∥) G_{σ_{r}} (| {\tilde{P}}_{u} - {\tilde{P}}_{v} |)$ (17) where v = (k₀, l₀) is the neighbour time-frequency bin in a small rectangle window centered at u. The set of adjacent TF bins is Ω. W_u denotes the normalization factor. G_{σ
_s} and G_{σ
_r} are, respectively, the spectral proximity factor and the grayscale similarity factor, given by: $\begin{matrix} G_{σ_{s}} (u - v) = \\ \exp {- ({(k - k_{0})}^{2} + {(l - l_{0})}^{2}) / (2 σ_{s}^{2})} \end{matrix}$ (18) $G_{σ_{r}} (| {\tilde{P}}_{u} - {\tilde{P}}_{v} |) = \exp {- {({\tilde{P}}_{u} - {\tilde{P}}_{v})}^{2} / (2 σ_{r}^{2})}$ (19) where σ_s denotes the standard deviation of spectral distance, while σ_r denotes the standard deviation of grayscale value. They control the radial range of the spatial domain filtering kernel function and grayscale filtering kernel, respectively. σ_s and σ_r are adjustable parameters, which directly determine and affect the performance of BF.

The bilateral filter has turned out to be versatile tool that has found widespread application in image processing, computer graphics and computer vision. However, a direct computation of BF requires $O (σ_{s}^{2})$ operations per pixel. In this paper, we adopt a fast and accurate BF version proposed by Chaudhury [30], where the Gauss polynomials of fixed approximate order N are used to realize that the rounding error is small and the BF technique can be implemented efficiently on hardware [31]. As a result, the complexity of proposed BFS algorithm is reduce to O (1).

The enhanced speech spectrogram can be obtained by using BSF [32], expressed as: $Y (k, l) = \max {\tilde{X} (k, l) - η \tilde{Q} (k, l), 0}$ (20) where Equation (20) uses half-wave rectification method to estimate the enhanced speech spectrogram. $\tilde{Q} (k, l)$ , i.e. ${\tilde{Q}}_{u}$ in the Equation (16), denotes the spectral component at a TF bin of the filter output of BSF $\tilde{Q}$ . And η can be regarded as an adjustable subtraction factor, which mainly affects both the amount of speech distortion and that of noise suppression. When η = 0, $Y (k, l) = \tilde{X} (k, l)$ holds so that all the pixels are unchanged. When η is large enough, $Y (k, l) = 0$ holds so that all the pixels become zero. Therefore, η should not be too large and its reasonable value ranges from 0.9 to 1.2.

Finally, the PSD represented by the enhanced spectrogram $Y (k, l)$ can be converted to the initial estimate amplitude spectrum of the enhanced speech $\hat{Y} (k, l)$ , given by: $\hat{Y} (k, l) = {[10^{0.1 | Y (k, l) |}]}^{1 / 2}$ (21)

4 Performance evaluation

In order to validate the performance of the improved OMLSA speech enhancement algorithm based on the bilateral spectrogram filtering, experiments were carried out in different noise environments. Four types of noise are selected from the Noisex-92 database [33], including stationary white Gauss noise, non-stationary white Gauss noise (obtained by increasing white Gauss noise by 15 dB in some periods), factory noise and babble noise. The clean speech comes from TIMIT standard speech database [34]. Speech data of 30 people are selected, of which 50% are male utterances and 50% are female utterances. The clean speech signal and four kinds of noise are mixed to get the noisy speech with the input SNR of –5 dB, 0 dB, 5 dB and 10 dB [35]. The sampling rate of all signals is 16 kHz, the frame length is 16 ms, the frame shift is 8 ms, and the window function is Hanning window. Other parameters used in the algorithm are determined experimentally as: σ_r = 40, σ_s = 16, Ω = 9 $σ_{s}^{2}$ , η = 1.1.

All test data are processed and analyzed by the following six popular speech enhancement algorithms, including M1: subspace speech enhancement base on auditory masking (SSE-AM) [7], M2: a priori SNR estimator based on united speech presence probabilities (PSNR-USPP) [8], M3: OMLSA (original method before improvement by proposed algorithm in this paper) [11], and M4: proposed improved OMLSA based on bilateral spectrogram filtering (IOM-BSF).

In this section, we chose two objective evaluation indices to test and compare the four algorithms mentioned above. The first one is segmental SNR improvement (SegSNRI) [36, 37]. Since the speech signal is a short-term stationary signal that changes slowly, the SNR should be different in different time periods. Therefore, segmented SNR can be used to evaluate speech enhancement performance. First frame the speech, then calculate the SNR for each frame of speech, and finally find the average. Specifically, the SegSNR can be calculated from the following equation:

$\begin{matrix} SegSNR = \frac{1}{L_{x}} \sum_{l = 0}^{L_{x} - 1} 10 \log_{10} \\ (\frac{\sum_{n = 0}^{N_{W} - 1} x^{2} (n + {lN}_{W} / 2)}{\sum_{n = 0}^{N_{W} - 1} {[x (n + {lN}_{W} / 2) - \hat{x} (n + {lN}_{W} / 2)]}^{2}}) \end{matrix}$ (22) where L_x denotes the number of frames of an utterance, N_W denotes the length of Hanning window. And x (n) denotes the clean speech signal, the $\hat{x} (n)$ denotes the enhanced speech processed by algorithm. SegSNR is closely related to the speech distortion and the residual noise. Hence, SegSNRI is a universal and important index to measure the performance of the algorithm.

The second objective index is perceptual evaluation of speech quality (PESQ) [38]. It is the result of integration of the perceptual analysis measurement system (PAMS) and perceptual speech quality measure (PSQM) 99, an enhanced version of PSQM. PESQ is expected to become a new ITU-T recommendation P.862, replacing P.861 which specified PSQM and measuring normalizing blocks (MNB) [38, 39]. The higher the PESQ score, the better the subjective speech quality. In addition, PESQ is a good approximation to the mean opinion score (MOS) [40], which is a subjective testing tool.

Figure 2 shows the experimental results of the SegSNRI scores of the four methods participating in the comparison. Evidently from Fig. 2, under various unused environments, the SESNRI score of enhanced speech processed by OM-BSF algorithm in this paper is basically the which is superior to the original OMLSA algorithm, and also superior to other traditional speech enhancement algorithms, indicating a good overall level of noise suppression and voice quality improvement. It is noteworthy that the leading margin is relatively large at low SNR (–5 dB), which also shows that IOM-BSF algorithm has better suppression performance for the adverse noise environment.

Fig. 2

Performance measures (SegSNRI) of four methods under different input SNRs and different noise environment types.

Figure 3 shows the experimental results of PESQ improvement score (PESQI) among the four methods. Figure 3 shows that the PESQI score of the IOM-BSF algorithm is the highest with white noise, non-stationary white noise and babble noise, which indicates that the subjective speech quality is better in both stationary and non-stationary noise environments. It also illustrates that IOM-BSF algorithm has less residual noise after speech processing. PSNR-USPP algorithm is also an optimization algorithm based on statistical model, so it is used for comparison. It achieves good results in dealing with factory noise. This is due to the introduction of a posteriori SNR without speech frame-delay, which improves the tracking performance of a priori SNR [8]. On all accounts, from the perspective of Figs. 2 and 3, the overall robustness of the IOM-BSF algorithm is better, which proves the effectiveness and superiority of the proposed algorithm.

Fig. 3

Performance measures (PESQI) of four methods under different input SNRs and different noise environment types.

For the sake of intuitiveness and subjectivity, we give a comparison of the processing effects and of the waveforms and spectrogram. Figure 3 highlights the enhancing ability of the proposed IOM-BSF technique in producing less residual noise and less speech distortion than the SSE-AM, PSNR-USPP and OMLSA under factory noise environment with 5 dB input SNR.

Figure 4 demonstrates that with the increase of noise non-stationarity, the spectrogram results of three competing algorithms become inferior. Because in this case, the noise spectrum estimation cannot follow the dramatic change of background noise in time. In addition, spectrogram of SSE-AM shows excessive distortion. However, IOM-BSF algorithm has a better performance owing to the property of image enhancement with edge preserving for BSF. The enhanced version OM-BSF has removed most of the noise and fuzzy areas. Especially, according to the area circled in the figure, the residual noise is less and the speech enhancement quality is better.

Fig. 4

Waveforms and spectrograms for noisy speech corrupted with factory noise at 5 dB enhanced by 4 algorithms.

5 Conclusion

Inspired by bilateral image filtering which can not only remove noises but also preserve edges, we proposed an BSF-based OMLSA algorithm for speech enhancement. By using BSF as a preprocessing scheme on noisy speech, we estimate the noise power spectral density using the IMCRA from the BSF enhanced speech signal, and we apply OMLSA to further reduce its residual noise finally. The proposed algorithm is tested and compared with classical speech enhancement algorithms in various noise environments. Experimental performance evaluations indicate that, in terms of averaged SegSNRI, combined with BSF, the performance of IOM-BSF compared with the original OMLSA improved by 1.65, 2.91, 3.17, 2.27, respectively under –5 dB, 0 dB, 5 dB, 10 dB conditions. And in terms of averaged PESQI, the performance of IOM-BSF compared with the original OMLSA improved by 0.33, 0.39, 0.30, 0.12, respectively under –5 dB, 0 dB, 5 dB, 10 dB conditions.

Evaluation results also show that the proposed algorithm outperforms other competing methods in terms of the amount of noise reduction, while the speech distortion remains acceptable, especially in adverse acoustic environments where the input SNR is extremely low or highly non-stationary noise exists. Hence, we can conclude that the introduction of bilateral image filtering as a preprocessing scheme in speech enhancement processing is effective and profound.

Footnotes

Acknowledgments

This work was supported in part by the National Science Fund of China (grant 11974086), in part by the Postgraduate Innovative Capability Cultivation Program by Guangzhou University (grant 2018GDJC-M20), in part by the Special Innovation Project of Department of Education of Guangdong Province (grant 2017KTSCX141), in part by the Open Fund of National Environmental Protection Engineering and Technology Center for Road Traffic Noise Control, and in part by the Guangzhou Science and Technology Plan Project (grant 201904010468).

References

Loizou

P.C.

, Speech enhancement: theory and practice, (second edition), CRC Press, Boca Raton, FL, USA, (2017).

Benesty

and Cohen

, Single-channel speech enhancement in the time domain, Canonical Correlation Analysis in Speech Enhancement, Springer, Cham, (2018).

Upadhyay

and Jaiswal

R.K.

, Single channel speech enhancement: Using Weiner filtering recursive noise estimation, Procedia Computer Science (2016).

Benesty

, Introduction, Fundamentals of Speech Enhancement, Springer, Berlin, Germany, (2018).

Loizou

P.C.

and Kim

, Reasons why current speech-enhancement algorithms do not improve speech intelligibility and suggested solutions, IEEE Transactions on Acoustics Speech and Signal Processing 19(1) (2011), 47–56.

, Yang

, Zhang

, Yan

, Hi

, Akagi

and Loizou

P.C.

, Comparative intelligibility investigation of single-channel noise-reduction algorithms for Chinese, Japanese, and English, The Journal of the Acoustical Society of America 129(5) (2011), 3291–3301.

Jabloun

and Champangne

, Incorporating the human hearing properties in the signal subspace approach for speech enhancement, IEEE Transactions on Acoustics Speech and Signal Processing 11(6) (2003), 700–708.

Zheng

, Zhou

and Li

, A modified a priori SNR estimator based on the united speech presence probabilities, Journal of Electronics & Information Technology 30(7) (2008), 1680–1683.

Ephraim

and Malah

, Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing 32(6) (1984), 1109–1121.

10.

Ephraim

and Malah

, Speech enhancement using a minimum mean-square error log-spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing 33(2) (1985), 443–445.

11.

Cohen

, Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator, IEEE Signal Processing Letters 9(4) (2002), 113–116.

12.

Tomasi

and Manduchi

, Bilateral filtering for gray and color images, Proc. IEEE International Conference on Computer Vision (ICCV), Bombay, India, (1998), 839–846.

13.

Knaus

and Zwicker

, Progressive image denoising, IEEE transactions on image processing 23(7) (2014), 3114–3125.

14.

Chaudhury

K.N.

and Rithwik

, Image denoising using optimally weighted bilateral filters: A sure and fast approach, Proc. IEEE International Conference on Image Processing (ICIP), (2015), 108–112.

15.

Chaudhury

K.N.

, Sage

and Unser

, Fast O(1) bilateral filtering using trigonometric range kernels, IEEE Transactions on Image Processing 20(12) (2011), 3376–3382.

16.

Wan

E.A.

and Nelson

A.T.

, Networks for speech enhancement, in Handbook of Neural Networks for Speech Processing, S. Katagiri, Ed. Norwell, MA, USA: Artech House, (1998).

17.

Buades

, Coll

and Morel

J.M.

, A non-local algorithm for image denoising, Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 2 (2005), 60–65.

18.

Cohen

, Noise spectrum estimation in adverse environments: Improved minima controlled recursive averaging, IEEE Transactions on speech and audio processing 11(5) (2003), 466–475.

19.

, Sun

, Zhang

and Li

, Direction-aware neural style transfer with texture enhancement, Neurocomputing 370 (2019), 39–55.

20.

, Shi

, Lu

and Wang

, Quantum circuit design for several morphological image processing methods, Quantum Information Processing 18(12) (2019), 364.

21.

Buades

, Coll

and Morel

J.M.

, A review of image denoising algorithms, with a new one, Multiscale Modeling & Simulation 4(2) (2005), 490–530.

22.

Yang

, Yang

, Davis

and Nister

, Spatial-depth super resolution for range images, Proc. IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, (2007), 1–8.

23.

B.M.

, Chen

, Dorsey

and Durand

, Image-based modeling and photo editing, Proc. 28th annual conference on Computer graphics and interactive techniques, Los Angeles, CA, USA, (2001), 433–442.

24.

Durand

and Dorsey

, Fast bilateral filtering for the display of high-dynamic-range images, Proc. 29th annual conference on Computer graphics and interactive techniques, San Antonio, Texas, USA, (2002), 257–266.

25.

Ramanath

and Snyder

W.E.

, Adaptive demosaicking, Journal of Electronic Imaging 12(4) (2003), 633–643.

26.

Winnemöller

, Olsen

S.C

and Gooch

, Real-time video abstraction, ACM Transactions On Graphics (TOG) 25(3) (2006), 1221–1226.

27.

Xiao

, Cheng

, Sawhney

, Rao

and Isnardi

, Bilateral filtering-based optical flow estimation with occlusion detection, Proc. European conference on computer vision, Springer, Berlin, Heidelberg, (2006), 211–224.

28.

Yang

, Wang

, Yang

, Stewénius

and Nistér

, Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling, IEEE Transactions on Pattern Analysis and Machine Intelligence 31(3) (2008), 492–504.

29.

Zheng

, Deleforge

, Li

and Kellermann

, Statistical analysis of the multichannel Wiener filter using a bivariate normal distribution for sample covariance matrices, IEEE/ACM Transactions on Audio, Speech and Language Processing 26(5) (2018), 951–966.

30.

Chaudhury

K.N.

and Dabhade

S.D.

, Fast and provably accurate bilateral filtering, IEEE Transactions on Image Processing 25(6) (2016), 2519–2528.

31.

Muller

J.M.

, Elementary functions: Algorithms and implementation, Birkhauser Boston, (2006).

32.

Hao

, Pan

, Guo

, Hong

and Wang

, Image detail enhancement with spatially guided filters, Signal Processing 120 (2016), 789–796.

33.

Varg

and Steeneken

H.J.M.

, Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication 31(2) (2014), 11–20.

34.

Garofolo

J.S.

, Getting started with the DARPA TIMIT CD-ROM: an acoustic phonetic continuous speech database, Gaithersburg, MD, Nat Inst of Standards and Technology (NIST), (1993).

35.

Wang

, Yang

, Yan

, Huang

and Sang

, Speech Enhancement Algorithm of Binary Mask Estimation Based on a Priori SNR Constraints, Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), (2018), 937–943.

36.

Hansen

J.H.L.

and Pellom

B.L.

, An effective quality evaluation protocol for speech enhancement algorithms, Proc. International Conference on Spoken Language Processing (ICSLP), Sydney, Australia, (1998), 1–4.

37.

Peng

, Tan

Z.H.

, Li

and Zheng

, A perceptually motivated LP residual estimator in noisy and reverberant environments, Speech Communication 96 (2018), 129–141.

38.

Rix

A.W.

, Beerends

J.G.

, Hollier

M.P.

and Hekstra

A.P.

, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Salt Lake, USA, (2001), 749–752.

39.

Wang

, Liu

, Zheng

and Li

, Spectral subtraction based on two-stage spectral estimation and modified cepstrum thresholding, Applied Acoustics 74(3) (2013), 450–458.

40.

and Loizou

P.C.

, Subjective comparison and evaluation of speech enhancement algorithms, Speech communication 49(7-8) (2007), 588–601.