Abstract
In Speech Enhancement (SE) techniques, the major challenging task is to suppress non-stationary noises including white noise in real-time application scenarios. Many techniques have been developed for enhancing the vocal signals; however, those were not effective for suppressing non-stationary noises very well. Also, those have high time and resource consumption. As a result, Sliding Window Empirical Mode Decomposition and Hurst (SWEMDH)-based SE method where the speech signal was decomposed into Intrinsic Mode Functions (IMFs) based on the sliding window and the noise factor in each IMF was chosen based on the Hurst exponent data. Also, the least corrupted IMFs were utilized to restore the vocal signal. However, this technique was not suitable for white noise scenarios. Therefore in this paper, a Variant of Variational Mode Decomposition (VVMD) with SWEMDH technique is proposed to reduce the complexity in real-time applications. The key objective of this proposed SWEMD-VVMDH technique is to decide the IMFs based on Hurst exponent and then apply the VVMD technique to suppress both low- and high-frequency noisy factors from the vocal signals. Originally, the noisy vocal signal is decomposed into many IMFs using SWEMDH technique. Then, Hurst exponent is computed to decide the IMFs with low-frequency noisy factors and Narrow-Band Components (NBC) is computed to decide the IMFs with high-frequency noisy factors. Moreover, VVMD is applied on the addition of all chosen IMF to remove both low- and high-frequency noisy factors. Thus, the speech signal quality is improved under non-stationary noises including additive white Gaussian noise. Finally, the experimental outcomes demonstrate the significant speech signal improvement under both non-stationary and white noise surroundings.
Keywords
Introduction
In digitalized world, the removal of audio distortion in noisy vocal signals has been basically obligatory for enhancing the vocal signals. Different speech improvement methods and algorithms have been developed by several researchers for diminishing the noise from the vocal signals [1]. In current non-stationary surroundings, the foremost issue in speech improvement is disturbed with the evaluation of the noise data correctly. The traditional evaluators are based on Voice Activity Detectors (VAD) [2, 3]. Afterwards, the energy spectrum of the noise factors is decided as a smoothed adjustment of its preceding rates achieved for the period of vocal intermissions. This method recommends a reasonable precision for fixed environment noises; however, they cannot specifically evaluate the time-varying spectrum. The complicatedness in tracking the non-stationary noises develops into more understandable for lengthy vocal fragments and low Signal-to-Noise Ratio (SNR) [4]. These conditions have been dealt with various power spectrum-based techniques [5, 6].
Over the previous years, a Time-Frequency (TF)-based vocal improvement technique [7] has been proposed by using the Empirical Mode Decomposition (EMD) scheme [8]. In general, the EMD decomposes the vocal signals into a sequence of oscillatory IMF and a remaining factor [9]. This scheme does not require a group of fundamental functions for appropriately analyzing the target signal. Also, it does not constrain the stationary signals. As a result, Zao et al. [10] proposed a novel EMD-based vocal improvement method for solving the difficulties in non-stationary noisy surroundings. In this method, the noise factors of each IMF were recognized and decided by its Hurst exponent data. As well, the choice of IMF and the vocal restoration were executed on the frame-by-frame manner by taking into consideration of the feature and precision target estimates. Nonetheless, this method has high time and resource consumptions. In addition, a considerable enhancement under Babble noise situations was not effective.
Therefore, SWEMDH technique [11] was proposed based on the computation of EMD in a relatively small window which is sliding along with the time axis. The size of the window was depending on the frequency spectrum of the vocal signal. The potential discontinuities in IMF among windows were avoided by means of the sum amount of modes and the amount of filtering iterations that have be assign a priori. The amount of filtering iterations must be modified for each component and depends on the sampling frequency, analyzed signal, its difficulty and band. This was computed by decomposing the signal windows based on the general algorithm and also the typical amount of filtering iterations for each module was computed. However, this technique was not effective in white noise surroundings.
Hence in this article, a VVMD with SWEMDH technique is proposed to reduce the complexity in real-time applications. Initially, SWEMDH technique is applied for decomposing the noisy vocal signals into IMFs. After that, the Time Delay Estimation (TDE)-based VVMD is applied on the addition of chosen IMF. The main aim of this SWEMD-VVMDH technique is to choose the IMFs based on Hurst exponent and then apply the VVMD technique which is appropriate for reducing the low- and high-frequency noise components in the vocal signal. Thus, the vocal signal quality is further enhanced under different noises such as additive white Gaussian noise, street noise, babble noise, airport noise, etc.
The remaining part of this article is prepared as follows: Section II surveys the related researches on SE techniques. Section III explains the methodology of SWEMD-VVMDH technique. Section IV demonstrates the performance efficiency of SWEMD-VVMDH technique compared to the SWEMDH technique. Section V concludes the research work.
Related works
Swami et al. [12] focused on adaptive scales for computing the Continuous Wavelet Transform (CWT) coefficients and adaptive thresholding of these coefficients for SE. In this technique, the adaptive scales and thresholds were chosen based on the noise level of the noisy speech signal. Then, the CWT coefficients were soft-thresholded by generating adaptive thresholds. However, it needs to adapt the threshold values separately for the speech regions and also this technique was limited to use single microphone recordings.
Mai et al. [4] proposed a novel technique, namely Extended-DATE (E-DATE) that extend the d-Dimensional Amplitude Trimmed Estimator (DATE) for noise power spectrum opinion in SE. It relies on the statement that the Short-Time Fourier Transform (STFT) of noisy speech signals was sparse in the sense that converted vocal signals can be denoted by a comparatively miniature amount of coefficients with huge amplitudes in the time-frequency domain. However, it needs advanced techniques to increase the quality of reconstructed speech signals. Tavares and Coelho [13] proposed a novel time domain SE method for non-stationary acoustic noise signals. In this technique, the noisy factors were identified and attenuated. Then, these factors were employed for defining a noise range threshold. But, it executes frame-by-frame noise opinion and also it requires threshold values for removing the noise from the speech signals.
Ji et al. [14] proposed a robust noise Power Spectral Density (PSD) estimator for binaural SE in time-varying noise environments. Initially, the noise PSD was acquired by an eigenvalue of the input covariance matrix. After that, the basic estimator was developed via an approximation method. Then, an eigenvalue compensation method was proposed for enhancing the precision of noise PSD approximation. However, computational cost was moderately increased. Messaoud and Bouzid [15] proposed a novel articulation decision algorithm based on the Multi-scale Product (MP) features to filter the noisy signals. In this algorithm, enhanced subspace decomposition and a multi-scale principal component analysis were performed on the noisy vocal signals and the unvoiced segments of similar signal, respectively. However, computational complexity was high.
Block diagram for SWEMD-VVMDH technique.
Ghahabi et al. [16] proposed a robust VAD to filter the non-speech segments based on the zero-order Baum-Welch statistics which are compared with the robust threshold value. However, Equal Error Rate (EER) was high. Dwijayanti et al. [17] proposed an improvement of speech dynamics for VAD by Deep Neural Network (DNN). The dynamics were pointed by vocal time candidates computed by the heuristic system for the patterns of the primary and secondary derivatives of the input signals. Then, these candidates combined with the log power spectra were given as input to the DNN for obtaining VAD results. However, the performance of VAD was degraded while it eliminates the subbands and its neighbors.
Saleem and Ijaz [18] proposed a SE algorithm based on low-rank and sparse matrix decomposition scheme that uses grammatone filterbank and Kullback-Leibler divergence to decompose the noisy vocal magnitude spectra into the low-rank noise and sparse vocal segments. Based on this technique, the noise signals were considered as low-rank factors and vocal signals were taken as sparse elements. Then, a modified SE algorithm was developed for separating the speech and noise magnitude spectra by requiring the rank and sparsity limits where the improved time-domain speech was created from the sparse matrix. However, the speech recognition accuracy was reduced while background noise was not preserved well.
In this section, the proposed SWEMD-VVMDH technique is described briefly. The VVMD is mainly applicable to minimize the low-frequency noise as well as the high-frequency noise. In contrast, SWEMDH fails to accurately extracting the high-frequency components. Also, this can be applied on entire length of noisy speech signals for enhancing the speech signals under white noise environment and real-time noisy environments. Similarly, NBS is used for removing unwanted wideband noises in the speech signals. As well, the benefit of selecting the narrow-band components is lower noise bandwidth and therefore better sensitivity and range. Figure 1 illustrates the fundamental processes in this technique.
The steps in this SWEMD-VVMDH technique for speech signal enhancement are described below:
Initially, the noisy vocal signal is decomposed into set of IMFs by SWEMD technique. Consider After that, VVMD technique is applied on addition of all chosen VVMD technique is performed on each chosen NBCs with input parameter to obtain DC components which are eliminated. Finally, an improved speech signal is obtained from the addition of all the residual factors.
The SWEMDH method is used to obtain the IMFs from an input signal
VVMD technique
The main purpose of VVMD is decomposing the signal
In Eq. (3.2),
By using Eq. (3.2), the NBCs and their related centre frequencies is determined by using the alternating direction scheme of multipliers which is described in Algorithm 3.2.
Initialize
Update
Update
Dual ascent;
The determination of NBCs in frequency-domain and their centre frequencies are denoted by,
Then, the mean and standard deviation from Eq. (4) are computed for estimating the centre frequencies
In Eqs (5) and (6),
Initialize
Update
Update
Dual ascent for all
In the restoration fidelity term, the centre frequencies
In this section, the SWEMD-VVMDH technique is implemented to analyze its effectiveness compared with the SWEMDH technique by using MATLAB 2014a. By using audio tools, VVMD function and Hurst exponent function in MATLAB 2014a, the SWEMD-VVMDH technique is implemented. In this experiment, TIMIT database [19] is considered for acquiring the speech signals which are directly digitized at a sample rate of 20 KHz using a Digital Sound Corporation (DSC) 200 with the anti-aliasing filter at 10 KHz. The speech is then digitally filtered, debiased and downsampled to 16 kHz. From this database, a subset of 12 speakers (a total of 120 vocal signal segments) consisting 7 men and 5 women is chosen at random. As well, eight are fused from each of 10 signals available per speaker and used for training the speaker models whereas the other two are split for testing. Each of
SNR: It defines the percentage of the average power of speech signal
It can be rewritten as:
In Eq. (8),
Mean Square Error (MSE): It defines the cumulative squared error between the restored and actual speech signal. It is computed as:
In Eq. (9),
Peak Signal-to-Noise Ratio (PSNR): It is the percentage of the highest signal power to the noise power. It is calculated as:
Mean Absolute Error (MAE): It defines the total inexactness between the restored speech signal and actual signal. It is calculated as:
Perceptual Evaluation of Speech Quality (PESQ): It is used for evaluating an end-to-end quality to characterize the listening quality as apparent by users.
Weighted Spectral Slop Measure (WSSM): It is defined as the time averaged WSSM where only the good frames are averaged. It measures the weighted differences of spectral slope over 25 critical frequency bands between the two corresponding signal frames.
In Eq. (13),
MSE comparison
MSE comparison.
Table 1 gives the comparison results of MSE for EMDH, SWEMDH and VVMDH using different acoustic noises that corrupt the speech signal during transmission.
Figure 2 shows the graphical representation of comparison results obtained from MSE values for EMDH, SWEMDH and VVMDH using different acoustic noises. From this analysis, it is identified that the proposed VVMDH approach can minimize the MSE compared to the EMDH and SWEMDH approaches. For example, consider the Babble noise environment with SNR of 15 dB. In this case, the MSE of VVMDH is 81.01% reduced than the EMDH approach and 72.78% reduced than the SWEMDH approach. This is achieved by facilitating the backward error correction and having the ability to properly cope with noise.
MAE comparison
MAE comparison.
SNR comparison (in dB)
SNR comparison.
Table 2 gives the comparison results of MAE for EMDH, SWEMDH and VVMDH using different acoustic noises that corrupt the speech signal during transmission.
The graphical representation of MAE values for existing EMDH, SWEMDH and proposed VVMDH using different acoustic noises is shown in Fig. 3. Through this analysis, the proposed VVMDH approach achieves less MAE when compared to the existing EMDH and SWEMDH approaches. For considering the case that Babble noise environment with SNR of 15 dB, the MAE of VVMDH is 64.29% less than the EMDH and 26.06% less than the SWEMDH. This is due to proper tradeoff for errors between the relevant bands and their corresponding modes which are simultaneously estimated by using VVMDH approach.
Table 3 gives the comparison results of SNR for EMDH, SWEMDH and VVMDH using different acoustic noises that corrupt the speech signal during transmission.
Figure 4 shows the graphical representation of comparison results obtained from SNR values for EMDH, SWEMDH and VVMDH using different acoustic noises. From this analysis, it is identified that the proposed VVMDH approach can maximize the SNR while compared to the EMDH and SWEMDH approaches. For the scenario of Babble noise environment with SNR of 15 dB, the SNR of VVMDH is 4.132736 dB which is less than EMDH and SWEMDH approaches whose SNR values are 2.018528 dB and 2.632736 dB, accordingly. This is because of ensuring the narrow-band properties corresponding to the current estimate of the mode’s center-frequency to the signal estimation residual of all other modes.
Table 4 gives the comparison results of PSNR for EMDH, SWEMDH and VVMDH using different acoustic noises that corrupt the speech signal during transmission.
PSNR comparison (in dB)
PSNR comparison.
PESQ comparison
PESQ comparison.
WSSM comparison
WSSM comparison.
The graphical representation of PSNR values for existing EMDH, SWEMDH and proposed VVMDH using different acoustic noises is shown in Fig. 5. Through this analysis, the proposed VVMDH approach achieves higher PSNR when compared to the existing EMDH and SWEMDH approach. For considering the case that Babble noise environment with SNR of 15 dB, the PSNR value for the proposed VVMDH approach is 43.67% increased than the EMDH approach and 7.06% increased than the SWEMDH approach. Since the VVMDH approach has high robustness to sampling and noise compared to the other approaches such as EMDH and SWEMDH.
Table 5 gives the comparison results of PESQ for EMDH, SWEMDH and VVMDH using different acoustic noises that corrupt the speech signal during transmission. Also, Fig. 6 shows its graphical representation. From this analysis, it is identified that the proposed VVMDH approach can maximize the PESQ while compared to the EMDH and SWEMDH approaches. For example, consider the Babble noise environment with SNR of 15 dB. In this case, the PESQ of proposed VVMDH approach is 9.12% higher than the EMDH approach and 3.71% higher than the SWEMDH approach. This is achieved because of accurately tuning the center-frequency of both low- and high-frequency harmonics which are detected at acceptable quality and recovered at almost without errors.
Table 6 gives the comparison results of WSSM for EMDH, SWEMDH and VVMDH using different acoustic noises that corrupt the speech signal during transmission.
Figure 7 shows the graphical representation of comparison results obtained from WSSM values for EMDH, SWEMDH and VVMDH using different acoustic noises. From this analysis, it is identified that the proposed VVMDH approach can maximize the WSSM while compared to the EMDH and SWEMDH approaches. For the scenario of Babble noise environment with SNR of 15 dB, the WSSM of VVMDH is 1.211347 which is higher than EMDH and SWEMDH approaches whose WSSM values are 0.390662 and 1.142055, respectively. Thus, it is concluded that the proposed VVMDH approach achieves higher performance i.e., high PSNR, SNR, PESQ and WSSM with less MSE, MAE while compared to the EMDH and SWEMDH approaches.
In this article, a SWEMD-VVMDH technique is proposed for enhancing the speech signal quality by eliminating both low- and high-frequency noisy factors in real-time applications. Initially, SWEMDH method is applied to decompose the input signal into number of IMFs based on the sliding window. Then, Hurst exponent is computed to decide the IMFs low-frequency noisy components which are discarded from the input signal. After that, VVMD technique is applied to addition of all IMFs by considering NBCs and their centre frequencies for deciding high-frequency noisy components which are also removed from the signal. Thus, the speech signal quality is enhanced by removing both non-stationary and white noisy components efficiently. Finally, the experimental outcomes proved that the VVMDH technique has better performance than the SWEMDH technique for SE under both non-stationary and white noise backgrounds. The SWEMD-VVMDH technique provides better performance in terms of MSE, MAE, WSSM, SNR, PSNR and PESQ measures which correlated with speech quality and perceptual abilities as compared with SWEMDH technique.
