Abstract
In this paper, a two step approach using Variational Mode Decomposition (VMD) and ℓ1 trend filter is proposed for enhancing speech signals degraded by white Gaussian noise. In the first step, VMD decomposes the noisy speech signal into Intrinsic Mode Functions (IMFs) corresponding to different frequency components of the signal. In the second step, ℓ1 trend filter retrieves the speech information by filtering out the noisy sub frames. The noisy sub frames are identified using a threshold based on noise variance in the corresponding IMFs. The proposed work experiments on speech signals degraded by white Gaussian noise in the range 10 dB– 30 dB. The performance of proposed method is compared with some of the well known speech enhancement techniques: Spectral Subtraction (SS) and Minimized Mean Square Error (MMSE) using the subjective and objective quality measures. The proposed, two-step approach of VMD-ℓ1 trend filter achieves better performance compared to the considered methods.
Introduction
Speech enhancement plays a major role in real time systems involving human machine interaction. Speech enhancement is the process of reducing the background noise distortion in the speech signal while retaining the signal quality. The presence of noise affects both the intelligibility and the perceptual quality of the speech signal. This paper focuses on the enhancement of speech signal degraded by white Gaussian noise at varying noise levels. The random nature of the noise and the non-stationary nature of the speech signal makes the task of speech enhancement more challenging. The noisy speech signal y can be expressed as
Speech enhancement methods are applied depending on the type of noise affecting the speech signal and also the level of background distortion. In this paper, a two-step approach using variational mode decomposition and ℓ1 trend filter to enhance white Gaussian noise affected speech signals [1, 15] is proposed. VMD is a data decomposition method based on calculus of variation which decomposes a signal in to sub signals known as Intrinsic Mode Functions (IMFs). Each IMF represent the signal information corresponding to different spectral components with lower frequency bands occuring in the initial modes. The first step in the proposed method, noisy speech signal is decomposed into its corresponding IMFs using VMD. The resulting IMFs have both speech and noise information with the level of white Gaussian noise varying from one IMF to another. In the second step, the noisy IMFs are separately filtered using ℓ1 trend filter to retrieve the speech information. The filtering is done based on the noise variance obtained from the silence portion of the corresponding IMF and the enhanced speech signal is obtained by combining the filtered IMFs. The objective quality measures are estimated to evaluate the reduction in signal distortion (Csig), reduction in background intrusiveness due to the presence of noise (Cbak), overall quality of the signal (Covl), SNR over speech frames (SNRseg) and PESQ (Perceptual Evaluation of Speech Quality) [14, 17]. The rest of sections in this paper are structured as follows. Section 2 describes about proposed method along with the concept behind VMD and ℓ1 trend filter. Section 3 provides a detailed discussion of the results obtained from the proposed method and comparison with the other enhancement methods. In this section, the proposed work is tested for enhancement of speech signals degraded by color Gaussian noise. Section 4 discusses the future work and conclusion.
The proposed method is a two-step approach of VMD and ℓ1 trend filter to remove the noise from the signal. The following steps are involved in enhancement of speech signal degraded by white Gaussian noise using VMD-ℓ1 trend filter method. Decompose the noisy speech signal y (t) into Intrinsic Mode Functions using Variational Mode Decomposition method.
Apply ℓ1 trend filter over the sub frames of each IMF to retrieve speech information.
The enhanced speech signal is obtained by combining the ℓ1 trend filtered IMFs.
During decomposition of noisy speech signal maximum speech information are retained in the initial IMFs. In the second stage method of filtering over sub frames using ℓ1 trend filter ensures that the background noise distortion is reduced to a minimum level and also the signal quality is retained. Enhanced speech signal is the produced from the ℓ1 trend filtered IMFs.
The decomposition of a signal x using the Variational Mode Decomposition method results in a set of sub-signals x1, x2, . . , x
k
in the temporal domain [7, 11]. The sub-signals are Intrinsic Mode Functions having sparse bandwidth with the ability to reproduce the input signal. The IMFs represent the signal information corresponding to different frequency components present in the original signal. The variational method forms the basis for Variational Mode Decomposition. The objective is to find the IMFs having sparse bandwidth that could reproduce the input signal, x. The optimization problem is defined over the bandwidth minimization and is solved using the Alternating Direction Method of Multipliers (ADMM). IMF is defined as a signal with amplitude and frequency modulation given by,
The signal x
k
(t) given in Equation 5 is purely harmonic having the instantaneous frequency as and amplitude A
k
(t). It represents the intrinsic mode function corresponding to mode k and is centered around the frequency ω
k
in the spectral domain. IMFs are defined as purely harmonic for a sufficiently long interval or has a limited bandwidth. The bandwidth of the IMFs depends on two factors, one is the maximum deviation of and the other is rate of change of . Now, the bandwidth of the intrinsic mode function is found by creating an analytical signal using the Hilbert transform of the mode function (
Another way of representing analytical signal in Equation 6 is
From the analytical signal (Equation 7), the bandwidth of the mode function x
k
(t) can be obtained by finding its derivative over time and then taking L2 norm for it.
The constrain ensures that the input can be recovered from the IMFs. Converting the constrained problem into an unconstrained problem and is given as
The quadratic penalty (second term) enforces the convergence and the Lagrangian multiplier enforces the constrain for input reproducibility. The above problem is solved using the ADMM algorithm, which breaks the problem into smaller ones and solves them separately. Initially the value of x
k
is estimated using initial values of ω
k
and λ. The update for the mode function x
k
is given by
The update for x
k
is obatined by filtering the residual of the mode function using Wiener filter with the signal prior 1/(ω - ω
k
) 2. In the next step, ω
k
is estimated by using the updated x
k
value and initial λ value. The update equation for center frequency ω
k
is given by
The λ is estimated with the updated values of x
k
and ω
k
. The process is repeated until the convergence criteria is achieved.
The convergence is sensitive to the presence of noise in the input signal. In case of denoising, it is not necessary to recover the complete signal and the noise can be reduced during decomposition. This is achieved by dropping the Lagrangian multiplier in the above process which enforces the input reproducibility. Figure 1 modes depicts the decomposition of a clean speech signal into its Intrinsic Mode Functions, IMF1 to IMF5. It can be seen from the figure that first mode IMF1 corresponds to the low frequency information in the signal and the frequency increases in IMF2 and so on. The VMD parameters: total number of IMFs, alpha and tau are varied during decomposition of noisy speech signal making that initial IMFs have more speech information compared to the final IMFs so that maximum speech information could be retrieved during ℓ1 trend filtering.

Clean speech waveform (top) and its corresponding IMFs (bottom).

(a) Clean speech waveform (top) and its IMFs (bottom). (b) Noisy speech waveform (top) and its IMFs (bottom). (c) Enhanced speech waveform using VMD-ℓ1 trend filter method (top) and its IMFs (bottom).
The objective of ℓ1 trend filter is to estimate the underlying trend of a signal having two components, a slow varying signal component and a fast varying noise component. In case of a signal degraded by noise, the ℓ1 trend filter estimates the underlying trend or signal information from the noisy input signal. The ℓ1 trend filter is similar to the H-P filter and penalizes the variation of the slow varying trend based on ℓ1 norm. The ℓ1 trend filter is defined as
The regularization parameter λ is calculated based λmax. The trend estimation is completely based on the regularization parameter λ. When λ → 0, it results in the convergence towards the original data and as λ→ ∞ the filtered output converges to a finite affine fit or in other words the underlying trend in the signal. Hence ℓ1 trend filter is best suited for piecewise linear signals. Since speech is a non-stationary signal direct application of ℓ1 trend does not completely remove the noise provided no data loss. Also the noise still persists in the lower frequency bands of the signal as shown in Fig. 3(c). Hence in the proposed method, ℓ1 trend filter is applied based on noise variance over IMFs. The IMF is divided into sub frames and categorized as either speech dominant frame or noise dominant frame. The variance of noise is estimated from the initial silence region of the corresponding IMF and a threshold is fixed to categorize the sub frames. If the sub frame variance is above the threshold then it is a speech dominant frame and the signal information is retrieved using ℓ1 trend filter and if it is below the threshold it is a noise dominant frame and is suppressed. The noise dominant frame represent the silence region in the speech. The IMFs obtained from ℓ1 trend filtering are combined to give the enhanced output speech signal. The threshold value is chosen experimentally for different noise levels based on the improvement in quality measures of the enhanced output. In Fig. 4, the comparison plot of clean speech signal, noisy speech signal (15 dB), ℓ1 trend filtered signal and the enhanced speech signal obtained from the proposed method is given. From Fig. 4(a) and (c), it is clear that the filter does not completely eliminate the noise during trend estimation. However, it can observed that the small variations in the clean speech are well recovered in the filtered output thus retaining the quality of the speech signal.

(a) Clean speech waveform (top) and spectrogram (bottom). (b) Noisy speech signal for white Gaussian noise (15 dB) waveform (top) and spectrogram (bottom). (c) ℓ1 trend filtered waveform (top) and spectrogram (bottom). (d) Enhanced speech wavform obtained by VMD-ℓ1 trend filter (top) and spectrogram (bottom).

Comparison plot of (a) Clean speech waveform. (b) Noisy speech waveform (15 dB). (c) ℓ1 trend filtered output of noisy speech signal and (d) Enhanced speech obtained from proposed method.
The speech signals used for testing the proposed method is taken from cmu-arctic database [16]. White Gaussian noise is added to the clean speech signals from the database at different noise levels (10 dB– 30 dB). The proposed method is compared with some of the standard speech enhancement methods such as spectral subtraction, Minimized Mean Square Error and also with ℓ1 trend filter. One of the drawbacks in spectral subtraction is the presence of musical noise in enhanced speech. This is overcome in the proposed method by applying the enhancement over the noisy IMFs or in other words signal information corresponding to different frequency bands. The clean speech, noisy speech, enhanced speech and their respective IMFs are given in Fig. 2(a), (b) and (c). From Fig. 2(b) it is clear that noise levels vary for IMFs, with IMF1 having more speech information and less noise and IMF5 has highest level of noise. The IMFs corresponding to the enhanced speech signal shows maximum retrieval of speech information in all the modes. Similarly, the waveform and the corresponding spectrogram plot of clean, noisy (15 dB), ℓ1 trend filter and the enhanced speech signal are given in Fig. 3(a), (b), (c) and (d) respectively. Comparing Fig. 3(a) and (c), it clear that the noise is not completely removed by directly applying ℓ1 trend filter, which is removed by applying the VMD-ℓ1 trend filter method. The performance of enhanced speech signals is analysed using the subjective quality measure and the objective quality measure. The subjective quality measure involves taking the MOS (Mean Opinion Score) test.
Objective quality measures
The objectives quality measures estimated are Csig (signal quality), Cbak (background distortion), Covl (overall quality), segmental Signal to Noise Ratio (SNRseg) and PESQ (Perceptual Evaluation of Speech Quality). One of the two criterias in speech enhancement is noise removal, which can be analysed using SNRseg and Cbak. The segmental SNR comparison of the noisy speech signal with other speech enhancement techniques are given in Table 1 and Fig. 5(a). From the figure, it is clear that the proposed method shows the good improvement when compared to the other methods. Even though the ℓ1 trend filter captures the underlying trend in a signal, some noise in the lower frequency bands still persist as shown in Fig. 3(c). This signifies the importance of the VMD that decomposes the signal into IMFs of varying center frequencies, so that speech structure is capture in the IMFs with less noise. The noise is filtered out using ℓ1 trend filter from each IMF separately. Comparison of Cbak of the noisy speech signal with other enhancement techniques are given in Table 3 and Fig. 6(a). It is clear from the plot that the proposed method performs better when compared to all the above mentioned methods including ℓ1 trend filtering. The second and important criteria for a speech enhancement technique is that the speech quality should be retained during the process which can be analysed using the signal quality measures Csig, Covl and also the PESQ score. The proposed method shows good improvement in speech signal quality (Csig) and overall quality (Covl) while comparing with the other methods. The Csig and Covl for ℓ1 trend filter closely follows the proposed method, which shows the effect of ℓ1 trend filter in extracting the speech signal structure from the white Gaussian noise affected IMFs. The estimation of signal quality measures, Csig and Covl are given in Tables 2 and 4 and their comparison is made in Figs. 5(b), 6(b) respectively. The Covl measure of the enhanced speech from the proposed method performs better the enhanced signal obtained form ℓ1 trend filtered output. The intelligibility of the enhanced signal can be analysed using the PESQ measure estimation and the corresponding comparison is made in Table 5 and Fig. 7. PESQ measure of proposed method is better than all the other methods including ℓ1 trend filtering. Based on these results the proposed method for speech enhancement exhibits better performance when compared with the standard methods considered. Based on the above results, it is clear that the ℓ1 trend filter extracts the speech information without much data loss while the filter operation is controlled based on the noise variance. However this has some limitation, the voiced portion can be easily retrieved but the unvoiced portion due to its random is not completely retrieved at higher noise levels.

Comparison of segmental SNR(a) and Csig(b) of enhanced speech obtained using Spectral subtraction, MMSE and also the ℓ1 trend filter with the proposed method at different noise levels.

Comparison of Cbak(a) and Covl (b) of enhanced speech obtained using Spectral subtraction, MMSE and also the ℓ1 trend filter with the proposed method at different noise levels.

Comparison of PESQ of enhanced speech obtained using Spectral subtraction, MMSE and also the ℓ1 trend filter with the proposed method at different noise levels.
Comparison of the SNRseg of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of white Gaussian noise
Comparison of the Csig of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of white Gaussian noise
Comparison of the Cbak of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of white Gaussian noise
Comparison of the Covl of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of white Gaussian noise
Comparison of the PESQ of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of white Gaussian noise
The subjective evaluation for the proposed method is conducted by taking the MOS score. Perceptual quality of the enhanced signal is important in speech enhancement which can be evaluated by taking the MOS scores. In this test eight speech signals from the database are taken at different SNR values 10, 15, 20, 25 and 30 dB respectively. The score for the enhanced speech signals are given based on the clean and noisy speech signals. For the test, 10 people aged between 20 to 30 without any history of hearing disorders has participated. The participants listened to the speech signals distorted by varying levels of background noise and the enhanced signal obtained from spectral subtraction, MMSE, ℓ1 trend filter method and VMD-ℓ1 trend filter method. Based on the improvement over the noise signal the participants gave a score in the 5-point MOS scale. The score is given depending on the overall quality of the signal taking into account the reduction in distortion of speech quality due to noise contamination. The resulting scores are averaged and the comparison with other techniques is shown in Fig. 8. The VMD-ℓ1 trend filter method achieves better performance for enhancement of speech signal distorted by white Gaussian noise when compared to spectral subtraction, MMSE and ℓ1 trend filter methods.

MOS comparison of the enhanced speech signal obtained using Spectral Subtraction (SS), Miminized Mean Square Error (MMSE) estimation, ℓ1 trend filter and the prposed VMD-ℓ1 trend filter method.
In addition to white Gaussian noise the performance of VMD-ℓ1 trend filter method over the speech signals degraded by color noise is analysed [13]. Unlike white noise the degradation of speech spectrum by color noise varies with frequency. The experiments are conducted for the enhancement of speech signal degraded by blue noise in the range of 10–30 dB for part of the cmu-arctic database. Power spectral density grows linearly with frequency for blue noise. The performance of the objective quality measures: Csig, Cbak, Covl, segmental SNR and PESQ obtained for VMD-ℓ1 trend filter method is compared with the spectral subtraction, MMSE and ℓ1 trend filter methods. From the analysis of segmental SNR and Cbak given in Tables 6 and 8, it is clear that the VMD- ℓ1 trend filter method reduces the level of background distortion in the speech signal. The retrieval of speech information from the noisy speech signal is essential in retaining the quality of the speech signal. The comparison of the quality measures Csig and Covl made in Tables 7 and 9 shows that the VMD-ℓ1 trend filter method effectively retrieves the speech signal when compared to the other techniques. The PESQ enhanced speech obtained by VMD-ℓ1 trend filter method given in Table 10 shows compatible performance when compared to the other techniques.
Comparison of the SNRseg of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of color noise
Comparison of the SNRseg of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of color noise
Comparison of the Csig of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of color noise
Comparison of the Cbak of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of color noise
Comparison of the Covl of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of color noise
Comparison of the PESQ of enhanced speech obtained using Spectral subtraction, MMSE, ℓ1 trend filter and the proposed method at different levels of color noise
Speech enhancement is one of the most researched areas due to its relevance in real time speech recognition systems. The proposed VMD-ℓ1 trend filter method is a two-step approach to enhance speech signals degraded by white Gaussian noise. In the first step, the speech signal is decomposed into IMFs where more speech information is retained in the initial modes. In the second step, the ℓ1 trend filter retrieves the speech information by filtering out the noise sub frames in the IMF. The experiments were conducted over speech signals degraded by both white and color Gaussian noise with the SNR from 10 dB to 30 dB. The performance evaluation is conducted based on subjective and objective quality measures where the VMD-ℓ1 trend filter achieves better performance compared to other methods. In VMD-ℓ1 trend filter method, the VMD retrieves the speech signal structure from the noisy speech signal in the IMFs and the noise in the non-speech regions are filtered out in the ℓ1 trend filtering. The work can be extended to other noise types such as factory noise, car noise, train noise, etc.
Footnotes
Acknowledgments
The author would like to thank Mr.Sachin Kumar, Research scholar, Centre for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, for the help given in understanding the Variational Mode Decomposition concept.
