Abstract
In this digitized world, the demand of users emphasizes the quality and accuracy. Practically, all variants of signals are analog in nature along with contaminated with noise. In this paper, speech signal is considered. Basically speech signal varies from person to person and time to time. It requires enhancement of the signal for different applications like engineering, medicine and social purposes. Reduction of noise as well as redundant data from the signal can be produced with enhanced versions. As the speech is of nonstationary in nature, in the initial phase, it is processed and normalized. To analyze the speech signal, spectral domain is most suitable and has been utilized. For this purpose, Discrete Cosine Transform (DCT-II) is used. As it has the advantage over other transforms and the calculation is simpler, DCT-II coefficients are further used for Deep Neural Network (DNN) model to reduce the noise and enhance the signal. So that the signal of any environment and of any amount can be enhanced using this model. 100 sentences have been collected form both males and females of 5 each. The sentences have been uttered by the corresponding males and females, 10 sentences each. Though DCT-II and DNN have been applied by many researchers for signal features and image classification, the same have been utilized here for speech enhancement, which is the novelty of this work. The results found better than the other methods applied earlier and it can be best utilized for any real time application. In the result section, the visual inspection is exhibited along with the comparison values. The measuring parameters show its efficacy.
Keywords
Introduction
Speech enhancement in noisy environment is always desired and a challenging research from several decades. Speech enhancement algorithms improve the quality and increase the Signal-to-Noise-Ratio (SNR) of the signal. Speech enhancement algorithms have been applied towards a wide range of areas, such as mobile telephones, video conferencing, speech recognition, hearing aids design etc. Different algorithms have been proposed to provide satisfactory performance [1].
Spectral subtraction is a well known algorithm to estimate clean signal spectrum from the noisy spectrum by suppressing noise components in frequency domain [2]. Several modifications have been made to this algorithm to obtain enhanced speech signal. Due to time varying nature of the speech signal, several adaptive algorithms are designed for speech enhancement. Least Mean Squares (LMS), Normalized LMS, Recursive Least Squares, State Space RLS (SSRLS) are also applied for enhancing the noisy speech [3, 4]. Various methods are accessible for estimating the coefficients of clean speech signal. Wiener filer does the same and is applied to minimize the Mean Square Error (MSE) between the input and enhanced speech [5, 6].
For nonstationary signal, it is difficult to determine the basic linear filter which minimizes MSE. Normally the resulting linear filters are time variant. So it cannot be implemented using Fourier Transform (FT) based technique. The general FT is more suitable for analysis and processing of time invariant signals. But for time varying signals, Discrete Cosine Transform (DCT) estimates signals with smaller MSE due to higher upper bound [7, 8]. Ram et al. have designed a fractional DCT filter [9] for speech enhancement. Similarly Discrete Wavelet Transform (DWT) is used for signal analysis and processing. DWT can also be utilized for signal denoising and recognition [10].
In recent years, machine learning approaches have concerned more focus in the field of speech analysis. Neural networks provide a new direction in the field of speech recognition and synthesis [11, 12]. Adaptive Linear Neuron (ADALINE) is utilized for noise cancellation of the speech signal. Based on the learning principle of the ADALINE, the weights and bias of the network are modified by the Least Mean Squares (LMS) or Widrow-Hoff rule [13]. The Convolutional Neural Network (CNN), Artificial Neural Network (ANN) are also used for reverberated speech enhnacement. In big data analysis, when the number of hidden layer increases the convergence rate decreases. In addition to this, time consumption and the error computing capability is more [14–17]. To avoid these problems, Neural Networks (NN) with random weights [18] and NN based particle filter [19] is proposed. The Fractional DCT (FrDCT) coefficients are real and have a binary phase value. The phase values depend only the sign of the coefficient. This allows better noise margin of the signal. The FrDCT-ADALINE [20] and WT-NN [21] are applied for improving the quality of the speech signal.
The enhancement methods based on deep learning calculates a mapping function, reconstructs the clean speech signals from noisy input signals. When a sufficient amount of training data are available, deep learning based speech enhancement methods are more capable and effective [22]. A regression based Deep Neural Network (DNN) is proposed by Y. Xu et al. [23, 24]. To enhance the real time signals DNN provides better result in terms of SNR and Segmental SNR (SegSNR). A recurrent neural network based Kalman filter [25], Improved Least Mean Square Adaptive Filtering (ILMSAF) [26], DNN based Linear Predictive Parameter Estimations [27] are some of the effective examples of the DNN. The intelligibility of the speech signals are improved by using the DNN. This may help the cochlear implant users [28]. Instead of feed forward NN, skip connections are added to force the DNN and to learn the ideal ratio mask between the inputs and outputs [29].
The paper organization is as follows: Section 1 provides the Introduction of the work. Section 2 presents the DCT-II based DNN (DCT-DNN) for speech enhancement. Section 3 discusses the results of different speech signals. Finally Section 4 concludes the work.
Materials and methods
Experimental setup
In this work, the utterances recorded from both male and female speaker of 10 sentences each. A total of 100 sentences have been collected by the corresponding speakers; of 5 each. Five different types of noise, namely Street, PC fan, Water cooler, Babble and Drilling are used as noise signals from the softcopy of Speech Enhancement by Loizou. All 100 utterances are degraded with the mentioned noise types at four levels of SNR, i.e. 0 dB, 5 dB, 10 dB and 15 dB. All the noisy and one condition of clean signals are considered to build a training set. A total 3900 frames are considered to train the DNN model. Two different kinds of noise signals: Train and Airport are deliberated to estimate the mismatch condition in testing phase. All experiments are performed in MATLAB environment.
Transform domain analysis
Speech is quasi-stationary signals. Only FT is not sufficient to analyze the signal. Though FT is a simple tool, it fails for decomposition in both time and frequency domain analysis with scaling property. DWT is an alternative choice; however the ambiguity is there. To overcome these drawbacks, DCT-II is more preferable. In DCT-II, the information is divided into N × N blocks while DWT is based on approximation and detailed coefficients. Considering the quality, use of both the transforms provided equal quality. Whereas the performance is better in case of DCT-II and can be further used, considering the coefficients as features. Comparing with Discrete Fourier Transform (DFT), it is a composite transform as both magnitude and phase are encoded whereas DCT-II has only even symmetric. The inherent periodicity of DFT causes boundary discontinuities while due to higher energy compaction, DCT-II is more advantageous than DFT. Hence it is chosen to a better transform to represent this voice signal to be analyzed. The following schematic diagram represents the basic idea behind DCT-II (Fig. 1).

Steps in discrete cosine transform.
For signal analysis, the speech signals are sampled at 8 KHz. The frame duration is 32 msec with a 256 point hamming window and a frame shift of 50%. Instead of calculating the DFT, DCT-II of each overlapping windowed frame is calculated. Total 128 samples of DCT-II are used to train the DNN. Different noise signals are added to the recorded speech signals to make the signals noisy are considered to test the results. One noisy speech sample is carried to show the frequency distributions of DFT and DCT-II. Figure 2 shows a speech signal after framing. DFT and DCT-II are applied to the signal of each frame. Figures 3 and 4 demonstrates the coefficients of DFT and DCT-II respectively. For signal length x
l
, the DCT-II is defined as

A framed speech signal.

Discrete Fourier Transform of a framed speech signal.

Discrete Cosine Transform of a framed speech signal.
Where m = 0, 1, 2, ⋯ x l - 1.
And
The Inverse DCT is defined as
Where t = 0, 1, 2, ⋯ x l - 1.
After successive analysis of the signal it is found that DCT-II method is more suitable and hence kept the DCT-II coefficients for further use. These coefficients are used in DNN for the analysis and enhancement purpose. It is explained in following subsection.
The structure of DNN based speech enhancement method is shown in Fig. 5. This method is divided into two stages, training stage and testing stage. A group of noisy and clean speech samples are developed in the training stage.

Structure of Deep Neural Network.
A Feed-forward architecture is adopted in DNN here (Fig. 6). Three hidden layers with 1024 hidden units and 32 output units are considered. A total of 3072 (=1024*3) hidden units are trained for noisy features. Both the noisy speech features and the clean features are nonlinear. Therefore, clean speech features are estimated by nonlinear activation function. Rectified Linear Units (ReLUs) are the activation functions for the hidden units and sigmoid function for the output units. The objective of DNN based supervised learning is to determine the mapping from noisy features to clean features. In case of an input noisy sentence, clean features are approximated frame wise from the output of trained DNNs. Speech features can be able to transform back to time domain signal. Network pretraining, regularization have applied to formulate the system better.

Feed Forward Speech Enhancement frameworks with 3 hidden layers.
After training the DNN, the Ideal Mask is estimated for the test signal by propagating its feature representation for all frames through the network. Train, Airport and the clean signals are considered for the testing. The output of the DNN is interpreted the estimated mask for the input frame. Then the mask is applied to the DCT-II features vector of the noisy speech signal and is multiplied to all 256 noisy signal samples of a frame. All DCT-II feature vectors are then concatenated and all overlapping frames are added. Subsequently, all frames are synthesized into a time domain signal method.
The DCT-II coefficients are employed to all the noisy signals. These features are fed to the DNN to generate the discrete cosine features of the enhanced signal. The noisy phase is not considered for training. The block diagram of the proposed method is shown in Fig. 7. To learn the DNN of noisy spectra, the multiple restricted Boltzmann machines (RBMs) are arranged [30, 31]. The input and the output features are obtained from the noisy-clean spectra of the DCT-II. For a window length of 256, DCT-II produces 256 independent spectral components. Due to real coefficients of DCT-II, the binary phase values depend on the sign of the coefficients.

Proposed Block Diagram of DNN for Speech Enhancement.
As other transforms, DCT-II decorrelates the signal data. After decorrelation, each transform coefficients can be encoded independently without mislaying the compression efficiency. The input vector X represents as X k = [Xf,k-τ, … Yf,k … Yf,k+τ]′ and the output vector of clean speech is Y k = [Yf,k-τ, … Yf,k … Yf,k+τ]′.
The signal is framed by k times and f = frequency bin τ= window length L = the hidden layer of the network.
For L hidden layers, the DNN model is expressed with the Signum function and can be represented as
Where (W1 … W
J
) and (b1 … b
J
) are the weighting matrices and bias vectors respectively. Y
k
vector carries the features of the restored speech corresponding to the noisy counterpart Y
k
. The Sigmoid function is used here as a nonlinear function and the expression is expressed as
The objective function I of the DNN model is optimized as shown below and the parameter sets are consisting of (W1 … W
L
; b1 … b
L
).
Where K = total number of training samples and δ= regularization term. In the DNN configurations, three hidden layers are considered with 39 frame expansions. 1024 hidden units are there in each hidden layer.
The DNN is consisting of 3 hidden layers with 1024 nodes each. Total 3072 (=1024*3) hidden units are considered to test the results. Another 50 arbitrarily chosen utterances from the database are taken to build the test set for each and every combinations of noisy types and noise levels. Two different sorts of noise signals, Train and Airport are considered to evaluate the mismatch condition.
For each layer of RBM, the number of epoch is set as 30. The learning rate parameter of pre-training is 0.005 and for fine tuning the learning rate is 0.1. 80 epochs are considered for the mini batch size 100. All the input features of DNN are normalized. Fig. 8 shows the clean signal ‘I am going to college’ and Fig. 9 shows the noisy signal (Drilling noise). One test result (enhanced signal) is shown in Fig. 10. Due to space limitation, all outputs are not presented here. 20% of the database signals are tested with the proposed method and the results are depicted.

Recorded speech signal ‘I am going to college’.

Noisy speech signal (recorded speech + Drilling noise).

Enhanced signal after DCT-DNN method.
Two objectives are considered for quality measure.
Perceptual evaluation of Speech Quality (PESQ)
It is a standard objective measure technique to test the speech quality. A number of listeners rate the voice quality. The quality value ranges from 1 (poor) to 5 (excellent). Segmental SNR (SNRSeg),
This parameter is also used for sound quality measure of speech signal. For a frame length of L, the SNRSeg is calculated as,
Where, n = number of active speech frames.
Three different hidden layers (L = 1, 2, 3) are considered to test the results. Table 1 show the PESQ scores of different noise levels with SNR 5 dB. The proposed method is compared with the DNN model of 3 different hidden layers. A minimum PESQ is 1.28 at DNN1. Whereas a maximum score of 3.85 is obtained for DCT-DNN3. As the number of hidden layer increases, the higher PESQ value is obtained.
PESQ scores on the test set for different types of noise. (DNN* represents the hidden layer number)
Table 2 shows the SNRseg values of different hidden layers. A maximum SNR improvement is 1.96 dB (=8.92 dB - 6.96 dB), obtained for PC fan noise. The SegSNR is 6.96 dB in DNN3 and for DCT-DNN3, it is 8.92 dB. The DCT-DNN enhanced speech removes the noise more and resulted in less residual noise. The results are verified with the signals from TIMIT database. A maximum of 1.96 PESQ score is obtained in [32]. In this proposed method, this score is improved to 3.85. Also the listening tests prove better sound quality obtained from the proposed method.
Segmental SNR (dB) for different types of noise with Improvement in SNR (dB)
In most of the cases, DCT-II is used for image compression, image coding and speech compression. Similarly, the DNN is used mostly for classification and detection. The novelty of this work is use of these methods in a new way. Such that the enhancement of speech has been performed and can be used for real time applications. It can be further explored by the utilization of other transforms and same can be used in the utterance level for speech signal. In the similar way, it can be applied for different types of images.
