Abstract
At present, to settle the question of excessive noise in the speech signal during the call of mobile devices in China, the research proposes that the Wiener filter and the generative adversarial network are combined into the IGAN algorithm. Firstly, the Wiener filter regularization algorithm is introduced to construct the preprocessing model of the speech signal; then the preprocessing model is fused with the generative adversarial network algorithm to construct the denoising model. Finally, the performance analysis and simulation experiments of the application effect of the model are carried out. The results show that in the experiment comparing IGAN with five traditional algorithms, when the SNR ratio is increased to 17.5 dB, the MOS and PESQ scores under the IGAN method can reach 4.9 and 3.5 respectively, and the DNN effect is second only to IGAN. Other algorithms perform poorly. Then compare the number of iterations and the loss value between the two. When the network voice signal begins to converge, the loss value corresponding to DNN is 1.132; while the loss value of IGAN is about 0.573, it can be found that the loss value of IGAN has dropped by half, which shows that IGAN Build the model with a smaller loss value. And IGAN tends to converge when iteratively is performed for about 200 times, and the average peak SNR can reach up to 33.85 dB, an increase of nearly 1.02 dB, and the effect is remarkable. This all shows that the IGAN algorithm has the best denoising performance for network speech signals, improves the denoising efficiency, and is conducive to obtaining a denoising signal with a higher fit with the clean signal, so that mobile devices can better serve the people.
Introduction
In daily life, people communicate through mobile devices, but in general, the voice signals received by the devices are mostly noisy voice signals. However, at this stage, speech is an important means of human-computer interaction, how to increase the clarity and readability of network voice signal in process, speech denoising technology is particularly important [1, 2]. In order to alleviate the above problems, the research proposes to combine the Wiener filter preprocessing method with the Generative Adversarial Network (GAN) to form an Improved Generative Adversarial Network (IGAN), which is applied to Internet speech denoising. In the system, and experiment by building models, in order to bring people a good experience and enhance competitiveness. Wiener filtering can remove long-band noise in signals and improve intelligibility; GAN is a powerful deep learning model suitable for unsupervised learning such as speech denoising [3, 4]. During the experiment, the generative model and the discriminative model are compared with each other to produce good data [5]. The two are combined with each other and use the speech signal waveform diagram and objective evaluation index to compare with IGAN and traditional classical algorithms to bear out the significance of the draw way, which will have innovative significance for the current problems of speech denoising in my country, so as to promote the recruitment of AI reduces noise impact of speech to improve speech quality, better serve the people, and promote business development.
Related works
With the rapid development of artificial intelligence algorithms, deep learning algorithms are also expanding their application fields, and more and more scholars have also applied GAN to the research of speech data recognition. The desired indicators of local liver function are obtained through dynamic contrast and magnetic resonance imaging, Simeth and Cao [6] developed a GAN, which on account of neural network. During training, 30 liver exams were collected in 22 patients, after comparison, it usually takes 16 minutes to collect data every 13 seconds. Robustness to change is encouraged by recording and comparing data. Arterial and portal venous inputs were generated using GAN based on a two-input, two chambers, pharmacokinetic model of intrahepatic gadoxetate for data augmentation. Test reveal the research under this mean can superior analyze obtained data, and the effect is good. Jung and Kim [7] have made reasonable improvements to the GAN and has been widely used in fields such as image synthesis and speech generation. And for the problems of image blur and mode collapse that will occur in the application of the model, a mapping generation model using a self-organizing generative network is proposed to improve the defects of the GAN model, and the vector value generated in the process is used as a potential vector. After training the GAN, the final result show that the improved model is easier to train and does not appear mode collapse, and the generated images are also clearer. Hua et al. [8] proposed GAN And Depth Distribution Q Network (GAN-DDQN) driven by Generative Adversarial Networks (GANs), the distribution of all action values is compared with the estimated minimum value for experimental operation. Action value distribution. A mechanism is also proposed to firm the value of the result and prevent the misuse of the algorithm distributed value obtained under distribution. The superiority of the action of the posed GAN-DDQN and dueing GAN-DDQN is verified by widely imitation. Haque [9] proposed an external classifier GAN algorithm model to better handle small-scale tasks. The proposed algorithm aims to combine GAN and semi supervised algorithm to improve the data supervision status. The results show that the performance of EC-GAN is very effective, especially for small data sets. Wang and Fang [10] put forward many strategies that can solve the problem for change of security authentication mechanism of communication equipment. The results show that the research model can have higher learning accuracy in iteration.
Jin et al. [11] proposed an image denoising method to judge the real situation of the radiographic image effect obtained by the chest transmission of the patient, and defines the priority of the disease obtained by the patient. Leverage batch normalization to address nature decline owing to increased NN layers, and use gap learning of noise distributions. The results show that the depthwise separable convolution is very effective for the convergence rate of the meshwork pattern, and the model practicing time is rapidly shortened. Chen et al. [12] proposed to improve the image repair algorithm for generating confrontation network by using the double discriminant network depth learning. In the process, it is found that the algorithm can enhance the ability of image identification and solve the problem of over fitting caused by too many eigenvalues. The final results show that the proposed algorithm has better adaptability to the recognition of multiple image types. Wu et al. [13] proposed a new deep learning-based Parameter Pruning (PP) to remove surplus line in NN to obtain the denoising performance of speech enhancement in practical scenarios. balance with computational cost. Among them, the Parametric Technology (PQ) is used to cut down the degree of seniority of the NN by use less data values to show importance. Results show that the size of compressed SE mould produced by PP and PQ techniques is only 10.03% of the original data model, leading to the performance wastage. That is, PP and PQ could be used in device systems with limited storage and computing resources. Saleem [14] proposed a Deep Neural Network (DNN) based speech enhancement algorithm and weaker wiener filter as an additional DNN layer that can supervise the research object, and applied this algorithm to in the practice of speech enhancement. During the managing phase, the neural network path divides the speech signal into acoustic features and pure signals. The framework outperforms competing speech enhancement methods. Sun et al. [15] proposed and tested a generative adversarial network-based X-ray image denoising method to meet the fast and low-dose detection requirements in security inspections and medical imaging inspections. The images used in the study were acquired using a Digital Radiographic (DR) imaging system. Results reveal the new figure denoising effectively remove statistical noise in x-ray images while keeping the image edges sharp. Compared with traditional Convolutional Neural Network (CNN) based methods, the proposed new method generates more believable images and is able to contain more details in the images.
To sum up, the network in the current new era is developing very rapidly, and how to denoise and enhance network speech has attracted extensive attention of researchers. Among them, artificial intelligence algorithms are also widely used in various fields, bringing many conveniences. However, there is not much research on applying generative GANs to speech denoising. For this reason, this study proposes to apply the GAN algorithm to the field of Internet speech denoising, and to improve the GAN algorithm and build a model to improve the denoising efficiency. It is urgent to provide new ideas for network speech denoising methods through this research.
Improved GAN algorithm and construction of Internet speech denoising model
Preprocessing mechanism for network speech signal denoising
Network speech denoising algorithms are roughly divided into two types, namely traditional denoising methods and speech denoising methods based on deep learning. The GAN proposed by the research belongs to the hot direction in artificial intelligence algorithms. And the GAN is composed of a generative network (G) and an adversarial network (D) respectively. The purpose is to generate fake data samples by building a model for learning from real data, and to make the fake samples as close to the real sample distribution as possible [16, 17]. The schematic diagram of the distribution of the generative adversarial network is shown in Fig. 1.
Gan operation principle diagram.
Figure 1a, it
In actual modeling, the model graph is needed, and the mathematical model of the adversarial network needs to be generated. The distribution of noise data is, and the distribution of
In Eq. (2), it Max is the maximization of the discriminative ability of the discriminant network, Min which represents the maximization of the probability of taking the generated data as the real data, which
Speech signal and gain function to get denoised speech signal. The power spectrum is obtained by updating the voice noise power by Voice Activity Detection (VAD), and it is judged whether it meets the threshold and whether it is a pure voice signal. The equation is shown in Eq. (2).
In Eq. (2), it
Principle frame diagram of VAD.
In Fig. 2, it can be seen that the method of preprocessing the noisy speech takes out the sound signal distribution diagram, and also calculates the threshold value through algorithm, and then compares the two to determine whether the obtained speech signal is pure. The input signal of Wiener filter has the difference between noise signal and clean signal, see Eq. (3).
In Eq. (3), it
In Eq. (4), the properties and meanings of the corresponding parameters remain unchanged. The required a priori SNR and a posteriori SNR are shown in Eqs (5) and (6).
In Eqs (5) and (6), it
In Eq. (7),
In Eq. (8),
In Eq. (9),
Overall algorithm flow chart of Wiener filter for noisy speech signal.
In Fig. 3, the spectrum is obtained by Fourier transform of noise, the first 25 ms is regarded as the initial power of speech noise, and VAD is used to update the noise power
Speech denoising algorithm is the cross entropy function, which has too many shortcomings [18]. To solve the difficult of unstable training results, some constraints are added to GAN in advance to become cGAN, so that unsupervised learning becomes supervised learning, and some additional information is added in (G) and (D)
In Eq. (10), the meaning of the parameters is shown in Eq. (1). In addition, due to the low quality of the network output signal, some studies have proposed a least squares generative adversarial network method (Least Square Generative Adversarial Networks, LSGANs). This method changes the cross-entropy loss function in the primitive algorithm to a least-squares loss function, see Eq. (3.2).
In Eq. (3.2), the meaning of each parameter is detailed in the above equation. In the experiment, in order to kill off the issue of the large gap between the real record samples obtained by the LSGANs algorithm and the generated data samples, a norm of a classical loss function is added to the GAN in the field of
Then cost function in generative network of LSGANs is obtained as shown in Eq. (13).
The results of speech denoising algorithms are different, so the evaluation criteria of each algorithm are also different. The research uses the criteria of subjective evaluation and objective evaluation to make judgments [19]. Subjective evaluation generally adopts the mean opinion score (Mean Opinion Score, MOS), and its score weighting equation is shown in Eq. (14).
In Eq. (14),
In Eq. (15),
Flow chart of network voice signal denoising.
In Fig. 4, after the noisy speech enters the denoising system, Wiener filtering is used to preprocess the speech. After this step, the data center system receives a signal with increased relative clarity, which is imported into the set IGAN algorithm. Start further operations to remove residual noise, obtain the final denoised speech, and end the process.
Comparative analysis of speech denoising performance under different algorithms
To experimentalize the performance of the IGAN denoising way, five traditional classical speech denoising methods and IGAN are used for performance comparison experiments. Five traditional denoising methods include Spectral Subtraction (SS), Wiener, CNN, DNN and MSS-map. In process, appropriate clean voice data sets and noisy speech data sets are selected for experimental operation [20]. First, compare the denoising effects of different algorithms for speech under three noise conditions, as shown in Fig. 5.
Comparison of denoising effect scores of different algorithms.
Figure 5 shows that the denoising scores of different methods under different conditions are quite different. Figure 5a shows the SNR score under the living condition, and the scores of each algorithm show an upward trend. At the beginning of the experiment, the scores of SS and Wiener showed a similar trend, and began to overlap when the SNR was about 7.5. The rest of the algorithms have their own upward trend, and the score of IGAN is significantly higher than other algorithms, and reaches a maximum value of 4.0 at a SNR of 17.5, followed by DNN. Figure 5b is the MOS score under the condition of mixed voice cafe noise, and c is the PESQ score in the office environment. From the second figure, it can be found that accompany the increase of SNR, the effect of IGAN in processing noise in speech is better than other Algorithm, when the SNR reaches 17.5 dB, MOS and PESQ scores under IGAN operation reach 4.9 and 3.5 points respectively, and the overall performance is also the best. The effect of iterative operation is compared, and the effect diagram is drawn in Fig. 6a and b.
Iteration curve of DNN and IGAN under the same conditions.
Iteration curve of DNN and IGAN under the same conditions.
The iteration graph in Fig. 6 is generated under the condition of noise
In Fig. 7, the DNN begins to converge after about 400 iterations, while the IGAN method begins to converge after 200 iterations. And under the action of IGAN, the average extremum SNR of denoised speech is increased by nearly 1.02 dB, and the maximum can reach 33.85 dB, and the denoising effect is remarkable. This means that the IGAN method can not only have a faster convergence speed and faster adaptability, but also has a great advantage in the loss value of the loss function, and can have a better peak SNR when the convergence speed is better. All show that IGAN is very effective in improving the effect of Internet speech denoising.
In the comparison of the denoising ability of each algorithm in the comprehensive performance comparison, the three noise conditions of cafe, living and office are used to discuss the relationship between the loss value and the number of iterations in experimental simulation using IGAN, as shown in Fig. 8.
Loss curve of IGAN training process under different noise conditions.
Waveform diagram of voice signal before and after IGAN processing.
Figure 8 shows the loss effect value obtained by using the IGAN algorithm to denoise the speech signal under the noise conditions of cafe, office and living. Before the start of the experiment, the loss value of each speech noise was high. As the number of iterations increases, loss value under each noise condition began to show a downward trend, and began to increase between the number of iterations 80–100 times, when iterations was 100, the loss value raises to the extreme value when the number of times or so, the loss value of the speech signal corresponds to 0.569, 0.552, and 0.515 under the three noise conditions of cafe, office and living, after that, the loss value shows a cliff-like downward trend, and when the number of iterations reaches At about 200 times, it began to level off. This is because there is an adaptation process when the algorithm is applied to the Internet speech signal denoising system, so it shows an upward trend at first, and then when the number of iterations reaches about 400 times, the loss value remains stable. In the overall process, the loss value of speech signal under each noise condition is less than 0.60, and the number of iterations is stable within 400 times, and finally achieves a balance with the system, which further indicates that the loss value of speech denoising in the IGAN training process is small, Works well. In order to more intuitively evaluate the denoised network speech, the waveform of the reconstructed speech signal is drawn in Fig. 9.
In Fig. 9, a section of noisy speech signal is used as test data. First, the signal is preprocessed by Wiener filter and then input into the IGAN network, and then compared with DNN. It can be seen that the denoising effect of the classic DNN algorithm is poor, the residual speech noise is more, the speech signal waveform after IGAN processing has fewer burrs, and the signal edge is obviously smoothed. This indicates that the means is better match with DNN for the processing effect of noisy speech signals. In order to accurately represent the superiority of IGAN, the following research points out evaluation criteria are used to evaluate processed speech. The objective evaluation outcomings are shown in Table 1.
Objective evaluation results
From Table 1 that the index scores obtained by the network speech signals processed by the SS and Wiener methods are relatively low, indicating that the denoising effects of the two methods are poor. CNN and Mss-map are significantly better than the previous two, but the index scores are also lower than DNN and IGAN. Compared with DNN, IGAN, PESQ increased from 1.66 to 1.75, COVL increased from 2.20 to 2.29, CSIG increased from 2.62 to 2.89, the quality of Internet voice increased significantly. In addition, CBAK increased from 2.41 to 2.47, SSNR and SDNR increased from 4.83, 10.20 to 5.21, 12.28, respectively, which also showed that the intelligibility of speech is stronger than DNN, which once again verifies the superiority of IGAN.
Separating noise from pure signal is a technique of speech noise reduction. The research takes Internet speech as the background, and uses the GAN algorithm based on Wiener filter to denoise speech. First, the performance of IGAN is compared with that of traditional algorithms, and then simulation experiments are carried out to verify the effectiveness of speech denoising. The results show that the loss value of all noises shows a downward trend when different algorithms are used to denoise the speech under three different environmental conditions; When the SNR is 17.5 dB, the MOS and PSEQ scores of the voice signal processed by IGAN algorithm are 4.9 and 3.5, and the DNN score is second only to IGAN. Compared with DNN algorithm and IGAN algorithm, the residual noise of speech signal after IGAN processing is less; Compared with DNN algorithm, the score of objective evaluation index of voice signal after IGAN processing increased from 1.66 to 1.75, COVL from 2.20 to 2.29, CSIG from 2.62 to 2.89, and the quality of Internet voice improved significantly. In addition, CBAK increased from 2.41 to 2.47, SSNR and SDNR increased from 4.83 and 10.20 to 5.21 and 12.28, respectively, which also showed that it was better than DNN in terms of speech intelligibility, again verifying the advantages of IGAN. To sum up, it fully shows that the IGAN algorithm proposed by the research can effectively reduce voice noise, improve the quality of voice signals, and the overall performance is optimal. However, there is little research on the combination of Wiener filtering and generation countermeasure network algorithm applied to speech denoising or other fields at present, so it is still the next direction of research to expand the scope of application and bring the advantages of the algorithm to the extreme.
