An improved machine learning algorithm for text-voice conversion of English letters into phonemes

Abstract

Text-to-voice conversion is the core technology of intelligent translation system and intelligent teaching system, which is of great significance to English teaching and expansion. However, there are certain problems with the characteristics of factors in the current text-to- voice conversion. In order to improve the efficiency of text-to- voice conversion, this study improves the traditional machine learning algorithm and proposes an improved model that combines statistical language, factor analysis, and support vector machines. Moreover, the model is constructed as a training module and a testing module. The model combines statistical methods and rule methods in a unified framework to make full use of English language features to achieve automatic conversion of letter strings and phonetic features. In addition, in order to meet the needs of English text-to- voice conversion, this study builds a framework model, this study analyzes the performance of the model, and designs a control experiment to compare the performance of the model. The research results show that the method proposed in this paper has a certain effect.

Keywords

Machine learning improved algorithm phonetic conversion English text-to-voice conversion

1 Introduction

With the growing maturity of voice recognition and voice synthesis technologies, natural voice processing has become a focal issue in the field of voice signal processing. The continuous upgrading of hardware chips makes it possible to process complex voice in real time. At present, the natural human-computer interaction interface has become the focus of voice technology research. In order to improve the naturalness of the human-computer interaction interface, speaker voice conversion technology is gradually developed. The purpose of speaker voice conversion is to modify the voice of one speaker (source speaker), so that the modified voice sounds like what another speaker (destination speaker) said. This modification requires changing the speaker characteristics contained in the voice and retaining the semantic information expressed by the voice. Researchers use speaker voice conversion technology to make machine voice with the characteristics of a specific speaker and make the application of voice technology more user-friendly [1].

As a frontier branch in the field of voice signal processing, speaker voice conversion technology is more difficult, but its research work has high value. The research of speaker voice conversion has made certain contributions to almost every field of voice signal processing, such as voice analysis, voice synthesis, voice recognition and speaker recognition. Specifically, its significance has the following aspects [2]: Speaker voice conversion inevitably requires a detailed analysis of the voice, such as studying the relationship between the changes in the parameters of the vocal source and the rhythmic characteristics of the voice, and studying the influence of the formant location, width and amplitude on the dialogue sound. This work is not only beneficial to voice synthesis, but also improves the naturalness of the text-to-voice system. Moreover, it may also promote the advancement of voice coding technology and will also improve the performance of voice enhancement systems. (2) The research on speaker normalization methods will promote speaker adaptation research in voice recognition. In modern voice recognition technology, a very important research topic is the voice recognition of non-specific speakers. When different speakers speak the same phoneme, due to differences in physiological parameters, the acoustic parameters of voice are not the same. One of the purposes of studying speaker voice conversion is to make the acoustic parameters of the source voice and the acoustic parameters of the destination voice as consistent as possible. (3) It is helpful to promote the development of speaker recognition technology. Speaker recognition does not pay attention to the semantic content of the voice signal but hopes to mine the speaker characteristics in the voice signal to achieve the speaker identification and recognition. Through the study of speaker voice conversion, the semantic information in the voice signal can be separated from the speaker information, which is helpful to find important features that affect speaker recognition [3]. In summary, speaker voice conversion technology has important research value. Fortunately, through the continuous efforts of researchers in phonetics, the technology has been developed to some extent.

2 Related work

The literature [4] proposed a relatively mature speaker voice conversion system. The system uses vector quantization technology and uses the codebook to represent the spectral characteristics of different speakers to train the speaker’s voice to establish the mapping relationship between the spectral envelope, energy and pitch frequency between different speakers. After that, the voice parameters are converted through codebook mapping, and finally the converted voice is synthesized with a linear predictive synthesizer. The literature [5] proposed a similar method, and it only replaced the codebook mapping with a multi-layer neural network for the conversion model. As a result, the quality of the converted voice has been greatly improved, and this inspiring result has promoted the further study of the speaker’s voice conversion. The literature [6] proposed to replace the linear predictive synthesizer with the pitch synchronous superposition algorithm to convert the residual signal. In this method, pitch synchronization and superposition techniques are used to align and convert the excitation signals in time, and the method of multiple linear regression and dynamic frequency rounding is used to convert the spectral envelope. Since the limitation of the traditional linear predictive synthesizer is broken, the quality of the voice converted by this method has been further improved. The literature [7] proposed to use the harmonic plus noise model to convert the time length and pitch frequency. The system uses the Gaussian mixture model to classify the speaker feature parameter space and estimates the mixed linear transfer function according to the principle of minimum mean square error. The feature parameter trajectory converted by this method is continuous, so it is more effective than the feature parameter conversion by vector quantization.

In recent years, there have been many excellent software of text-to-voice conversion, such as: InterPhonic of IFLYTEK, Microsoft ’s Mi-crosoft Voice SDK, Jietong Huasheng voice synthesis software and so on. These can better realize the function of outputting voice in real time after inputting text. Moreover, the clarity and intelligibility of Chinese voice synthesized by this voice synthesis software can reach a high level. However, due to technical limitations, there are still some deficiencies when synthesizing Chinese voice, such as the lack of different age characteristics, different gender characteristics and the expression of tone, speed, tone, and lack of personal emotional color and language style [8]. With the development of technology and the advancement of voice synthesis technology, people try to add various subtle, complex, different emotions and language styles to the synthesized voice to improve the naturalness of the synthesized voice, enhance the expressiveness of the voice, and then promote the development of human-computer interaction. In order to express the style and emotion of the speaker, the tone, speed, tone, etc. of the synthesized voice should all exhibit certain characteristics. This requires a deep analysis and research on Chinese rhythm and rhythm based on a large number of research results on Chinese rhythm, rhythm and intonation, so as to reflect the expressive power of voice in synthesis and enhance the naturalness of synthesized voice [9]. There are many research methods around improving voice performance. For example, literature [10] proposed expressive voice acoustic modeling based on neural network. The modeling method uses the context parameters and the emotion label PAD value of emotional voice as input parameters and uses the acoustic features of emotional voice as output parameters and uses the STRAIGHT algorithm to modify the acoustic features of neutral voice to obtain converted emotional voice. The literature [11] proposed a PSOLA-based Chinese-to-voice conversion system that changes the pitch and length of words to adjust the prosody at the word level and sentence level to achieve high definition and naturalness. The literature [12] uses a grammar and operation to obtain the mode adjustment of the entire text from the emotional color and style color attributes of each word. Furthermore, through the mode adjustment, the literature obtained the corresponding reference values of pitch. Finally, through this reference value, the literature adjusted the voice rate and scale of the synthesized voice, thereby making the synthesized voice smoother and more natural, enriching the expressive power of the synthesized voice, and improving the quality of voice synthesis. Some studies have suggested that Chinese stress is closely related to parameters such as pitch and duration of voice [13]. Moreover, the pause reflects the tightness of the connection between syllables and between syllables and silent segments [14]. In addition, the voice rate can be achieved by adjusting the delay time according to different information frames [15]. The literature [16] proposes to use the prosody features of Mandarin Chinese, the method of dynamically synthesizing patterns and the time-domain pitch synchronization overlay method (TD-PSOLA) to modify both time and fundamental frequency scales, which can achieve higher-quality prosody voice synthesis. The literature [17] proposed an improved method for smooth transition between voice units and a combination method of words on the basis of word segmentation for the problem of insufficient natural degree of voice s f (O|W, L, Λ) ynthesis. Moreover, it analyzed the law of tone processing in voice synthesis, and finally applied these methods and laws to Chinese text-to-voice conversion and obtains a relatively good conversion effect. The literature [18] collected news simulcast voices and established a corpus to synthesize high-quality voices with personal characteristics of news anchors.

3 Application of statistical language model in language recognition

For the phoneme recognition result of a voice, in order to determine which language it belongs to, we can judge it according to the maximum likelihood estimation (MLE), and it is defined as [19]: $L^{*} = \underset{L}{argmax} \sum_{W} f (O | W, L, Λ) P (W | L)$ (1)

Among them, O represents this segment of voice, and Λ represents the acoustic model. refers to the probability of obtaining the phoneme sequence W in the case of voice segment O in language L and acoustic model Λ. Meanwhile, P (W|L) represents the probability of occurrence of the phoneme sequence W in a certain language L.

In the case of using the best phoneme sequence output, $W^{*} = \underset{W}{argmax} f (O | W, L, Λ)$ (2)

is used to replace possible paths. Since the same phoneme recognizer is used, f (O|W, L, Λ) is independent of language L. Finally, Equation (1) is simplified to: $L^{*} = \underset{L}{argmax} P (W^{*} | L)$ (3)

Statistical language model is a general term. The two models commonly used in language recognition are N-gram grammar language model and binary decision tree model.

The N-gram language model is the most widely used language model. In the field of voice recognition, it is mainly used to predict the next occurrence of a word given a known sequence of words, which can greatly improve the accuracy of voice recognition. Moreover, inspired by voice recognition, people naturally introduce the N-gram language model into language recognition, which is mainly used to calculate the occurrence probability of a given phoneme sequence in a language. The main idea of the N-gram language model is to use an N - 1-order Markov process to approximate the generation process of the symbol sequence. For example, for sequence W^* = w₁w₂ ⋯ w_T, the probability is:

$\begin{matrix} P (W^{*}) = P (w_{1} w_{2} \dots w_{T}) = \\ \prod_{i = 1}^{T} P (w_{i} | w_{1} w_{2} \dots w_{i - 1}) = \\ \prod_{i = 1}^{T} P (w_{i} | h_{i}) \approx \prod_{i = N}^{T} P (w_{i} | w_{i - N + 1} w_{i - N + 2} \dots w_{i - 1}) \end{matrix}$ (4)

Among them, the history before w_i is denoted by h_i = w₁w₂ ⋯ w_i-1. The last step of the equation is to use the Markov process to approximate that w_i is only related to the preceding N - 1 symbols, which can greatly simplify the calculation. In practical applications, the logarithm of the formula is usually taken, and the multiplication is converted to addition. In language recognition, depending on the size of the training corpus, a 3-gram language model or a 4-gram language model is usually used. In general, the more training corpus, the higher the order of the language model obtained by training (usually no more than 4), and the better the language recognition performance.

To train an N-gram language model is to estimate the posterior probability P (w_i|w_i-N+1w_i-N+2 ⋯ w_i-1). Among them, the simplest and most commonly used method is to use maximum likelihood estimation (Maximum Likelihood Estimation, MLE) $\begin{matrix} P_{MLE} (w_{i} | w_{i - N + 1} w_{i - N + 2} \dots w_{i - 1}) = \\ \frac{count (w_{i - N + 1} w_{i - N + 2} \dots w_{i - 1} w_{i})}{count (w_{i - N + 1} w_{i - N + 2} \dots w_{i - 1})} \end{matrix}$ (5)

It can be seen that the calculation is extremely simple, and it is only necessary to count the number of occurrences of different collocations in the training corpus. If the phoneme set contains V phonemes, then the N-gram language model has V^N parameters that need to be estimated.

If the order of the language model can be dynamically determined according to the distribution of actual data and the same language model can also contain different orders. Then, the above two problems can be effectively solved. Therefore, a binary decision tree model was introduced.

In the N-gram language model, the probability of w_i appearing later can be predicted according to N - 1 historical phonemes h_i = w₁w₂ ⋯ w_i-1. Each phoneme in w_i-N+1w_i-N+2 ⋯ w_i-1 is called a predictor, for a total of N - 1 predictors. Figure 1 is a schematic diagram of a binary decision tree, which consists of three parts:

Internal nodes (corresponding to circles in the figure). When asking a binary question to a predictor, such as “whether the value of a predictor is in a certain phoneme subset”, the system only needs to answer “yes / no”.

Leaf node (corresponding to the ellipse in the figure). It stores the probability that the leaf is followed by each phoneme.

Branch. It refers to a path to a child node (internal node or leaf node) obtained by answering the question raised by the internal node.

Fig. 1

Schematic diagram of the binary decision tree.

After a historical phoneme string h_i = w₁w₂ ⋯ w_i-1 is given, the probability of being followed by w_i can be calculated by the following steps: starting from the root node, the questions on internal nodes are answered. Then, according to the value of the corresponding predictor, the algorithm goes to the corresponding child node, and this process is repeated until the algorithm goes to a leaf node to find the probability of the predicted w_i stored in the leaf node.

The binary decision tree model is used to calculate the logarithmic form of Equation (4): $\begin{matrix} log P (W) = log P (w_{1} w_{2} \dots w_{T}) = \\ log \prod_{i = 1}^{T} P (w_{i} | w_{1} w_{2} \dots w_{i - 1}) = \\ \sum_{i = 1}^{T} log P (w_{i} | h_{i}) \approx \sum_{i = 1}^{T} log P (w_{i} | l_{i}) = \\ \sum_{w \in P, l \in BT} log P (w | l) Count (l, w) \end{matrix}$ (6)

l_i is the corresponding leaf node on the binary decision tree found according to the value of w_i’s predictor, and P (w_i|l_i) is the probability of w_i appearing behind this leaf.

4 Factor analysis in language recognition

For phone voice, phoneme recognizers based on the NN/HMM architecture are more suitable. Although the performance of the phoneme recognizer of the GMM/HMM architecture in language recognition is slightly worse, some technologies are still very desirable. For example, some adaptive or feature transformation techniques are used to eliminate the influence of speakers and channels: based on GMM/HMM-based language recognition in the system, WadeShen proposed to adopt $CMLLR (\begin{matrix} ConstrainedMaximum \\ LikelihoodLinearRegression \end{matrix}),$ $VTLN (VocalTractLengthNormalization)$

The training set is x = { x_i } , i = 1, 2, 3, ⋯ , n, x_i represents a sample instance, x_i ∈ R^D, and D represents the feature dimensions of the sample. The corresponding covariance matrix is $C = \sum_{i = 1}^{n} (x_{i} - \bar{x}) {(x_{i} - \bar{x})}^{T}$ (7)

Among them, A is the sample mean, and the corresponding principal component can be obtained by performing eigenvalue decomposition on the covariance matrix. Meanwhile, the first principal component is the eigenvector associated with the largest eigenvalue, and so on.

In order to verify that in a language recognition system based on phoneme recognition, for a specific voice segment, the phoneme recognition NIST provides the LDCCallFriendCD1 number.

According to the analysis, the English part contains 10 conversation-style phone recordings, each conversation lasts 5-30 minutes, and contains 20 speakers and various channels. Moreover, these voices are cut into 851 small segments, each effective voice length is not Less than 30 s. These features are analyzed by principal components, the first two-dimensional principal components are taken, and the scatter plot is drawn as shown in Fig. 2.

Fig. 2

Two-dimensional representation of voice.

It can be seen from Fig. 2 that the first component depicts the difference between male and female voices, which shows that our existing phoneme recognizer fails to remove the influence of noise such as speaker channels; At the same time, it also illustrates the possibility of modeling such noise, and then the necessity of removing the influence of noise..

The mathematical model of factor analysis is described as follows:

$[\begin{matrix} x_{1} \\ x_{2} \\ ⋮ \\ x_{m} \end{matrix}] = [\begin{matrix} μ_{1} \\ μ_{2} \\ ⋮ \\ μ_{m} \end{matrix}] + [\begin{matrix} α_{11} \\ α_{21} \\ ⋮ \\ α_{m 1} \end{matrix} \begin{matrix} α_{12} \\ α_{22} \\ ⋮ \\ α_{m 2} \end{matrix} \begin{matrix} \dots \\ \dots \\ \dots \end{matrix} \begin{matrix} α_{1 s} \\ α_{2 s} \\ ⋮ \\ α_{ms} \end{matrix}] [\begin{matrix} F_{1} \\ μ_{2} \\ ⋮ \\ F_{s} \end{matrix}] + [\begin{matrix} ɛ_{1} \\ ɛ_{2} \\ ⋮ \\ ɛ_{m} \end{matrix}]$ (8)

or $X = μ + AF + ɛ$ (9)

Among them, x is the observation vector, μ is the mean vector, F is the common factor and is an unobservable hidden variable, which subject to N (0, 1) distribution. Matrix A is called the factor load matrix, which reflects the relationship between the observed variable and the factor. Meanwhile, ɛ is the special factor or noise factor, which indicates the part that cannot be contained by the previous s hidden factors. It is generally assumed that ɛ follows the normal distribution N (0, ψ).

At present, language recognition research is mainly aimed at telephone channel voice, so the difference between speaker and channel is the main factor affecting the performance of language recognition.

For an GMM - UBM acoustic model with a model mixing number of m and a feature vector of d dimension, it can be considered that each Gaussian component reflects a specific pronunciation. In this way, for the u-th sentence of the language L, each frame can be allocated by calculating the state occupancy rate of each frame relative to each Gaussian component and calculating the state occupancy rate of each frame relative to each Gaussian component. For the c-th Gaussian, the mean M_uc (L) of all the frames assigned to this Gaussian is calculated. At the same time, it can be considered that M_uc (L) represents the characteristics of a particular pronunciation in this sentence. The matching degree of M_uc (L) and the c-th Gaussian component reflects the matching degree of the acoustic feature of the specific pronunciation in this sentence to the corresponding Gaussian component in the GMM - UBM acoustic model, which can be measured by its likelihood: $\begin{matrix} p (M_{uc} (L)) = \frac{1}{{(2 π)}^{d / 2} {| \sum_{c} |}^{1 / 2}} exp \\ [- \frac{1}{2} {(M_{uc} (L) - μ_{c})}^{T} \sum_{c}^{- 1} (M_{uc} (L) - μ_{c})] \end{matrix}$ (10)

Among them, M_uc (L) is the d-dimensional column vector, μ_c is the mean of the c-th Gaussian, and ∑_c is the covariance matrix of the c-th Gaussian, which is a diagonal matrix of d × d. It can be seen that the value of p (M_uc (L)) is determined by M_uc (L) - μ_c. In other words, M_uc (L) - μ_c reflects the influence of noise on this particular pronunciation.

In order to reflect the influence of noise on all m pronunciations, we put all m M_uc (L) - μ_c into a d × m-dimensional supervector M_uc (L) - M (L). If it is assumed that the influence of noise can be described by a lower-dimensional noise space, then M_u (L) can be expressed as $M_{u} (L) - M (L) = Λ X_{u} (L)$ (11)

Λ is the noise load matrix, which reflects the influence of the noise subspace on the characteristic parameters, and X_u (L) is the noise factor, which follows the N (0, 1) distribution.

In a language recognition system based on word graph output, noise also affects how well the acoustic features match the phoneme model in the phoneme recognizer. However, due to the complexity of the decoding process, we cannot directly describe how well the acoustic features corresponding to a phoneme in a certain segment of voice match the corresponding phoneme model.

Table p (w|l) shows the probability of the path W in the word graph 1, and this can be seen as a representation of the degree of matching between the acoustic features of the phonemes present in the voice and the phoneme model: the higher the degree of matching, the greater the p (w|l), The lower the matching degree, on the contrary, the smaller the p (w|l), the higher the matching degree. Therefore, the N - Gram probability obtained from the word graph statistics reflects the influence of the above matching degree.

For the output of the word graph of a voice in language L (numbered as the u-th sentence), we construct the following vector representation in factor analysis $M_{u} (L) = {[\begin{matrix} log (p (d_{1} | l)), log (p (d_{2} | l)), \\ \dots, log (p (d_{m} | l)) \end{matrix}]}^{T}$ (12)

Compared with the vector representation of bagofN - Gram, it mainly uses the logarithmic form.

Since there are speaker and channel differences in each voice, we assume that the feature vector represented by (13) is obtained by adding a bias to the feature vector of the corresponding language. Specifically, for the feature vector M_u (L) of the voice in the u-th sentence of language L, it is expressed as: $M_{u} (L) = M (L) + Λ X_{u} (L) + ɛ_{u} (L)$ (13)

Among them, M (L) is the mean value of the feature vectors of all training corpora by language L. Λ represents the noise space and is used to describe the distribution of noise in low dimensions, which is represented by the matrix of m × s. Meanwhile, m is the dimension of the feature vector, and s is the dimension of the noise space Number. X_u (L) is the noise factor, which is the s-dimensional vector projected from M_u (L) to the noise space Λ and conforms to the N (0, 1) distribution, ɛ_u (L) represents the residual, which conforms to the N (0, ψ) distribution.

For language L, there are U (L) sentences of voice M₁ (L) , M₂ (L) , ⋯ , M_U(L) (L), and the corresponding noise factor is X₁ (L) , X₂ (L) , ⋯ , X_U(L) (L). Then, the likelihood of all sentences is $\begin{matrix} P_{all} (Λ, Ψ) = \prod_{i = 1}^{U (L)} \int p (M_{i} (L), X_{i} (L) | Λ, Ψ) \\ N (X_{i} | 0, I) dX \end{matrix}$ (14)

Parameter (Λ, Ψ) is estimated using MLE criterion to maximize the likelihood of all training data: $(Λ^{*}, Ψ^{*}) = \underset{Λ, Ψ}{arg max} (P_{all} (Λ, Ψ))$ (15)

The parameter estimation uses the EM algorithm, which estimates the implicit variable X_i (L) from the initial value of parameter (Λ, Ψ) or the value of the previous iteration and the training data. Then, the new parameter (Λ, Ψ) value is estimated based on the implicit variable X_i (L) and the training data, and iteratively repeated until convergence. The specific algorithm steps are as follows:

1) Initial parameter model: Initial Λ is randomly generated, M (L) is the feature vector of all training corpora from language L is the feature vector of all training corpora from language L, and Ψ is initially the variance of all training corpora;

2) E (expectation): According to the last iteration result of the model parameter (Λ, Ψ) and the training data, the first-order and second-order statistics of the noise factor X_i (L) of each sentence are estimated by maximizing the likelihood function: $\begin{matrix} E (X_{i} (L) | M_{i} (L)) = {(I + Λ^{T} Ψ^{- 1} Λ)}^{- 1} \\ (M_{i} (L) - M (L)) \end{matrix}$ (16) $\begin{matrix} E (X_{i} (L) X_{i}^{T} (L) | M_{i} (L)) = {(I + Λ^{T} Ψ^{- 1} Λ)}^{- 1} + \\ E (X_{i} (L) | M_{i} (L)) E (X_{i}^{T} (L) | M_{i} (L)) \end{matrix}$ (17)

3) M (maximization): The new parameter (Λ, Ψ) is updated according to the estimated first and second order statistics of the noise factor X_i (L):

$\begin{matrix} Λ = (\sum_{L} \sum_{i = 1}^{U (L)} (M_{i} (L) - M (L)) E (X_{i}^{T} (L) | M_{i} (L))) \\ {(\sum_{L} \sum_{i = 1}^{U (L)} E (X_{i} (L) X_{i}^{T} (L) | M_{i} (L)))}^{- 1} \end{matrix}$ (18) $\begin{matrix} Ψ = \frac{1}{\sum_{L} U (L)} diag \\ [\begin{matrix} \sum_{L} \sum_{i = 1}^{U (L)} \begin{matrix} (M_{i} (L) - M (L)) \\ {(M_{i} (L) - M (L))}^{T} \end{matrix} - \\ \sum_{L} \sum_{i = 1}^{U (L)} \begin{matrix} (M_{i} (L) - M (L) E) \\ (X_{i}^{T} (L) | M_{i} (L)) \end{matrix} Λ^{T} \end{matrix}] \end{matrix}$ (19)

4) Steps 2) to 3) are repeated until the algorithm converges. Generally, the algorithm can end after 3-4 iterations.

For a test sentence, we hope to remove the noise-affected part of its feature vector and retain its language information. The specific formula is as follows: $M_{u} {(L)}^{'} = M_{u} (L) - Λ X_{u} (L)$ (20)

The noise factor is obtained by formula (16). Since the test does not know the language to which it belongs, M (L) is replaced by the mean value of the feature vectors of all training corpora.

In order to verify the effectiveness of the factor analysis method proposed in this paper, we test on NISTLRE2007 and experiment on PRLM. At the same time, the training data uses LDCCallFriendCD1 data, OHSU and some new language training data in the 2007 test. The noise space Λ is also estimated from these training data, and the number of noise factors is 10. bag ofN - Gram features are obtained from word graph statistics and smoothed by UBM. $p^{(d_{i} | l)^{'}} = 0.3 p (d_{i} | l) + 0.7 p (d_{i} | UBM)$ (21)

3 - gramPRLM is adopted to perform training and testing. Because the system PRLM has sufficient training data when training the language model, it is considered to be speaker channel independent and the system only compensates for the development and test sets. The experimental results in each case are shown in Table 1.

Table 1

Comparison of NISTLRT2007 factor analysis method and baseline system experimental results

EER/%	30s	10s	3s
Baseline system	3.61	9.73	24.17
Factor analysis	2.97	9.42	24.51

It can be seen from Table 1 that the factor analysis method is effective for long-term testing, and the system performance is improved by 17.8% relative to the 30sPRLM baseline system. However, it is not much improved for the 10 s test, and the performance of the 3 s test is slightly reduced. There are two main reasons for this situation: because the noise factor training process uses 30 s long voice, the training and test do not match during short-term voice testing. Meanwhile, it is difficult to extract the speaker channel information contained in the short-term voice and estimate the reliability, so that the performance improvement is not large and may even have a counter-productive effect.

5 Application of support vector machine in language recognition based on phoneme layer information

When Introducing support vector machines into the language recognition task based on phoneme layer information, it must meet the following two requirements:

1) The Kernel function can measure the similarity of two voices

2) The kernel function should be easy to calculate and store

The structure of the kernel function is derived from the perspective of measuring the similarity of two voice segments. First, we need to select appropriate features to describe a piece of voice, ang we use the “ bag ofN - Grams” method and draw on the idea of bag of words in text classification. If it is assumed that the phoneme sequence W represents a possible path (phoneme sequence) in the word graph 1, the probability that the corresponding N - Gramd_i appears is $p (d_{i} | l) \approx \frac{count (d_{i} | l)}{\sum_{j = 1}^{M} count (d_{j} | l)}$ (22) $\begin{matrix} count (d_{i} | l) = E_{W} [count (d_{i} | W)] \\ = \sum_{W} p (W | l) count (d_{i} | W) \end{matrix}$ (23)

Among them, count (d_i|W) represents the number of times a certain N - Gramd_i appears in the sequence W, and p (W|l) represents the probability of the path W in the word graph l. When putting the probability of all N - Grams into a vector, we can get the “ bag ofN - Grams” feature vector S: $S = {[p (d_{1} | l), p (d_{2} | l), \dots p (d_{m} | l)]}^{T}$ (24)

Next, we will discuss how to measure the similarity between two sentences. If the word graph structures corresponding to two specific voice segments are assumed to be l₁ and l₂, then the similarity between l₁ and l₂ can be expressed as: $K (l_{1}, l_{2}) = \sum_{i = 1}^{M} p (d_{i} | l_{1}) log (\frac{p (d_{i} | l_{2})}{p (d_{i} | all)})$ (25)

Among them, p (d_i|all) represents the probability of d_i appearing in the entire training corpus. When the value of x of the logarithmic function log(x) is around 1, we can make a first-order approximation log(x) ≈ x - 1. At this time, formula (25) can be approximated as $\begin{matrix} K (l_{1}, l_{2}) \approx \sum_{i = 1}^{M} p (d_{i} | l_{1}) \frac{p (d_{i} | l_{2})}{p (d_{i} | all)} - \sum_{j = 1}^{M} p (d_{i} | l_{1}) \\ = \sum_{i = 1}^{M} p (d_{i} | l_{1}) \frac{p (d_{i} | l_{2})}{p (d_{i} | all)} - 1 \\ = \sum_{i = 1}^{M} \frac{p (d_{i} | l_{1})}{\sqrt{p (d_{i} | all)}} \frac{p (d_{i} | l_{2})}{p (d_{i} | all)} - 1 \end{matrix}$ (26)

After removing the constant term, the similarity of the two voices can be finally expressed as $K (l_{1}, l_{2}) = \sum_{i = 1}^{M} \frac{p (d_{i} | l_{1})}{\sqrt{p (d_{i} | all)}} \frac{p (d_{i} | l_{2})}{p (d_{i} | all)} - 1$ (27)

The higher the score, the more similar the N - Gram combination that appears in the two voice segments, that is, the two voice segments are more similar and more likely to belong to the same language.

We observe the expression of the kernel function, and we can see that it weights the probability of each N - Gram appearing first, and then calculates the inner product of the weighted vector. The reason for this is that the distribution of N - Gram is very uneven and the probability of some N - Gram appearing is far greater than the probability of others N - Gram appearing. Therefore, if we do not use weighting but simply use the “ bag ofN - Grams” vector to calculate the inner product, the size of the inner product will be controlled by N - Gram with a higher probability of occurrence. At this time, the information contained in the lower probability N - Gram will not be reflected, which reduces the accuracy. Therefore, the role of weighting is to adjust the influence of each N - Gram to as far as possible make the distinguishing information contained in each N - Gram reflected.

Table 2 shows the performance of the third order SVM on LRE07, and the N - Gram statistic is obtained from Lattice statistics. The kernel function takes the form of formula (22). The SVM training uses the Torch SVM toolkit, and the training strategy uses 1 VS others. Meanwhile, the training set corpus contains 22,231 voices, and the effective voice length of each segment is not less than 30 s. We can see that the long-term (30 s) performance of the SVM model is better than the 3-gram language model, while the performance on the 10 s and 3 s is slightly worse. The reason is that when the training set uses 30 s of voice as a sample, the test duration is short, and there is a problem of mismatch. Due to the use of two different training strategies, there is a strong complementarity between the two models. Especially in the 30 s, the two models have a considerable improvement over a single system.

Table 2

Comparison of the performance of the third order SVM and 3 - Gram language model of LRE07

EER /%	30s	10s	3s
3-Gram language model	3.61	9.73	24.17
3rd order SVM	3.29	10.06	24.62
Fusion	2.33	7.26	21.38

6 Model building

The experimental system is composed of two parts: training and testing. The composition of the system is shown in Fig. 3. In the training stage, the features of the words are first extracted from the corpus, and after the feature selection, the iterative algorithm is used to learn the weights of the features. In the test phase, feature extraction is first performed, and then the probability of the sound being converted into each word in the context is calculated and the result of the phonetic conversion is obtained.

Fig. 3

Structure diagram of phonetic conversion.

The block diagram of the text-to-voice conversion system is shown in Fig. 4. The text-to-voice conversion system generally includes three core parts: text analysis part, prosody control part and voice synthesis part. Among them, the problems that need to be solved in the text analysis stage are: the removal of unrecognizable symbols in the input text, the word segmentation, the determination of the syntactic structure, the disambiguation of polyphonic characters, the recognition of numbers, years, and abbreviations. The quality of the prosody control module has a great impact on the quality of the final synthesized voice. This module mainly enables the voice to be better controlled in terms of intonation, pause, stress, and speed of voice, thereby improving the naturalness of the synthesized voice. The voice synthesis module uses the corresponding voice synthesis methods according to the results of text analysis and prosody analysis, such as formant parameter synthesis method, waveform stitching synthesis method, unit selection voice synthesis method, etc., and finally synthesizes voice. This subject focuses on the study of prosody control modules to improve the quality of synthesized voice.

Fig. 4

Block diagram of text-to-voice conversion system.

After the system is built, the performance of the system is identified, and the performance is identified through 40 sets of data. In this study, the deep learning model is used as a comparative model to analyze the performance of the research model. The deep learning model is named DL, and the research model is named PML. First, a comparative analysis of the text-to-voice conversion speed is performed. The comparison results are shown in Fig. 5 and Table 3.

Fig. 5

Comparison diagram of text to language conversion speed.

Table 3

Comparison table of text-to-language conversion speed

	DL	PML		DL	PML
1	107.7	72.3	21	140.4	45.5
2	146.6	61.5	22	131.9	47.9
3	142.0	63.9	23	143.6	38.2
4	96.0	39.7	24	139.3	44.1
5	118.5	67.5	25	116.0	36.3
6	102.7	54.0	26	139.0	69.1
7	116.3	54.8	27	141.5	72.1
8	140.6	67.6	28	125.6	44.1
9	108.5	35.5	29	139.6	72.9
10	143.5	63.9	30	125.8	56.1
11	149.1	64.2	31	106.7	48.4
12	124.6	37.6	32	100.0	36.8
13	110.6	69.5	33	96.4	50.6
14	105.3	45.2	34	134.9	49.9
15	127.9	69.4	35	131.5	65.5
16	136.6	60.7	36	105.2	50.1
17	103.2	58.2	37	121.7	58.4
18	103.6	64.7	38	144.2	55.2
19	134.7	60.1	39	96.4	71.6
20	118.0	35.5	40	133.3	70.9

As shown in Fig. 5 and Table 3, in the comparison of text-to-voice conversion speed, this research method has a certain advantage in recognition speed compared with the deep learning model. Moreover, the time required for the text-to-voice conversion of this research model is only half of that of the deep learning model. Next, a comparison of the accuracy of text-to-voice conversion is conducted.

As can be seen from Fig. 6 and Table 4, the model proposed in this study is distributed between 85% -95% in the accuracy of text-to-voice conversion. This accuracy rate has a practical basis for application to the system. However, the recognition accuracy of the deep model is only about 60%. Therefore, the model proposed in this study is superior in the accuracy of text-to-voice conversion.

Fig. 6

Comparison diagram of accuracy of text-to-language conversion.

Table 4

Comparison table of accuracy of text-to-language conversion

	DL	PML		DL	PML
1	51.3	88.6	21	51.4	85.6
2	47.1	87.7	22	47.4	92.1
3	62.1	85.4	23	60.1	89.6
4	62.9	91.3	24	51.9	90.1
5	45.7	86.1	25	67.8	87.8
6	60.2	93.5	26	56.9	89.4
7	67.2	93.9	27	56.9	87.7
8	51.9	86.7	28	56.4	90.2
9	45.8	90.3	29	56.1	88.0
10	45.1	91.5	30	58.2	87.6
11	65.0	94.6	31	58.2	91.6
12	54.5	89.0	32	65.7	89.9
13	64.3	91.9	33	48.1	85.2
14	46.4	89.8	34	61.3	90.3
15	58.8	92.4	35	53.1	89.1
16	62.3	91.5	36	45.3	91.4
17	50.3	93.9	37	53.2	94.6
18	63.0	90.4	38	59.1	93.6
19	67.9	93.3	39	63.1	92.5
20	50.3	93.1	40	54.6	92.0

7 Conclusion

The process of converting English letters into phonemes has a wide range of applications, and it is also used in teaching and intelligent text recognition. Based on this, this study analyzes the application of improved machine learning algorithms in the process of converting English letters into phonemes. This research system consists of two parts: training and testing. In the training phase, first extract the features of words from the corpus, and after the feature selection, use an iterative algorithm to learn the weights of the features. In the test phase, feature extraction is first performed, and then the probability of the sound being converted into each character in the context is calculated, and the result of the conversion of the phonetic is obtained. The text-to-voice conversion system generally includes three core parts: text analysis part, prosody control part and voice synthesis part. The voice synthesis module uses the corresponding voice synthesis methods according to the results of text analysis and prosody analysis, such as formant parameter synthesis method, waveform stitching synthesis method, unit selection voice synthesis method, etc., and finally synthesizes voice. Finally, the performance analysis of the research model proposed in this paper is carried out through comparative experiments. The research results show that this algorithm has good performance and can be applied to practice.

References

Sarria-Paja

, Senoussaoui

and Falk

T.H.

, The effects of whispered speech on state-of-the-art voice based biometrics systems[J], Canadian Conference on Electrical and Computer Engineering 2015 (2015), 1254–1259.

Leeman

, Mixdorff

, O’Reilly

, et al., Speaker-individuality in Fujisaki model f0 features: Implications for forensic voice comparison[J], International Journal of Speech Language and the Law 21(2) (2015), 343–370.

Hossain

M S

and Muhammad

, Healthcare big data voice pathology assessment framework[J], IEEE Access (99) (2016), 1–1.

Hill

A.K.

, Cárdenas

Rodrigo A.

, Wheatley

J.R.

, et al., Are there vocal cues to human developmental stability? Relationships between facial fluctuating asymmetry and voice attractiveness[J], Evolution & Human Behavior 38(2) (2017), 249–258.

Woźniak

Marcin

and Połap.

Dawid

, Voice recognition through the use of Gabor transform and heuristic algorithm[J], Nephron Clinical Practice 63(2) (2017), 159–164.

Haderlein

, Döllinger

Michael

, Matoušek

Václav

, et al., Objective voice and speech analysis of persons with chronic hoarseness by prosodic analysis of speech samples[J], Logopedics Phoniatrics Vocology 41(3) (2015), 106–116.

Nidhyananthan

S S

, Muthugeetha

and Vallimayil

, Human Recognition using Voice Print in LabVIEW[J], International Journal of Applied Engineering Research 13(10) (2018), 8126–8130.

Malallah

F L

, Saeed

K N Y M G

, Abdulameer

S D

, et al., Vision-Based Control By Hand-Directional Gestures Converting To Voice[J], International Journal of Scientific & Technology Research 7(7) (2018), 185–190.

Morgan Sleeper, C19tact effects on voice-onset time in Patagonian Welsh[J], Acoustical Society of America Journal 140(4) (2016), 3111–3111.

10.

Mohan

, Hamilton

, Grasberger

, et al., Realtime voice activity and pitch modulation for laryngectomy transducers using head and facial gestures[J], Journal of the Acoustical Society of America 137(4) (2015), 2302–2302.

11.

Kang

T.G.

and Kim

N.S.

, DNN-based voice activity detection with multi-task learning[J], Ieice Transactions on Information & Systems E99.D(2) (2016), 550–553.

12.

Choi

HaNa

, Byun

SungWoo

and Lee

SeokPil

, Discriminative feature vector selection for emotion classification based on speech[J], Transactions of the Korean Institute of Electrical Engineers 64(9) (2015), 1363–1368.

13.

Herbst

C.T.

, Hertegard

, Zangger-Borch

, et al., Freddie Mercury—acoustic analysis of speaking fundamental frequency, vibrato, and subharmonics[J], Logopedics Phoniatrics Vocology 42(1) (2016), 1–10.

14.

Al-Tamimi

, Revisiting acoustic correlates of pharyngealization in Jordanian and Moroccan Arabic: Implications for formal representations[J], Laboratory Phonology 8(1) (2017), 1–40.

15.

Laukka

, Elfenbein

H.A.

, Thingujam

N.S.

, et al., The expression and recognition of emotions in the voice across five nations: A lens model analysis based on acoustic features[J], Journal of Personality & Social Psychology 111(5) (2016), 686.

16.

Zhou

Shuren

, Ke

Maolin

and Luo

Peng

, Multi-camera transfer GAN for person re-identification, J. Vis. Commun. Image Represent 59 (2019), 393–400.

17.

Pandi

Maruthu

and Rajendran

Vimala Devi K.

, Efficient feature extraction for text mining[J], Advances in Natural & Applied Sciences 10(4) (2016), 64–73.

18.

, Zhao

and Han

, A fingerprint feature extraction algorithm based on optimal decision for text copy detection[J], International Journal of Security & Its Applications 10(11) (2016), 67–78.

19.

Soleymanpour

and Marvi

, Text-independent speaker identification based on selection of the most similar feature vectors[J], International Journal of Speech Technology 20(1) (2016), 1–10.