Design of English text-to-speech conversion algorithm based on machine learning

Abstract

English text-to-speech conversion is the key content of modern computer technology research. Its difficulty is that there are large errors in the conversion process of text-to-speech feature recognition, and it is difficult to apply the English text-to-speech conversion algorithm to the system. In order to improve the efficiency of the English text-to-speech conversion, based on the machine learning algorithm, after the original voice waveform is labeled with the pitch, this article modifies the rhythm through PSOLA, and uses the C4.5 algorithm to train a decision tree for judging pronunciation of polyphones. In order to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems, this study constructed a system model. In addition, the waveform stitching method and PSOLA are used to synthesize the sound. For words whose main stress cannot be discriminated by morphological structure, label learning can be done by machine learning methods. Finally, this study evaluates and analyzes the performance of the algorithm through control experiments. The results show that the algorithm proposed in this paper has good performance and has a certain practical effect.

Keywords

Machine learning English text-to-speech conversion improved algorithm simulation

1 Introduction

It is an important topic of modern computer technology research to make computers speak like humans and understand speech. Moreover, with the development of network technology and multimedia technology, in terms of applications, voice technology is also urgently required to be further developed from content to depth. Text-to-speech (TTS), also known as text-to-speech conversion, is the key to a computer’s ability to “speak”. Its function is to convert text information generated by the computer itself or externally input, such as documents, emails, or text information recognized by scanning, into natural and smooth voice output through voice processing technology. In other words, it enables the computer to read text information fluently, so that people can understand the content of the information by “listening” [1]. TTS technology is an interdisciplinary frontier technology, which involves linguistics, text-to-speeches, psychology, acoustics, digital signal processing technology, multimedia technology and other subject areas. Moreover, the synthesized speech requires extremely high intelligibility and naturalness.

This method uses digital signal processing technology to treat the human vocalization process as a source that simulates the glottal state and to excite a time-varying digital filter that characterizes the resonance characteristics of the channel. This source may be a periodic pulse sequence, which represents vocal cord vibration in the case of voiced sounds, or a random noise sequence, which represents voiceless sounds. By adjusting the parameters of the filter, the shape of the oral cavity and the sound channel can be changed to achieve the purpose of controlling to emit different sounds. By adjusting the period or intensity of the excitation source pulse sequence, the pitch, stress, etc. of the synthesized speech can be changed [2].

The research of TTS technology has a history of more than two hundred years. It can be traced back to the mechanical speech synthesizer developed by Krantzenstein in 1779, which can synthesize some vowels by mechanical means. However, the really practical TTS technology is developed with the development of computer technology and digital signal processing technology. In the decades after the birth of the computer, the development of TTS technology has mainly experienced the development process from parameter synthesis to splicing synthesis and from rule-driven to data-driven [3], and the quality of synthesized speech continues to improve. Today, TTS technology has been widely used in many industries.

In recent years, with the increasing memory capacity of computers and the increasing speed of processors, a new speech synthesis technology based on large corpora is gradually becoming the focus of research. In this method, a large number of continuous speeches are recorded in advance, and various speech units (including whole sentences, phrases, words, syllables, monosyllables, etc.) are separated by software and stored in the speech corpus. When synthesizing, the unit of speech is selected from the speech corpus by the unit selection algorithm, and the speech unit most suitable for the current context is spliced. Since the synthesized speech primitives are all derived from the natural original pronunciation, as long as the appropriate speech unit is selected, clear speech with extremely high naturalness can be obtained.

2 Related work

Speech synthesis technology involves many fields such as acoustics, linguistics, digital signal processing technology, multimedia technology, etc., and is one of the popular technologies in today’s world power competition [3]. In the 1960 s, the English TTS system first came out. Today, famous computer companies such as L & H, Lucent, IBM, and Microsoft have developed TTS systems in multiple languages. The synthesis systems of Microsoft and Bell Labs can read English documents aloud under the conditions of freely setting the tone. The context of its English synthesis system is realistic, and the sound is clear and smooth [4]. Since the 1980 s, some domestic research institutions have conducted a lot of research on the application of Chinese TTS. The first to carry out this work was the Institute of Acoustics of the Chinese Academy of Sciences. Later, the Institute of Languages of the Chinese Academy of Social Sciences, Tsinghua University, University of Science and Technology of China, Northern Jiaotong University and other units successively carried out research on Chinese TTS. At the same time, Taiwan Jiaotong University, Taiwan University and Bell Labs in the world also developed Chinese TTS system. In recent years, with the support of the national “863” intelligent computer theme, Chinese TTS technology has made great progress. Tsinghua University, the Chinese University of Science and Technology, the Chinese Academy of Sciences and other units have made some progress in this field, and some research results have been converted into products and have been practically applied, such as the Sonic system of Tsinghua University, the KD-863 Chinese text-to-speech conversion system of the University of Science and Technology of China, the Chinese 1vrS system of Hangzhou Sanhui Company, the embedded TTS Chinese speech system of Jietong Company, and the KDZOO Chinese-to- text conversion system of Xunfei Company [6]. Among them, the speech synthesized by some systems is relatively close to the natural speech of human beings, but the “taste of the machine” can still be heard. The quality of speech synthesis depends mainly on the clarity and naturalness of the language. In some cases, it also shows changes in emotion, and the broad development prospects of speech synthesis technology will certainly put forward higher requirements for this [7].

The literature [8] provides a mechanical speech synthesizer, which can synthesize some vowels by mechanical means. However, the truly practical TTS technology developed with the development of computer technology and digital signal processing technology. In the decades after the birth of the computer, the development of TTS technology has mainly experienced the development process from parameter synthesis to splicing synthesis, and from rule-driven to data-driven, and the quality of synthesized speech continues to improve. Today, TTS technology has been widely used in many industries. In the early research, the parameter synthesis method is mainly used. This method uses digital signal processing technology to treat the human vocalization process as a source that simulates the glottal state and to excite a time-varying digital filter that characterizes the resonance characteristics of the channel. This source may be a periodic pulse sequence, which represents vocal cord vibration in the case of voiced sounds, or a random noise sequence, which represents voiceless sounds. By adjusting the parameters of the filter to be equivalent to changing the shape of the mouth and the vocal tract, the purpose of controlling different sounds is achieved. By adjusting the period or intensity of the excitation source pulse sequence, the pitch and stress of the synthesized speech can be changed [9]. As long as the excitation source and filter parameters are correctly controlled, this model can flexibly synthesize various sentences [10]. The more well-known systems implemented by the parametric synthesis method are the parallel formant synthesizer in the literature [11], the serial / parallel formant synthesizer in the literature [12], and the DECtalk system invented by DEC in 1987. For these systems, as long as the parameters are carefully adjusted, the synthesizer can synthesize clearer speech. Although the parameter synthesis method requires a small storage capacity and is convenient and flexible in use, it is difficult to accurately extract formant parameters, and the overall sound quality of synthesized speech is difficult to meet the practical requirements of the TTS system [13]. With the introduction of PSOLA (Pith synchronous overlap add), a synthesis method based on waveform stitching has gradually become the mainstream of TTS research at home and abroad [14]. The waveform stitching synthesis method is to store a certain amount of voice recorded by human pronunciation in the computer in advance, select the appropriate speech unit from speech synthesis during speech synthesis, and then insert pauses of different lengths at different positions of the sentence according to the prosody features to generate highly natural sentences.

Although the synthesis method based on a large corpus improves the sound quality of synthesized speech, this method also has some disadvantages. For example, most of the construction of the original speech database requires manual operation, and the construction period is too long. Recently, a new synthesis method with a high degree of automation has emerged, that is, Trainable TTS [15]. Trainable TTS is established through a set of automated processes, and Trainable TTS trains pre-prepared speech to form the required synthesis system. Generally, training is performed on models or parameters. In the field of speech signal processing, Hidden Labelov Model (HMM) [16] is the most used modeling method. Previously, HMM was mainly used for speech recognition in the field of speech processing, and its application in the field of speech recognition was quite mature. At present, most Trainable TTS is modeled based on the hidden Labelov model. Some well-known schools and scientific research institutions such as NITtl71, Microsoftt [17] and IBME [18] have proposed some implementation technologies based on trainable TTS. Like splicing-based speech synthesis, Trainable TTS was not recognized by people in the industry when it was first introduced. The main reason is that the model training algorithm is immature, and the quality of the speech synthesized by this synthesis method is not very good. Afterwards, some experts improved the model training algorithm, and the introduction of STRAIGHT analysis synthesizer [19] made the synthesis effect of Trainable TTS substantially improved. Compared with the previous three methods, while ensuring high-quality synthesized sound quality, the most obvious advantage of trainable speech synthesis is that it can automatically establish a speech synthesis system that is not limited by language in a short time, and it requires little manual intervention. This also provides a good platform for multilingual speech synthesis [20]. In addition, Trainable TTS does not have very high requirements on hardware storage capacity and computing power during the synthesis process, which makes Trainable TTS very suitable for use in embedded environments [21].

3 PSOLA algorithm

PSOLA algorithm takes pitch synchronization as the core. In PSOLA algorithm, a complete pitch period is a necessary condition to ensure the continuity of the spectrum and waveform. Therefore, before modifying the rhythm with PSOLA, the pitch of the original voice waveform needs to be labeled. The sound is generally composed of voiced and unvoiced sounds. The base voice of the voiced sound is periodic, and the unvoiced sound is close to white noise. Therefore, it is generally only necessary to complete the labeling of the pitch of the voiced signal and set the pitch period of the unvoiced signal to a constant to ensure the consistency of the algorithm. The work to be done for pitch labeling is to record the starting position of the label, the number of pitch periods, and the sequence of the starting position of each pitch period. After the original waveform of the synthesized primitive is labeled, the PSOLA algorithm can be used to insert, delete and modify the waveform segment according to the pitch period. There are three steps to complete the PSOLA algorithm [22 –24]:

The original waveform is analyzed, and a non-parametric intermediate representation is generated. That is, the labeled original speech signal is multiplied with the window function synchronized with the pitch to obtain some analysis short-term speech signals with overlap.

The intermediate representation is modified. That is, the obtained analysis short-term speech signal is transformed in the time domain, such as adjusting the fundamental frequency, duration, and amplitude to determine the relationship between the analysis short-term speech signal and the synthesized short-term speech signal, and a short-term synthesized speech signal sequence synchronized with the target pitch curve is obtained.

The short-term synthesized speech signal sequence is adjusted to be synchronized with the target pitch period, and the short-term synthesized speech signal sequence is overlapped and added to obtain a synthesized speech waveform.

There are three forms of PSOLA algorithm: time domain pitch synchronization overlay method (TD-PSOLA), frequency domain pitch synchronization overlay method (FD-PSOLA) and linear prediction pitch synchronization overlay method (LP-PSOLA). The difference between these three algorithms is that the transformation method of the short-term signal waveform is different, and the algorithm complexity is different, but the working principle is still the same. The differences are as follows:

TD-PSOLA processes the waveform in the time domain and changes the duration by superimposing or deleting some short-term analysis voice signals. The algorithm of this method is relatively simple and efficient, so it is widely used.

FD-PSOLA first uses the Fourier transform to obtain the short-term speech signal and analyze the short-term frequency spectrum and its envelope of the speech signal, modify and re-combine the two spectrums to match the required synthetic fundamental frequency, and then uses the inverse Fourier transform to generate the synthesized short-term speech signal. This algorithm has high computational complexity.

LP-PSOLA is a combination of TD-PSOLA and LPC. It is mainly used for TD-PSOLA processing of multi-pulse excitation in LPC speech synthesis. LP-PSOLA can use different size windows to estimate the spectral shape and rhythm adjustment.

As the simplest and most widely used pitch synchronization overlay method, TD-PSOLA has many forms. The following highlights the TD-PSOLA under the meaning of spectral equality and the TD-PSOLA under the meaning of minimum mean square deviation.

(1) TD-PSOLA algorithm under the meaning of spectrum equal

For the speech signal x_n, we can obtain the phrase time signal as: $x_{s} (n) = x (n) w (s - n)$ (1)

In formula (1), w (n) is a window function, and the window function can be a hanning window or a haming window. x_s (n) is the signal sequence obtained by x (n) windowing at s.

x_s (n) is subjected to Fourier transform, $F [x_{s} (n)] = X_{s} (e^{jw}) = \sum_{n = - \infty}^{+ \infty} x (n) w (s - n) e^{- jwm}$ (2)

In formula (2), s = t_m, m = 1, 2, ⋯ , M. Among them, t_m is the pitch label of the speech signal x (n). Therefore, there is the following formula. $x_{m} (n) = x (n) w (t_{m} - n), m = 1, 2, \dots, M$ (3)

Formula (3) is a synchronous analysis of the pitch for the speech signal x (n), that is, the speech signal x (n) is converted into a set of windowing sequences at t (m).

We assume that the pitch label of the final synthesized speech signal y (n) is t_q, q = 1, 2, ⋯ , N. The purpose of using TD-PSOLA is to find the synthesized speech signal y (n) with the pitch label t_q corresponding to the speech signal t_m with the pitch label x (n) while keeping the spectral envelope (timbre) unchanged. The mapping relationship between t_q and t_m can be expressed as: $t_{m} = φ (t_{q})$ (4)

t_m is the pitch label of the analysis speech signal, and t_q is the pitch label of the synthesized speech signal. The relationship between t_m and t_q is not necessarily one-to-one, but may be many-to-one or one-to-many.

Spectral equality means that the synthesized speech signal is equal to the Fourier transform of the corresponding windowed sequence in the analyzed speech signal. If it is assumed that t_q corresponds to t_m, then the windowing sequence of the speech signal x (n) at t_m is x_{t
_m} (n), and the windowing sequence of the synthesized speech signal y (n) at t_q is y_{t
_q} (n). In order to compare the Fourier transform of the window sequence of the analyzed speech signal with the Fourier transform of the window sequence of the synthesized speech signal, it is necessary to move the center of the analyzed speech signal window and the synthesized speech signal window to 0 at the same time, then: ${\begin{matrix} x_{t_{m}}^{'} (n) = x_{t_{m}} (n + t_{m}) = x (n + t_{m}) w (n) \\ y_{t_{q}}^{'} (n) = y_{t_{q}} (n + t_{q}) = y (n + t_{q}) w (n) \end{matrix}}$ (5)

The window function of the analysis signal window and the synthesis signal window is the same as w (n). We assume that under the spectral equality, the Fourier transform of y_{t
_q} (r + t_q) is equal to the Fourier transform of x_{t
_m} (r + t_m) in each frequency component. According to the shift theorem, we can obtain: $e^{{jw}_{k} t_{m}} X_{t_{m}} (e^{{jw}_{k}}) = e^{{jw}_{k} t_{q}} Y_{t_{q}} (e^{{jw}_{k}})$ (6)

The synthesized speech signal y (n) is obtained by superimposing all the short-term synthesized speech signals y_{t
_q} (n), q = 1, 2, ⋯ , N, namely $y (n) = \sum_{t_{q}} y_{t_{q}} (n)$ (7)

By taking the inverse Fourier transform of y (n), we can obtain: $\begin{matrix} y (n) = \sum_{t_{q}} [\frac{1}{N} \sum_{k = 0}^{N - 1} Y_{t_{q}} (e^{{jw}_{k}}) e^{{jw}_{k} t_{m}}] = \\ \sum_{t_{q}} [\frac{1}{N} \sum_{k = 0}^{N - 1} e^{{jw}_{k} t_{q}} Y_{t_{q}} (e^{{jw}_{k}}) e^{{jw}_{k} (n - t_{q})}] \end{matrix}$ (8)

By substituting Equation (6) into (8), we can obtain: $\begin{matrix} y (n) = \sum_{t_{q}} [\frac{1}{N} \sum_{k = 0}^{N - 1} e^{{jw}_{k} t_{m}} X_{t_{m}} (e^{{jw}_{k}}) e^{{jw}_{k} (n - t_{q})}] \\ = \sum_{t_{q}} [\frac{1}{N} \sum_{k = 0}^{N - 1} e^{- {jw}_{k} (t_{q} - t_{m})} X_{t_{m}} (e^{{jw}_{k}}) e^{{jw}_{k} n}] \end{matrix}$ (9)

The inverse Fourier transform of X_{t
_m} (e^jw) is: $X_{t_{m}} = \frac{1}{N} \sum_{k = 0}^{N - 1} X_{t_{m}} (e^{{jw}_{k}}) e^{{jw}_{k} n}$ (10)

Through the shift theorem, we can obtain: $\begin{matrix} X_{t_{m}} (n - (t_{q} - t_{m})) = \frac{1}{N} \\ \sum_{k = 0}^{N - 1} e^{- {jw}_{k} (t_{q} - t_{m})} X_{t_{m}} (e^{{jw}_{k}}) e^{{jw}_{k} n} \end{matrix}$ (11)

Through formulas (3), (9), (11), we can obtain: $\begin{matrix} y (n) = \sum_{t_{q}} X_{t_{m}} (n - (t_{q} - t_{m})) = \\ \sum_{t_{q}} w (t_{m} - (n - (t_{q} - t_{m}))) x (n - (t_{q} - t_{m})) \\ = \sum_{t_{q}} w (t_{q} - n) x (n - (t_{q} +_{m})) \end{matrix}$ (12)

Formula (12) is a speech synthesis formula of synchronous superposition of time-domain pitch in the sense of spectrum equality. This synthesis method ensures that the original speech signal x (n) and the synthesized speech signal y (n) are equal in the spectral sense. In order to make up for the energy lost during the adjustment of the fundamental frequency, an energy compensation factor α_q is introduced here. Therefore, there are: $\begin{matrix} y (n) = \sum_{t_{q}} α_{q} w (t_{q} - n) x (n - (t_{q} + t_{m})) \\ = \sum_{t_{q}} α_{q} w_{t_{m}} (n - t_{q} + t_{m}) \end{matrix}$ (13)

The above is the process of TD-PSOLA algorithm under the meaning of spectrum equality.

(2) TD-PSOLA in the sense of minimum mean square error

We set a certain t_q to correspond to a certain t_m, the windowing sequence of the analysis speech signal x (n) at t_m is x_{t
_m} (n), and the windowing sequence of the analysis speech signal y (n) at t_q is y_{t
_q} (n), that is: x_{t
_m} (n) = w₁ (t_m - n) x (n) , y_{t
_q} (n) = w₂ (t_q - n) y (n). Among them, w₁ (n) and w₂ (n) are the analysis speech signal window and the synthesis speech signal window. The distance measure of x (n) and y (n) is defined as: $\begin{matrix} D [y (n), x (n)] = \sum_{t_{q}} \frac{1}{2 π} \\ \int_{- π}^{π} {| e^{{jwt}_{m}} X_{t_{m}} (e^{jw}) - e^{{jwt}_{q}} Y_{t_{q}} (e^{jw}) |}^{2} dw \\ = \sum_{t_{q}} {\sum_{n = - \infty}^{+ \infty} [\begin{matrix} w_{1} (t_{m} - (n + t_{m})) x (n + t_{m}) - \\ w_{2} (t_{q} - (n + t_{q})) y (n + t_{q}) \end{matrix}]}^{2} \\ = \sum_{t_{q}} {\sum_{n = - \infty}^{+ \infty} [\begin{matrix} w_{1} (t_{q} - n) x (n - t_{q} + t_{m}) - \\ w_{2} (t_{q} - n) y (n) \end{matrix}]}^{2} \end{matrix}$ (14)

If the distance measure D [y (n) , x (n)] is minimized, there are: $\frac{\partial D [y (n), x (n)]}{\partial y (n)} = 0$ (15)

$y (n) = \frac{\sum_{t_{q}} \begin{matrix} w_{1} (t_{q} - n) w_{2} (t_{q} - n) \\ x (n - t_{q} + t_{m}) \end{matrix}}{\sum_{t_{q}} w_{2}^{2} (t_{q} - n)}$ (16)

Therefore, in order to make up for the energy loss, it is also necessary to introduce an energy compensation factor α_q, that is: $y (n) = \frac{\sum_{t_{q}} \begin{matrix} α_{q} w_{1} (t_{q} - n) w_{2} (t_{q} - n) \\ x (n - t_{q} + t_{m}) \end{matrix}}{\sum_{t_{q}} w_{2}^{2} (t_{q} - n)}$ (17)

In formula (17), the value of α_q is related to the length of the selected window. Generally, the window length τ = 24p is taken, and p is the pitch period. When τ = 2p, α_q = 1. In the above formula, the mapping relationship between t_m and t_q is given by formula (4). Since the algorithm does not modify the short-time signal, the analysis signal window and the synthesis signal window can take the same window, namely w₁ (n) = w₂ (n). Therefore, the synthesis formula is a simple superposition process, and the difference is that the synthesis window function at this time is the square of w₂ (n). The denominator is a time-varying normalization factor, Under wide-band conditions, its value is a constant, and under narrow-band conditions, its value is also approximately constant. $y (n) = \sum_{t_{q}} α_{q} x_{t_{m}} (n - t_{q} + t_{m})$ (18)

The above is the synthesis process of TD-PSOLA in the sense of minimum mean square error. From the above synthesis process, it can be observed that formula (18) and formula (13) are the same. It shows that when certain constraints are met, TD-PSOLA in the sense of minimum mean square deviation is the same as TD-PSOLA in the same sense of spectrum.

4 C4.5 algorithm

In this paper, the C4.5 algorithm is used to train a decision tree for judging pronunciation of polyphones. Generally speaking, the decision tree contains two types of nodes: leaf nodes and decision nodes. The leaf node points to a classification result, and the decision node can generate a new branch or subtree. The C4.5 algorithm is proposed by Ouinlan and is a widely used decision tree classifier algorithm. The specific content includes decision tree construction algorithm and rule generation algorithm. The working mode of the C4.5 algorithm is to derive the classifier model from a large amount of training data.

A classic example of a decision tree is to use a decision tree to predict whether the weather on a certain day is suitable for golf. There are two results of the decision: play and notplay. In the process of decision-making, there are many factors that affect the decision result, and the set of these factors can be called a problem set. The problem set of the golf problem is shown in Table 1.

Table 1
Golf question set

Attribute name Continuity of values Value range

OUTLOOK Discrete sunny, overcast, rain

TEMPERATURE Continuous Positive integer

HUMIDITY Continuous Positive integer

WIND Discrete Yes, No

Attribute name	Continuity of values	Value range
OUTLOOK	Discrete	sunny, overcast, rain
TEMPERATURE	Continuous	Positive integer
HUMIDITY	Continuous	Positive integer
WIND	Discrete	Yes, No

C4.5 algorithm is used to train the above data, and the decision tree shown in Fig. 1 is obtained.

Fig. 1

Decision tree for the golf problem.

The following conditions need to be met when applying C4.5 algorithm for classification:

The data to be classified needs to be described by a set of attributes.

The possible classification results are defined in advance.

The results of the classification are discrete.

The result of the construction is a set of rules or decision trees, and the classification is described by some logical expressions.

In the process of constructing the decision tree, the C4.5 algorithm compares the size of the information gain value of each description attribute and selects the attribute with the largest information gain value to perform classification. The decision of each classification is related to the selected target classification. The best assessment method for classification uncertainty is information entropy: $S = \sum_{1} (P_{i}^{*} log (P_{i}))$ (19)

If it is assumed that there are two classifications P and N, and the record set S contains x records of the class P and Y records of the class N, then the amount of information used to determine whether any record in record set S belongs to a certain class is: $\begin{matrix} Info (S) = Info (S_{p}, S_{n}) = - \\ (\frac{x}{x + y} \cdot log \frac{x}{x + y} + \frac{y}{x + y} \cdot log \frac{y}{x + y}) \end{matrix}$ (20)

We assume that the variable D is the root node, the subclass of the record set s is (S₁, S₂, ⋯ , S_k), and each (S₁, S₂, ⋯ , S_k) has x_i records belonging to the class P and y_i records belonging to the class N. Then, the amount of information used to classify in all subcategories is: $Info (D, S) = \sum_{i = 1}^{k} \frac{x}{x + y} Info (S_{ip}, S_{in})$ (21)

If variable D is a classification node, its information gain must be greater than the information increment of other variables. At this time, the information increment of D is: $Gain (D) = Info (S) - Info (A, S)$ (22)

The definition of the information gain function is: $Gain (D, S) = Info (S) - Info (D, S)$ (23)

Among them, $\begin{matrix} Info (S) = I (P) = P (P_{1}, P_{2}, \dots, P_{k}) = I \\ (\frac{| C_{1} |}{| S |}, \frac{| C_{2} |}{| S |}, \dots, \frac{| C_{k} |}{| S |}) = - \\ (p_{1} \cdot log p_{1} + p_{2} \cdot log p_{2} + \dots + p_{k} \cdot log p_{k}) \end{matrix}$ (24) $Info (D, S) = \sum_{i - 1}^{n} (| S_{i} | / | S |) \cdot Info (S_{i})$ (25)

In the process of learning with C4.5, the user does not need to know a lot of relevant knowledge, but the user must understand which attributes can affect the results of the C4.5 classification. Before learning, the user should create a corresponding problem set based on the content to be learned. The problem set is a collection of attributes that affect the classification result. To determine the pronunciation of a polyphonic character is equivalent to classifying the pronunciation of the polyphonic character, so it is necessary to select some attributes that can affect the pronunciation of the polyphonic character to establish the problem set.

5 System construction

The prototype system consists of three modules: text processing module, prosody processing module, and voice synthesis module. Among them, the word segmentation processing in the prosody processing module is completed by ICTCLAS. The main purpose of constructing the prototype is to evaluate the performance of pronunciation discrimination method based on part-of-speech rules and HMM-based prosody hierarchy prediction in speech synthesis systems. Therefore, the prototype system uses pronunciation discrimination method based on part-of-speech rules to solve the problem of polyphonic characters in the process of grapheme to phoneme conversion and uses HMM-based prosodic hierarchical structure prediction to extract the prosodic information of the text. In the sound synthesis module, the system uses the waveform stitching method and PSOLA to synthesize sound, and the speech library uses the Keda Xunfei Xiaojing speech library. The block diagram of the prototype system is shown in Fig. 2.

Fig. 2

Structure of the grapheme to phoneme conversion system.

For words that cannot be distinguished by morphological structure, the label learning can be done by machine learning methods. Primary and secondary stresses are considered together, and attributes are extracted for the syllable of a word. In this article, the labeling of the primary and secondary stress is carried out separately, and it is assumed that there is only one primary stress for a word. Taking a three-syllable word as an example, the main stress may appear in three positions. If the main stress occurs in the first syllable, the word is labeled as 0, and if it appears in the second syllable, it is labeled as 1, and so on.

During learning, attributes will be extracted for the entire word. Since the main stress can appear in multiple positions, for a two-syllable or multi-syllable word, we do not know which syllable or which syllable component is more important for determining the position of the main stress. Therefore, for the sake of safety, we take all syllables of a word into consideration to determine the position of the main stress.

Therefore, taking the disyllabic word balance as an example, if the first sound, core and tail sound of all syllables of a word are extracted as attributes, there are 6 attributes that can be extracted, as shown in Table 2.

Table 2

The attributes extracted for each syllable of balance

O_0	b
N_0	æ
C_0	0
O_1	1
N_1	ǝ
C_1	ns

Among them, each attribute is identified by the initial letter of its corresponding English name and the serial number of the corresponding syllable in the word, such as C_O indicates the end of the first syllable. If there is no corresponding content for an attribute, such as the first syllable has no ending, the value of C_O is 0. The distribution of words with different syllables is extremely uneven. Taking the lexicon used in this system as an example, the distribution of numerals in each syllable is shown in Table 3.

Table 3

Distribution of syllable numerals in the thesaurus

Syllables	Word count	Proportion
l	4779	16.58%
2	11620	40.32%
3	7348	25.49%
4	3631	12.60%
5	1242	4.31%
6	184	0.64%
7	16	0.06%
8	2	0.01%
Total	28822	100.00%

As can be seen from Table 3, in addition to monosyllable words, the distribution of other words presents a situation where the fewer the number of syllables, the greater the proportion of words. After excluding single-syllable words, if all other words are studied according to their different syllable groups, that is, all two-syllable words are studied in groups, three-syllable words are studied in groups, and so on, the less the number of syllables, the fewer the attributes in the instance, which can greatly simplify the learning process and improve the learning effect.

In this study, the transformation-based learning method is used to learn the label of the main stress, and the learning through this method is an error-driven greedy search process. In each round of learning, the labeled text-to-speech stream is compared with the correct text-to-speech stream, and the wrongly labeled text-to-speech stream is converted using a conversion formula. The learning process based on the transformational learning method is shown in Fig. 3.

Fig. 3

Learning process based on transformational learning.

It can be seen from Fig. 3 that the whole learning process has two main steps: initial label and label learning.

Initial label means that the words in the unlabeled text-to-speech stream are initially labeled as a label. Taking disyllabic words as an example, among all 11,620 disyllabic words, the main stress of 9180 words is in the first syllable, accounting for 79% of the total. Therefore, the initial label of disyllabic words is to label all disyllabic words in the learning corpus as O, that is, the main stress is in the first syllable. In the subsequent label learning process, for each conversion template, the text-to-speech symbols after the initial labeling are compared with the correct text-to-speech symbols. Afterwards, for the words with the main accent labeled incorrectly, a candidate conversion formula is generated.

Then, the second round of learning is conducted. First of all, the rule “N_0 = 2=>P=I” just learned is applied to the learning corpus, that is, all the disyllabic words whose core of the first syllable is vowel / ǝ / are labeled as 1. Then, the above learning process is repeated again until no candidate conversion score is greater than 0. At this point, the learning is over.

6 System performance test

The experimental evaluation is conducted in an open and centralized manner, and the evaluation index is the evaluation index used in the international Chinese segmentation competition, and each index is defined as follows:

Accuracy (P)=the number of correctly segmented words in the segmentation result / the number of all segmented words in the segmentation result×100%.

Recall rate (R)=the number of correctly segmented words in the segmentation result / the number of all segmented words in the standard answer×100%.

F-measure=2PR/(P + R).

This article conducts small-scale text training and selects other materials to perform open testing. The comparison between the algorithm proposed in this paper and the traditional algorithm is shown in Table 4 and Fig. 4.

Table 4
Comparison table between the algorithm proposed in this paper and traditional algorithm

P (%) R (%) F (%)

Forward maximum matching 87.1 85.1 86.2

Reverse maximum matching 88.5 87.2 88.4

Binary grammar 92.2 92.5 92.5

Algorithm 94.1 92.9 93.4

	P (%)	R (%)	F (%)
Forward maximum matching	87.1	85.1	86.2
Reverse maximum matching	88.5	87.2	88.4
Binary grammar	92.2	92.5	92.5
Algorithm	94.1	92.9	93.4

Fig. 4

Comparison diagram between the algorithm proposed in this paper and traditional algorithm.

It can be seen from the above results that the algorithm proposed in this paper has improved the accuracy and recall rate compared with the traditional algorithm and obtained relatively good results.

In terms of word segmentation speed, the algorithm proposed in this paper is compared with other word segmentation algorithms as shown in Table 5 and Fig. 5.

Table 5

Comparison table of word segmentation speed

Algorithm	Word segmentation speed (4D/min)
Forward maximum matching	6.7
Reverse maximum matching	4.8
Binary grammar	3.5
Algorithm	8.5

Fig. 5

Comparison diagram of word segmentation speed.

From Table 5 and Fig. 5, it can be seen that the algorithm proposed in this study has a greater advantage in word segmentation speed than the traditional algorithm, and this speed fully meets the need of English text-to-speech conversion.

Through the above analysis, it can be seen that the algorithm proposed in this paper has certain advantages in accuracy, recall and recognition speed of small-scale text recognition, and can meet the needs of the English text-to-speech conversion system. Next, the algorithm is applied to practical experiments, and is identified by 50 sets of text data, each group has 100 sets of text, that is, the algorithm performs a total of 5000 identifications, which can fully meet the requirements of detecting algorithm performance. In order to improve the comparison effect of algorithm test, this study uses the Binary grammar algorithm as a control, and the results are shown in Table 6 and Fig. 6.

Fig. 6

Algorithm performance test chart.

Table 6

Algorithm performance test table

	Binary grammar	Paper model		Binary grammar	Paper model
1	72.0	87.5	26	56.2	89.5
2	57.6	89.7	27	50.5	92.0
3	70.2	90.2	28	63.8	90.5
4	54.2	89.5	29	50.9	90.9
5	58.8	85.2	30	55.8	91.3
6	49.1	89.4	31	57.8	91.6
7	58.6	91.8	32	58.2	91.7
8	70.8	88.3	33	57.4	85.1
9	67.2	88.0	34	70.1	85.7
10	65.8	91.0	35	51.5	91.5
11	71.8	86.1	36	68.0	86.0
12	53.8	85.1	37	68.7	87.6
13	54.4	91.7	38	62.7	87.6
14	49.0	87.3	39	55.9	87.8
15	70.8	88.7	40	63.0	87.0
16	48.2	85.9	41	68.3	88.7
17	59.6	91.7	42	65.3	88.2
18	58.6	91.5	43	49.7	90.4
19	62.7	87.7	44	68.7	89.3
20	68.8	90.6	45	51.9	88.8
21	48.6	85.8	46	65.2	88.4
22	56.2	91.2	47	60.3	85.5
23	68.6	89.8	48	56.9	90.2
24	59.4	91.7	49	67.5	87.4
25	70.2	90.2	50	48.6	85.9

It can be seen from Table 6 and Fig. 6 that the algorithm proposed in this paper has good performance and the accuracy rate is more than 85%, which shows that the algorithm proposed in this paper can be applied to the grapheme to phoneme conversion system.

7 Conclusion

In order to study the prediction of intonation phrase boundaries, this paper first adopts part-of-speech tagging on the text corresponding to the speech corpus and labels the middle phrases and intonation phrases according to the actual pause of the speech corpus. Moreover, in an intonation phrase, with chunks as the basic unit, this study labels all the chunks with above two labels, which can easily identify the boundary of intonation phrases, so that the problem of segmentation of intonation phrases can be transformed into the labeling problem of chunks. Since the vocabulary of English is extremely large, in English speech synthesis, it is impractical to establish a large text-to-speech database to obtain the text-to-speech transcription of words. Therefore, we try to convert English words directly into text-to-speech symbols through the algorithm, that is, text-to-speech conversion. Through comparative experimental analysis, the performance of the algorithm proposed in this study is studied in terms of recognition speed, recall rate, and accuracy rate. The research results show that the algorithm proposed in this paper has good performance and has certain practical effects.

Footnotes

Acknowledgments

2019–2020 educational reform project of Education Department of Inner Mongolia Autonomous Region, Research on the Present Situation and Strategy of Chinese Cultural Input and Output in College English Teaching. Item number: 2019 NMGJ006.

References

Hossain

M.S.

and Muhammad

, Healthcare Big Data Voice Pathology Assessment Framework, IEEE Access 43(9) (2016), 15–26.

Hill

A.K.

, Rodrigo Cárdenas

, Wheatley

J.R.

, et al., Are there vocal cues to human developmental stability? Relationships between facial fluctuating asymmetry and voice attractiveness, Evolution & Human Behavior 38(2) (2017), 249–258.

Woźniak

and Połap

, Voice recognition through the use of Gabor transform and heuristic algorithm, Nephron Clinical Practice 63(2) (2017), 159–164.

Haderlein

, Döllinger

, Matoušek

, et al., Objective voice and speech analysis of persons with chronic hoarseness by prosodic analysis of speech samples, Logopedics Phoniatrics Vocology 41(3) (2015), 106–116.

Nidhyananthan

S.S.

, Muthugeetha

and Vallimayil

, Human Recognition using Voice Print in LabVIEW, International Journal of Applied Engineering Research 13(10) (2018), 8126–8130.

Malallah

F.L.

, Saeed

K.N.Y.M.G.

, Abdulameer

S.D.

, et al., Vision-Based Control By Hand-Directional Gestures Converting To Voice, International Journal of Scientific & Technology Research 7(7) (2018), 185–190.

Morgan

, Contact effects on voice-onset time in Patagonian Welsh, Acoustical Society of America Journal 140(4) (2016), 3111–3111.

Mohan

, Hamilton

, Grasberger

, et al., Realtime voice activity and pitch modulation for laryngectomy transducers using head and facial gestures, Journal of the Acoustical Society of America 137(4) (2015), 2302–2302.

Kang

T.G.

and Kim

N.S.

, DNN-Based Voice Activity Detection with Multi-Task Learning, IEICE Transactions on Information & Systems E99.D(2) (2016), 550–553.

10.

Choi

H.-N.

, Byun

S.-W.

and Lee

S.-P.

, Discriminative Feature Vector Selection for Emotion Classification Based on Speech, Transactions of the Korean Institute of Electrical Engineers 64(9) (2015), 1363–1368.

11.

Oki

Takuro

, Scene Text Localization Using Object Detection Based on Filtered Feature Channels and Crosswise Region Merging, Growth & Change 21(3) (2015), 61–76.

12.

Kamble

R.R.

and Kodavade

D.V.

, Relevance Feature Search for Text Mining using FClustering Algorithm, International Journal of Computer Sciences & Engineering 6(7) (2018), 223–227.

13.

Maruthupandi

and Devi

K.V.

, Multi-label text classification using optimised feature sets, International Journal of Data Mining Modelling & Management 9(3) (2017), 237.

14.

Pandi

Maruthu

, Devi

K.V.

and Rajendran , Efficient Feature Extraction for Text Mining, Advances in Natural & Applied Sciences 10(4) (2016), 64–73.

15.

, Zhao

and Han

, A Fingerprint Feature Extraction Algorithm based on Optimal Decision for Text Copy Detection, International Journal of Security & Its Applications 10(11) (2016), 67–78.

16.

Soleymanpour

and Marvi

, Text-independent speaker identification based on selection of the most similar feature vectors, International Journal of Speech Technology 20(1) (2016), 1–10.

17.

Mojaveriyan

, Ebrahimpourkomleh

and Mousavirad

S.J.

, IGICA: A Hybrid Feature Selection Approach in Text Categorization, International Journal of Intelligent Systems Technologies & Applications 8(3) (2016), 42–47.

18.

Aghdam

M.H.

and Heidari

, Feature Selection Using Particle Swarm Optimization in Text Categorization, Journal of Artificial Intelligence & Soft Computing Research 5(4) (2015), 38–43.

19.

Robati

, Zahedi

and Fayazi Far

, Feature Selection and Reduction for Persian Text Classification, International Journal of Computer Applications 109(17) (2015), 1–5.

20.

Hussain

, et al., Estimating Virtual Trust of Cognitive Agents Using Multi Layered Socio-fuzzy Inference System, Journal of Intelligent & Fuzzy Systems 37(2) (2019), 2769–2784.

21.

Zia

, Abbas

and Akhtar

M.P.

, Evaluation of Feature Selection Approaches for Urdu Text Categorization, International Journal of Intelligent Systems Technologies & Applications 07(6) (2015), 33–40.

22.

zia

Tehseen

, Akhter

Muhammad Pervez

and Abbas.

Qaiser

, Comparative Study of Feature Selection Approaches for Urdu Text Categorization, Malaysian Journal of Computer Science 28(2) (2015), 93–109.

23.

Dong

and Hou

, A Useful Method for Analyzing Incomplete and Inconsistent Information: Paraconsistent Soft Sets and Corresponding Decision Making Methods, Journal of Intelligent & Fuzzy Systems 37(1) (2019), 901–912.

24.

, Jin

X.Z.

and LiHua

, Text recognition algorithm based on text features, International Journal of Multimedia & Ubiquitous Engineering 11(5) (2016), 209–220.