Application of machine learning and image target recognition in English learning task

Abstract

Artificial intelligence speech recognition mostly judges the accuracy of grammar or sentence in the detection of pronunciation error, but has little research on pronunciation judgment, so it cannot effectively correct the pronunciation. This study analyzes the application of image target recognition in English learning task. Task-based approach emphasizes the process of English learning, not the result, the purposeful communication and meaning expression, encourages learners to open their mouths, and emphasizes that English language learning activities and their tasks are realistic in life. In addition, this paper introduces the DNN adaptive technique based on KL divergence regularization to adapt the acoustic model. Finally, this paper uses the experimental contrast method to compare and analyze the algorithm of this research with the traditional algorithm. The research shows that the recognition ability of the algorithm for confusing phonemes is improved than that of traditional algorithms, and this conclusion provides a powerful result for the introduction of error correction algorithms into education networks. By using the platform of autonomous learning center, students can improve their English level by completing the tasks chosen by teachers or by themselves and through training.

Keywords

Deep learning image target recognition DNN algorithm English learning

1 Introduction

On the one hand, the reform of English teaching reflects the renewal requirements for foreign language talents put forward by China’s reform and opening up. Specifically to individuals, it is reflected in the changes of learners’ needs for language learning reflected in the needs of society. On the other hand, the reform of teaching methods also reflects people’s further understanding of the nature of language and the theory and practice of language teaching and learning. At the same time, with more and more people grasping the technology in the network information age, language teaching has been expanded both in space and time; the application of new technology in teaching has been reversed. It also gives people a new perspective on the understanding of language. Among them, network task teaching is the concrete embodiment of new needs, new technologies and new theories in teaching. Task-based language teaching emphasizes the learning process, not the results. It emphasizes purposeful communication and meaning expression, encourages learners to open their mouths, and emphasizes that language learning activities and their tasks are realistic in life. The purpose of language learning is to use language in life. Teaching activities and tasks follow the order gradually from easy to easy. The principle of difficulty emphasizes that language learning is the process of acquiring skills. Task-based teaching combines network, information technology and task-based teaching mode, and focuses on a certain topic, so that task-based teaching in traditional classroom can be diversified in terms of information quantity, reliability, and communication form and task type. The effectiveness of the task has been greatly expanded.

Faced with the rapid development of English learning needs, numerous language training schools, teaching methods and teaching tools emerge one after another. However, the teaching resources and learning methods of spoken English have always been a major obstacle to the improvement of Chinese oral English. On the one hand: teachers with high levels of spoken English in China are extremely scarce. Even the schools in developed regions, there are very few foreign languages speaking teachers who can accurately guide students and have accurate instructions. On the other hand: Chinese pronunciation habits and English pronunciation habits are very different. Therefore, Chinese people will be influenced by their mother tongue when they speak English, and they will make many unpredictable mistakes [1]. Although some schools use multimedia teaching, traditional multimedia teaching can only be taught unilaterally, and students can only follow the recording of multimedia systems. Therefore, the multimedia system does not give students the correct guidance and feedback, and it does not play an effective interaction [2].

In recent years, computer and Internet technologies have developed rapidly, and personal computers have entered thousands of households, greatly facilitating people’s lives. In addition, the research on speech recognition technology has made great progress. From the aspects of recognition speed and recognition accuracy, speech recognition technology has achieved practical purposes. Therefore, the computer-aided pronunciation training system based on speech recognition technology is more and more widely used in the field of education [3]. Through the CAPT system, users can perform pronunciation training in foreign language spoken and spoken at any time and in any place in a relaxed and enjoyable environment. The CAPT system can evaluate the user’s pronunciation in a targeted manner and provide timely feedback. These feedbacks contain the knowledge of human pronunciation experts, which will not mislead users, thus effectively overcome the problems of teachers’ low level of oral English and lack of teaching resources, and also greatly affect the effect and enthusiasm of users learning foreign language. Therefore, it is of great application value and research significance to study how to effectively perform pronunciation evaluation and error correction, and to guide learners to learn foreign language pronunciation more accurately [4].

The main contribution of this paper is to analyzing the application of deep learning and image target recognition in English learning task. In addition, this paper introduces the DNN adaptive technique based on KL divergence regularization to adapt the acoustic model. Finally, this paper uses the experimental contrast method to compare and analyze the algorithm of this research with the traditional algorithm. The research shows that the recognition ability of the algorithm for confusing phonemes is significantly better than that of traditional algorithms, and this conclusion provides a powerful theoretical basis for the introduction of error correction algorithms into deep neural networks. By using the platform of autonomous learning center, students can improve their English level by completing the tasks chosen by teachers or by themselves and through training.

This paper is organized as follows: The related work is introduced in Section II. Evaluation method of pronunciation quality under the DNN framework was in section III. Improve deep learning algorithm was in section IV. Model test analysis was in section V. Analysis of results was in section VI Finally, Conclusions are given in Section VII.

2 Related work

Automatic speech recognition technology is a very important research topic and an important means of human-computer interaction. Scientists have been fully researching it since the invention of the computer [5]. The earliest speech recognition system dates back to the 1950 s. In 1952, DaviS et al. of AT&T Bell Laboratories first developed a speech recognition system capable of recognizing 10 English numbers. In the 1960 s, researchers in the former Soviet Union proposed Dynamic Programming (DP) and Dynami CTime Warping (DTW) algorithms, which successfully solved the alignment problem of speech signal templates and speech instances and laid a solid foundation for the development of modern automatic speech recognition technology [6]. In the 1970 s, with the introduction of the speech feature parameter LPC, the method of speech recognition using related features such as pitch was applied to the actual. In the 1980 s and 1990 s, the research on speech recognition technology reached a climax. More and more research institutions in industry and academia have joined the research on automatic speech recognition technology, and major companies have launched their own speech recognition applications. These include the Hearsay system [7], Harpy [8] at Carnegie Mellon University, the HWIM system at BBN [9], Bell Labs’ speech recognition research for telecom services, and IBM’s research on phonographs. After the introduction of HMM, the research on speech recognition has finally changed from a simple template-based to a probabilistic model-based system. At the same time, the theory and practice of HMM in speech recognition is becoming more and more perfect, and HMM has become the most important method of speech recognition since then [10]. At this time, the Artificial Neural Network (ANN) has also received more and more attention from speech recognition researchers all over the world. Finally, Cambridge University launched and open sourced the HTK (HiddenMarkovToolkit, HTK) software toolkit that they developed, which greatly facilitated the research of speech recognition by other organizations, thus launching a research boom based on the HMM model [11]. Since the beginning of the 21st century, the research of automatic speech recognition technology has made great progress both in depth and breadth. Some new acoustic models that have jumped out of the HMM framework have obtained preliminary research. For example, CRF (Conditional Random Field, CRF), HTM [12], etc., have attracted more and more attention from researchers in the optimization of acoustic models. From the research process of decades, automatic speech recognition has evolved from only identifying isolated words to large vocabulary continuous speech recognition under natural language flow, and the accuracy of recognition results has been increasing year after year [13].

In the construction of acoustic models, optimizing the acoustic model is very important to improve the performance of the pronunciation error detection system. Because the acoustic model is the key in the speech recognition algorithm, if the modeled acoustic model cannot accurately reflect the acoustic characteristics of the speech unit, the performance of the pronunciation error detection system cannot be guaranteed. First, the discriminative training of acoustic models is the main method for optimizing acoustic models in current speech recognition systems. Cambridge University’s Gales et al. used the MMI-based discriminative training method in the TIMIT corpus to show that compared with the traditional ML training method, the speech recognition system’s false recognition rate at the phoneme level has dropped from 29.4% to 25.3% [14]. In China, researchers at HKUST and China University of Science and Technology have used a speaker-adaptive method based on restricted maximum likelihood linear regression to optimize the acoustic model in the pronunciation error detection system. This optimization solves the problem that the acoustic model is not robust due to the different speaker characteristics in the training database, thereby improving the performance of the pronunciation error detection system [15].

3 Evaluation method of pronunciation quality under the DNN framework

3.1 Traditional GOP algorithm

In 2000, Witt and Yong proposed the Goodness Of Principle (GOP) algorithm in the GMM.HMM speech recognition system, which was used as the evaluation and error detection of phoneme pronunciation quality. Moreover, prior to this, all pronunciation quality assessments were based on word and phrase levels. This method assumes that the phoneme sequence of the read text is known. This method gives the speech segment x^(p) and its GOP of the phoneme p is defined as [16]: $GOP (p) = \frac{1}{d} log p (p | x^{(p)})$ (1) $= \frac{1}{d} log \frac{p (x^{(p)} | p) p (p)}{\sum {q \in Q} p (x^{(p)} | q) p (q)}$ (2) $\approx \frac{1}{d} log \frac{p (x^{(p)} | p) p (p)}{{max}_{{q \in Q}} p (x^{(p)} | q) p (q)}$ (3)

Among them, d is the duration of the pronunciation, p (q) is the prior probability of phoneme q, p (x^(p)|q) is the likelihood of the observed feature x^(p) relative to the model q, Q is the set of all phonemes. To simplify the calculation, Equation (3) approximates the sum of all phonemes using the maximum value of the joint probability of phoneme and speech segment. The approximate GOP calculation is shown in Fig. 1. After extracting the sequence of acoustic features of the sentence, on the one hand, according to the forced alignment process of the acoustic feature sequence and the accompanying text, the start and end time and likelihood function values of the respective phonemes to be read are obtained, that is, the molecular terms of the formula (3). On the other hand, through a speech recognition process with no grammatical constraints on the phoneme level, the likelihood function value of each possible phoneme, that is, the denominator term of Equation (3) is obtained. After obtaining the GOP score of each phoneme, and combining the system threshold, it is judged whether the pronunciation is correct [17].

Fig. 1

Schematic diagram of the structure of the pronunciation error detection system based on GOP algorithm.

3.2 Extension of GOP algorithm in DNN-ItMM system

Firstly, the GOP algorithm is extended to the DNN-based pronunciation error detection system, and the calculation process of the GOP is further simplified. In Equation (3), assuming that each phoneme has the same prior probability, the GOP is reduced to [18]: $GOP (p) = \frac{1}{d} log \frac{p (x^{(p)} | p)}{{max}_{{q \in Q}} p (x^{(p)} | q)}$ (4) $= \frac{1}{d} {log p (x^{(p)} | p) - max_{{q \in Q}} p (x^{(p)} | q)}$ (5)

Assuming that the phoneme corresponding to the phoneme p obtained by forced alignment is x^(p), and the time is t_s and t_e, then the hidden state sequence is

s* = { s_ts, ⋯ , s_te }, then the likelihood function

p (x (P) p (x^(p)|p ; t_s, t_e) is expressed as: $\begin{matrix} p (x^{(p)} | p; t_{s}, t_{e}) \\ \approx max_{s} p (x^{(p)}, s | p; t_{s}, t_{e}) \\ = π s_{t_{s}} {\underset{t = t_{s + 1}}{Π}}^{t_{e}} {As}_{t} - 1 s_{t} {\underset{t = t_{s + 1}}{Π}}^{t_{e}} p (x_{t}^{(p)} | s_{t}) \end{matrix}$ (6) $\approx {\underset{t = t_{s}}{Π}}^{t_{e}} p (x_{t}^{(p)} | s_{t})$ (7) $=_{\underset{t = t_{s}}{Π}}^{t_{e}} p (s_{t} | x_{t}^{(p)}) p (x_{t}^{(p)}) / p (s_{t})$ (8) $α_{\underset{t = t_{s}}{Π}}^{t_{e}} p (s_{t} | x_{t}^{(p)}) / p (s_{t})$ (9)

Among them, π is the initial probability distribution for each state, A is the transition probability matrix for the state, $p (s_{t} | x_{t}^{(p)})$ is the output of the DNN, which represents the posterior probability of senone s_t, and p (s_t) is the prior probability of senone s_t. In the traditional GMM.HMM system, $p (s_{t} | x_{t}^{(p)})$ is calculated from a mixed model containing more than a dozen Gaussian components. In the DNN.HMM system, $p (s_{t} | x_{t}^{(p)})$ is the output of the DNN, and its distribution is obtained through the discriminative training of multi-layer nonlinear transformation. As a discriminant model, DNN can obtain a more accurate posterior probability distribution.

In addition to the forced alignment process of the entire sentence level to obtain the start and end time of each phoneme, the GOP calculation needs to perform a forced alignment process inside the phoneme for other possible phonemes within the starting and ending time period given by the current phoneme. After that, a segmentation of the senone level is obtained, and then the likelihood function value p (x^(p)|q ; t_s, t_e) is calculated, thereby obtaining a phoneme that maximizes the likelihood function [19].

4 Improve deep learning algorithm

4.1 Improved GOP algorithm

When a phoneme in a sentence has a pronunciation error, the structure of triphone causes the error of the current phoneme to affect the GOP calculation of the correct phoneme on the left and right sides. For example, for the word “China”, its pinyin is zhong1guo2, and the corresponding triphone sequence is sil - zh + ong1, zh - ong1 + g, ong1 - g + uo2, g - uo2 + sil. Among them, sil is the silent segment model at the beginning and end of the word. If there is a pronunciation error ong1 → ang1, the error will not only affect the forced alignment result of the current triphonezh - ong1 + g, but also the forced alignment and posterior probability calculation of above sil - zh + ong1 and below ong1 - g + uo2. The posterior probability of sil - zh + ong1 corresponding to senone will be pulled low, and the posterior probability of sil - zh + ang1 corresponding to senone will be pulled high. Therefore, the GOP value of the phoneme will be pulled down. For triphone ong1 - g + uo2, there are similar observations [20].

In response to the above problem, the GOP algorithm of Equation (9) has been improved. The corrected calculation method is expressed as: $log p (x^{(p)} | p; t_{s}, t_{e}) = \sum_{t = t_{s}}^{t_{e}} log (\sum_{s = {set}_{p}} p (s | x_{t}^{(p)}))$ (10)

Among them, set_p is the set of all senone in the decision tree where senone s_t is located, $p (s | x_{t}^{(p)})$ is the output of the DNN, and senone is obtained by strongly aligning s_t and x_t. When Iriphone is used as the modeling unit, each state of the phoneme constructs an independent decision tree according to the context and performs state clustering. In the following, we will analyze the newly defined GOP algorithm from the perspective of decision trees.

Figure 2 illustrates the decision tree for the third valid state of phoneme zh. Each leaf node in the tree represents a clustered state senone, and its posterior probability is learned through DNN. Here, it is assumed that each leaf node represents a different senone. The value of each leaf node in the decision tree is defined as the posterior probability of the corresponding senone, and the value of the intermediate node is defined as the sum of the posterior probabilities of its child nodes. Then the value of the root node zh _ s3 is represented as the posterior probability sum corresponding to all leaf nodes in the decision tree. Assuming that the senone corresponding to x_t obtained by forced alignment is zh _ s3.2, the sum of the Equations 10 is the value of the root node zh _ s3. Returning to the above example, when using Equation 10 to calculate the posterior probability of each state of zh, it not only considers the senone corresponding to the current context sil - zh + ong1, but also includes the actual misalignment context sil - zh + ang1 and senone corresponding to other possible contexts. Therefore, the sensitivity of the posterior probability value p (x^(p)|p ; t_s, t_e) to the erroneous pronunciation of the context is greatly reduced. On the other hand, if the current phoneme and its context are pronounced correctly, the posterior probability obtained by the DNN will be dominated by the current senone s_t, so Σ _{p=set_p} $p (s | x_{t}^{(p)})$ and $p (s_{t} | x_{t}^{(p)})$ will be very close in value. Therefore, Equation (10) does not affect the GOP value of the correct phoneme. Another difference from Equation (9) is that Equation 10 omits the prior probability p (s) of each senone. Therefore, subsequent experiments found that the accuracy of error detection is not sensitive to the prior term [20].

Fig. 2

Schematic diagram of the decision tree of the third valid state of phoneme zh.

The posterior probability defined by Equation 10 is essentially the understanding binding of the current scone St, and the posterior probability value of the context-independent state is obtained, which is equivalent to the GOP calculation using the model structure of monophone. For the accuracy of each phoneme segmentation, triphone is still used as the HMM modeling unit, and forced alignment is performed at the entire sentence level to obtain time division of each phoneme and its corresponding state. The GOP of each phoneme is then calculated using Equations 5 and 10. For convenience of the following description, the method is defined as DNN-GOP2.

4.2 Adaptive technique of DNN model based on KL divergence regularization

In information theory and probability theory, KL scatter (Kullback-Leibler Divergence, KLD) is commonly used to measure the difference between two probability distributions. For the two probability distributions Q and p on the discrete random variable x, the KL divergence value is defined as [21]: $D_{KL} (Q | | P) = \sum_{i} Q (i) log \frac{Q (i)}{P (i)}$ (11)

Among them, Q is usually expressed as the true or theoretical distribution of the variable x, and p is its approximate distribution. If x is a continuous random variable, its KL divergence is defined as: $D_{KL} (Q | | P) = \int \begin{matrix} \infty \\ - \infty \end{matrix} Q (x) log \frac{Q (x)}{P (x)} dx$ (12)

The meaning of KL divergence can be interpreted intuitively according to the character encoding examples in Shannon’s information theory. For a given character set $X = {x_{i}}_{i = 1}^{N}$ and its probability distribution p, the probability of a string appearing is p (i). A character encoding can be designed such that the average number of bits required to represent a string consisting of the character set is the least. The number of bits required to provide the average obtained using the optimal coding is the entropy of the probability distribution p. $H (p) = - \sum_{i = 1}^{N} P (i) logP (i)$ (13)

On the same character set, it is assumed that there is another probability distribution Q. If the optimal coding of the probability distribution p is used to encode the character subject to the probability distribution Q, the average number of bits used is more than ideal. KL divergence is the number of bits used to measure the average versatility of each character [22]: $\begin{matrix} D_{KL} (Q | | P) \\ = - \sum_{i = 1}^{N} Q (i) logP (i) - (- \sum_{i = 1}^{N} Q (i) log Q (i)) \\ = \sum_{i = 1}^{N} Q (i) log \frac{Q (i)}{P (i)} \end{matrix}$ (14)

As can be seen from its physical explanation, the KL divergence is a non-negative value, that is D_KL (Q||P) ≥ 0. The theoretical derivation is as follows [23]: $\begin{matrix} D_{KL} (Q | | P) \\ = \sum_{i = 1}^{N} Q (i) log \frac{Q (i)}{P (i)} \end{matrix}$ (15) $= E_{Q} [- log \frac{P (i)}{Q (i)}]$ (16) $\geq - log E_{Q} [\frac{P (i)}{Q (i)}]$ (17) $= - log \sum_{i = 1}^{N} Q (i) \cdot \frac{P (i)}{Q (i)}$ (18) $= 0$ (19)

Among them, the inequality transformation from Equation (16) to Equation (17) is provided by Jensen’s inequality.

Because f (x) = - log(x) is a convex function, according to Jensen’s inequality, there are: E (f (x)) ≥ f (E (x)).

Therefore, D_KL (Q||P) ≥ 0. Only if the distributions a and b are identical, there exist. D_KL (Q||P) = 0.

Under the minimum cross entropy training criterion, the loss function of the training data $S = {x_{n}, y_{n}} \begin{matrix} N \\ n = 1 \end{matrix}$ is defined as: $\bar{E} (θ) = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{N^{L}} y_{n}^{(k)} \cdot log p (k | x_{n})$ (20)

Among them, N^L represents the dimension of the DNN output layer, p (k|x_n) represents the output of the k-th node of the DNN in the case where the input feature is x_t, and $y_{n}^{(k)}$ represents the label of the senone class to which the n-th sample belongs. When the sample belongs to the k-th senone, the value is 1, and the others are 0.

KL divergence measures the difference between two probability distributions. In the DNN model adaptive application, the two probability distributions refer to the posterior probability distribution of each frame of data before and after the adaptation to each senone. Assuming that the DNN model of Speaker Independent before adaptation is DNN^SI, the posterior probability of the k-th senone is P^SI (K|X_n), and the posterior probability of the adaptation is P (K|X_n), then the KL divergence value on the frame data is expressed as [24]: $\begin{matrix} D_{KL} (p^{SI} (x_{n}) ∥ p (x_{n})) \\ = \sum_{k = 1}^{N^{L}} p^{SI} (k | x_{n}) \cdot log \frac{p^{SI} (k | x_{n})}{p (k | x_{n})} \end{matrix}$ (21) $\begin{matrix} = \sum_{k = 1}^{N^{L}} p^{SI} (k | x_{n}) \cdot log p^{SI} (k | x_{n}) \\ - \sum_{k = 1}^{N^{L}} p^{SI} (k | x_{n}) \cdot log p (k | x_{n}) \end{matrix}$ (22) $α - \sum_{k = 1}^{N^{L}} p^{SI} (k | x_{n}) \cdot log p (k | x_{n})$ (23)

Among them, because the model parameters of DNN^SI are known, $\sum \begin{matrix} N^{L} \\ k = 1 \end{matrix} p^{SI} (k | x_{n}) \cdot log p^{SI} (k | x_{n})$ is a constant value and can be ignored in parameter estimation.

Yu et al. defined the loss function on the adaptive data set as the weighted sum of the minimum cross entropy and the KL divergence value [25]: $\begin{matrix} \hat{E} (θ) \\ = (1 - ρ) \cdot \hat{E} (θ) + ρ \cdot \frac{1}{N} \sum_{n = 1}^{N} D_{KL} \\ (p^{SI} (x_{n}) ∥ P (x_{n})) \end{matrix}$ (24) $\begin{matrix} α (1 - ρ) \cdot (- \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{N^{L}} \begin{matrix} y_{n}^{(k)} \\ \cdot_{log} p (k | x_{n}) \end{matrix}) \\ + ρ \cdot (- \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{N^{L}} \begin{matrix} \hat{p} (k | x_{n}) \\ _{log} p (k | x_{n}) \end{matrix}) \end{matrix}$ (25) $= - \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{N^{L}} \hat{p} (k | x_{n})_{log} p (k | x_{n})$ (26)

Among them, $\hat{p} (k | x_{n}) = (1 - ρ) \cdot ρ \cdot ρ^{SI} (k | x_{n})$ and ρ is the weight coefficient. Obviously, the smaller ρ, the smaller the dependence on the original model and the greater the dependence on adaptive data.

Comparing the loss functions $\bar{E} (θ)$ and $\hat{E} (θ)$ before and after the adaptation defined by Equations 20 and 26, it is found that the two have similar expressions, the only difference being the definition of the target probability distribution. The target probability distribution of the adaptive pre-DNN learning is marked as $y_{n}^{(k)}$ for the forced alignment, and the adaptive target probability distribution is the linear interpolation of $y_{n}^{(k)}$ and the posterior probability p^SI (k|x_n) value of the reference model. Therefore, the adaptive model parameters can be solved iteratively by the backward propagation algorithm.

5 Model test analysis

5.1 Data collection

Two Chinese databases are used in this paper: one is a Mandarin test database recorded by a very standard Chinese in Mandarin, which is recorded as MPE and used as an training for acoustic models. The other is the Chinese Mandarin learning database, which is used as an assessment of the error detection performance of each system. The database was provided by the Singapore Institute of Science, Technology and Research and was named iCALL.

The MPE database consists of 140 speakers. According to the Chinese Putonghua Proficiency Test Level, it is divided into four levels: First-class A, first-class B, second-class A, second-class B or below. The scores and number of people at each level are shown in Table 1. Among them, 110 speaker data of the first-class A and the first-class B are used for the training of the standard pronunciation model, and 30 speaker data of the second-class A and the second-class B and below are used as the acoustic model test. Moreover, the total duration of the training data is 41 hours and the test data are 2.9 hours.

Table 1
Grade distribution of Mandarin level

First-class A First-class B Second-class A Second-class B

Score [97,100] [92,97] [87,92] [0,87]

Number of people 10 100 20 10

Number of men 5 50 10 5

Number of women 5 50 10 5

	First-class A	First-class B	Second-class A	Second-class B
Score	[97,100]	[92,97]	[87,92]	[0,87]
Number of people	10	100	20	10
Number of men	5	50	10	5
Number of women	5	50	10	5

In this experiment, the sentences containing the read and miss are removed. To compare the performance of different system pronunciation error detection and diagnosis, the iCALL database is divided into three parts: training set, test set and verification set. The number of speakers, the number of sentences and the total duration of each data set are shown in Table 2. Moreover, in the entire data set, the proportion of false pronunciations to all phonemes is 19.2%.

Table 2

Statistics of each data set in the iCALL database

iCALL	Training set	Verification set	Test set
The number of speakers	221.45	30.9	61.8
The number of sentences	55186.37	7,763	15351.12
Total time length (hours)	82.503	11.33	22.454

This paper uses precision, recall, and accuracy to evaluate the performance of the pronunciation error detection system, which is defined as follows: $precision = \frac{N_{M}}{N_{D}} \times 100 %$ (27) $Recall = \frac{N_{M}}{N_{H}} \times 100 %$ (28) $Accuracy = \frac{N_{M} + N_{C}}{N} \times 100 %$ (29)

Among them, N_M is the number of true false pronunciations detected by the system; N_D is the number of all false pronunciations of the system detection number; N_H is the number of all incorrect pronunciations manually marked; N_C is the number of true correct pronunciations detected by the system; N is the number of phonemes in all collections.

5.2 Model diagnosis results

In the GMM-HMM training, the feature vector extracted by the front end is a 39-dimensional HTK format feature vector, and the window length is 25 milliseconds, and the frame shift is 10 milliseconds. Moreover, each vowel has been expanded tones, and the expanded phoneme set contains 183 different phonemes. The triphone is the basic modeling unit, and the pronunciation of each triphone is described by an HMM containing 3 valid states. Each state is bound by a clustering algorithm based on a decision tree and a maximum likelihood criterion. After the state is bound, 3002 different senones are finally formed, and the observation probability space of each senone is described by a Gaussian mixture model containing 32 components. Moreover, the GMM-HM model was trained by HTK.

In the DNN-HM model training, the feature vector extracted by the front end includes a 3-dimensional F0 vector in addition to the 39-dimensional HTK format. The input feature window length of the DNN is ll frames (first 5 frames, current frame and last 5 frames), so the input layer dimension is 562. Moreover, the DNN contains 3 hidden layers, each with 2048 nodes, the activation function is the sigmoid function, and the output layer dimension is 3002. At the same time, we use the minimum mutual entropy criteria described in Section 2.3.4 for training, and DNN training uses training tools developed internally by Microsoft Research Asia.

The speech recognition results of the acoustic models on the MPE test data set are shown in Table 3. In this paper, two DNN models, one using only the 39-dimensional MFCC feature and the other embedding the 3-dimensional F0 feature. Moreover, the second column is the frame level classification error rate FER measured on the DNN model. After embedding the F0 information, the FER on the test data set decreased from 47.69% to 40.99%, which was reduced by 6.69%. The third column shows the recognition error rate at the phoneme level, and this paper does not use a complex speech model, but uses a free phoneme loop without grammatical constraints, that is, each phoneme is followed by the any phoneme with same probability. As can be seen from the table, through the discriminative training of DNN, PER (Phone Error Rate) decreased from 40.99% to 27.64%, which was reduced by 23.68%. Then, after embedding the F0 information in the DNN model, the PER is further reduced by 4.79%. It can be seen that the discriminative training of DNN and the F0 information are very important for Chinese speech recognition.

Table 3
Identification results of MPE test data sets

Frame error rate Phone error rate

GMM-HMM(MFCC) —— 41.65%

DNN—HMMCMFCC) 47.69% 28.47%

DNN-HMM(MFCC+F0) 40.99% 23.68%

	Frame error rate	Phone error rate
GMM-HMM(MFCC)	——	41.65%
DNN—HMMCMFCC)	47.69%	28.47%
DNN-HMM(MFCC+F0)	40.99%	23.68%

In order to better evaluate the performance of each system, the Receiver Operat-ing Characteristic (ROC) curve of accuracy and recall rate is often used to describe the overall performance of the system.

Figure 3 depicts the error detection ROC curves for the four different scoring algorithms: GMM.CM, DNN.CM, DNN.GOP1, and DNN.GOP2 on the iCALL test set. Among them, the horizontal axis is the accuracy rate and the vertical axis is the recall rate. Moreover, by adjusting the system threshold, the recall rate can be obtained under different accuracy conditions. The black box point on the curve indicates the position of the operating point when the accuracy rate is equal to the recall rate. At the same time, when a value is needed to describe the overall performance of the error detection system, the accuracy or recall rate at that operating point is often used. It can be seen from Fig. 3 that there is no crossover of the ROC curves of the four systems, and the performance of each system is very obvious. At the same time, it can be seen that three DNN-based systems are significantly superior to the traditional GMM.HMM system. Compared with GMM.CM, DNN-CM increases the accuracy and recall rate of error detection mainly due to DNN’s powerful discriminative learning ability and the embedding of F0 information in DNN model. Comparing the three DNN-based error detection algorithms, the two GOP algorithms are also significantly better than the confidence evaluation algorithm.

Fig. 3

Sound error detection results.

In order to further compare the improved GOP algorithm, the performance of the two in the diagnosis of false pronunciation will be compared. The detection of the pronunciation error is to determine whether the pronunciation of each phoneme in the sentence is correct, and the diagnosis of the false pronunciation is to further diagnose the wrong pronunciation and which sound is most likely to be mistaken. Obviously, the diagnosis of the false pronunciation is very effective feedback to the user. By summarizing common types of pronunciation errors and targeted training, users can learn more about their pronunciation habits, correct false pronunciations, and improve the efficiency of oral learning.

For the i-th error pronunciation in the iCALL test data set, we assume that its observed feature sequence is $X_{te}^{te} (p_{i})$ , the phoneme that should be sent is $p_{canonical}^{(i)}$ , the actual pronunciation of the manual pronunciation is $p_{spoken}^{(i)}$ , and the most likely mis-transferred phoneme of the first N numbers recommended by the system according to the value of the likelihood function value is ${q_{1}^{(i)}, q_{2}^{(i)}, . . ., q_{N}^{(i)}}$ . Then, the error rate of TopN is defined as:

$\begin{matrix} TopNerror = \\ \sum_{i = 1}^{N_{H}} \frac{1 - δ (p_{spoken}^{i} \in {q_{1}^{i}, q_{2}^{i}, . . ., q_{N}^{i}})}{N_{H}} \times 100 % \end{matrix}$ (30)

Among them, N_H is the number of all false pronunciations manually marked, and δ () is a 0-l function. When the condition is met, it is 1, otherwise it is 0. Obviously, the lower the error rate of T0pN, the better the performance of the system error diagnosis.

The performance of the two GOP algorithms defined by Equations (13) and (14) in the diagnosis of false pronunciations was compared on the iCALL test set. When the system is implemented, it is only necessary to calculate the likelihood function values defined by Equations (13) and (14). Here, for the continuation of the naming, the error pronunciation diagnosis system corresponding to the Equations (13) and (14) is still represented by DNN-GOP1 and DNN-GOP2, respectively. The results are shown in Fig. 4 and Table 4. In the figure, the horizontal axis is the N value, and the vertical axis is the corresponding TopN error rate. As can be seen from the figure, DNN-GOP2 is superior to DNN-GOP1 at any N value and can reduce the error rate of TopN by about 2%. Moreover, the table gives the specific values of the error rates of the two systems from Topl to Top5. In the entire phoneme set, there are 182 different phonemes in addition to the mute model. Therefore, if a random guess is made on the pronunciation of the error, the probability of one guess is less than 0.6%, and if the DNN-GOP2 algorithm is used, the error rate of Topl is 52.43%, and the error rate of Top3 is close to 25%.

Fig. 4

Statistical diagram of the results of the diagnosis.

Table 4

Statistical table of pronunciation diagnosis results

	DNN-GDP1	DNN-GDP2
Topl error	54.18%	52.43%
Top2 error	36.77%	34.40%
Top3 error	27.91%	25.54%
Top4 error	22.25%	19.98%
Top5 error	18.44%	16.38%

The results of the false pronunciation diagnosis on the iCALL test set again verified the superiority of the improved GOP algorithm. Based on this, the GOP algorithm in the DNN.HMM system mentioned later refers to the DNONGO2 algorithm.

In this paper, the regularized DNN model adaptive technique based on KL divergence is used to adaptively adjust the parameters in the standard pronunciation model.

When modelling the iCALL dataset, the selected reference text is the actual dictated text of the artificial dictation, rather than the accompanying text.

Since the training and test data used in this section of the experiment are derived from the standard pronunciation library, it can be considered that the phoneme level pronunciation in the voice file is correct for the markup file. Therefore, the evaluation criteria for the system in this section can only be measured by the correct acceptance rate, which is called the recognition rate here. A set of test data has a total of 25 phonemes, and the results obtained are shown in Table 5, which is plotted as a statistical chart as shown in Fig. 5.

Table 5

Statistical Table of Phoneme Recognition Rate

Phoneme	DNN-GOP1	DNN-GOP1
1	71.11%	91.50%
2	54.46%	82.00%
3	26.39%	63.50%
4	72.60%	77.42%
5	49.30%	94.30%
6	71.11%	92.18%
7	54.46%	93.89%
8	26.39%	82.78%
9	72.60%	91.67%
10	49.30%	83.83%
11	52.88%	68.07%
12	47.48%	88.33%
13	73.46%	83.00%
14	46.22%	83.50%
15	56.59%	80.00%
16	80.28%	89.05%
17	76.47%	78.11%
18	59.26%	77.11%
19	69.44%	75.13%
20	71.11%	91.50%
21	54.46%	82.00%
22	26.39%	63.50%
23	72.60%	77.42%
24	49.30%	94.30%
25	59.10%	75.50%

Fig. 5

Statistical diagram of phoneme recognition rate.

The above results are summarized to show that DNN-GOP2’s ability to recognize confusing phonemes is significantly better than DNN-GOP1’s ability to recognize confusing phonemes. This conclusion provides a powerful theoretical basis for the introduction of deep neural networks into error correction algorithms, and also verifies the feasibility of introducing MLP neural networks into error correction algorithms.

6 Analysis of results

Teachers’ assignment tasks are issued by teachers, starting from students’ learning needs, in order to achieve certain expected teaching objectives and effects, and to force students to complete tasks within the prescribed time. Teachers can assign tasks not only for the whole class, but also for some students in terms of scope. After assigning the same tasks to the whole class, the teacher assigns additional tasks suitable for the students with different needs, which is beneficial to teach students in accordance with their aptitude. This task is based on the students ‘learning goal, is to achieve the purpose of a certain teaching or to achieve a certain training effect, with a purpose. Finally, tasks have time limitations. Students should complete the tasks within the prescribed time, which improves the efficiency of task completion.

The types of students’ optional tasks can be divided into goal-oriented tasks, free-style tasks and non-learning tasks. Purposeful task refers to the task chosen by students to achieve a certain learning goal. For example, some students practice the exam questions over the years on the learning platform for the English test. Other students participated in the speech contest, watched the video of English speech on the learning platform, and improved their level through learning and imitation. Free tasks refer to the learning tasks that students can choose in all aspects of listening, speaking, reading and writing according to their preferences and knowledge base. Non-learning tasks refer to tasks that students do not have to do with learning. Usually, after a period of study, students will stop learning tasks because of fatigue or decreased motivation, thus turning to listening to English songs and watching movies.

Although the performance of the speech correction algorithm system has been greatly improved due to the introduction of the neural network, it still needs further improvement. First, the neural network is only applied to the speech correction algorithm in the late stage of the CAPT system, and it does not link the neural network and the CAPT system to the user’s speech. If the neural network can be applied to the forced alignment of the system, it will have a favorable impact on the cutting accuracy of the system. Therefore, how to introduce neural networks into forced alignment to enhance the accuracy of phoneme cutting still needs further study. Secondly, the neural network model used in this paper is a static network model, which does not have the characteristics of processing the timing dynamics of speech signals. Although the regularized feature vector is used as the input of the neural network, which reflects the temporal dynamic characteristics of the speech signal to a certain extent, the description of the dynamic characteristics of the speech signal is relatively rigid and cannot accurately reflect the true dynamic characteristics of the speech signal. Therefore, how to describe the dynamic characteristics of speech signals by improving the structure of neural network models is also of value for further research.

7 Conclusion

In the process of autonomous learning, whether teachers or students, the choice of tasks plays a very important role in achieving the effect and goals of English learning. Therefore, when assigning tasks, teachers must choose tasks that are beneficial to students’ learning and stimulate students’ interest in learning. In the process of teaching, we should constantly adjust the type and quantity of tasks to meet the new needs of students for skills and knowledge. For students, teachers should guide them to choose learning tasks reasonably, and actively understand the learning effect and make corresponding feedback. With the joint efforts of teachers and students, clear learning tasks will improve students’ abilities.

This study analyzes the pronunciation error detection, extends the GOP algorithm to the DNN-based pronunciation error detection system, and further simplifies the GOP calculation process, and improves the GOP algorithm. Two Chinese databases are used in this paper: one is a Mandarin test database recorded by a very standard Chinese in Mandarin, and is recorded as MPE and used as an training for acoustic models. The other is the Chinese Mandarin learning database, which is used to evaluate the performance of each system’s error detection. At the same time, this paper uses accuracy, recall and accuracy to evaluate the performance of the pronunciation error detection system. In addition, this paper uses the DNN model adaptive technique based on KL divergence regularization to adaptively adjust the parameters in the standard pronunciation model. Through experimental research on the performance of the algorithm, it is found that the recognition ability of the algorithm for confusing phonemes is significantly better than the traditional algorithm. This conclusion provides a powerful theoretical basis for the introduction of deep neural networks into error correction algorithms, and also verifies the feasibility of introducing MLP neural networks into error correction algorithms.

References

Han

, Liu

, Mao

, et al., EIE: Efficient Inference Engine on Compressed Deep Neural Network, Acm Sigarch Computer Architecture News 44(3) (2016), 243–254.

Richardson

, Reynolds

D.A.

and Dehak

, Deep Neural Network Approaches to Speaker and Language Recognition, IEEE Signal Processing Letters 22(10) (2015), 1671–1675.

Tang

, Deng

, Huang

G.B.

, et al., Compressed-Domain Ship Detection on Spaceborne Optical Image Using Deep Neural Network and Extreme Learning Machine, IEEE Transactions on Geoscience and Remote Sensing 53(3) (2015), 1174–1185.

, Zhao

, Wei

, et al., DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection, IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society 25(8) (2015), 3919.

Katzman

J.L.

, Shaham

, Cloninger

, et al., DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network, Bmc Medical Research Methodology 18(1) (2018), 24.

Sun

, Shao

, Zhao

, et al., A Sparse Auto-encoder-Based Deep Neural Network Approach for Induction Motor Faults Classification, Measurement 89(ISFA) (2016), 171–178.

Lyons

, Dehzangi

, Heffernan

, et al., Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network, Journal of Computational Chemistry 35(28) (2015), 2040–2046.

Ferrer

, Lei

, Mclaren

, et al., Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition, IEEE/ACM Transactions on Audio Speech & Language Processing 24(1) (2016), 105–116.

Ardakani

, Primeau

Leduc-F.

, Onizawa

, et al., VLSI Implementation of Deep Neural Network Using Integral Stochastic Computing, IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2017), 1–12.

10.

Min-Joo

, Je-Won

and Tieqiao

, Intrusion Detection System Using Deep Neural Network for In-Vehicle Network Security, PLOS ONE 11(6) (2016), e0155781.

11.

Miao

, Zhang

and Metze

, Speaker Adaptive Training of Deep Neural Network Acoustic Models Using I-Vectors, IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(11) (2015), 1938–1949.

12.

Kolbaek

, Tan

Z.H.

and Jensen

, Speech Intelligibility Potential of General and Specialized Deep Neural Network based Speech Enhancement Systems, IEEE/ACM Transactions onAudio Speech & Language Processing 25(1) (2017), 153–167.

13.

, Yang

, Liu

, et al., Sketch-a-Net: A Deep Neural Network that Beats Humans, International Journal of Computer Vision 122(3) (2017), 411–425.

14.

Narayanan

and Wang

D.L.

, Improving robustness of deep neural network acoustic models via speech separation and joint adaptive training, IEEE/ACM Transactions on Audio, Speech, and Language Processing 23(1) (2014), 1–1.

15.

Yoshioka

and Gales

M.J.F.

, Environmentally robust ASR front-end for deep neural network acoustic models, Computer Speech & Language 31(1) (2015), 65–86.

16.

Zhang

X.L.

and Wang

D.L.

, Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing 24(2) (2016), 252–264.

17.

Tayfun

and Yurii

, Acceleration of Deep Neural Network Training with Resistive Cross-Point Devices: Design Considerations, Frontiers in Neuroscience (2016), 10.

18.

Kahng

, Andrews

P.Y.

, Kalro

, et al., ActiVis: Visual Exploration of Industry-Scale Deep Neural Network Models, IEEE Transactions on Visualization & Computer Graphics PP(99) (2017), 1–1.

19.

, Mei

, Prokhorov

, et al., Deep Neural Network for Structural Prediction and Lane Detection in Traffic Scene, IEEE Transactions on Neural Networks and Learning Systems 28(3) (2017), 690–703.

20.

Kang

T.G.

, Kwon

, Shin

J.W.

, et al., NMF-based Target Source Separation Using Deep Neural Network, IEEE Signal Processing Letters 22(2) (2015), 229–233.

21.

Wang

Y.B.

, You

Z.H.

, Li

, et al., Predicting protein–protein interactions from protein sequences by a stacked sparse autoencoder deep neural network, Mol BioSyst 2017:10.1039.C7MB00188F.

22.

Zhang

, Zheng

, Cui

, et al., A Deep Neural Network-Driven Feature Learning Method for Multi-view Facial Expression Recognition, IEEE Transactions on Multimedia 18(12) (2016), 2528–2536.

23.

Serizel

and Giuliani

, Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children,-, Natural Language Engineering 1(3) (2016), 1–26.

24.

Jiang

, Chin

K.S.

, Wang

, et al., Modified genetic algorithm-based feature selection combined with pre-trained deep neural network for demand forecasting in outpatient department, Expert Systems with Applications 82(C) (2017), 216–230.

25.

Zeng