Abstract
In the process of learning English, the status of spoken language is particularly important, and it is also the most concerned aspect of most English learners. However, the current situation is that due to the limited resources of traditional teachers and the lack of oral practice environment, it is difficult for many learners to effectively improve their English level. Based on this, this study builds a smart English recognition system based on support vector machine. Moreover, this paper introduces a support vector machine to characterize speech signals. In addition, this paper uses feature fusion to map complex nonlinear relationships between features based on support vector machines and establishes a smart English recognition system based on support vector machine. The model can accurately identify the syllables and pronunciations in the words. Moreover, the use of a large-scale corpus based on non-specific people in this article can represent the generality of spoken learner.
Introduction
With the increasing trend of global integration, more and more people in today’s society hope to learn and fluently master one or several foreign languages to facilitate more convenient communication. It is clear that English is one of the most important international languages in many languages of the world. Moreover, English is an important communication tool in the fields of international politics, military, economics, science and technology, culture, trade, transportation and so on. With the continuous expansion of China’s opening up to the outside world, the continuous advancement of science and technology, and the continuous improvement of its international status, it is urgent to create a large number of specialized talents who are proficient in foreign languages in order to accelerate China’s “four modernizations” process and enable China to play a greater and more active role in international affairs [1]. Therefore, mastering English has important practical significance and far-reaching historical significance for achieving the above goals.
The use of the CALL system to assist in the reading, writing and listening of language has achieved remarkable results, and related technologies are relatively mature. The biggest advantage of the CALL system from traditional teachers is that it can provide one-to-one tutoring for learners, and evaluate and feedback the learners’ pronunciation, that is, the scoring mechanism. The scoring mechanism is that the computer uses the speech processing technology to give correct evaluation of the learner’s pronunciation practice instead of the expert evaluation and can provide feedback suggestions for the pronunciation level [2]. However, the scoring mechanism in the CALL system developed at the end of the 20th century. Until now, almost all the research has been on extracting the acoustic features of speech, that is, the pronunciation segment features are evaluated, while ignoring the information characteristics of speech in perceptual and supersonic segments. Therefore, the correlation between machine score and expert score is not high. Researchers have been exploring better scoring methods and scoring mechanisms to make computer scores highly correlated with expert scores [3].
Therefore, based on the speech recognition mechanism in the spoken English self-learning system based on the key technology of speech recognition, this study evaluates the learner’s spoken pronunciation from the vocal range, super-sound segment and perception domain of speech signals, improves the correlation between machine scores and expert scores, improves CALL technology. Moreover, this study solves the problem of the current shortage of spoken English teachers, helps English spoken language learners to improve their oral English skills quickly, and provides technical support for English teaching to change the traditional teaching mode, which has practical significance and value for cultivating high-quality qualified talents.
Related work
Speech recognition is a technique in which a computer converts a speech signal into a corresponding text by recognition and belongs to the category of multidimensional pattern recognition and intelligent computer interface. The research goal of speech recognition is to enable computers to “understand” the spoken language of humans [4].
Bialous S A [5] uses a computer to identify English vowels and isolated words, which marks the beginning of computer speech recognition. At the end of the 1960s, linear prediction techniques for speech signals and dynamic time-regulation techniques emerged, which solved the problem of feature extraction and unequal length matching of speech. The research feature of this technology is to identify the isolated word speech and establish the template as a whole. The hidden Markov model is a typical model based on statistical models. It can well describe the long-term time-varying and short-term stability of speech signals and makes the development of large-vocabulary continuous speech recognition systems possible. After the 1990s, the application of artificial neural network technology became a new way of speech recognition. It has adaptability, parallelism, non-linearity, fault tolerance and learning characteristics, and promotes the further development of speech recognition technology, and makes the speech recognition system from the laboratory to use [6]. At present, the ViaVoice non-specific continuous speech recognition system developed by IBM has been successfully introduced to the market and has received wide acclaim. In addition, the Chinese Academy of Sciences, Tsinghua University, and Belgium L&H have successfully launched speech recognition systems [7].
Speech recognition technology is a very important human-computer interaction technology. It has a very wide range of application fields and market prospects. Moreover, its most direct application is to assist learners in learning language skills. It is believed that with the continuous development of voice technology, it will inevitably promote the continuous improvement of computer-aided language learning (CALL) system [8]. According to Michaellevy’s definition, computer-aided language learning refers to “the research and learning of computer applications in language teaching and learning” [9]. The CALL system has many advantages that language teachers can’t match. For example, CALL can provide one-on-one communication for individualized teaching; it can manage learning progress and evaluate learning effects more objectively; compared with teachers, it will never lose patience and is not limited by learning time. Of course, due to the limitations of current technical conditions, the CALL system’s automatic pronunciation evaluation function and interactive learning methods are limited. Therefore, the CALL system cannot completely replace the status of the teacher. The development of computer-aided language learning can be divided into three stages according to the development of language learning theory and computer technology [10].
The first stage was behaviorist computer-assisted language learning (behavioristic CALL), which began in the United States in the 1950s. This phase is characterized by computer-assisted language learning based on behaviorist theory, and computer technology is a large computer. Moreover, this stage is mainly to provide a large number of learning for the majority of learners, only to provide users with instructions and materials tools. In addition, it is limited to learning words, grammatical structures, and simple graphics, so to some extent it makes learners less interested in learning. The second stage is the exchange of computer-assisted language learning (communicative CALL). In this stage, the microcomputer is widely used, and the software features have certain interactivity, and this stage pays more attention to the use of the form. Moreover, this stage helps learners to access a real and colorful communication environment. In addition, this stage is more concerned with the communicative practice of learner language learning, and images, text, sound and video are applied in the learning process. The third stage is comprehensive computer-aided language learning (integrative CALL). In the 21st century, with the development of science and technology, including the development of multimedia technology, network technology, voice processing technology, etc., it has undoubtedly put forward higher requirements for CALL. Moreover, this stage pays more attention to the application of multimedia and human-computer interaction technology in language learning, in order to meet the learner’s personalized learning mode and targeted guidance needs.
The complete CALL system involves fields such as phonetics, natural linguistics, psychology, and digital signal processing. Speech processing technology plays a decisive role in the design and implementation of CALL systems. The main speech technologies used in the CALL system are: high-quality speech compression coding technology [11], speech synthesis technology [12], speech recognition [13], spoken dialogue technology [14], and emotion recognition technology [15]. However, the research on CALL system based on speech technology is still in its infancy, which has attracted the attention of many researchers and carried out related research work, which helps to improve the status quo of foreign language teaching and solve the existing problems in oral teaching.
VC dimension and structural risk minimization induction (SRM)
The important theoretical cornerstone for the development of the finite sample learner-VC dimension has created a good opportunity for us to overcome the “dimensional disaster” and has become an important criterion for measuring the complexity of arbitrary classification algorithms. The VC dimensions of two important sets of functions are defined as follows [16]:
The VC dimension of the indicated function set is defined as: The VC dimension of the indicated function set Q (z, α) , α∈ ∧ means that the functions in the set are used to divide the set of data sets containing 2 h sample vectors into positive and negative classes for all possible combinations of g methods. Then, the maximum number h is the VC dimension of the indicator set that is, the maximum hash capability of the sample vector for the indicated function set). If a function set Q (z, α) , α∈ ∧ can hash a set of sample vectors containing any number n, the ability of the function set to hash the sample vector is infinite, that is VC→ ∞.
The VC dimension definition of the real function set: We assume that Q (z, α) , α∈ ∧ is a set of real functions with constants C and D as the upper and lower bounds (C and D can take ±∞, separately). The β level indicator of the real function set Q (z, α) is mainly used to evaluate the z value, that is, to find the z value exceeding the level β in the function set The functions corresponding to these z values do not have the ability to break up. Therefore, all the level indicators of the real function set can be used to characterize the hashing ability of the function set.
At the same time, the real function set Q (z, α) , α∈ ∧ and its level β are considered. The set of indicators for this function set can be expressed by Equation (1):
In the formula, θ (u) is a step function:
Therefore, the VC dimension of the real function set Q (z, α) , α∈ ∧ is the VC dimension of the β-level indicator set corresponding to the Equation (1).
The VC dimension of a classifier can effectively control the complexity of SRM, but unfortunately it is difficult to accurately evaluate the VC dimension in most cases. In order to avoid this problem, it is calculated on the artificial data set based on a certain error rate best-fit theoretical formula. For example, the VC dimension of the set of linearity functions
The Structural Risk Minimization (SRM) principle not only minimizes the empirical risk but also ensures that the confidence risk is at a minimum. The confidence range risk is related to two factors: (1) the number of samples n. As the number of sample sets increases, the curve of confidence range risk gradually shifts downward, while the curve of empirical risk gradually moves up. (2) The VC dimension of the function set. If the VC dimension of the function set is increased, the empirical risk curve will gradually move down, while the confidence range risk curve will gradually move up. Therefore, these two risks are a contradiction. However, our goal is to find the minimum of the sum of the two. One solution in statistics is to divide the function set Q (z, α) into a series of nested subsets of function S i . The VC dimension values of the sequence S i of these function subsets are bounded and ordered. Therefore, the sum of the empirical risk and the confidence range risk can be minimized by selecting different subset sequences S i , that is, finding the optimal risk optimal bound. Figure 1 is a typical example of using this method to consider both the risk of experience and the risk of confidence. In the figure, S1 ⊆ S2 ⊆ ⋯ ⊆ S n is a sequence of nested function subsets, and h1 ≤ h2 ≤ ⋯ ≤ h n is the VC dimension.

Finding the actual risk optimal boundary based on SRM.
The basic idea of the SVM classifier implementation is to empirically select a nonlinear map that satisfies the Mercer condition to implicitly transform the input vector X into the high-dimensional feature space F (as shown in Fig. 2). The classification problem in low-dimensional space is solved because the vector in the high-dimensional space is sparse and linearly separable, and the classification result in the high-dimensional space is reflected in the low-dimensional sampling space.

Theoretical model of nonlinear SVM.
Taking two classifications as an example, we assume that the training sample set is S = ((y1, x1) , (y2, x2) , ⋯ , (y I , x I )). Among them, I is the total number of samples, x i is an n-dimensional vector, and y i , is the category of the sample x i , y i ∈ (+ 1, - 1). Then, the equation for classifying hyperplanes can be defined as: y = ω T x + b. In the equation, ω is the weight vector and b is the offset. The optimal hyperplane (ω, b) is to find a plane that maximizes the geometrical spacing γ between the nearest neighbors of the two types of sample vectors. The geometric interval γ can be expressed by (3):
At this point, solving the minimum value of the objective function can be transformed into the convex quadratic programming (QP) of the optimization formula (4):
The objective function of the convex quadratic programming is determined by least squares method:
In the formula, α = (α1, α2, ⋯ , α l ) T , α i ≥ 0 is a Lagrangian multiplier. The partial derivatives of the corresponding ω and b are solved:
It can be known from the extreme conditions:
By substituting the above formula using the first extreme value condition into the original Lagrangian function (5), the following formula is obtained:
The constraints we use to construct the objective function of the solution using the least squares method are: 1-y [(ω · x) + b] ≤ 0. Since this constraint is less than or equal to 0 rather than not exactly equal to 0, that is, the condition that Lagrange least squares solves the maximum value is not satisfied, the solution we find is not necessarily the optimal solution. At this point, we can first write the generalized Lagrangian function corresponding to the original function and then define its dual problem. The dual problem represented by Equation (8) is:
The dual representation of the above formula is further transformed into a solution:
Using the minimum sequential priority algorithm (SMO) to solve the above quadratic programming problem, the optimal solution α = (α1, α2, ⋯ , α l ) T can be obtained, and the optimal classification hyperplane (ω, b) can be obtained.
The function of the kernel function is to implicitly transform the original data set from the low-dimensional linearly inseparable sampling space to the high-dimensional feature space. In this feature space, the data set S becomes sparse and linearly separable, so that at this time, the nonlinear data set S can be easily separated using a linear learner. Currently, common kernel functions are:
Linear core:
Polynomial kernel:
RBF core (Gaussian kernel):
Sigmoid core:
In the formula, γ, dandr are a kernel parameter. These kernel functions have their own advantages, but the Gaussian kernel is the most widely used kernel function because of its strong generalization ability. Of course, with the promotion of applied research, the choice of the type of kernel function and the tuning of each parameter in the training process require some experience.
The introduction of kernel functions in nonlinear classification problems can transform the inner product of two vectors in low-dimensional space into the problem of using the K (x i , x j ) = φ (x i ) · φ (x j ) to find the equivalent inner product in the feature space. In the formula, φ is a mapping from the original space X to the feature space F. At this point, the objective function of the nonlinear SVM can be further represented by the kernel function K (·) as:
In the formula ((11), the Lagrangian multipliers corresponding to the sample vector x i , x j are α i and α j , respectively. The penalty factor C is a parameter introduced by soft interval optimization, which indicates the degree of penalty when the sample is misclassified. This measure takes into account the influence of noise and outliers on the classification hyperplane.
The SMO algorithm is used to solve the optimal solution α = (α1, α2, ⋯ , α
l
)
T
, and α is substituted into
After ω is substituted into the classification hyperplane equation, the value of the decision function of each sample is calculated by Equation (12). If f (x) > 0, it is a positive class sample. However, if f (x) < 0, it is a negative class sample.
After mapping the raw data to a high-dimensional space using the kernel function method, although the possibility of linear separability is greatly increased, it is still difficult to handle in some cases. For example, if there is noise in the data set, the location of the original data will shift. Moreover, offset sample points are often referred to as outliers or change points. These outliers severely affect the location of the classified hyperplane, and Fig. 3 is a notable example. The yellow triangle circled by a red circle is an outlier, which is far from the category space it should belong to and causes the hyperplane to be squeezed to the other side. If we ignore this point and find the classification hyperplane of the largest geometric interval, a better classification effect can be achieved. For the outlier problem, the constraint can be rewritten as: y
i
(ω
T
x
i
+ b) ≥ 1-ξ
i
, i = 1, ⋯ , I. In the formula, the value of the slack variable ξ
i
actually indicates how far the data point deviates from the group. The larger the value, the farther the point is from the group. The formula for solving the optimal solution after introducing the slack variable ξ
i
is expressed as:

Effect of outliers on the location of the classified hyperplane.
One Against Rest OAR, for the k-class classification problem, first establishes k binary classifiers. After that, each of the binary classifiers calculates a decision function (ie, obtains a class hyperplane) to distinguish between samples of this class and other categories of samples. Determining the final classification result of a sample is to see which classifier it is determined to belong to, and the category label is the category to which the sample belongs. However, the method fails when an unknown sample is simultaneously determined by multiple classifiers to belong to the class.
One Against One OAO. A binary classifier needs to be trained between any two categories of multi-classification problems. Since the class (i . j) classifier and the class (j . i) classifier can be characterized by the same classifier, only binary of k (k-1)/2 numbers need to be trained for the k class classification problem. If the binary C ij are used for training, the training samples belonging to category i can be regarded as the positive class of the two-category problem, and the category, belonging to the category j can be regarded as the negative class. The final category to which a sample belongs can be determined using the method of counting votes. The category tag with the most votes is the final category to which the sample belongs. The current research on multi-classification problems shows that the OAO multi-classification method is superior to other methods in terms of training effectiveness and prediction accuracy. However, when the size of each type of data set is unbalanced, the method does not perform well.
Based on decision tree-based multi-class SVM techniques (DDAG and DAGSVM), for k-class classification problems, the decision-directed acyclic graph method contains k (k-1)/2 nodes and each node train a binary classifier. Figure 4 depicts the solution process for the DDAG algorithm using a four-classification problem:

The process of finding the optimal class by the DDAG algorithm [taking four classifications as an example].
DDAG is equivalent to manipulating a list, and each node on the decision tree can only exclude one type of sample in the list at a time. The list {1, 2, 3, 4} is initialized with all category tags, and the OAO binary classifiers are created with the first and last elements in the category tag list. If the root node is (1VS4), the root node excludes category 1 to get a new list {2, 3, 4}, and excludes category 4 to get a new list {1, 2, 3}. The newly generated list of category tags then uses the first and last elements of the new list to create new OAO classifiers (2VS4) and (1VS3). The above algorithm steps are repeated until the leaf nodes can be self-contained. In the above figure, solving 4 classification problems requires 3 decision nodes, so the k classification problem requires (k-1) decision nodes.
Experiments with the DDAG algorithm show that the main factors affecting the classification accuracy are the interval between decision nodes and the size of the graph, but the dimension of the input space is not an influencing factor. Therefore, J. Platt et al. proposed the improved algorithm DAGSVM of DDAG, which chose to establish a larger interval hyperplane in the high-dimensional feature space (as shown in Fig. 5), thereby improving the classification accuracy.

Binary classifiers for the root node of DDAG processing four classification problems.
Cascaded support vector machine (Cascade SVM)
With the advent of big data and cloud computing era, training data sets are growing at an exponential rate. Therefore, training data sets with single-node SVMs is far from meeting the requirements of data storage and real-time processing. Therefore, many scholars have begun to study methods of block or group training. The basic idea of the method is to train each block in parallel using multiple nodes, and then integrate the support vectors obtained by each block training according to the established rules, and then obtain the global optimal solution. This “divide and conquer” idea not only greatly reduces the scale of the problem but also effectively shortens the training time, but the prediction accuracy is often not guaranteed.
In response to the above problems, Grafs et al. proposed a two-way cascading full feedback structural model (Cascade SVM model, as shown in Fig. 6). The model also guarantees a certain prediction accuracy while parallel computing and becomes a milestone in the distributed parallel SVM algorithm. The processing of the Cascade SVM is described as follows: First, the original training sample set TD is generated by a random partitioning method to generate N subsample sets (in the figure N = 8). Then, these sub-sample sets are assigned to the child nodes of the first layer, and a single-node SVM parallel training such as libsvm is used on each child node to obtain the support vector (SV1-SV8) of the child nodes. Then, the support vectors of these child nodes are combined to form a new training sample set and used as input for the second layer training. Moreover, the number of nodes participating in the training of the second layer is only 4 (that is, the number of nodes per layer of training is halved). This process is repeated until the number of child nodes in the last layer is 1, at which point, one round of training ends. If the support vector (SV15) obtained by the last layer of training reaches the global optimum or meets the preset condition, the iteration stops, and the final training model is given. Otherwise, the result is fed back to the first layer to continue the next iteration, and so on, until the final support vector reaches the global optimum.

Model diagram of two-way Cascade SVM.
In describing the Cascade SVM algorithm model framework, we can easily draw a conclusion: The iterative training process of each layer on the sub-dataset is actually a process of layer-by-layer culling of the support vector and retaining the support vector. If all non-supported vectors can be culled during the training process, the final supported support vector is the global optimal solution, and the optimal solution is the global optimal solution of the original training sample set.
Compared with other classifiers, the SVM classifier has the advantages of high prediction accuracy and no need for any additional training in the training process to obtain the final classification result. Therefore, SVM classifiers are the preferred method in all fields. However, as the size of the training sample set expands, the computation time during the SVM training process will increase dramatically, and the calculation time is often four times the total number of samples. Alham et al. proposed a MRSVM parallel algorithm framework for this problem. The algorithm performs the training of a single sub-sample set on a single node using a minimum sequential priority (SMO) algorithm. Moreover, multiple Mapper and Reducer compute nodes in the cluster execute in parallel to complete the optimization of the entire training sample set. The design idea of the algorithm is to use the random partitioning method to divide the original training sample set into N subsample sets. Moreover, each subsample set is a Mapper’s input dataset, and multiple Mappers perform parallel optimization of the SVM. Then, the two levels are merged to train and get the support vector, and the combined support vector is used as the input for the next layer of training. This process is repeated until the number of nodes in the last layer is 1, and the level training ends. If the support vector obtained by the last layer of training reaches the global optimum, the iteration stops. Otherwise, the support vector obtained in the last layer is matched with the sub-sample set of the first layer and then the process is repeated until the global optimal iteration stop condition is reached. The experimental results of image annotation show that the algorithm can guarantee the accuracy of certain predictions and significantly shorten the training time.
The MRSVM parallel algorithm model successfully applies the SVM classifier to the MapReduce functional programming framework, and effectively shortens the overall training time without significantly affecting the prediction accuracy. In the algorithm framework, the number of Mappers that perform the optimized subsample set in parallel in the first layer is equal to the number of original training sample set partitions, and each Mapper is responsible for optimizing one subsample set, and multiple Mappers of one layer are executed in parallel. Moreover, Mapper only outputs those sample sets whose Lagrangian coefficient α
i
is greater than 0, that is, the corresponding support vector. The output of the last layer includes not only the Lagrangian coefficient α
i
(0 ≤ α
i
≤ C) and the sample x
i
corresponding to the coefficient, but also the offset b. Then, the formula

MRSVM frame diagram.
The database used in this paper is the ISLE database, which is recorded by 23 Germans and 23 Italians, who are not native and adult English learners, 16000.0 Hz, 16bit, mono, little-endian. Each spoken corpus in the database is labeled to the phoneme level. It includes the start and end time of the phoneme, the name of the phoneme, and the accented nature of the vowel phoneme. The annotation file is divided into three levels of alignment of words, phonemes and accented syllables, as shown in Fig. 8.

Annotation of corpus statements.
In order to study which features are most effective for identifying accents, we perform various calculations on the features separately, and the different parameters obtained respectively give corresponding experimental results on the test set. These include duration (Duration); energy maximum (EM), energy average (EA), energy change rate (ECR); fundamental frequency maximum (PM), fundamental frequency average (PA), and fractal dimension value change rate (FDCR). The experimental results are shown in Table 1, and the statistical graph is shown in Fig. 9.
Linear discriminant recognition results for accented syllable recognition

Linear discriminant recognition result of accented sylla ble recognition.
It can be seen from Table 1 and Fig. 9 that the method proposed in this study is significantly better than the minimum and maximum normalization method. Moreover, it proves that for the theory of RankNet ordering, only the vowel phonemes in the same word will be comparable. At the same time, under the normalization of INwMS, the energy maximum EM, the average value of energy EA, the rate of change of energy ECR and the rate of change of fractal dimension FDCR show a relatively good discrimination when using a single feature for accent recognition. However, its accent recognition rate is still low. The reason may be that traditional single speech features do not respond well to the commonality of accents, and it may play a more prominent role in some corpora but have little effect in some other corpora. Therefore, it is necessary to seek a better feature fusion algorithm, which is based on the Artificial Neural Network’s RankNet feature fusion algorithm, to better reflect the commonality of all accents by merging the features, thus improving the recognition rate of the accent.
To eliminate the dependence of the algorithm on the training data, we randomly assign the training set and verify the set of test sets. Moreover, after repeatedly training the algorithm more than 100 times, we obtained a better stable accent recognition model and tested it on independent test corpora. The experimental results are shown in Table 2 and Fig. 10.
Accented syllable recognition results of fusion features

Accented syllable recognition results of fusion features.
It can be seen from Table 2 that the accent recognition rate is greatly improved after feature fusion, which indicates that the feature fusion algorithm proposed in this study can well reveal the commonality of accent. On the other hand, for the independent open test corpus, the results identified by the neural network model reached 20.59% and 19.39%, respectively, indicating that the model is robust, and we can get reliable results for oral assessment.
In English recognition, in the case of word or phoneme-based recognition cannot be applied to the problem of speech recognition, through careful and comprehensive research, we successfully transformed the identified problem into the word syllable re-reading order and the word re-reading order in continuous sentences. Moreover, this study accurately extracts the phonetic features of phonemes, and combines RankNet’s ranking theory to take speech feature vectors as input and a pre-review stress level as a marker to train a ranked model.
Traditional speech evaluation methods are faced with the inability to describe the chaotic features of speech signals or to fully approximate the complex nonlinear relationships between features, so that the exact location of stress cannot be accurately determined. This paper introduces the support vector machine to characterize the speech signal, and through the feature fusion, maps the complex nonlinear relationship between the features of the support vector machine to establish a smart English recognition system based on support vector machine. The model can accurately identify the syllables and pronunciations in the words. Moreover, the use of a large-scale corpus based on non-specific people in this article can represent the generality of spoken learners.
Footnotes
Acknowledgment
Study on the Standardization of English Translation of Public Signs in Tourist Attractions in Shanxi Province(GH-16162).
