Abstract
The diagnostic evaluation model of English learning is difficult to judge the subjective factors in student learning, so some diagnostic evaluation models of English learning are difficult to apply to English learning practice. In order to improve the effect of English learning, based on machine learning technology, this study combines the needs of English evaluation to build a diagnostic evaluation model of English learning based on machine learning. Moreover, this study compares the methods of random forest, Bayesian network, decision tree, perceptron, K-nearest neighbor and multi-model fusion, and selects the best algorithm for diagnostic analysis. The diagnostic evaluation model of English studies constructed in this paper mainly evaluates and judges the errors in students’ English learning. In addition, this study validates the methods proposed in this study through controlled experiments. The research results show that the method proposed in this study has a certain effect.
Introduction
China’s growing global influence has triggered people in China to learn English as a second language to connect with international standards, and this trend will continue. At present, many computer-assisted learning tools have been developed for learning English. However, there are very few evaluation tools for English learning. In particular, there is a lack of automatic tools for automatically detecting and correcting English grammatical errors. For example, although Microsoft word processing software (Word) has integrated powerful English spelling and grammar checking functions for many years, these English tools are still quite original [1]. English is not only an ancient language, but also a widely used language. During the development of English over the years, many differences from other languages have gradually accumulated [2]. There will be more repetitions in English expressions, so it is very common for non-native English speakers to make various grammatical errors in writing. The goal of English Grammar Error Diagnosis (CGED) is to establish a system that can automatically diagnose errors in English sentences. With the proliferation of foreign language learners, automatic diagnosis of English grammatical errors (CGED) will be of great benefit. The study of English grammar check has been in existence for many years, and great progress has been made, but the study of English grammar check, error detection, and correction is relatively few, and it has not progressed until recently. Although different languages have many differences, they also have similarities, such as fixed collocation of words. Therefore, we can still gain experience from English natural language learning. The English Grammar Error Diagnosis (CGED) task provides English Natural Language Processing (NLP) researchers with an opportunity to build and develop an English grammar error recognition system, compare their research results, and communicate their learning methods. At the same time, it is also conducive to the better learning of English learners and the teaching activities of teachers [3].
At present, the problem in scoring of English learning is mainly too superficial. The commonly used scoring methods are analysis and impression. In addition to organized examinations, daily homework is usually independently judged by the English teacher himself, that is, according to the analytical evaluation method or the impression evaluation method, several items of the composition are scored and finally combined into a total score. The scoring criteria of the analysis method is the accuracy of grammar, vocabulary and spelling. The disadvantage of this method is that it pays too much attention to the local content, but it is insufficient to investigate the cultivation of the integrity and thinking ability. The scoring standard of impression method is the impression given to the teacher by the elements, content, structure, language accuracy and fluency of the students’ composition. However, the score does not reflect the pros and cons of the students’ abilities involved in these elements. The SOLO classification evaluation method not only emphasizes the students’ knowledge ability, but also attaches importance to the evaluation of their cognitive development, so it helps to improve their English continuation thinking skills and strategies. Moreover, it pays more attention to how students complete learning tasks and what skills are used. According to the five levels of the theory, the level of abstract expansion under each mode of thinking means (or is equivalent to) the level of pre-structure under the mode of higher level thinking operations. Therefore, in the practice of continuation of English writing, appropriate higher-level teaching activities and goals can be designed according to the students’ current learning ability to make the teaching activities more targeted, thereby improving the efficiency of English learning.
Related work
In the field of grammatical error correction, more work is concentrated on the English language. The detection and correction of grammatical errors in shared tasks have attracted a large number of English NLP researchers, and participants have adopted different methods to conduct research [4]. For example, the literature [5] uses manual rules, statistical models, translation models, and language models to conduct research. The use of a large number of English materials and annotated corpora allows scholars to use English more thoroughly. However, English research resources are far from enough, and there are few studies related to English grammatical error correction. The literature [6] proposed two types of language models to detect word order, omission, and redundant error types, which correspond to the three types of English grammar error diagnosis. The literature [7] proposed a first-order probability inductive learning algorithm for misclassification, which is superior to some basic classifiers. The literature [8] proposed a sentence-level judgment system that integrates multiple predefined rules and N-gram-based statistical features. The literature [9] proposed several methods including CRF and SVM, and performed frequency learning from a large n-gram corpus to detect and correct sequencing errors. In the past research, there have been some new ideas and results for error diagnosis. The work of the literature [10] includes manual construction of rules and automatic generation of rules, and the latter is similar to the frequent patterns in the training corpus. Literature [11] used a parallel corpus on the web, the language exchange website, to train statistical machine translation. Zampieri and Tan used a news corpus as a reference corpus and used frequent n-grams to detect errors in the data provided by the English grammar error diagnosis task. In their work, they also used a reference corpus as a source of n-gram frequencies. The collocation research proposed in the literature [12] also shows that grammatical error detection has been greatly improved. Since 2014, English Grammar Error Diagnosis (CGED) as a competition task of NLP.TEA has officially become the goal of scholars’ research and competition with major platforms. The literature [13] has the best detection effect on grammatical errors, and its Fl score is 0.6884. For recognition level assessment, the system needs to recognize the type of error in a given sentence. Most systems cannot effectively recognize input sentences to indicate possible grammatical errors. The test results show that the system recognition effect developed by the Confucius Institute at Rutgers University (CIRU) is the best, with F1 = 0.4333 [14]. The literature [15] used LSTM.CRF model as the main design basis to adopt three integration strategies to improve system performance. At the recognition level and location level, the system scored the highest 1191. At this point, all subsequent researchers have started to use the Bi.LSTM model as the basic model of English grammar error diagnosis content. The literature [16] regarded the problem of grammatical error diagnosis as a machine translation (MT) task. The literature [17] studied rule-based models and language models. The results of the competition show that the problem of grammatical error diagnosis as a machine translation (MT) task has achieved good results. In comparison, the performance of rule-based models and language models is not satisfactory.
Traditional machine learning models
Decision tree, also known as decision tree, is a commonly used machine learning algorithm that uses a tree structure for classification. Each leaf node in the decision tree represents a classification of nodes. Moreover, each internal node represents an attribute, and after passing through this node, it means that it has passed the screening of this attribute, and it is necessary to decide the next compared attribute based on the result of this screening, that is, the next internal node [18].
Each path from the root node to the leaf node refers to a judgment process, which can indicate which attributes are classified in turn. Every decision-making process in the decision tree needs to start from the root node and follow a certain path all the way to a certain leaf node, and the leaf node is the final classification result [19]. The first step in the generation of a decision tree is to determine the characteristics of the choice, that is, to find the current optimal attribute. The choice of attributes determines the quality of the prediction tree. In the process of feature selection, we hope to use a certain feature to divide the features to get the features belonging to the same category as much as possible. For this kind of uncertainty, two methods can generally be used for evaluation. One method is information entropy:
Another method is the Gini coefficient:
|γ| refers to the number of labels of the training data. The two evaluation criteria are as follows: the larger the rule value, the lower the purity and the lower the value, the higher the purity
After we determine the characteristics of the current node split, each child node is recursively generated from top to bottom, and the generation of child nodes stops until the leaf node, that is, the data is inseparable. After generating the decision tree, a pruning process is also needed to prevent overfitting and reduce its generalization ability, which can reduce the size of the parameter.
Random forest is an ensemble learning, which belongs to the Bagging type. It obtains the final result by combining multiple weak classifiers, voting or averaging, and makes the overall model results with higher accuracy and generalization performance. Random forest is a machine learning model that is flexible and not too dependent on parameter tuning. Random forest integrates multiple decision trees into a classifier, and its basic unit is the decision tree. The rules generated by each tree are as follows:
We assume that the size of our training data is N. For each tree, we need to randomly and again put back N training samples from the training data. This new set of data will become the real training data for a tree. The purpose of sampling with this method is to generate different decision trees. Through sampling with replacement, we can obtain different training data sets. Secondly, in the random forest, not all features are trained on each tree. We assume that the dimension of each sample is M and set a constant m (m ⪡ M). Extract m features from M, and select the m features with the best features each time the tree splits.
In Bayesian networks, the conditions between variables can be used to independently decompose the joint distribution into multiple probability distributions with lower complexity [20].
The three connection forms of Bayesian network are shown in Fig. 1.

Bayesian network connection form.
Sequential connection:
When Z is unknown, x and y are not independent of each other.
When Z is known, x and y are independent of each other.
Decentralized connection [21]:
When Z is unknown, x and y are not independent of each other.
When Z is known, x and y are independent of each other.
Aggregation connection:
When Z is unknown, there is no relationship between x and y, so x and y are independent of each other.
When Z is known, x and y are independent of each other.
Frequent itemsets: It refers to itemsets, sequences or substructures that frequently appear in the data set. Among them, the support refers to the frequency of a certain set in all transactions.
First of all, the frequent item sets of all symptoms in all electronic medical records that are greater than a specific support level are counted and saved in additional files. The Apriori algorithm is used, which mainly uses two measurement forms:
The Apriori algorithm is based on two main laws [22–24]:
If an item set is a frequent item set, then all its subsets are also frequent item sets;
If an item set is not a frequent item set, then all its supersets are not frequent item sets.
In this topic, we mainly use support to find frequent item sets with high support. Then, we randomly add several symptoms for each electronic medical record, and each electronic medical record is
If there is a frequent item set
If
n is an integer that can be set by the user. When the intersection is greater than n, several items are randomly selected from M and added to T. After supplementing the data, the performance of the Bayesian network has improved to a certain extent.
Perceptron is a simple artificial neural network. The goal of perceptron learning is to learn a hyperplane, and use the hyperplane to divide different types of data into two sides of the plane. It mainly uses binary classification and uses back propagation algorithm training.
We assume that X = (X1, X2, ⋯ , X
n
) is one of our training data. Assuming that X = (X1, X2, ⋯ , X
n
) is a set of at least one non-zero real number and a constant V, all satisfy the linear equation.
If
Then, X is considered to be a positive example, that is, Y is equal to 1, otherwise Y is equal to -1. It is easy to find that if the sample is correctly classified, there is Y (X*w) > 0.
The goal of the perceptron is to learn the function of this hyperplane. The loss function of the perceptron is the distance from all misclassified points to the hyperplane. The distance from a single misclassified node to the plane can be calculated by the following formula:
Because the distance from the misclassified node to the plane must be a negative number, we need to add a negative sign in front of it. Since ∥w∥ is a constant, it does not affect the overall training.
Therefore, the loss function can be obtained as:
Among them, M is a set of misclassified nodes, which is used to measure the pros and cons of the perceptron model. The training process of the perceptron is to continuously reduce the distance between the classification error nodes and the hyperplane, and gradually reduce the examples of classification errors.
In order to solve the shortcomings of one - hot coding, many researchers have proposed different algorithms to replace one - hot coding. The words in the text are pre-trained into word vectors before entering the neural network. Its idea is to map each word to a vector of fixed dimensions through an algorithm.
Word embedding is also called distributed vectors. At present, word2vec includes the Hierarchical Softmax method and the Negative Sampling method. The more commonly used methods are the CBOW (Continuous Bag of Words) model and Skip - Gram. In fact, these two methods are opposite processes. The training input of the CBOW model is the word vector corresponding to the specific word, and the output is the word vector of this specific word.
The Skip - Gram model and CBOW have the opposite idea, that is, the input is a word vector of a specific word, and the output is a word vector of the context corresponding to the specific word.
First, the CBOW model is introduced. Figure 2 shows the structure of the algorithm.

CBOW model structure.
Input layer: The input is C context words, and each word is encoded with one - hot. We assume that the number of different words is V, so each word is represented by a V-dimensional vector.
All words need to be multiplied by W
V
*
N
, and W
V
*
N
is a shared matrix. After all the context words are multiplied by the shared matrix and then averaged, the following result is obtained:
Output layer: In the output layer, we need to calculate the probability of W i appearing in the context, that is, P (W i |W1 ⋯ Wi-1Wi+1 ⋯ W C ). Because V projection y i has integrated the context of W i , we can obtain:
After multiplying the output of the hidden layer by the weight matrix
In terms of the loss function, the cross-entropy function is generally selected, and the weight is updated using the backpropagation gradient descent algorithm. The W obtained by the final training is the word vector we need.
Next, we introduce the Skip - Gram model, as shown in Fig. 3.

Skip - Gram model structure.
Input layer: The input layer is the head word. We know that the Skip - Gram model predicts its context by inputting specific words. Therefore, the input is the one - hot code of the specific word, and is also a V-dimensional vector, where V is the number of different words.
Hidden layer: The output of the hidden layer is the input layer multiplied by a matrix W
V
*
N
, that is:
Output layer: At the output layer, under situation that specific words are given, we calculate the probability of the context and pick out the context with the highest probability. The output of the output layer is actually a vector. After softmax processing, the top k contexts with high probability are selected.
For the generation of training data for the Skip - Gram model, we use an example to illustrate how to obtain word vectors. If we assume that there is a sentence “I am a Chinese student", first we need to select a specific word “Chinese” and use this word as the input word. After having the input word, we need to define the size of the window. The role of the window is to specify the number of context words around the input word. In this article, we assume that the selected window size is 2 (2 words on the left and 2 words on the right), and the specific word set is “Chinese”. Then the word contained in the window obtained by us is [amaChinesestudent], and three sets of training data of [Chinese, a] [Chinese, am] [Chinese, student] can be generated. The output probability of the model represents the probability that the context word corresponding to a specific input word appears in the training corpus. For example, in a corpus, if the training sample similar to [Chinese, student] is much larger than [Chinese, am], then in the output result, student will have a higher probability value than am.
In the medical data, there is a certain relationship, for example, the disease causes symptoms, and the examination confirms the disease, so a network will be formed between different entities. If only word2vec is used to vectorize the graph, it will ignore a lot of information on the graph structure. The vector representation of the vertices of the learning network can extract its structural features. Currently, there are several ways to vectorize graphs, that is, deepwalk, node2vec, and struct2vec.
Deepwalk: The nodes in the network are simulated as different words, and a random walk is used to obtain a sequence to represent the local characteristics of the node in the graph. Random walk is also called random walk or random walk, which simulates the free movement of molecules, such as the spread of smell in the air and the release process of ink in water. Moreover, the random walk on the graph refers to that, in a graph, a starting point is given first, and then it walks to its neighbor nodes according to the principle of random diffusion. When reaching the neighbor node, the neighbor node is used as the starting point to continue the random walk. The process of random walk is to repeat this step. During the walk, if the degree of the vertex satisfies the power law distribution, then the number of occurrences of nodes in the sequence also follows the power law distribution. The power law distribution refers to:
y is the frequency of occurrence of nodes with degree r, that is to say, in general, there are fewer nodes with large degrees and more nodes with small degrees. The sequence of nodes obtained by the walk also follows the power law distribution, and the nodes with larger degrees are more likely to appear in the sequence.
We assume that the sequence obtained from the vertex V i is W V i , and in this sequence Wk+1 is obtained according to the random walk of W k . After that, the sequence obtained by random walk is processed by the Skip - gram algorithm introduced in the previous article, and the node vector is generated using the traditional word2vec method.
Node2vec:node2vec is a further improvement of deepwalk, which mainly changes the method of random walk. This method uses partial random walk. Moreover, the method introduces a depth-first algorithm and a breadth-first algorithm. The breadth-first algorithm mainly extracts local features, and the depth-first algorithm can reflect the connection between higher-level nodes. In the random walk of the algorithm, an original node V
i
is given, W
V
i
is a sequence starting from V
i
, and W
k
is the k-th node in W
V
i
. Moreover, the node Wk+1 conforms to the following distribution: If the transfer probability is directly set as a weight, it is not possible to effectively learn the entire network structure and consider further nodes. In this model, there are two parameters p and q to adjust the random walk from one node to another node.
π
vx
= α
pq
(t, x) *w
vx
, w
vx
is the weight of the edge of the fixed point v and vertex x.
P controls the probability of repeatedly accessing the vertex just accessed, and the parameter P only works when d tx = 0. d tx f refers to x is the access node before access node v. If p is relatively high, the probability of accessing the node just accessed will become low, otherwise the probability will become high.
q is responsible for controlling whether the walk is inward or outward. When q > 1, it means that the walk will be close to the node t, which is similar to the breadth priority. When q < 1, it indicates that there is a high probability that the walk will be far away from t, which is similar to depth priority.
The sequence W V i generated by the partial random walk is input into the Skip - gram algorithm to generate the word vector of the node.
Struct2vec: It mainly converts each vertex into a vector from the spatial structure. In this method, two non-adjacent nodes may also be similar. Because two points with the same network structure are not limited to their neighbors, there may be a distant node and the structure of the node is similar. The algorithm focuses on vectorizing the graph structure. Moreover, the algorithm transforms the structure into a hierarchical structure, which is mainly completed by the following steps:
Step 1: The evaluation method of the structural similarity of two nodes needs to be determined. R
k
(u) represents the set of vertices with distance k from vertex u, s (S) represents the sequence after all nodes in R
k
(u) are sorted according to their degrees. In addition to this, the distance between two sequences needs to be defined. We use g (D1, D2) to denote the distance between these two sequences. Where f
k
(u, v) is the structural distance between node u and node v at a distance no greater than k, f
k
(u, v) satisfies:
Step 2: In this algorithm, we need to transform the structure of the graph into a hierarchical structure, so we need to build a multi-layer graph. For each k is a separate complete graph, calculate the distance between any vertices in each complete graph, We assume that at the k-th layer, the weight of the edges of the two vertices is:
The graphs between different layers are connected by directed edges, and the connected points are the same vertices of the two graphs. That is to say, the u vertex of the first layer must be connected to the u vertex of the k layer. The weight of the edge is defined as follows:
Γ k (u) g refers to the number of edges in the k-th layer whose weight to the vertex u is greater than the average weight.
Step 3: The partial random walk is used to generate the context of each vertex. The context is generated from the above multi-layer graph. Through the random walk algorithm, a vertex sequence can be obtained by walking in a multi-layer graph, and this sequence is the context. In the sampling process, the first step is to decide whether to swim in the current layer or in the next layer. If the walk is in the current layer, the probability of walking from the u vertex to the v vertex in the kth layer is:
The denominator is the normalization factor. If the walk does not continue on the current layer, the probability of choosing layer k + 1 or layer k - 1 is as follows:
Step 4: After obtaining the sequence of the context through random walk in the previous step, we now turn the obtained sequence into a vector. The problem is transformed into a word2vec problem. Moreover, we use the Skip - gram algorithm to obtain the vector of each node.
In order to study the performance of the diagnostic evaluation model of English learning in this paper, a control experiment is designed to study the model proposed in this paper. The diagnostic evaluation model of English studies constructed in this paper is mainly to evaluate and judge the errors in students’ English learning. When there is a problem, the problem will be output and prompted to the student. The model can provide students with a learning reference, find their deficiencies in learning in time, improve the learning weaknesses, and correct the wrong learning methods.
In this study, a control group and an experimental group were set, and there were 50 people in each of the experimental group and the control group. Students come from two different classes of the same grade in a university and belong to the same English teacher. Moreover, teaching methods and teaching hours are consistent. Among them, the control group do not use the model constructed in this article for assisted learning, while the experimental group uses the model proposed in this article for assisted learning. After a semester of teaching, the final exam results are compared.
First of all, in order to ensure that the learning results of the control group and the experimental group are basically the same, the same test paper is used for assessment at the beginning of the school. The final distribution of results is shown in Table 1.
Statistical table of score distribution before the experiment
Statistical table of score distribution before the experiment
It can be seen from Table 1 and Fig. 4 that the experimental group and the control group in this study are basically the same in the distribution of scores before the experiment, and the number and scores of the experimental group and the control group in different score segments are basically the same. Therefore, the English levels of the two classes in this study before the experiment are basically the same.

Statistical diagram of the distribution of scores before the experiment.
The model proposed in this study is used to interfere with the experimental group, and the final results are shown in Table 2 and Fig. 5.
Statistical table of score distribution after the experiment

Statistical diagram of the distribution of scores before the experiment.
It can be seen from the distribution of results after the experiment that the academic performance of the experimental group is obviously above the control group. Through comparative analysis, it can be found that the English diagnostic evaluation model can effectively improve the academic learning efficiency and thereby improve the English performance. The comparison diagrams of the English scores of the experimental group and the control group before and after the experiment are shown in Figs. 6 and 7.

Comparison of academic performance of students in the experimental group before and after the experiment.

Comparison of academic performance of students in the control group before and after the experiment.
It can be seen from Fig. 6 that the students in the experimental group experienced a certain increase in academic performance after training through the model proposed in this study compared to before the experiment, while the performance of the control group after the experiment was lower than that before the experiment. Therefore, it can be verified that the method proposed in this article has a certain effect on the improvement of English learning achievements.
With the proliferation of foreign language learners, automatic diagnosis of English grammatical errors (CGED) will be of great benefit. The study of English grammar examination has a history of many years and has made great progress. The article analyzes the theoretical basis and deficiencies of traditional machine learning algorithms used to solve CGED tasks, proposes improved algorithm models, and studies the performance of the English learning diagnostic evaluation model proposed in this article. Moreover, this study designs a controlled experiment to study the model proposed in this article. The diagnostic evaluation model of English studies constructed in this paper is mainly to evaluate and judge the errors in students’ English learning. When there is a problem, the problem will be output and prompted to the student. The model can provide students with a learning reference, find their deficiencies in learning in time, improve the learning weaknesses, and correct the wrong learning methods. In addition to this, this study compares the horizontal and vertical scores through controlled experiments. By comparing the English scores of the experimental group and the control group, we can see that the experimental group students are better than the control group during the experimental period. Through the comparison of the scores of the control group before and after the experiment and the comparison of the scores of the experimental group before and after the experiment, it can be seen that the method proposed in this article does have a certain effect on the improvement of English learning performance.
