Abstract
Task degree has become one of the important indicators to measure students’ English learning intensity and learning quality, and the difference in task degree has different effects on students’ English learning. In order to realize the task recognition of English classroom teaching, combined with the characteristics of deep learning, this study combines the actual situation of English classroom teaching to analyze, and distinguishes characters through student positioning and feature recognition. Moreover, this paper combines the characteristics of English learning scoring to judge students’ learning situation, and designs a shallow convolutional neural network based on TensorFlow architecture for identifying images and uses GPU training acceleration to solve the problem of training time-consuming in the face of large data volume. In addition, the task results feedback is evaluated by scoring method, and the performance of the algorithm is analyzed by experiments. By setting the category of sensitive targets, this paper can perceive the results according to the target location and mark the sensitive targets in the input scene image. The research results show that the method proposed in this paper has certain effects.
Introduction
With the rapid development of higher education, the reform of higher education has continued to advance. In order to meet the demand for the English ability of higher education compound talents in China’s social and economic development in the new era, public English has begun a new round of reform. “Teaching Requirements” clearly states: in order to improve students’ future professional competitiveness and sustainable development, public English should cultivate students’ ability to use English, and help students master English learning interest and English self-learning ability while mastering effective learning strategies [1]. It can be seen that cultivating students’ interest in English learning and improving their ability to learn English in public English has become one of the important goals of public English teaching in higher education.
Studies have shown that college students generally lack English learning interest and English self-learning ability, which has become a major obstacle to higher vocational English teaching and learning. There are many reasons for the problems. For example, the public English teaching mode in colleges and universities is still outdated and lacks effective teaching methods. The teaching still takes a teacher-centered model to teach. Because the students’ English foundation is weak, teachers need to spend a lot of time to explain the basic knowledge of high school English [2]. This tedious, inflexible and innovative teaching model has seriously affected students’ enthusiasm and interest in learning. At present, vocational students are aware of the importance of English learning, and teachers are constantly trying to improve teaching methods. However, due to the weak English foundation of most vocational college students, lack of interest in learning, weak self-directed learning, and lack of learning methods, the quality of public English teaching in higher vocational schools is still not satisfactory. Therefore, it is urgent to improve the effectiveness of public English teaching and learning in higher vocational education, to cultivate students’ interest in English learning and to improve their ability to learn English independently [3].
For college students, learning is mainly based on independent learning, and teachers help students to establish learning plans and learning goals through task teaching. From the actual situation, the task degree has become one of the important indicators to measure students’ English learning intensity and learning quality, and the difference in task degree has different effects on students’ English learning effect. Based on this, this study combines image recognition technology to analyze the English learning task degree from the perspective of deep learning and explores the method of recognizing the student’s learning state, which is convenient for effectively controlling the teaching process.
Related work
Since the concept of deep learning has been proposed, it has been highly valued by researchers and technology companies around the world. Deep learning is mainly used to deal with problems in the three application fields of image recognition, speech recognition and natural language processing. These three application scenarios are also the difficulties of traditional machine learning research. The following is a brief introduction to the research status of deep learning in these fields. Image recognition is the earliest field of deep learning. Peng X et al. proposed Convolution Neural Network (CNN) [4]. The CNN is usually a neural network composed of a network layer such as a convolutional layer, a downsampling layer, and a fully connected layer. For a long time to come, CNN was able to achieve good results in dealing with small-scale problems like handwritten digit recognition, but it has never been a great success for large-scale image processing. Until 2012, Peng X used the deeper and more complex neural network structure in the ILSVRC.2012 competition to win the championship, which made breakthroughs in the research and application of deep learning in image recognition [5]. In 2014, Microsoft’s artificial intelligence project Adam system challenged GoogleBrain. Microsoft said that when running the ImageNet22K benchmark test set, Adam’s performance data is higher than GoogleBrain. Adam can identify the dog breed in the phone photo and whether the bug in the photo is poisonous. The domestic Baidu Deep Learning Institute has reached the international top level in image recognition. Taking face recognition as an example, Google’s recognition error for 6000 pairs of faces is 0.37%, while Baidu is only O.16%. Baidu’s cross-age face recognition technology has been applied to find lost children. By comparing the pictures uploaded by the child and the parents, it is judged whether the two people in the photo are in the same age are the same person. The technology has successfully confirmed the first lost child in April 2017 [6]. Speech recognition is another important area of research for deep learning. Deep learning expert Hinton and Microsoft speech recognition expert Mahmood Z began collaborating in 2009, and in 2011 launched the first deep neural network based speech recognition system [7]. It reduces the error rate of speech recognition by more than a dozen percentage points, which is the biggest breakthrough in the field of speech recognition for more than a decade. After using the deep learning method, the speech recognition model can fully explore and describe the correlation between sample features, and then combine the speech features of consecutive multi-frames to form a high-dimensional feature for deep neural network training. In November 2012, Microsoft publicly demonstrated a fully automatic simultaneous interpretation system based on deep learning at the “2l Century Computing Conference". The speaker gave a speech in English, and the computer converted it into a Chinese speech with similar sound. It showed a very smooth effect, and the error rate was only 7%. In addition, as a big man in the field of artificial intelligence, Google has also begun to use deep neural networks to model speech [8], and at the Go09 conference in 2012, the GoogleNow voice assistant was first released (later upgraded to GoogleAssistant). GoogleNow dominates the current smart assistants. At the same time, the domestic Baidu Deep Learning Institute is also in the international leading position in the field of speech recognition. In 2014, DeepSpeech, a deep learning speech recognition system, was developed. It can achieve nearly 81% recognition accuracy in a heterogeneous environment such as a restaurant, while the highest recognition rate of Bing, Google, and WitAI is only 65%. In addition, DeepSpeech is 9 percentage points higher than the top academic speech recognition model [9]. Deep learning has made progress in the field of image recognition and speech recognition, while research in the field of natural language processing is relatively slow. Professor Nikitin M Y et al. [10] of the University of Montreal in Canada proposed using a neural network to train a model in 2003. While learning the distributed representation of each word, the model can also learn the probability function of the word sequence of such distributed representation. Experimental results on the two corpora of Brown and APNews show that their proposed method is better than the optimal N-gram model at the time. Bashbaghi S et al. [11] proposed a new learning method based on neural network in 2011 to enhance the expression of original information by embedding representation symbols into a more flexible continuous vector space. This embedded method facilitates the use of data of various sizes in prediction and information retrieval, so this method is equally applicable to WordNet and Freebase. Gu J [12] and others at Oxford University proposed a semantic convolutional neural network (DCNN) for semantic modeling of sentences. The network uses a dynamic K-MAX pool with linear global pooling operations and achieves excellent performance on small-scale binary, multi-class sentiment prediction, and six-way classification. Moreover, it reduces the error rate by 25% compared to the benchmark in Twitter semantic sentiment prediction. Deep learning has not made far-reaching progress in the field of natural language processing. However, the language-based symbol system is completely generated and processed by the human brain, and the artificial neural network is constructed by imitating the human brain. Therefore, deep learning still has a great research space in the field of natural language processing.
Theoretical analysis
Shallow learning and deep learning
From the 1980 s to the present, in terms of the hierarchical structure of machine learning models, the development of machine learning has probably gone through two stages of shallow learning and deep learning. Starting from the statistical model-based machine learning method, the most representative is the artificial neural network called Multilayer Perceptron (MLP) based on back propagation algorithm (BP) training. Subsequently, shallow unsupervised learning models such as support vector machine (SVM), maximum entropy method (such as Logistic Regression) and principal component analysis (PCA), Kmeans appeared [13]. However, most of these early shallow structures have only one layer of hidden nodes or no hidden nodes.
(1) Limitations of shallow learning algorithms
Frank Rosenblatt first proposed a perceptron with a single-layer computing unit in 1995. This shallow structure can be easily calculated, but it does not effectively solve nonlinear problems. Subsk then argued that many of the basic problems were not solved by the perceptron, such as binary exclusive OR (XOR) problems. As shown in Fig. 1, XOR is a typical nonlinear sample space classification problem. In two-dimensional space, single-layer perceptron cannot effectively solve this problem, because XOR cannot be divided by linear classification.

Binary XOR gate.
In order to effectively solve the binary XOR problem in two-dimensional space, in 1986, Rumelhart, McCelland and others proposed BP neural network. The BP neural network is also a shallow learning model that contains only one implicit node. As shown in Fig. 2, a two-layer BP neural network model is given [14].

BP network structure.
However, a two-layer BP network cannot effectively solve the XOR problem of multiple variables and many complex functions cannot be represented by shallow data structures. For example, in polynomial, we can effectively use O (mn) complexity to calculate the product of sum, but to calculate the product and structure, we need the computation amount of O (n m ) [15]. If the neural network is used to represent this polynomial, it may be necessary to have exponential parameter numbers. In this case, a neural network with multiple hidden nodes is needed.
With the advent of the era of big data, in the face of massive data, shallow learning structures are difficult to fully exploit the intrinsic representation of data. Based on this, researchers have begun to invest in the study of deep learning structures, hoping to find a deep network learning structure with strong expressive ability to learn more useful information from big data.
(2) Purpose of deep learning exploration
Researchers have a certain reason for the study of deep learning. The first is the expression of the two ideas. One is a top-down idea defined by a so-called “expert system” using a large number of (If – Then) rules [16]. The other is the Artificial Neural Network (Artificial Neural Net-work), which is a bottom-up approach. In image processing, many data structures are complex (such as image rotation, scale transformation, illumination change, etc.) and the number is large, so the shallow structure cannot effectively solve these complex nonlinear problems. Therefore, it is necessary to find an effective computational model to solve the representation problem of complex functions [17]. As shown in Fig. 3, we need more unknown parameters and more computation to build a complex function with a shallow network structure. However, if we use a hierarchical representation of a multi-layered network structure, the problem can be made simple.

Complex functions with multiple layers of simple expression.
Deep learning does not have a strict formal definition. It is a general term for a series of multi-layer network structure models and part of machine learning. Its basic feature is to imitate the transmission and processing of information between neurons in the brain. As shown in Fig. 4, a computational model to be divided into neural networks usually requires a large number of nodes connected to each other having the following two characteristics. First, each node needs to solve the weighted input values from other neighboring nodes through a specific activation function. The second is to use the so-called weighting values to define the strength of the information transmitted between nodes, and the algorithm adjusts this weighting value through continuous self-learning [18].

Neural network model.
The essence of deep learning is to express abstract data features by using multiple layers of nonlinear elements, or to express complex functions through layered nonlinear network structures. In image processing, it is necessary to learn the characteristics of an image from a large number of samples.
An automatic encoder is a neural network that can restore input information to the greatest extent, so it must extract important features that can represent input information and find the main components that can represent the original data, just like the main analysis. The algorithm flow is specifically as follows: First, the unlabeled data is given, and the features are learned in an unsupervised learning manner. Secondly, the feature is generated by the encoder and input to the next layer for layer-by-layer training. Finally, a network fine-tuning with supervised learning is done [19]. On the basis of this, a classifier can be added to classify the data, but the automatic encoder has certain limitations in image classification. Because it uses a discriminant model, it is difficult to sample the input sample space, which makes it difficult for the network model to capture its internal representation.
Compared with the neural network model of the traditional discriminant model, DBN (Deep Belief Network) is a probability generation model. The generation model establishes a joint distribution between observation data and label data and evaluates both, while the discriminant model only evaluates the label data. However, when applying the traditional backpropagation algorithm to DBN, it will encounter the following problems: (1) A tagged data set must be provided for training, which will add a lot of work. (2) The learning process is slow. (3) Choosing an inappropriate parameter will cause the learning to converge toward the local optimal solution.
Compared with traditional methods, convolutional neural networks have the advantages of strong generalization ability, simultaneous feature extraction and classification, and strong applicability. Therefore, it has become one of the research hotspots in the field of deep learning. This chapter will comprehensively analyze the convolutional neural network from the following aspects:
(1) Neural Networks
First, a neural network consisting of only one “neuron” is introduced, as shown in Fig. 5:

Simple neural network.
The corresponding formula is as follows:
In the formula, x1, x2, x3 and intercept +1 are input values, the output is hw,b (x), and the function f (·) is the “activation function”. Generally, the sigmoid function and the hyperbolic tangent function (tanh) are used as the activation function, and the corresponding formulas of the sigmoid function and the hyperbolic tangent function (tanh) are as follows:
The sigmoid function and the tanh function are shown in Figs. 6 and 7, respectively:

Sigmoid function.

Hyperbolic tangent function.
It can be seen that the essence of the input-output mapping relationship of a single “neuron” is a logistic regression. A neural network model is formed by joining multiple single “neurons” in the form of a hierarchical structure. The figure below shows a neural network with a layer of hidden layers.
The corresponding formula is as follows:
Similarly, it can be extended to 2, 3, 4, 5,... hidden layers. It is similar to the training method of Logistic. However, due to the multi-layered nature of neural networks, it is necessary to use the chain-based derivation rule to derive the hidden layer nodes, that is, the back propagation algorithm.
(2) Local perception
In image processing, we usually represent an image as a vector of pixels, such as an image, which can be represented as a vector of 1000000. In the neural network, if the number of hidden layers is the same as the input layer, that is, 1000000, then the parameter of the input layer to the hidden layer is 10 ∧ 12, so that it is difficult to train the neural network. Therefore, we must reduce the number of parameters to speed up the training of neural networks.
Convolutional neural networks can reduce the number of parameters in two different ways. The first way is local perception. In the spatial connection of images, local image pixels are closely related, and distant pixels are relatively weak. Generally speaking, people’s perception of the outside world is also from local to global. Therefore, each neuron only needs to perceive the local image without having to perceive the global image. Finally, the local information can be integrated at a higher layer to obtain global information. As shown in Fig. 9 below: Figure (a) is a full link, and Figure (b) is a local link.

Neural network with a layer of hidden layer.

Full link and local link neural network.
In Figure (b), if each neuron is connected to only 10 × 10 pixels, the weight data is 1000000 × 100 parameters, and the number of parameters can be reduced to one ten thousandth.
(3) Parameter weight sharing
The second method of reducing the number of parameters is weight sharing. Sometimes the number of parameters is still too much after local linking. In the above local connection, each neuron corresponds to 100 parameters, 1000000 neurons. If the 100 parameters of these 1000000 neurons are equal, only 100 parameters are needed.
Features are extracted by convolving these 100 parameters. This way of extracting features has nothing to do with the position of the image, that is, the characteristics of the various parts of the image are the same. That is to say, the characteristics of the neural network learning in one part can also be used in another part. For example, a small block is randomly selected from a large-sized image, such as 8 * 8 as a sample, and some features learned from this 8 * 8 sample are applied as detectors to any position of the image. In particular, this detector can be used to convolve with the original large size image so that an activation value of a different feature can be obtained from any position on the large size image.
Figure 10 shows the convolution process of a 3 * 3 convolution kernel on an image of 5 * 5 pixels. Each convolution is a feature extraction method that filters out the eligible parts of the image.

Image convolution process.
When there are only 100 parameters as described above, there is only one 100 * 100 convolution kernel. The features extracted in this way are not enough, and we can add multiple convolution kernels to learn more features. As shown in Fig. 11 below, each color represents a different convolution kernel, and each convolution kernel produces an image as another image. For example, two convolution kernels will be able to generate two images, which can be viewed as different channels of an image, as shown in Fig. 12:

Multi-convolution kernel neural network.

Four-channel convolution process.
Convolution operations on four channels are given. Since there are two convolution kernels, two channels are generated. It should be noted that each channel on each of the four channels corresponds to a convolution kernel. When w2 is ignored and only w1 is considered, the convolution results at (i, j) on the four channels are added and then the value of the activation function is used to get the value of w1 at (i, j).
Therefore, in the process of obtaining 2 channels by convolving 4 channels, the number of parameters is 4 × 2 ×2 × 2. Among them, 4 means 4 channels, the first 2 means 2 channels, and 2 × 2 means the size of the convolution kernel.
(5) Pooling
It is very difficult for the classifier to learn the input of a large number of features, and it is prone to over-fitting. In order to solve this problem, the average or maximum value of a certain feature on a certain area of the image can be calculated by counting features of different positions. These statistical features are not only low in dimension, but also not prone to over-fitting. This aggregation operation is pooling.
(6) Multi-volume layer
The characteristics learned by single-layer convolution are generally local. In practical applications, it is usually necessary to use multi-layer convolution. The more layers of convolution, the more comprehensive the learned features will be. In 2010, Alex’s CNN model (ImageNet-2010) won the championship in the ImageNet LSVRC image classification competition. The model uses a convolutional layer to be a 2-GPU model parallel structure that trains the model parameters into two parts. Moreover, it splits the model parameters of several layers and uses the same data to train on different GPUs, and the result is directly used as the input of the next layer. This paper has improved on Alex’s CNN structure to get better classification results. The improved CNN structure is shown in Fig. 13 below:

CNN network structure.
The improved structural parameters are basically similar to the parameters of ImageNet-2010, except that there is only one layer of fully connected layers at the end, and the latter layer is the softmax layer. The fully connected layer is used to represent an image, and the fourth layer of convolution and the third layer of the largest pooled output are used as input to the fully connected layer, so that local and global features can be simultaneously. This makes the structure learn the global features and local features comprehensively in feature learning, which improves the accuracy of classification and speeds up the training speed.
By leveraging TensorFlow’s open source and its advantages for convolutional neural network learning, in this study, a shallow convolutional neural network based on Ten-sorFlow architecture for image recognition is designed, and GPU training acceleration is used to solve the problem of training time-consuming in the face of large data volume. There are two main purposes for designing this convolutional neural network: The first objective is to focus on establishing a standardized network structure and then training and evaluating. The second objective is to provide the basis for building a larger and more complex model in the later part of the paper.
Cifar-10 is a database for ubiquitous object recognition collected by two students from Hinton, Alex Krizhev-sky and IIya Sutskever. The database contains 10 types of images and has a total of 60,000 color images, which are 50000 training images and 10000 test images, respectively. The classification of cifar-10 data sets is an important benchmarking problem in machine learning, the task of which is to classify a set of 32 × 32 colored images. These images cover 10 categories: airplanes, cars, birds, cats, deer, dogs, frogs, horses, boats and trucks, as shown in Fig. 14:

Cifar-10 image class.
The reason for choosing cifar-10 is that it is complex enough to test the image recognition and classification capabilities of the Tensor Flow open source architecture and extend it to larger data sets. At the same time, because this data set is small, it is fast to train, and it is more suitable for testing new algorithms and testing new technologies.
The model designed in this study is a multi-layered structure composed of convolutional layers and nonlinearities alternately arranged multiple times. These layers are ultimately docked to the softmax classifier through the full link layer, as shown in Fig. 15 below. The convolutional neural network implemented on the TensorFlow platform will be trained on the CIFAR-10 dataset. After more than an hour of training on a GPU, the model achieved an accuracy of up to 85%. The model contains 1,068,298 learning parameters, and it takes approximately 19.5 M multiplication operations to classify an image. The training process will be described in detail below.

Network model.
The algorithm flow of the training phase of the network model is as follows:
Input: Training set
Error threshold ɛ
The maximum number of iterations num
Output: Network weight
Step:
Initialization: Network weight: W ← N (0, 1), Offset value: b ← const
Step 1. Do
Step 2. A Batch is randomly selected from the training set and input into the convolutional neural network.
Step 3. The training samples are forward-propagated and accelerated by GPU training, and the network output is obtained by layer-by-layer calculation.
Step 4. If the classification error is less than the error threshold or the number of training times is equal to the maximum number of iterations num,
Step 5. Break:
Step 6. else
Step 7. The error is calculated, the error back propagation is performed, and the network weight is updated.
Step 8. Until all Batch is trained.
The algorithm flow in its test phase is as follows:
Input: test set
Output: classification result
Step:
Initialization: network weights: W, b← trained network value Step 1. Do
Step 2. A batch is randomly selected from the test set and input into the convolutional neural network.
Step 3. The test sample is forwardly propagated and accelerated with GPU training, and the network output is obtained by layer calculation.
Step 4. It is judged to compare the label with the classification result and count the classification result.
Step 5. Until all Batches have been tested.
The English learning task degree perception system in this paper mainly analyzes the data information for image information. The image data used includes two images: visible light color image and radar gray image. Through the perceptual processing and statistical analysis of the target in the input image data, the English learning task degree perception system of this paper can achieve two main functions. One function is to analyze the target object and densely distributed target in the image scene according to the perceived category and its specific position in the English teaching and student learning images and give a textual description about the target information in the image scene. Another function is to label sensitive targets based on perceived results on the target of interest set by the researcher.
For the English learning task perception system, the more types of targets that can be perceived, the more robust the system is. In the perception tasks in different scenarios, the English learning task awareness system is sometimes required to fully perceive the scene, identify and locate all the targets in the scene. However, sometimes the system does not need to analyze all the targets, only need to identify and label the sensitive category targets that the researchers are interested in. By setting the category of sensitive targets, this paper can mark sensitive targets in the input scene image according to the target location perception results.
Figure 16 shows an example of an English learning class. By using such images or videos as input parts, the algorithm constructed in this study performs image recognition research, and the image is initially processed, and the results are shown in Fig. 17.

English learning class.

Character background separation.
As shown in Fig. 17, all the tasks of the English classroom are separated from the background. On the basis of this, the characteristics of the students are determined by the English learning state, and the English learning scores are obtained through facial expression recognition and motion recognition, and the task degree results are obtained. To perform the above feature recognition, contour segmentation recognition is required, and the overall image is divided into individual images, and the obtained result is shown in Fig. 18.

Character outline recognition.
On this basis, the recognition score is combined with the final test score of the student, and the results are shown in Table 1. The closer the correlation in Table 1 is to 1, the higher the accuracy of the system.
Statistics of system test results
Correlation and scores were plotted as statistical charts for comparative analysis, and the results are shown in Fig. 19.

Comparative analysis of correlation and academic performance.
It can be seen from Table 1 and Fig. 19 that the scores are basically consistent with the test scores of the students. Therefore, the algorithm of this study can effectively identify the students’ English learning tasks and has certain practical effects.
This study, from the perspective of deep learning, combines image recognition technology to analyze the learning task of English, and explores the method of recognizing the learning state of students, which is convenient for effectively controlling the teaching process. This study designs a shallow convolutional neural network based on the TensorFlow architecture for identifying images and uses GPU training acceleration to solve the problem of training time-consuming in the face of large data volume. The model designed in this study is a multi-layered structure composed of convolutional and nonlinear layers alternately arranged multiple times. These layers are ultimately docked to the softmax classifier through the full link layer. Through continuous reinforcement and training, neural networks with large-scale neural units and links enable computers to clearly identify natural images, video and speech with potentially complex structures like humans. Through the perceptual processing and statistical analysis of the target in the input image data, the English learning task perception system can realize two main functions, namely, person positioning and scoring. By setting the category of sensitive targets, this paper can mark sensitive targets in the input scene image according to the target location perception results. The research results show that the method proposed in this paper has certain effects.
Footnotes
Acknowledgments
This paper was supported by Special Research Projects on Teaching Reform of Aba Teachers’ University in 2019, The Reform of English Lexicology Teaching Model from the Perspective of Multimodality, No: 201901014.
