Abstract
At present, applying image recognition technology to promote English teaching is a kind of teaching innovation that meets the needs of the times. Therefore, based on machine learning neural network and image super-resolution, this study conducted an innovative analysis of English teaching mode. This paper combined the current situation of English teaching classroom to study and analyze English classroom, combined classroom characteristics as the basis of English teaching innovation and constructed a feature recognition model suitable for current English teaching status. Moreover, this paper formed an initial high-resolution image for low-resolution image reconstruction by sparse representation method, and then established a mixed sample spine regression model to re-estimate the high-frequency components of the initial high-resolution image to realize various behavioral characteristics of students in English teaching classroom. In addition, this article builds a verification test. The research shows that the proposed algorithm has certain effects and can provide theoretical reference for subsequent related research.
Introduction
Strengthening foreign language teaching and improving the level of foreign language teaching and teaching quality have become the top priority in the education reform of all countries, and the cultivation of international talents has become the main national policy of the 21st century. International talent = professional + cross-cultural exchange + foreign language, is the requirement of national development in the era of globalization and informatization [1]. With the continuous deepening of comprehensive reform of higher education, connotation development and quality improvement have become an important task of higher education. The development of higher education has multiple characteristics. On the one hand, it can make the young people who are about to enter the society fully prepared so that they can adapt as soon as they actually enter social work. At the same time, for individuals, it can also make individuals have a long-term application in English, improve employment opportunities, and meet social needs. On the other hand, higher education is to teach the students broader knowledge, follow the temptation, and guide them to open the door to innovation [2].
The English curriculum of colleges and universities has always been a public foundation course. As a teaching method to improve the English level and ability of college students, the importance of English education has become increasingly prominent in college teaching. The globalized economy has transformed China’s pure domestic market into a global market, and China’s foreign capital and well-known multinational companies are also increasing [3]. Therefore, the demand for more English talents in the market is not limited to the language of English itself, but also requires English talents of various professions. This requires colleges and universities to change the way of college students’ English teaching, so that students are more familiar with the study of professional English and strengthen English communication and English application [4]. Therefore, through the improvement of the effectiveness of public English classroom teaching in colleges and universities, it can help college students to establish a strong enthusiasm for English learning, and at the same time, it can promote the level of public English teaching in Chinese universities, thus enhancing the practicality and effectiveness of English.
With the continuous improvement of urban construction at home and abroad, cameras are becoming more and more popular. Moreover, many concepts such as safe city, intelligent transportation, smart home, and intelligent education have been proposed. In this context, video image-based processing technology has also made great progress. The main content of video image processing is to understand and analyze the information in the acquired video image [5]. In the field of video image processing, the detection and recognition of objects in a video scene are mainly accomplished by knowledge of multiple disciplines such as image processing and pattern recognition. The recognition of human behavior is a major research direction in the field of image processing based on video. As more and more scholars invest in it, a large number of algorithms have been proposed and applied to the field of human behavioral motion recognition.ding The main purpose of body behavior recognition is to identify and analyze human behavioral motion information in video sequence images. Generally speaking, human behavior motion recognition is the classification of human action behaviors in video scenes [6]. Based on this, applying image recognition technology to English teaching to promote English teaching is a kind of teaching innovation that meets the needs of the development of the times. Therefore, based on machine learning neural network and image super-resolution, this study conducted an innovative analysis of English teaching mode.
Related work
The Visual Surveillance Project (VSAM), led by Carnegie Mellon University and established by the US Defense Advanced Research Projects Agency, which is attended by many American universities and research institutes, is mainly to complete the positioning and segmentation of the human body. On this basis, it can track and detect the human body, which can be used for scene analysis and understanding of battlefield and civil scene video surveillance [7]. The Chinese Academy of Sciences is relatively deep and mature in the study of behavioral motion recognition. They applied the research results of behavioral motion recognition analysis to the daily training of national team diving, gymnastics, trampoline and other projects, and guided and corrected the athlete’s movement through the behavioral motion recognition and analysis system and achieved good results. The main content of their research is the video captured by various video equipment, positioning, tracking, identification, comprehensive analysis of moving targets in the video scene [8]. The main purpose of the research is to conduct intelligent analysis records and warnings of irregular behaviors and abnormal actions for some basic behavioral standards in daily life and work processes, and to take timely safety measures. Behavioral motion recognition based on video images involves three key steps: foreground moving target detection, behavioral action feature extraction, and behavioral action classification recognition. Moving target detection is to detect the position of the moving target from the video sequence image. Behavioral motion feature extraction refers to transforming an image in a video sequence and then acquiring features that can describe the behavioral actions in the video sequence. Behavioral action classification and recognition refers to the process of identifying and classifying behavioral actions based on the image features in the acquired video sequence. Detecting and locating the moving target area from the video sequence image is the first step of behavioral motion recognition. Moreover, the ability of the moving target area to successfully locate and detect the positioning directly affects the performance of the behavioral motion recognition system. At present, the mainstream moving target detection algorithms have the following five: background difference method, frame difference method, code table method, mixed Gaussian model, Vibe method. Moving object detection based on video sequences can be understood as image foreground detection in nature. Du Z et al. used the maximum likelihood estimation to detect and segment moving objects in the image. Stauffer et al. used a background update algorithm based on a mixed Gaussian model to model the pixels in the image. In actual use, the background model was continuously updated to obtain the foreground target [9]. Lotz J M et al. obtained the moving target by means of three frames of video image difference [10]. Mhlolo M K et al. used the inter-frame difference method and the background subtraction method to eliminate the influence of different illumination, thus achieving the adaptation of illumination [11]. With the introduction of more and more moving target detection algorithms, in the actual application scenario, the best algorithm can be selected according to the advantages and disadvantages of each algorithm and the applicable scenarios. The extraction of human behavioral actions in video sequence images is a key step in the process of behavioral motion recognition, and the performance difference of behavioral features will directly affect the effect of final classification recognition. Because of the critical role of this process in behavioral recognition, the core content of human behavioral feature extraction has been indispensable in the research work on behavior recognition in the past four decades. In the human body behavior feature extraction stage, compared with the global characteristics such as the optical flow method, the most widely used and the same effect is the local feature. There are two widely used image local representation methods: space-time interest points and local descriptors. The time-space interesting point detection algorithms commonly used in the direction of behavioral motion recognition are [12]: HarriS3D detector, Cuboid detector, convexity-based detector, Hessian detector, and so on. Unlike space-time points of interest, a local descriptor is a feature that first divides the image into individual blocks and then pairs them. Moreover, it represents the image by means of local feature description, thereby eliminating background interference or having invariance of scales such as translation, rotation and scaling. Caballero D et al. first proposed the space-time interest point STIPs [13]. In their algorithm, the essence of spatio-temporal interest points is to use the Harris3D detector to detect the corners in the image in three-dimensional space, so as to obtain a series of feature points in the image. Homer R et al. proposed the use of Cuboid detectors to detect spatiotemporal points of interest. In the video sequence, it uses the Gaussian filter in the time dimension and the Gaussian filter in the spatial dimension, and then adjusts the local minimum by changing the spatiotemporal scale in the neighborhood of the point of interest, thereby achieving the purpose of adjusting the number of points of interest in the space-time [14]. Forrest J et al. studied the detection based on convexity [15]. Wlemsb et al. propose a method for detecting spatiotemporal points of interest using the Hessian detector, which has scale invariance. The number of feature points acquired based on the detection method of space-time interest points is generally less, so it will affect the representativeness of behavioral actions. For the method of local descriptors [16], Dalai et al. first proposed the HOG descriptor. The method calculates and counts the gradient direction histogram information of the local region of the image, and then combines the features of all local regions to form the characteristics of the whole image [17]. Literature 18 proposes the use of gradient direction histogram (HOG) feature descriptors to represent behavioral actions in an image, and the size of the method features is relatively large [18]. Juneja et al. proposed the HOG3D descriptor by extending the two-dimensional HOG feature into three-dimensional space-time and used the regular polyhedron in space to quantify the orientation of the spatiotemporal gradient [19].
Image pyramid based on image block redundancy
Assuming I0 is a low-resolution input image, different layers of the image gold tower can be constructed according to the following formula (1).
In the formula, Ii-1 represents the image of the i–1-th layer in the image pyramid, I i represents the image of the i-th layer, * represents the convolution operation, H i represents the fuzzy kernel, and D σ is a down-sampling operation with a scaling factor of σ. As shown in Fig. 1, the image I i (i = -1, - 2, ⋯ , - a) in the next layer is first created. Where the size of a is determined by the desired magnification, and then the image of the upper layer is constructed using the similarity of the image blocks within the scale and between the scales. In the figure, I1andI2 represent high resolution layers in the image pyramid, and I-1andI-2 represent low resolution layers. P s represents a source image block in the low resolution input image, and image blocks P1 and P2 similar to P s can be searched by comparing the Euclidean distances, and the image blocks P1 and P2 correspond to the image regions R1 and R2, respectively. Therefore, there are two decisive factors in establishing regions D1 and D2 in high-resolution layers: (1) The image area where the source image block P s is located; (2) The layer number of the similar image block found (that is, P1 corresponds to -1, and P2 corresponds to -2). The pixel values of the final regions D1 and D2 are the pixel values obtained after the copy regions R1 and R2 have been enlarged.

Schematic diagram of the creation of an image pyramid.
Feature extraction
The image
The high frequency detail component
In the formula, n represents the total number of blocks extracted.
Considering the robustness of the sample selection, the low-pass filtering operation is performed on the self-sample image
In the formula (4),
In the formula (5), δ
c
represents the most similar external sample number corresponding to the sample
F
l
represents the set of all
In the equation (6), F
l
(δ
c
) denotes a feature numbered δ
c
extracted from all
Since F
L
is a feature matrix composed of four features, when the selected image block is large, the dimension of F
L
is too high. Therefore, it is necessary to use Principal Component Analysis (PCA) to reduce the dimensionality of F
L
. The PCA projection matrix can be calculated by:
In equation (7), λ ϑ is the ϑ-th eigenvalue of F L T F L , v ϑ is its corresponding eigenvector, λ is the vector of all eigenvalues, Sort () is an operation of sorting the vector from small to large, and ψ is the PCA projection matrix. F L uses PCA projection matrix to reduce dimensionality and train low resolution dictionary.
In the formula, Φ
L
and α are the low resolution dictionary and the corresponding sparse coefficient, respectively, ∥ ∥
F
represents the Frobenius norm, the vector α
ω
represents the ω-th column vector in α, and the constant τ controls the sparsity. The high-resolution dictionary Φ
H
is obtained by the following formula (9).
In the formula, g represents a matrix of sparse coefficients obtained by equation (8). The solution of the formula (9) is given by the following formula (10).
Formula (10) is a pseudo inverse representation.
Sparse representation reconstruction
The low-resolution input image Y is given. We also use the bicubic kernel to interpolate Y with the magnification factor s. Moreover, we use f1 = [1, 0, - 1] , f2 = [1, 0, - 1]
T
, f3 = [- 1, 0, 2, 0, - 1] and f4 = [- 1, 0, 2, 0, - 1]
T
to filter the interpolated image to obtain its gradient and Laplacian features, and then extract the image blocks from the feature image into a feature matrix.
In the formula, Q is a bicubic interpolation operation, R∘ is an image block extraction operation defined in equation (3), and * is a convolution operation defined in Equation (1).
Since the dimension of F
LR
is high, we use the PCA projection matrix ψ to reduce the dimension of F
LR
, and then use the orthogonal matching pursuit (OMP) algorithm to solve the sparse coefficient on the low-resolution dictionary.
In the formula, ϕ ρ is the ρ-th column vector of the sparse coefficient matrix ϕ.
According to the sparse representation theory, the initial high resolution image
In the formula, contrary to R∘, R-1∘ is the operation of restoring the image block to the image, and β is the weight parameter.
The initial high-resolution image
High frequency components play an irreplaceable role in super-resolution reconstruction. In order to re-estimate the high-frequency components of the initial high-resolution image
In the formula,
In the formula,
In the formula,
In order to accurately describe the relationship between the adaptive hybrid sample low frequency
The solution of the formula (18) is given by the following formula (19).
In the formula, I is the identity matrix. After that, we use the ridge regression coefficient Λ
o
and the mixed sample high frequency
The high-resolution image after high-frequency re-estimation by the AMSRR model can be expressed as:
In the formula, G l is an 7 × 7 Gaussian low-pass filter with a standard deviation of 1.6, and γ is the weight coefficient.
The core idea of non-local mean is to search for a similar image block in a large range of neighborhoods around a target image block and establish a non-local relationship between the target block and the non-local similar image block. This section extends the search area from non-local to the entire image area to form a global mean (GLM). Assuming that
In Equation (22),
In the formula, w is the weight control factor. The value of u represents the sum of the weights of ownership, which is calculated by the following equation (24).
We assume that g
o
denotes a vector containing the weight
According to the above definition, the prediction error
In addition, we use the gradient descent rule to strengthen the global reconstruction constraint, project
Using the gradient descent method, the solution of equation (26) can be given by iterative equation.
The key technology of this system is to identify the learner’s behavior, that is, by processing, processing and analyzing the data collected by the sensor, the computer system can understand the individual actions. Due to the increasing maturity of somatosensory measurement technology, such as the Kinect somatosensory device introduced by Microsoft, it can measure the distance by infrared sensing device and provide partial bone detection information, Kinect somatosensory equipment was selected for behavior recognition in this paper. After detecting the critical behavior, the system needs to be able to capture the video image of the location of the event for post-tracking and verification. Therefore, a programmable PTZ camera with auto focus is selected for the video image.
Through the above requirements research, the overall design of the system is carried out. The overall requirements of the system are as follows: The learner behavior measurement system is based on a variety of data collected by modern devices such as body sensation instruments in the classroom, and the data is processed, identified and stored for the next step of classroom teaching evaluation statistics and analysis. At the same time, it is necessary to take into account the performance of the equipment and the functions of the application system in the measurement process, so that it is simple and easy to use. The system function diagram is shown in Fig. 2.

Schematic diagram of system deployment.
According to the overall requirements of the above system, the system can be divided into three different sub-modules: data acquisition module, detection and identification module, and data analysis module. The flow and relationship between functional modules are shown in Fig. 2. The following will introduce the basis of each module division and the key technical points and difficulties of the module.
For the multi-person behavior recognition in the classroom scene, if the somatosensory device is placed directly in front of the subject, there is a problem of occlusion. The scheme of suspending the somatosensory equipment in the classroom ceiling is designed. By hanging the Kinect, measuring the coverage of the Kinect and adjusting the angle of the Kinect by the system, the occlusion problem in the multi-person situation can be effectively solved. The design of the intelligent classroom experimental scenario is based on the following requirements: The testing equipment in the classroom cannot interfere with the teaching process; the Kinect detection range covers all seats; In the case of full people, each seat in Kinect’s field of view does not block each other, and each Kinect field of view does not exceed 6 seats; the PTZ zoom camera can obtain a face with a pixel accuracy of not less than 200×200 within the zoom range. The scene diagram of the design through field measurement and calculation is shown in Fig. 4, and the equipment and layout are shown in Table 1:
Classroom equipment layout table

Schematic diagram of the function module.

Plan view of the experimental scene.
Kinect is able to detect the bone joint points of the user within the field of view. In the scenario monitored by the classroom in this paper, it is necessary to be able to distinguish the three sitting postures of the students’ hands raised, sitting down, and squatting.
As shown in Fig. 6, the point where the palm detection accuracy is not high is eliminated, and the upper limb joint point of the human body is simplified and there are nine remaining. These nine points constitute the structure of the upper limbs of the human body and reflect the changes in the posture of the human body. Due to the difference in body height between different people, even with the same hand-raising behavior, different people have the same action upper limb structure vector in the same coordinate system. Therefore, a separate upper limb structure vector cannot be used as a feature vector, and a new auxiliary feature vector needs to be proposed for the gesture determination. By comparing three different poses, it is found that the upper limb structural vectors in different poses exhibit different correlations, especially the vector angle and the modulus ratio relationship.

Schematic diagram of the skeleton in three dimensions.

Schematic diagram of the sitting skeleton.
The student participation ranking table obtained on the basis of the above system monitoring is shown in Table 2. The difficulty ranking of the title is shown in Table 3.
Student Participation Ranking Table
Title difficulty ranking
The time-sharing chart of the number of students attending the class and looking up is shown in Fig. 12.
The schematic diagram of the sitting skeleton feature vector and the modulus ratio are shown in Figs. 7 and 8, respectively. In this system, it involves the recognition of the low-head posture. However, by observing and calculating the bone data collected by us, it is found that if the neck joint does not move significantly downward, it is difficult for the system to judge whether it is bowing or sitting. Therefore, the head deflection angle detection function is introduced in this module, and the function depends on the interface provided by Microsoft official as shown in Fig. 9: Yaw (left and right shaking head rotation) detection range is –45°–45°, Roll (left and right deviating head rotation) detection range is –90°–90°, Pitch (bow head rotation) detection range is –45°–45°.

Schematic diagram of the sitting skeleton feature vector.

Schematic diagram of the modulus ratio.

Kinect2.0 head deflection angle detection.
This system is a student behavior measurement system based on PC, Kinect2.0 somatosensory equipment, and Hikvision PTZ camera in a small classroom scene. Since each Kinect can only recognize 6 people’s bone information, if you can add more than one Kinect, it is also suitable for large classroom scenes, but the cost is large. The raw data collected in the experiment is transmitted and stored in the server database after being processed and identified by the PC. The actual scene of the deployed experimental scene is shown in Fig. 10.

Experimental scene diagram.
Two PTZ programmable control zoom cameras are deployed in the classroom and are suspended in front of the classroom and 3 m from the ground, which is used to dynamically obtain detailed color maps. Four Kinect 2.0 somatosensory devices are suspended from the top of the classroom, and Kinect is 2.5 m above the ground. Each Kinect covers a quarter of the classroom and is used to detect student behavior in the area. A global fixed camera is mounted behind the classroom to detect the position of the classroom on the podium.
First, after the student adjusts the position in his seat and is recognized by the kinect, the simulated class answering situation is simulated, as shown in Fig. 11. From Fig. 11, we see a test interface for our student behavior detection system.

Ready.
In the student lecture test, the raw data shown in Fig. 12 is subjected to statistics of time-divided periods according to the calculation method in the previous design. Then, the statistical results of time-divided periods of the number of people in the entire simulation class who are listening carefully in class are shown in Fig. 12. In Fig. 12, we can see that the number of students attending the lectures and looking up in the 90s–160 s and 210s–260 s periods is low, which is consistent with our experimental design. This experiment also shows that through our system, we can detect some negative states of students in the classroom and provide a method to detect the number of students attending classes. It has certain reference significance for the evaluation of classroom activity.

Time-based statistics of the number of students attending lectures.
By reviewing the literature, this paper summarizes the development status of classroom observation based on video technology, as well as existing research methods and video analysis software. In the process of literature review, it is found that there are already mature coding systems for teachers’ coding classification methods in the classroom at home and abroad, which provides a theoretical basis and method basis for observing students’ classroom behavior. Secondly, by understanding the steps of using video technology to analyze classroom video, it not only has a deep understanding of classroom observation based on video technology, but also finds that it can be more intelligent in the process of identification. Therefore, considering the application of pattern recognition technology, this paper combines video technology and pattern recognition technology, which can simplify the process of data collection and analysis, and can greatly improve the accuracy of analysis. In addition, a classroom behavior recognition system integrating device control, data acquisition, recognition, recording, analysis and display is realized. At the same time, this paper uses the system to conduct individual experiments for learners and experiments on classroom activity evaluation and gives three experimental indicators for each learner’s personal evaluation, difficulty evaluation of each question, and overall classroom activity evaluation. Moreover, the conclusions with certain significance are obtained through experiments, which have certain reference significance for classroom teaching evaluation.
