Abstract
A variety of factors affect English classroom teaching, which prevents teachers from effectively grasping students’ learning status and learning situation. In particular, classroom management is more difficult during online teaching. In order to improve the effectiveness of English classroom teaching, based on the human-computer interaction algorithm and facial identification algorithm, this paper effectively recognizes the human-computer interaction process and classroom learning status of students in online teaching, and eliminates the image background according to the actual teaching needs. Moreover, by extracting and fusing the time sequence information and spatial information of the motion in the video, a spatiotemporal feature image capable of expressing a dynamic typical feature is obtained. In addition, this paper uses system algorithms to judge the status and feed back the identification results to the teacher’s teaching terminal equipment, which is convenient for timely teaching adjustment. Finally, this paper analyzes the effectiveness of the algorithm through simulation experiments. The research results show that the algorithm constructed in this paper has good performance.
Introduction
Today’s university teaching is completely different from the past. The campus multimedia teaching equipment is distributed in every classroom, and the campus network is connected to every office, classroom and student apartment. We use multimedia systems and the Internet to obtain more and more timely information. In particular, the Internet has almost become a major tool for students to acquire knowledge. Teachers’ teaching content and teaching methods have also undergone tremendous changes with the birth of the Internet. The teaching content may come from the Internet at any time, so as to ensure the novelty of the course and keep pace with the times.The way of teaching is not limited to forty minutes in the classroom and began to use modern information technology to extend after class. As a result, various teaching aid systems were born. The English online auxiliary teaching system is also one of them, which is used to help students self-study English courses [1].
The research of English online auxiliary teaching system aims to develop and implement an auxiliary system for English after-class teaching and teaching management according to the characteristics of English course learning. This auxiliary system can meet the needs of most schools for English teaching, to ensure that students use more of their spare time to learn English, communicate with teachers online, and increase their interest in English learning. The research of English online auxiliary teaching system has the following significance. In any school, the guarantee of the teaching quality cannot be achieved by 40 minutes in the classroom. A large amount of students’ extra-curricular time requires us to make full use of management. It can be said that whoever makes good use of extra-curricular time will ensure the quality of teaching. The purpose of developing English online auxiliary teaching system is to make full use of students’ spare time. Colleges and universities have always attached great importance to the cultivation of students’ self-learning ability. By cultivating students’ self-learning ability, students can master the way of survival in the future. The cultivation of self-learning ability does not depend solely on the students themselves, but also requires guidance from school teachers.Through English online auxiliary teaching system, it can help students develop self-study ability and develop a good language content learning method [2]. The level of modernization of teaching has always been a reflection of the comprehensive level of schools. Therefore, major universities attach great importance to the construction of modernization of teaching, and scientifically manage the school curriculum through the modernization of teaching.The purpose of the development of English online auxiliary teaching system is to provide a means for the modernization of English courses.
The teaching process of online education is difficult to control, and teachers cannot observe dozens of students at the same time. Therefore, it is necessary to cut off human interaction and facial identification technology to recognize the status of students and improve the effectiveness of online English teaching [3].
Related work
The most fundamental purpose of human-computer interaction evaluation technology research is to realize the usability of system engineering.Usability is the effectiveness, efficiency, and subjective satisfaction of a product for a specific user for a specific use under a specific use environment.Usability emphasizes the behavior of products in real life and the way they are used by users. It is not simply to see at a product from the inside, but from the outside to see how users use and interact with it [4]. All along, in product development, people usually focus on what the product is used for rather than how to use the product to do things.Therefore, from the beginning of product design, we must seriously consider the user experience, not only focus on the factors of technological innovation and ignore the human factor.Products with good usability not only enable developers to reduce development costs, increase credibility and improve product competitiveness, but also improve user efficiency and satisfaction [5]. There are many studies on usability abroad, and they started early. Usability plays a very important role in the product development process.The usability evaluation research of the virtual simulation human-computer interaction interface is inextricably linked with various industrial departments of a country.The UK Department of Industry and Trade (DTI) has a research and development plan called “Usability: Start Now!”. IBM, Microsoft, etc. have dedicated usability labs or usability research groups. Studies have shown that investment in usability accounts for 10% of the investment, and usability of systems redesigned based on usability studies has increased by an average of 135%. A hard-to-use system will keep users away, and once the user has a bad experience in the use of the system, it is difficult to accept the use of the system again.Many mission tests indicate that 35% of the operation fails when users use the system. The operation when the user uses the system is to solve the problem. If the efficiency of the user to solve the problem is greatly improved, the possibility that the user will use the system again will increase. The usability of the human-computer interaction interface is a problem that runs through the design, and the user-centered design method should be used in the design [6]. One of the most direct results of improved usability is the increased utilization of the system, and the efficiency of users using the system to complete tasks will also increase, and user satisfaction will increase. To measure the usability of the human-computer interaction interface of the system, we need to determine the usability index through evaluation. This index can be a specific quantitative value or a satisfaction evaluation [7]. Therefore, usability research must study usability assessment. From the perspective of the user interface development process, human-computer interaction interface evaluation can be roughly divided into two categories: one is evaluation in the design process, called formative evaluation, and the other type is the final evaluation made after the user interface is completed, called summative evaluation.Both types of evaluation play an important role in the development process and are an integral part of the entire human-computer interaction interface design. Among them, the formative evaluation emphasizes the use of open methods in the development process, such as interviews, questionnaires, attitude surveys and scale technology [8]. However, most of the summative assessments use strict quantitative assessments, such as response time and error rate. Foreign countries have started earlier in the study of system human-computer interaction usability evaluation, and there are many researches on human-computer interaction usability evaluation of the system, such as the usability evaluation of digital libraries [9]. Related research has also achieved rapid development in recent years, and there has been a study on the usability evaluation of digital museums and e-commerce systems [10]. The literature [11] identified the central task as improving the availability of IT and the market impact of human-computer interaction products. The research that has been done includes includes the comparative evaluation,cross-cultural user research, mobile business, and mobile phone company research of the relevance, page coverage, dead link rate, timeliness, professional segmentation, and classification functions of results under the standard search methods of search engines such as Google, hooChina, Baidu, Zhongsou. The literature [12] conducted usability testing of Microsoft’s Chinese products and cooperated with Microsoft to explore ways to train usability professionals.The China EU Usability Research Center promotes usability concepts and usability engineering practices in China and provides more usable products to users in China and around the world [13]. Its customer base is huge, including many well-known domestic and foreign companies such as IBM, HP, eBay, Sina, Tsinghua Tongfang, Germany Volkswagen, etc.The center offers many courses including human-computer interaction technology, usability engineering, user interface design, accessibility and common design, etc. Jinshan Company established the first software usability research laboratory in China, and the China Branch of Usability Experts Association was established [14]. The article [24] addresses the issue such as enormous volume of bigdata and come up with the concept of SmartBuddy to form brilliantly and savvy environment utilizing human practices and human elements.The article [25] talks almost the development of coordinated non-cyclic chart for video coding calculations for movement estimation in parallel reconfigurable computing frameworks. Moreover, the partitioning algorithm plays a major part to speed up the video processing.The article [26] dealt exploiting IoT and BigData Analytics utilizing Hadoop environment in genuine time situations. Execution of IoT-based Smart City is accomplished by the above-mentioned processes.The article [27] centers around IoT and its significant job in sophisticating the human practices and endeavors. This paper additionally managed the assortment of different information from different assets that are associated with the web.The literature [28] addresses the different issues within the field of vehicle communication with the recommendation of a common bound together and scattered range detecting demonstrate [29]. The application of the shared cognitive paradigm minimizes struggle and different obscure problems [30].
Summary of image foreground segmentation technology
The direct subtraction method is one of the simplest foreground segmentation methods. It assumes that the background is stationary and does not change with time. At the same time, it does not consider the change of background light caused by the entry of the foreground. The advantages of this method are simple and fast, but the disadvantages are also obvious, and the scope of application is limited. When the background objects and the ambient light are constant, this method can achieve good results. When using this method for foreground segmentation, there are generally two steps [15]:
1. The background is sampled. When sampling the background, there is no foreground object in the environment at this time. I
t
(x, y) is used to represent the pixel value at point (x, y) at time t. then the image at this time is expressed as I
a
(x, y), and the background image is expressed as:
2. Background subtraction. When a foreground object enters and the background needs to be removed to segment the foreground, the time is set to b at this time, then the segmented foreground is:
Based on the above two steps, this paper did a gesture foreground segmentation experiment based on direct subtraction. The experimental results are shown in Fig. 1.

Experimental results of gesture foreground segmentation based on direct subtraction.
The median and mean background model is also a simpler background modeling method, but compared to the direct subtraction method, it takes the dynamic change of the background into account. When performing real-time background segmentation, the background model is also updated in real time. Its basic idea is to count the M-frame video sequences before the current moment, and calculate the background model based on this. The background model is shown in formula (3) and formula (4) [16]:
Among them, formula (3) is the mean background model, formula (4) is the median background model, and I i (x, y) represents the gray value at the moment i at (x, y). At the same time, median () is the function of taking the median value of the pixels within a certain time range. Single Gaussian modeling is generally divided into three stages:
1. Training phase. At this stage, the background features are extracted by learning a sequence of video frames, and on this basis, a mathematical statistical model that can describe the background features is established. When modeling a single Gaussian background model, the distribution of pixel values at fixed locations in the background image conforms to the Gaussian distribution, that is [17]:
Among them, the values of μ and σ2 are:
μ and σ2 are the average gray value and gray value variance of the point with coordinate (x, y) in a certain period of time T, respectively.
2. Detection stage. At this stage, the foreground needs to be segmented in real time according to the background model established in the training stage. For each pixel, the background foreground judgment formula is generally as formula (8):
Among them, th is the threshold. the threshold is not fixed, but may be different for each pixel, which often takes three times the variance, that is 3σ2.
3. Update stage. Since the background is not constant, it is likely to change or shift with the increase of time, ambient light, background objects, etc. Therefore, with the increase of time, the background model needs to be updated according to adapt to the new background environment. The update formula of the background model is as formula (9) [18]:
Bt-1 (x, y) represents the background image at time t–1, F t (x, y) represents the current frame, and ρ represents the background update rate, which is used to reflect the background image update rate. The larger the value, the faster the background image update.
Based on the above three steps, this paper did a gesture foreground segmentation experiment based on a single Gaussian background model. The experimental results are shown in Fig. 2.

Single Gaussian modeling experiment results.
The four figures in Fig. 2 are screenshots of real-time gesture segmentation video frames. Among them, figure (a) and figure (c) are Gaussian background modeling gesture segmentation original image and segmented image respectively, figure (b) and figure (d) are the original image and segmented image when the background environment object moves. It can be seen from the experimental results that single Gaussian background modeling can overcome the problem that the background model cannot be updated in time when the background environment changes simply. However, because the light in the shadow area under the hand is too dark, the segmentation effect of this area is still not ideal [19].
The most common use of the RGB color space is the display system. The color raster graphic display uses different R, G, and B values to drive the corresponding electron guns to emit electrons and excite the corresponding phosphors of the three channels on the display screen, thereby emitting different colors of light. The scanner is also based on the RGB color space to collect color information of color manuscripts. Different components in the scanner will absorb the color components from different channels in the RGB space and use it to represent the color of the manuscript.
As shown in Fig. 3, each point in the cube represents a color, which is represented by the parameter values of the three dimensions of R, G, and B, respectively. The coordinate at the origin is (0, 0, 0), which is represented by the parameter value.

RGB color space color cube.
The correlation between the three components of the RGB color space is relatively high, and the distance between different colors is larger than the human visual perception, so it is not suitable for color segmentation. This section explores another color space, that is, YCbCr color space, it is the color model of CCIR601 encoding. Meanwhile, it has a wide range of applications in the field of color TV displays. In the YCbCr color space, Y represents the luminance component, Cb is the blue chrominance component, and Cr is the red chrominance component. The color conversion formula from RGB space to YCbCr color space is shown in formula (10) [20]:
The YCbCr color space can separate the brightness information and chroma information of the color. On the Cb-Cr plane, the skin color can be well gathered in a small area. Converting the color to the YCbCr color space has the following benefits: In the YCbCr color space, the brightness is separately divided into one dimension. When analyzing the color information without considering the brightness information, only two dimensions of Cb and Cr need to be analyzed in this space, which reduces the dimension of the color space and reduces the computational complexity; The composition principle of YCbCr color space is similar to human visual perception; Skin color has a good clustering in the YCbCr color space. In this space, the skin color is not scattered, and can be well gathered in a small area; The YCbCr color format is widely used in many fields such as displays, and is also widely used in the field of video compression coding.
In order to better cluster human skin color, we exclude the interference of brightness, only consider the chroma information, and find the skin color clustering area in the Cb-Cr plane. To this end, the color values of a large number of skin regions need to be counted, and then the skin color regions in the Cb-Cr plane are counted on this basis. In order to explore the clustering performance of YCbCr color space for skin color, a total of 88 hand images under different environments were collected, and 113 pieces of human skin color area were manually cropped, which totaled 6007466 pixels. Through calculation and statistics, the average gray value of these pixels is 128.95, and the average values of the three RGB channels are 149.23, 124.21, and 99.90. There is R, G, B ∈ [0, 255]. This result is converted to YCbCr space, the brightness information Y of the color is discarded, and the distribution of skin color in the Cb-Cr plane is shown in Fig. 4 [21]:

The distribution of skin color pixels collected in this paper in the Cb-Cr plane.
It can be seen from Fig. 4 that the skin color has a good clustering in the Cb-Cr plane, which is mainly distributed in the range of 80–130 on the Cb axis and in the range of 125–170 on the Cr axis. Its shape is similar to an irregular quadrilateral. In order to represent this area more accurately, this paper uses quadrilateral to fit it, as shown in Fig. 5 [22]:

Fitting skin color area with quadrilateral.
In this paper, the quadrilateral area of the four vertices with the coordinate (65, 152) , (118, 115) , (135, 130) , (111, 175) is selected to represent the skin color area, so the expression of the quadrilateral area is:
Among them, x represents the Cb axis, y represents the Cr axis. According to this skin color area, this paper establishes the corresponding skin color model and tests the skin color area segmentation effect on this basis, as shown in Fig. 6 [23]:

The effect of segmentation based on YCbCr skin color model.
In order to explore the effect of the skin color model on removing the shadow under the hand after modeling the single Gaussian background in the previous section, this paper uses the diagram d in Fig. 2 as the input image and uses the skin color model to perform the skin color region segmentation experiment. The experimental results are shown in Fig. 7:

Further extraction of gesture foreground based on skin color segmentation in YCbCr color space.
Color images are usually represented by three components, R, G, and B, but there is a high correlation between them. When modeling color, if you use the RGB model directly, you often cannot obtain the desired effect. In contrast, the correlation between the three components of the HSV (Hue, Saturation, Value) color space is relatively low, and is closer to human visual perception. The HSV color model is a color space created by AR Smith in 1978. As shown in Fig. 8, the angle between the radius of the cone bottom and the 0° line represents Hue, the radius of the cone bottom represents saturation, and the height of the cone represents the value of Value, their value ranges are respectively 0∼360, 0∼255, 0∼255.

HSV color space.
Each color in the HSV space corresponds to a relatively consistent Hue, so that the Hue can be used to separate color regions. Compared with RGB space, HSV space can well meet the characteristics of color uniformity and integrity. The transformation from RGB space to HSV space is as follows:
First, RGB is normalized, there is R, G, B ∈ [0, 1]. The conversion formula from RGB space to HSV space is:
We set:
Normally, the V value of a color image will not be zero to make H meaningless, and the H channel determines the hue information of an object. When detecting skin color, the saturation S and brightness V have little effect on skin color. Therefore, we do not consider the information of these two dimensions, only analyze the information of the H dimension.
In order to establish a good skin color model, some scholars collected several skin color images of Asian and European and American races from the Internet, and counted the H value distribution of these pixels, and then established a skin color model on this basis. A good segmentation effect has been achieved for the face, hands, and bare body parts. In this paper, the segmentation object is limited to the hands in a specific environment. Therefore, in order to achieve a more accurate skin color segmentation effect, the skin color image collection objects in this paper are also collected in a specific environment, and only the hand skin color image is collected, such as different gestures, different light changes (morning, noon, and evening), different Experimental environment (change of background desktop), etc. In this paper, a total of 88 hand images in different environments were collected, and 113 pieces of human skin color regions were manually cropped, which totaled 6007466 pixels. Through calculation and statistics, the average gray value of these pixels is 128.95, and the average values of the three RGB channels are 149.23, 124.21, and 99.90. There is R, G, B ∈ [0, 255]. This result is converted to HSV spatial image smoothing and morphological filtering.
During the acquisition or transmission process, the image is often affected by various factors, which inevitably affect the image quality, such as noise and distortion. Therefore, in order to improve image quality, we often do some processing on noisy images, such as image smoothing, morphological filtering, etc.
Smoothing is also called Blurring. It can suppress high-frequency signals in the image and highlight low-frequency signals in the image. It is a simple and common image processing method. Image smoothing has many uses, and the most common is to reduce noise or distortion on the image. The image smoothing process is often completed by different filters. Common filters include mean filtering, Gaussian filtering, and median filtering.
The neighborhood operator is an operator that determines the output value of a pixel by the pixel value around a given pixel of the input image, and the output value of the pixel is determined by the weighted sum of surrounding pixels. The specific calculation process is shown in formula (18).
Among them, f (x, y) , h (x, y) , g (x, y) represents input image, filter, and output image, respectively. The above process can be simply written as:
Mean filtering is the simplest filtering operation. The output is the average value of the corresponding pixels in the input kernel window, that is, the weighting coefficients at different positions in the filter are fixed values:
The mean filtering is to average the pixels in the neighborhood, so that it can suppress the high frequency part of the image and make the image blur and smooth, but the mean filtering also has inherent defects. If the noise pixel value is very different from the nearby pixel value, the output pixel value after averaging the area near the noise will also be larger, it also destroys some details in the image while denoising, and cannot remove the noise well.
Gaussian filtering is a linear smoothing filter that is commonly used to eliminate Gaussian noise. The Gaussian filter is a filter that selects weights according to the shape of the Gaussian function. The one-dimensional zero-mean Gaussian function is as follows:
Among them, the parameter sigma determines the height of the Gaussian function. For two-dimensional images, a two-dimensional zero-mean Gaussian function is commonly used as a filter. The two-dimensional zero-mean Gaussian function is as follows:
The specific steps of Gaussian filtering are similar to the mean filtering, but the filter is different. It is very effective for suppressing the noise that follows the normal distribution.
Mathematical morphology is an image analysis discipline based on Glenn and topology. Common operations include: corrosion, expansion, opening and closing operations, skeleton extraction, and so on. This section briefly introduces the most basic dilate and erode operations of morphology. The main functions of corrosion and dilation are as follows: The noise is removed; The maximum and minimum regions of the image are searched; The gradient is calculated.
Dilation is the process of convolving an image with a convolution kernel. The convolution kernel can be square or circular, etc., which can be used as a mask or a template. Dilation is the operation of finding the maximum value of all pixels in the kernel. The specific steps are:First, the kernel is used to traverse the original image, then the local maximum pixel value of the original image in the region of the kernel is obtained as the output pixel value at that point, and finally the traversal is completed to obtain the output image. Repeated dilation operations will cause the highlighted areas in the image to grow outward along the border, similar to the “dilation” effect, so it is called dilation operation. The calculation formula of image expansion is shown in formula (23). Among them, src (x, y) is the original image, dilate (x, y) is the expanded image, and max() is the operation of maximizing the pixels in the core window area.
Corrosion operation is just the opposite of dilation operation. The process is the same, and it takes the convolution kernel to traverse the original image. The difference is that the corrosion operation seeks the local minimum pixel value of the original image in the kernel area, and then this value is used as the output. Repeated corrosion operations will cause the highlighted areas in the image to shrink continuously along the border, similar to the effect of “corroded”, so it is called corrosion. The calculation formula of image erosion is shown in the following formula. Therefore, src (x, y) is the original image, erode (x, y) is the expanded image, and min() is the minimum value operation for the pixels in the core window area.
Based on the system constructed above, this study conducts a system effect verification.This article takes an English classroom teaching as an example to carry out research and uses an online identification system as an experimental platform in actual teaching. Moreover, this paper combines the human-computer interaction and facial identification algorithms proposed in this paper to test the effectiveness of the system algorithm by artificially setting the interpersonal interaction and student status.Firstly, the human-computer interaction actions are identified and compared with the human identification method. The results obtained are shown in Table 1 and Fig. 9.
1 Statistical table of human-computer interaction action identification
1 Statistical table of human-computer interaction action identification
It can be seen from the statistical diagram of human-computer interaction action identification shown in Fig. 9 that the system has a better identification effect on human-computer interaction actions and a higher accuracy rate, while the accuracy rate of human-based identification is lower. Moreover, in the actual teaching, when only the teacher recognizes the student status, not only the efficiency is poor, but also the teaching effect is affected. It can be seen that the algorithm proposed in this paper has a certain effect in the teaching system.

Statistical diagram of human-computer interaction action identification.
Next, this paper analyzes the accuracy of facial identification, recognizes facial expressions through the system, and also conducts comparative verification by combining human identification methods. The results are shown in Table 2 and Fig. 10.
Statistical table of facial identification accuracy

Statistical diagram of facial identification accuracy.
It can be seen from Fig. 10 that the algorithm’s statistics on the accuracy of facial identification are close to 100% (red part in the figure), while the accuracy rate of human observation is lower (yellow part in the figure). It can be seen that the algorithm constructed in this paper has good performance and can effectively improve the efficiency of English teaching.
As the teaching scale of colleges and universities continues to expand, the task of teaching management is also increasing.English courses are public courses for the whole school, so English teachers have a heavy teaching task. How to achieve the teaching purpose within the stipulated course hours and improve the teaching quality of English courses has always been a difficult problem to be solved in college teaching management. This forced colleges and universities to take further measures to improve their own teaching management system to meet the needs of college English teaching management. This paper proposes a face identification method for human-computer interaction in a complex background environment. By extracting and fusing the time sequence information and spatial information in the action in the video, a spatiotemporal feature image that can express a dynamic typical feature is obtained.Different actions have very different spatiotemporal feature images, so we can use different feature images to express different actions.Finally, deep learning is used to construct a convolutional neural network to train feature images to achieve the purpose of understanding and recognizing different actions and facial expressions.The research results show that the algorithm constructed in this paper has good performance.
Footnotes
Acknowledgment
The research in this paper was supported by Social and Science Fund of Hunan Province: Research on categorization Perception of wh-words under the influence of linguistic experience (NO. 18WLH21).
