Abstract
English vocabulary recognition has certain applications in both learning and life. The existing English vocabulary recognition model is limited by a variety of factors, which will result in a more complicated recognition process and a low recognition accuracy. In order to improve the effect of English vocabulary recognition, based on natural language processing algorithms and corpus systems, this paper proposes a multi-feature fusion adaptive kernel-related filter tracking algorithm for the problems of kernel-related filtering algorithms. Moreover, based on the KCF algorithm, this paper improves the algorithm from three parts: feature fusion, adaptive change of update rate, and scale detection. In addition, this paper explores whether the vocabulary recognition of different rhythms will affect the reaction time and accuracy of the second language vocabulary recognition when the test subjects are in the experimental conditions with similar characters and different voices. The research results show that the model constructed in this paper performs well in the recognition of English words.
Introduction
English teaching itself involves two aspects: teaching and learning. However, most domestic English teaching researches have focused on the “teaching” of teachers for a long time, while ignoring the “learning” of students. As one of the three basic elements of English language learning for junior high school students, vocabulary is the basis for the formation of skills such as listening, speaking, reading, and writing, and runs through the entire process of English learning. Moreover, the effectiveness of vocabulary learning directly affects the effectiveness of students’ English learning [1].
The key and difficulty of junior middle school students’ English learning is vocabulary learning, and the lack of vocabulary is often the main obstacle for many junior middle school students’ English learning and the stumbling block that restricts their English performance. In the long-term English vocabulary teaching, teachers often adopt a fixed mode of “teacher reads, students follow". Moreover, teachers often ignore students’ individual differences and the characteristics of learning styles, and fail to adopt different teaching methods for students with different learning styles to guide students to find the most suitable vocabulary learning method. If things go on like this, boring vocabulary explanations and monotonous and tedious vocabulary memory make students’ vocabulary memory and learning difficult and inefficient, and make students gradually lose their enthusiasm and interest in learning English vocabulary. Therefore, the English vocabulary memory and learning situation of junior high school students at this stage is worthy of the attention of educators and researchers and further research [2].
Due to the rapid development of multimedia technology, visual learning is gaining more and more attention. With the emergence and rapid development of visual information carriers, visual learning continues to progress and develop and gradually attracts the attention of educational researchers. In the current education field, research on visual learning mainly focuses on teaching research based on visual learning and the educational application of visual technology. These related studies enlighten people to carry out visual learning research from multiple angles and aspects. In addition, with the emergence of multimedia teaching environments, diversified multimedia information technologies have increasingly adapted to the multiple intelligences of human beings and strengthened the connection with learning.
It is necessary for teachers to understand and master the differences in students’ perceptual learning styles. This paper combines perception style, visual learning and English vocabulary learning, and presents vocabulary in different visual forms combined with the current multimedia teaching environment. Moreover, this paper studies the differences in vocabulary memory of different presentation forms of learners with different visual image perception capabilities and visual text perception capabilities, and further explores the impact of individual differences in learners’ visual perception on English vocabulary memory [3].
Related work
Since word frequency changes correspond to word meanings, word frequency-based methods can capture word meaning changes. Although this method is relatively rough, it is simple to implement and can intuitively reflect the corresponding phenomenon. The literature [4] conducted statistical tests on the frequency of political, social and emotional words to identify and describe the characteristics of the era. The literature [5] used corpus G to calculate the frequency change of each vocabulary in each time period, and tracked the changes in the frequency of words over time. Moreover, it used the word frequency model to detect changes in vocabulary and analyzes the social background at that time. For example, the frequency of the word “her” increased sharply around the 1960 s. The reason may be the rise and popularity of the feminist movement. The literature [6] collected news corpus for nearly 60 years, applied natural language processing technology, studied the changes of word frequency and word meaning based on frequency, cumulative frequency and Shannon entropy, and developed a modern Chinese vocabulary diachronic retrieval system. Moreover, it displayed the diachronic information of the vocabulary such as frequency, frequency, etc. in the form of a line graph and a histogram, which can intuitively represent the diachronic change trend of the vocabulary. The literature [7] sorted out newspapers and periodicals with strong influence and representativeness from the late Qing Dynasty to the Republic of China as a corpus. Moreover, by analyzing the word frequency changes of different words with the same meaning in various time periods, it investigated the development process of modern Chinese political terminology, which caused a huge sensation in the diachronic field.
Quantitative observation and analysis of vocabulary in big data can provide a new way for application and research in the field of world culture. At the same time, the rise of online social media provides an important platform for studying the temporal and spatial changes of vocabulary. The literature [7] analyzed important political events through the word frequency model, used huge text data to carry out word frequency statistics of related words, and verified and supplemented them based on diachronic time. Moreover, it used the word frequency model to quantitatively study related products such as human culture and spirit, and made more accurate and comprehensive analysis of related products according to the laws and characteristics of vocabulary changes. The literature [8] proposed a digital information theoretical model to detect language changes and compare languages in different periods to identify vocabulary changes. In addition, the model can identify vocabulary changes and conclude that the speed of changes is inconsistent. Therefore, this measurement technique can be extended to the task of increasing the speed of querying simple abstracts from American Geographical Journals. The literature [9] built a Meme Tracker system to count the vocabulary extracted from the news for a large number of collected news papers, and reflect the trend information of American politics, economy, and culture according to the changes in word frequency over time. The literature [10] analyzed the frequency and context of vocabulary in diachroniccorpus and explored the changing trend of word meaning influenced by social events, cultural evolution and political environment. Baker found that the meaning of the word “money” is relatively stable, and the words family and children in the English corpus have attracted more and more attention. Vocabulary is the basic unit of human language. Meanwhile, the word frequency model has achieved remarkable results in the study of language evolution and social changes, and it can reflect its characteristics intuitively and simply.
The literature [11] proposed an LDA-based model to identify and analyze word meaning changes, assign different weights to different feature sets, and optimize the number of fixed word meanings, the number of layers, and the size of the context window. The literature [12] used clustering methods based on Topics-overTime and K-means to identify vocabulary transferred from topics or clusters in one time period to topics or clusters in another time period. At the same time, it linked the experimental results with potential diachronic events at that time for analysis. The literature [13] proposed to use a non-parametric topic model in the task of word sense recognition, and carried out different preprocessing on the examples, choosing to delete the target vocabulary from each example and set the mark. At the same time, it also adjusted the hyperparameters of the topic model to optimize the effectiveness of word sense recognition on the evaluation set and did not use location or dependent features. The literature [14] applied topic models to identify the meaning of target words and proved that the meaning induction method can be used to automatically detect vocabulary with sudden new meanings and the time period when these meanings appear.
The literature [15] proposed an automatic detection method for word meaning changes based on distributed similarity model. In this method, the distribution attribute of each word is described by a vector space model, and each word is associated with its context vector. The literature [16] applied typical calculation algorithms to a diachronic corpus and tracked the development and change of the meaning of words. Moreover, it taken other words in the sentence where the vocabulary is located as features, and its feature weight was calculated using the point mutual information PMI. In addition, it used SSI to describe the degree of stability of word meaning, that is, it judged the change of word meaning by evaluating the proportion of words in a set of similar word meanings. Since many vocabulary and phrases have distinct characteristics for a specific time period, by collecting and analyzing these vocabularies, we can see the changes of the times and the trajectory of social development. The literature [17] used a large-scale diachronic corpus arranged in chronological order to explore the diachronic phenomenon of vocabulary, studied the unknown relationship between language use and time periods or eras, and analyzed the significant changes in the distribution of vocabulary in the Google Books.gram corpus and its relationship with emotional vocabulary. Moreover, it used statistical methods to test data, and used Welch test, run test, least squares method, Ratio, Spearman coefficient, Kendall test and other related statistical methods to count important political related vocabulary. It is found that there is a connection between its changes and real historical events and reveals the dynamic interests within society. The literature [18] advocated the view of continuous change of word meaning, and proposed a continuous automatic word meaning change detection framework. The framework uses time as a unit to measure the vocabulary state with entropy to form a time series data, and obtain the vocabulary state from continuous time units, and use curve fitting to obtain the vocabulary change pattern on the time series. Moreover, the framework can successfully identify vocabulary change patterns, such as the expansion and contraction of word meaning, the generation of new words, metaphorical changes, and metonymic changes. Therefore, the framework provides a feasible platform for the invocation of word meaning changes. The literature [22] addresses the various problems in the field of vehicle communication with the suggestion of a mutual unified and dispersed spectrum sensing model. The application of the mutual cognitive paradigm minimizes conflict and multiple unknown problems. The literature [23] discusses the problem of vast volumes of big data and introduces the SmartBuddy idea of an adaptive and smart world incorporating human activity and human dynamics. The literature [24] talks about the development in parallel reconfigurable computing systems of a directed acyclic graph for video coding algorithms for motion estimation. Partitioning algorithm also plays a major role in speeding up the production of images. The article [25] deals with leveraging IoT and BigData Analytics in real-time applications using the Hadoop platform. The above-mentioned processes enable the deployment of an IoT-based Smart City. The article [26] centers on IoT and its major part in sophisticating the human practices and endeavors. This paper moreover managed with the collection of different information from different assets that are associated to the web [27, 28].
Principle of correlation filter tracking algorithm
The concept of Correlation originated in the field of signal processing. People used this concept to describe the correlation between two signals. Later, researchers introduced it into the research directions of image detection and classification processing. The so-called “correlation” includes two types, one is “autocorrelation” and the other is “cross-correlation". In this paper, “correlation” refers to “cross-correlation". If we assume that there are two signals f and g, the cross-correlation of the two signals can be defined as [19]:
In the formula, f* represents the complex conjugate of the signal f, and τ represents the displacement variable. For “correlation", the popular understanding is to describe the similarity of two signals at time τ. When calculating, the formula is usually transformed into the frequency domain, that is, the above formula becomes:
In the formula, F (t) and G (t + τ) represent the Fourier transform of f (t) and g (t + τ) respectively, F-1 represents the inverse Fourier transform, and ⊙ represents the dot multiplication operator. Through the introduction of the Fourier transform, the convolution operation in the time domain is converted to the point multiplication operation in the frequency domain, thereby reducing the amount of calculation and improving the calculation speed.
Similarly, when the idea of correlation filtering is applied to target tracking, it is to find a filter template to maximize the correlation between the template and the target, so that when it acts on the tracking target, the response obtained is maximized [20].
The algorithm constructs a correlation filter, searches for candidate areas, finds the maximum response value, and uses the area with the maximum response value as the predicted tracking target area. Moreover, the algorithm converts the calculation in the time domain to the frequency domain, which greatly improves the operation speed. Taking the MOSSE algorithm as an example, we analyze the principle of the correlation filtering tracking algorithm. We set f as the input image, h as the correlation filter, and g as the output response. The principle of the correlation filtering tracking algorithm is expressed as follows [21]:
After transforming the above formula to the frequency domain, the formula is as follows:
Compared with formula (3), the convolution operation in formula (4) is transformed into a point multiplication operation, which reduces the amount of calculation and improves the algorithm speed. The formula for solving the correlation filter is as follows:
In order to improve the robustness of the correlation filter, we also consider the m image samples of the target to obtain the objective function:
The solution of the above formula is as follows:
Through formula (7), the correlation filter can be solved. The obtained correlation filter is correlated with the next frame of image to obtain the response graph. The point with the largest value in the response map is the predicted position in the next frame of image.
In addition to the MOSSE algorithm, the general framework of the correlation filter tracking algorithm is shown in Fig. 1:

Correlation filtering target tracking framework.
The general framework can be summarized as follows: Initialization: First, the algorithm obtains the initial position of the target and extracts the target features. After that, the algorithm uses the method of adding a cosine window to reduce the boundary effect of samples generated by cyclic shift. Finally, the algorithm combines the expected output to train the correlation filter; Feature is extracted: The algorithm extracts the features of the search area from the target position in the previous frame of image as the center, and adds the cosine window to the extracted features to do Fast Fourier Transform (FFT); Prediction target: The algorithm uses the trained correlation filter to detect the features of the extracted search area to obtain the response map, and uses the position with the largest response value as the predicted target position; Model update: The algorithm extracts features at the target location and updates the relevant filters. In the calculation process of the algorithm, the calculation in the time domain is converted into the frequency domain through FFT, which greatly improves the speed of the target tracking algorithm.
The algorithm constructs a cyclic matrix while obtaining a large number of training samples by performing a cyclic shift operation on the basic samples. Moreover, the algorithm realizes the rapid training of the classifier and the rapid detection of the target through the application of the nature of the circulant matrix.
In order to express the derivation process of the algorithm more clearly, the following takes a single-channel one-dimensional sample as an example for explanation and description, and the results obtained can be extended to two-dimensional images. We use a vector x = [x1, x2, x3, ⋯ , x
n
]
T
of n × 1 to represent the basic sample. In order to get the sample matrix required for training, we need to cyclically shift the vector x. Through the cyclic shift operator P, we can perform a cyclic shift operation on the vector x to obtain the sample matrix. Among them, the cyclic shift operator P is a permutation matrix, and its form is as follows:
The product Px = [x1, x2, x3, ⋯ , xn-1]
T
of the cyclic shift operator P and the vector x indicates that the vector x is cyclically shifted to the right by one element, and the product P
u
x indicates that the vector x is cyclically shifted to the right by u bits. However, if u is negative, P
u
x means that the vector x is rotated left by u bits. For vector x, the complete set of samples obtained by cyclic shift transformation is expressed as follows:
The visualization result of the cyclic shift operation of the one-dimensional vector is shown in Fig. 2, and the partial result of the cyclic shift operation of the two-dimensional image is shown in Fig. 3.

Cyclic shift of one-dimensional vector.

Vertical cyclic shift of two-dimensional image.
We can get the cyclic matrix X by using cyclic shift. The formula is as follows:
The first row of the circulant matrix X is the basis vector x, and each subsequent row is obtained by cyclically shifting the previous row by one bit to the right. The circulant matrix has some special properties. Among them, one of the important properties is that all cyclic matrices can be diagonalized using discrete Fourier transform. The formula is as follows:
In the formula, F is a constant matrix independent of x, which represents the discrete Fourier transform matrix.
The kernel correlation filtering target tracking algorithm regards target tracking as a ridge regression problem, and the algorithm uses a Regularized Least-Squares (RLS) classifier for learning and training to obtain a set of parameters, which enables the function f (z) = w
T
z to Minimize the risk of regularization on all training samples. The formula is as follows:
In the formula, w represents the parameter, x
i
represents the i-th training sample, and y
i
represents the regression value of the i-th training sample. Meanwhile, λ represents the regularization parameter. By solving the above formula, we can obtain the closed-form solution of formula (12). The formula is as follows:
In the formula, each row of the matrix X represents a training sample, each element y
i
of the vector y represents the regression value corresponding to the training sample, and I represents an identity matrix. The plural form of the above formula is as follows:
In the formula, X
H
represents the Hermitian transpose of X. Since directly solving formula (14) involves matrix inversion operation, the amount of calculation is relatively large. Therefore, we use the conclusions based on the circulant matrix in the previous section to simplify the ridge regression problem. First, when we use formula (11) to simplify formula (14), we can get the following results:
In the formula, F
H
F = I. Then, we can get the following results:
In the formula,
We substitute formula (17) into formula (14) to simplify and obtain the following formula:
Equation (19) is the solution to the ridge regression problem after taking the cyclic shift sample as the training sample in the frequency domain.
The above derivation shows that by making full use of the characteristics of the circulant matrix, the algorithm uses only the original basic samples in the process of solving the ridge regression problem, and does not need to explicitly cyclically shift the samples. The training model solved by the cyclic shift sample can be obtained.
In order to solve the problem that the sample is linearly inseparable in the low-dimensional space, an important idea is to use the kernel function to map the sample from the low-dimensional space to the high-dimensional space to make it linearly separable in the high-dimensional space, and then it is classified in a high-dimensional space to convert the problem into a linear classification problem. Kernel function is an important technology, and it is applied in a variety of machine learning algorithms in the field of machine learning.
We map the features of the sample to a high-dimensional space and define the mapping function as φ (x). According to the representation theorem, the parameter w can be expressed as the following formula:
In the formula, α is a column vector of size m × 1. At this time, the problem of solving w is converted to the problem of solving α. We use the kernel function k (x
i
, x
j
) to represent the inner product 〈φ (x
i
) , φ (x
j
)〉 of the two sample mapping features. Common kernel functions include Gaussian Kernel, Polynomial Kernel, Linear Kernel, etc. After that, we construct all k (x
i
, x
j
) into a kernel function matrix K of size m × n. Among them:
The solution of the original ridge regression problem after introducing the kernel function is:
The introduction of the kernel function makes the algorithm do not need to explicitly map the features of the sample from the low-dimensional space to the high-dimensional space in the calculation process, and can implicitly use the mapped sample features in the high-dimensional space to linearly classify samples. Although the above formula has obtained a more concise solution, the inversion of the matrix is still involved in the process of solving the classifier, which causes the complexity of the operation to increase as the number of samples increases. In order to solve this problem, the kernel correlation filtering tracking algorithm also introduces the circulant matrix into the kernel ridge regression problem. Common kernel functions, such as polynomial kernels, Gaussian kernels, and linear kernels, are still a cyclic matrix for the kernel function matrix obtained for data C (x) with a cyclic structure. Therefore, in formula (22), the kernel function matrix K can also be reduced to:
In the formula, k xx represents the first row of the kernel function matrix K.
In the process of target tracking algorithm detecting and tracking targets, first the algorithm needs to extract sample features from the selected candidate regions. After that, the algorithm uses the previously trained classifier to sequentially calculate the response value. Finally, the algorithm selects the candidate sample with the largest response value as the current tracking target area. In the process of sequentially detecting all candidate samples, when there are many candidate samples, the operation speed is bound to be affected. Therefore, the algorithm also uses the circulant matrix to obtain candidate samples to achieve rapid detection of the search area. Whether to use K
z
to represent the kernel matrix composed of all training samples and candidate samples. Among them, the training sample is obtained by cyclically shifting the basic sample x, and the candidate sample is obtained by cyclically shifting the basic sample z. Therefore, K
z
can be expressed as k (Pi-1z, Pj-1x). Similarly, the kernel matrix is also a circulant matrix. Therefore, the kernel matrix can be expressed as follows:
For the regression values of all candidate samples constructed based on the basic sample z, the calculation formula is as follows:
In the formula, f (z) contains the regression values of all candidate samples constructed based on the basic sample z. Among them, the position of the candidate sample corresponding to the maximum value is the predicted position. Because the kernel matrix K
z
is a circulant matrix. Using the nature of the circulant matrix, f (z) can be expressed as follows:
Different from the traditional algorithm, the kernel correlation filter tracking algorithm uses a cyclic shift operation to construct candidate samples, and only one calculation is required to obtain the response values of all candidate samples. Among them, the position of the candidate sample with the largest response value is the position of the predicted target, thereby realizing rapid detection and tracking of the target.
In the target tracking stage, because the tracking scene is constantly changing, the appearance of the tracking target may change due to target deformation, scale change, and target rotation. In order to adapt to changes in the appearance of the tracking target, the algorithm needs to update the target model. For the kernel-related filtering tracking algorithm, the update formula at the t-th frame is as follows:
In the formula, α t and x t respectively represent the cumulatively learned target appearance and filter template of the t-th frame, αt-1 and xt-1 respectively represent the cumulatively learned target appearance and filter template of the t - 1-th frame, and α and x respectively represent the target appearance and filter template calculated in the current frame.
This experiment mainly explores whether the vocabulary recognition of different rhythms will affect the reaction time and accuracy of the vocabulary recognition when the vocabulary is similar in the shape of the words and different experimental conditions. The study adopted a 3×2 mixed experimental design, and the experimental flowchart is shown in Fig. 4. The average number of response time and correct rate of the system in the experimental conditions with similar characters and different voices as show in Table 1. Main effect analysis results of system reaction time in the experimental conditions with similar characters and different voices as show in Table 2.

Experimental flowchart.
The average number of response time and correct rate of the system in the experimental conditions with similar characters and different voices
Statistical table of main effect analysis results of system reaction time in the experimental conditions with similar characters and different voices
The obtained data is screened, and all data whose correct rate is less than 50% of the subjects are deleted. If the correct rate is too low, there may be situations where the subjects’ attitude towards the experiment is not serious or the subjects are unfamiliar with the vocabulary. The data of the former is meaningless, and the data of the latter deviates from the purpose of the research on word recognition in this study, and becomes the right Investigation of vocabulary memory. Therefore, we delete these data and further screened the test data whose correct rate was greater than 50%. Moreover, this paper deletes all data with a response time of less than 500 ms. The reason is that when doing the task of vocabulary recognition in this study, it is necessary to recognize the meaning of two words and judge whether the two are related, and the average response time is more than 1400 ms. Therefore, we believe that the 500 ms response time may reflect the subject’s vocabulary guessing response, that is, the algorithm does not really recognize the vocabulary, but only answers randomly. If there are more than 5 data less than 500 ms when the subject responds, it is considered that the subject data may be answered randomly in the experiment. Therefore, the subject data is also deleted as a whole. The selected data is further analyzed by variance, and the results are as follows:
We will draw a statistical diagram, as shown in Figs. 5 and 6.

The reaction time of the system on the vocabulary detectionin the experimental conditions with similar characters and different voices.

The correct rate of the systemon vocabulary detection in the experimental conditions with similar characters and different voices.
The main effect analysis results of the system response time in the experimental conditions with similar characters and different voices are shown in the following table and Fig. 7.

The main effect analysis results of the reaction time under different conditions of similar characters.
We take the reaction time as the dependent variable, and analyze the main effect of the game rhythm and the correlation between word pairs. The results show that game rhythm has a significant impact on reaction time (F = 74.46, p < 0.05), word pair correlation has no significant effect on reaction time (F = 0.41, p > 0.05), and interaction has no significant effect on reaction time (F = O.12, p > 0.05). Table 3 and Fig. 8 show the results of the variance analysis of the correct rate of the voices with similar shapes under different conditions. Statistics table of the recognition results of high-definition English words as show in Table 4.
Statistics table of the results of the variance analysis of the correct rate under different conditions of similar glyphs

Statistical diagram of the results of the variance analysis of the correct rate in the experimental conditions with similar characters and different voices.
Statistics table of the recognition results of high-definition English words
This paper takes the correct rate as the dependent variable, and analyzes the main effects of the game rhythm and the correlation between word pairs. The results show that game rhythm has a significant effect on reaction time (F = 43.91, p < 0.01), word pair correlation has no significant effect on reaction time (F = 3.08, p > 0.05), and interaction has no significant effect on reaction time (F = 3.08, p > 0.05). F = 1.71, p > O.05).
On the basis of the above analysis, this paper conducts data simulation analysis. First of all, this paper conducts English vocabulary recognition analysis, verifies it through 72 sets of English vocabulary with high clarity, and recognizes it through the system constructed in this paper. Since the English test characters displayed in different fonts are difficult to recognize, the result of word recognition on this basis has certain reliability. The statistical recognition results are shown in the following table and Fig. 9.

Statistical diagram of the recognition results of English words with higher clarity.
From the above recognition results, we can see that the model constructed in this paper performs well in the recognition of English vocabulary, so it can be applied to clearer English vocabulary recognition and is not restricted by fonts and writing methods. Next, this paper identifies some vague English vocabulary such as buildings and old books. The results are shown in Table 5 and Fig. 10.
Statistics table of the recognition results of fuzzy English words

Statistical diagram of the recognition results of fuzzy English words.
Through the above analysis, it can be seen that the method proposed in this paper performs well in English vocabulary recognition and can be applied in practice.
This paper explores the effect of English vocabulary recognition in the context of different visual presentation of vocabulary. Based on the improvement of the kernel-related filtering algorithm, in view of the shortcomings of the original algorithm, this paper proposes the corresponding improved algorithm and verifies it through experiments. Moreover, in view of the problems of the kernel-related filtering algorithm, this paper proposes a multi-feature fusion adaptive kernel-related filtering tracking algorithm. In addition, based on the KCF algorithm, this paper improves the algorithm from three parts: feature fusion, adaptive change of update rate, and scale detection. Finally, this paper applies the algorithm to the English vocabulary recognition system platform and optimizes it to prove the practicality of the algorithm. This experiment mainly explores whether the vocabulary recognition of different rhythms will affect the reaction time and accuracy of the second language vocabulary recognition of the subjects when the subjects are in the experimental conditions with similar characters and different voices. The research results show that the model constructed in this paper performs well in the recognition of English vocabulary and can be applied to the recognition of English vocabulary without being restricted by fonts and writing styles.
