Abstract
The corpus software has many functions, such as keyword retrieval, context co-occurrence, word list generation and word frequency statistics. It can quickly and accurately provide various corpus and information, such as word-formation collocation, context, word frequency and so on. In this paper, the author analyzes the application of deep learning and target visual detection in English vocabulary online teaching. Deep learning is a kind of machine learning algorithm which includes multi-layer non-linear mapping and tries to obtain high-level abstract representation of data. By extracting features from information, the identifiable components in the image can be extracted. The results show that the application of corpus in College English vocabulary teaching can promote students’autonomous use of corpus in English vocabulary learning. The simulation experiment improves the performance of the system by choosing parameters, and the classification accuracy is more than 90%. Corpus can enable students to learn real and natural language and master natural collocation. At the same time, corpus can help students understand the semantic and pragmatic norms of words in communication and recognize the characteristics of register variants. Future research can use Map-reduce technology to accelerate the training process, save training time and test more hyperparameters.
Introduction
The information management of educational resources has gradually moved from traditional paper management to automated management. The emergence of Web education resources will effectively improve the educational resources management level and service capabilities [1]. According to the 12th Five-Year Plan of the Ministry of Education, many educational resources will be integrated into the overall education cloud platform, and educational resources will truly realize management informatization, networking and paperless.
Vocabulary is the most basic material of language, so vocabulary teaching is one of the key points of English teaching. Corpus has attracted the attention of language teaching circles in recent years because of its auxiliary learning functions such as vocabulary retrieval, synonym comparison, phrase collocation and register query, which can effectively help students grasp a large number of English vocabulary quickly and accurately. This paper explores the practical implementation of online corpus in College English vocabulary teaching from a practical point of view, with a view to contributing to college English vocabulary teaching [2]. Corpus is a large-scale electronic text database based on computer, according to certain linguistic principles, using random sampling method to collect a large number of real language data. The corpus stores language materials that actually appear in the actual use of the language, and the corpus is dynamic, that is, adding or deleting corpus with the development of language changes. The auxiliary learning functions of online corpus include vocabulary retrieval, synonym comparison, phrase collocation, sentence structure, word frequency query, register query and so on. Students can grasp the multiple meanings of vocabulary and avoid misuse of words by searching the usage of words in real context. J. Sinclair (1999) once pointed out that corpus only presents real language examples, which are easy to mislead learners; corpus can provide real context of word use, variants of language use, reflecting the change and development of language. In this way, learners can avoid boring and blind learning on the basis of real discourse. In the past, there was no connection between grammar and vocabulary, but online corpus built a bridge between grammar and vocabulary to help learners identify synonyms and familiarize themselves with the structure of words. In autonomous learning, we can choose appropriate words to express our thoughts for specific syntactic structures according to communicative needs, so as to achieve the goal of learning and applying [3].
Researches and applications of corpus-based English vocabulary teaching have yielded some achievements, but most of the existing studies focus on the theoretical and technical aspects [14]. There are fewer studies on the application of corpus-based English vocabulary teaching in classroom teaching and fewer empirical studies on how to integrate corpus-based English into English classroom. This gives English a corpus-driven online environment. Vocabulary acquisition leaves room for research. Based on the theory of mobile agent, this topic studies the discovery methods of Web education resources.
The main contribution of this paper is analyzing the application of deep learning and target visual detection in English vocabulary online teaching. Deep learning is a kind of machine learning algorithm which includes multi-layer non-linear mapping and tries to obtain high-level abstract representation of data. By extracting features from information, the identifiable components in the image can be extracted. From the survey of the effect of using corpus in vocabulary teaching, we can see that the effect of using corpus in vocabulary teaching is generally positive in terms of students’ learning attitude, degree of involvement, vocabulary learning effect, as well as the cultivation of autonomous learning ability and habits.
This paper is organized as follows: The related work is introduced in Section II. Semantic similarity filtering and unsupervised machine learning method was in Section III. Educational resource discovery model was in Section IV. Algorithm experiment analysis was in Section V. Corpus vocabulary teaching was in section VI Finally, Conclusions are given in Section VII.
Related work
Big data is beginning to enter people’s lives. In the education industry, big data activities are also becoming more frequent [5]. It is no longer a distant and unfamiliar concept, but also has 4 V characteristics, namely, a large number (Volume), a wide variety (Variety), the potential is immeasurable (Value), a fast flow and a dynamic (Velocity) [6] The processing of education big data is to apply data mining technology to the education industry, and to mine data for analysis and processing. Educational data mining is a process of extracting useful data information using data mining technology for data in the education system, which can help students and teaching workers [7].
The EDM community website defines educational data mining as follows [8]: Educational data mining emerges from data mining, which focuses on the processing and analysis of specific types of data in the education system, so that it can deeply study, analyze and understand its role and influence on education [9]. With the rapid development of the Internet, the education industry has also entered the informatization stage, and various software systems have entered all aspects of the education field, so a large amount of educational data has also been accumulated. Educational data mining focuses on how to use these educational data to extract potential and available information, help teaching workers to analyze problems in the teaching process, and help students analyze the problems in the learning process [10]. For the goal of educational data mining, foreign scholar Bienkowski et al. Creating a learning model to predict student learning behavior, the model contains various information of the student, such as knowledge, attitude, motivation, etc.; to discover students learning content models and improve the best teaching sequence; to study the effects of different types of assisted learning software; to integrate student models, learning models, and assisted instructional software models to improve student learning efficiency [11].
Using educational data can predict student dropouts, academic performance, graduate employment, etc. Through the obtained prediction results, different learning guidance for students in different situations, and the relationship between students and the curriculum can be found, so that the setting of the curriculum becomes more reasonable and personalized, which helps to stimulate students’ enthusiasm for learning and promote personal growth [12]. Educational data mining is different from data mining. It requires not only knowledge related to data mining, but also knowledge related to education and psychology [13, 14].
The emergence of Internet has provided new opportunities for traditional foreign language teaching and is an important direction of quality education. In recent years, the rapid development of modern information technology based on computer multimedia and network technology has produced a tremendous impact on the traditional teaching mode, which not only brings one aspect of educational technology. The field revolution is also a revolution in teaching idea and educational thought [15]. Web-based autonomous learning means that learners actively use and regulate their meta-cognition, motivation and behavior to learn online courses by using various resources of the network, so that students can become independent, autonomous and effective learners is the fundamental goal of education. The introduction of network multimedia teaching will undoubtedly have a positive impact on broadening students’ horizons, enriching language learning materials, drawing on and absorbing excellent cultural corpus and improving students’ comprehensive quality [16, 17].
Web-based learning has incomparable advantages over traditional teaching. In today’s foreign language teaching at all levels, there are many attempts at web-based teaching. Teachers and researchers are also gradually exploring the rules in some attempts, trying to find the best fit for web-based service teaching. There are many kinds of paper-based reading in online English reading [18, 19]. Comparable with other places, the network has abundant resources and larger information capacity, so as to better satisfy the readers’choice of information; the content is detailed and can provide in-depth elaboration; the content is new and mostly hot topics, so as to ensure that readers’ interest in learning is satisfied; there are many links and illustrations, which make the access more convenient. Similar to English vocabulary software, online word-picking function can also be widely used. In addition, it is very convenient to search relevant Chinese web pages.
The analysis of synonyms in corpus linguistics mainly examines the differences in frequency distribution of synonyms in different register, counts the significant degree of co-occurrence of collocations and keywords, observes the collocation features of synonyms in retrieval lines, and reveals the linguistic features of different types of collocation, collocation relations and semantic prosody. The corpus index can provide abundant usages and contexts for synonyms, which enables researchers to compare and grasp the subtle semantic and pragmatic differences between synonyms and describe them objectively and comprehensively.
Therefore, starting from the measurable and combinable nature of Web education resources, we establish a universal evaluation system for the credibility measurement of educational resources according to the diversified needs of educational subjects, and improve the accuracy of the acquisition of educational resources. At the same time, we design an efficient education resource ranking algorithm to adapt to the dynamic characteristics of the heterogeneous network environment, improve the ability to actively adapt to the environment, and provide credible support and technical support for educational resource users.
Theoretical analysis
Semantic similarity filtering
The similarity calculation of resource semantic is to calculate the degree of similarity between the intrinsic meanings of two resources. It has been widely used in information integration [20], information recommendation and filtering [21], data mining and other fields, and has become a hot spot in information technology research today. Moreover, there are many different methods for calculating similarity, such as cosine formula, Pearson correlation coefficient, and conditional probability [22]. The following related definitions are given for the semantic similarity calculation of Web education resources.
A set of Web education resources E ={ e1, e2, . . . , e i , . . . . , e n } of different types and versions are set as a set of resource screening objects.
In the Web education resource, after the feature extraction of the Web education resource, the feature item set E ={ e1, e2, . . . , e
i
, . . . . , e
n
} is obtained. The vector resource of e
i
can be expressed as Vi ={ w1, w2, . . . , w
i
, . . . , w
n
} and is defined as follows
w
i
is the weight on feature item f
i
. Among them, f (f
i
, e
th
) is the word frequency of the f
i
of feature item in the Web educational resource the (Web Education Resource Name or Title) e
th
,
For Web education resources, concentrate resources e1 and e2, there are
sim (e1, e2) is their resource similarity. Among them, w1i is the weight of feature item f i in the web education resource e1, and w2i is the weight of feature item f i in the web education resource e2. Among them, sim (e1, e2) ∈ (0, 1), if e1 = e2, then sim (e1, e2) = 1. The level of semantic similarity between Web educational resources represents the degree of similarity between the two resources.
Firstly, a semantic similarity-based filtering algorithm HA_SA is adopted, which mainly considers two aspects: (1) How to calculate the weight of the sub-feature vector in the resource according to the feature vector extracted from the resource; (1) How to calculate the semantic similarity between two Web educational resources to filter resources. Due to the heterogeneity and mass of Web education resources, the problem has become more complicated.
The algorithm uses the vector space model to filter the Web education resources, quantizes the resources into a set of feature vectors, and separately counts the word frequency of the feature vector in the resource name (title) and text, and calculates the weight of the feature vector. Finally, the similarity between the two Web education resources is calculated according to the cosine formula. The execution flow and pseudo code of the algorithm are shown in Fig. 1. Among them. If the similarity of the resource is lower than this value, the resource is filtered out in the resource class.

Algorithm flow of text filtering.
Reinforcement learning, as an unsupervised machine learning method, is widely used to model the behavior of animals and humans. In this paper, Web education resource users are used as learning subjects, and Web education resource environment is used as a learning environment to filter educational resources through continuous learning feedback [23–25]. The principle of reinforcement learning is that the Agent performs some action on the environment, changes the state of the environment and obtains the reward signal given by the environment to strengthen the mapping relationship between a certain state and the optimal action strategy. By repeatedly performing this process, the Agent can obtain the ability to give an optimal action strategy in any environment.
A single-agent MDP consists of four elements S, A, P, R, and S are the state sets of the agent, A is the action set of the agent, P is the state transfer function, and R is the reward function [26].
The process of learning each agent is described as a quad (A, R, N, λ). A represents the action, R represents the reward value, N represents the degree of satisfaction, and the threshold λ represents the extent to which the learning process needs to be achieved. During the learning process of the user agent, the return value is continuously returned, and the filtering mechanism of the resource is determined according to the size of the reward value. Moreover, the traditional Q learning algorithm is used in the paper for subject learning.
The Q learning algorithm strategy uses the value of the state-action pair (s, a) to estimate the function (Q value). In the Q learning process, the Agent learns according to the Q function and does not need to wait for the task to complete. The update formula is as follows:
Among them, the α parameter is the learning step size, γ is the discount factor, α, γ ∈ [0, 1], and r is the reward value returned by the user.
The model is divided into two layers, which are progressively refined from bottom to top. The first level of screening is performed through the semantic similarity calculation of educational resources, and multiple sub-categories of Web education resources are obtained. In each sub-category, the agent’s learning property is used to provide real-time feedback to the user’s needs, so as to achieve the re-screening of the Web education resource sub-category. Finally, reducing the size and choice of candidate educational resources, can better meet the needs of users.
This algorithm uses the Q learning method in the enhanced learning model to filter Web education resources. The user of the Web education resource is used as an Agent, and according to the Q value (state-action) of the initialized Agent, the appropriate action is continuously learned until the learning degree reached by the target desired agent is reached. The execution flow and pseudo code of the algorithm are shown in Fig. 2 and Algorithm 2. Among them, γ indicates the degree of learning that the user agent is expected to arrive, and Ni indicates the degree of satisfaction. If the degree of satisfaction is lower than the degree of learning, the state transition is performed according to the Boltzman function, and the resources selected by the user agent in this behavior action are filtered out. This algorithm uses the vector space model to filter the Web education resources, quantizes the resources into a set of feature vectors, and separately counts the word frequency of the feature vector in the resource name (title) and text, and calculates the weight of the feature vector. Finally, the similarity between the two Web education resources is calculated according to the cosine formula. The execution flow and pseudo code of the algorithm are shown in Fig. 1. Among them, If the resource’s similarity is lower than this value, the resource is filtered out in this resource class.
Rank model based on fuzzy set
A domain U and a mapping U
A
: U : 0 →1 from U to [0, 1] are given. Assuming A ={ u
A
(u) |u ∈ U }, then it is called A as a fuzzy set on the domain U. Function A is a membership function of fuzzy set A on the universe U, and u
A
(u) is the membership of u to fuzzy set A. The membership function is defined as
Among them, u
i
is the set of words representing the educational resources or the content to be queried, n is the number of u
i
, N (u
i
) is the number of u
i
in the educational resources, and the value of u
A
(u
i
) is the membership of u
i
to the fuzzy set A. Euclid ambiguity determines the content of the query by membership and can represent the degree of fuzzy relevance of educational resources. Euclid ambiguity is defined as:
The Rank algorithm based on fuzzy sets can be expressed as:
Simple aggregation is to efficiently aggregate the educational resources of related sites according to the needs of users, and “push” the predetermined information (title, abstract, content) to the user’s desktop. Since the RSS technology is based on the needs of the user, a custom user tag is required. The RSS tag is defined as:
Among them, Tag i is the i-th key attribute of portraying web education resources, C [Tag i ] is user-defined attention to Tag i , 0 ≤ C [Tag i ] ≤ 1.
RSS technology needs to calculate the proportion of educational resource tags based on user-defined tags. Assuming that educational resources are ER ={ er1, er2, . . . , er
n
}, and the number of Tag
i
in resource er
i
is Ner
i
[Tag
i
], the weight of the RSS tag Tag
i
in the resource can be expressed as
After that, according to the user’s custom label weight, the user’s most satisfactory educational resources are calculated and pushed to the user. Moreover, user satisfaction is defined as
Among them, the size of W eri reflects the level of user satisfaction with Web education resources. Then, the Rank algorithm of RSS technology can be expressed as:
The main steps of the above algorithm include the following two points. First, word set U ={ u1, u2, . . . , u t } that can represent this resource is extracted from the educational resource set ER. Then, the membership degree of each search term is calculated according to the membership function. At the same time, based on the content of the query and its proportion, using Euclid ambiguity to calculate the query content can represent the ambiguity of this educational resource. Then, according to the RSS tag and its tag weight defined by the current user, the degree of association between the tag and the educational resource is calculated by the retrieved educational resource. Finally, after the comprehensive comparison, the retrieved documents are sorted. The Web education resource Rank algorithm based on fuzzy sets and RSS is mainly divided into two algorithms: Algorithm 1 implements the ordering of resources through the membership function and Euclid ambiguity in the fuzzy set, and its time complexity is o (n2). Algorithm 2 is based on the principle of RSS technology to push out the resources that users are satisfied with, and the time complexity is o (n).
The experiment is mainly based on the China Knowledge Network dataset and the ranking results of China Knowledge Network are compared from the aspects of resource label weight, user satisfaction, and ambiguity. First of all, we searched for “fuzzy sets” in China Knowledge Network, and the results contained a total of 2,937 records. Then, several records were randomly selected as our experimental test data set. Partially randomly selected data sets are shown in Table 1.
data set and its ranking results in CNKI
data set and its ranking results in CNKI
Experiment 1 Influence analysis experiment of ambiguity.
The content given by the user is “fuzzy set” and “subordinate function”, and the proportion of the query content is 0.8 and 0.6 respectively. Therefore, the query content calculated by Algorithm 1 can represent the fuzzy association of educational resources.
The bar graph in Fig. 3 represents the relevance of the content of the user query and the fuzzy representation of the resource. The higher the bar graph indicates that the resource is closer to the educational resource that the user wants to query, and the lower the representation is, the smaller the association between the resource and the query content is. The zero point represents that the educational resource has no association with the query content.

Execution flow of Q algorithm.

The ambiguity of the content of Web education resources queried by users.
Since the educational resource end user only obtains educational resources for a specific area in a certain period of time, the user-defined label can be obtained, and the proportion of the user label can be given. Moreover, the proportion of the resource tags of the experimental data can be calculated, and the educational resources that the user is satisfied can be gathered. We assume that the user-defined RSS tags are {educational resources, fuzzy sets, RSS, sorting, labels, membership functions, inclusion degrees, 0,0,0}, and their corresponding weights are {1, 0.9, 0.9, 0.6, 0.4, 0.6, 0.7, 0, 0, 0}. The calculated resource label weight of the experimental data is shown in Fig. 4:

Calculation results of resource label weight.
Figure 4 shows a comparison between the weight of each educational resource and the proportion of user-defined labels. The closer the distance between the weight of each label and the weight of the user-defined label, the more satisfied the resource user is.
According to the content of fuzzy centralized query, the fuzzy degree and fuzzy satisfaction of educational resources represented by fuzzy are compared comprehensively. Then, the user satisfaction calculated by the experimental data is as shown in Fig. 5. The dataset sorting results in Fig. 5 indicate that the correlation between user tags and user satisfaction is not strong, and it needs to be adjusted in a certain way.

User satisfaction.
Figure 6 is a corresponding ranking result obtained according to the algorithm of the present invention. According to Fig. 6, we can see that there is an inconsistency between the ranking results in China Knowledge Network and the ranking of educational resources that users need. Some of the educational resources that some users need are ranked lower in China Knowledge Network. However, through the algorithm of this paper, the ranking of resources that users need can be improved. Therefore, the algorithm of this paper is more in line with the user’s personalization.

Results of the comprehensive ranking.
Corpus refers to a large number of original corpus text stored in the computer or processed text with linguistic information tagging. It is a large collection of language materials, mainly used to observe, analyze and study the characteristics of the target language. The characteristics of corpus are as follows: firstly, it has a large amount of information, and some large corpuses can collect tens of millions or even hundreds of millions of words; secondly, examples in corpus are natural and fashionable, which can reflect the trend of language development in time; thirdly, corpus has keyword retrieval, context co-occurrence, word list generation and word frequency statistics, etc. It can provide various corpus and information quickly and accurately, such as word-formation collocation, context, word frequency and so on.
Vocabulary teaching is one of the earliest and most fruitful areas in which corpus resources and research tools are used in foreign language teaching. The application of corpus in English vocabulary teaching has the following advantages. Firstly, the use of corpus in vocabulary teaching conforms to the constructivist view of learning. In corpus-based vocabulary learning, learners do not passively receive information from teachers and textbooks, but construct language knowledge through their own observation, analysis, induction and comprehension of a large number of authentic corpus. Secondly, corpus vocabulary teaching method is helpful to cultivate students’autonomous learning ability. Having mastered the powerful tool and method of corpus, students can learn independently at any time and anywhere without relying too much on teachers and textbooks. Thirdly, corpus enables students to learn real and natural language and master natural collocation. Finally, corpus helps students understand the semantic and pragmatic norms of words in communication, and the characteristics of register variants, etc. Corpus Vocabulary Teaching as shown in Fig. 7. And the Word frequency statistics as shown in Fig. 8.

Vocabulary Teaching.

Word frequency statistics.
In order to enable students to make better use of corpus resources for autonomous learning of English vocabulary, two routine learning tasks need to be arranged. One is to use corpus to find the meaning and usage of key words in the text. The specific methods are as follows: 1) Teachers point out the key words of Section A in each unit; 2) Students form four groups, each of which is responsible for finding the meanings (including the meanings and other meanings appearing in the text) and usages (collocation and syntactic structure, etc.) of some words, and extracting typical examples from the corpus; 3) Team members in class; Exchange the meaning and usage of the vocabulary you are looking for; 4) Teachers check the learning effect.
Second, students use corpus to correct their own vocabulary errors in writing. When the teacher reviews the students’ compositions, he does not correct the errors in the compositions one by one, but marks the types of errors in the students’ compositions with different symbols, including wording errors, collocation errors and morphological errors. After class, students use corpus to search keywords (misused words), summarize the correct usage, and correct the errors in the composition. Finally, the teacher checks the effect of the correction and provides necessary help. Most students think that the method of using corpus to learn vocabulary is effective. 70% of the students agreed or strongly agreed that “I have learned a lot of vocabulary knowledge from other students in the same group"; 65% agreed or strongly agreed that “learning vocabulary through corpus enables me to master vocabulary usage better"; 61% agreed or strongly agreed that “I can correct the big part of my composition by searching corpus". Part of the vocabulary errors; 57% of the students agreed or strongly agreed that “in general, the method of learning vocabulary with corpus works well". From these results, we can see that although most students are positive about the effect of corpus on vocabulary learning, there are still quite a few students who have difficulties and need the help and guidance of teachers.
This paper studies the intelligent acquisition methods of trusted educational resources in a heterogeneous network environment, and provides a specific and effective method for obtaining Web education resources for the majority of educators, thus improving the training level of students.
The application of corpus in College English vocabulary teaching should enable students to understand corpus and its role in vocabulary learning, demonstrate the method of using corpus to learn vocabulary, provide students with corpus tools and resources, and promote students’ autonomous use of corpus for English vocabulary learning. From the survey of the effect of using corpus in vocabulary teaching, we can see that the effect of using corpus in vocabulary teaching is generally positive in terms of students’ learning attitude, degree of involvement, vocabulary learning effect, as well as the cultivation of autonomous learning ability and habits. This further illustrates the feasibility of using corpus in vocabulary teaching. However, there are also some problems in the teaching process. In addition, this report is only an attempt and investigation of using corpus in vocabulary teaching in one semester. The methods and methods of using corpus need to be further improved.
Footnotes
Acknowledgments
The study was supported by “The Core Teacher Project of Henan Universities of China in 2017: Research on Talent-training Mode of English Plus More under ‘Three Districts and One Cluster’, (Grant No. 2017GGJS194)”.
