Abstract
With the development and application of Web Semantic, users are no longer satisfied with basic metadata descriptions such as titles and link texts and string-matching search results. They hope that the resource description can provide the theme ideas, topics, and topics involved in the resource. Potential semantic information contains documents such as teaching methods and knowledge-concept relationships. This research starts from the demand for semantic annotation of resources in the process of resource library construction and sharing, and uses the LDA model to semantically model the document resources in the resource library to mine potential topics in the document. From “document-topic-keyword” scheme, the semantic description of teaching resources at different levels enriches the metadata attributes and content of resources, and adds more related topics and corresponding keyword descriptions related to disciplines, teaching content, teaching methods, etc., providing resource retrieval and sharing. The experimental results show that the LDA model can catalogues teaching resources from a macro perspective, and model LDA on teaching subject resources of the same teaching content. It can mine the inherent semantic topic features and detailed differences of resources. The final performance analysis verifies LDA’s advantages of the model in parallel computing in the big data environment.
Introduction
In the context of educational big data, new resource integration technologies and platforms and new teaching resource governance mechanisms are needed, which should include not only resource construction management standards, sharing rights, and intellectual property rights, but also resource collection, storage, labeling, sharing and other technologies [1]. The management, sharing, and promotion of online teaching resources with automated and intelligent technical means is an important part of the research and practice of knowledge management in educational big data [2]. The development, promotion and application of metadata standards for teaching resources standardize the practice of teaching resource construction and lay a metadata description framework for the storage and sharing of teaching resources. Web crawler-based resource aggregation and sharing technology, which uses web page title, keywords, hyperlink text and other information obtained during web page parsing as annotations of online teaching resources to provide users with keyword-based search services [3]. With the development and application of Web technology, users are no longer satisfied with basic metadata descriptions such as titles and link texts and string-matching search results [4], and hope that the resource description can provide the resources involved the latent semantic information in documents such as subject ideas, teaching methods, knowledge and concept relations [5], such as the basic method of Chinese teaching, the central idea of the text, the teaching method of specific knowledge points, and the hidden semantic information in resources. The semantic annotation of the internal text content of these resources can better reflect the characteristics of the resources themselves, the description is more accurate, and it is easier for users to choose and use. In addition, in today’s extremely rich online teaching resources, and today’s growing rapidly, also needs an automated, intelligent resource labeling, aggregation method [6].
The labeling of teaching resource information is the most tedious and important part in the construction of teaching resources. The accuracy, completeness and availability of resource information greatly affect the sharing and utilization of resources. On this basis, the educational resource construction norms and other domestic series of standards stipulate that the attribute labeling of educational resources includes required attribute data and optional attribute data. The original data that must be included are representation, title, language, description, and keywords etc. [7], and the metadata such as title, description, and keywords are directly related to resource content and teaching applications. Manual labeling has its precise side, but there are also problems such as personal understanding deviation, selective filtering, and large labeling workload. Today, with abundant network resources, it is urgent to use automated means for metadata labeling.
With the rise of the Semantic Web technology, that is, the rise of the Web and the number of network resources, the inherent semantic analysis of document resources has been sought after by users, and its research and application have become a hot topic nowadays. Ma S et al. [8] proposed an open shared theme resource directory strategy, automatically semantically analyzing and labeling the title, text, and links of Web pages, and then establishing a directory index. The theme parameters generated in the semantic labeling are the objects that users search for. Roy S et al. [9] proposed a resource similarity calculation method based on Linked Data Semantic Distance (LDSD) technology, which fully considered the attributes of resources and the satisfaction of public users in the calculation process. Factors, this method has shown better performance than similar resource calculations in the DBpedia resource library test and music recommendation system. Semantic Web technology provides technical support for automated metadata annotation of large-scale network teaching resources. Saravanan et al. [10] proposed an analytic investigation of division in resources and its significance in performance and power of processor in multi-core. Osamah et al. [11] proposed a modified based algorithm for enhancing wireless sensor network lifetime. Ayman et al. [12] proposed a model based on the processing adaptive intelligence for various types of networks. Nieto et al. [13] proposed decision making for supporting academic performance at institutions utilizing machine learning algorithm.
This research starts from the demand for semantic annotation of resources in the process of resource library construction and sharing, and uses the LDA model to semantically model the document resources in the resource library to mine potential topics in the document. From “document-topic-keyword”, the semantic description of teaching resources at different levels enriches the metadata attributes and content of resources, and adds more related topics and corresponding keyword descriptions related to disciplines, teaching content, teaching methods, etc., providing resource retrieval and sharing.
Semantic modeling of LDA teaching resources
The LDA (Latent Dirichlet Allocation) model is a probabilistic theme model for modeling discrete data sets (such as document sets) proposed by Rani M et al. [14]. The model believes that a document is generated by multiple topics in different proportions, and each topic is a three-layer probability model of the distribution of words, that is, “document-topic-keyword”. The “word” matrix is dynamically generated semantic metadata, which reveals the internal topics and keywords of resources from the level of semantic content. At the same time, these semantic metadata also reveal the knowledge structure relationship between resources, which are the key elements for constructing resource ontology and concept map [15]. The LDA model is excellent in the topic modeling of large-scale document resources, and has a wide range of applications in the fields of document semantic analysis, topic modeling, and resource recommendation. Maier D et al. [16] used a comprehensive collaborative resource filtering algorithm and a probabilistic topic model based on LDA to develop a scientific and technological literature recommendation system to recommend new and old scientific and technological literature that may be of interest to users. Ni Y et al. [17] proposed a topic model of semantic relationship constraints, which is used to process a large number of product review text data, which can better find fine-grained feature words, emotion words and semantic correlations between product features, and then obtain Product feature level and user’s emotional tendency. Zhang N et al. [18] used the LDA method to map text content to the topic space, and eliminated spam based on the subject and user characteristics of the text. For the filtered information, a new social networking site was proposed from the perspective of users, topics and communities, quantitative analysis and evaluation methods.
In addition, the LDA model is often used for text classification processing, resource topic modeling, and resource recommendation. Gordon J and Aguilar S [19] combined the probabilistic topic extraction method of the traditional LDA model and the co-word network analysis method, proposed the CA-LDA (LDA Model with Co-Word Analysis) model, and added the co-word network on the basis of the traditional LDA model The analysis method uses the co-word network topology parameters as weights to control the distribution of vocabulary topics, and preferentially extracts vocabulary with high co-occurrence and high frequency. It has achieved good results in processing hot topic literature data in large-scale traffic law research. Gurcan F et al. [20] used the LDA model to model the subject of the document set to generate the latent semantics of each document. At the same time, they aggregated user profiles with common social activities, and then calculated the user’s document resource demand tendency and similarity between document topics, thereby recommending document resources to users in this group. The research framework and process based on the LDA model are shown in Fig. 1.

Research framework and process based on LDA model.
In this study, the experimental document is the lesson plan text extracted from the web page, which is denoted as D, and d is one of the documents. The goal is to find the “small” topics and related keywords hidden in the document d through LDA semantic analysis and modeling, and mark these “small” topics and keywords as semantic metadata of the document. For example, Fig. 2 is an LDA “document-topic-keyword” three-layer model, indicating that document d has n “small” topics t1 ∼ tn, and the topic ti consists of m keywords wi1 ∼ wim. The topic ti and the corresponding keyword wij is a “topic-keyword” two-dimensional matrix obtained by LDA semantic modeling. In the semantic annotation of online teaching resources, the topic and keyword wij obtained by semantic modeling can be used as the document d semantic metadata provides resource retrieval services for teachers and students.

LDA “document-topic-word” three-layer model.
LDA is an unsupervised machine learning technique that can be used to identify topic information hidden in large-scale document sets or corpora. This method treats each document as a word frequency vector, and converts the text information into digital information that is easy to model. Each word of an article is obtained by selecting a certain topic with a certain probability, and selecting a certain word from this topic with a certain probability”. The occurrence probability of each word in a document can be expressed as a formula (1):
The “document-word” matrix represents the word frequency of each word in each document, that is, the probability of occurrence; the “topic-word” matrix represents the occurrence probability of each word in each topic; the “document-topic” matrix represents each probability of each topic in the document. Given a series of documents, by segmenting the documents and calculating the word frequency of each word in each document, the “document-word” matrix on the left can be obtained. The process of document semantic modeling is to train through the matrix on the left and learn to get the two matrices on the right, namely the “topic-term” matrix and the “document-topic” matrix.
The learning process is described as follows. For specific coding implementation, please refer to [12] and [13]. For each document, extract a topic from the topic distribution. Extract a word from the word distribution corresponding to the theme extracted above. Repeat the above process until each word in the document is traversed.
Through the above document learning process, you can analyze the distribution of topics and words in the document set and any one of the documents.
The LDA model is a document generation model, and its probability graph structure is shown in Fig. 3. The topic distribution of each document θ in the document set is described by the Dirichlet distribution with the parameter α; in a document D with a given parameter θ, the topic z of the word corresponding to any image block is a polynomial distribution with the parameter θ. The word Wij is determined by the subject Zij and the parameters β.

LDA model structure.
The k-dimensional Dirichlet random variable θ is controlled by the distribution parameter α:
Given the parameter α, β the document D distribution probability can be described by the following formula:
On the entire training set, there are:
The steps of the document generation process under the LDA model are as follows: Select the number N of document words, and N satisfies the cypress distribution with parameters as parameters; Select the theme parameter β to satisfy the Dirichlet distribution described by α; For each of the N words Select Z
ij
and w
ij
to satisfy the polynomial distribution with θ as the parameter; p (w
ij
|z
ij
) is the polynomial conditional probability of w
ij
on Z
ij
with β as a parameter; According to P (w
ij
|z
ij
, β), conditionally select a word z
ij
randomly.
The difference between the LDA model and other subject models can be explained by the geometry of a latent space. All topic models generate documents by controlling the word distribution. Any word distribution can be regarded as a point on a simple body. We call this simple body a word simple body. The topic model finds k points on the word simple body, and based on these points, a sub-simple body is formed. We call this sub-simple body the subject simple body. The specific model structure is shown in Fig. 3.
The topic model, also known as the layer model, is a three-layer Bayesian network model, and the three layers are the document, topic, and word. It is a probabilistic model of document generation. The generation of documents is a random probability process according to the model: before generating a document, the topic distribution of a document is first generated. Each time a word is obtained, the probability distribution of the document is first randomly selected to a topic [21]. Through the distribution of words under the topic to a word randomly, taking this word and put it into the document, and finally the word constitutes a document. The structure of LDA model can be seen in Fig. 3.
When dealing with semantic automatic labeling problems, you can choose a small number of features that are highly distinguishable for two types of stitching, and then use feature selection to select features with high discrimination between the two types or directly use dimensionality reduction techniques to reduce feature dimensions. Therefore, in the labeling problem, a variety of features can be processed into a feature with a low dimension through feature preprocessing without having to consider directly inputting multiple features.
The traditional automatic method of semantic region annotation directly trains a simple classifier to establish the connection between the visual features of the image region and the keywords. When an unknown image is annotated, the trained classifier is used to classify and visually label the regional visual features of the unknown image [22]. This approach ignores the relationship between image regions, and cannot distinguish image regions with similar visual features but different keyword tags, that is, they cannot solve the problem of image ambiguity.
Experimental data
Starting from the need to automatically generate semantic descriptions of metadata for online teaching resources, this study conducted topic relevance filtering based on the Nutch open source crawler, and collected 10,000 documents on the seed website, including exercises, lesson plans, reading materials, education News and other categories, after preprocessing the web page, extracting the text, and categorizing, finally selected 9442 of the lesson plans as experimental documents, denoted as D.
Experimental environment
This article uses the open source Hadoop cluster system, a total of 8 nodes, a NameNode node, 7 DataNode nodes, the hardware is a common PC desktop, dual-core 2.53 GHz, 4 GB memory, CentOS 6. 5 operating system, LDA algorithm used Map / Reduce distributed operation, the total number of Map and Reduce are 14. The parameter estimation in the experiment uses the Gibbs sampling algorithm in the MCMC method, specifically setting the initial number of topics K = 50, α = 50 / K, β = 0.01 [19].
Experimental process
The research experiment is divided into three parts. First, semantic modeling is performed on all documents in the system to compare the effects of LDA on document semantic classification and keyword distribution. Secondly, multiple documents of a certain teaching content are selected for semantic modeling to examine the difference in details of LDA in the semantic modeling of the same teaching content document. Finally, from the perspective of application performance requirements, the performance of LDA under the Map / Reduce parallel computing framework for large-scale document semantic modeling is verified to meet the needs of semantic modeling and labeling of actual applications for online teaching resources.
Directory classification and labeling experiment
In the semantic classification labeling experiment, all 9442 lesson plans documents irrespective of grade and subject were used. The 50 topics obtained after the semantic modeling of LDA, the first 4 topics and key word distribution are shown in Table 1. In Table 1, Topic1 is a topic related to teaching in all grades and disciplines. Keywords such as “teaching”, “object”, “teaching” and “discussion” fully reflect the characteristics of instructional design in the lesson plan and pay attention to it. Topic2 is a topic related to mathematics and function teaching. Keywords such as “relationship”, “linearity”, “change” and “interval” describe the key and difficult points of function teaching and the changes in function teaching. Topic3 is English language ability and teaching methods about English language teaching that are not related to grade. In Topic 4, “naughty” and “hyperactive” are topics that are paid more attention to in early childhood education.
LDA classification labeling “theme-keyword” distribution
LDA classification labeling “theme-keyword” distribution
Through the examination of other Topic5 to Topic50, it is found that each Topic focuses on a certain topic. For example, Topic5 is the distribution of keywords such as “Newton”, “Pressure”, and “Object” in physics, and Topic6 is a “drawing picture” about information technology courses, “PPT” “Maker” and other keywords. As the topic goes further, the topic characteristics gradually weaken, and the representativeness of keyword distribution also gradually weakens. From the perspective of the role of LDA in the classification of document resources, the feature keywords in the “topic-keyword” distribution obtained by LDA modeling can be used as semantic metadata of document resources and used for classification retrieval catalogs on resource sharing platforms. In addition, the common characteristics of teaching resources and the emphasis and difficulties of teaching are marked, which provides users with richer resource retrieval and navigation services.
In order to further verify the semantic mining effect of the LDA model on the details of teaching resources, the study randomly selected 6 texts, counted the number of teaching plans of these texts, and conducted LDA modeling experiments on the teaching plans with more than 50 of them. Table 2 is the first 4 themes and corresponding keywords obtained after modeling the theme of 300 teaching plan documents. It can be seen that Topic0 is a nominal parent-child relationship, Topic1 is a description related to the teaching goals and requirements, and the learning requirements that students should meet. Topic2 is a series of actions and performances expressed by the parent-child relationship expressed in the form of a verb It is the feature mining of the genre and description of the text [23, 24]. The small topics and keywords within these lesson plans describe the characteristics of each lesson plan from multiple details, and provide a richer and more precise internal semantic description for resource metadata description.
Theme–Word distribution
Theme–Word distribution
The “theme-keyword” correspondence matrix obtained from the LDA theme modeling of other textbooks shows similar theme features and corresponding keyword distribution characteristics. These keywords capture the teaching requirements, teaching methods, and centers of teaching resources Thoughts and other internal details provide rich semantic information. No matter teachers or students, they can find the required resources from these detailed “theme-keywords” and grasp the focus and difficulties of teaching.
Another advantage of the LDA topic model is its large-scale document topic modeling ability. The experiment randomly selected 5,000, 10,000, and 20,000 documents in the teaching plan document library, and then selected different numbers of parallel computing nodes, running 3 times. The average value, to obtain the time of LDA modeling under the environment of multiple parallel computing nodes of different size data sets, and use the running time under a single node as a benchmark, and the ratio of the time used for LDA modeling under multiple node environments to the reference time as speedup ratio. Table 3 is the time and acceleration ratio of multiple nodes for parallel calculation when LDA modeling is performed on corpora of different sizes. Figures 4 and 5 are visual representations of Table 3.
LDA calculation time and speedup for different nodes and number of documents
LDA calculation time and speedup for different nodes and number of documents
It can be clearly seen from Figs. 4 and Fig. 5 that under the same number of parallel nodes, the longer the modeling time of a large corpus set LDA, as the number of parallel computing nodes increases, the shorter the document modeling time, the three corpus sets. The smaller the time gap used, that is, the LDA modeling acceleration ratio increases with the number of parallel computing nodes. Under the same number of computing nodes, the increase in the number of documents does not increase linearly with the time spent in LDA calculations. The computational performance under larger data sets is more superior.

LDA calculation time at different nodes and number of documents.

LDA acceleration ratio at different nodes and document counts.
Since the LDA model is a model that can handle multiple sets of features, we use the optimal vocabulary capacity obtained in the LDA model-related experiments as the vocabulary capacity of the LDA model. Figure 6 is a study on the number of iterations and convergence of the LDA model. The model uses three sets of two-by-two combined features and four sets of all three features.

Average correct rate of different location categories.
The first column of Fig. 7 is the change of the average correct rate of labeling categories using the LDA model with the number of iterations of the model, and the second column is the change of the total accuracy of labeling with the number of iterations. It can be seen from the figure that when the number of topics k is small (k = 10), the convergence curve is relatively volatile, not smooth enough, and the labeling effect is not good. As the k value increases, the iterative convergence curve is smooth, and the labeling effect will also increase, but if the number of saturated topics is exceeded, the labeling effect will decline again. When using color and position features, the number of topics can be 40 to obtain the best results; from the figure, it can be seen that the convergence speed has no obvious relationship with the number of topics, and the model iteration basically converges within 8 rounds. The speed is very fast, and the model time consumption is not large.

The total accuracy of different positions.
We also compared the labeling effects using different kinds of features. The results are shown in Table 4. The two parameters, the number of topics and the vocabulary capacity, take the corresponding optimal values obtained in the above experiments.
Comparison of different groups of feature labeling effects
The final labeling effect of the LDA model using a single group of features is shown in Fig. 8 as a comparison. The four groups of experiments using the LDA model were averaged 5 times. It can be seen that the LDA model using two sets of features is significantly better than the LDA model using a single set of features, and all three sets of feature annotations are used to achieve the best effect.

Comparison of average and overall labeling effects.
We studied the relationship between the number of feature groups used by the LDA model and the stability of the label. Since these two groups of experiments were conducted 5 times, we can obtain the standard deviation of the 5 experiments. The results are shown in Table 5. As can be seen from right side of Fig. 8, compared with the use of one set of features, the variance of the accuracy rate is the smallest after using the two sets of features, and the variance of the two evaluation indicators is also the smallest. Therefore, the two-group feature model is more stable.
LDA model annotation stability
The development and application of the Semantic Web technology provides new ideas and methods for the construction of basic education resources. The use of LDA semantic modeling technology can mine hidden topics and keywords in document resources, providing basic metadata for the construction and sharing of resource libraries. The semantic metadata outside provides more reference information for teachers and students to use teaching resources. Only by finding an efficient way to manage and organize massive semantic data can the efficiency of semantic annotation related work be greatly improved. The ultimate goal of this work is to achieve an automatic semantic annotation retrieval and management method based on the semantic concept that conforms to the human cognitive mechanism. To find out the relationship between the two, constructing a semantic-based tagging system, semantic tagging of educational resources is a critical step. Compared with automatic labeling, manual labeling is inefficient and can only be used as a method for generating a semantic labeling training set; the information obtained by overall labeling is not as rich as the information obtained by regional labeling.
From the experimental results, the “document-topic-keyword” semantic model obtained by modeling the teaching resource document LDA can enrich the metadata description of the resource from the document semantic level, and its computing performance is under the parallel computing environment. The research value of this paper lies in, first, the use of LDA topic modeling to automatically perform semantic analysis and topic modeling on the documents in the resource library to provide technical support for the semantic annotation of the resources in the library; second, semantic-based topic modeling technology expands the resource description metadata generation method based on Dublin Core and basic education metadata standards, and provides more semantic information inside the resource. The shortcoming of this research is that the semantic modeling technology of LDA is still in the experimental research stage, and it also needs to be tested in practice. The resources involved in the research are teaching plans documents, and multimedia resources such as audio and video, Flash animation and other resources need to be further resolved. In addition, for the online teaching resources that are increasing every day, the LDA model needs further attempts in timely data processing.
