Abstract
How to better grasp students’ learning preferences in the environment of rapid development of engineering and science and technology so as to guide them to high-quality learning is one of the important research topics in the field of educational technology research today. In order to achieve this goal, this paper utilizes the LDA (Latent Dirichlet Allocation) model for text mining of the survey results on the basis of a survey on students’ self-perception evaluation. The results show that the LDA model is capable of extracting terms from text, fuzzy identifying groups of students at different levels and presenting potential logical relationships between the groups, and further analyzing the learning preferences of students at different levels for IT courses. Based on the student’s learning needs, this paper proposes recommendations for developing students’ learning effectiveness. The LDA method proposed in this paper is a feasible and effective method for assessing students’ learning dynamics as it generates cognitive content about students’ learning and allows for the timely discovery of students’ learning expectations and cutting-edge dynamics.
Introduction
Latent Dirichlet Allocation (LDA), first proposed by Blei et al. [1] in 2003, is a class of data analysis techniques commonly used to study the thematic structure of a collection of texts. LDA utilizes the relationships between words, topics, and texts to solve the problem of semantic mining in text clustering [2]. The algorithm mainly combines the functions of inductive and statistical methods, making it particularly suitable for application in studies related to exploratory and descriptive analysis [3]. Creating such visualizations in educational data is challenging due to the high dimensionality of the fitted models - LDA is typically applied to thousands of documents, which are of high dimensionality. These documents are modeled as mixtures of tens (or hundreds) of topics, which themselves are modeled as distributions of thousands of terms [1]. Therefore, as one of the emerging techniques for educational text data mining, LDA topic modeling has great potential for text mining and searching for relationships between text documents.
The LDA topic modeling method can associate words with similar meanings and then perform high-quality automatic clustering of a large amount of text based on the meaning and description of the sentences [4] to achieve the specific function of distinguishing polysemous words. In other words, LDA topic modeling in this paper not only analyzes sentences containing keywords or phrases related to the subject of information technology but also analyzes the students’ learning level of the subject of information technology. In addition, LDA topic modeling is an unsupervised fuzzy classification-based approach [5]. It enables this method to mine and analyze a large amount of text that has not yet been tagged and to obtain potential themes in the text [6]. This means that the LDA topic model does not require the researcher to construct a framework for evaluating IT disciplines. Based on this approach, the researcher performs fuzzy calculations and classifications on the textual data to determine the meanings and labels of each topic [7], which are able to respond to the underlying perspectives held by the students on the learning process of the IT discipline.
Recently, scholars have begun to focus on the practice and visualization of subject models fitted using Latent Dirichlet Allocation (LDA) in the field of education [8, 9]. The main focus is developed by comparing educational hotspots, current status, future work and application algorithms, for example, Kim, D., & Im [10] explored the changes in the research trends in the field of virtual reality-based education using LDA models, revealing the main topics of virtual reality-based education research at different times and suggesting the development of the same. Alrumiah and Al-Shargabi [11], on the other hand, performed Recovery Oriented Summary Evaluation (ROUGE) and manual evaluation on educational video datasets, and the performance of LDA-based generated summaries outperformed TF-IDF and LSA-generated summaries in terms of data from educational videos. It can be found that many scholars ignore the educational data about students in the educational system, therefore, to better mine the educational data for potential learning topics of students, this paper adopts the open-ended question route to obtain the educational data.
The text produced by students answering open-ended questions is a special type of text, different from the assessment of open-ended tasks [12] and questionnaires [13], it is a process text of knowledge learning and knowledge expression under the premise that students have a certain knowledge structure, and the text mainly contains three levels of content: knowledge of a certain subject or domain expertise in the form of a text; answering the question of the students on the text contains three main levels of content: a subject knowledge or domain expertise in the form of a text; the representation or viewpoint formed by the student answering the question on the learning process of domain knowledge; and the cognition and understanding related to the learning process of certain knowledge. Meanwhile, del et al. conducted a GradeAid analysis for the framework of automated short-answer scoring, which differs from the LDA model by its joint analysis of lexical and semantic features of student answers through regressors [14]. To a certain extent, he neglected to mine some of the underlying concerns in the latent data, whereas the main role of LDA is to make each data point belong to more than just one cluster by means of soft clustering. In addition LDA’s soft clustering allows for a fuzzy component of categorization, which has the advantage of uncovering something latent. Therefore, the text of students’ answers to open-ended questions is an important tool and vehicle for collaborative pedagogical research.
The core essence of LDA topic modeling of open text is based on the objective meaning of the records produced by students in their learning activities. It realizes inter-teacher and multi-education sector collaboration as well as assisting students to enhance their professional competence and literacy, while also effectively improving the science and effectiveness of student learning. Therefore, this paper proposes to mine and analyze students’ open-ended response texts using LDA-based topic models in an attempt to answer several fundamental questions about the fitted topic models: What is the distribution of student groups presented by the terms in each theme? How generalized is each theme? How do these themes relate to each other?
This paper answers these questions from different visual components, some of which are original and some of which are borrowed from previous tools. In order to provide a reference for the application of AI technology in the field of education and to contribute to the development of a new type of teachers’ teaching methodology, teaching evaluation and testing, and the formation of a two-way interaction mechanism between teachers and students.
This paper is organized as follows. In Section 2, we review some of the work related to the analysis of the thematic model. Section 3 describes the methodology used to analyze the studied cases. In Section 4, the results of the study in this paper are described in detail. Finally, Section 5 details the conclusions of our work and aims to provide discussion points for future work.
Literature review
In this section, we focus on describing and analyzing some studies related to topic modeling and text mining to analyze data about student learning preferences in open-ended questions.
Some researchers have argued that data from open-ended text questionnaires are more difficult to analyze compared to data from single-option questions and answers because it uses human intervention during data processing [15]. Manual intervention refers to the dimensional division of respondents’ responses to the questionnaire based on their relevance to important key phrases or words during the research process by the researcher. As the work is more complicated, it raises the difficulty for the processing of manual intervention, not only to - the text questionnaire but also to adding manual intervention coders, making the work of data analysis more difficult [15, 16].
Among others, some scholars have expressed doubts about the validity of single-choice survey analysis tools or methods for the following reasons: (1) the existence of respondents’ interpretation of the content of item surveys similar to that of the developers of the tools; (2) such standardized tools usually reflect the opinion of the expert who developed the tool about the knowledge of a certain dimension, which, being subjective, makes the research results necessarily biased [17]. In contrast to these types of standardized analysis methods, open-ended questionnaires allow respondents to express their understanding of particular knowledge. Thus, open-ended survey documents allow for fuller analysis and assessment of respondents’ viewpoints on a particular piece of knowledge. In addition, when applying a topic modeling approach to data mining of open-ended texts, different processing methods have different conclusions. For example, Ozgelen et al. [18] used the method of Lederman et al. [19] for data analysis, however, Mesci [20] coded the numbers according to the classification method proposed by Khishfe and Abd-E1-Khalick [21] but found that the results would be limited. Therefore, to reduce the impact of similar classification methods, this paper adopts the thematic modeling approach of LDA to process and analyze the data.
Research on data mining for student learning is underdeveloped in the field of educational research. Erkens et al. [22] performed text mining by developing an automated tool (a grouping and visualization tool) that is geared towards exploring and visualizing students’ cognitive information and can improve collaborative learning in the classroom. Xing et al. [23], to explore students’ semantics of scientific argumentation patterns, a latent LDA classification method was applied to mine the text to interpret students’ self-assessments in a statement-evidence-reasoning framework. The research results show that these uncertainties can be explained by applying this classification. Yang et al. [24] analyzed educational text mining from a macro perspective, proposed an educational text mining workflow and focused on and identified bibliographic information of texts, and used a three-step approach (text source selection, text mining technique application, and educational information discovery) to use text mining in educational research. However, the study has not yet given a detailed practical process and evaluation index system, so it needs to be developed in educational text mining research.
To date, scholars have conducted a large number of investigations on students’ IT learning. For example, Sood & Saini [25] proposed a thematic model analysis of student performance predictions and comments to provide stimulating comments and video suggestions for prospective students and to predict students’ academic performance. Zhao et al. [26] explored deep learning themes based on Stack Overflow to compare the problems faced by different deep learning frameworks and analyzed the trends of each topic and concluded that gradient propagation is the most popular of all topics in deep learning recommendations for students, but the detection of objects is also the most difficult. Research has recognized that students’ attitudes toward IT learning play a central role in their understanding of the subject and that improving students’ understanding of the IT subject is intrinsic to aiding them in learning the subject [27]. Whereas topic modeling is a response to the whole learning process of students, in topic modeling we start with texts and calculate their optimal topic composition. This means that researchers can infer students’ essential view of the subject directly from the classification results of the LDA model, thus avoiding the risk of differences in analysis methods that could affect the results.
Methodology
The topic modeling approach is an unsupervised machine learning technique that scans a set of documents, detects patterns of words and phrases in them, and automatically clusters groups of words and similar expressions that best characterize a set of documents. The underlying rules of this approach are mined from a given text using data mining techniques and used for association analysis, classification and prediction.
The specific process can be explained as preprocessing of data after the completion of data collection which includes data selection, data cleaning, data organization and data normalization. The use of LDA for data mining after data preprocessing greatly improves the quality and efficiency of data mining as it does not require predefined human labels to train the data lists, which is known as “unsupervised” machine learning. To some extent, fuzzy categorization of data and discovery of potential or valuable information from a large amount of data. The steps are shown in Fig. 1.

Data mining flow chart.
In our practical research study, the answers to the open-ended questions of the student self-assessment were systematically analyzed (Fig. 2). In stage (A) we give the steps for obtaining and testing the research data. The collected data were randomly tested to ensure the validity of the experimental data, which is practically relevant for the achievement of the research purpose. The text mining data source in phase (B) is derived from the raw data processed in (A). In this phase, we use text mining techniques to explore the association and difference between different elements in the text and to show the information that is difficult to find in the text. In addition, this phase involves the application of LDA topic modeling methods.

Methodology of Topic modeling and Topics network.
In the third phase (C) of the research methodology we used a network topic model, to construct each text as a network. Each topic represents a network node, and the correlation between different nodes represents the strength of the association that exists between them. The model determines the most influential topic terms in a given network based on the feature values of topic term co-occurrence [28]. Data visualization techniques are then used in this model to present different topic groups for topic classification, with different types of topic word groups representing the important topics of the text and the relationships between topics, respectively. The final (D) stage requires the identification and interpretation of the topics automatically classified by the computer by experts in the research field, with the main emphasis on the analysis of the correlation between the various topic types.
The main work at this stage was to acquire and examine the text to provide scientific support for the subsequent thematic modeling process. In addition, the text was pre-processed.
Data collection and processing
In this session, the carrier of the original data (questionnaire) was mapped, which involved students’ knowledge of what they had learned about the IT subject, their perception of their subject, and their self-development plans. In addition, respondents need to be analyzed and data collection schemes need to be planned to achieve a scientific and credible corpus. On top of this, excellent data conditions are provided for later text mining and topic modeling.
Data pre-processing
Data pre-processing is an important part of ensuring the quality of research data, which is especially important in unstructured text analysis [28, 29]. In this study, the preprocessing of raw data is mainly divided into the following steps: to screen the collected data, and those containing blank text or text content that cannot meet the requirements will be treated as useless data; to eliminate data noise. Crop all text with punctuation marks, dummy words, subjective words, extra spaces, numbers, etc. converts the text content in CVS files to UTF-8 encoding to recognize Chinese characters.
Text mining and topic modeling
Building a terminology matrix
In Natural Language Processing (NLP) [30], all terms of a text are defined as a bag of words, and a bag of words is a terminology matrix made up of many terms. The terminology document matrix is one of the components of the corpus, which is a data input source for the topic model [31]. All the data in the pre-corpus are discrete, so there is no need to worry about the logical structure of the data when inputting them. They are mixed for analysis in the process of topic modeling.
Based on the definition of the terminology document matrix, it is clear that the amount of terms in these raw data is relatively large. However, when mining the text, the number of meaningful terms present may be smaller. In this case, we can then reduce the dimensionality of the terminology matrix to reduce the error while the terms do not lose their original important relationships.
Determining the number of topics
The number of topics is determined by calculating the term matrix data constructed by the LDA topic model. The determination of the number of topics in the LDA model is a difficult problem, and most studies have used multiple tries to set the number of topics or to estimate the number of topics according to the size of the data volume, which is less reasonable. In order to determine the number of topics scientifically, this paper fuzzy determines the optimal number of topics by calculating the method of weighted values of the maximum average distribution probability of the topic words and the average similarity probability of the topic words [32], with a view to mining more potential under-topic information, and the specific process is as follows:
Parameter settings: d represents a copy of the text, n represents the number of texts, r represents a certain topic, w represents the topic word, and k represents the number of topics. Let d
n
be a text in the text set D, as shown in Equation (1), and r
k
be a subset of the topic set R, as shown in Equation (2): E represents the maximum average distribution probability of topics and texts, T represents the average similarity between topics, and G represents the weighted topic similarity between topics.
The number of topics is set to k, and an initial model is obtained. The distribution probability of topics and texts and the distribution probability of words on topics are calculated by the LDA model. The maximum average method is used to obtain the maximum average distribution probability of topics and texts, as shown in Equation (3), and the E value is derived.
④ The mean topic-to-topic similarity is calculated using the Cosine Similarity Theorem [33]. t represents the topic matrix, j represents the number of words in the text, and m represents the total number of words in the corpus after de-duplication. Equation (4) calculates the similarity between two topics mainly based on the Cosine Similarity Theorem.
The method of calculating the average similarity of a one-dimensional array is adopted in this paper to obtain the similarity between multiple topics [34], as shown in Equation (5), and finally, obtain the T value.
⑤ The values of Equations (4) are weighted to form the weighted G value, as shown in Equation (6).
Adjust the k value and retrain the text. Repeat the step ② to get the optimal K value when G is the largest.
As a statistical method for text topic grouping, it is predicated on the assumption that each document is considered as a function of potential topic variables, and because of its excellent grouping function for text functions, it has been widely used in computer science research in recent years, especially in the fields of text mining and information retrieval [35]. The basic idea of this algorithm is that each text is formed by a random combination of potential topics, and the distribution of each topic is also discrete, and all it has to do is to define the probability of each word appearing on a particular topic. Therefore, LDA is considered an unsupervised conceptual modeling method for data. The detailed flow of the algorithm is given in Fig. 3. LDA considers each text (M) as a patchwork of several words (N), which can be seen as the probability distribution of potential topics on the Dirichlet. Where α refers to the Dirichlet weight of the topic in the text; Z represents the distribution of a certain word on the topic; and W represents the words that appear in the text. In this study, LDA is used to mine the topics that appear in the text.

Schematic of LDA algorithm.
The LDA modeling methodology assumes that a word belongs to a certain topic and that there is at least one class of topics in a text. Under this premise, the setting of the parameter α determines the topic distribution of each document. When the value of α is higher, it indicates that the homogeneity of topic distribution is higher. On the contrary, a lower value of α avoids the probability distribution of a certain topic being too low but the homogeneity is not high.
Topic network modeling
While classifying words to different topics in stage (B), session 3, LDA converts the topic matrix into binary and thus analyzes which topic the text has the highest correlation coefficient with. This generates a matrix of correlation coefficients between texts and topics, which is presented as a heat network diagram and reflects the relationship between topics [36].
Relevance of the identified topic
The purpose of this phase is to analyze the grouped topics, describe the algorithmic grouping, and find a label that reflects the substantive content of the topic set. The expert’s knowledge and understanding of the topic is an important basis for the interpretation of the analyzed topics.
Results of the case study
Before processing the data for the study, a manual tagging process was performed on the corpus. This process focused on general reading, testing, and filtering of students’ open-ended answers to identify the relevant topics that students mentioned in their answers. The most representative markers were identified for each topic in the second stage of the analysis. Table 1 shows the topics and markers identified during the topic selection process.
Topic and term markers identified in the topic modeling analysis
Topic and term markers identified in the topic modeling analysis
The results obtained by applying the proposed methodology in the case study are detailed below.
In this case study, the paper constructed a text database in acquiring the data, which mainly involved 350 responses as opposed to open-ended questions. The questions were as follows: “Tell me about your shortcomings in the process of learning the IT subject?”, “What do you hope to learn in the IT subject?” These questions provided an orientation to guide students toward IT learning and self-expectations, which in turn led to more convergent responses. These questions were applied to the Student Learning Assessment Survey from September-December 2022, and the text data from the student responses were initially screened to obtain 162 valid text data and stored in a CSV file. To better present student responses for interpretation purposes, we give the following two sample student responses:
“There are some limitations in thinking, not very innovative, there are some technical speed is too slow, too unskilled. I hope to learn more network and information technology, and learn to use knowledge to solve real-life problems”
“The grasp of information knowledge is not deep enough. In the follow-up study to be able to have more knowledge of information technology”
Text mining and topic modeling
Following the manual processing of the corpus, we pre-processed the corpus, i.e., transcoding, setting deactivation words, identifying nonsense symbols, spaces, etc. In the corpus, the document terminology matrix yielded 1956 terms and 163 documents. To better parse out the important terms in the corpus, in the LDA model we used the frequency-inverse document frequency index (TF-IDF) for weighted analysis [37]. It is a common weighting technique used for information retrieval and data mining. After determining the important terms of the corpus, the probability of the distribution of the terms on a topic is calculated. Applying the term distribution frequencies to the implementation process of determining the number of topics, it is calculated that the optimal topic K = 12. Therefore, the number of topics is restricted in the LDA model and the corresponding term frequencies are output as shown in Table 2.
The top 10 most likely terms for each topic (scale values retain three valid numbers)
The top 10 most likely terms for each topic (scale values retain three valid numbers)
The analysis of Table 2 shows that the higher the proportion of terms in each topic, the stronger the relationship between the term and the topic it is in. In addition, we can see that there is a term that appears in more than one topic, for example, “improve” (in topics 2, 3, 6, 8, 9, 10, and 12, respectively) and “knowledge” (in topics 1, 4, and 5, respectively). The reason for this result is that the same terms, phrases, or keywords express different meanings in different contexts [38]. The more frequently such terms like these appear the more concentrated the aggregation effect is in the topic cluster, and the closer the distribution of word frequencies is to the binomial distribution.
Another feature of the LDA topic modeling approach is to mark each document as a mixture of topics. That is, each document has a probability of belonging to each topic. This can be interpreted by applying a relational graph in which each document is connected to a certain number of topics.
In the LDA topic modeling of text documents, the probability of occurrence of the topic for each document was obtained, thus constructing a network related to the documents and topics (Fig. 4). The figure presents a heat map of the document (row) and topic (column) correlations, where the darker the color, the stronger the correlation coefficient. When analyzed together with Fig. 4 and Table 2, it can be roughly seen that among all the topics, Topic 1 and Topic 12 have the highest number of occurrences and relevance, e.g., the top-ranked terms in Topic 1 are “knowledge”, “ability”, “communication”. “communication” and “problem”, while the top terms in Topic 12 were “expression”, “problem-solving ability”. It is easy to see that the terms are closely related to each other and there is a certain hierarchical relationship, i.e., students’ responses must first go through the process of Topic 1 to reach the level of Topic 12. Therefore, the probability of both topics appearing in one document is low, and there must be a level division between the two or more documents in which these two topics appear, i.e., two or more students with similar levels of development and more similar levels of literacy and competence.

Heatmap representation of the matrix relating documents (rows) to topics (columns).
In the session on the topic generation network model. Our visual interactive network has two core functions that can provide word separation for text data while also presenting the model so that researchers can adjust thresholds and thus present better visualizations. The network model allows the analysis of each topic to reveal the most relevant terms in a given topic. In addition, bilateral network models allow a better description of the relationships between the identified topics [36]. The terms in the projected documents are analyzed concerning common themes, and the projections link the topics in the documents and provide weights between the topics. For example, in Fig. 5, the distances between all topics (left) and the top 30 related terms corresponding to topic 1 (right) are presented, with the light blue bars in the related terms on the right representing the frequency of each term in the whole corpus and the red bars representing the frequency of each term in a particular topic.

Topic 1 is highlighted (On the left are the relative positions and sizes of the topic groups (K = 12). The size of a topic represents its universality, while the distance between topics reflects the differences between them. On the right are the 30 most relevant words in topic 1.).
In Fig. 5 (left), topic 1 is represented by the largest circle and is closer to the other topics, indicating that students’ responses are more similar and in different dimensions. The terms of topics 3 and 4 are at the same level. Topic 5 is relatively distant from the other topics, as shown in this figure, and the terms analyzed in topic 5 have different characteristics from the other topics. In contrast, topics 2, 6, 7, 8, 9, 10, 11, and 12 share some prominent terms, such as “lift”, “cheer” and “hope”. Figure 5 (right) shows the terms associated with Topic 1, with the highest association being “knowledge” and “understanding”. Analyzing the terms in Topic 1, we found that students’ responses in this dimension mainly focused on the expression of basic knowledge and skills of IT subjects, and since this topic accounted for a larger proportion of the responses, most of the students’ responses were attributed to this topic. In contrast, the terms in topics 3 and 4 are more specific to the learning outcomes of IT subjects, with terms such as “more”, “mastery”, “achievement” and the terms “more”, “mastery”, “achievement” and “work” can better indicate the learning demands of students, who have certain requirements for learning IT subjects. The terms in topics 2, 6, 7, 8, and 9 mainly revolve around the expression of less understanding of IT subject knowledge, and the main terms are “operation”, “not enough”, “ability”, “cheer”, and “work”. The terms “cheer up”, “hope” and “future” show the state of students’ perception of their learning. Although the level of basic knowledge mastery is insufficient, the enthusiasm for learning IT subjects is still high. In response to this phenomenon, teachers should try to adopt different teaching methods in real teaching, and driven by strong interest, the learning effect and enthusiasm of such students will be satisfied. Topics 10, 11, and 12 are different from other topics in that the terms “life”, “no”, “thinking about problems”, “solving problems” and “self-reliance” are used. Problem-solving” and “self-learning skills”. This group of students has different views on the purpose of learning the subject, so the span is relatively large, which covers the expression of students’ disapproval of basic knowledge, and their need for IT subjects to solve real-life problems. Therefore, instructional design and teaching policies in real-life teaching should involve teaching IT applications or subject competitions as a way to stimulate the development and transformation of the topic.
After determining the connotations of the terms precipitated among the topics, the topics were clustered and analyzed using the LDA modeling method. We depict the relationships between topics in Fig. 6(a) to reveal the relationships that exist between the elements of the IT disciplinary self-assessment system. The algorithm we used in this segment is based on marginality, similar to the community structure algorithm [39], which focuses on calculating the number of shortest paths to determine the distance, as a way to describe the structure between topics (see Fig. 6 (b)).

Topic network clustering diagram.
Table 2 lists the 12 relevant topics in the case study and the top ten associated terms, which were derived from the corpus analysis by applying the LDA algorithm. To better analyze the content of the topics in Table 2, we grouped these topics to extract information based on the clusters presented in Fig. 6(b). The clustering specifics are as follows.
Topics 2, 5, 6, 8, and 9 refer to the learning enhancement of basic knowledge of IT subjects, which is the most common way of learning the needs of many students in the learning process. As can be seen in Fig. 6, the prevalence of these topics, and their relationship to the overall case study, is relatively large in proportion to the larger number of students in this cluster. The interpretation of the terms and documents in this category reveals that students’ learning enhancement demands for basic knowledge are generally reflected in teaching exercises, post-class consolidation, learning interactions, and knowledge understanding. As can be seen from the structural diagram in Fig. 6(b), Topic 6 contributes very little to this topic, which is confirmed in the topic modeling section of the text. To meet the learning demands of students in this subgroup, improve teaching effectiveness and enhance students’ learning satisfaction, in reality, teachers should adopt after-school topic learning activities, conduct learning interest groups and exercises related to the central knowledge points, etc. Such initiatives can, to a certain extent, enhance the intensity of learning communication and meet the above learning demands of students.
The grouping of topics 3 and 4 corresponds mainly to hands-on learning strategies, i.e., the use of computers to perform hands-on operations on learning contents and consolidate experiences to facilitate students’ learning. In Fig. 6(a), the relationship between topics 4 and 6 and topics 5, 8, and 9 can be clearly seen. This is because there is a close backward and forward logical relationship between hands-on learning and experiential learning of basic knowledge. In this grouping, there are strong terminologies such as “operation” “application” and “technology” which indirectly reflect students’ interest and motivation for practical learning. When teachers implement practical teaching initiatives based on students’ demonstrated interests, it helps teachers and students to communicate and improve the teaching-learning process, thereby enhancing student learning.
Topic 1 alone represents a new cluster. This topic is particularly worth emphasizing as a neighboring point of difference between the average student, whose learning purposes and requirements are limited to the mastery of basic knowledge and the satisfaction of learning, and the student in topic 1, who has the motivation to apply knowledge to real-world problems, gradually emphasizing problem thinking and application of knowledge, prompting this group of students to think about learning and innovate The students in Topic 1 have the motivation to apply their knowledge to real-world problems, and gradually emphasize thinking about problems and applying their knowledge, leading to a tremendous increase in their learning thinking and creative abilities. To enhance the learning literacy of these students, teachers should provide opportunities for students to solve real-world problems, such as participating in mathematical modeling competitions, information literacy competitions, and so on. In the process of solving real-world problems, they further consolidate course knowledge and develop students’ problem-solving literacy.
In the case study, we grouped topic 7 with topics 10, 11, and 12 because of the presence of the term “life” in topic 7 and the other three terms “problem-solving” “thinking about problems” and “learning skills. The close connection between the terms “life” and the other three terms “problem-solving” “thinking about problems” and “learning skills” can be seen in Fig. 6(b). The terms in this grouping have the greatest relevance to real life and would correspond to the group of students with the most observational and problem-solving skills of all the students surveyed. Charles Gidling once said: “Problem identification is often more important than problem-solving” and students in this topic are more capable of independent learning and problem-solving. In the process of learning the basics, the level of disciplinary literacy of such students is higher, so the dimensions involved in teachers’ teaching organization strategies and teaching methods will be different. That is, the instructional content should have an interdisciplinary nature to assist such students in thinking about problems and solving them from different perspectives [40].
Conclusion
In the field of education, there have been many pedagogical researchers who have been continuously investigating teachers, students, and teaching processes. The purpose of this is primarily to obtain data on teaching and learning and then to extract valuable information. However, there is a challenge in using open-ended questions to obtain data: after collecting the data, it takes a lot of work and time to organize and analyze them, and the human knowledge structure is evolving all the time, and there is a risk of losing the opportunity to explain the prevailing level when drawing conclusions. This is the focus of this paper, in which we propose to apply a methodology based on topic modeling and text network modeling when investigating open-ended question data. This method can present a better picture of students’ autonomous thoughts and their attitudes toward the researchers and collect valuable information. The method speeds up the time and reduces the workload of data analysis and can analyze well the large amount of textual information generated during open-ended question surveys.
In this paper, the process of the topic modeling approach is analyzed and practiced in detail. Differences from other related studies are the addition of the choice of the number of topics designed to enrich the textual analysis of open-ended questions, and the use of the topic network structure as a complementary tool for topic modeling. In addition, we give case studies designed to analyze students’ answers to open-ended questions and thus assess their satisfaction and learning preferences in the IT subject area. The results of the study found that the student groups were mainly distributed into three categories: poor, intermediate and excellent students, whose expressions could be differentiated by the three dimensions of the terms “hope”, “cheer” and “practice”. The expressions can be distinguished by the three dimensions of “Hope”, “Go”, and “Practice”, and the binomial distribution structure is formed among each group of students. In addition, we found potential student groups, i.e., those in the developmental stage, such as those represented by Theme 1 and Theme 7. At the same time, each theme analyzed through fuzzy classification is persuasive in a certain region, but the generality of the themes could be improved due to the influence of the region of data collection. Finally, the study found that these themes showed a community distribution relationship through network modeling analysis (see Fig. 6(b)), and each theme also has the function of inter-transformation, which has a greater impact on students’ individual student preferences.
Based on the results of the study, the following recommendations were made to facilitate students’ development:
(1) Adopt instructional assessment based on students’ existing levels
Designing multidimensional evaluation factors to increase students’ behavioral willingness to engage in learning. Research has found that students’ effort expectations and achievement goals have a significant impact on behavioral willingness to learn, which then requires students to have specific evaluation criteria for learning outcomes and clear learning task drivers. To address this issue, teachers should systematically analyze students’ existing levels in advance, combine students’ learning characteristics and teaching content, and conduct precise teaching based on learning analytics. In particular, online learning is inseparable from the guidance of teachers, who need to provide support to students in terms of knowledge leadership, emotional communication and information services. Specifically, teachers use mathematical software to analyze learner characteristics, learning motivation, interpersonal relationships, learning styles, and effort goals based on educational big data, and conduct data diagnosis and analysis of student learning characteristics, to adopt different teaching methods for students in different classes, enhance students’ perceived practicality and identity of online courses, and stimulate students’ willingness to learn actively.
(2) Building an environment for course teacher-student interaction
In the environment of the rapid development of engineering and science and technology, universities should continuously and effectively promote the innovation of teaching methods. Firstly, the university should establish scientific and effective mechanisms of top-level design, policies and regulations, faculty, and teaching platform supporting support. Secondly, build an efficient classroom with teacher-led, teacher-student interaction and technical support.
(3) Enriching learning resources to stimulate the inner drive to learning
As the main body of learning, students’ intrinsic motivation and ability, and literacy will affect their learning behavior, and they need to manage their learning ability to achieve better teaching and learning results. The online learning platform in the teaching environment can provide students with more convenient technical conditions and stimulate their interest, which can help them achieve better self-management and continuously adjust their learning methods and progress through summaries and reflections in the learning process, to achieve the expected learning goals. Teachers should provide students with abundant course learning resources to improve the efficiency of students’ access to information while taking care that the use of learning tools should not be too cumbersome, otherwise it will easily cause students’ effort to decrease, thus reducing their mobile self-efficacy for learning. In addition, teachers can guide students to prepare learning materials and plan their courses through pre-course task distribution; use teaching platform data statistics to observe students’ completion of learning resources and provide guidance for students to adjust their learning paths; and encourage students to actively participate in course discussions, scenario presentations, and other activities by establishing a diversified assessment system, rather than relying solely on final assessments for grades.
In session D (Relevance of the identified topic), the topics are discussed in groups, after determining the spatial structure between the topics and proposing constructive strategies. In this part of the description, it is easy to see that the suggestions we came up with can better meet the learning aspirations of students in different groupings to achieve an adaptive learning environment. Based on the results of the case study in this paper, we suggest that the topic classification section should be able to classify topics in terms of multiple variable dimensions, such as in terms of gender, age, home education level, and major. These influence factors will improve the quality of topic clustering and more accurate results can be obtained.
In addition, we found some limitations in the study. Since the topic modeling approach is mainly based on the commonality of terms appearing in the text, when the raw data is small, it makes a part of the terminology appear less frequently in the text. This means that the results of the study received the influence of data shortage to a certain extent, which can be seen from topic 6 in Fig. 4, and is also the reason why the number of topics in this paper is designed to link K = 12, the critical value of the optimal solution for fuzzy computation.
In future work, we anticipate conducting a larger user study to further understand how topic interpretation can be facilitated in a suitable LDA model to improve the generalizability and relevance of topics. We also noted the need to visualize correlations between topics as this can provide insight into what is happening at the document level without the need for realistically tagged documents. Finally, we sought a solution to visualize a large number of topics (e.g., 50–100 topics) in a compact way.
Footnotes
Acknowledgments
We would like to thank No.2 High School of Duyun for supplying the raw survey data necessary for the completion of this study.
This work was supported by the Guizhou Educational Science Planning Project under Grant (No. 2023B036), and the Qiannan Prefecture Educational Science Planning Project under Grant Nos. (2023B005, 2023B007, 2023B044).
