Abstract
With the increase in needs for personalized learning of online students, knowledge tracing (KT), a technique aimed at tracing the state of a student’s knowledge mastery and predicting performance in future exercises, has become a hot topic in personalized learning research. The behavioral features exhibited during students’ learning process bear information that impacts the state of a student’s knowledge mastery. To study the influence of learning behaviors on students’ knowledge mastery state in the learning process, we propose a Precise Modeling of Learning Process based on
Keywords
Introduction
At present, with the growing demand for personalized learning from online students, mining and analysis of massive learning data has become an urgent process. The acquisition of students’ knowledge needs from data [1], as well as tracing their state of knowledge mastery in real-time [2], is crucial in realizing adaptive personalized learning [3]. KT has thus become a research hotspot in educational data mining [4], aiming at dynamically tracing students’ knowledge mastery state and predicting their future learning performance. The core task is to automatically trace changes in students’ knowledge mastery state over time, considering their historical attempted exercises, to accurately predict their performance in forthcoming exercises. The prediction made is the basis for the choice of action taken to meet students’ learning needs. This could be recommendation of learning resources [5], provision of tutorials, planning of learning paths [6, 7], training of weak knowledge points [8], monitoring of students’ knowledge mastery level [9], among others.
Currently, numerous researchers have conducted research on KT techniques, mainly including Bayesian Knowledge Tracing (BKT) [10], Deep Knowledge Tracing (DKT) [11], and Dynamic Key-Value Memory Networks for Knowledge Tracing (DKVMN) [12] and their variants [13–18]. However, these studies are mainly hinged on results from historical attempted exercises, neglecting the impact of students’ behaviors on their state of knowledge mastery, which, as well, weigh in on the students’ learning ability. For instance, the length of time between learning may affect the extent of a student’s forgetfulness of the learnt concept. On the other hand, repeated learning of a knowledge point may enhance its remembrance.
Moreover, in educational psychology, many scholars have noticed the memory decline phenomenon in humans due to forgetting. The Ebbinghaus forgetting curve shows [19, 20] that students will forget what they have learned, i.e., memory decline. Moreover, according to the memory trace decline theory [21], the mastery level of students’ original knowledge point affects the magnitude of memory decline. We therefore take the time interval between last learning and the mastery level of existing knowledge points as the factors affecting memory decline. In addition, the presence of an enhancement phenomenon is worth considering. Learning theory emphasizes [22] that if a student learns the same knowledge point repeatedly, the understanding of it will be enhanced. This paper thus presents the time interval between learning identical knowledge points, the number of repetitions of the same knowledge point, and the level of mastery of existing knowledge points as the factors affecting memory enhancement.
Literature research also reveals that some variants of classical KT models have improved predictive performance by incorporating exercise answer behaviors. A case in point is DKT+F [14], which incorporates behavioral features to model a student’s forgetting behavior based on DKT. CF-DKD [16] and LFKT [17] investigate KT models that abstract a student’s learning and forgetting behaviors by incorporating behavioral features based on DKVMN. DKVMN-LA [23] integrates the behavioral features into a model of historical exercises and uses them as input to train DKVMN. However, the above works do not involve research on the contribution and degree of influence of different behavioral features on a student’s knowledge mastery state in the learning process. In this study, we have explored this issue by calculating attention weight for behavioral features using multi-head attention networks.
Furthermore, existing KT studies establish two ways of updating the state of students’ knowledge mastery when considering their historical attempted exercises: global update [12], as is the case with DKT which updates the mastery state of students for all knowledge points, and partial update, evident in DKVMN and LFKT, where the mastery state of knowledge points related to an exercise is updated according to the knowledge relevance weight of the exercise and the knowledge point. In view of this, we have made the following improvements to the update method of knowledge mastery state. In detail, based on the memory rules embodied in educational psychology and learning theory, we assume that students’ learning process can be subdivided into three links: memory decline, memory enhancement and memory update. In the subsequent learning process modeling, different update strategies are designed for each link to trace the change in students’ knowledge mastery state in the learning process. In the memory decline link, a student’s mastery state of all knowledge points is modeled by global update. The mastery state of the student’s current exercise-related knowledge points is then updated in the memory enhancement link, according to the relevance weight of the knowledge points. For the memory update link, the hidden layer of LSTM is used only to update the knowledge mastery state of the knowledge points corresponding to the current exercise. This allows for accurate update of changes in the knowledge mastery state of a student after attempting exercise and reduces the loss of information due to the presence of excessive knowledge points [24]. In addition, memory networks are used to model the changes in the knowledge mastery state caused by the learning intervals. This caters for the inability of the RNN hidden layer to reflect time interval when modeling a student’s knowledge mastery state [16].
Moreover, to improve the prediction accuracy of the KT model, we investigate the existing research on the potential feature information of exercises. DKVMN enhances accuracy of model prediction by using the embedding vectors of the exercises themselves as latent features of the exercises, while CF-DKD [16] combined answer results of exercises with cognitive features to predict performance in future exercises. JKT [25] uses GCN in the field of knowledge tracking and fuses the exercise-exercise subgraph and concept-concept subgraph to obtain exercise embedding and concept embedding, respectively. While learning the relationship between exercises and concepts, the original graph structure information is kept to the maximum extent. Inspired by this current work, when predicting students’ performance in future exercises, not only do we consider a student’s potential state of knowledge mastery for an exercise and its corresponding knowledge point, but also the response time in historical exercises. In other words, response time in attempted exercises on the same knowledge point is used to calculate response time of future exercises to be attempted. Each student takes a different amount of time to answer the same exercise, reflecting his or her different learning abilities and adaptability to the difficulty of the exercise. Therefore, to improve prediction accuracy, we consider the student’s response time in historical exercises as an influential factor in predicting performance in future exercises.
To address the shortcomings in existing works, and inspired by research results from educational psychology, a new KT model is proposed, with the following as its main contributions. A knowledge tracing model named MBFKT is designed for modeling students’ learning process, derived from a combination of multi-head attention networks, memory networks, and recurrent neural networks. Three memory links, namely memory decline link, memory enhancement link, and memory update link, are modeled in the learning process as per student’s learning behaviors, and update strategies are designed for each link, respectively. Multi-head attention networks focus on the contribution and degree of influence of different behavioral features on the student’s knowledge mastery state in MBFKT. A more efficient method of combining behavioral features is followed, where research results in educational psychology are combined, thus the rules of learning and forgetting are explained, and the interpretability of MBFKT is enhanced. MBFKT makes use of the response time in attempted exercises to improve the information model of predicted exercises, thus improving prediction accuracy.
The rest of this paper is organized as follows. Section 2 discusses the related works on KT. Section 3 explains the relevant concepts and notations and gives the problem definition. The framework of MBFKT and its computational process are described in detail in Section 4. Section 5 shows the results and analysis of MBFKT comparison experiment, and a wrap-up of the whole paper is given in Section 6.
Related works
In recent years, KT techniques have developed rapidly, while early works such as BKT, DKT, and DKVMN [10–12] retain simpler models. Besides, important factors such as students’ answer behaviors, potential features of the exercises, and attention mechanisms, are not applied. This has allowed room for subsequent development of some successful variants stemming from these classical models. The related works are described as follows.
Knowledge tracing considering a student’s behavioral features
KT studies have been conducted to show that incorporating the behavioral features of a student when attempting exercises into KT models can improve the interpretability and predictive performance to a certain extent. For instance, Khajah et al. [26] extended BKT by incorporating human cognitive factors, thus improving the model prediction accuracy. Qiu et al. [27] considered the interval between students’ learning of the same knowledge point since their last repetition. They added the new day’s marker to BKT to model the forgetting behavior that occurred after a one-day interval. However, the resultant model could not account for forgetting behavior for shorter periods. Khajah et al. [28] improved BKT by applying the number of repetitions of the student learning the knowledge point to estimate the probability of forgetting, enhancing the accuracy of the model prediction.
Yang et al. [13] used a tree-based classifier to preprocess students’ behavioral features when answering exercises and implicitly embedded them into DKT to enhance the model performance. Nagatani et al. [14] improved DKT by considering the number of repetitions when learning the same knowledge point, the time interval since the last learning of the same knowledge point, and the time interval since the last learning. However, the model ignored the effect of the original state of the student’s mastery of the knowledge point on the changes to occur on it. Sun et al. [15] extended the behavioral features of students during answering exercises to DKVMN to achieve better prediction results. Huang et al. [16] obtained learning features and forgetting features from a student’s behavioral features, such as the number of repetitions of the same knowledge point, the time interval since the last learning of the same knowledge point, and the time interval since the last learning. These, in combination with memory networks, were applied on DKVMN to improve its performance. In literature [17], factors influencing learning and forgetting were mined from a student’s behavioral features and combined with memory networks and recurrent neural networks to model forgetting and learning behaviors, update changes of knowledge mastery state, and improve model prediction accuracy. Literature [29] was an extension of SAINT, which improved the prediction performance by applying two time features to response embedding: the elapsed time (the time for students to answer questions) and the lag time (the time interval between adjacent learning).
Although these extended models have good interpretability, there are still some limitations. To begin with, the mining of students’ behavioral features is incomprehensive. Besides, the degree of influence of different behavioral features on modeling changes in students’ knowledge mastery state is inconsistent, and the behavioral features combination methods require further investigation. In contrast, our model explores the effects of different behavioral features and numerous combination methods on students’ knowledge mastery state, and focuses on the influence of different behavioral features on students’ knowledge mastery state so as to trace changes in their knowledge mastery state dynamically.
Knowledge tracing considering the potential features of the exercises
A student’s performance in an exercise depends not only on the mastery of the knowledge points examined in the exercises but also on other potential features of the exercises, such as the difficulty of the exercises. Therefore, the prediction accuracy of the KT model can be improved by considering information about potential features that affect the predicted performance on the exercises.
Huang et al. [16] argued that among the factors influencing a student’s ability to answer exercises correctly was his or her cognitive features (learning and forgetting), and integrated cognitive features and the knowledge mastery state of the exercises into a vector to improve the prediction accuracy. Literature [12, 17] used the embedding vectors of the exercises themselves as potential features of the exercises for performance prediction. Liu et al. [18] mined potential features of exercises, such as knowledge points and exercised content, from the text of the exercises, and combined them with the student’s learning history for knowledge tracing. Sun et al. [23] combined students’ behavioral features with their learning ability to obtain a new representation of the exercise, a step that boosted the performance of KT. Pardos et al. [30] improved accuracy in predicting students’ performance in future exercises by incorporating the information on difficulty of the exercises.
In fact, in performance prediction, different students attempt an exercise and achieve different results, reflecting the difference in the students’ learning ability. A student answers the exercises that examine the same knowledge point with different results, reflecting the variability in the difficulty of the exercises. To tackle this phenomenon, the student’s response time in an exercise is integrated into the model capturing the student’s learning features, such that the student’s learning ability adaptability to the increasing difficulty of the exercises can be considered.
Knowledge tracing with attention mechanism
Attention mechanism has been widely used in different fields, such as speech recognition and image processing. Simply put, the attention mechanism is learning the attention weight vector. Most of the current work on KT incorporating attention mechanism focus on exercises, neglecting the exploration of a student’s behavioral features.
Abdelrahman et al. [31] utilized the attention mechanism to improve DKVMN by focusing on a student’s records of attempted exercises when answering similar exercises. Inspired by the Transformer architecture and further developments in natural language processing, the attention mechanism was applied in literature [32–34] in a deep knowledge tracing model to capture relationships between exercises and their relevance to students’ knowledge state. Choi et al. [35] argued that the Transformer-based attention layer was too shallow. To dig deeper into the complex relationship between exercises and answer results, the effectiveness of multi-head attention was demonstrated through several experiments.
In our study, the attention mechanism focuses on students’ behavioral features. We apply multi-head attention networks to calculate attention weight for behavioral features to enhance the influence of important behavioral features on students’ knowledge mastery state during historical learning process.
Knowledge tracing considering other factors
Further efforts in KT research have discovered more factors affecting personalized learning, as discussed below.
CKT [36] measured students’ prior knowledge from their records of attempted exercises and designed hierarchical convolutional layers for extracting learning rates to personalize the modeling of the knowledge mastery state. GKT [37] combined the graph neural network with the knowledge tracking task. It coded the students’ knowledge mastery state as the embedding of graph nodes, and updated the students’ knowledge mastery state according to the embedded feature vector and knowledge graph structure. Bi-CLKT [24] transformed the traditional KT problem into a graphical form and trained the model with huge volumes of unlabeled data through comparative learning. The embeddings of the exercises and knowledge points were obtained by node-level and graph-level GCNs, and connected to the prediction layer as attributes of each exercise so as to predict the exercise answer performance. GIKT [38] used graph convolutional networks to obtain relational embeddings of exercises and knowledge points with which to train the model for tracing a student’s knowledge mastery state. Finally, a historical review module was designed to solve the sequence long-term dependence problem, and an interaction prediction module was created to improve the accuracy of the prediction.
Problem definition
In this study, S is defined as the set of students, K as the set of knowledge points, and E as the set of exercises. Each student learns independently without affecting each other. X={x0, x1, x2, …, x t } is a sequence of a student’s attempted exercises at different times. An attempted exercise is expressed as x t =(e t ,k t ,r t ,ΔST t ,ΔRT t ,ΔCT t ,ΔFT t ), which is a seven tuple, representing a student’s response to an exercise, e t (e t ∈ { e1, e2, e3, …, e|E| }) at a time, t. k t (k t ∈ K) is the knowledge point corresponding to the exercise, e t . The student’s answer is given by r t (r t ∈ { 0, 1 }), which is a binary variable. When the student answers an exercise correctly, r t = 1, otherwise, r t = 0. The attempted exercise x t also includes the behavioral features exhibited by the student while attempting it. These include ΔST t , which is the time interval since the last learning at the time, t, and ΔRT t which is the time interval since the last learning of the same knowledge point, the number of times (ΔCT t ) that the same knowledge point is repeatedly studied, and the response time (ΔFT t ) of the exercise. In this study, we set the units of ΔST, ΔRT, and ΔFT to minutes and the unit of ΔCT to times. The notations used in this paper are shown in Table 1.
Notations
Notations
The key memory network is the matrix M
skill
(d
skill
× |K|), which denotes |K| knowledge embedding representations in its entire knowledge space. The value memory networks is the matrix
Based on the memory rules embodied in educational psychology and learning theory, and combined with the behavioral features of students when answering questions and their performance in answering exercises, this study models a student’s learning process as three memory links: memory decline link, memory enhancement link, and memory update link. These are used to dynamically trace the changes in a student’s knowledge mastery state, according to his or her behavioral features when attempting exercises, and performance in those exercises. The knowledge mastery state at a time t - 1 when a student, stu, starts attempting a given exercise is denoted as
Given a sequence of a student’s attempted exercises X={x0, x1, …, x
t
}, MBFKT can achieve the two objectives below. Keeping track of the dynamic changes in the state of a student’s knowledge mastery, Prediction of the student’s performance in the next exercise, et+1, by calculating the probability of the student giving a correct answer in the next exercise et+1, using the equation pt+1 (rt+1 = 1|et+1, X).
Our proposed model (MBFKT) is a time series model that uses multi-head attention networks to focus on the contribution and degree of influence of different behavioral features on a student’s knowledge mastery state. The potential state of a student’s knowledge mastery is dynamically stored in form of memory network, and the current state corresponding knowledge point of the exercise after each answer is precisely updated by using recurrent neural networks, thereby achieving the two objectives mentioned in definition 3.
In this study, the framework of MBFKT at a time t is constructed as shown in Fig. 1. MBFKT consists of five functional modules, namely calculating knowledge relevance weight, calculating the degree of influence of behavioral features, modeling the learning process, exercise answer performance prediction, and knowledge mastery level output. Each of these is distinguished by differently colored connecting lines and solid wireframes, from which both the data interaction between modules and the change of a student’s knowledge mastery state in the three memory links (the blue dashed boxes in the figure) are observable.

Framework of MBFKT at timestamp t.
To make this framework better understood, Algorithm 1 summarizes the major steps of the whole process. The main idea of our algorithm is to model students’ learning process and predict the performance of the exercises. Then, the next few subheadings describe in detail the calculation process of the five functional modules in MBFKT framework.
In this module, we aim at obtaining relevant relationships between knowledge points in the same course. The relevance of knowledge points to their corresponding exercises is weighted, and the weight for all knowledge points computed. The process commences with the mapping of a knowledge point k
t
corresponding to the exercise e
t
onto a vector space by multiplying knowledge point by the embedding matrix E ∈ R(|K|×d
skill
)to obtain the embedding vector KP
t
of the knowledge point corresponding to the exercise at time t, where KP
t
∈ R
d
skill
. Next, the result of the inner product of KP
t
and a single knowledge embedding M
skill
(i) calculated. The Softmax value of the inner product is calculated to obtain the knowledge relevance weight w
t
(i), which is to be fed into the memory enhancement link and the performance prediction module. The calculation is shown in Equation (1):
Prior to modeling the learning process (the changes undergone by a student’s knowledge mastery state), the degree of influence of behavioral features on the student’s knowledge mastery state is computed using multi-head attention networks, as in the following steps.
The behavioral features: ΔST
t
, ΔRT
t
, and ΔCT
t
of a student, are extracted from his or her records of attempted exercises at a time, t. These three features, being scalar quantities, are then mapped onto the vector space through the embedding matrices A, B, and C, respectively, to obtain the behavioral feature embedding ST
t
, RT
t
, and CT
t
, which are then input to the Multi-head attention networks (MHA) and weighted using attention weights, to obtain the vector M
t
. M
t
consists of the vectors d
t
and a
t
. Where, d
t
contains information about behavioral feature ΔST
t
, while a
t
contains information about behavioral features ΔRT
t
and ΔCT
t
. Vectors d
t
and a
t
are to be used in modeling the memory decline link and memory enhancement link, respectively, as shown in Equations (2)–(5).
As shown in Equation (2), this study applies the Scaled dot-product attention as each attention head and sets the number of heads to 3. The output of multi-head attention needs to be linearly transformed to obtain the result of stitching multiple heads, thus each head is focused on a single section, leaving MBFKT to handle the degree of influence of various behavioral features on the student’s knowledge mastery state.
In this module, the output of multi-head attention network is used to model changes in the potential state of a student’s knowledge mastery state, by transforming it into three links, that is memory decline link, memory enhancement link, and memory update link, as hinted in the earlier sections of this paper. The following experiments compare performance of various combinations of behavioral features in memory links, and explain the rules of learning and forgetting, as well as ways of increasing the interpretability of MBFKT.
Memory decline link
The memory decline link models the decline of all knowledge points of a student judging by the information contained in the behavioral feature ΔST
t
.
The memory enhancement link updates the mastery state of the knowledge points associated to the current exercises based on information from the behavioral features ΔRT
t
and ΔCT
t
.
In the study, LSTM plays the role of the network layer, and appreciably reduces the long-term dependency problem of the sequence, through its unique gating mechanism. The memory update link will particularly update the mastery state of the knowledge point corresponding to the current exercise based on its answer. This is implemented by locating the knowledge point, k
t
, corresponding to the exercise, e
t
, attempted by the student, together with its position index in the student’s knowledge mastery state matrix
One of MBFKT’s tasks is to predict a student’s performance (pt+1) in a forthcoming exercise (et+1), at a time, t + 1, based on the student’s sequence of attempted exercises X={x0, x1, x2, …, x t }. As mentioned in Section 2.2, research on modeling information of predicted exercises in the field of KT is still being refined. Our study incorporates the students’ response time in the exercises, in accordance with literature [12]. To be specific, the student’s average response time in all historical exercises with the same knowledge point is deemed the student’s predicted response time in a forthcoming exercise. Finally, three factors that influence performance in future exercises are added to the behavioral features factor. These include the exercise’s latent knowledge mastery state (M S ), the exercise’s knowledge point embedding (KP), and the exercise’s response time embedding (FT). The specific calculation process is shown in Fig. 2.

Exercise answer performance prediction.
Firstly, the potential knowledge mastery state of the candidate exercises is calculated, and the wt+1 obtained by calculating knowledge relevance weight in Section 4.1. Weighting and summation of the knowledge relevance weight, wt+1, with the student’s knowledge mastery state embeddings
A student’s performance in a forthcoming exercise, et+1, not only depends on the student’s potential knowledge mastery state, M
s
, for this exercise, but also on its personalized information modeling. Our research considers the knowledge point embedding corresponding to the exercise KPt+1 and the response time embedding of the exercise FTt+1. The combined vector [M
s
, KPt+1, FTt+1] is obtained by vector concatenation and fed to the fully connected layer with Tanh activation function. Then the output vector ht+1 contains the potential knowledge mastery state of the student for the exercise as well as the personalized information modeling. The detailed calculation is shown in Equation (14):
Finally, the vector ht+1 is fed into a fully connected layer with a Sigmoid activation function to obtain the probability pt+1 of the student’s correctly answering the candidate exercise et+1, which is calculated as shown in Equation (15).
The primary function of the knowledge mastery level output module is to output the mastery level value
t
of each knowledge point in the student’s knowledge mastery state matrix
The knowledge mastery level output module and the exercise answer performance prediction module use the same network structure, but we use
Each knowledge point in
MBFKT requires a total loss function optimized by loss backpropagation for parameters such as the embedding matrices A to F, the knowledge embedding matrix M
skill
, the knowledge mastery state matrix
In this study, M
skill
and
Datasets
In order to verify the performance of MBFKT, sufficient experiments have been conducted on three real online learning datasets. The detailed statistical data of each dataset is shown in Table 2, and the detailed data analysis is shown in Fig. 3.

Distribution graph of students’ number of exercises.
Datasets information statistics
We select the following four benchmark models for comparison experiments to verify the performance of our proposed MBFKT.
Evaluation metrics and Experimental setup
To achieve the best performance of the models on different datasets and uniformly compare the final experimental results of them, We chose the following evaluation metrics and set the experimental parameters of each model separately.
Evaluation metrics
In this paper, the Area Under the Curve (AUC) and Accuracy (ACC) are used as evaluation metrics, which are widely used in the field of KT [11–14]. In general, higher values of both metrics indicate better predictive performance.
Parameter settings
For model parameter settings, DKT, DKVMN, DKT+F, and MBFKT are all set with a batch size of 32, using Adam optimizer and a learning rate of 0.001. Among them, the size of the LSTM hidden layer of recurrent neural networks is 200 for DKT and DKT+F. The dimension of the hidden vector of key-value memory matrix is set to 32 for DKVMN, and the key and value memory matrix columns are set to 266, 102, and 1458 on the datasets ASSISTments2012, ASSISTments2017 and Slepemapy.cz, respectively. For our MBFKT setting in three different datasets ASSISTments2012, ASSISTments2017, and Slepemapy.cz, the number of columns in the matrices M
skill
and
Experimental environment
All experiments are based on the PyTorch framework implemented in Python on a Linux server with two 2.60GHz Intel Xeon E5-2690 v4 CPUs and four Tesla GPUs. Moreover, to ensure fairness, all models have been optimized to obtain the best performance.
Experimental results and analysis
In this subsection, the proposed model MBFKT is explored for its advantages over the benchmark model. The effectiveness of behavioral features in improving MBFKT performance is verified by conducting experiments in four aspects. That is, the effect of different embedding vector dimensions on MBFKT, the effect of different combination methods of behavioral features on MBFKT, comparison experiments with the benchmark model, and ablation experiments, respectively.
The effect of different embedding vector dimensions on MBFKT
The hyperparameter settings for the dimension of the knowledge embedding vector d skill and the dimension of a student’s knowledge mastery state embedding vector d stu are selected by observing the AUC values of MBFKT under different dimensional settings. In order to reduce the number of parameters, d = d skill = d stu is set, and the experiment results are detailed in Table 3.
AUC values of MBFKT with different embedding vector dimensions on the three datasets
AUC values of MBFKT with different embedding vector dimensions on the three datasets
Combination methods of behavioral features
From Table 3, we can learn that different datasets correspond to different situations. In the ASSISTments2012 dataset, when d = d skill = d stu = 32, the average AUC value is 0.8414, which is higher than other dimension settings. In the ASSISTments2017 dataset, when d = d skill = d stu = 16, the average AUC value is 0.8329, which is higher than other dimension settings. In the Slepemapy.cz dataset, when d = d skill = d stu = 8, the average AUC value is 0.7445, which is higher than the other dimensional settings. The comparison shows when the dimension d is set too low, the learning ability of the model is low. When the dimension d is set too high, it leads to too many parameters of the model, which is easy to cause overfitting. Therefore, for the ASSISTments2012 dataset, d = d skill = d stu = 32 is set. For the ASSISTments2017 dataset, d = d skill = d stu = 16 is set. And for the Slepemapy.cz dataset, d = d skill = d stu = 8 is set.
In order to explore the effects of different behavioral features (ST, RT, CT) on a student’s knowledge mastery state in memory decline link and memory enhancement link, the effects of three behavioral features combination methods on MBFKT are compared, and the best combination method has been found. The details of the three behavioral features combination methods are shown in Table 3.
The experimental results are shown in Table 5, and the above three combination methods correspond to MBFKT1, MBFKT2, and MBFKT, respectively. In this study, Combination method 3 obtains the best AUC on the three datasets. This leads us to conclude that the different behavioral features affect different links of a student’s knowledge mastery state. ST affects the memory decline link, RT and CT affect the memory enhancement link. This result is consistent with the memory phenomenon embodied in educational psychology theory, which shows that it is correct for us to model a student’s learning process as different memory links. At the same time, it is in line with the rules of people’s learning and forgetting and explains the change in student’s knowledge mastery state in the learning process.
AUC effect of different combination methods of behavioral features on MBFKT
AUC effect of different combination methods of behavioral features on MBFKT
One of the primary tasks of KT is to predict students’ future exercise answer performance. In the experiments, we randomly use 80% data in the dataset as the training set and the other 20% as the test set. The improvement of the prediction performance of the proposed model MBFKT is verified by conducting comparison experiments with the benchmark models.
The results of the AUC and ACC of MBFKT compared with the four benchmark models on the three datasets are shown in Table 4. Where the experimental results of BKT are from the literature [17, 24], DKT and DKT+F are from our recurrence, and DKVMN is from open-source code available online. Table 6 shows that the AUC and ACC of MBFKT on the two datasets are better than the benchmark models. But on the Slepemapy.cz dataset, the effect of MBFKT is not optimal due to the excessive number of knowledge points in this dataset. This result indicates that MBFKT outperforms the benchmark models in predicting students’ future exercise answer performance, but the excessive number of knowledge points may affect its performance.
Comparison experiments with the benchmark model
Comparison experiments with the benchmark model
The comparison experiment results show that BKT has limitations in modeling a student’s mastery of a knowledge point as a discrete binary variable. Therefore BKT has the lowest predictive performance. DKT models a student’s overall knowledge mastery state but cannot model a student’s knowledge mastery state for each knowledge point. It also lacks modeling of the memory decline and enhancement links. Hence, the prediction performance of DKT on two datasets is lower than MBFKT. However, DKT performs slightly better than MBFKT in the Slepemapy.cz dataset. The reason may be that DKT is better at modeling the presence of a large number of knowledge points in the dataset than MBFKT. The performance of DKVMN is lower than DKT. Because DKVMN constructs a student’s knowledge mastery state for each knowledge point, information about the relationship between knowledge points may be lost [24]. This phenomenon is more obvious for the Slepemapy.cz dataset containing 1458 knowledge points. Both DKVMN and MBFKT can model a student’s mastery of a single knowledge point. However, DKVMN ignores the changes in the knowledge mastery state brought about by a student’s behavioral features during the learning period. Although DKT+F considers a student’s exercise answer behaviors when learning, its prediction performance is lower than MBFKT on two datasets due to the limitations of RNN. In contrast, MBFKT not only integrates the behavioral features of students in the learning process based on the multi-head attention mechanism, but also models the learning process of students into three links: memory decline, memory enhancement and memory update, which can more accurately model the change process of students’ knowledge mastery state in the learning process. Finally, when predicting the performance of the exercises, the information modeling of the exercises is further improved by integrating the exercises’ response time, thus the prediction performance of MBFKT is improved. On the whole, MBFKT has the best prediction performance, and the performance improvement is more significant in terms of experimental results.
This subsection tests the validity of the behavioral features mentioned in the study, and the variability of different behavioral features affecting a student’s knowledge mastery state. In addition, we experimentally verify the model performance improvement by the multi-head attention networks and the improved method of updating the knowledge mastery state.
(1) Validity analysis of the exercise’s response time factor
To investigate the validity of the response time factor of the exercise in the prediction performance of the future exercise, we conduct experiments by ablating the exercise’s response time factor in DKVMN and MBFKT. Among them, DKVMN+FT is DKVMN that considers the response time FT of the predicted exercises, and MBFKT-FT is MBFKT that does not consider the response time FT of the predicted exercises. The specific experimental results are shown in Table 7.
Validity analysis of the exercise’s response time factor
Validity analysis of the exercise’s response time factor
According to the results in Table 7, we can learn that the impact degrees in different datasets are not consistent after adding the exercise’s response time factor to DKVMN+FT due to the variability of the datasets, with DKVMN+FT improving more significantly compared to DKVMN in the Slepemapy.cz dataset. While after adding the personalized feature of a student’s exercise response time to MBFKT, the AUC of all datasets are improved more significantly compared to MBFKT-FT. Indicating that our model MBFKT can better capture the modeling information of the exercises than DKVMN. The predictive performance of MBFKT can be enhanced by incorporating the exercise’s response time factor into the information modeling of the exercises.
(2) Extended study of the exercise’s response time factor
To further investigate the effect of the exercise response time factor, we add MBFKT’ as a comparison experiment. MBFKT’ is a MBFKT that considers the real response time FT’ of the predicted exercises. These two models differ only in the exercise’s response time factor. The specific experimental results are shown in Table 8.
Extended study of the exercise’s response time factor
According to the experimental results in Table 8, we can learn that although MBFKT considers the response time of historical exercises to approximate the real response time of the predicted exercises, the results are not accurate enough. So MBFKT’ is set to consider the student’s real response time to the exercises. The experimental results show that MBFKT’ is further improved compared to MBFKT on the three datasets. It verifies the validity of the exercise’s response time factor again and shows that our model still has room for improvement.
(3) Different behavioral features affect the variability of a student’s knowledge mastery state
In order to eliminate the influence of the estimated response time of the exercises on other factors in the ablation experiments, the following ablation experiments in this paper are conducted based on the real response time FT’ of the predicted exercises. To focus the variability of different behavioral features affecting a student’s knowledge mastery state, we conduct the ablation experiments in Table 9. Where MBFKT’-ST is MBFKT’ without the behavioral feature ST, and MBFKT’-RTCT is MBFKT’ without the behavioral features RT and CT.
Comparison of the results of different behavioral features affecting student’s knowledge mastery state
Based on the results in Table 9, we can learn that ablating different behavioral features of the student causes various degrees of decrease in the model AUC. The model that considers more influences of behavioral features has higher AUC and better performance. In other words, more comprehensive behavioral features can more accurately model the student’s knowledge mastery state.
(4) Impact of multi-head attention networks on model performance
To investigate the effect of multi-head attention networks on the model performance, we conduct ablation experiments, as shown in Table 10. Where MBFKT’-MHA is MBFKT’ without considering multi-head attention networks.
Analysis of the validity of multi-head attention networks
Based on the results in Table 10, we can learn that MBFKT’-MHA leads to lower model performance than MBFKT’ due to the lack of attention mechanism. The prediction performance of MBFKT’ can be improved by learning the influence degrees of different behavioral features on the knowledge mastery state with the help of multi-head attention networks.
(5) Impact of different knowledge mastery state update methods on model performance
Finally, to investigate the effects of different knowledge mastery state update methods, we conduct the ablation experiments shown in Table 11. MBFKT’-WT is MBFKT’ for updating students’ knowledge mastery state by using the knowledge relevance weight updating method in the memory decline link.
Comparison of the results of different knowledge mastery update methods
Based on the results in Table 11, we can learn that the memory decline link using the global update is better than the one based on knowledge relevance weight. Too much time between a student’s last learning can lead to memory decline for all knowledge points, not just the relevant ones.
This section compares the experimental results of different ablation models on all datasets. The most intuitive phenomenon is that MBFKT that considers more influence factors has a higher AUC and better performance. Based on the above experiments, the conclusions are as follows. The validity of the response time factor of the exercises is verified. Refining the information modeling of the exercises can be another research direction to improve the model’s prediction performance. The changes in students’ knowledge mastery state can be better modeled by considering behavioral features more comprehensively. And the influence degrees of different behavioral features on the knowledge mastery state are different, as we demonstrate by applying multi-head attention. Our improved method of updating the knowledge mastery state can improve the performance of MBFKT.
To verify another primary task of KT, tracing the dynamic changes in a student’s knowledge mastery state
For in-depth analysis, we select a student’s historical exercise interaction sequence to trace the trend of the knowledge mastery level changes over 30 time steps. We visualize the memory slots of five potential knowledge points in the student’s

Graph of changes in a student’s knowledge mastery level (left) and radar graph of learning ability (right).
In Fig. 4 (left), the first column represents the student’s initial mastery level of the five knowledge points. And the student’s knowledge mastery level changes with each time stamp, rather than alternating between mastery and non-mastery. Specifically, the student’s potential knowledge mastery level increases (decreases) each time when the student answers a correct (incorrect) exercise. In Fig. 4 (right), after answering 30 exercises, the student has made significant progress for knowledge point k1, and their learning ability for knowledge point k5 has slightly decreased. We notice that the student’s learning ability for knowledge point k3 has been weak, indicating that he or she has not been able to master knowledge point k3 in the learning process. Therefore, after tracing 30 exercises, the student has mastered knowledge points k1, k2, k4, and k5. Then knowledge point k3 needs to be strengthened.
Based on the theory of educational psychology, combined with the information of students’ behavioral features in the learning process, this study designs a new KT model named MBFKT, which bases on multi-head attention network, memory network and recurrent neural network. MBFKT can trace changes in students’ knowledge mastery state and predict their performance in future exercises. We model the learning process of a student in form of memory decline link, memory enhancement link and memory update link, a method consistent with the memory rules of natural humans, with an advantage of appreciably increasing the interpretability of MBFKT. We also give a method of extracting behavioral features from students’ historical attempted exercises and incorporate information about behavioral features into MBFKT. Extensive experiments on three experimental datasets are conducted, and MBFKT’s performance over the benchmark model verified. We finally visualize the tracing results of a student’s knowledge mastery state, from which the student’s knowledge points of weakness can be identified, hence due solutions tabled. The experimental results demonstrate the validity and interpretability of our proposed model.
We think future work can be explored as follows. On the one hand, from the perspective of improving the interpretability of KT, graph neural network can be used to learn the structure information of knowledge graph to build a knowledge tracking model [38, 39]. On the other hand, from the perspective of solving the long-term dependence problem, the traditional RNN network can be replaced by other better time series models to further improve the prediction performance of KT.
Footnotes
Acknowledgment
This research is supported by the Shandong Postgraduate Education Quality Improvement Plan(Grant No.:SDYJG19075), Shandong Education Teaching Research Key Project(Grant No.:2021JXZ010), Qingdao City Philosophy and Social Science Planning Project(Grant No.:QDSKL2201132), National Statistical Science Research Project(Grant No.:2021LY053), Shandong Province Education Science Planning Innovation literacy project (Grant No.:2022CYB280) and Shandong University of Science and Technology Education and Teaching Research Stars Plan Project(Grant No.:QX2022ZD07).
