Precise modeling of learning process based on multiple behavioral features for knowledge tracing

Abstract

With the increase in needs for personalized learning of online students, knowledge tracing (KT), a technique aimed at tracing the state of a student’s knowledge mastery and predicting performance in future exercises, has become a hot topic in personalized learning research. The behavioral features exhibited during students’ learning process bear information that impacts the state of a student’s knowledge mastery. To study the influence of learning behaviors on students’ knowledge mastery state in the learning process, we propose a Precise Modeling of Learning Process based on Multiple Behavioral Features for Knowledge Tracing model (MBFKT), which models a student’s learning process by making use of these behavioral features. MBFKT initially processes these features through multi-head attention networks, memory networks, and recurrent neural networks to model students’ learning process into three memory links: memory decline link, memory enhancement link, and memory update link. Various update strategies are designed for each memory link, and the performance of numerous possible combinations of behavioral features in the memory links is compared, for the rules of learning and forgetting to be explained. Furthermore, we also study the contribution and degree of influence of different behavioral features on a student’s knowledge mastery state, by which MBFKT is improved, thus enhancing the accuracy of prediction. Through experiments on real online education datasets and comparison with existing benchmark methods, it is observed that MBFKT has evident advantages in predicting performance with good interpretability.

Keywords

personalized learning knowledge tracing multiple behavioral features memory links educational data mining

1 Introduction

At present, with the growing demand for personalized learning from online students, mining and analysis of massive learning data has become an urgent process. The acquisition of students’ knowledge needs from data [1], as well as tracing their state of knowledge mastery in real-time [2], is crucial in realizing adaptive personalized learning [3]. KT has thus become a research hotspot in educational data mining [4], aiming at dynamically tracing students’ knowledge mastery state and predicting their future learning performance. The core task is to automatically trace changes in students’ knowledge mastery state over time, considering their historical attempted exercises, to accurately predict their performance in forthcoming exercises. The prediction made is the basis for the choice of action taken to meet students’ learning needs. This could be recommendation of learning resources [5], provision of tutorials, planning of learning paths [6, 7], training of weak knowledge points [8], monitoring of students’ knowledge mastery level [9], among others.

Currently, numerous researchers have conducted research on KT techniques, mainly including Bayesian Knowledge Tracing (BKT) [10], Deep Knowledge Tracing (DKT) [11], and Dynamic Key-Value Memory Networks for Knowledge Tracing (DKVMN) [12] and their variants [13–18]. However, these studies are mainly hinged on results from historical attempted exercises, neglecting the impact of students’ behaviors on their state of knowledge mastery, which, as well, weigh in on the students’ learning ability. For instance, the length of time between learning may affect the extent of a student’s forgetfulness of the learnt concept. On the other hand, repeated learning of a knowledge point may enhance its remembrance.

Moreover, in educational psychology, many scholars have noticed the memory decline phenomenon in humans due to forgetting. The Ebbinghaus forgetting curve shows [19, 20] that students will forget what they have learned, i.e., memory decline. Moreover, according to the memory trace decline theory [21], the mastery level of students’ original knowledge point affects the magnitude of memory decline. We therefore take the time interval between last learning and the mastery level of existing knowledge points as the factors affecting memory decline. In addition, the presence of an enhancement phenomenon is worth considering. Learning theory emphasizes [22] that if a student learns the same knowledge point repeatedly, the understanding of it will be enhanced. This paper thus presents the time interval between learning identical knowledge points, the number of repetitions of the same knowledge point, and the level of mastery of existing knowledge points as the factors affecting memory enhancement.

Literature research also reveals that some variants of classical KT models have improved predictive performance by incorporating exercise answer behaviors. A case in point is DKT+F [14], which incorporates behavioral features to model a student’s forgetting behavior based on DKT. CF-DKD [16] and LFKT [17] investigate KT models that abstract a student’s learning and forgetting behaviors by incorporating behavioral features based on DKVMN. DKVMN-LA [23] integrates the behavioral features into a model of historical exercises and uses them as input to train DKVMN. However, the above works do not involve research on the contribution and degree of influence of different behavioral features on a student’s knowledge mastery state in the learning process. In this study, we have explored this issue by calculating attention weight for behavioral features using multi-head attention networks.

Furthermore, existing KT studies establish two ways of updating the state of students’ knowledge mastery when considering their historical attempted exercises: global update [12], as is the case with DKT which updates the mastery state of students for all knowledge points, and partial update, evident in DKVMN and LFKT, where the mastery state of knowledge points related to an exercise is updated according to the knowledge relevance weight of the exercise and the knowledge point. In view of this, we have made the following improvements to the update method of knowledge mastery state. In detail, based on the memory rules embodied in educational psychology and learning theory, we assume that students’ learning process can be subdivided into three links: memory decline, memory enhancement and memory update. In the subsequent learning process modeling, different update strategies are designed for each link to trace the change in students’ knowledge mastery state in the learning process. In the memory decline link, a student’s mastery state of all knowledge points is modeled by global update. The mastery state of the student’s current exercise-related knowledge points is then updated in the memory enhancement link, according to the relevance weight of the knowledge points. For the memory update link, the hidden layer of LSTM is used only to update the knowledge mastery state of the knowledge points corresponding to the current exercise. This allows for accurate update of changes in the knowledge mastery state of a student after attempting exercise and reduces the loss of information due to the presence of excessive knowledge points [24]. In addition, memory networks are used to model the changes in the knowledge mastery state caused by the learning intervals. This caters for the inability of the RNN hidden layer to reflect time interval when modeling a student’s knowledge mastery state [16].

Moreover, to improve the prediction accuracy of the KT model, we investigate the existing research on the potential feature information of exercises. DKVMN enhances accuracy of model prediction by using the embedding vectors of the exercises themselves as latent features of the exercises, while CF-DKD [16] combined answer results of exercises with cognitive features to predict performance in future exercises. JKT [25] uses GCN in the field of knowledge tracking and fuses the exercise-exercise subgraph and concept-concept subgraph to obtain exercise embedding and concept embedding, respectively. While learning the relationship between exercises and concepts, the original graph structure information is kept to the maximum extent. Inspired by this current work, when predicting students’ performance in future exercises, not only do we consider a student’s potential state of knowledge mastery for an exercise and its corresponding knowledge point, but also the response time in historical exercises. In other words, response time in attempted exercises on the same knowledge point is used to calculate response time of future exercises to be attempted. Each student takes a different amount of time to answer the same exercise, reflecting his or her different learning abilities and adaptability to the difficulty of the exercise. Therefore, to improve prediction accuracy, we consider the student’s response time in historical exercises as an influential factor in predicting performance in future exercises.

To address the shortcomings in existing works, and inspired by research results from educational psychology, a new KT model is proposed, with the following as its main contributions.

A knowledge tracing model named MBFKT is designed for modeling students’ learning process, derived from a combination of multi-head attention networks, memory networks, and recurrent neural networks. Three memory links, namely memory decline link, memory enhancement link, and memory update link, are modeled in the learning process as per student’s learning behaviors, and update strategies are designed for each link, respectively.

Multi-head attention networks focus on the contribution and degree of influence of different behavioral features on the student’s knowledge mastery state in MBFKT. A more efficient method of combining behavioral features is followed, where research results in educational psychology are combined, thus the rules of learning and forgetting are explained, and the interpretability of MBFKT is enhanced.

MBFKT makes use of the response time in attempted exercises to improve the information model of predicted exercises, thus improving prediction accuracy.

The rest of this paper is organized as follows. Section 2 discusses the related works on KT. Section 3 explains the relevant concepts and notations and gives the problem definition. The framework of MBFKT and its computational process are described in detail in Section 4. Section 5 shows the results and analysis of MBFKT comparison experiment, and a wrap-up of the whole paper is given in Section 6.

2 Related works

In recent years, KT techniques have developed rapidly, while early works such as BKT, DKT, and DKVMN [10–12] retain simpler models. Besides, important factors such as students’ answer behaviors, potential features of the exercises, and attention mechanisms, are not applied. This has allowed room for subsequent development of some successful variants stemming from these classical models. The related works are described as follows.

2.1 Knowledge tracing considering a student’s behavioral features

KT studies have been conducted to show that incorporating the behavioral features of a student when attempting exercises into KT models can improve the interpretability and predictive performance to a certain extent. For instance, Khajah et al. [26] extended BKT by incorporating human cognitive factors, thus improving the model prediction accuracy. Qiu et al. [27] considered the interval between students’ learning of the same knowledge point since their last repetition. They added the new day’s marker to BKT to model the forgetting behavior that occurred after a one-day interval. However, the resultant model could not account for forgetting behavior for shorter periods. Khajah et al. [28] improved BKT by applying the number of repetitions of the student learning the knowledge point to estimate the probability of forgetting, enhancing the accuracy of the model prediction.

Yang et al. [13] used a tree-based classifier to preprocess students’ behavioral features when answering exercises and implicitly embedded them into DKT to enhance the model performance. Nagatani et al. [14] improved DKT by considering the number of repetitions when learning the same knowledge point, the time interval since the last learning of the same knowledge point, and the time interval since the last learning. However, the model ignored the effect of the original state of the student’s mastery of the knowledge point on the changes to occur on it. Sun et al. [15] extended the behavioral features of students during answering exercises to DKVMN to achieve better prediction results. Huang et al. [16] obtained learning features and forgetting features from a student’s behavioral features, such as the number of repetitions of the same knowledge point, the time interval since the last learning of the same knowledge point, and the time interval since the last learning. These, in combination with memory networks, were applied on DKVMN to improve its performance. In literature [17], factors influencing learning and forgetting were mined from a student’s behavioral features and combined with memory networks and recurrent neural networks to model forgetting and learning behaviors, update changes of knowledge mastery state, and improve model prediction accuracy. Literature [29] was an extension of SAINT, which improved the prediction performance by applying two time features to response embedding: the elapsed time (the time for students to answer questions) and the lag time (the time interval between adjacent learning).

Although these extended models have good interpretability, there are still some limitations. To begin with, the mining of students’ behavioral features is incomprehensive. Besides, the degree of influence of different behavioral features on modeling changes in students’ knowledge mastery state is inconsistent, and the behavioral features combination methods require further investigation. In contrast, our model explores the effects of different behavioral features and numerous combination methods on students’ knowledge mastery state, and focuses on the influence of different behavioral features on students’ knowledge mastery state so as to trace changes in their knowledge mastery state dynamically.

2.2 Knowledge tracing considering the potential features of the exercises

A student’s performance in an exercise depends not only on the mastery of the knowledge points examined in the exercises but also on other potential features of the exercises, such as the difficulty of the exercises. Therefore, the prediction accuracy of the KT model can be improved by considering information about potential features that affect the predicted performance on the exercises.

Huang et al. [16] argued that among the factors influencing a student’s ability to answer exercises correctly was his or her cognitive features (learning and forgetting), and integrated cognitive features and the knowledge mastery state of the exercises into a vector to improve the prediction accuracy. Literature [12, 17] used the embedding vectors of the exercises themselves as potential features of the exercises for performance prediction. Liu et al. [18] mined potential features of exercises, such as knowledge points and exercised content, from the text of the exercises, and combined them with the student’s learning history for knowledge tracing. Sun et al. [23] combined students’ behavioral features with their learning ability to obtain a new representation of the exercise, a step that boosted the performance of KT. Pardos et al. [30] improved accuracy in predicting students’ performance in future exercises by incorporating the information on difficulty of the exercises.

In fact, in performance prediction, different students attempt an exercise and achieve different results, reflecting the difference in the students’ learning ability. A student answers the exercises that examine the same knowledge point with different results, reflecting the variability in the difficulty of the exercises. To tackle this phenomenon, the student’s response time in an exercise is integrated into the model capturing the student’s learning features, such that the student’s learning ability adaptability to the increasing difficulty of the exercises can be considered.

2.3 Knowledge tracing with attention mechanism

Attention mechanism has been widely used in different fields, such as speech recognition and image processing. Simply put, the attention mechanism is learning the attention weight vector. Most of the current work on KT incorporating attention mechanism focus on exercises, neglecting the exploration of a student’s behavioral features.

Abdelrahman et al. [31] utilized the attention mechanism to improve DKVMN by focusing on a student’s records of attempted exercises when answering similar exercises. Inspired by the Transformer architecture and further developments in natural language processing, the attention mechanism was applied in literature [32–34] in a deep knowledge tracing model to capture relationships between exercises and their relevance to students’ knowledge state. Choi et al. [35] argued that the Transformer-based attention layer was too shallow. To dig deeper into the complex relationship between exercises and answer results, the effectiveness of multi-head attention was demonstrated through several experiments.

In our study, the attention mechanism focuses on students’ behavioral features. We apply multi-head attention networks to calculate attention weight for behavioral features to enhance the influence of important behavioral features on students’ knowledge mastery state during historical learning process.

2.4 Knowledge tracing considering other factors

Further efforts in KT research have discovered more factors affecting personalized learning, as discussed below.

CKT [36] measured students’ prior knowledge from their records of attempted exercises and designed hierarchical convolutional layers for extracting learning rates to personalize the modeling of the knowledge mastery state. GKT [37] combined the graph neural network with the knowledge tracking task. It coded the students’ knowledge mastery state as the embedding of graph nodes, and updated the students’ knowledge mastery state according to the embedded feature vector and knowledge graph structure. Bi-CLKT [24] transformed the traditional KT problem into a graphical form and trained the model with huge volumes of unlabeled data through comparative learning. The embeddings of the exercises and knowledge points were obtained by node-level and graph-level GCNs, and connected to the prediction layer as attributes of each exercise so as to predict the exercise answer performance. GIKT [38] used graph convolutional networks to obtain relational embeddings of exercises and knowledge points with which to train the model for tracing a student’s knowledge mastery state. Finally, a historical review module was designed to solve the sequence long-term dependence problem, and an interaction prediction module was created to improve the accuracy of the prediction.

3 Problem definition

In this study, S is defined as the set of students, K as the set of knowledge points, and E as the set of exercises. Each student learns independently without affecting each other. X={x₀, x₁, x₂, …, x_t} is a sequence of a student’s attempted exercises at different times. An attempted exercise is expressed as x_t=(e_t,k_t,r_t,ΔST_t,ΔRT_t,ΔCT_t,ΔFT_t), which is a seven tuple, representing a student’s response to an exercise, e_t (e_t ∈ { e₁, e₂, e₃, …, e_|E| }) at a time, t. k_t (k_t ∈ K) is the knowledge point corresponding to the exercise, e_t. The student’s answer is given by r_t (r_t ∈ { 0, 1 }), which is a binary variable. When the student answers an exercise correctly, r_t = 1, otherwise, r_t = 0. The attempted exercise x_t also includes the behavioral features exhibited by the student while attempting it. These include ΔST_t, which is the time interval since the last learning at the time, t, and ΔRT_t which is the time interval since the last learning of the same knowledge point, the number of times (ΔCT_t) that the same knowledge point is repeatedly studied, and the response time (ΔFT_t) of the exercise. In this study, we set the units of ΔST, ΔRT, and ΔFT to minutes and the unit of ΔCT to times. The notations used in this paper are shown in Table 1.

Table 1
Notations

Notations Description

S the set of students

K the set of knowledge points

E the set of exercises

X the sequence of a student’s attempted exercises at different times

x _t the student’s response to an exercise (seven tuple)

e _t an attempted exercise in X

k _t the knowledge point corresponding to the exercise e_t

r _t the student’s answer of exercise e_t

ΔST_t the time interval since the last learning

ΔRT_t the time interval since the last learning of the same knowledge point

ΔCT_t the number of times that the same knowledge point is repeatedly studied

ΔFT_t the response time of the exercise

M ^skill the knowledge embedding matrix

$M_{t}^{stu}$ the knowledge mastery state matrix

w _t the knowledge relevance weight

M _t the degree of influence of behavioral features

v _t the embedding of answer information (k_t, r_t)

KP _t the embedding of k_t

ST _t the embedding of ΔST_t

RT _t the embedding of ΔRT_t

CT _t the embedding of ΔCT_t

FT _t the embedding of ΔFT_t

P _t+1 the prediction of student’s performance in a forthcoming exercise e_t+1

value _t the student’s mastery level of each knowledge point at the end of learning at time t

A to F the embedding matrices used in this paper

Notations	Description
S	the set of students
K	the set of knowledge points
E	the set of exercises
X	the sequence of a student’s attempted exercises at different times
x _t	the student’s response to an exercise (seven tuple)
e _t	an attempted exercise in X
k _t	the knowledge point corresponding to the exercise e_t
r _t	the student’s answer of exercise e_t
ΔST_t	the time interval since the last learning
ΔRT_t	the time interval since the last learning of the same knowledge point
ΔCT_t	the number of times that the same knowledge point is repeatedly studied
ΔFT_t	the response time of the exercise
M ^skill	the knowledge embedding matrix
$M_{t}^{stu}$	the knowledge mastery state matrix
w _t	the knowledge relevance weight
M _t	the degree of influence of behavioral features
v _t	the embedding of answer information (k_t, r_t)
KP _t	the embedding of k_t
ST _t	the embedding of ΔST_t
RT _t	the embedding of ΔRT_t
CT _t	the embedding of ΔCT_t
FT _t	the embedding of ΔFT_t
P _t+1	the prediction of student’s performance in a forthcoming exercise e_t+1
value _t	the student’s mastery level of each knowledge point at the end of learning at time t
A to F	the embedding matrices used in this paper

Definition 1. Key-Value Memory Networks

The key memory network is the matrix M^skill (d_skill × |K|), which denotes |K| knowledge embedding representations in its entire knowledge space. The value memory networks is the matrix $M_{t}^{stu} (d_{stu} \times | K |)$ , which denotes the knowledge mastery state matrix in a student’s knowledge space at a time t [12]. The |K|-dimensional vector value_t is used to represent the student’s mastery level of each knowledge point at the end of learning at time t. Where, the value of each dimension of the vector is between 0 and 1. A value close to 0 indicates a low level of knowledge mastery, and a value close to 1 indicates a high level of knowledge mastery.

Definition 2. Memory Links

Based on the memory rules embodied in educational psychology and learning theory, and combined with the behavioral features of students when answering questions and their performance in answering exercises, this study models a student’s learning process as three memory links: memory decline link, memory enhancement link, and memory update link. These are used to dynamically trace the changes in a student’s knowledge mastery state, according to his or her behavioral features when attempting exercises, and performance in those exercises. The knowledge mastery state at a time t - 1 when a student, stu, starts attempting a given exercise is denoted as $M_{t - 1}^{stu}$ . This state initially goes through the memory decline link, whose operations are dependent on the learning interval, to obtain the student’s knowledge mastery state ${\tilde{M}}_{t - 1}^{stu}$ . It is then taken through the memory enhancement link, where the knowledge mastery state $M_{t}^{stu_M}$ , at the time t, when the student starts attempting the exercise, is obtained. Finally, in the memory update link, the student’s knowledge mastery state after answering the exercise is updated to $M_{t}^{stu}$ , based on his or her answer result at a time, t. The matrices $M_{t - 1}^{stu}$ , ${\tilde{M}}_{t - 1}^{stu}$ , $M_{t}^{stu_M}$ , and $M_{t}^{stu}$ must be of the same shape and represent the student’s knowledge mastery state in different links, respectively.

Definition 3. Knowledge Tracing

Given a sequence of a student’s attempted exercises X={x₀, x₁, …, x_t}, MBFKT can achieve the two objectives below.

Keeping track of the dynamic changes in the state of a student’s knowledge mastery, $M_{t}^{stu}$ , over time.

Prediction of the student’s performance in the next exercise, e_t+1, by calculating the probability of the student giving a correct answer in the next exercise e_t+1, using the equation p_t+1 (r_t+1 = 1|e_t+1, X).

4 Proposed model

Our proposed model (MBFKT) is a time series model that uses multi-head attention networks to focus on the contribution and degree of influence of different behavioral features on a student’s knowledge mastery state. The potential state of a student’s knowledge mastery is dynamically stored in form of memory network, and the current state corresponding knowledge point of the exercise after each answer is precisely updated by using recurrent neural networks, thereby achieving the two objectives mentioned in definition 3.

In this study, the framework of MBFKT at a time t is constructed as shown in Fig. 1. MBFKT consists of five functional modules, namely calculating knowledge relevance weight, calculating the degree of influence of behavioral features, modeling the learning process, exercise answer performance prediction, and knowledge mastery level output. Each of these is distinguished by differently colored connecting lines and solid wireframes, from which both the data interaction between modules and the change of a student’s knowledge mastery state in the three memory links (the blue dashed boxes in the figure) are observable.

Fig. 1

Framework of MBFKT at timestamp t.

To make this framework better understood, Algorithm 1 summarizes the major steps of the whole process. The main idea of our algorithm is to model students’ learning process and predict the performance of the exercises. Then, the next few subheadings describe in detail the calculation process of the five functional modules in MBFKT framework.

4.1 Calculating knowledge relevance weight

In this module, we aim at obtaining relevant relationships between knowledge points in the same course. The relevance of knowledge points to their corresponding exercises is weighted, and the weight for all knowledge points computed. The process commences with the mapping of a knowledge point k_t corresponding to the exercise e_t onto a vector space by multiplying knowledge point by the embedding matrix E ∈ R^{(|K|×d_skill)}to obtain the embedding vector KP_t of the knowledge point corresponding to the exercise at time t, where KP_t ∈ R^{d
_skill}. Next, the result of the inner product of KP_t and a single knowledge embedding M^skill (i) calculated. The Softmax value of the inner product is calculated to obtain the knowledge relevance weight w_t (i), which is to be fed into the memory enhancement link and the performance prediction module. The calculation is shown in Equation (1): $w_{t} (i) = Softmax ({KP}_{t}^{T} M^{skill} (i)) .$ (1) Where, Softmax (z_i) = e^{z
_i}/∑_je^{z
_j}, w_t (i) ∈ [0, 1] .

4.2 Calculating the degree of influence of behavioral features

Prior to modeling the learning process (the changes undergone by a student’s knowledge mastery state), the degree of influence of behavioral features on the student’s knowledge mastery state is computed using multi-head attention networks, as in the following steps.

Algorithm 1 MBFKT Algorithmic
Input: the sequence of a student’s attempted exercises at different times X
Output: the prediction of the student’s performance P_t+1, the student’s mastery level value_t
1: Initialize the knowledge embedding matrix M^skill
2: Initialize the knowledge mastery state matrix $M_{t}^{stu}$
3: for epoch ← 1 to T
4: for each x_t ← X
5: ST_t, CT_t, RT_t, FT_t, KP_t, v_t ← Embedding process via embedding matrices A to F
6: w_t ← Calculating knowledge relevance weight via Equations (1)
7: M_t ← Calculating the degree of influence of behavioral features via Equations (2)-(5)
8: $M_{t - 1}^{stu}$ , ${\tilde{M}}_{t - 1}^{stu}$ , $M_{t}^{stu_M}$ , $M_{t}^{stu}$ ← Modeling the learning process via Equations (6)-(12)
9: p_t+1 ← Exercise answer performance prediction via Equations (13)-(15)
10: value_t ← Knowledge mastery level output via Equations (16)-(18)
11: end for
12: Model optimization via Equations (19)
13: endfor
14: return p_t+1, value_t

The behavioral features: ΔST_t, ΔRT_t, and ΔCT_t of a student, are extracted from his or her records of attempted exercises at a time, t. These three features, being scalar quantities, are then mapped onto the vector space through the embedding matrices A, B, and C, respectively, to obtain the behavioral feature embedding ST_t, RT_t, and CT_t, which are then input to the Multi-head attention networks (MHA) and weighted using attention weights, to obtain the vector M_t. M_t consists of the vectors d_t and a_t. Where, d_t contains information about behavioral feature ΔST_t, while a_t contains information about behavioral features ΔRT_t and ΔCT_t. Vectors d_t and a_t are to be used in modeling the memory decline link and memory enhancement link, respectively, as shown in Equations (2)–(5). $Attention (q, k, v) = softmax (\frac{q . k^{T}}{\sqrt{d_{k}}}) v .$ (2)

$\begin{matrix} MHA (q, k, v) = Concat ({head}_{1}, \dots, {head}_{h}) W^{o} \\ where, {head}_{h} = Attention (q . W_{h}^{q}, k . W_{h}^{k}, v . W_{h}^{v}) . \end{matrix}$ (3) $M_{t} = MHA (q, k, v) .$ (4) $M_{t} = [d_{t}, a_{t}] .$ (5) Where, vector [ST_t, RT_t, CT_t] contains the combined behavioral features of a student and set to q. The k and v are generated by random initialization, $W_{h}^{q}, W_{h}^{k}, W_{h}^{v}$ , and W^o are the parameter matrices.

As shown in Equation (2), this study applies the Scaled dot-product attention as each attention head and sets the number of heads to 3. The output of multi-head attention needs to be linearly transformed to obtain the result of stitching multiple heads, thus each head is focused on a single section, leaving MBFKT to handle the degree of influence of various behavioral features on the student’s knowledge mastery state.

4.3 Modeling the learning process

In this module, the output of multi-head attention network is used to model changes in the potential state of a student’s knowledge mastery state, by transforming it into three links, that is memory decline link, memory enhancement link, and memory update link, as hinted in the earlier sections of this paper. The following experiments compare performance of various combinations of behavioral features in memory links, and explain the rules of learning and forgetting, as well as ways of increasing the interpretability of MBFKT.

4.3.1 Memory decline link

The memory decline link models the decline of all knowledge points of a student judging by the information contained in the behavioral feature ΔST_t. $M_{t - 1}^{stu} (i)$ is a vector containing the state of mastery of the i-th knowledge point by a student, stu, at a time t - 1. It is concatenated to vector d_t, containing memory decline information, to obtain a vector D_t (i). The student’s memory decline vector, M_{d
_t} (i), is then obtained after a fully connected layer with a Sigmoid activation function. The process of updating the state of a student’s knowledge mastery for the i-th knowledge point, at time t - 1, to get an up-to-date vector of the knowledge mastery state, ${\tilde{M}}_{t - 1}^{stu} (i)$ , after memory decline, is laid out in in Equations (6)–(8). $D_{t} (i) = [M_{t - 1}^{stu} (i), d_{t}] .$ (6) $M_{d_{t}} (i) = Sigmoid ({MD}^{T} D_{t} (i) + b_{md}) .$ (7) ${\tilde{M}}_{t - 1}^{stu} (i) = M_{t - 1}^{stu} (i) (1 - M_{d_{t}} (i)) .$ (8) Where, MD^T and b_md are the parameter matrix and bias of the fully connected layer, respectively.

4.3.2 Memory enhancement link

The memory enhancement link updates the mastery state of the knowledge points associated to the current exercises based on information from the behavioral features ΔRT_t and ΔCT_t. $M_{t - 1}^{stu} (i)$ is a vector containing the state of mastery of the i-th knowledge point by a student, stu, at a time t - 1. It is concatenated to vector a_t, containing memory enhancement information, to obtain a vector A_t (i). The student’s memory enhancement vector, M_{a
_t} (i), is then obtained after a fully connected layer with a Tanh activation function. Finally, the knowledge mastery state, ${\tilde{M}}_{t - 1}^{stu} (i)$ , of the i-th knowledge point of the student after memory decline processing, is updated using the knowledge relevance weight, w_t (i), calculated in Section 4.1, to become the current knowledge mastery state $M_{t}^{stu_M} (i)$ , to conclude the memory enhancement process. The process is shown in Equations (9)–(11). $A_{t} (i) = [M_{t - 1}^{stu} (i), a_{t}] .$ (9) $M_{a_{t}} (i) = Tahn ({MA}^{T} A_{t} (i) + b_{ma}) .$ (10) $M_{t}^{stu_M} (i) = {\tilde{M}}_{t - 1}^{stu} (i) (1 + w_{t} (i) M_{a_{t}} (i)) .$ (11) Where, MA^T and b_ma are the parameter matrix and bias of the fully connected layer, respectively.

4.3.3 Memory update link

In the study, LSTM plays the role of the network layer, and appreciably reduces the long-term dependency problem of the sequence, through its unique gating mechanism. The memory update link will particularly update the mastery state of the knowledge point corresponding to the current exercise based on its answer. This is implemented by locating the knowledge point, k_t, corresponding to the exercise, e_t, attempted by the student, together with its position index in the student’s knowledge mastery state matrix $M_{t}^{stu_M}$ , which is the number of the knowledge point k_t. The answer information is then embedded in vector, v_t, as feedback about the learning effect, in order to accurately update the knowledge mastery state of the corresponding knowledge point of the exercise. The knowledge mastery state, $M_{t}^{stu_M} (index)$ , of the knowledge point before the student attempts the exercise is updated to the knowledge mastery state $M_{t}^{stu} (index)$ of the knowledge point after attempting the exercise, which is calculated as in Equation (12). $M_{t}^{stu} (index) = LSTM (v_{t}, M_{t}^{stu_M} (index)) .$ (12)

4.4 Exercise answer performance prediction

One of MBFKT’s tasks is to predict a student’s performance (p_t+1) in a forthcoming exercise (e_t+1), at a time, t + 1, based on the student’s sequence of attempted exercises X={x₀, x₁, x₂, …, x_t}. As mentioned in Section 2.2, research on modeling information of predicted exercises in the field of KT is still being refined. Our study incorporates the students’ response time in the exercises, in accordance with literature [12]. To be specific, the student’s average response time in all historical exercises with the same knowledge point is deemed the student’s predicted response time in a forthcoming exercise. Finally, three factors that influence performance in future exercises are added to the behavioral features factor. These include the exercise’s latent knowledge mastery state (M_S), the exercise’s knowledge point embedding (KP), and the exercise’s response time embedding (FT). The specific calculation process is shown in Fig. 2.

Fig. 2

Exercise answer performance prediction.

Firstly, the potential knowledge mastery state of the candidate exercises is calculated, and the w_t+1 obtained by calculating knowledge relevance weight in Section 4.1. Weighting and summation of the knowledge relevance weight, w_t+1, with the student’s knowledge mastery state embeddings $M_{t + 1}^{stu_M}$ is then done. This gets the student’s comprehensive mastery embeddings of the knowledge points associated with the candidate exercise, e_t+1, at the beginning of the exercise. The student’s potential knowledge mastery state, M_s, for the forthcoming exercise, is then calculated, as in Equation (13). $M_{s} = \sum_{i = 1}^{| K |} (w_{t + 1} (i) M_{t + 1}^{stu_M} (i)) .$ (13)

A student’s performance in a forthcoming exercise, e_t+1, not only depends on the student’s potential knowledge mastery state, M_s, for this exercise, but also on its personalized information modeling. Our research considers the knowledge point embedding corresponding to the exercise KP_t+1 and the response time embedding of the exercise FT_t+1. The combined vector [M_s, KP_t+1, FT_t+1] is obtained by vector concatenation and fed to the fully connected layer with Tanh activation function. Then the output vector h_t+1 contains the potential knowledge mastery state of the student for the exercise as well as the personalized information modeling. The detailed calculation is shown in Equation (14): $h_{t + 1} = Tanh (W_{t}^{T} [M_{s}, {KP}_{t + 1}, {FT}_{t + 1}] + b_{t}) .$ (14) Where, $W_{t}^{T}$ and b_t are the parameter matrix and bias of the fully connected layer, respectively.

Finally, the vector h_t+1 is fed into a fully connected layer with a Sigmoid activation function to obtain the probability p_t+1 of the student’s correctly answering the candidate exercise e_t+1, which is calculated as shown in Equation (15). $p_{t + 1} = Sigmoid (W_{s}^{T} h_{t + 1} + b_{s}) .$ (15) Where, $W_{s}^{T}$ and b_s are the parameter matrix and bias of the fully connected layer, respectively.

4.5 Knowledge mastery level output

The primary function of the knowledge mastery level output module is to output the mastery level value_t of each knowledge point in the student’s knowledge mastery state matrix $M_{t}^{stu}$ at time t. First, the student’s knowledge mastery state $M_{t}^{stu}$ for all knowledge points is obtained by modeling the learning process in Section 4.3, and the knowledge mastery state vector $M_{t}^{stu} (i)$ of a column in $M_{t}^{stu}$ selected by one-hot vector β_i. This gives the student’s knowledge mastery state for the i-th knowledge point, as shown in Equation (16). $M_{t}^{stu} (i) = β_{i} M_{t}^{stu} .$ (16)

The knowledge mastery level output module and the exercise answer performance prediction module use the same network structure, but we use 0 vectors to pad the input content embedding of KP and FT to ignore the information of exercises. The detailed calculation is shown in equations (17) and (18). $y_{t} (i) = Tanh (W_{t}^{T} [M_{t}^{stu} (i), 0, 0] + b_{t}) .$ (17) ${value}_{t} (i) = Sigmoid (W_{s}^{T} y_{t} (i) + b_{s}) .$ (18)

Each knowledge point in $M_{t}^{stu}$ is calculated in turn, and the final knowledge mastery level value_t of each knowledge point of the student is obtained.

4.6 Model optimization

MBFKT requires a total loss function optimized by loss backpropagation for parameters such as the embedding matrices A to F, the knowledge embedding matrix M^skill, the knowledge mastery state matrix $M_{t}^{stu}$ , and the weights and biases of the neural networks. Therefore, MBFKT is optimized by minimizing the cross-entropy loss function between students’ true answer result (r_t) and the model’s predicted probability (p_t) of correctly answering the exercise. The loss function is shown in Equation (19). $L = - \sum_{t} (r_{t} log p_{t} + (1 - r_{t}) log (1 - p_{t})) .$ (19)

In this study, M^skill and $M_{t}^{stu}$ are initialized with Kaiming normal distribution, that is M^skill ∼ He (0, σ) and $M_{t}^{stu} \sim He (0, σ)$ . Adam is used to accelerate the convergence and weight decay. The initial value of learning rate is 0.01. The exponential decay is used to update dynamically with a decay parameter of 0.75. More details will be detailed in section 5.

5 Experiments and analysis of results

5.1 Datasets

In order to verify the performance of MBFKT, sufficient experiments have been conducted on three real online learning datasets. The detailed statistical data of each dataset is shown in Table 2, and the detailed data analysis is shown in Fig. 3.

Fig. 3

Distribution graph of students’ number of exercises.

Table 2

Datasets information statistics

Dataset	Exercise	Skill	Student	Interaction	Interaction per Student
ASSISTments2012	179,999	266	46,674	6,123,270	131.2
ASSISTments2017	3,162	102	1,709	942,816	551.6
Slepemapy.cz	1,458	1,458	87,952	10,087,305	6,918.6

ASISTments2012 1 : The dataset is obtained from the ASSISTments online education platform. In the experiment, the information of students with too few records of attempted exercises (textless2) is removed. After pre-processing, the dataset includes 46674 students, 266 knowledge points, and a total of 6123340 exercise records with 17999 exercises, with an average of 131 exercises per student. We randomly select 1/5 data as experimental data.

ASISTments2017 2 : The dataset is from the same system as ASSISTments 2012. The pre-processed dataset includes 1709 students, 102 knowledge points, and a total of 942816 exercise records with 3162 exercises, with an average of 551 exercises per student, all data are selected as experimental data.

Slepemapy.cz 3 : The dataset is pre-processed, and its information includes 87952 students, 1458 knowledge points, and a total of 1087305 exercise records with 1458 exercises, with an average of 6918 exercises per student. We randomly select 1/5 data as experimental data.

5.2 Comparison method

We select the following four benchmark models for comparison experiments to verify the performance of our proposed MBFKT.

BKT [10] uses binary variables to model students’ knowledge mastery state and uses Hidden Markov Models to trace changes in students’ knowledge mastery state and to predict the exercise answer performance.

DKT [11] represents the exercises with knowledge points and uses the hidden layer of RNN to trace the changes in students’ knowledge mastery state to predict the exercise answer performance.

DKVMN [12] extends the MANN by using key-value memory networks to model students’ learning process. The key memory networks store the knowledge embedding, and the value memory networks store students’ knowledge mastery state. And the results of the modeled knowledge mastery state are used to predict the performance of the unanswered exercises.

DKT+F [14] is improved based on DKT. Its predictive performance is enhanced by incorporating students’ forgetting behavioral features in the input and exercise performance prediction parts of the model, respectively.

5.3 Evaluation metrics and Experimental setup

To achieve the best performance of the models on different datasets and uniformly compare the final experimental results of them, We chose the following evaluation metrics and set the experimental parameters of each model separately.

5.3.1 Evaluation metrics

In this paper, the Area Under the Curve (AUC) and Accuracy (ACC) are used as evaluation metrics, which are widely used in the field of KT [11–14]. In general, higher values of both metrics indicate better predictive performance.

5.3.2 Parameter settings

For model parameter settings, DKT, DKVMN, DKT+F, and MBFKT are all set with a batch size of 32, using Adam optimizer and a learning rate of 0.001. Among them, the size of the LSTM hidden layer of recurrent neural networks is 200 for DKT and DKT+F. The dimension of the hidden vector of key-value memory matrix is set to 32 for DKVMN, and the key and value memory matrix columns are set to 266, 102, and 1458 on the datasets ASSISTments2012, ASSISTments2017 and Slepemapy.cz, respectively. For our MBFKT setting in three different datasets ASSISTments2012, ASSISTments2017, and Slepemapy.cz, the number of columns in the matrices M^skill and $M_{t}^{stu}$ are set to 266, 102, and 1458 respectively, and the hidden vector size is set to 32, 16, and 8 respectively.

5.3.3 Experimental environment

All experiments are based on the PyTorch framework implemented in Python on a Linux server with two 2.60GHz Intel Xeon E5-2690 v4 CPUs and four Tesla GPUs. Moreover, to ensure fairness, all models have been optimized to obtain the best performance.

5.4 Experimental results and analysis

In this subsection, the proposed model MBFKT is explored for its advantages over the benchmark model. The effectiveness of behavioral features in improving MBFKT performance is verified by conducting experiments in four aspects. That is, the effect of different embedding vector dimensions on MBFKT, the effect of different combination methods of behavioral features on MBFKT, comparison experiments with the benchmark model, and ablation experiments, respectively.

5.4.1 The effect of different embedding vector dimensions on MBFKT

The hyperparameter settings for the dimension of the knowledge embedding vector d_skill and the dimension of a student’s knowledge mastery state embedding vector d_stu are selected by observing the AUC values of MBFKT under different dimensional settings. In order to reduce the number of parameters, d = d_skill = d_stu is set, and the experiment results are detailed in Table 3.

Table 3
AUC values of MBFKT with different embedding vector dimensions on the three datasets

d ASSISTments2012 ASSISTments2017 Slepemapy.cz

AUC AUC AUC

8 0.7990 0.7941 0.7445

16 0.7996 0.8329 0.7388

32 0.8414 0.8009 0.6864

d	ASSISTments2012	ASSISTments2017	Slepemapy.cz
8	0.7990	0.7941	0.7445
16	0.7996	0.8329	0.7388
32	0.8414	0.8009	0.6864

Table 4

Combination methods of behavioral features

Method	Memory decline	Memory Enhancement
Combination method 1	[ST, RT, CT]	[ST, RT, CT]
Combination method 2	[ST, RT]	[CT]
Combination method 3	[ST]	[RT, CT]

From Table 3, we can learn that different datasets correspond to different situations. In the ASSISTments2012 dataset, when d = d_skill = d_stu = 32, the average AUC value is 0.8414, which is higher than other dimension settings. In the ASSISTments2017 dataset, when d = d_skill = d_stu = 16, the average AUC value is 0.8329, which is higher than other dimension settings. In the Slepemapy.cz dataset, when d = d_skill = d_stu = 8, the average AUC value is 0.7445, which is higher than the other dimensional settings. The comparison shows when the dimension d is set too low, the learning ability of the model is low. When the dimension d is set too high, it leads to too many parameters of the model, which is easy to cause overfitting. Therefore, for the ASSISTments2012 dataset, d = d_skill = d_stu = 32 is set. For the ASSISTments2017 dataset, d = d_skill = d_stu = 16 is set. And for the Slepemapy.cz dataset, d = d_skill = d_stu = 8 is set.

5.4.2 The effect of different combination methods of behavioral features on MBFKT

In order to explore the effects of different behavioral features (ST, RT, CT) on a student’s knowledge mastery state in memory decline link and memory enhancement link, the effects of three behavioral features combination methods on MBFKT are compared, and the best combination method has been found. The details of the three behavioral features combination methods are shown in Table 3.

Combination method 1: We integrate both three behavioral features of answering exercises to model the changes in students’ knowledge mastery state during the time interval between two learning, i.e., memory decline link and memory enhancement link [17].

Combination method 2: Starting from the two features of time interval and number of repetitions, we integrate ST and RT behavioral features to model a student’s memory decline link and use the behavioral feature CT alone to model the student’s memory enhancement link.

Combination method 3: Inspired by educational psychology theory, the behavioral feature ST is alone used to model a student’s memory decline link, and the behavioral features RT and CT are integrated to model the student’s memory enhancement link.

The experimental results are shown in Table 5, and the above three combination methods correspond to MBFKT₁, MBFKT₂, and MBFKT, respectively. In this study, Combination method 3 obtains the best AUC on the three datasets. This leads us to conclude that the different behavioral features affect different links of a student’s knowledge mastery state. ST affects the memory decline link, RT and CT affect the memory enhancement link. This result is consistent with the memory phenomenon embodied in educational psychology theory, which shows that it is correct for us to model a student’s learning process as different memory links. At the same time, it is in line with the rules of people’s learning and forgetting and explains the change in student’s knowledge mastery state in the learning process.

Table 5
AUC effect of different combination methods of behavioral features on MBFKT

Model (ours) ASSISTments2012 ASSISTments2017 Slepemapy.cz

AUC AUC AUC

MBFKT₁ 0.8054 0.8292 0.7254

MBFKT₂ 0.8007 0.8160 0.7303

MBFKT 0.8414 0.8329 0.7445

Model (ours)	ASSISTments2012	ASSISTments2017	Slepemapy.cz
MBFKT₁	0.8054	0.8292	0.7254
MBFKT₂	0.8007	0.8160	0.7303
MBFKT	0.8414	0.8329	0.7445

5.4.3 Comparison experiments with the benchmark model

One of the primary tasks of KT is to predict students’ future exercise answer performance. In the experiments, we randomly use 80% data in the dataset as the training set and the other 20% as the test set. The improvement of the prediction performance of the proposed model MBFKT is verified by conducting comparison experiments with the benchmark models.

The results of the AUC and ACC of MBFKT compared with the four benchmark models on the three datasets are shown in Table 4. Where the experimental results of BKT are from the literature [17, 24], DKT and DKT+F are from our recurrence, and DKVMN is from open-source code available online. Table 6 shows that the AUC and ACC of MBFKT on the two datasets are better than the benchmark models. But on the Slepemapy.cz dataset, the effect of MBFKT is not optimal due to the excessive number of knowledge points in this dataset. This result indicates that MBFKT outperforms the benchmark models in predicting students’ future exercise answer performance, but the excessive number of knowledge points may affect its performance.

Table 6
Comparison experiments with the benchmark model

Model ASSISTments2012 ASSISTments2017 Slepemapy.cz

AUC ACC AUC ACC AUC ACC

BKT 0.6732 0.6310 0.5620 0.5550 0.7017 0.6730

DKT 0.6939 0.7264 0.7247 0.6212 0.7650 0.8159

DKVMN 0.6944 0.7125 0.6620 0.6717 0.6725 0.7815

DKT+F 0.7220 0.7416 0.7258 0.6108 0.7695 0.8143

MBFKT(Ours) 0.8412 0.7937 0.8329 0.7678 0.7445 0.8080

Model	ASSISTments2012	ASSISTments2017	Slepemapy.cz
BKT	0.6732	0.6310	0.5620	0.5550	0.7017	0.6730
DKT	0.6939	0.7264	0.7247	0.6212	0.7650	0.8159
DKVMN	0.6944	0.7125	0.6620	0.6717	0.6725	0.7815
DKT+F	0.7220	0.7416	0.7258	0.6108	0.7695	0.8143
MBFKT(Ours)	0.8412	0.7937	0.8329	0.7678	0.7445	0.8080

The comparison experiment results show that BKT has limitations in modeling a student’s mastery of a knowledge point as a discrete binary variable. Therefore BKT has the lowest predictive performance. DKT models a student’s overall knowledge mastery state but cannot model a student’s knowledge mastery state for each knowledge point. It also lacks modeling of the memory decline and enhancement links. Hence, the prediction performance of DKT on two datasets is lower than MBFKT. However, DKT performs slightly better than MBFKT in the Slepemapy.cz dataset. The reason may be that DKT is better at modeling the presence of a large number of knowledge points in the dataset than MBFKT. The performance of DKVMN is lower than DKT. Because DKVMN constructs a student’s knowledge mastery state for each knowledge point, information about the relationship between knowledge points may be lost [24]. This phenomenon is more obvious for the Slepemapy.cz dataset containing 1458 knowledge points. Both DKVMN and MBFKT can model a student’s mastery of a single knowledge point. However, DKVMN ignores the changes in the knowledge mastery state brought about by a student’s behavioral features during the learning period. Although DKT+F considers a student’s exercise answer behaviors when learning, its prediction performance is lower than MBFKT on two datasets due to the limitations of RNN. In contrast, MBFKT not only integrates the behavioral features of students in the learning process based on the multi-head attention mechanism, but also models the learning process of students into three links: memory decline, memory enhancement and memory update, which can more accurately model the change process of students’ knowledge mastery state in the learning process. Finally, when predicting the performance of the exercises, the information modeling of the exercises is further improved by integrating the exercises’ response time, thus the prediction performance of MBFKT is improved. On the whole, MBFKT has the best prediction performance, and the performance improvement is more significant in terms of experimental results.

5.4.4 Ablation experiments

This subsection tests the validity of the behavioral features mentioned in the study, and the variability of different behavioral features affecting a student’s knowledge mastery state. In addition, we experimentally verify the model performance improvement by the multi-head attention networks and the improved method of updating the knowledge mastery state.

(1) Validity analysis of the exercise’s response time factor

To investigate the validity of the response time factor of the exercise in the prediction performance of the future exercise, we conduct experiments by ablating the exercise’s response time factor in DKVMN and MBFKT. Among them, DKVMN+FT is DKVMN that considers the response time FT of the predicted exercises, and MBFKT-FT is MBFKT that does not consider the response time FT of the predicted exercises. The specific experimental results are shown in Table 7.

Table 7
Validity analysis of the exercise’s response time factor

Model ASSISTments2012 ASSISTments2017 Slepemapy.cz

AUC AUC AUC

DKVMN 0.6944 0.6620 0.6725

DKVMN+FT 0.6922 0.6684 0.6980

MBFKT-FT 0.8183 0.8152 0.7229

MBFKT(Ours) 0.8412 0.8329 0.7445

Model	ASSISTments2012	ASSISTments2017	Slepemapy.cz
DKVMN	0.6944	0.6620	0.6725
DKVMN+FT	0.6922	0.6684	0.6980
MBFKT-FT	0.8183	0.8152	0.7229
MBFKT(Ours)	0.8412	0.8329	0.7445

According to the results in Table 7, we can learn that the impact degrees in different datasets are not consistent after adding the exercise’s response time factor to DKVMN+FT due to the variability of the datasets, with DKVMN+FT improving more significantly compared to DKVMN in the Slepemapy.cz dataset. While after adding the personalized feature of a student’s exercise response time to MBFKT, the AUC of all datasets are improved more significantly compared to MBFKT-FT. Indicating that our model MBFKT can better capture the modeling information of the exercises than DKVMN. The predictive performance of MBFKT can be enhanced by incorporating the exercise’s response time factor into the information modeling of the exercises.

(2) Extended study of the exercise’s response time factor

To further investigate the effect of the exercise response time factor, we add MBFKT’ as a comparison experiment. MBFKT’ is a MBFKT that considers the real response time FT’ of the predicted exercises. These two models differ only in the exercise’s response time factor. The specific experimental results are shown in Table 8.

Table 8

Extended study of the exercise’s response time factor

Model (Ours)	ASSISTments2012	ASSISTments2017	Slepemapy.cz
	AUC	AUC	AUC
MBFKT	0.8412	0.8329	0.7445
MBFKT’	0.8495	0.8532	0.7756

According to the experimental results in Table 8, we can learn that although MBFKT considers the response time of historical exercises to approximate the real response time of the predicted exercises, the results are not accurate enough. So MBFKT’ is set to consider the student’s real response time to the exercises. The experimental results show that MBFKT’ is further improved compared to MBFKT on the three datasets. It verifies the validity of the exercise’s response time factor again and shows that our model still has room for improvement.

(3) Different behavioral features affect the variability of a student’s knowledge mastery state

In order to eliminate the influence of the estimated response time of the exercises on other factors in the ablation experiments, the following ablation experiments in this paper are conducted based on the real response time FT’ of the predicted exercises. To focus the variability of different behavioral features affecting a student’s knowledge mastery state, we conduct the ablation experiments in Table 9. Where MBFKT’-ST is MBFKT’ without the behavioral feature ST, and MBFKT’-RTCT is MBFKT’ without the behavioral features RT and CT.

Table 9

Comparison of the results of different behavioral features affecting student’s knowledge mastery state

Model (Ours)	ASSISTments2012	ASSISTments2017	Slepemapy.cz
	AUC	AUC	AUC
MBFKT’-ST	0.8371	0.8236	0.6891
MBFKT’-RTCT	0.8478	0.8342	0.7710
MBFKT’	0.8495	0.8532	0.7756

Based on the results in Table 9, we can learn that ablating different behavioral features of the student causes various degrees of decrease in the model AUC. The model that considers more influences of behavioral features has higher AUC and better performance. In other words, more comprehensive behavioral features can more accurately model the student’s knowledge mastery state.

(4) Impact of multi-head attention networks on model performance

To investigate the effect of multi-head attention networks on the model performance, we conduct ablation experiments, as shown in Table 10. Where MBFKT’-MHA is MBFKT’ without considering multi-head attention networks.

Table 10

Analysis of the validity of multi-head attention networks

Model (Ours)	ASSISTments2012	ASSISTments2017	Slepemapy.cz
	AUC	AUC	AUC
MBFKT’-MHA	0.8443	0.8414	0.7630
MBFKT’	0.8495	0.8532	0.7756

Based on the results in Table 10, we can learn that MBFKT’-MHA leads to lower model performance than MBFKT’ due to the lack of attention mechanism. The prediction performance of MBFKT’ can be improved by learning the influence degrees of different behavioral features on the knowledge mastery state with the help of multi-head attention networks.

(5) Impact of different knowledge mastery state update methods on model performance

Finally, to investigate the effects of different knowledge mastery state update methods, we conduct the ablation experiments shown in Table 11. MBFKT’-WT is MBFKT’ for updating students’ knowledge mastery state by using the knowledge relevance weight updating method in the memory decline link.

Table 11

Comparison of the results of different knowledge mastery update methods

Model (Ours)	ASSISTments2012	ASSISTments2017	Slepemapy.cz
	AUC	AUC	AUC
MBFKT’-WT	0.8222	0.8007	0.7125
MBFKT’	0.8495	0.8532	0.7756

Based on the results in Table 11, we can learn that the memory decline link using the global update is better than the one based on knowledge relevance weight. Too much time between a student’s last learning can lead to memory decline for all knowledge points, not just the relevant ones.

This section compares the experimental results of different ablation models on all datasets. The most intuitive phenomenon is that MBFKT that considers more influence factors has a higher AUC and better performance. Based on the above experiments, the conclusions are as follows.

The validity of the response time factor of the exercises is verified. Refining the information modeling of the exercises can be another research direction to improve the model’s prediction performance.

The changes in students’ knowledge mastery state can be better modeled by considering behavioral features more comprehensively. And the influence degrees of different behavioral features on the knowledge mastery state are different, as we demonstrate by applying multi-head attention.

Our improved method of updating the knowledge mastery state can improve the performance of MBFKT.

5.5 Visualization of knowledge tracing results

To verify another primary task of KT, tracing the dynamic changes in a student’s knowledge mastery state $M_{t}^{stu}$ over time. That is, outputting the student’s mastery level of each knowledge point in real-time. The visualization of KT results is performed, for which the results of KT can quickly locate the student’s weak knowledge points and help his/her to check the gaps and carry out personalized learning assistance.

For in-depth analysis, we select a student’s historical exercise interaction sequence to trace the trend of the knowledge mastery level changes over 30 time steps. We visualize the memory slots of five potential knowledge points in the student’s $M_{t}^{stu}$ . Fig. 4 (left) shows the changes in student’s knowledge mastery level. Its value is in the range of 0 to 1. The larger the value, the darker the color. That is, the higher the student’s mastery of this knowledge point. Over time, the student’s knowledge mastery level changes. Fig. 4 (right) is a radar plot of the student’s learning ability, depicting the corresponding ability values before (S1) as well as after (S1’) the student learns.

Fig. 4

Graph of changes in a student’s knowledge mastery level (left) and radar graph of learning ability (right).

In Fig. 4 (left), the first column represents the student’s initial mastery level of the five knowledge points. And the student’s knowledge mastery level changes with each time stamp, rather than alternating between mastery and non-mastery. Specifically, the student’s potential knowledge mastery level increases (decreases) each time when the student answers a correct (incorrect) exercise. In Fig. 4 (right), after answering 30 exercises, the student has made significant progress for knowledge point k₁, and their learning ability for knowledge point k₅ has slightly decreased. We notice that the student’s learning ability for knowledge point k₃ has been weak, indicating that he or she has not been able to master knowledge point k₃ in the learning process. Therefore, after tracing 30 exercises, the student has mastered knowledge points k₁, k₂, k₄, and k₅. Then knowledge point k₃ needs to be strengthened.

6 Conclusion

Based on the theory of educational psychology, combined with the information of students’ behavioral features in the learning process, this study designs a new KT model named MBFKT, which bases on multi-head attention network, memory network and recurrent neural network. MBFKT can trace changes in students’ knowledge mastery state and predict their performance in future exercises. We model the learning process of a student in form of memory decline link, memory enhancement link and memory update link, a method consistent with the memory rules of natural humans, with an advantage of appreciably increasing the interpretability of MBFKT. We also give a method of extracting behavioral features from students’ historical attempted exercises and incorporate information about behavioral features into MBFKT. Extensive experiments on three experimental datasets are conducted, and MBFKT’s performance over the benchmark model verified. We finally visualize the tracing results of a student’s knowledge mastery state, from which the student’s knowledge points of weakness can be identified, hence due solutions tabled. The experimental results demonstrate the validity and interpretability of our proposed model.

We think future work can be explored as follows. On the one hand, from the perspective of improving the interpretability of KT, graph neural network can be used to learn the structure information of knowledge graph to build a knowledge tracking model [38, 39]. On the other hand, from the perspective of solving the long-term dependence problem, the traditional RNN network can be replaced by other better time series models to further improve the prediction performance of KT.

Footnotes

Acknowledgment

This research is supported by the Shandong Postgraduate Education Quality Improvement Plan(Grant No.:SDYJG19075), Shandong Education Teaching Research Key Project(Grant No.:2021JXZ010), Qingdao City Philosophy and Social Science Planning Project(Grant No.:QDSKL2201132), National Statistical Science Research Project(Grant No.:2021LY053), Shandong Province Education Science Planning Innovation literacy project (Grant No.:2022CYB280) and Shandong University of Science and Technology Education and Teaching Research Stars Plan Project(Grant No.:QX2022ZD07).

References

Zeng

, Zhao

, and Liang

Course ontology-based users knowledge requirement acquisition from behaviors within e-learning systems, Computers & Education 53(3) (2009), 809–818.

Liu

, Shen

, Huang

, Chen

, and Zheng

A survey of knowledge tracing, arXiv preprint arXiv:2105.15106, 2021.

Xie

, Chu

H.-C.

, Hwang

G.-J.

, and Wang

C.-C.

Trends and development in technology-enhanced adaptive/personalized learning: A systematic review of journal publications from 2007 to 2017, Computers & Education 140 (2019), 103599.

Abdelrahman

, Wang

, and Nunes

B.P.

Knowledge tracing: A survey, arXiv preprint arXiv:2201.06953, 2022.

Schiaffino

, Garcia

, and Amandi

eteacher: Providing personalized assistance to e-learning students, Computers & Education 51(4) (2008), 1744–1754.

Nabizadeh

A.H.

, Gonçalves

, Gama

, Jorge

and Rafsanjani

H.N.

, Adaptive learning path recommender approach using auxiliary learning objects, Computers & Education 147 (2020), 103777.

Diao

, Zeng

, Li

, Duan

, Zhao

, and Song

Personalized learning path recommendation based on weak concept mining, Mobile Information Systems 2022 (2022).

Huo

, Wong

D.F.

, Ni

L.M.

, Chao

L.S.

, and Zhang

Knowledge modeling via contextualized representations for lstm-based personalized exercise recommendation, Information Sciences 523 (2020), 266–278.

Hooshyar

, Huang

Y.-M.

, and Yang

Gamedkt: Deep knowledge tracing in educational games, Expert Systems with Applications 196 (2022), 116670.

10.

Ding

, and Larson

E.C.

Incorporating uncertainties in student response modeling by loss function regularization, Neurocomputing 409 (2020), 74–82.

11.

Piech

, Bassen

, Huang

, Ganguli

, Sahami

, Guibas

L.J.

, and Sohl-Dickstein

Deep knowledge tracing, Advances in Neural Information Processing Systems 28 (2015).

12.

Zhang

, Shi

, King

, Yeung

D.-Y.

, Dynamic key-value memory networks for knowledge tracing, in: Proceedings of the 26th international conference on World Wide Web (2017), pp. 765–774.

13.

Yang

, and Cheung

L.P.

Implicit heterogeneous features embedding in deep knowledge tracing, Cognitive Computation 10(1) (2018), 3–14.

14.

Nagatani

, Zhang

, Sato

, Chen

Y.-Y.

, Chen

, Ohkuma

, Augmenting knowledge tracing by considering forgetting behavior, in: The World Wide Web Conference (2019), pp. 3101–3107.

15.

Sun

, Zhao

, Ma

, Yuan

, He

, Feng

, Mutibehavior features based knowledge tracking using decision tree improved dkvmn, in: Proceedings of the ACM Turing Celebration Conference-China, (2019), pp. 1–6.

16.

Huang

, Yang

, Li

, Xie

, Geng

, and Zhang

A dynamic knowledge diagnosis approach integrating cognitive features, IEEE Access 9 (2021), 116814–116829.

17.

, Wei

, Zhang

, Du

, and Yu

Lfkt: A deep knowledge tracking model integrating learning and forgetting, Journal of Software (2021).

18.

Liu

, Huang

, Yin

, Chen

, Xiong

, Su

, and Hu

Ekt: Exercise-aware knowledge tracing for student performance prediction, IEEE Transactions on Knowledge and Data Engineering 33(1) (2019), 100–115.

19.

Ebbinghaus

Memory: A contribution to experimental psychology, Annals of Neurosciences 20(4) (2013), 155.

20.

Murre

J.M.

, and Dros

Replication and analysis of ebbinghaus forgetting curve, PloS One 10(7) (2015), e0120644.

21.

Bailey

C.D.

Forgetting and the learning curve: A laboratory study, Management Science 35(3) (1989), 340–352.

22.

Pelánek

Modeling students’ memory for application in adaptive educational systems, International Educational Data Mining Society (2015).

23.

Sun

, Zhao

, Li

, Ma

, Sutcliffe

, and Feng

Dynamic key-value memory networks with rich features for knowledge tracing, IEEE Transactions on Cybernetics (2021).

24.

Song

, Li

, Lei

, Zhao

, Chen

, and Mian

Biclkt: Bi-graph contrastive learning based knowledge tracing, Knowledge-Based Systems 241 (2022), 108274.

25.

Song

, Li

, Tang

, Zhao

, and Guan

Jkt: A joint graph convolutional network based deep knowledge tracing, Information Sciences 580(4698) (2021).

26.

Khajah

, Wing

, Lindsey

, Mozer

, Integrating latent-factor and knowledge-tracing models to predict individual differences in learning, in: Educational Data Mining 2014, Citeseer, 2014.

27.

Qiu

, Qi

, Lu

, Pardos

Z.A.

, Heffernan

N.T.

, Does time matter? modeling the effect of time with bayesian knowledge tracing, EDM (2011), 139–148.

28.

Khajah

, Lindsey

R.V.

, and Mozer

M.C.

How deep is knowledge tracing? arXiv preprint arXiv:arXiv preprint arXiv:1604.02416, 2016.

29.

Shin

, Shim

, Yu

, Lee

, Kim

, Choi

, Saint+: Integrating temporal features for ednet correctness prediction, in: LAK21:11th International Learning Analytics and Knowledge Conference, (2021), pp. 490–496.

30.

Pardos

Z.A.

, Heffernan

N.T.

, Kt-idem: Introducing item difficulty to the knowledge tracing model, in: International Conference on User Modeling, Adaptation, and Personalization, Springer, (2011), pp. 243–254.

31.

Abdelrahman

, Wang

, Knowledge tracing with sequential key-value memory networks, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (2019), pp. 175–184.

32.

Ghosh

, Heffernan

, Lan

A.S.

, Context-aware attentive knowledge tracing, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020), pp. 2330–2339.

33.

Pandey

, and Karypis

A self-attentive model for knowledge tracing, 2019.

34.

Pandey

, Srivastava

, Rkt: Relation-aware self-attention for knowledge tracing, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management (2020), pp. 1205–1214.

35.

Choi

, Lee

, Cho

, Baek

, Kim

, Cha

, Shin

, Bae

, Heo

, Towards an appropriate query, key, and value computation for knowledge tracing, in: Proceedings of the Seventh ACM Conference on Learning @ Scale (2020), pp. 344–341.

36.

Shen

, Liu

, Chen

, Wu

, Huang

, Zhao

, Su

, Ma

, Wang

, Convolutional knowledge tracing: Modeling individualization in student learning process, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2020), pp. 1857–1860.

37.

Nakagawa

, Iwasawa

, Matsuo

, Graph-based knowledge tracing: Modeling student proficiency using graph neural network, IEEE/WIC/ACM International Conference on Web Intelligence (2019), pp. 156–163.

38.

Yang

, Shen

, Qu

, Liu

, Wang

, Zhu

, Zhang

, Yu

, Gikt: A graph-based interaction model for knowledge tracing, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Sringer, (2020), pp. 299–315.

39.

Wen

, Kang

, Zeng

, Duan

, Chen

, and Li

Session-based recommendation with gnn and time-aware memory network, Mobile Information Systems 2022 (2022).

Precise modeling of learning process based on multiple behavioral features for knowledge tracing

Abstract

Keywords

1 Introduction

2 Related works

2.1 Knowledge tracing considering a student’s behavioral features

2.2 Knowledge tracing considering the potential features of the exercises

2.3 Knowledge tracing with attention mechanism

2.4 Knowledge tracing considering other factors

3 Problem definition

4.3.1 Memory decline link

5.1 Datasets

5.3 Evaluation metrics and Experimental setup

5.3.1 Evaluation metrics

5.3.2 Parameter settings

5.3.3 Experimental environment

5.4 Experimental results and analysis

5.4.1 The effect of different embedding vector dimensions on MBFKT

Table 3 AUC values of MBFKT with different embedding vector dimensions on the three datasets d ASSISTments2012 ASSISTments2017 Slepemapy.cz AUC AUC AUC 8 0.7990 0.7941 0.7445 16 0.7996 0.8329 0.7388 32 0.8414 0.8009 0.6864

Table 5 AUC effect of different combination methods of behavioral features on MBFKT Model (ours) ASSISTments2012 ASSISTments2017 Slepemapy.cz AUC AUC AUC MBFKT1 0.8054 0.8292 0.7254 MBFKT2 0.8007 0.8160 0.7303 MBFKT 0.8414 0.8329 0.7445

Table 7 Validity analysis of the exercise’s response time factor Model ASSISTments2012 ASSISTments2017 Slepemapy.cz AUC AUC AUC DKVMN 0.6944 0.6620 0.6725 DKVMN+FT 0.6922 0.6684 0.6980 MBFKT-FT 0.8183 0.8152 0.7229 MBFKT(Ours) 0.8412 0.8329 0.7445

Footnotes

Acknowledgment

References

Table 3
AUC values of MBFKT with different embedding vector dimensions on the three datasets

d ASSISTments2012 ASSISTments2017 Slepemapy.cz

AUC AUC AUC

8 0.7990 0.7941 0.7445

16 0.7996 0.8329 0.7388

32 0.8414 0.8009 0.6864

Table 5
AUC effect of different combination methods of behavioral features on MBFKT

Model (ours) ASSISTments2012 ASSISTments2017 Slepemapy.cz

AUC AUC AUC

MBFKT₁ 0.8054 0.8292 0.7254

MBFKT₂ 0.8007 0.8160 0.7303

MBFKT 0.8414 0.8329 0.7445

Table 7
Validity analysis of the exercise’s response time factor

Model ASSISTments2012 ASSISTments2017 Slepemapy.cz

AUC AUC AUC

DKVMN 0.6944 0.6620 0.6725

DKVMN+FT 0.6922 0.6684 0.6980

MBFKT-FT 0.8183 0.8152 0.7229

MBFKT(Ours) 0.8412 0.8329 0.7445