Supervised machine learning models for student performance prediction

Abstract

Educational Data Mining has turned into an effective technique for revealing relationships hidden in educational data and predicting students’ learning outcomes. One can analyze data extracted from the students’ activity, educational and social behavior, and academic background. The outcomes which are produced are, the following: A personalized learning procedure, a feasible engagement with students’ behavior, a predictable interaction of the students with the learning processes and data. In the current work, we apply several supervised methods aiming at predicting the students’ academic performance. We prove that the use of the default parameters of learning algorithms on a voting generalization procedure of the three most accurate classifiers, can produce better results than any single tuned learning algorithm.

Keywords

Prediction of the students’ performance contact sessions distance learning combination of classification algorithms machine learning

1. Introduction

Nowadays, many universities offer innovative and high-quality education via the distance learning method. The adoption of digital technologies in education, and more general in society, establish a necessity for data and learning analytics. As the students’ needs vary, the way the distance learning courses are offered should be adapted to these needs and also to the students’ prior knowledge. In conventional education, the instructors and the auxiliary staff, guide and assist the students through the learning procedure by using different types of learning material [1, 2, 3]. In distance learning, the students schedule their study time by themselves and the successful completion of each course depends on the students’ effort but also on additional parameters, like their educational background, the psychological support of their family, and their job obligations. Many institutions that apply the distance learning methodology, include in their teaching procedure some contact sessions that are non-obligatory face-to-face contacts with lecturers and other students where questions are asked and clarification on issues are given [4].

The performance prediction is a useful information, especially when it concerns those students who fail in their quizzes and their final exams, as the instructors can intervene during the academic year and alter the applied educational methodology, based on these students’ learning style.

The prediction of the students’ performance is a hot issue in the field of machine learning. Educational Data Mining is a viable solution in this domain. The goal is the analysis of the students’ learning behavior and as well the prediction of their performance. The educational issues can be enhanced with machine learning methodologies [5]. The supervised classification methods consist of a viable aspect for this field.

In the work of [6], the authors use Pivot Tables of Excel to visualize the learners’ participation in a blended learning course. The proposed methodology is useful to discover data that affect the prediction of learners’ performance and the optimization of the course organization. Thus, a tutor is efficient to advise the students during the learning procedure. A study of the opportunity that the Learning Management Systems provide to a data scientist to record the students’ learning activity is suggested by the authors in [7]. This potential offers a benefit to students, faculty, and administration to improve important issues such as the learning procedure, the course design and the educational methodology. This cited work aims to analyze learners’ behavior and performance so that to reveal abilities useful to enhance the functionality of a distance learning course. In this study, we extend the above methodologies by using data mining and artificial intelligence in order to predict the students’ performance.

2. Research questions

1 ${}^{\text{st}}$ Research Question: Is it possible to predict the students’ performance when the only available data derive from the students’ participation in the Contact Sessions and from their grades in the submitted projects? We tried to answer the 1 ${}^{\text{st}}$ research question by applying machine learning methodologies to a dataset containing the aforementioned information.

2 ${}^{\text{nd}}$ Research Question: Is it feasible to discover the important parameters of the data, in order to predict the students’ performance via machine learning algorithms? To answer this research question, we experimented with various prediction methods and we used game theory to find a unique solution by combining these methods. We were capable to personalize the feedback to the students’ learning tactics. Our proposed methodology allows the instructor to early intervene in the learning procedure by alerting the at-risk students.

The rest of the paper is structured as follows: Section 3 presents recent works regarding the application of classification methods in education. A detailed description of the dataset and the attributes which are used in this study are presented in Section 4. We define the algorithms in section 5. We present our experiments and our results in Section 6. The significance of each parameter in the proposed methodology, is presented in Section 7. In the same section, the methodology for predicting the students’ grades in their final exams, is also presented. Finally, in Section 8, we communicate our conclusions and we present our future research focus.

3. Related work

A decision tree classifier is built in [8] for detecting students at risk of failing, in an early undergraduate computer programming course. The authors use several data related to students’ assessment marks, their activity in an automatic marking system and their participation in discussion forums, in order to early identify the low performers. By using only the assessment attributes, the authors achieve an accuracy of 83%, while by using all attributes they reach an accuracy of 87%.

The authors in [9] investigate the effectiveness of familiar supervised techniques (Naïve Bayes, SVM, J48 decision tree, and multilayer neural networks) for the early prediction of at-risk students. As a risk, the study considers the eventual failure in two introductory programming courses in a Brazilian university. The first dataset is referred to 262 students attending a 10 weeks distance course and the second one is referred to 161 students attending a 16 weeks on-campus course. Both datasets contain several demographic, social, and university attributes. The experimental results show that EDM methods are effective in the early identification of low performers. Moreover, the effectiveness is enhanced after data preprocessing and fine-tuning of algorithms.

A behavioral analysis of the students and the effect of this analysis on their performance are suggested in [10]. The authors utilize the data which are derived from smartcards. The metric for the performance estimation contains the grade of each course. The poor performers are those who are at risk. The methods which are used to predict the performance are Decision Tree, Random Forest, Ridge Regression, and other algorithms. In order to speculate students’ similarity, the authors consider the presence of students in the same place (library, dormitories, teaching buildings, etc.) within a short time interval (one minute). Sleep pattern is also taken into account as students who wake up later perform worse. Moreover, the authors study the effect in different majors and during different semesters.

A study for various contingency tables is presented in [11]. The authors use these tables to predict the students’ performance. Specifically, during the semester, the performance in online exercises is considered in order to predict the students’ performance on paper-based exams, at the end of the semester. It is highlighted that the number of a student’s attempts to answer the online exercises, is associated with the grades in the final exams.

The use of the application of machine learning techniques in education is demonstrated in [12]. A case study with the prediction of the students’ grades is analyzed by the authors. Furthermore, there is use of demographical data and grades in the written assignments in order to predict students’ performance with a regression method. In addition, the authors provide a novel software tool appropriate for tutors.

In a recent study [13] machine learning methodologies such as Artificial Neural Network, K-Nearest Neighbor, K-Means Clustering, Naïve Bayes, Support Vector Machine, Logistic Regression, and Decision Tree, are presented. It is the authors’ choice to implement in Weka the prediction of the at-risk students. This prediction takes place before the students start their new academic program, i.e., pre-start data. It is highlighted that the Auto-Weka tool [14] performs the selection of the algorithm automatically with better accuracy.

An overview of studies on the prediction of the students’ performance is highlighted in [15]. Some researchers use cumulative grade point average (CGPA) and internal assessment, as data sets. The most common approaches for this prediction are the Decision Trees and the Neural Networks.

According to [16], we are able to predict the students’ failure in distance learning by using the most common classification algorithms. The authors suggest the development of a framework, which can transfer, manage and implement Machine Learning to tackle the imbalance problem in the final examinations.

A drawback of binary decision trees is confronted in [17]. It is strongly considered that the sharing of datasets from various organizations incurs this drawback. The goal is to preserve the privacy of those sensitive patterns. Furthermore, the authors compare their proposed methodology with other methodologies such as output perturbation or cryptography methodologies. The results lead the authors to conclude that the proposed method is better. In addition, with this proposed methodology the decision tree rules, which have the minimum impact than the other rules, are hidden.

Moreover, in [18] it is suggested by the authors that the privacy-preserving record linkage is confronted with private blocking techniques. The proposed technique is based on sorted nearest neighborhood clustering and it is twice faster than other existed private blocking methodologies. We also use in our experiments decision trees and k-Nearest Neighborhood algorithms.

Data coming from the learning management system Moodle are used in [19]. In addition, data have been collected from questionnaires. The goal of this cited work is to predict the students’ success or failure to pass the lesson. For this purpose, the authors use four different methods (visualization of the variables determined by the questionnaires, C4.5 decision tree, class association rules, and k-means implementation). The R language and the Weka tool are used for the implementation of the proposed methodologies.

Moreover, in [20] it is suggested by the authors the combination of classification algorithms in order to enhance the prediction accuracy of students’ performance. For the combination, the authors use the voting methodology. The algorithms Naive Bayes, the 1-NN, and the WINNOW are used.

In [21] multimodal techniques are employed for the analysis of the learning activities in a laboratory classroom. Video technology is utilized for this aim. The techniques offer great research possibilities for the investigation of various classroom and pedagogical conditions. These possibilities are appropriate for the optimization of classroom teaching and learning. The data come from audio and video recordings.

In [22] a student model for a web-based course with a subject the programming language C is proposed. The system in which this student model is incorporated attempts to identify the needs, background, and prior knowledge of the learners. This is feasible before the interaction among the system and the student so that a better experience to be available. Data analysis is used for the creation of the model. The collection of the data comes from an initial implementation of the system with the name ELaC.

Many aspects of Learning Analytics (LA) are covered in the chapters of [23]. Specifically, subjects such as LA in distance learning in postsecondary education, prediction of students’ performance in a blended introductory course, prediction of dropout students, dashboards for project and problem-based learning are negotiated in the [23] book.

Some well-known algorithms are used [24] in order to remove noisy and redundant data from the data set used in predictive data mining. It is highlighted that this methodology is efficient to enable knowledge discovery as an easier task.

Table 1
Dataset parameters used in the study

Parameter	Contact sessions	Projects	Final grade-status
Number	$i=$ 1, 2, 3, 4, 5	$i=$ 1, 2, 3, 4, 5	2
Type	Binary	Real	Binary
Values	0, 1	[0, 10]	0, 1
Meaning	Absence/presence in the i-th optional contact session	Grade of the i-th project	Final grade 0-fail, 1-pass

4. Dataset

In this study, we used a dataset provided by the Hellenic Open University (HOU), an institution which is mastering the open and distance learning education and always seek modern and innovative ways of communicating Knowledge [25, 26, 27]. Our study focused on the year-long class module called “Algorithms and Complexity”, a module on the subject of algorithms introduction of the Informatics graduate programme. It is the authors’ preference to characterize each instance by the values of 10 variables (Table 1). These instances consist of five submitted projects and five not obligatory attended consulting Sessions (aka contact session (CS)).

The completion of the module requires the submission of five projects and a grade equal to or higher than five, in the final exams. Students can undertake the final exams of the module, only if they have successfully completed the four or five projects with a total score of twenty-five or more, on a ten-grade scale for each project. During the academic year, students have the opportunity to attend, if they wish, five attended consulting sessions of four-hours each.

In the figures: Figs 1 to 10, we illustrate our dataset. In each pair of figures, we can make the following observations. The orange bar of Figs 1, 3, 5, 7, and 9, regards the percentage of students over the total number of students who have attended each attended consulting session, while the blue bar corresponds to the percentage of the students over the total who have not attended each consulting session. All figures regard the students who have passed the course. The orange bar in Figs 2, 4, 6, 8, and 10, denotes the range of the grades in the corresponding submitted project for those who have passed the final exams. The blue bar denotes the range of the grades of the same project for those students who have failed in the final exams. The vertical line of each interval denotes the mean grade of the grades’ range.

In Fig. 1, we observe that the majority of the successful students have attended the first consulting session.

Figure 1.

Percentage of the successful students who have attended or not the contact session 1.

Figure 2.

Grades range of students who pass or fail the submitted project1.

Figure 3.

Percentage of the successful students who have attended or not the contact session 2.

Figure 4.

Grades range of students who pass or fail the submitted project2.

Figure 5.

Percentage of the successful students who have attended or not the contact session 3.

Figure 6.

Grades range of students who pass or fail the submitted project3.

Figure 7.

Percentage of the successful students who have attended or not the contact session 4.

Figure 8.

Grades range of students who pass or fail the submitted project4.

Figure 9.

Percentage of the successful students who have attended or not the contact session 5.

Figure 10.

Grades range of students who pass or fail the submitted project5.

Based on Fig. 2 it is apparent that the successful students who attended the first contact session have higher grades in the first submitted project than their fellow students who did not attend this particular session.

According to Fig. 3 it is obvious that many successful students attended the second contact session. The percentage of the students who passed the exams but they didn’t attend the second contact session over the total number of students who didn’t attend the second session is low. Thus, a small number of students are not influenced positively from the attendance of the second consulting session.

In addition, as we observe in Fig. 4 all the successful students who attended the second session obtained higher grades than those who did not participate in it.

Similar observations can be made for those students who did and did not participate in the third contact session, as presented in Figs 5 and 6.

Figure 7 demonstrates that the successful students who attended the fourth contact session are almost equal to the successful students who did not.

As it is apparent in Fig. 8, the students’ grades, after participating in the fourth consulting session, are higher than the grades of those students who did not participate in it. The difference among the mean grades is not negligible.

According to Fig. 9 the percentage of the successful students who attended the final consulting session is higher than the percentage of the successful students who did not. The successful students who attended the contact sessions 1 and 2, 3 and 5 are more than those who attended the contact session 4.

As a conclusion of the Table 2, we observe that most students have attended the first two Contact Sessions (CS). Furthermore, the students who attended the CS2 are slightly less than the students who attended the CS1.

Table 2

Combination of contact sessions participation and relevant statistics

	CS1	CS2	Total	MeanFinalGrade	MedianFinalGrade
0	1	1	35	5.728857	5.70
1	1	0	14	4.300000	5.00
2	0	1	10	3.590000	5.05
3	0	0	21	3.452381	5.00
	CS1	CS2	Total	PercentagePassModule	NumberPassModule
0	1	1	35	0.885714	31
1	1	0	14	0.642857	9
2	0	1	10	0.600000	6
3	0	0	21	0.523810	11
	CS1	CS5	Total	MeanFinalGrade	MedianFinalGrade
0	1	1	34	5.938529	5.90
1	0	1	5	5.800000	5.40
2	1	0	15	3.920000	5.00
3	0	0	26	3.053846	2.75
	CS1	CS5	Total	PercentagePassModule	NumberPassModule
0	0	1	5	1.000000	5
1	1	1	34	0.882353	30
2	1	0	15	0.666667	10
3	0	0	26	0.461538	12
	CS1	CS3	Total	MeanFinalGrade	MedianFinalGrade
0	1	1	34	5.991471	5.9
1	0	1	5	5.080000	5.6
2	1	0	15	3.800000	5.0
3	0	0	26	3.192308	4.0
	CS1	CS3	Total	PercentagePassModule	NumberPassModule
0	1	1	34	0.911765	31
1	0	1	5	0.800000	4
2	1	0	15	0.600000	9
3	0	0	26	0.500000	13
	CS3	CS5	Total	MeanFinalGrade	MedianFinalGrade
0	1	1	32	6.059687	5.9
1	0	1	7	5.285714	5.3
2	1	0	7	5.028571	5.0
3	0	0	34	3.029412	3.1
	CS3	CS5	Total	PercentagePassModule	NumberPassModule
0	1	1	32	0.906250	29
1	0	1	7	0.857143	6
2	1	0	7	0.857143	6
3	0	0	34	0.470588	16

Table 3

Grade value limit of project1 to project4 and the corresponding percentage of the successful students

	project3	Total	PercentagePassModule	NumberPassModule
0	$>=$ 6.5	58.0	0.862069	50.0
1	$<$ 6.5	22.0	0.318182	7.0
	project1	Total	PercentagePassModule	NumberPassModule
0	$>=$ 3.5	76.0	0.75	57.0
1	$<$ 3.5	4.0	0.00	0.0
	project2	Total	PercentagePassModule	NumberPassModule
0	$>=$ 5	73.0	0.739726	54.0
1	$<$ 5	7.0	0.428571	3.0
	project4	Total	PercentagePassModule	NumberPassModule
0	$>=$ 5	59.0	0.813559	48.0
1	$<$ 5	21.0	0.428571	9.0

Table 4

Participation of the students in each Contact Session and the corresponding mean project grade and percentage of passed students

CS1	TotalSubmissionProject1	MeanGradeProject1	MedianGradeProject1	PercentagePassModule	NumberPassModule
1	49.0	7.632.653	8.0	0.816327	40.0
0	31.0	6.246.667	6.2	0.548387	17.0
CS2	TotalSubmissionProject2	MeanGradeProject2	MedianGradeProject2	PercentagePassModule	NumberPassModule
1	45.0	8.108.889	8.1	0.822222	37.0
0	35.0	6.754.286	7.0	0.571429	20.0
CS3	TotalSubmissionProject3	MeanGradeProject3	MedianGradeProject3	PercentagePassModule	NumberPassModule
1	39.0	7.987.179	8.2	0.897436	35.0
0	41.0	6.763.415	7.3	0.536585	22.0
CS4	TotalSubmissionProject4	MeanGradeProject4	MedianGradeProject4	PercentagePassModule	NumberPassModule
1	29.0	6.606.897	7.0	0.758621	22.0
0	51.0	6.033.333	6.0	0.686275	35.0
CS5	TotalSubmissionProject5	MeanGradeProject5	MedianGradeProject5	PercentagePassModule	NumberPassModule
1	39.0	5.412.821	5.4	0.897436	35.0
0	41.0	3.536.585	3.3	0.536585	22.0

Figure 11.

Grades distribution of each submitted project.

The students’ grades after attending the fifth contact session, are better than those of the students who were absent in this session, as we can observe in Fig. 10. It also should be noted that many students have a fail grade in the submitted project5 and all of them were absent in the corresponding contact session. This regards almost the half of the successful students. That happens, due to the difficulty of the final project. It is also a proof that the half successful students made an effort to study the submitted projects themselves without the help of a third person.

In Figs 2, 4 and 8 it is obvious that there is a small overlap between the grades of those students who attended the first, second and the fourth contact session, in comparison with the grades of the students who did not participate in these sessions. Figure 10 presents that this overlap is bigger for the fifth session which is the final session. This can be attributed to the fact that there were many students in this class who prefer only the remote communication with their tutor and their fellow students. Alternatively, there is a chance that those students assign their projects to third persons and as a result they had no questions to pose.

By using and combining classification algorithms with this dataset, we are able to predict the students’ performance. Moreover, we can locate the more informative attributes for the prediction. In the following sections, we present a methodology with this goal.

Initially, we describe the algorithms with which we experiment. Furthermore, we compare the combinations in pairs of contact sessions. Each combination is represented by a binary number with two digits. When a digit is 1 this denotes the present students in the corresponding contact session. Useful conclusions for each pair are the mean and the median grade in the final exams, the percentage of the successful students and also the numerical value of those students (Table 2).

The grades of each project and the percentage of the successful students who achieve a grade greater or smaller than a value are given in the Table 3. For example the 86.2069% of the successful students achieve a grade greater than 6.5 in project3. The 75% achieves a grade greater than 3.5 in project1, the 73.9726% achieves a grade greater than 5 in project2 and the 81.3559% achieve a grade greater than 5 in project4.

In Table 4, we observe the percentage of the students who passed the final exams and also they had participated in each Contact Session or not. In addition, we observe the mean grade and the median grade of each submitted project and the number of the successful students according to their presence or absence in each Contact Session. The median grade separates the numerical value of the higher half of a sample, a population or a probability distribution from the lower half.

In the Fig. 11 we can observe the grades distribution of each submitted project according to the participation or not to each corresponding Contact Session.

5. Algorithms definition

A decision tree is a classification method which uses for representation purpose, a collection of decision nodes connected by branches. Each decision node consists of the tested attributes whereas the outcome produced is a branch. The goal is to classify the instances according to the attributes they contain. In our approach, we use the CART decision tree [28]. It is a binary tree which produces two branches for each decision node. CART uses the Gini Impurity function [29] to measure the attribute selection in order to build a decision tree. This function determines the calculation of the sum of squared probabilities of each class. After the training, the algorithm selects as a root node the attribute with the highest information calculated by the Gini Index. Similarly, the maximum Gini coefficient is selected in every split. When there are not remaining decision nodes, the complete tree is grown.

Naïve Bayes is an algorithm that assumes that the dependency relationships, which a classifier uses among the features, are unknown [29]. The simplest assumption is taken into account i.e. when a category is given, the features are conditionally independent. The conditional probability assumes that the features are conditionally independent by considering the product of features’ conditional probabilities.

The k-Nearest Neighbor is an algorithm that applies instance-based learning [30]. Initially, the algorithm trains some data in order to find a classification for a new unclassified entry. To achieve this, a comparison is performed to the most similar entries in the training set. The algorithm takes into account the nearest points of each centroid in each neighborhood to perform classification. In case some random points are inserted after the training, the classification approximates the initial clusters which accrue after the training. These random points are also called noise. In case one removes a data point from the classification of kNN and executes the algorithm to predict a class for this point, then the incorrect prediction is called a cross-validation error. It is the authors’ preference to use 3NN in the experiments.

Logistic regression is a methodology domain that is useful for the description of the relationship between a categorical response variable and a set of predictor variables. It can be used for binary or dichotomous variables, or in variables with more than two categories [31]. The latter case is called polychotomous logistic regression. In order to interpret the outcomes of the above methodologies, the term of the odds ratio should be used. The data are processed in the terms of logit functions and maximum likelihood estimation.

The neural networks have a structure of three layers. The first layer is called the input layer, the intermediate layers are called hidden layers and the last layer is called the output layer. The training of the neural networks detects the optimal weights in order to achieve the maximal accuracy for new data inputs. This is feasible by iterating through training aiming to improve the accuracy. At the end of the training, the model is efficient to estimate the output for a new data pattern. Multi-Layer Perceptron (MLP) is a Feed-Forward NN and it consists of one hidden layer. The Feed Forward NN allows the signal to propagate in one direction only, from input to output. In the work of [32], the authors use the Moth-flame Optimization to find the optimal weights of MLP by minimizing the Mean Squared Error of actual and desired outputs and to obtain a high classification rate.

The goal of using an SVM is to find the separating hyperplane which has the largest margin. The separation among two classes is determined as margin. The classifier is generalized better according to the size of the margin. More specifically, the hyperplane is selected so that the distance of every data point from each side of the hyperplane, is maximum. In this case, the corresponding linear classifier is called the maximum margin classifier. The linear classifier SVM is useful to classify the most difficult patterns because of the optimal hyperplane separation property [33, 34].

The Random Forest [35] is a classification algorithm that considers a large number of different decision trees. The prediction is the one that corresponds to the decision tree with the most votes. The procedure of selecting a random sample from the dataset with a replacement for individual decision trees is called bagging. The bagging and the random features are used in order to create a forest of decision trees that have a more accurate prediction, in comparison with each individual tree.

The boosting term corresponds to a method that enables a more accurate prediction rule. This results from the combination of rough and moderately inaccurate rules of thumb. The most popular algorithm is the AdaBoost [36] where each example is weighted. Initially, all the examples are assigned with the same weight. In order to classify the examples, the algorithm considers the distribution of the weights. The boosting algorithms use a weak learner. A prediction is binary (correct or not). According to the verification or not of a correct prediction, the algorithm adapts the weight of each example. In order to achieve the final classification, the algorithm uses the weights which the weak learners determine.

According to [37] there is the ability to combine classification algorithms. Each classifier votes for each decision. The class, which the authors prefer to select, comes from the combination of the algorithms with the larger number of votes. This selection is also called a consensus decision.

Figure 12.

Machine learning algorithms.

Figure 13.

Random forest feature importance.

Figure 14.

Permutation importance.

Figure 15.

Mean shapley values.

6. Experiments

We carry out the experiments by using the Scikit-learn, and we conduct them in ten successive steps. Initially, we separate the data into groups of the same size. The total number of groups is 10. The 1 ${}^{\text{st}}$ step includes CS1 as an independent variable. In the 2 ${}^{\text{nd}}$ step, variable Project1 is added to the training set. The 3 ${}^{\text{rd}}$ step includes in addition CS2, while in the 4 ${}^{\text{th}}$ step variable Project2 is added, etc. The final 10 ${}^{\text{th}}$ step includes all the variables of the dataset.

In every iteration, each group is considered as test data and the remaining groups are the training data set. Therefore, for each group, it is the authors’ preference to calculate the accuracy of each classifier algorithm applied to the training of the remaining groups. The current group is considered as a test data set. We obtain the accuracy of each algorithm for this procedure with the cross-validation estimation.

Hyperparameters are the parameters that the machine learning algorithms use to control the training procedure. These parameters determine the degree to which the model obtains the desired output during training. The correct combination of hyperparameters that we use is the one that contributes to the estimation of a maximum or the minimum value of a function. In order to calculate this combination, a random search of the hyperparameters in a grid of those parameters is employed. The performance is computed by cross-validation. This method is applied to the following machine learning algorithms in Fig. 12.

The default parameters of the three most accurate classifiers on a voting generalization procedure can produce better results than any single tuned learning algorithm.

7. Feature importance

In Fig. 13, the significance of each parameter is highlighted in order to make a prediction, in comparison with the significance of the other parameters. We assign to our data the most relevant or most irrelevant parameters to calculate the target variable with each predictive model. Moreover, it is the authors’ preference to use the Random Forest algorithm [35] for parameter importance implemented in scikit-learn [38]. There is a description of this algorithm in the previous Section 5.

Using Feature Importance, we can identify the features of a model which are most or least significant in order to perform a prediction. Initially, we apply a model to our data even if this model does not support native feature importance scores. Then, we perform predictions for each feature even though the values of each feature (column) are shuffled. This procedure is repeated for a small number of times (up to 10 times). The mean score for the importance of each feature, and the distribution of those scores in each iteration, are determined. With the application of this method, we are able to identify the feature importance metric which affects the performance. This metric can be the mean squared error for regression and accuracy for classification. The Feature Importance of the Random Forest algorithm is illustrated in Fig. 13, while the Permutation Feature Importance by shuffling the features is presented in Fig. 14.

With the use of the methodology proposed in [39], we are able to measure the importance of each feature of various prediction methods. This is necessary to make these methods easy to comprehend. Game theory is used to find a unique solution. Next, we compare the predictions by including or not each feature. In addition, we consider three properties for reasons of approximation. These properties are local accuracy, absence, and consistency. Specifically, we calculate the Shapley values from game theory by employing a weighted linear regression as it is illustrated in Fig. 15. Linear regression and Shapley values are combined in the above-cited work. The authors of [39] consider the mean as the best least-squares point estimation for a set of given data elements. The computational efficiency of this method offers a more accurate estimation with fewer computations of the original model in comparison with other estimations based on samples.

The results lead us to conclude that the most informative attributes according to all examined feature importance strategies are Project3, Project4, and Project1 grades.

7.1 Final grade estimation

In this subsection, we build a linear regression model for predicting the final grade in Eq. (1).

Grade $=-$ 0.04627383618163403 * CS1 $-$ 0.8730758399567464 * CS2 $+$ 1.2796260611072536 * CS3 $-$ 0.28953715932917123 * CS4 $+$ 1.5004401810873296 * CS5 $+$ 0.30833641460618305 * project1 $-$ 0.1879967461964284 * project2 $+$ 0.40704215094295476 * project3 $+$ 0.305525384328978 * project4 $-$ 0.09066485517580523 * project5 $-$ 1.3668410333581766 Equation 1. Final Grade Prediction

In order to run this regression, we separate the data into 10 subsets. We reserve one subset to test the model while we use the other subsets to train the model. As a next step, we repeat the regression for each subset used as a test set. It is the authors’ preference to trace the prediction error and to calculate the average of the 10 traced errors. This performance metric is called the cross-validation error. The algorithm is called Repeated K-fold cross-validation. We run a 10 cross-validation and we calculate the mean values for mean absolute error and root mean squared error. Furthermore, we use the Root Mean Squared error to calculate the average difference between the observed and the predicted values which the model produces. The Mean Absolute Error is a metric that we use instead of RMSE as it is less vulnerable to outliers.

Mean Absolute Error: 1.7958424729267497 Root Mean Squared Error: 2.2347550840313413

8. Discussion and concluding remarks

In this study, we investigate the effectiveness of active learning algorithms to predict students’ performance (pass or fail) in a distance learning undergraduate course module in the HOU. The prediction focuses on the grades achieved in the students’ final exams. The prediction of the students’ performance has been an interesting and highly important research topic for educational institutions in the recent years. Identifying as soon as possible, the low performers in a class, could lead to the development of personalized learning strategies for enhancing the learning outcomes, in accordance with the learning profile.

In order to approach this matter with artificial intelligence techniques, we combine several machine learning algorithms to predict the students’ performance. The most significant factors for the prediction are their grades in Project3, Project4, and Project1. The performance of each algorithm is calculated with cross-validation. We calculate the mean score for the importance of each feature. Then we calculate the distribution of the mean scores in each iteration of the Feature Importance method. Hence, we are capable to identify the feature importance metric which affects the algorithm performance. Furthermore, we can find a unique solution by employing Game Theory and Shapley values applied to a weighted linear regression. The average difference between the observed and the predicted values of the model are measured with the Root Mean Squared error or with the Mean Absolute Error.

We employ a generalized procedure based on votes. By using the classification algorithms which correspond to the three best accurate estimations we prove that they perform better than any single balanced learning algorithm.

There is a clear need for further insights into ascertaining the salient features that contribute to both a highly predictive and explanatory model. In future work, we will explore these features. Moreover, as a test benchmark, we will use data mining techniques to explore the students’ socialization during the contact sessions with associations among variables. In addition, we will cluster the students with similar performance and presence in the Contact Sessions with fuzzy classification. The sequences of presence in Contact Sessions and grades achieved in the submitted projects will be explored with hidden Markov models.

References

Siemens

. Learning analytics: The emergence of a discipline. Am Behav Sci. 2013; 57(10): 1380–400. doi: 10.1177/0002764213498851.

Gosch

Andrews

Barreiros

Leitner

Staudegger

Ebner

, et al. Learning analytics as a service for empowered learners: From data subjects to controllers. In: LAK21: 11th International Learning Analytics and Knowledge Conference. New York, NY, USA: ACM; 2021. doi: 10.1145/3448139.3448186.

Leitner

Khalil

Ebner

. Learning analytics in higher education – A literature review. In: Learning Analytics: Fundaments, Applications, and Trends. Cham: Springer International Publishing; 2017. pp. 1–23. doi: 10.1007/978-3-319-52977-6_1.

Keegan

, ed. Foundations of distance education London and New York: Routledge Falmer Studies in Distance Education. 2000.

Zhao

Yan

Guo

Wang

. Student achievement analysis and prediction based on the whole learning process. In: 2020 15th International Conference on Computer Science & Education (ICCSE). IEEE; 2020. Available from: doi: 10.1109/ICCSE49874.2020.9201865.

Alachiotis

Stavropoulos

Verykios

. Learning analytics with excel in a blended learning course.

\Delta\iota\varepsilon\theta\nu\acute{\varepsilon}\varsigma

\Sigma\upsilon\nu\acute{\varepsilon}\delta\rho\iota

\gamma\iota\alpha

\tau\eta\nu

\nu

\iota\kappa\tau

\varepsilon\xi

\pi

\sigma\tau\acute{\alpha}\sigma\varepsilon\omega\varsigma

\kappa\pi\alpha

\delta\varepsilon\upsilon\sigma\eta

. 2017; 9(A): 8. Available from: doi: 10.12681/icodl.1077.

Alachiotis

Stavropoulos

Verykios

. Analyzing learners behavior and resources effectiveness in a distance learning course: A case study of the Hellenic open university. J Inf Sci Theory Pract. 2019; 7(3): 6–20. doi: 10.1633/JISTaP.2019.7.3.1.

Koprinska

Stretton

Yacef

. Students at Risk: Detection and Remediation. EDM; 2015.

Costa

Fonseca

Santana

de Araújo

Rego

. Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses. Comput Human Behav. 2017; 73: 247–56. doi: 10.1016/j.chb.2017.01.047.

10.

Yao

Lian

Cao

Zhou

. Predicting academic performance for college students: A campus behavior perspective. ACM Trans Intell Syst Technol. 2019; 10(3): 1–21. doi: 10.1145/3299087.

11.

Ahadi

Hellas

Lister

. A contingency table derived method for analyzing course data. ACM Trans Comput Educ. 2017; 17(3): 1–19. doi: 10.1145/3123814.

12.

Kotsiantis

. Use of machine learning techniques for educational proposes: A decision support system for forecasting students’ grades. Artif Intell Rev. 2012; 37(4): 331–44. doi: 10.1007/s10462-011-9234-x.

13.

Zeineddine

Braendle

Farah

. Enhancing prediction of student success: Automated machine learning approach. Comput Electr Eng. 2021; 89(106903): 106903. doi: 10.1016/j.compeleceng.2020.106903.

14.

Kotthoff

Thornton

Hoos

Hutter

Leyton-Brown

. Auto-WEKA: Automatic model selection and hyperparameter optimization in WEKA. In: Automated Machine Learning. Cham: Springer International Publishing; 2019. pp. 81–95. doi: 10.1007/978-3-030-05318-5_4.

15.

Swathi

Soujanya

KLS

Suhasini

. Review on Predicting Student Performance. In: Lecture Notes in Electrical Engineering. Singapore: Springer Singapore; 2021. pp. 1323–30. doi: 10.1007/978-981-15-7961-5_120.

16.

Gkontzis

Kotsiantis

Tsoni

Verykios

. An effective LA approach to predict student achievement. In: Proceedings of the 22nd Pan-Hellenic Conference on Informatics – PCI ’18. New York, New York, USA: ACM Press; 2018. doi: 10.1145/3291533.3291551.

17.

Feretzakis

Kalles

Verykios

. Using minimum local distortion to hide decision tree rules. Entropy (Basel). 2019; 21(4): 334. doi: 10.3390/e21040334.

18.

Vatsalan

Christen

Verykios

. Efficient two-party private blocking based on sorted nearest neighborhood clustering. In: Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management – CIKM ’13. New York, New York, USA: ACM Press; 2013. doi: 10.1145/2505515.2505757.

19.

Kotsiantis

Tselios

Filippidi

Komis

. Using learning analytics to identify successful learners in a blended learning course. Int J Technol Enhanc Learn. 2013; 5(2): 133. doi: 10.1504/ijtel.2013.059088.

20.

Kotsiantis

Patriarcheas

Xenos

. A combinational incremental ensemble of classifiers as a technique for predicting students’ performance in distance education. Knowl Based Syst. 2010; 23(6): 529–35. Available from: doi: 10.1016/j.knosys.2010.03.010.

21.

Chan

MCE

Ochoa

Clarke

. Multimodal learning analytics in a laboratory classroom. In: Intelligent Systems Reference Library. 2020.

22.

Chrysafiadi

Virvou

Sakkopoulos

. Optimizing programming language learning through student modeling in an adaptive web-based educational environment. In: Intelligent Systems Reference Library. 2020.

23.

Virvou

Alepis

Tsihrintzis

Jain

. Machine learning paradigms: Advances in learning analytics. In: Intelligent Systems Reference Library. 2020.

24.

Alexandropoulos

S-AN

Kotsiantis

Vrahatis

. Data preprocessing in predictive data mining. Knowl Eng Rev [Internet]. 2019; 34(e1). doi: 10.1017/s026988891800036x.

25.

Paxinou

Panagiotakopoulos

Karatrantou

Kalles

Sgourou

. Implementation and evaluation of a three-dimensional virtual reality biology lab versus conventional didactic practices in lab experimenting with the photonic microscope. Biochem Mol Biol Educ. 2020; 48(1): 21–7. doi: 10.1002/bmb.21307.

26.

Paxinou

Georgiou

Kakkos

Kalles

Galani

. Achieving educational goals in microscopy education by adopting virtual reality labs on top of face-to-face tutorials. Res sci technol educ. 2020. 1–20. doi: 10.1080/02635143.2020.1790513.

27.

Paxinou

Zafeiropoulos

Sypsas

Kiourt

Kalles

. Assessing the impact of virtualizing physical labs [Internet]. arXiv [cs.HC]; 2017. Available from: http://arxiv.org/abs/1711.11502.

28.

Breiman

. Classification and regression trees. Chapman and Hall/CRC; 1984.

29.

Duda

Hart

Stork

. Pattern classification. John Wiley & Sons; 2012. 146.

30.

Larose

. Data mining and predictive analytics. John Wiley & Sons; 2015. 462.

31.

Hosmer

DWJ

Lemeshow

. Applied logistic regression. John Wiley & Sons; 2004. 47.

32.

Yamany

Fawzy

Tharwat

Hassanien

. Moth-flame optimization for training Multi-Layer Perceptrons. In: 2015 11th International Computer Engineering Conference (ICENCO). IEEE; 2015. doi: 10.1109/icenco.2015.7416360.

33.

Suthaharan

. Support Vector Machine. In: Machine Learning Models and Algorithms for Big Data Classification. Boston, MA: Springer US; 2016. pp. 207–35. doi: 10.1007/978-1-4899-7641-3_9.

34.

Platt

. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research; 1998. Available from: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.43.4376.

35.

Breiman

. Mach Learn. 2001; 45(1): 5–32.

36.

Freund

Schapire

. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997; 55(1): 119–39. doi: 10.1006/jcss.1997.1504.

37.

Kittler

Hatef

Duin

RPW

Matas

. On combining classifiers. IEEE Trans Pattern Anal Mach Intell. 1998; 20(3): 226–39. doi: 10.1109/34.667881.

38.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research. 2011; 12(Oct): 2825–30. doi: 10.5555/1953048.2078195.

39.

Lundberg

Lee

. In: Advances in Neural Information Processing Systems [Internet]. Curran Associates, Inc. Available from: https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf.

Supervised machine learning models for student performance prediction

Abstract

Keywords

1. Introduction

2. Research questions

3. Related work

Table 1 Dataset parameters used in the study

7. Feature importance

7.1 Final grade estimation

8. Discussion and concluding remarks

References

Table 1
Dataset parameters used in the study