Abstract
Peer assessment is a method that has shown a positive impact on learners' cognitive and metacognitive skills. It also represents an effective alternative to instructor-provided assessment within computer-based education and, particularly, in massive online learning settings such as MOOCs. Various platforms have incorporated this mechanism as an assessment tool. However, most of the proposed implementations rely on the random matching of peers. The contributions introduced in this article are intended to step past the randomized approach by modeling learner matching as a many to many assignment problem, and then its resolution by using an appropriate combinatorial optimization algorithm. The adopted approach stands on a matching strategy that is also discussed in this article. Furthermore, we present two key steps on which both the matching strategy and the representation of the problem depend: 1) modeling the learner as an assessor, and 2) clustering assessors into categories that reflect learners’ assessment competency. Additionally, a methodology for increasing the accuracy of peer assessment by weighting the scores given by learners is also introduced. Finally, compared to the random allocation of submissions, the experimentation of the approach has shown promising results in terms of the validity of assessments and the acceptance of peer feedback.
Keywords
Achieving consistency in educational interaction requires the effective application of the three pillars of the pedagogical cycle: teaching, learning and assessment. The latter pillar (i.e. assessment) represents a significant challenge in many computer-based educational contexts and, in particular, in open learning environments such as MOOCs (Massive Open Online Courses). Indeed, within massive learning environments marked by a strong student-teacher ratio, instructors are unable to offer timely and effective assessment to a significant number of learners (Suen, 2014).
Thus, in order to address this challenge, Peer Assessment (PA) represents an alternative whose application has shown many promising outcomes. It is a commonly used technique in education, which requires learners to provide scores and formative feedback on the work of their peers. This work may take the form of text, speech, behavior, research, projects, or reports, etc. (Reinholz, 2016; Topping, 2018).
In fact, large-scale learning environments use PA to reduce the assessment burden on instructors, especially for examinations where the automatic assessment is unusable or invalid, as is the case for open questions, project tests, etc. (Elizondo-Garcia et al., 2019; Suen, 2014).
Alongside the facilitation of the task of assessment for instructors, this mechanism has a positive impact on the participating learners. Indeed, Baas et al. (2015) emphasize the importance of assessment activities in shifting to students the responsibility of managing their own learning. Besides, PA contributes effectively to improving learners’ ability to self-regulate their learning and to evaluate their progress and skills objectively. It also promotes learners’ engagement when it is adopted as an active learning approach (Reinholz, 2016; Tighe-Mooney et al., 2016; Wanner & Palmer, 2018).
From another point of view, the assessment is not only restricted to awarding scores, but also includes ensuring feedback on the performance of learners. Peer Feedback is an integral part of PA, which helps overcome the perceived lack of feedback given to learners in settings like MOOCs. In fact, feedback allows learners to receive information about their work's strengths and weaknesses, as well as guidance for future progress (Huisman et al., 2019; Yuan & Kim, 2015).
Moreover, by providing their peers with feedback, learners improve their ability to follow more objective assessment standards, which can be expressed in enhancing their metacognitive mechanisms and the quality of their work. It also leads to a better understanding of the subject matter and of the objectives of the learning process. Learners who review peers’ work often develop their analytical, deductive and expressive skills (Burkšaitienė, 2012; Lu & Zhang, 2012).
Additionally, PA is recommended for formative assessment. While, despite studies showing an important correlation between learner-provided and instructor-provided scores (Falchikov & Goldfinch, 2000; Marty et al., 2010; Topping, 1998), the knowledge gaps and participants' limited appraisal experience raise questions regarding the validity of peers’ ratings and reviews.
Besides, within contexts such as MOOCs, the retention or dropout rates depend, inter alia, on the quality of evaluation and feedback provided to learners (Vickerman, 2009; Xiong & Suen, 2018; Yousef et al., 2015). Thus, in open learning environments, achieving more accurate rating and more effective feedback should represent a core priority for the application of PA.
Meeting these requirements implies the need to improve the implementation of PA. To this end, this article introduces a solution inspired by the editorial management in peer-reviewed scientific journals. In fact, the credibility of these journals is highly related to the expertise of the editorial board in matter of selecting the appropriate reviewers for the submitted manuscripts. We aim then at developing a sort of virtual editor that optimizes the assignment of assessment tasks to the participating learners in the context of PA implemented within online learning environments.
This perspective has led us to pose a fundamental question that is not commonly addressed in most research on the same topic, namely: how to decide who should assess whose work in such a way to increase the effectiveness of PA application?
In order to answer the question, we step past the random distribution of peer submissions in the context of PA by introducing a novel approach for matching learners as assessors and as assessees. This approach stands on modeling the process of allocating peers’ submissions as a many to many assignment problem. Then, in order to find the best possible mapping, a combinatorial optimization algorithm is applied to solve the modeled problem.
In addition, two basic steps are required for modeling the allocation process. The first is defining the descriptive features of the learner as an assessor along with measuring her\his assessment competency. While the second is the clustering of assessors into categories representing different levels of assessment competency. The proposed approach also consists of calculating the scores to be assigned to the work of peers in a manner that takes into account the gaps of competency between the assessors.
On the other hand, the experimentation of the adopted mechanism indicated a significant increase in the validity of peer-provided assessments as well as in the consistency between the learners as assessors and as assessees compared to the random allocation of submissions. Similarly, the rate of approval of participants for peer feedback has increased substantially.
The remainder of this paper is structured as follows: the section 2 presents an overview of the related work on the techniques of allocating submissions within PA platforms and on the methods optimizing the learner matching in the same context. The different components of the proposed approach are then introduced in Section 3, whereas the section 4 describes the implementation and the experimentation of the proposed contributions. Finally, the last section illustrates conclusions and perspectives for future work.
Background
The literature review discussed in this work is split into two key focuses: the first is interested in the methodologies of allocating submissions (i.e. matching peers) within the systems implementing PA. While the second reflects on the methods used to optimize the matching of learners.
Techniques of Allocating Submissions in PA Platforms
Most of the literature relevant to peer assessment does not provide ample information about the mechanisms of allocating submissions (i.e. matching peers) (Staubitz et al., 2016). The following paragraphs describe the details mentioned regarding the allocation of assessment tasks in some systems implementing PA.
In fact, within some of the scarce mentions of such methodologies, most of the platforms applying PA are limited to a random assignment of submissions (Cho & Schunn, 2007; Purchase & Hamer, 2017; Rice, 2015).
In Coursera, which is one of the main providers of online courses, the allocation of assessment tasks consists of randomly assigning to each assessor a set of peers’ submissions together with a number of pre-assessed submissions by experts (usually instructors). This process represents a part of a technique called Calibrated Peer Review (CPR). In CPR, assessor reliability is estimated by comparing her\his awarded scores for the pre-evaluated submissions with those provided by the instructor (Piech et al., 2013; Russell et al., 2017; Van Zundert et al., 2010).
As opposed to Coursera, Open Edx assigns pre-evaluated assignments to the assessors in the context of a training stage preceding the formal assessment of peers’ submissions. Besides, in order to address situations where some learners have not assessed their assigned tasks, this platform suggests that the instructor requests a higher number of assessments than is required for each submission (Edx, 2020).
In addition, PRAISE is a system implementing PA that stands on a distribution approach that aims at saving time between the submission of assignments and their assessment. Indeed, the allocation process starts immediately as soon as the accumulated number of submitted assignments exceeds a specified minimum (Luxton-Reilly, 2009).
OASYS, on the other hand, is a platform that divides learners into three categories based on the outcome of a multiple-choice question test. Afterwards, the program matches each assessor with a number of learners that belong to different levels of performance (categories) (Bhalerao & Ward, 2001).
Furthermore, the allocation of assessment tasks in the models of Zeller (2000) and Staubitz et al. (2016) is based on the characterization of the submitted assignments by different levels of assessment priority. Both models give higher priority for the allocation of work that belong to: 1) learners whose submissions have not yet reached the required number of assessments, and 2) assessors who have already completed the assessment tasks assigned to them.
From another perspective, the Open Feedback Studio model provides learners with the opportunity to share their work and receive multiple feedback in order to improve their performances (Gamage et al., 2018).This concept is largely used nowadays in social networks (Demir, 2018; Niu, 2019), primarily in a variety of art and design communities. Likewise, C. E. Kulkarni et al. (2015) propose a platform based on a similar approach with a focus on immediate feedback on the work of learners. In this platform, after the learner submits her\his draft, she\he has to review two randomly assigned submissions belonging to her\his peers before receiving feedback on her\his draft.
Yet, such approach is more effective in fields where assessment tasks are not time consuming. This implies that receiving constructive feedback cannot be guaranteed in contexts requiring significant effort in order to carry out assessment. Therefore, it seems necessary to consider the appropriate matching of assessors with each submission in a methodology that aims to ensure evaluation that is more accurate along with more constructive feedback.
Furthermore, PA is a practice that stands on learners' collective intelligence. According to Van den Berg et al. (2006), PA process requires the assignment of each assessment tasks to a group of learners in order to offset the impact of unqualified assessors.
However, this methodology cannot reflect a radical solution for such problem. As stated earlier, the random matching of learners does not imply a fair assignment and correct assessment. This is due to the fact that some submissions may possibly be assigned to a group composed only of insufficiently competent assessors. In such circumstances, the integrity and reliability of the assessment process may be compromised, which may also adversely affect the motivation of learners.
Optimization of Learner Matching in PA
In the light of the relevant literature, a variety of approaches has been introduced to improve the effectiveness of PA. Yao & Suen (2016) have elaborated an overview of peer score pre-correction techniques. Most of the methods in this regard are focused on the weighting of assessors based on criteria such as their assessment validity. This mechanism has been applied in a certain number of methods, such as the aforementioned CPR. The same overview, on the other hand, also highlighted the post correction of assessors' scores. In fact, different methods that implement this methodology stand on statistical models to adjust the scores after being assigned by assessors.
Furthermore, the previously mentioned Open Feedback Studio model represents an alternative approach for maximizing the efficacy of PA that relies on enhancing assessors' motivation to engage in the process of reviewing their peers work.
As we believe that improving the matching between learners is a key factor for the effective implementation of peer assessment. In this article, we focus on this area of optimization that has been covered in a limited number of related works. Two methods stand out in this context. The first is based on Artificial Neural Networks and the second uses a Genetic Algorithm to provide an optimal matching of learners.
Indeed, Giannoukos et al. (2010) proposed a ranking of the most suitable reviewers for each author by means of artificial neural networks. In order to train the model, the reviewers' feedback usefulness serve as outputs from the network while inputs are built on the basis of a combination of the authors and reviewers profiles. These two components (inputs and outputs) are determined by using an initial assessment of the authors' work through a random allocation of submissions. This initial step allows authors to evaluate the usefulness of their received feedback.
Afterwards, the trained model predicts the usefulness of feedback for each possible pair of authors and reviewers. Then, for each author, prospective reviewers are ranked according to their expected usefulness, assuming they assess her\his submission. The actual matching for each author depends on the corresponding ranking of her\his potential reviewers and their availability.
On the other hand, the approach introduced by Crespo et al. (2005) stands on a genetic algorithm that generates an optimal learner matching in a way that maximizes the overall interest of peer mapping. In this context, the interest of mapping reflects the degree of desirability of a mapping that associates a learner as an author with another learner as a reviewer. The value of this interest is determined by calculating its similarity to a sample provided by the instructors containing the most desirable and undesirable mappings. As a result, the fitness of a given matching is equal to the sum of the interest values of the mappings engaged.
The algorithm operates on a population of initial solutions formed of a set of potential matchings generated randomly. Then, the approach consists of evaluation the fitness of each matching. Afterwards, using adapted artificial genetic operators (selection, crossover and mutation); new generation of possible solutions is reproduced depending on the evaluation of fitness. During the reproduction phase, some solutions are selected for mating, whereas the others are discarded. In fact, the higher the fitness of a matching the higher its expectation of being selected. The same process is repeated several times to reproduce new generations with a view of obtaining an ultimate population consisting of fittest solutions. Finally, the optimal matching is selected from the solutions of the last population.
The methods presented in this subsection require prior intervention of instructors; either through an initial exercise of PA based on a random matching of peers as in the first method, or by the preparation of samples of the most desirable and undesirable matches as in the second. Under these methods, each assessor is often addressed on a case-by-case basis. However, in order to maximize the overall validity of the assessment procedure, it seems necessary to implement the matching of learners in a way that takes into account the competency of the assessor compared to the other peers. Besides, both mechanisms are not modeled to be effective in massive contexts.
On the other hand, although the use of neural networks is a significant contribution towards improving the matching of learners, it does not directly relate to the generation of the optimum assignment of assessment tasks to learners.
Thus, the introduction of the approach presented in the following section was also encouraged by the limitations and lack of methods that seeks to optimize learner matching in the context of PA.
The Proposed Approach for Online Peer Assessment
The mechanism for distributing submissions naturally affects the efficiency of the PA process. The assignment of assessment tasks requires a balanced matching of learners in order to increase and converge participants' chances of receiving valid assessments and to improve the overall effectiveness of the appraisal system.
The contributions introduced in this section focus mainly on the optimization of learner matching in the context of PA and in a secondary level on the relative weighting of assessors’ ratings.
Indeed, the goal of optimizing the matching of assessors with peers’ submissions has led us to consider modeling this process as a Many to Many Assignment Problem (MMAP). This form of modeling represents an extension of the Generalized Assignment Problems (GAP). A GAP is one of the traditional problems of combinatorial optimization, whose solution consists in proposing an optimal distribution of a set of tasks over a set of agents (Sethanan & Pitakaso, 2016). In our context, agents are the learner as assessors, and tasks refer to the learners as assessees.
Three key elements define a MMAP:
Each task is characterized by a number of agents required for its accomplishment. Each agent has a maximum capacity of tasks that she\he can perform at the same time. The execution of a task T by an agent A is characterized by a given performance.
Besides, the specification of the three elements defining the problem and its resolution consists of:
The proposal of a model defining the agent represented by the learner as assessor. The development of an allocation strategy to specify the types of missions and the workload to be assigned to each assessor. This step includes the categorization of assessors according to different levels of assessment competency. The mathematical modeling of the problem and the development of a suitable algorithm to solve it.
On the other hand, based on the results of the matching step, the approach also provides an assessor relative weighting mechanism in order to increase the overall rating validity. Figure 1 summarizes the different components of the proposed approach.

Summary of the different stages of the proposed approach.
Representation of the Learner as Assessor
As with all human activities, and especially in the field of education, the specification of personal criteria that affect both the accomplishment and the influence of PA represents a challenging task. Abrache et al. (2018) proposed an assessor model that addresses the individual characteristics of the assessor according to three basic axes:
The grading and feedback competency, which includes, inter alia, the assessment accuracy and validity along with the feedback providing skills. It also concerns the assessment expertise and the amount of training of the learner as assessor. The assessor performance as a learner. Typically, a highly performing learner may represent more accurate assessor (Piech et al., 2013). The attitude, the behavior and the interaction within the platform. The more the presence of the learner within the course activities and forums, the more she\he can be involved effectively in the context of PA.
Construction of Indicators
Bovo et al. (2013) have established a number of indicators for the profiling of learners based on the aggregation of their data. The set of indicators includes presence, content study, activity, performance, social aspect and tutor opinion. We build on this representation to model assessor features by incorporating additional PA-related indicators: mastery of assessment, background and perception of the assessment exercise.
The main advantage of using an indicator-based representation is the generic definition of the assessor characteristics so that her\his model has less dependency on any platform or on the availability of certain data. Such representation also allows the reduction of the size of the data vector that represents the assessor, which simplifies the processing procedures.
In addition, the construction of indicators takes place in two stages. The first is to select or combine the relevant attributes for each indicator from the assessor model mentioned above, while the second is based on the specification of weights representing the difference in influence of the attributes that constitute each indicator.
Within the relevant literature, there exist various methods of executing this weighting task, which include SWARA (Karabasevic et al., 2017), Rank Exponent weight, Inverse or Reciprocal weights and Rank Sum weight (Roszkowska, 2013). The latter method is used in this research because it allows assigning weights that closely correspond to the real differences between characteristics based on expert opinion.
In fact, the features undergo a min-max normalization in order to unify their scales. Then the indictors’ values are specified using the following equation:
Assessor’s Score
It is essential to outline the concept of the assessor’s score, before proceeding to the next subsection presenting the strategy of matching assessors with peers’ submissions.
The assessor’s score represents an estimation of the learner’s assessment competency. Indeed, Abrache et al. (2018) introduced this parameter to measure the efficiency of assessment that can be provided by the learner, considering a number of influencing factors. Using rank sum weight method, as was the case in the previous step, the weight of each indicator is defined to reflect its influence to the overall assessment competency of the learner. Then, a similar equation is used to calculate the assessor’s score.
Figure 2 summarizes the different steps that shape the process of the representation of the learner as assessor.

The process of construction of the assessor’s features.
Strategy of Matching Assessors With Assessment Tasks
The matching strategy is a central factor of the proposed approach, because it allows, on the one hand, to specify the workload for which an assessor would be liable and, on the other, to provide a guideline for the matching algorithm in order to minimize its complexity.
This strategy stands on the outcomes of the clustering of assessors into categories reflecting different levels of assessment competency. Indeed, the first part of this subsection focuses on the categorization techniques, and the second on the use of this categorization in the making of the matching strategy.
Categorization of Assessors
The aim of this part is to introduce the adopted methods to cluster learners into categories that represent different levels of assessment competency. Two methods were included with regard to the categorization of assessors. The first adapts the k-means algorithm, and the second is a clustering method that stands on the discretization.
Adapted K-Means Clustering
Description of the Algorithm:
The k-means is a widely used agglomerative clustering algorithm that aims at optimizing the clustering of instances by maximizing intra-cluster similarity and simultaneously minimizing inter-cluster similarity (Gan & Ng, 2017).
K-means process can be summarized into five steps that include:
Specifying the number of clusters. Initializing the clusters’ centroids. Assigning each instance to the nearest centroid. Repositioning each centroid as the mean of the instances that belong to the corresponding cluster. Checking the changes of the centroids' positions, if any, the algorithm goes back to step 3 and if not, it stops.
Adaptation of the Similarity Measure:
The assignment of instances as well as the repositioning of centroids depend on a similarity measure that allows the calculation of the distance between different instances. Indeed, Euclidian, Manhattan or Chebychev distances are part of the similarity measures contained in the literature (Irani et al., 2016; Singh et al., 2013; Wierzchoń & Kłopotek, 2018).
In fact, the specification of the appropriate distance measure for a specific context relies on factors such as the characteristics of features and the knowledge of the domain. In some cases, standard distances do not allow the detection of similarities between instances in a way that satisfies the purpose of the application of the clustering mechanism (Kulkarni et al., 2015).
In the context of this work, the categorization is not only intended to create categories of assessors with similar characteristics, but more precisely to establish a hierarchy between these categories. This involves clusters that represent different levels of mastery of PA. Indeed, the use of the standard distances listed above does not allow achieving and then verifying the outcomes of the targeted form of clustering. These considerations led to the need to adapt the algorithm using an appropriate similarity measure that stands on the gap between the scores of assessors. Therefore, in this work, the objective of the adapted k-means is to minimize the following function:
Post-Processing of the Categories:
Additionally, during the matching of assessors with assessees’ submissions, the system must assign a reasonable amount of assessment tasks to each assessor, which represents an important pedagogical constraint that should be respected in the context of any PA implementation.
However, the k-means can result in very imbalanced categories, implying greater complexity in determining an appropriate matching strategy. Thus, with a view of balancing the categories, the method adopts a post-processing mechanism that stands on a greedy algorithm that exchanges the assessors with scores close to the boundaries of each cluster. Figure 3 summarizes the process of categorizing assessors using the adapted k-means with post-processing.

Process of the adapted k-means clustering with post-processing.
Discretization-Based Clustering
The discretization is a mechanism by which the range of a continuous quantitative variable is divided into a finite number of disjoint intervals. A qualitative categorical variable is then constructed on the basis of this separation, where its values are labels that characterize the elements of each interval (Dougherty et al., 1995; Ramírez-Gallego et al., 2016).
Moreover, in the fields of the machine learning and data mining, discretization methods can be applied in two forms: supervised and unsupervised. The supervised discretization takes into consideration the class information of the instances in order to specify the intervals that characterize the attributes to be discretized, whereas the unsupervised discretization methods do not rely on that parameter (Yang et al., 2009).
In order to build balanced clusters, we relied on an unsupervised discretization method called equal frequency discretization (Dash et al., 2011). This method involves dividing the range of values into intervals, each of which contains an equal number of values. Figure 4 describes the proposed discretization-based clustering method that includes:

Process of discretization-based clustering.
Calculation of the assessor's score for each learner.
Sorting the assessors (i.e. instances) in ascending order according to their scores.
Divide the range of values into intervals so that each interval contains a number approximately equal to
Constructing a number of clusters that refer to each interval, and then assign each assessor to the cluster that corresponds to the interval to which her\his score value belongs.
The specification of the number of clusters and their labels will be discussed in the following subsection.
Constraints on the Matching of Learners
The mechanism of learner matching requires the introduction of specific rules under which the method for setting up assessor groups should be implemented. The goal is to define the criteria that set out the likelihood of matching the owner of a submission with some group of selected assessors who would individually comment and rate her\his work.
First Constraint: Adequate Workload and Reduction of Complexity
In order to reduce the complexity of the problem, we specified four categories for all assessors: Superior Level, Advanced Level, Intermediate Level and Novice Level. The proposed form of categorization was grounded in a pedagogical assumption that this separation into levels may reflect the reality adequately. This confirmed by a number of tests standing on the elbow method to assess the consistence of clustering using the k-means algorithm.
On the other hand, in order to maintain adequate validity for the assessment exercise, and to avoid straining on the participants, the number of assessments necessary for each submission is set at 4 (Cho & Schunn, 2007). Likewise, each learner can evaluate between 1 and 6 submissions. This number varies according to the category to which the assessor belongs.
Second Constraint: Peers Matching and Assessment Consistency
Matching the peers as assessors and as assessees requires the inclusion of an additional parameter that represents an estimate of the consistency of matching an assessor with regard to a particular submission.
In fact, learners with high-scoring performances tend to offer greater validity as assessors (Piech et al., 2013; Topping, 2013). Then, we assume that the learners with the lowest capacities as assessors need the most effective feedback from the assessors with high capacities. The estimated consistency of matching the peers should therefore be useful for a methodology that aims to increase the likelihood that the least performing assessors will be able to obtain the best available feedback.
In order to apply the above hypothesis, the proposed measure of matching consistency relies on the difference between the assessors’ scores of the two learners involved (i.e. the assessor and the assessee). Indeed, the consistency of the assessment of a learner when rating her\his own submission is considered null. Otherwise, the normalized consistency of matching the peers is specified as follows:
⌈
Third Constraint: Assessors Motivation and Reliability of the System
The motivation of the learners to participate effectively in the PA process represents a key factor in the success of this activity (Wang et al., 2015). The assessors that belong to the superior level are the driving force and the main determinant of the validity of the PA exercise. They therefore need to be motivated by providing them with quality feedback and ratings.
On the other hand, reducing the impact of novice-level assessors (learners with minimal assessment skills) helps to improve the overall accuracy of the assessment process. This took place in conjunction with the necessity of involving these assessors in a limited amount of assessment tasks in order to avoid depriving them of the benefits of such activity.
Figure 5 describes the configuration of peers matching. As mentioned earlier, four participants review each submission. Besides, in order to motivate the assessors, each learner would have at least 3/4 of her\his assessors belonging to categories equivalent to or higher than the one to which she\he has been assigned.

Strategy of matching assessors with assessees.
Modeling the Matching of Learners
The basic hypothesis behind this work stands on the belief that contrary to random distributions, the intelligent assignment of submissions to assessors provides a major opportunity for optimizing PA outcomes.
In this subsection, we introduce the method used to optimize peers matching, which consists of two steps: 1) the mathematical modeling of the allocation of submissions as a many-to-many assignment problem and 2) the presentation of the algorithm developed to propose the best possible allocation of assessment tasks.
Mathematical Modeling
As mentioned before, the objective behind solving a generalized assignment problem is to specify the best assignment of a set of tasks to a set of agents (Sethanan & Pitakaso, 2016). Indeed, to achieve an optimal distribution, the objective consists of maximizing the performance of the execution of all tasks (Murthy & Ransbotham, 2019).
Furthermore, different types of extensions were suggested in the same context, including the many to many assignment problem (MMAP) where agents can carry out multiple but different tasks.
To formalize the optimization of learner matching according to a MMAP representation, the learners as assessors are considered as the problem agents, while learners as assessees compose the set of tasks, as stated in the introduction of this section. In addition, the strategy of matching specifies the required number of agents for the accomplishment of each task, besides the maximum potential assessment load to be delegated to each assessor.
Moreover, the performance of the execution of a specific task by a specific agent refers to the estimated consistency of the matching of the corresponding assessor with the corresponding assessee. The objective function of the problem of learner matching can be formulated as follows:
With:
Subject to:
For each submission
For each assessor
Solving the Allocation Problem
We stand in the context of this contribution on the
The solution that provides the
Application and adaptation of the
The application of the The The algorithm does not include the option of preventing an agent from executing specific tasks, while the system does not consider the self-review of an assessor for her\his own submission.
To comply with the first requirement, the
For the second criterion, in order to prevent a task from being allocated to its owner, we have introduced to the
The integration of the new parameter implies some modifications in the algorithm. In fact, within its process of resolution, the
levels ← [ “S”; “A”; “I”; “N”]; //Each level represents the concerned submissions’ owners
current_step ← 1;//1 for Superior, 2 for the Advanced, 3 for Intermediate, and 4 for Novice).
prepareTasks(step,levels[i]);//Select the concerned assessment
tasks
initializeProblem(
doAssignments(
current
Calculation of the Final Score
In order to improve the accuracy and validity of PA, the calculation of the overall submissions’ scores should reflect differences in the capacities of the participating assessors. The objective of this subsection is to introduce an additional contribution relating to a methodology for an efficient calculation of final peer scores.
Indeed, the weighting of assessors’ ratings is a technique that aims at reflecting the difference of influences between assessors according to their competency. Lan et al. (2010) propose weighting on each assessment criterion based on each participant learning style. Besides, other methods use an overall weighting of the assessors based on indexes such as the Credibility Index introduced by Xiong et al. (2014) or the Reviewer Competency index (RCI) which is calculated as part of the Calibrated Peer Review approach (Yao & Suen, 2016).
On the other hand, within the field of multi-criteria decision-making Roszkowska (2013) provides an overview of various techniques for deducing the relative weights of different factors according to their ranking. Among these methods, we applied the Rank Sum Wight method (RS) previously used in this paper, as part of a weighting technique for assessors. The formula for calculating weights in RS is as follows:
In addition, the decisive element in the proposed technique is the ranking of each assessor. This task should be performed taking into account the competency gap between the assessors considering their categories and their scores.
Indeed, in order to represent effectively the contribution of each assessor, the first criterion for the ranking is the categorical membership of assessors, whereas the second stands on the assessor score-based sorting. Table 1 shows the division of ranks into four levels that correspond to the hierarchy of categories.
Distribution of Ranks by Category.
Ranks 1 to 4 are exclusive to the assessors of the superior level category, the next four for the advanced level, and so on to the remaining two categories. Some rank numbers cannot be assigned, according to the requirements of the allocation strategy presented earlier.
For the assessors belonging to the same categories, ranks are assigned in order of the assessors’ scores. Figure 6 illustrates the correspondence between the allocation strategy, the ranking and the weights calculated using Formula 7.

Process of weighting peers’ rating.
Implementation and Experimentation
In this section, we first present an implementation of the system applying the approach. The second part focuses on the experimentation of the contributions, and the presentation of some results.
System Implementation
The structure of the proposed system consists of two components, the first of which is built into the platform and involves the processing of learner data to generate assessor profiles. Whereas the second component is a web service that relies on the profiles created in order to provide an optimal matching of learners. This architecture allows for easier adaptation when using the systems in conjunction with different types of platforms. For its portion, the proposed Peer Assessment API (PAAPI) consists of two levels that reflect the contributions of this research.
Figure 7 illustrates the suggested architecture of the system. Within the first level of the API, the processing consists of receiving assessor profiles contained in a JSON file, and then performing the clustering into four categories. While the second represents the logic behind the optimal matching of learners, which relies on the categories formed at the first level. The proposed distribution is then sent to the platform, also in JSON format. Finally, the system calculates the final scores using the weighting technique.

Architecture of the implementation of the contributions.
Moreover, the application of the present contributions focuses principally on the second part of the architecture in which PAAPI is a RESTful, Spring Boot-based web service (Gutierrez, 2019). The adaptation of K-means stands on WEKA API (Waikato Environment for Knowledge Analysis API) (Frank et al., 2016), whereas the allocation algorithm has been implemented using the Java programming language.
Experimentations
This work is part of research aimed at maximizing the validity of PA within online learning environments. Two experiments applying the proposed approach were conducted within two different contexts. The first is an online mass access course, i.e. a MOOC. While the second is based on a PA exercise carried out by a group of university-class students.
The present experiments addressed the following questions:
What impact can be made on the validity of PA and the acceptance of peer feedback using the proposed matching approach? Could the proposed implementation effectively apply the matching strategy? Will the system effectively operate within massive contexts? Standing on the validity of assessors, does the proposed clustering reflect the difference of assessment capacities between categories?
Retrospective Study on a MOOC Class
Methods
Participants
The sample included involved 4066 students enrolled in the Stanford Logunita edX-MOOC course entitled “Writing in Sciences” in fall 2014. The data consists of the traces of interactions between participants and the platform. The dataset was made available thanks to permission from the Center for Advanced Research through Online Learning (CAROL).
Materials
The process of allocating assessment tasks uses assessors’ profiles represented by the indicators as inputs. These profiles are established based on the traces that constitute records of the learners’ behaviors and interactions with the learning environment (Lafifi et al., 2010). We obviously use the implementation of the proposed system to carry out the matching process. For comparison purposes, the system provides the results of certain random distributions.
Measures
The main measure in the context of this experiment is the accumulated assessment consistency that results from the application of the allocation of submission. The extracted value should be compared with those of random distributions. The application also requires the verification of the processing duration.
These measures aim at giving an idea of the system's capacity to maximize overall consistency between learners, as well as the effectiveness of its implementation in a massive context.
Procedure
Following a number of steps, we describe the study design:
In addition, we stored the profiles of assessors in a MySQL database. Then, in order to test the interaction with PAAPI, we created a minimal learning platform based on PHP. The platform sent the profiles to the API in a JSON format and after the processing; it receives the proposed allocation in the same format.
The mechanism was applied to the entire sample to verify its usability in massive contexts. Afterwards, for comparison purposes, an optimal matching has been generated for a sample of 2000 learners.
3.
Analysis
The first objective of the analysis is the representation and verification of the results of the application of the approach. Then, the sum of the estimated assessment consistencies for each assignment is compared to that of a set of random distributions. The membership of the assessors by categories is also considered in order to check the possible differences between the resulting clusters.
Results
In order to expose the results, we take for instance an assessor of the advanced category. The Figure 8 represents her\his assessor profile.

Example of an assessor’s profile.
In accordance with the previously outlined strategy, each submission would be reviewed four times. Besides, an assessor clustered as Advanced should assess five submissions: Two belonging to assessors from the same category (i.e. Advanced Level), along with three others owned by one Superior, one Intermediate and one Novice (see Figure 5). Figure 9 shows the matchings in which this learner is involved as assessor and as assessee, whereas Figure 10 illustrates the profiles of the concerned owners of her\his allocated submissions.

Outcomes of the allocation of submissions related to the assessor presented in Figure 8.

Profiles of submissions owners listed in Figure 9.
The proposed mechanism aims to move beyond the random distribution of submissions given its negative effect on the validity of assessment and the motivation of participants. In order to demonstrate the superiority of the introduced approach within massive learning platforms, we compared the outcomes of the matching mechanism with the results of four random distributions of a sample of students containing two thousand individuals. In this context, we adopted the overall consistency of the suggested matches in the different distributions as criterion of the targeted comparison. The results illustrated in Figure 11 show a significant improvement in the cumulative consistency of assessment of the distribution optimized by the proposed approach compared with that of the random distribution. In addition, the system manages to process a number of participants that can be considered as massive within a reasonable timing.

Chart representing the accumulated consistency resulting from the different allocations.
Prospective Study on a University Class
Methods
Participants
The second experience of the approach was based on a PA exercise carried out by 20 students of a master class in information systems management. In order to validate the approach, the students are separated into two groups: one is experimental and the other is for control. The subject of the exercise concerns the UML modeling of an online PA system.
For the allocation of submissions using the approach, it was necessary to build profiles of learners as assessors. In this regard, we relied on the participants' performance on previous online and classroom activities, as well as on the opinions of two teachers of the class in question.
Materials
Measures
The validity of the students' assessments is the fundamental measure in the context of this experiment. Its value is calculated by measuring the correlation between the score given by the learner and that awarded by the teacher.
Procedure
The procedures used in the experiment can be described according to a number of steps:
Prior to the publication of the assignment on the platform, an introductory online meeting was arranged with the participating learners to clarify the PA activity, its purpose and its impact on the learners' capacities. The construction of assessors profiles based on the performance and activities of the learners in addition to the opinions of the teacher. Missing indicators are not taken into account, and weights are recalculated for the indicators for which data were provided. Proposition of the optimal matching of learners and distribution of the responses received after a 10-day deadline. Each participant receives the rubric along with the assessment tasks assigned to her\him. This step takes into account the anonymity of the assessors and the assessees, which represents an important requirement within the framework of PA (Güler, 2017). 13 of the 20 participants were assigned their assessment tasks according to the approach presented in this article, while the remaining 7 students formed a control group for which submissions were randomly assigned. As mentioned earlier, this methodology allows for comparison of results and validation of the approach. Calculation of the final scores using the relative weighting method described above. Learners were also asked to evaluate the quality of their received feedback, which was taken into account with a limited percentage as part of the final score.
Analysis
In this regard, the mean validity of the weighted scores of the experimental group is considered in comparison to that of the control group. A comparison is also made of the means of validity and quality of assessments between the categories of assessors. The Quality in this respect is the average of the assessee-provided scores awarded for the assessments of each assessor.
Results
The findings of the second experiment is primarily intended to demonstrate the influence of the approach implemented on the validity of peers-provided assessments. Figure 12 displays the overall outcomes of the experiment.

Chart representing the validity and the quality of assessments.
Indeed, the comparison between the validity of the experimental and the control groups revealed a gap of approximately 10% in favor of the method introduced in this paper. It is also possible to notice the difference between the mean of assessment validity of the members of each proposed category in accordance with the distribution strategy, even if it is not very significant between certain categories. On the other hand, there is a clear difference between the qualities of the assessments of the lowest category compared to the other categories.
Discussion
The two previous experiments aimed to apply and test the contributions presented in this paper, first in a mass learning setting and second in the context of a university class.
The first observation to be mentioned in this context is the ability of the proposed system to manage a large number of learners within the context of a MOOC (more than 4000 learners) despite the complexity due to the algorithm applied for solving the assignment problem.
Concerning the university class, one of the hallmarks of this study is the high average of assessors’ validity, which exceeds 78%. This may motivate instructors to use of this approach within similar contexts given its credibility as an assessment tool, as well as its positive impact on learners. Besides, most participants made a good impression about this activity as a whole.
In addition, the results showed the contribution of the approach in matter of the optimization of the learner matching. Indeed, the measurement of the cumulative consistency of the matching between learners as assessors and as assessees has experienced a significant improvement using the proposed approach in comparison with that of random distributions. This obviously requires more validation with respect to the formula used to estimate this matching consistency.
For their part, the two methods of clustering showed a great similarity at the level of categorization results. The second method based on the discretization is faster and easier to implement. However, clustering using k-means is very useful in choosing the number of categories as well as in studying the consistency of the resulting clusters.
In addition, the difference between the averages of validity and consistency of assessments across categories reflects the hierarchy of capacity levels among assessors. Finally, the difference in validity between the experimental group and control group in favor of the former allows a first validation of the approach proposed by the contributions presented in this work.
Conclusions and Perspectives
Peer assessment is a technique that helps to resolve the burden of massiveness in computer-based education and, particularly, in remote and mass access online settings such as MOOCs. It also positively impacts the skills and autonomy of the learners engaged in its practice.
This mechanism has been adopted by many online learning platforms as an appraisal tool, while keeping into account the importance of maintaining the quality of learner-provided assessments. However, most of these platforms have overlooked a fundamental question in this regard, namely: Who should assess whose work? This paper addressed this issue by modeling learner matching as a many to many assignment problem, and then proposing a technique of its resolution by using an appropriate combinatorial optimization algorithm.
Besides, in order to specify the tasks, the agents, and the execution performances for such problem, we stand on the clustering of learners into categories of assessors, so that the clusters reflect different levels of learner assessment competencies. On the other hand, before the mathematical formalization of the problem, the allocation strategy was established to specify the appropriate workload for each assessor and to reduce the complexity of the allocation process.
The application of the method suggested has shown promising results with respect to the accuracy and validity of peers-provided assessments and the acceptance of peer feedback. Such approach requires further validation by way of the evaluation of its effects during its implementation in the sense of a MOOC course involving PA.
Some assumptions also need to be checked, such as the rationale behind the equation that predicts the consistency of the matching between participants. The system also requires proposing a technique to deal with the not assessed submissions.
Footnotes
Acknowledgments
We thank Zhu, H., Liu, D., Zhang, S., Zhu, Y., Teng, L., & Teng, S. the authors of the paper entitled “Solving the many to many assignment problem by improving the Kuhn–Munkres algorithm with backtracking” for their valuable assistance during the implementation of the approach of learner matching. We would also like to show our gratitude to the Center for Advanced Research through Online Learning (CAROL) for providing us with the data used for the test.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
