Abstract
We propose new methods for identifying and classifying aberrant response patterns (ARPs) by means of functional data analysis. These methods take the person response function (PRF) of an individual and compare it with the pattern that would correspond to a generic individual of the same ability according to the item-person response surface. ARPs correspond to atypical difference functions. The ARP classification is done with functional data clustering applied to the PRFs identified as ARP. We apply these methods to two sets of simulated data (the first is used to illustrate the ARP identification methods and the second demonstrates classification of the response patterns flagged as ARP) and a real data set (a Grade 12 science assessment test, SAT, with 32 items answered by 600 examinees). For comparative purposes, ARPs are also identified with three nonparametric person-fit indices (Ht, Modified Caution Index, and ZU3). Our results indicate that the ARP detection ability of one of our proposed methods is comparable to that of person-fit indices. Moreover, the proposed classification methods enable ARP associated with either spuriously low or spuriously high scores to be distinguished.
Keywords
1. Introduction
Knowledge and skills reflect individual characteristics that are evaluated indirectly through a respondent’s performance on certain tasks that are grouped into a test. The respondent’s answers to the test items are summarized into an individual score that is interpreted as an indicator of ability. Inferences from scores will be valid only if the individual level of achievement can be correctly inferred from them (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014). Various reasons may be adduced for inferring that an individual score is invalid. For example, respondents may copy the answers they do not know for the most difficult items from a more competent examinee. In this case, their test score will overestimate their ability. In other cases, a test score may underestimate the abilities of examinees, for example, when capable examinees fail to pay sufficient attention to the easiest items and thus provide incorrect answers. Such situations give rise to aberrant response patterns (ARPs), as a result of which inferences made about examinees’ abilities based on their test score will not be valid.
Numerous person-fit indices have been proposed for identifying ARP (e.g., Meijer & Sijtsma, 2001), which to a greater or lesser extent efficiently identify response patterns that underestimate or overestimate the latent ability of the respondent being evaluated. However, none of these indices have been designed to identify directly the type of bias involved in estimating this latent ability. Several authors (Emons et al., 2004, 2005; Nering & Meijer, 1998; Sijtsma & Meijer, 2001; Walker et al., 2016) argued to identify the type of ARP from the person response function (PRF), a synonym for person response curve. A PRF provides the probability of a certain respondent giving the correct answer to each test item (Trabin & Weiss, 1983).
The objectives of this study are the identification and classification of ARP based on functional data analysis (FDA). Analyzing PRF as functional data was first mentioned by Emons et al. (2004) as a possibility for further research in person-fit analysis. Our proposal essentially consists in comparing the PRF of an individual with the PRF that would correspond to a generic individual of the same ability according to the item-person response surface (IPRS), which also takes into account the estimated item parameters and which is estimated on the basis of the responses of all examinees. We compute the difference between both functions for each individual. The further the difference function is from zero, the more aberrant is the response pattern of the individual it represents. Functional outlier detection techniques are used to identify ARP. Finally, functional cluster analysis is applied to the PRFs that are flagged as aberrant in order to classify them into different types of ARP.
This article is structured as follows: Section 2 introduces the FDA concepts employed throughout this article. Section 3 presents the IPRS, which are defined in this section from a functional perspective. Section 4 presents our proposed methods for estimating the global IPRS and the PRF, as well as for detecting and classifying ARP. In order to illustrate the performance of these methods, two simulation studies are discussed in Section 5. Our proposed methods for identifying and classifying ARP are applied to a real data example in Section 6. Final conclusions are discussed in Section 7.
2. FDA
2.1. Basic Concepts
FDA deals with the statistical description and modeling of samples in which a whole function is observed for each individual. For instance, in the Berkeley Growth Study, the heights of 39 boys and 54 girls were measured and registered at 31 specific time points from 1 to 18 years old. Each individual in the study contributed a complete function to the sample, namely his or her own growth curve. This is one of the examples included in Ramsay and Silverman (1997/2005), which constitutes the first FDA monograph and includes functional versions for a wide range of statistical tools. A recent general introduction to FDA can be found in Kokoszka and Reimherr (2017). Two packages in R deserve special mention because of their broad coverage of functional tools:
A functional random variable is a random variable
The parameters describing the probability distribution of a functional random variable
2.2. Functional Depth
The measure of depth is intended to quantify the centrality of an observation within a given sample (see Cuevas et al., 2007, for a comprehensive treatment of the concept of statistical depth for functional data). Depth measures are useful for detecting functional outliers, identified as functional data with the least depth (Febrero-Bande et al., 2008).
In this article, we use modal depth (through the function
where
2.3. Clustering of Functional Data
The distance matrix between observations is the only information required for hierarchical clustering. Clustering functional data can therefore be performed as soon as a distance between functional data is defined. For instance, the L
2 norm can be used. Another classical clustering method, namely k-means, also requires computation of the averages of the observations allocated to the same cluster. We use the function
3. IPRS
3.1. Definition and Estimation
For the sake of simplicity, we consider a unidimensional one-parameter item response theory model. Let us assume that an exam has m items that differ only in their (latent) difficulties
The IPRS is a function p defined from
A simple statistical model for the data coming from an IPRS is as follows. Let
The inference goal is to estimate abilities
We use the 1PLM with
When working with a nonparametric model, we assume that
where
We estimate the models in Equations 1 and 2 in the following way. First, the total score is computed for examinee i on the exam as

Illustrative example. (A) Two-dimensional representation of the item-person response surface
3.2. ARP
The PRF for individual i is a function
Nevertheless, it is possible that the PRF for certain individuals do not coincide with their corresponding IPRS profiles. In such cases, we say that these individuals present an ARP. More formally, we say that individual i follows an ARP when the functions
Types of ARP Considered in This Study, Their Characteristics, and Operationalization for Simulating Them
An IPRS model including the ARP is as follows. Consider the difficulties bj
and the abilities

Illustrative example including 10 aberrant response patterns (ARPs) of the following types: cheaters (a1 and a2), lucky guessers (b1 and b2), random respondents (c1 and c2), careless respondents (d1 and d2), and creative respondents (e1 and e2). Panel A gives their PRF and Panel B gives the ARP profiles for one examinee of each type.
4. Implementation
4.1. Outline of the Procedure
We propose the following procedure for identifying examinees with ARP. First, the IPRS is estimated using data from all the individuals in such a way that the estimation is not significantly affected when ARPs are present. Second, the PRF of individual i is estimated using data from this individual only; this estimation will be affected by the presence of ARP. Third, for each individual, the corresponding profile derived from the estimated IPRS is compared with its individually estimated PRF. Individuals for whom both estimations are extremely different are identified as ARP. The estimation procedures are detailed in Section 4.2, and the ARP identification process is described in Section 4.3.
4.2. IPRS and PRF Estimation
Following the estimation strategy for the IPRS model introduced in Section 3, we propose using a specific type of nonparametric regression model: the generalized additive model (GAM; see, for instance, Wood, 2017), which assumes that
where s
1 and s
2 are smooth functions that can take any value in
In order to estimate an individual’s
For the illustrative example, Figure 3A shows the estimated IPRS and Figure 3B the PRF for the 100 individuals whose PRF was plotted in Figure 2. The ranking and relative order of the 10 individuals presenting ARP have changed with respect to Figure 2 because these individuals were chosen at random before their IPRS profiles were replaced by the ARP functions that correspond to abilities different from those they originally had. Figure 3B shows that there are more than 10 estimated PRF that could be declared ARP.

Continuation of the illustrative example. (A) Item-person response surface and (B) person response function (PRF) estimates of the
4.3. Detection of ARP
Rationale
We propose to compute the differences between
When individual i has an ARP,
ARP detection
We propose three methods for ARP detection, the first two of which are based on FDA. In particular, the first method consists in computing the modal depth of the functional data
The second one uses the estimated IPRS
Illustration
Figure 4A shows for the illustrative example the differences,

(A)
Given that the method based on the log-likelihood-ratio test statistic is a nonstandard way of detecting outliers, its performance is shown for the illustrative data. Figure 5 displays the scatter plot of

Scatter plot of
4.4. Classification of Identified ARP
Rationale
In order to characterize the PRF of individuals identified as ARP, we propose using clustering techniques for functional data (k-means and hierarchical clustering, as in Section 2.3). The working functional data set is formed by
where
where
Procedure
For functional k-means clustering, we used the function
Illustration
As an example, we worked with a simulated data set of
When applying the ARP detection method based on log-likelihood differences (using an ARP threshold based on a resampling procedure that mimics that of the
For the functional k-means clustering, the optimal number of clusters was
Cross Table of the True ARP Type by the Assigned Cluster by k-Means Clustering for the 49 Individuals Identified as ARP
Note. The cluster numbers correspond to those in Figure 6A.

Clustering with (A) k-means and (B) hierarchical clustering of the 49
Figure 6A gives a graphical representation of the discovered clusters. The thin gray dashed curves are the PRF of the 49 individuals identified as ARP. Each thick curve represents a summary of one cluster: They are the logit inverse transformation of the average of the logit transformation of the PRF for the individuals in each cluster. These summary curves had the expected shape: The two solid gray curves correspond to Clusters 1 and 2, including non-ARP respondents who differed in ability but shared the characteristic of moving too suddenly from easier items with correct responses to more difficult items with incorrect responses; the dot-dashed curve (corresponding to Cluster 3 comprising only three cheaters) increases abnormally for the difficult items (
The dendogram resulting from hierarchical clustering is shown in Figure 7. It suggests that a cut defining three clusters is appropriate, leading to results summarized in Table 3 and in Figure 6B. One may observe that one cluster with only non-ARP individuals was identified (Cluster 1, which average is the gray solid line in Figure 6B). Two more clusters also appeared: Cluster 2 (2-dotted curve in Figure 6B) included all cheaters, guessers, and random respondents plus one non-ARP case, while Cluster 3 (long-dashed curve in Figure 6B) consisted of creative and careless respondents plus one non-ARP.

Dendogram for the hierarchical clustering applied to aberrant response pattern from simulated data.
Cross Table of the True ARP Type by the Assigned Cluster by Hierarchical Clustering for the 49 Individuals Identified as ARP
Note. The cluster numbers correspond to those in Figure 6B.
There was a high degree of concordance between the clusters obtained by k-means and those by hierarchical clustering, as shown in Table 4. The main difference was that Clusters 3, 4, and 5 obtained by k-means were grouped when using hierarchical clustering as Cluster 2.
Cross Table of the Cluster Composition Through k-Means and Hierarchical Clustering for the 49 Individuals Identified as ARP
Note. The cluster numbers correspond to those used in Figure 6.
5. Simulation Studies
5.1. Simulation of the ARP Detection
Method
We conducted a simulation study to evaluate the ARP identification power of the three different methods proposed in Section 4.3. Furthermore, we compared them with three well-known nonparametric person-fit statistics, which have been reported to be among the best at identifying aberrant-responding examinees by Karabatsos (2003): Sijtsma’s Ht, Harnisch & Linn’s Modified Caution Index (labeled as
Instead of the default value of 1,000 only, 100 resamples were used in our simulations in order to reduce computing time. The other function parameters in the library
The design of the simulation study was in accordance with what is usual in this field (Rupp, 2013). A total of
Three different proportions (
Results
Figures 8 and 9 summarize the results of the simulation study for

Simulation results for n = 500: Sensitivity (probability of correctly detecting an aberrant response pattern).

Simulation results for n = 500: Specificity (probability of correctly detecting a normal response pattern).
The main results from the simulation study can be summarized as follows:
The following general findings were valid for all the ARP detection methods:
The lower the proportion of ARP, the better its detection.
Cheaters and creative respondents were the easiest types of ARP to detect. This was particularly clear when looking at the sensitivity. This result was expected, given that cheaters and creative respondents were simulated with the largest deviations from normal patterns.
Regarding our suggested methods, the main results were the following:
In general, the method based on the log-likelihood-ratio test statistic (
For low proportion of ARP (5%), the methods based on functional depth (
These findings allowed us to compare our methods with the person-fit statistics:
Comparison of these four methods (in both sensitivity and specificity) enabled us to state that The sensitivity of
Discussion
The proposed ARP identification methods present better detection rates in conditions with less presence of ARP. This is the usual result obtained with other indices: Detection rates decrease as the percentage of ARP increases (e.g., Karabatsos, 2003). Thus, under the simulated condition with a 25% presence of ARP, the detection rate of all ARP types is low. In general, however, under simulated conditions with a relatively low percentage of ARP (5% or 10%), the detection rates increase with all methods and for any type of ARP. This result is in accordance with those reported by Rupp (2013).
Calculating the difference between the log-likelihoods presents the best performance among the proposed methods. Its good functioning is not only due to a high ARP detection rate against false negatives (in general sensitivities above .90) but also due to a high detection rate of normal patterns against false positives (specificities above .95). The sensitivity of this index is lower when identifying characteristic patterns of guessers and careless respondents.
This result is not unexpected since the previously defined characteristic patterns of guessers and careless respondents deviate less from the normal patterns than those of cheaters and creative respondents. However, both cheaters and guessers deviate from the representative function of normal responses in the most difficult items, just as the careless and creative respondents deviate from the representative function of normal responses in the easiest items.
Among the proposed methods, the one based on the log-likelihood-ratio test statistic presents the best performance and is globally comparable to that of the three nonparametric person-fit indices we have used. In particular, it performs similarly to
5.2. Simulation of the Identified ARP Classification
Method
A simulation study was conducted to evaluate the identified ARP classification proposal in Section 4.4. We simulated exams using
We applied clustering methods to classify separately the two sets of identified ARP by either
When we analyzed all the simulated exams, it was not possible to replicate for each exam the detailed analysis conducted for the example in Section 4.4. Thus, automatic summaries of the clustering results were required. For each simulated exam, and for the sets of ARP identified by either
Number of columns K. This was the optimal number of clusters when doing either k-means or hierarchical clustering. The range of possible values of K was constrained to between 2 and 10.
Combined purity of each type of ARP. In order to evaluate when a specific type of ARP, say ti , was correctly allocated to one of the identified clusters while, at the same time, taking into account whether the cluster where it was mostly allocated was shared or not by other types of ARP, we computed the ARP combined purity:
where
To fix ideas, the combined purity of cheaters in Table 2 (k-means) was
which was attained at the Cluster 4. On the other hand, the combined purity of cheaters in Table 3 (hierarchical clustering) was
which was attained at Cluster 2. Thus, cheaters were classified better by hierarchical clustering than by k-means.
Chi-square distance between types of ARP. In order to determine which types of ARP tended to be classified together, we computed the chi-square distance between the rows of the cross tables (e.g., Greenacre, 2016):
Results
Regarding the optimal number of clusters, K, Table 5 shows the summary statistics of
Summary Statistics of the
Given that the results on combined purity were similar for both

Average combined purity for each type of aberrant response pattern (ARP) in (A) k-means and (B) hierarchical clustering. ARP detection is conducted with the method based on log-likelihood ratio.
The matrices containing the average of chi-square distances between types of ARP throughout the

Average chi-square distance between clustering profiles of aberrant response patterns (ARPs) obtained by (A) k-means and by (B) hierarchical clustering. ARP detection was done with the method based on log-likelihood ratio. Black corresponds to the lowest distances and white to the greatest.
Discussion
No clustering method perfectly classifies the six types of responses considered. The k-means method tends to propose more (and less pure) clusters than the hierarchical method. In both cases, different types of ARPs are combined in the same cluster in a logical way. The hierarchical clustering method is good at distinguishing clusters with response patterns that are mainly associated with (a) spuriously high scores (cheaters, guessers, and random respondents), (b) spuriously low scores (careless and creative respondents), and (c) those with non-ARP (wrongly identified as such), which have a certain similarity to careless and creative respondents. The results obtained by the k-means method are less clear but still coherent. There is one pure cluster for non-ARP identified as such, another that is not so pure for cheaters, and two more that mainly group together guessing with random respondents, and careless with creative respondents, respectively. Once again, the responses associated with spuriously low scores tend to be grouped together, but now the responses associated with spuriously high scores are less drawn together than in hierarchical clustering.
6. Empirical Example
We applied our proposed methods for identifying and classifying ARP to the responses of 600 students to 32 items on a Grade 12 science assessment test (SAT12) which measured their knowledge on the topics of chemistry, biology, and physics. These data are available at the
Figure 12 describes this data set. It shows (a) the empirical cumulative distribution functions of the examinee proportion of correct answers (this variable seemed to be close to normality) and (b) the item proportion of examinees who answered it incorrectly (quite close to uniformity). These two features were used to estimate examinee abilities

Description of the SAT12 data set. Empirical cumulative distribution function of (A) examinee proportion of right answers and (B) item proportion of examinees who answered incorrectly.
In addition to the PRF, the IPRS for each examinee was estimated as in Section 4.2. Both sets of functions were used for ARP detection by means of the
The 86 cases identified as ARP were used in the classification step (as in Section 4.4). Both k-means and hierarchical clustering were applied, and the number of automatically determined clusters for both of them was three. The automatic selection for k-means yielded one very large cluster that was distributed nearly uniformly across the clusters of the hierarchical clusters (the dendogram is shown in Figure 13). To break up this mixed cluster, we explored the k-means solution for

Dendogram for the hierarchical clustering applied to the 86 curves identified as aberrant response pattern in the SAT12 data set.
Clustering of 86 Curves Identified as ARP in the SAT12 Data Set. Crossing the Assigned Cluster by k-Means and Hierarchical Clustering
Note. The number of clusters for k-means is four. The cluster numbers correspond to those in Figure 14.
We now describe the ARP classification results corresponding to k-means with

Clustering of 86 curves identified as aberrant response pattern in the SAT12 data set. (A) k-means results with
Cluster A is represented with dotted curves in Figure 14. These curves underwent a sudden decline (far more pronounced than the average) resembling perfect Guttman patterns. They were detected as ARP by
Cluster B, represented with long-dashed curves in Figure 14, was composed by individuals with lower than average probabilities of giving the right answer to the items of low and medium difficulty, but the opposite happened for the most difficult items. A common characteristic was that the curves in this cluster decreased more slowly than the average. These examinees shared certain characteristics with careless and creative respondents.
Cluster C is represented with two-dotted curves in Figure 14. It corresponded to individuals with greater than average probabilities of giving the right answer to the items with medium and high difficulty, but the opposite happened for the easiest items. As in the previous cluster, the curves in this cluster decreased more slowly than the average. These examinees shared certain characteristics with cheaters and guessers.
7. General Discussion
An ARP detection and classification methodology is presented based on PRF. Regarding identification, our simulation experiments reveal that the approach denoted as
Regarding classification of identified ARP, we have used functional distances to perform functional k-means and hierarchical clustering. In both cases, the number of clusters was chosen automatically. Our results indicate that k-means tends to identify more clusters than hierarchical clustering and also that it is less stable than the latter because of its dependence on random initial centers for the k clusters.
We recommend professionals to follow the next steps (cf. empirical example):
Perform descriptive analyses (as in Figure 12) to visualize the distributions of abilities and difficulties.
Estimate the IPRS and the PRF nonparametrically.
Flag all ARP cases identified by either
Classify the flagged ARP using both functional k-means and hierarchical clustering, with automatic determination of the number of clusters.
Describe the final clusters representing the average curves as in Figure 14.
This strategy enables to distinguish patterns that are associated with spuriously low scores from those associated with spuriously high scores. It even allows different types of ARPs to be detected among the high-scoring examinees.
It is possible to extend this work in two main directions. On the one hand, the number of items and respondents could be expanded. In our study, we simulated only 50 items and both 200 and 500 respondents. Subsequent studies should be carried out to analyze the extent to which the results of the study can be generalized to other evaluation conditions. On the other hand, in addition to the difficulty of the items, future analyses should consider their discrimination and the probability of random responses. Our proposal can be generalized to the 2PL model as follows. Consider that items can vary in difficulty b and discrimination a, with
Footnotes
Acknowledgments
We are very grateful to the associate editor and three anonymous reviewers for their valuable comments and suggestions that helped us to improve this work considerably.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: This investigation is partially supported by the Spanish Ministerio de Economía y Competitividad Grants EDU2013-41399-P (Eduardo Doval) and MTM2017-88142-P (Pedro Delicado).
