Abstract
The Rasch Poisson Counts Model is the oldest Rasch model developed by the Danish mathematician Georg Rasch in 1952. Nevertheless, the model has had limited applications in psychoeducational assessment. With the rise of neurocognitive and psychomotor testing, there is more room for new applications of the model where other item response theory models cannot be applied. In this paper, we give a general introduction to the Rasch Poisson Counts Model and then using data of an attention test walk the reader through how to use the “lme4” package in R to estimate the model and interpret the outputs.
Introduction
In most item response theory (IRT) models, the unit of analysis is the individual item. In such models, the probability that a person correctly answers an item or endorses certain categories is modeled. However, common IRT models need at least one parameter per item (any many more on polytomous IRT models), so they are relatively complex for situations where the same task or many simple tasks are given to examinees and aggregation of hits/misses is conducted. Such testing conditions arise in psychomotor testing (Spray, 1990), the testing of attention/processing speed (Baghaei, Ravand, & Nadri, 2019; Doebler & Holling, 2015), oral reading errors (Jansen, 1997a; Rasch, 1960/1980; Verhelst & Kamphuis, 2009), reading comprehension (Verhelst & Kamphuis, 2009), and divergent thinking (Forthmann et al., 2016). In these tests, examinees usually have to solve an unlimited (or at least very large) number of relatively easy items within a fixed period of time. Another example is identifying correctly spelled words in a long list of words. In such testing situations, the total scores (raw counts) or the total numbers of errors on the tasks are modeled instead of the individual attempts.
The Rasch Poisson Counts Model (RPCM, Rasch, 1960/1980) is a member of the family of Rasch models which was developed for tests where counts of errors or successes on a number of tasks are modeled instead of replies to individual items. Modeling the number of errors might be the only option when the number of potential successes is not well defined, say in classic oral reading tests where examinees are to read a passage aloud and the test administrator counts the number of errors (Rasch, 1960/1980). In such conditions, the total scores or the total number of errors on each block are assumed to be the realization of a Poisson process (e.g., Ross, 1983). That is, the number of correct checks (or errors) on each block for each person is assumed to be Poisson distributed and is the unit of analysis. 1
RPCM is a unidimensional latent trait model (Jansen, 1994) and enjoys all the elegant properties of Rasch models including separate person and item parameters, sufficiency of raw scores, and specifically objective comparison of persons and items (Masters & Wright, 1984). Masters and Wright (1984) demonstrate that the RPCM, binomial trials, rating scale model, and the partial credit model all originate from a logistic function and hence have the same algebraic basis. In all these models, simple counts of correct replies or successes are the sufficient statistics to estimate ability and difficulty parameters.
The Poisson distribution
To appreciate the RPCM, understanding the mathematical basis of the model is in order. The model rests on the Poisson probability function and its underlying assumptions. The Poisson probability function, named after the French mathematician Siméon Denis Poisson (1781–1840), expresses the probability of a number of events occurring in a fixed time period if the average number of events is known. The function is written as follows
The Poisson probability function is adequate in the following situation: The number of many potential events is recorded, each event has a very small probability of occurring, and the number of potential events might be unknown. In addition, the occurrence of one event does not have any impact on the probability of another event (stochastic independence). Suppose the average number of oral reading errors is 2.5 in a session (say this value has been computed using the past records of reading error frequency). Now we can compute the probability of making no errors (k = 0) or 1, 2, 3 . . . errors in a given reading session. Using the Poisson probability function, we have
That is, the probability of making no errors in a session is 0.08, the probability of one error is 0.20, and the probability of making two errors is 0.25.
Rasch Poisson Counts Model
The RPCM is historically the first member of the family of Rasch models developed by the Danish mathematician and statistician Georg Rasch. Despite being the first Rasch model introduced, it has had limited applications compared with other extensions of the model such as the dichotomous model or the polytomous models for rating data. In 1951, the Danish Ministry of Social Affairs assigned Rasch to the task of analyzing oral reading data collected over several years from a group of 125 students with reading problems. The complication that Rasch encountered was that the texts students had read in each testing administration differed in difficulty, making the monitoring of their reading development rather difficult. In a concrete formulation of this problem I imagined—in good statistical tradition—the possibility that the reading ability of a student at each stage, and in each of the two above-mentioned dimensions [accuracy and speed], could be characterized in a quantitative way—not through a more or less arbitrary grading scale, but by a positive real number defined as regularly as the measurement of a length (Rasch, 1977).
Rasch (1960/1980) used the Poisson probability function to model raw counts of errors on the reading tests. Here, the total numbers of misreadings on individual tests were modeled. The RPCM assumes that the total raw scores or errors Y vi of person v on part i of a test are independent and Poisson distributed with mean μvi. Note that in part of the RPCM literature, a collection of simple items or tasks, that is, a subtest, is referred to as an item, and the total scores on these subtests are modeled and not responses to individual items. Item parameters refer to the subtests. Time limits on the subtests are optional (Doebler, Doebler, & Holling, 2014).
The parameter μvi is the expected number of successes of person v on item i. The probability that person v gets a raw scores of y on item i is given by the Poisson function which is the multiplicative form of the model:
The probability to observe a score (or number of errors) yvi in equation (2) is obtained by replacing k in equation (1) by yvi and λ by μvi. The expectation μvi is a function of person’s ability and item’s difficulty and is assumed to have a multiplicative composition, that is, it is the product of person’s ability and item’s easiness:
Rasch (1960/1980) with some algebra demonstrated that it is possible to estimate the person parameter
We use the tilde symbol to indicate that log-parameters are used. Note that we can always obtain the original form in equation (3) by applying the exponential function, that is,
The additive specification is important because it allows to view the RPCM as a special case of a generalized linear mixed model, a large class of regression models, and to employ corresponding software.
Local independence and unidimensionality are assumed to hold for the RPCM like in other IRT models. That is, a test is supposed to measure a single trait, and the trials should be independent of each other conditional on a fixed level of ability as are the scores for different examinees. This assumption may be violated as learning and fatigue can affect attempts, especially those close to each other in time (Spray, 1990).
There are several estimation methods for the RPCM (Verhelst & Kampuis, 2009). As there are sufficient statistics for parameter estimation, conditional maximum likelihood (CML) estimation is feasible. However, no distribution of the person parameters is obtained in CML, which might be interesting especially when structural models for latent traits are of interest (Jansen, 2003). Treating the ability parameter as a Gamma distributed random variable, one can derive marginal maximum likelihood estimators for the person and item parameters (Jansen, 1997b).
Joint maximum likelihood estimation (JML) also exists to estimate the parameters of RPCM. However, formally each additional person adds a parameter to the model, which is statistically undesirable. In contrast, marginal maximum likelihood estimation (MML) imposes some distributional assumptions on the person parameters and as a consequence, the model’s likelihood is a function of the item parameters only (and maybe the parameters of the marginal distribution; Doebler & Holling, 2015). For MML estimation, the two-parameter Gamma distribution is typically specified for the person parameter with shape parameter c and scale parameter m. This is referred to as Gamma Poisson Counts Model (Jansen & van Duijn, 1992). Alternatively, a lognormal distribution is computationally feasible (e.g., Doebler & Holling, 2015) but analytically less tractable (Jansen, 1994). Since person parameters are non-negative, the Gamma distribution which is conjugate to the Poisson is very convenient from an algebraic perspective. Jansen and van Duijn (1992) demonstrated that imposing a Gamma distribution on the person parameters does not affect the point estimates of the item parameters, that is, JML, MML with Gamma marginal distribution, and CML result in the same item parameter estimates. The drawback of imposing a parametric distribution, however, is that the ability distribution might be misspecified (Jansen, 1994) which, nevertheless, does not affect point estimates of item parameters. Therefore, if only the difficulty of the tests is of interest, the distribution of the abilities can essentially be ignored. However, in other applications, the person parameters and their distribution are important (Jansen, 1994, 1997b).
The fit of the model is assessed by checking the ability of the model to correctly predict total scores, as they are sufficient statistic for estimating model parameters. As the RPCM is a log-linear model (cf. equation (3)), it can be considered as a generalized linear model (GLM; McCullagh & Nelder, 1989) or (from the MML perspective) as a generalized linear mixed model (GLMM; Brown & Prescott, 2015; Demidenko, 2013). Therefore, methods for checking GL(M)Ms and the diagnosis of under or overdispersion is available for RPCM too (Doebler & Holling, 2015). Assuming a Gamma distribution for the person parameters implies that the distribution of the row total scores is negative binomial (Jansen & van Duijn, 1992). This property can be used to evaluate the correctness of the Gamma assumptions (Jansen & van Duijn, 1992). Model checking within MML estimation entails comparing the observed distribution of total scores with the one predicted by the model.
Empirical example
In this section, RPCM is estimated using the lme4 package (Bates et al., 2017) in R (R Development Core Team, 2016). We employ the glmer function from the lme4 package. The additive form of the model from equation (5) has to be used, as glmer does not support the equivalent multiplicative form directly. Also, glmer can only handle a lognormal distribution of the person parameter, that is,
The data of 228 examinees to a selective attention test are used for demonstration. The test is constructed by Beyzaee (2017) after the model of Ruff 2 and 7 Selective Attention Test (Ruff, Evans, & Light, 1986). In this test, respondents have to cross out the digits 2 and 7 in three rows of randomly arranged digits and letters. The test contains 20 blocks each containing three lines. The time limit for each block is 15 seconds. Here, the task is very simple, and it is not reasonable to scale the test with the dichotomous Rasch model. An example block is given below.
Estimation with ‘lme4’
A prerequisite for RPCM analysis with lme4 is that the data should be in long format. Perhaps the simplest way to do this is with SPSS though also R includes this functionality in the reshape function. The data structure for the current analysis after converting it to long format is shown in Figure 1. “ID” is student identification number, “Grade” indicates each student’s grade at school, “Item” refers to the block of letters and numbers students had to check, “Hit” is the total number of correct checks on each item for each examinee, “Miss” is the number of items that test takers have failed to check, and “TL” is the hypothetical time limit (in seconds) set for each item. As explained before in RPCM, the raw counts of correct answers or the counts of errors within a set of items are modeled. Here, each block is considered an item, and the total number of correct checks of 2s and 7s are recorded under “Hit” and modeled as the unit of analysis.

The data structure for the attention test.
Preliminary analyses
To perform the analyses, first load the package:
import the data set (here from a text file, which is not the only option):
A summary of the variables in the data set can be obtained with the following function:
To make sure that R treats items as categorical variables run the following code:
The boxplot of the raw scores can be obtained with:
Test takers’ raw scores, that is, the sum of the hits, can be obtained with the following function:
A histogram of the raw score is produced by:
The basic function call to fit the RPCM is:
Note two peculiarities in the models syntax: (1) the -1 omits a regular intercept from the model. In combination with + Item, this yields item-wise intercepts, which are the easiness parameters (or in regression model terms: cell means coding). (2) The syntax + (1|ID) adds a random intercept on the person level, with mean 0 and unknown variance. The following function returns the item parameters, their standard errors, and information criteria:
Output 1: The output of RPCM analysis
In Output 1, the Akaike Information Criterion (AIC, Akaike, 1974) and Bayesian Information Criterion (BIC, Schwarz, 1978) and log item easiness parameters are given. The easiest item is Item 10 with easiness parameter 3.04, and the hardest item is Item 7 with an easiness estimate of 2.853. The standard deviation of the log person ability parameters is 0.15, and their mean is constrained to zero for model identification.
Confidence intervals for the item parameters can be obtained using the following code:
The confidence intervals are helpful to assess the amount of uncertainty in the estimation of item easiness parameters. Note that the first confidence interval is not for an easiness parameter but for the standard deviation of the person parameters. Since the lower bound is clearly separated from zero, there is variance in the person parameters which amounts to individual differences that the test is able to discern.
In attention or processing speed tests, the contents of the items do not differ much. Usually, the same items with little variation in content are repeated across the test. Since the structure and content of the blocks are the same, we can expect equal difficulty for the items. For comparison purposes, we can run a model that assumes equal difficulty for all the items but includes a person parameter (random intercept model):
Output 2: Output of RPCM analysis with equal item parameters
Outputs 1 and 2 show that the AIC and BIC for the model named “fit1” are smaller than those for “fit0.” That is, the model with equal item parameters does not fit as good as the model where different difficulty parameters are assumed for the items. Hence, the item difficulties vary across the items. To compare the fit of the two models with a hypothesis test, run the following code:
which gives Output 3.
Output 3: Comparison of the fit of the two models
Output 3 depicts the information criteria, the log-likelihood, and the deviance statistic (−2 × loglikelihood) for each model. A likelihood ratio test (LRT) with a chi-square test statistic is computed to compare the fit of the two models. The value of the chi square (174.6) statistic is significant at p < 0.001, df = 19. In other words, the model with different item parameters (fit1) significantly fits better than the model with equal item parameters (fit0).
To obtain predicted score for each item and person run the following code:
For graphical check of the model run the following code:
In Figure 2(a), we see the predicted values for each person on the x-axis and the Pearson-residuals on the y-axis. For good model fit, Pearson residuals have a mean of 0 and S.D. of 1.0. The more the points diverge from these values, the worse the fit. They are assumed to be roughly symmetrical (which is the case here), and they should (roughly) follow a normal distribution, at least when counts are not small. Here, they are too small for large values of the predicted values, that is, there is less variance than the model predicts. This is not necessarily detrimental and just means that the Poisson distribution predicts too much variance here, hence our data are underdispersed, potentially leading to more conservative inference (Zeviani, Ribeiro, Bonat, Shimakura, & Muniz, 2014). Additionally, we can plot the predicted values against the observed values and obtain a regression line (Figure 2(b)) with the following code:

Graphical overall model check. (a) Predicted scores against their Pearson residuals. (b) Predicted scores against observed scores.
Fit of the individual items can be evaluated by graphical inspection of the residuals. The following code can be run:
which returns Figure 3. The residuals are the difference between the observed scores and the scores predicted by the Poisson model. The boxes and the whiskers indicate the range of the residuals. The residuals should roughly span from −2 to 2, that is, approximately the 2.3% and 97.7% quantiles of a standard normal distribution that approximates the distribution of the residuals. Some outliers are expected. As shown in Figure 3, only Items 1 and 2 have a substantial amount of residuals which exceed ±2.

Graphical item fit check.
The item parameters given in Output 1 are on the log-scale (additive parameterization from equation (5)). By exponentiation, we can obtain the item parameters on the multiplicative scale (equation (3)):
The next function returns the person parameters:
Output 4 depicts the person parameters for the first 10 respondents out of the total of 228 respondents.
Output 4: Person parameters
Likewise, we can obtain the person parameters on the multiplicative scale using the following code:
To obtain the summary of person parameters, use the following functions, depending on which person parameters you obtained:
RPCM with separate time limits for items
Then run the following lines of code:
The time limits in the multiplicative form define the units of the easiness parameter. As the model is fit on the log-scale, the time limits need to be log transformed and added as an offset in the additive form of the model, that is, a known constant added to the regression equation. This amounts to the factor
Output 5: The output of RPCM with different time limits for the items
As you can see in Output 5, the item parameters change, but the information criteria are identical to those in “fit1,” because including the offset merely shifts parameters without a change in the log-likelihood. In other words, including the a priori known time limits does not change the model fit but merely enhances interpretation of the easiness parameters: On the multiplicative scale, they can now be interpreted as the expected number of points scored/errors made by a person of average ability (average ability being zero as person parameters are centered at zero) within the imposed time unit, that is, a second in this study. The code given to produce the statistics for “fit1” can be used to obtain the statistics for “fit2.”
Investigating item fit
To investigate item fit, we propose a simple chi-square type statistic: Let the set
If the model holds, then
For the attention data, Items 5 and 10 show misfit, that is, the chi-square values are larger than the .99 quantile of a chi-square distribution with five degrees of freedom.
We omit a reanalysis with these items removed and mention in passing that the asymptotic argument can be replaced by a parametric bootstrap in small samples.
Differential item functioning
When one or more groups are to be compared with IRT models, the comparison should be based on means of latent variables. The prerequisite for such a comparison is that the latent variables of both groups are on the same scale, otherwise one could merely be observing an artefactual difference (e.g., Holland & Thayer, 1988). As the scale is implied by the item difficulty parameters, it is vital to check that item difficulty parameters are identical, that is, the items do not function differentially or that measurement invariance (Millsap, 2011) holds.
We now discuss a differential item functioning (DIF) detection method in the context of the RPCM that builds on well-known approaches for existing IRT models. Similar to binary or ordinal IRT models, detecting DIF items is complicated by the fact, that latent means from the two groups cannot be assumed to be equal. We propose a modeling approach that first uses an LRT to investigate globally whether DIF is present. In the absence of DIF, group means can then be compared. The method also flags items as DIF items and by subsequently eliminating flagged items, a DIF-free item set can be obtained.
In the attention test example, 3rd and 4th graders have been tested (N3rd = 174, N4th = 54) and we investigate, whether there is a difference in latent means for these two groups. In a first step, we extend the model with different time limits (fit2) by group-specific intercepts:
Note that for numerical reasons, we have used a different optimizer (bobyqa) and a higher number of function evaluations (maxfun), which avoids mild convergence problems. The glmer function provides several alternative model fitting procedures (optimizers) and sometimes default values need to be adjusted to ensure model fitting proceeds smoothly.
Output 6: Differential Item Functioning: Baseline Model
The parameter of the Grade variable is the difference in latent means on the log-scale. Here, 4th graders are slightly less skilled. The item parameters are assumed to be equal in this model for both groups. As we have introduced the difference in latent means, the item parameters deviate slightly from this in fit1. We now add group by item interaction terms that represent the group-specific deviations in item difficulty:
Output 7: Differential Item Functioning: Interaction Model
The interaction of Grade and the individual items (the lines starting with Grade:ItemX) is the difference in item difficulty on log-scale of the 4th graders relative to the 3rd graders. For example, item 2 is slightly more difficult for the 4th graders (−0.094), but the difference is not significant. The difference of −0.170 for Item 3 is significant. Note, however, that with 20 items, a multiple testing problem exists and one should not interpret significant interaction terms before conducting a global DIF test:
By comparing the two models with an LRT, we test whether the interaction terms explain variability in the data, which amounts to a global DIF test. Here, we find:
Output 8: LRT for global DIF test
A p value of 0.10 results, indicating that the model with the interaction term does not explain more variability in the data, that is, the LRT provides no evidence for global DIF. It is hence not necessary to remove items to limit the amount of item level DIF. We can now proceed and test for group differences by comparing fit2 and fit3:
Output 9: Testing for Global DIF
We find no differences in latent means (p = 0.49), that is, 3rd and 4th graders have comparable latent attention ability. Alternatively, the z-test for the coefficient of Grade in fit3 can be used (which is also not significant, z = −0.69, p = 0.49).
In some scenarios, one wants to study DIF on the item level. By studying the interaction terms in fit4 item-wise, we see which items have been flagged by the procedure. Here, Items 3 and 10 are flagged, and we could proceed to purify the item set by removing the flagged item with the smaller p value first (item 10 here) and refitting the model of fit4 to the reduced item set to investigate the remaining item set. We caution, however, that this sequential procedure might be suboptimal in terms of the nominal alpha level.
Dispersion
In a Poisson model (and by implication in the RPCM), the assumption is that the variance of the dependent variable Y given covariates X is equal to its expectation, that is,
In reality, this equidispersion assumption is frequently violated. The φ coefficient is essentially the ratio of model implied variance to predicted mean (McCullagh & Nelder, 1989). It is expected to be 1. Three scenarios may occur:
φ is roughly equal to 1; the assumption of equal mean and variance is met. φ < 1 indicates underdispersion. This means that there is less variance in the data than the Poisson model predicts. This is a very commonly observed scenario in test data. This entails that confidence intervals calculated from a model fit are too wide. Also, in our context, the reliability is underestimated, that is, the test seems less reliable than it really is. Clearly, this is undesirable and needs to be addressed. Unfortunately, the problem of underdispersion has been widely ignored in the literature and dispersion modeling normally focuses on overdispersion. φ > 1 indicates overdispersion. This means that there is more variance in the data than the Poisson model predicts. Confidence intervals for model parameters are too narrow and reliability is overestimated. Again, this in undesirable.
There are some strategies to address under- and overdispersion: (1) deleting misfitting items or exclude persons (especially those without any mistakes might be the cause of an overly low variance), and (2) employing a kind of ad hoc statistical correction via a so-called quasi-Poisson-Regression. This has been advised for the case of overdispersion in the literature, but it is not a standard procedure in the random effects case, and (3) employing extensions of the Poisson-distribution, say the Conway-Maxwell-Poisson distribution (Boatwright, Borle, & Kadane, 2003; Shmueli, Minka, Kadane, Borle, & Boatwright, 2005) or the Gamma-count distribution (Zeviani et al., 2014). However, currently no standard procedures or software for these approaches are available.
Generally, overdispersion is considered to be worse than underdispersion, because if ignored statistical inference is anti-conservative: When overdispersion occurs, SEs from a Poisson model are spuriously too small, and when undersidpersion occurs, they are artificially too large. To estimate φ, the following code should be run:
[1] 0.531849
Which returns a φ equal to .53. The value of φ is way smaller than 1 which indicates underdispersion. The model is prone to underestimate reliability and inference on model parameters is conservative.
Reliability
As in other IRT models, reliability varies as a function of ability. Figure 4 shows reliability for different ability estimates. The horizontal axis depicts the ability continuum, and the vertical axis depicts the reliability estimate. The plot implies that the precision of the test changes as a function of the ability. The reliability estimates for different locations on the log ability scale can be read from plot. For example, the reliability for examinees with ability −1 on the log scale is .90 and for examinees with ability −.50 is. 94.

Reliability graph.
Note that we use a reliability estimate on log scale here, in contrast to Verhelst and Kamphuis (2009) and Doebler and Holling (2015). Specifically, from an estimate
Figure 4 was produced with the following lines of code:
Conclusion
In this article, the RPCM was reviewed and R functions were given to fit the model. The “lme4” package (Bates et al., 2017) in R was employed to estimate the model parameters. The data of 228 respondents to an attention test developed after the model of “Ruff 2 and 7” test (Ruff et al., 1986) were analyzed, and the outputs were interpreted. Item residuals and a chi-square type statistic showed that only two items misfit the model. Overall graphical and statistical model checks indicated that the attention data fit the RPCM, but the data were underdispersed, making inference conservative and biasing reliability estimates downward.
Instead of modeling the counts of the correctly checked items (Hit), we could have modeled the counts of missed items (Miss). The analysis of the counts of missed items, which is not reported here due to space limitations, had a better fit to the data. In the original application of the model by Georg Rasch, the oral misreadings or errors were also modeled instead of the correct words.
The MML approach as implemented with the glmer function here has several advantages and some practical and statistical limitations: On the plus side, the GLMM framework is very flexible. One can add covariates as fixed effects and incorporate additional random effects, for example, to model multilevel structures in the data that appear naturally in many (educational) applications, such as persons nested in classrooms which are maybe even nested in schools (e.g., Aiken, Mistler, Coxe, & West, 2015). Adding fixed effects is interesting to explain variation in ability, say to predict ability by gender, age, or experimental conditions. While we cannot cover all these techniques in this brief example, we refer the reader to the authoritative monograph on explanatory IRT models by de Boeck and Wilson (2004) and the tutorial by de Boeck et al. (2011) for IRT-models for dichotomous data. However, by assuming log-normal person parameters, the MML approach is a bit more restrictive than CML. From our experience, the differences in item parameter estimates are minor, that is, they are essentially only rescaled. Also, glmer can be computationally intensive for large data sets and complex models.
It is worth noting in this context that count data items are typically more informative of person ability than binary items. This has two consequences: Person parameter estimates are reliable, even when only a few items are used. Depending on an item’s mean and the variance of the person parameters, as few as three items will give reliable person parameter estimates. The second implication is that the rule of thumb sample size requirements are more modest compared to the dichotomous case, which includes the explanatory IRT models mentioned above.
