Abstract
Mokken scale analysis uses three types of scalability coefficients to assess the quality of (a) pairs of items, (b) individual items, and (c) an entire scale. Both the point estimates and the standard errors of the scalability coefficients assume that the sample ordering of the item steps is identical to the population ordering, but due to sampling error, the sample ordering may be incorrect and, consequently, the estimates and the standard errors may be biased. Two simulation studies were used to investigate the bias of the estimates and the standard errors of the scalability coefficients, as well as the coverage of the 95% confidence intervals. Distance between item steps was the most important design factor. In addition, sample size, number of items, number of answer categories, and item discrimination were included in the design. Bias of the standard errors was negligible. Bias of the estimates was largest when all item steps were identical in the population, especially for small sample sizes. Furthermore, bias of the estimates decreased as number of answer categories increased and as item discrimination decreased. Coverage of the 95% confidence intervals was close to .950, but for small sample size coverage deteriorated. Coverage also became poorer as number of items increased, in particular for dichotomous items.
Keywords
Mokken scale analysis (Mokken, 1971; Sijtsma & Molenaar, 2002) is used to construct tests and questionnaires. Among other model assessment methods, Mokken scale analysis uses an automated item selection procedure to partition a set of items into one or more scales, such that the items in a particular scale measure a common trait using a reasonable level of discrimination power to be controlled by the researcher (Sijtsma & Molenaar, 2002, p. 68). The item selection is based on a nonparametric item response theory (NIRT) model known as the monotone homogeneity model (Sijtsma & Molenaar, 2002, Chapter 2). Mokken scale analysis is used to construct tests in various research areas such as psychology, for assessing psychological distress and well-being (Watson, Wang, Thompson, & Meijer, 2014), depression and anxiety (Bech, Bille, Moller, Hellström, & Ostergaard, 2014), disability in activities of daily living (Kingston et al., 2012), learning disability (Murray & McKenzie, 2013), and sexual sadism (Nitschke, Osterheider, & Mokros, 2009).
Mokken scale analysis uses three types of scalability coefficients for assessing the quality of (a) item pairs, (b) individual items, and (c) a set of items. The item selection procedure uses the scalability coefficients as criteria for item set partitioning and as diagnostics for the strength of the scales. To compute a scalability coefficient, the sample ordering of the item steps (Molenaar, 1991) is needed. Because of sampling fluctuation, the sample ordering may be different from the population ordering, thus biasing the estimates of the scalability coefficients. The distortion may be more serious when distance between adjacent population item steps is small and sample size is small. Hence, scalability coefficients may either underestimate or overestimate their parameter values. For dichotomous items, based on statistical reasoning involving all the 2 × 2 tables, Sijtsma and Molenaar (2002, p. 56) suggested that bias is almost negligible for N > 200 when incidental pairs of item steps are close together, say, less than .02 units, and for N > 400 when many item steps are close together. From their discussion, it is clear that additional research may be needed to support accurate recommendations. Kuijpers, Van der Ark, and Croon (2013) analytically derived standard errors for each of the three scalability coefficients. The standard errors are based on the sample item step ordering, and a sample ordering different from the population ordering may produce positively or negatively biased standard error estimates. Consequently, confidence intervals may have an incorrect coverage.
Simulation studies were used to assess the magnitude of the bias in the scalability coefficient estimates, the standard error estimates, and the coverage of the confidence intervals. Because it is expected that a smaller distance between adjacent population item steps produces more reversals in the sample item step ordering, the authors investigated the effect of differences between sample and population item step orderings on the estimates of the scalability coefficients and their standard errors. Bias of the estimates and the standard errors, and the coverage of the confidence intervals were assessed under several conditions. The most important design factor was distance between population item steps; a smaller distance was expected to increase the probability that the sample and population item step orderings differ from each other. Other design factors were sample size, number of items, number of answer categories, and item discrimination.
This article is organized as follows. First, the authors discuss Mokken scale analysis and the scalability coefficients. Second, they explain the computation of the standard errors by means of marginal modeling. Third, they discuss Simulation Study 1 and fourth, a follow-up Simulation Study 2 that investigates surprising results from Study 1. Finally, they provide recommendations on how to use the standard errors.
Mokken Scale Analysis
The Monotone Homogeneity Model
Mokken scale analysis is based on the monotone homogeneity model (Mokken, 1971, Chapter 4; Sijtsma & Molenaar, 2002, pp. 22-23), which is an NIRT model for measuring respondents on an ordinal scale. Let θ denote the latent variable that underlies performance on the J items in the test. For dichotomous items, the monotone homogeneity model implies the stochastic ordering of θ by means of total score X+, which is the sum of the J item scores, denoted X
j
with
Fit of the monotone homogeneity model to the data is investigated by checking whether several of the model’s manifest properties are satisfied in the data. For example, the model implies that all interitem covariances are nonnegative in the population; hence, for a set of items to constitute a scale, the interitem covariances must be nonnegative. If not, the monotone homogeneity model is not the model that generated the data and must be rejected as an explanatory model. Nonnegativity of interitem covariances is investigated by evaluating whether the sample values of the J(J−1)/2 item pair scalability coefficients
Scalability Coefficients
Item steps and weighted Guttman errors
The scalability coefficients are based on the common item step ordering in each pair of items and the weighted sum of Guttman errors that is based on the item step ordering (Molenaar, 1991; also see Kuijpers et al., 2013). A single item j having z+ 1 ordered answer categories has z ordered item steps:
Two items i and j together have 2z item steps; the ordering of these 2z item steps is needed for estimating item pair coefficient Hij. To order the 2z item steps, one uses the z probabilities that a randomly chosen respondent passes an item step of item i, P(Xi≥x), and similarly the z probabilities for item j, P(Xj≥x). For item score Xj≥0, by definition we have P(Xj≥0) = 1, and this option is ignored. If in a particular item a less popular step is passed, by definition the more popular step is also passed.
Different respondents may pass and fail item steps in an order that is inconsistent with the common item step ordering for the two items, so that some individuals pass a less popular item step while failing a more popular item step. This incidence is referred to as a Guttman error (Guttman, 1950; Molenaar, 1991). Table 1 shows an example of the joint probabilities of having a score x on item a and a score y on item b, that is, P(Xa = x, Xb = y) with x, y = 0, 1, 2, 3. The marginal probabilities are defined by P(Xa = x) and P(Xb = y), and the cumulative probabilities by P(Xa≥x) and P(Xb≥y). For this example, the cumulative probabilities order the item steps by descending popularity as
Cross-Tabulation of Probability of Obtaining Particular Item-Score Patterns.
Note. Probabilities of item-score patterns that are in agreement with the Guttman model are printed in boldface. Guttman weights are shown within parentheses.
Let index h enumerate the number of most popular item steps passed. Item-score patterns (0,0), (0,1), (1,1), (1,2), (1,3), (2,3), and (3,3) (in Table 1, corresponding probabilities are printed in boldface) are consistent with the Guttman (1950) model, because the h most popular item steps in Equation 1 were passed and the remaining 2z−h less popular steps were failed. The remaining item-score patterns are inconsistent with the Guttman model, and to arrive at any of these patterns, one or more Guttman errors are made. For example, someone who obtained item-score pattern (0,3), failed the more popular item step Xa≥1 but passed the less popular item steps Xb≥2 and Xb≥3.
Molenaar (1991) proposed weighing the sample frequencies of the Guttman errors (in Table 1, weights are shown within parentheses) depending on the degree to which the item step ordering was violated according to the Guttman model. The weight for a particular item-score pattern (Xi = x, Xj = y), denoted
Consider indicator vector
For each pair of 0s and 1s, Equation 2 counts how often a score 0 precedes a score 1 in vector
Different random samples produce item step orderings different from the population ordering, resulting in sample weights different from population weights. For example, for two different random samples containing 200 observations each, drawn from the population values in Table 1, Table 2 shows the joint frequencies for the two samples. In the first sample (Table 2, upper panel), the estimated item step ordering is identical to the population item step ordering (Table 1). As the estimated ordering is identical to the population ordering, the sample weights equal the population weights. In the second sample (Table 2, lower panel), the estimated item step ordering and the corresponding weights are different from the population values. Using weights different from population weights may result in biased estimates and standard errors. Molenaar (1991) showed that when two item steps have equal popularities, scalability coefficients have the same value irrespective of the sample ordering of the two item step popularities. This implies that whenever the item step ordering contains ties, the scalability coefficient has the same value irrespective of the item step popularity that occurs first in the ordering.
Frequency Tables for Two Samples (N = 200) Drawn From the Distribution in Table 1.
Note. In Sample 1 (upper panel), the estimated item step ordering is identical to the population item step ordering. In Sample 2 (lower panel), the estimated item step ordering is different from the population ordering, resulting in different Guttman weights. Probabilities of item-score patterns that are in agreement with the Guttman model are printed in boldface. Guttman weights are shown within parentheses.
Scalability coefficients and their standard errors
For item pair (i, j), scalability coefficient
Here,
Item scalability coefficient
The monotone homogeneity model implies that 0 ≤
Total scale scalability coefficient H expresses the degree to which respondents can be ordered by means of a set of items (Sijtsma & Molenaar, 2002, p. 39), and is a weighted average of the J
The monotone homogeneity model implies that 0≤H≤1. Mokken (1971, p. 185) proposed that for a sufficiently reliable person ordering, .3≤H≤1. Hence, an item set for which H < .3 does not define a scale. Furthermore, a scale is defined to be weak if .3≤H <.4, moderate if .4≤H< .5, and strong if H≥.5. In the absence of Guttman errors,
Biased
Kuijpers et al. (2013) used a two-step method based on categorical marginal models to derive asymptotic standard errors for each of the three scalability coefficients. First, data were collected in a frequency vector
Simulation Study 1
Scalability coefficients were computed using the sample item step ordering. However, due to sampling fluctuation, the ordering may be different from the population ordering, thus affecting the estimates and the standard errors. For small sample size and small distance between item steps, more reversals of item step pairs are expected to occur. A simulation study was conducted to investigate the effects of different factors on the bias of
Method
Simulation model
We simulated data using the graded response model (Samejima, 1969, 1972). This model is a parametric version and hence a special case of the monotone homogeneity model (Hemker, Sijtsma, Molenaar, & Junker, 1996). The graded response model defines the probabilities of scoring at least x, x = 0, 1, . . . , z, on item j by means of a logistic function with a discrimination parameter
By definition,
The values of discrimination parameter
Location Parameters
Note. For each distance between consecutive item steps and for items, the table shows
Design
The design factors were varied as follows.
Discrimination parameter (
)
Discrimination parameters equaled 1, 1.5, or 2. Keeping all other factors constant, item discrimination has a positive effect on the scalability coefficients (e.g., Sijtsma, 1988, Chapter 3). The effect of item discrimination on the bias in point estimates and standard errors was unknown.
Number of items (J)
Number of items equaled 2 or 3; J was small so as to keep the simulation study manageable. Small J does not limit the results, because H is a weighted mean of the pairwise J(J−1)/2
Number of answer categories (z+ 1)
Items were dichotomous (z+ 1 = 2) or polytomous (z+ 1 = 3). Polytomous items are expected to produce more errors in the sample ordering and thus to produce more bias in the estimates of the scalability coefficients and their standard errors, and a poorer coverage of the 95% confidence interval.
Sample size (N)
Sample size was small (N = 50), medium (N = 200), large (N = 500), or very large (N = 1,500). As N grows smaller, additional observations in the error cells have more influence on the sample item step ordering, and more likely produce stronger bias in the point estimates and the standard errors, and more strongly deteriorate the coverage of the 95% confidence interval.
Distance between item steps
The greater the distance between two adjacent item steps, the more likely the sample item step ordering is correct. Distance between item steps had four levels, labeled None, Small, Moderate, and Large. Distance was varied by manipulating the location parameters
equaled the desired values in Table 4 for cumulative distribution
Theoretical Cumulative Item Step Probabilities
Note. For each distance between consecutive item steps and for items, the table shows
Outcome variables
The outcome variables were bias of the estimates of scalability coefficient H, bias of the standard errors of
Bias of the estimates (bias)
Let
Bias of the standard errors (bias.se)
The authors first computed the standard deviation of the estimates of H, denoted sd(
Standard deviation sd(
Coverage of the 95% confidence interval
The authors first constructed a confidence interval for each qth replication, using
Table 5 shows parameter H, which was varied across design cells. Sample size does not affect parameter H. The simulation study was programmed in R (R Core Team, 2014), using the R-package mokken (Van der Ark, 2007, 2012) to compute
Simulation Study 1: Population Values for Coefficient H, for All
Note.
Results
The bias of

Bias in sample

Bias in sample
For most conditions, the bias of
Results of Simulation Study 1 for Δ = 0 and Δ = 1.5.
Note. Coverage values outside the 95% Agresti–Coull interval [.946, .954] are printed in boldface. J = number of items; N = sample size; z+ 1 = number of answer categories; bias = bias of estimates of H; bias.se = bias of standard errors of
Coverage of the 95% confidence intervals was almost equal to .950 in all conditions. To accurately interpret the values of the coverage, a 95% Agresti–Coull confidence interval was derived (Agresti & Coull, 1998). The interval was equal to [.946, .954]. In some conditions, coverage was just below the Agresti–Coull interval, but we consider these deviations negligible. Only for N = 50, irrespective of the value of discrimination parameter
Simulation Study 2
Study 1 showed that, compared with J = 2, bias was unaffected for J = 3, but these small test lengths seemed insufficient for ruling out bias effects for larger sets of items. Study 1 also showed that, inconsistent with the expectation, bias of
Study 1 showed that for
Simulation Study 2: Population Values for Coefficient H.
Note. z+ 1 = number of answer categories;
Table 8 shows results for
Results of Simulation Study 2 for Δ = 0; for J = 10 (Left Panel) and z+ 1 = 5 (Right Panel).
Note. Coverage values outside the 95% Agresti–Coull interval [.946, .954] are printed in boldface. J = number of items; z+ 1 = number of answer categories; N = sample size; bias = bias of estimates of H; bias.se = bias of standard errors of
Figure 3 shows the coverage for J = 10 for varying

Coverage of 95% confidence intervals for J = 10 for varying number of answer categories.
Discussion
The estimates and the standard errors of Mokken’s scalability coefficients are based on the assumption that the sample item step ordering is identical to the population ordering. A violation of this assumption may bias the estimates and standard errors of scalability coefficients and coverage of the corresponding confidence intervals may be incorrect. The two simulation studies showed that bias of
Inconsistent with the expectations, bias of
For most conditions, coverage of the 95% confidence intervals was slightly under .950. For small N, large J, and high item discrimination
The coverage of 95% confidence intervals, even if not perfect, may be adequate for practical use, but may be improved if asymmetric confidence intervals are used. The Wald-based 95% confidence interval used in this study (i.e.,
The automated item selection procedure (Sijtsma & Molenaar, 2002, Chapter 4) in Mokken scale analysis only uses the sample scalability coefficients for selecting items into scales. However, ignoring standard errors of sample coefficients may be a source of selection error (Kuijpers et al., 2013). Future research may systematically investigate the influence of standard errors on the automatic item selection procedure in Mokken scale analysis. A next step would be to implement the standard errors in the automatic item selection procedure.
Footnotes
Acknowledgements
The authors thank Rudy Ligtvoet and Iris A. M. Smits for their comments on an earlier draft of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research of Renske E. Kuijpers was funded by the Netherlands Organization of Scientific Research (NWO), Grant 406-12-013.
