Abstract
Response time models (RTMs) are of increasing interest in educational and psychological testing. This article focuses on the lognormal model for response times, which is one of the most popular RTMs. Several existing statistics for testing normality and the fit of factor analysis models are repurposed for testing the fit of the lognormal model. A simulation study and two real data examples demonstrate the usefulness of the statistics. The Shapiro–Wilk test of normality and a z-test for factor analysis models were the most powerful in assessing the misfit of the lognormal model.
With the increasing popularity of computerized testing, which makes recording of response times straightforward, analysis of response times has become a rapidly expanding field of research. A common way to analyze response times is to include them in psychometric/statistical models that are referred to as response time models (RTMs). The use of RTMs has been suggested to improve precision of examinee ability estimates (e.g., Bolsinova & Tijmstra, 2018; van der Linden et al., 2010), to detect test fraud (e.g., Sinharay & Johnson, 2019; Qian et al., 2016; van der Linden & Guo, 2008), to detect speededness (e.g., Schnipke & Scrams, 1997), to improve test construction (e.g., van der Linden, 2007), and to test substantive theories about cognitive processes (e.g., van der Maas et al., 2011). Several RTMs have been suggested by, for example, Bolsinova and Tijmstra (2018); Klein Entink, van der Linden, and Fox (2009); E. Maris (1993); G. Maris and van der Maas (2012); Rasch (1960); Schnipke and Scrams (1997); Thissen (1983); van der Linden (2006); van der Linden (2007); van der Maas et al. (2011); and T. Wang and Hanson (2005). Extensive reviews of RTMs include De Boeck and Jeon (2019); Kyllonen and Zu (2016); Lee and Chen (2011); Schnipke and Scrams (2002); van der Linden (2009); and van Rijn and Ali (2017).
The lognormal model for response times (LNMRT) is arguably one of the most popular RTMs. The model was first suggested by Thissen (1983); was further developed by van der Linden (2006); and has been considered, either to analyze only the response times or to jointly analyze the response times and response accuracies, by several researchers including Bolsinova and Tijmstra (2018), Boughton et al. (2017), Glas and van der Linden (2010), Qian et al. (2016), Sinharay (2018), Sinharay and Johnson (2019), van der Linden (2007), van der Linden (2009), van der Linden and Glas (2010), and van der Linden and Guo (2008).
There is a lack of research on model fit statistics for the LNMRT, Ranger and Kuhn (2014), Glas and van der Linden (2010), and van der Linden and Glas (2010) being among the few exceptions. This article, in an attempt to fill that void, brings to bear several tools that have been used to assess fit of other statistical models to test item fit and the local independence assumption for the LNMRT.
The next section includes a review of the LNMRT, existing approaches for estimation of the parameters of the model, and existing approaches for the assessment of fit of the model. The Method section includes discussions of the model fit statistics that we propose for assessing the fit of the LNMRT. The Simulation Study section includes an evaluation of the Type I error rate and the power of the statistics. The Real Data Examples section includes applications of the statistics to two operational data sets. Discussions and conclusions are provided in the last section.
Reviews of the Lognormal Model, Fit Statistics, and Normality Tests
The Lognormal RTM
The model
Let us consider a test that includes J items. Let
According to the LNMRT,
where
In applications of the LNMRT, researchers have analyzed only response times using the stand-alone LNMRT (e.g., Finger & Chee, 2009; Sinharay, 2018; van der Linden, 2006) or jointly analyzed both response times and response accuracies using the LNMRT and an IRT model (e.g., Glas & van der Linden, 2010; van der Linden, 2007; van der Linden & Glas, 2010).
Estimation of the item parameters of the model
A Markov chain Monte Carlo algorithm was suggested by van der Linden (2006) to estimate the parameters of the LNMRT. Glas and van der Linden (2010) suggested an approach to compute the maximum likelihood estimates (MLEs) of the item parameters when the LNMRT is used along with the three-parameter logistic model to jointly analyze both response times and response accuracy. Finger and Chee (2009) showed how one can use factor analysis to obtain the marginal maximum likelihood estimates (MMLE) of the item parameters of the stand-alone LNMRT, and researchers such as Molenaar et al. (2015) showed how one can use factor analysis to obtain the MMLEs of the item parameters of the joint model involving the LNMRT and an IRT model. Under the LNMRT,
where
where
Therefore, we used the R package lavaan (v0.6-4; Rosseel, 2012), which is used to perform factor analysis and structural equation modeling (SEM), to estimate the item parameters of the LNMRT. 2 The codes for using the lavaan package to compute the MMLEs for LNMRT are provided in Appendix A. Molenaar et al. (2015) noted that the LNMRT, when used as a component of a joint model, can be fitted using standard SEM software packages such as EQS, Lisrel, Mplus, and Mx.
The Need to Assess the Fit of the LNMRT as a Stand-Alone Model
According to the Standard 4.10 of the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014), evidence of model fit should be documented when model-based methods are used. Therefore, it is important to assess the fit of the LNMRT. Given that the LNMRT was first presented as a stand-alone model by van der Linden (2006) and that the LNMRT has been used as a stand-alone model in various applications by, for example, Boughton et al. (2017), Marianti et al. (2014), Qian et al. (2016), and Sinharay (2018), there is a need of more research on assessing the fit of the LNMRT as a stand-alone model. Also, a necessary condition of the joint model (consisting of the LNMRT and an IRT model) fitting both response times and response accuracies is that the LNMRT fits the response times. Thus, if a simple test of fit of the LNMRT as a stand-alone model shows misfit, then one may not need to fit the joint model and instead can proceed with another RTM.
Given the two results that the effect of violation of the normality assumption is often negligible (e.g., Scheffe, 1959, p. 337) and all models are wrong but some are useful (Box & Draper, 1987, p. 54),
one wonders whether the LNMRT is always useful, in terms of yielding accurate/valid inferences, because of its underlying normality assumption, or whether some types of misfit of the LNMRT have practical consequences and hence threaten the validity of the inferences from the LNMRT.
A substantial number of applications of the LNMRT involve the detection of test fraud. For example, Boughton et al. (2017), Fox and Marianti (2017), Marianti et al. (2014), Qian et al. (2016), Sinharay (2018), Sinharay and Johnson (2019), and van der Linden and Guo (2008) used the LNMRT to detect various types of test fraud. Specifically, person-fit analysis, which is one of the six types of statistical methods that are used in practice to detect test fraud according to Wollack and Schoenig (2018), was performed using the LNMRT in Marianti et al. (2014), Fox and Marianti (2017), and Sinharay (2018). Let us study the behavior of the Bayesian person-fit test statistic lt
of Marianti et al. (2014) under the misfit of the LNMRT. Klein Entink et al. (2009) showed that the LNMRT does not adequately fit data simulated from the Box–Cox normal model. A data set of response times of 5,000 examinees to 80 items was simulated from the Box–Cox normal model.
3
Then the LNMRT was fitted to the data set, and the lt
statistic (Marianti et al., 2014) was computed for all examinees. At 5% level of significance, the lt
statistic
4
showed a statistically significant misfit for about 8% examinees whereas the misfit percentage would be 5% if the LNMRT were fitted to data from the LNMRT or the Box–Cox normal model were fitted to the data set. Thus, about 3% more examinees (or 150 examinees in the sample of 5,000) would be erroneously flagged as possible cheaters when the poor-fitting LNMRT is used instead of the better-fitting Box–Cox normal model. If data with even a larger extent of misfit of the LNMRT are simulated (by setting
Existing Approaches for Assessment of Fit for the LNMRT
Schnipke and Scrams (1999) suggested the use of graphical plots and the root mean squared error between the observed and predicted cumulative distribution function of response times to assess the fit of the LNMRT. Several model fit statistics have also been suggested in the context of applications of the LNMRT as a component of a joint model. These include the several Lagrange multiplier statistics including one to assess item fit (Glas & van der Linden, 2010), the Lagrange multiplier statistic for assessing conditional independence of the responses and response times (van der Linden & Glas, 2010), and the item-fit statistics of Ranger and Kuhn (2014). 5 Some of these methods can be adapted to applications of the LNMRT as a stand-alone model. The few model fit tools that have been suggested for the LNMRT as a stand-alone model include the Lagrange multiplier test for assessing conditional independence of the response times (van der Linden & Glas, 2010) and the Bayesian residuals based on the posterior predictive distribution of response times (van der Linden & Guo, 2008).
The item-fit statistic of Glas and van der Linden (2010) is designed to test the null hypothesis
where
where r is an appropriate cut point that divides the examinees in two groups of roughly equal sizes. The alternative hypothesis essentially states that the mean of the response time is larger (compared to what is expected under LNMRT) for the slow examinees and smaller for the fast examinees or vice versa. Then, the item-fit statistic of Glas and van der Linden (2010), henceforth denoted as the
To compute the item-fit statistic of Ranger and Kuhn (2014) for item j, one first fits the RTM to the available data and then divides the examinees into G groups based on their response times on an item. Then, one counts the number of examinees who belong to these groups—this results in a total of G counts. Let us denote the collection of these counts for item j as
that quantifies the extent of model fit in item j. Ranger and Kuhn (2014) proved that the distribution of Tj
is a multiple of a
In the Bayesian approach of van der Linden and Guo (2008), one determines whether the response time of an examinee–item combination is substantially different from what is expected under the model. van der Linden and Guo (2008) showed that in applications of the LNMRT as a stand-alone model, the posterior distribution of the predicted value of
If the absolute value of
Van der Linden and Glas (2010) stated that under violation of local independence for item pairs
where
where
The statistic
Even though there exist several tools for testing the fit of the LNMRT, there seems to be a need for further research on assessing model fit in applications of the LNMRT as a stand-alone model. For example, there exist no comparison studies of the fit statistics for the model. In addition, while there exist several
Tests of Normality
Adefisoye et al. (2016) pointed to the existence of more than 40 tests of normality. These tests can be classified into the following three categories: tests based on empirical distribution function: Examples of such tests are the Kolmogorov–Smirnov test (e.g., Lilliefors, 1967), the Anderson–Darling (AD) test (Anderson & Darling, 1954), and the Lilliefors test (Lilliefors, 1967); tests based on moments: Examples of such tests are Jarque–Bera test (Jarque & Bera, 1980) and those based on skewness and kurtosis (e.g., Mardia, 1970); and tests based on correlation: Examples of such tests are D’Agostino test (D’Agostino, 1971), Shapiro–Wilk (SW) test (Shapiro & Wilk, 1965), and Weisberg–Bingham test (Weisberg & Bingham, 1975).
In addition, as Thode (2002) noted,
There also exist several comparison studies of normality tests including Adefisoye et al. (2016), Gan and Koehler (1990), Razali and Wah (2011), Shapiro et al. (1968), and Yazici and Yolacan (2007). Using data simulated from 45 different distributions and nine statistics, Shapiro et al. (1968) showed that the SW test statistic provided a general superior measure of non-normality. Seber (1984, pp. 147–148) stated that among the tests of normality, the SW, AD, and D’Agostino test statistics are the most useful. Gan and Koehler (1990) compared the power of several normality tests and found the SW test to be the best overall test for assessing normality. Yazici and Yolacan (2007) found three tests including the Jarque–Bera test to be the most powerful in a comparison of 12 tests of normality but also found the SW test to be a superior omnibus indicator of normality. Razali and Wah (2011) compared the power of four tests—SW test, AD test, Lilliefors test, and Kolmogorov–Smirnov test—and found the SW test to be the most powerful. Adefisoye et al. (2016) found that the test based on kurtosis is the most powerful for observations from a symmetric distribution, and the SW test is the most powerful for observations from an asymmetric distribution. The Nikulin–Rao–Robson (NRR) test (Nikulin, 1973; K. C. Rao & Robson, 1974) statistic has also been found more powerful than tests of normality in detecting departures from normality (e.g., Voinov et al., 2009) and possesses several optimality properties (e.g., Singh, 1987; Voinov et al., 2013, p. 37). In the simulation study later in this article, the SW test, the AD test, and NRR test are used to test for item fit. Brief descriptions of these three tests are provided below.
SW test
In a test for normality based on observations X
1, X
2,
where
where
AD test
The AD test (Anderson & Darling, 1954) based on observations X
1, X
2,
where
NRR
test
To apply
The NRR
where
Method
In this section, several existing statistics for testing normality and the fit of factor analysis models are repurposed for testing item fit and the assumption of local independence of the LNMRT.
Item Fit Analysis
Equation 1 implies that for a randomly chosen examinee, the marginal distribution of
In addition, for an item,
We also used
Test for Local Independence
Several statistics for testing the local independence of the item scores, such as the Q 3 statistic (Yen, 1984), are based on the correlation between the scores on a pair of items. Borrowing the idea, a test statistic based on the correlation between the logarithm of observed response times on a pair of items can be used to test the local independence of the response times.
Equation 1 and the population distribution
Let
The population correlation
under the null hypothesis of the model fitting the data. Given the earlier discussion on how the LNMRT can be expressed as a factor analysis model, the
Simulation Study
Three sets of simulations were performed to study the properties of and compare the performances of the following model fit statistics: (a) SW statistic, (b) AD statistic, (c) NRR
Simulation of Data Under No Model Misfit
Data under no misfit were simulated under the LNMRT given by Equation 1. The true values of

Density plots of logarithm of response times for 4 items.

Average power across test lengths to detect the first type of item misfit (preknowledge) of the item-fit statistics.

Average power across test lengths to detect the second type of item misfit (Box–Cox normal distribution) of the item-fit statistics.
Simulation of Data Under Some Item Misfit
In this set of simulations, the generated data sets included a majority of examinees whose response times followed the LNMRT given by Equation 1 (the response times of these examinees were simulated in a manner similar to that for data simulated under no model misfit) and a small fraction of aberrant/misfitting examinees whose response times did not follow the LNMRT. The percentage of aberrant examinees in a data set was assumed to be 2, 5, or 10. Each generated data set included a majority of items with no misfit and a few items with some misfit. The number of items with misfit was assumed to be 2, 4, and 6, respectively, for test lengths of 20, 40, and 60, which means that misfit is assumed to be present for 10% of the items. The items with misfit and the aberrant examinees were randomly chosen for each simulated data set.
Three types of item misfit were considered including those arising from some examinees having preknowledge of the item, the response times for the item being simulated from the Box–Cox normal model (Klein Entink et al., 2009), and the response times for the item representing a positive shift for incorrect responses.
To create the first type of misfit (item preknowledge), it was assumed that the response times of the aberrant examinees followed the LNMRT given by Equation 1 for the nonmisfitting (or noncompromised) items but were equal to 10, 20, or 30 seconds for the misfitting (or compromised) items (that constitute 10% of all items on the test). To create the second type of misfit (Box–Cox normal model), it was assumed that the response times of the aberrant examinees followed the LNMRT given by Equation 1 for the nonmisfitting items but were simulated from the Box–Cox normal model (Klein Entink et al., 2009) with
For each simulation condition represented by a test length, a sample size, a percentage of aberrant examinees, and a specific magnitude of misfit (represented by the time, simulate a data set with mostly nonaberrant examinees and some aberrant examinees, compute the MLEs of the item parameters for the data set, and compute the item-fit statistics of the misfitting items using the MLEs computed above.
The power of each item-fit statistic for each simulation condition was computed as the percentage of misfitting items that had a significant value of the statistic under that simulation condition.
Simulation of Data Under Violation of Local Independence
In this set of simulations, generated data sets included a majority of item–examinee combinations for which the response times followed the LNMRT given by Equation 1 (the response times of these examinees were simulated in a manner similar to that for data simulated under no model misfit) and some other item–examinee combinations for which local dependence was simulated in one of two ways.
To simulate the first type of local dependence, we assumed that 10%, 20%, or 40% of examinees suffer from speededness on one fifth of the items at the end of the test and respond in 10, 20, or 30 seconds to those items. To simulate the second type of local dependence, we simulated response times for 10 item pairs using the bivariate distribution given by Equation 6 for 10%, 20%, or 40% of examinees. The correlation
For each simulation condition represented by a test length, a sample size, a percentage of aberrant examinees, and a value of the number of items affected by speededness or a value of simulate a data set that involves violation of local independence, compute the MLEs of the item parameters for the data set, and compute the
Results for Data Simulated Under No Model Misfit
Except for the
Results for Data Simulated Under Some Item Misfit
Figure 1 shows the density plots of the logarithm of response times for 4 items from the simulation cases involving 5,000 examinees and 80 items—the LNMRT fits the data for Item 1 (circles on a dotted line) so that the logarithms of the response times follow the normal distribution for that item and does not fit the other items. Item 2 (triangles on a dotted line) represents an item on which 10% examinees had preknowledge and answered the item in 5 seconds. The response times for Item 3 (plus symbols on a dotted line) were simulated from the Box–Cox normal model (Klein Entink et al., 2009). The response times for Item 4 (multiplication symbols on a dotted line) represent a shift of 5 seconds for the wrong responses. The figure shows that the distribution of the response times for Item 2 is bimodal. The figure also shows that compared to the item with no misfit, the distribution of the logarithm of response times for Item 3 is shifted slightly toward the left and that for Item 4 is shifted slightly toward the right. In addition, the distributions for Items 2 through 4 have lower peak compared to that of Item 1. The values of the NRR statistic are approximately 51, 2,805, 77, and 534 for the four items, the critical value at 5% level being 66.3 (i.e., the 95th percentile of a
Other factors remaining the same, the power of each statistic was very similar over different test lengths. Figures 2 through 4, respectively, show the average power (averaging over the different test lengths) for detecting the three types of item misfit for different values of sample size (I), percentage of aberrant examinees, and the extent of misfit (denoted by the time taken to answer the compromised items,

Average power across test lengths to detect the third type of item misfit (a shift in time for incorrect answers) of the item-fit statistics.
The power of the Lagrange multiplier item-fit statistic (
Figures 2 through 4 show that Power increases with an increase in sample size, which is a favorable result for the item-fit statistics (e.g., C. R. Rao, 1973, p. 464). Power becomes larger as the percentage of aberrant examinees increases. Power mostly becomes larger as the extent of misfit (denoted by time or the The power of all the statistics is very close to 1 in the bottom right panel (i.e., for large samples and large extent of item misfit) in each figure. No item-fit statistic uniformly has the largest power in all simulation cases, but the SW statistic comes close to achieving this distinction. The statistic consistently has the largest power for small sample sizes and small percent–aberrant examinees. The large power of the SW statistic is in agreement with the superior performance of the SW statistic in several comparisons of normality tests (reviewed earlier). The NRR The Tj
statistic has the smallest power overall.
Results for Data Simulated Under Violation of Local Independence
Figure 5 shows the average values of power (averaging over the test lengths) of

Power to detect violation of local independence.
The power of both statistics increases as the sample size or the percentage of aberrant examinees increases.
For the sample size of 5,000, the power of
The power increases with a decrease in time (of responding under speededness) or an increase in
The
Real Data Examples
Example 1: A Mathematics Test
Let us consider a real data set that consists of responses and response times of 1,079 American test takers in Grade 8 on 40 mathematics items. The data were analyzed by van Rijn and Ali (2017) and were collected as part of a larger study. Thirty-two items are multiple choice, and eight are numeric entry. The items focus on basic topics in number, measurement, geometry, data analysis, and algebra and are dichotomously scored. The items were assembled in four different forms using blocks of 10 items, with different orders of the blocks to counterbalance order effects. The time limit was 90 minutes.
The test was administered under low-stakes conditions—so we computed the response time effort (RTE) measure of Wise and Kong (2005) for the data set to examine whether the examinees suffered from a lack of motivation. The RTE for an examinee is the proportion of items for which the response time of the examinee is above a cutoff. With a cutoff of 5 seconds, a value of .8 of the RTE means that an examinee took more than 5 seconds on 80% of the items (so, lower values of RTE mean less effort). With a cutoff of 5, 10, and 15 seconds, respectively, 0, 10, and 64 of the 1,079 examinees had an RTE less than .80. With a cutoff value of 10 seconds, the lowest RTE value found for the data set is .45, meaning that the corresponding examinee spent 10 seconds or less on 55% of the items. The number of examinees answering an item in less than 10 seconds ranged between 10 and 151 for the items. These numbers indicate that, overall, there is not enough evidence that many examinees suffered from a lack of motivation. Figure 6 shows the total time in minutes versus the raw score on the test—it shows almost no correlation (correlation coefficient = −.05) between the total time and raw score.

Plot of total time versus raw score for the mathematics data.
The MLEs of
Figure 7 shows the normal probability plots of the log response times for 9 items (randomly chosen) among the 40 items. In each panel, a diagonal line is also shown for convenience—a curve close to the diagonal line would indicate a good fit of the LNMRT to the data. The figure shows that while the fit is not too bad in the right side of the panels, the curve drops well below the diagonal line toward the left of several panels. One cause of this drop is quick responding by several examinees. This is demonstrated in Figure 8. The left panel of the figure shows the standardized residuals

Normal probability plots for 9 items from the mathematics data.

The residuals and response times versus the estimated speed parameters for Item 8.
Figure 9 provides a deeper look at the relationship between the misfit and quick responding. The left panel shows a plot of the average per-item time (in seconds) of the examinees (x-axis) versus the values of a person-fit statistic using response times (Sinharay, 2018) whose larger values indicate more misfit 11 (y-axis). A horizontal dashed line is provided at the critical value of the person-fit statistic at 5% significance level—values of the statistic above this line are significant. The correlation between the two plotted quantities is −.25, which, together with the plot, indicates that more misfit is associated with quicker responding. Especially, the three quickest examinees (who appear on the extreme left of the top panel) all have significant values of the person-fit statistic. 12 The right panel shows a plot of the average per-person time (in seconds) on the items (x-axis) for all examinees in the sample versus the response times (in seconds) on the items (y-axis) for one of the examinees who appear on the extreme left of the top panel; a diagonal (dashed) line is added to the plot; the panel shows that the examinee answered all the items except one faster than the other examinees on average. Thus, Figure 9 shows that a part of the severe extent of item misfit in the data can be attributed to some examinees who responded quickly. But, several points above the horizontal line and toward the right of the left panel of Figure 9 show that some examinees took longer than average and yet had significant values of the person-fit statistic—so, quick responding is not the only source of the misfit of the LNMRT to the data.

The relationship between misfit of the lognormal model for response times and quick responding.
In addition, the
The value of the

A plot of the values of the
Example 2: A Licensure Test
Two data sets from a licensure test were analyzed in several chapters of Cizek and Wollack (2017). We consider one of these data sets, which includes item scores and response times of 1,629 examinees on one test form with 170 operational items that are dichotomously scored. Sinharay and Johnson (2019) fitted a joint model that includes the two-parameter logistic IRT model and the LNMRT to this data set to detect item preknowledge. Figure 11 shows the total time in minutes versus the raw score on the test. There is a negative correlation (of −.25) between the total time and the raw score on the test, and, unlike for the mathematics data, the quickest examinees (say, those who took less than 100 minutes) obtained large raw scores. About 10 examinees seem to have spent considerably more time than the rest, but information on why that happened was unavailable to the authors—accommodation is a possible explanation. 13

Plot of total time versus raw score for the licensure data.
The MLEs of the
Figure 12 shows the normal probability plots of the log response times for 9 items (randomly chosen) among the 170 items. The figure shows some signs of departure of the log response times from a normal distribution, but the extent of departure is considerably less than that in Figure 7.

Normal probability plots for 9 items from the licensure data.
Figure 13 shows the density of the response times versus that of the best-fitting lognormal distribution for 2 items represented in Figure 12—Item 33 (Column 2 of the top row in Figure 12) and Item 164 (Column 3 of the middle row in Figure 12). These 2 items were picked because the model misfit appears not severe for the former and severe for the latter. While the density of the response times is close to that of the lognormal distribution for Item 33 (left panel), there is a substantial gap between the two curves for Item 164 (right panel). Several individuals take longer than what is expected from the LNMRT for the latter item 14 for which the mean and standard deviation of the response time are about 35 and 31, respectively.

The density of the response times for 2 items.
Conclusions and Recommendations
This article focuses on the LNMRT and suggests the use of several statistics for assessing item fit and local independence of the LNMRT. A simulation study demonstrates that the suggested statistics have satisfactory Type I error rate and power, especially when compared to the existing fit statistics. In general, the SW statistic and the
The item-fit statistics are based on the classical tests for normality (e.g., Thode, 2002) and the NRR (Nikulin, 1973) test. The test for local independence is based on standardized residuals in structural equation models. The asymptotic null distributions of all the suggested statistics are known, and/or tables of critical values for the test statistics are publicly available. The simplicity of the suggested methodologies and their strong theoretical basis (in the form of asymptotic null distributions) promise to make them attractive to those interested in assessing the goodness of fit of RTMs.
The LNMRT was found to offer inadequate fit to two real data sets—one from a Grade 8 mathematics test and one from a licensure test. The percentages of statistically significant values of the item-fit statistics and
Researchers such as Gelman et al. (2014, p. 151) noted that finding an extreme p value and thus rejecting a model are never the end of an analysis. Therefore, a natural question in the context of this article is “What should a practitioner do when a misfit of the LNMRT is found?” It is possible to do several things if a misfit is found. First, as Gelman et al. (2014, p. 151) stated, one can look for other models, including extensions of the current model, that may improve the fit. In our context, the Box–Cox normal model for the response times is a possible extension of the LNMRT that was found to fit response times data better by Klein Entink et al. (2009); the extension of the LNMRT suggested by Bolsinova and Tijmstra (2018) is another possible candidate. Second, given that the data examples show the presence of some outlying/aberrant examinees, a simple extension may not fit the data and one may need to fit a mixture RTM (that assumes one model for normal responses and another model for aberrant responses) such as that of C. Wang et al. (2018). Third, as noted by researchers such as Sinharay and Haberman (2014), practitioners should assess practical significance of any model misfit—such an assessment aims to answer questions such as “Are the main inferences made from the model influenced by the model misfit?” and “Can the model, with its misfit, still be used for the present problem?” The assessment of practical significance is problem-specific and depends heavily on the purpose for which the model is being used. An example of such an analysis would be that in an application of the LNMRT to person-fit analysis using response times (as in, e.g., Sinharay, 2018), one finds 10% misfitting examinees but then applies the Box–Cox normal model (Klein Entink et al., 2009) to find the percentage of misfitting examinees; if only 5% examinees are found misfitting with the Box–Cox normal model, then the misfit of the LNMRT is statistically significant.
Our article has several limitations. First, applications of the suggested statistics to more simulated and real data would provide more insight into these statistics. Second, extension of the suggested statistics to more complicated RTMs is a possible topic for future research. Third, other types of item-fit statistics for these models may be helpful. For example, the suggested statistics have low power under certain conditions, and research on finding more powerful statistics will be useful. Finally, research on finding effect sizes corresponding to the suggested statistics would be useful. One way to examine effect size would be to find out the practical significance of the misfit (Sinharay & Haberman, 2014).
Footnotes
Appendix A. R Functions to Compute the MLEs and the New Statistics
Appendix B. Type I Error Rates of the Statistics in the Simulation Study
Acknowledgments
The authors would like to thank the editors, Li Cai and Daniel McCaffrey; the associate editor, Rianne Janssen; and the three anonymous reviewers for several helpful comments that led to a significant improvement of the article. The authors would also like to thank Jodi Casabianca-Marshall, John Donoghue, and Rebecca Zwick for several helpful comments and James Wollack for generously sharing a data set that was used in the research that led to this article.
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The authors prepared the work as employees of Educational Testing Service. Any opinions expressed in this publication are those of the authors and do not necessarily represent views of the Institute of Education Sciences, the U.S. Department of Education, or the Educational Testing Service.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305D170026.
