Abstract
In this article, a new model for test response times is proposed that combines latent class analysis and the proportional hazards model with random effects in a similar vein as the mixture factor model. The model assumes the existence of different latent classes. In each latent class, the response times are distributed according to a class-specific proportional hazards model. The class-specific proportional hazards models relate the response times of each subject to his or her work pace, which is considered as a random effect. The latent class extension of the proportional hazards model allows for differences in response strategies between subjects. The differences can be captured in the hazard functions, which trace the progress individuals make over time when working on an item. The model can be calibrated with marginal maximum likelihood estimation. The fit of the model can either be assessed with information criteria or with a test of model fit. In a simulation study, the performance of the proposed approaches to model calibration and model evaluation is investigated. Finally, the model is used for a real data set.
Introduction
Psychological assessment deals with the measurement of human characteristics such as ability, characteristic traits, attitudes, or preferences. The process of measurement is thereby based on the responses to standardized stimuli, which are usually short statements or isolated microproblems bundled together in a test. These responses are then mapped to the continuum of the characteristic that is supposed to underlie the test performance. The mapping requires a measurement model that describes the relation between the observable responses and the unobservable characteristic, the so-called latent trait. Standard measurement models are the logistic models that were originally developed for achievement tests (Baker & Kim, 2004). The application of a particular measurement model requires that the relation between the responses and the trait is the same for all individuals. This implies that all individuals respond in the same way, or more specifically, that the items trigger the same response processes in all subjects. This assumption, however, might be invalid in some cases. There is accumulating evidence that individuals differ in the way they process information and arrive at a response.
With respect to achievement tests research findings indicate the existence of subgroups that differ qualitatively in their way of responding. In tests with little consequences for the test takers there is sometimes a subgroup of individuals that guesses rapidly while the majority of the test takers strive to solve the items. This problem occurs in low-stakes testing where some subjects lack the motivation to make any effort (Bolt, Cohen, & Wollack, 2002; Meyer, 2010; Schnipke, 1999; Wise & DeMars, 2006). Qualitatively different ways of responding, however, can have more profound reasons. In diagnostic classification models, the tests are designed to distinguish subgroups of test takers who hold qualitative different forms of misconceptions and thus rely on different mental operations when responding; see, for example, Bradshaw and Templin (2014). Although misconceptions become manifest in the response, the way individuals apply concepts also influences the response time (Lasry, Watkins, Mazur, & Ibrahim, 2013). Sometimes, individuals even differ in a more fundamental way. The dual processing theory, which was developed in cognitive psychology, states that there are two modes of information processing, an automated mode and a controlled mode. Individuals switch from controlled processing to automated processing with practice (Goldhammer et al., 2014). Automated processing is supposed to be faster than controlled processing.
Evidence for the presence of subgroups of test takers who respond in a different way can also be found in personality tests and attitudinal scales. An important topic in psychological assessment is the problem that some individuals respond on some other basis than the specific item content. Response styles like acquiescence or extreme responding can be mentioned here as well as the motivation to respond in a social desirable way. Much effort has been spent on the identification of the affected subjects. In addition to the responses the response times are indicative of the response style, and several researchers managed to separate different groups of responders by means of a response time analysis (Holden & Kroner, 1992; Hsu, Santelli, & Hsu, 1989; McIntyre, 2011; Schneider & Hübner, 1980). However, not all subjects that respond differently should be classified as aberrant responders. For attitudinal scales it has been shown that individuals can be classified into two groups on the basis whether they take a rapid peripheral route of information processing or a more central one (Mayerl, 2013). In the same line one can distinguish strong and weak attitudes, a distinction that refers to the availability of the attitudes when completing the test. Attitudes are either retrieved or generated when responding to the test items and it is the retrieved attitudes that are stable and predict behavior (Bassili, 1996; Fazio, 1995; Mulligan, Grant, Mockabee, & Monson, 2003). Retrieving attitudes is usually faster than generating them and it has been shown that the response time can be used to separate the two response processes. Lately, the existence of different classes of respondents has also been noticed in the field of conjoint measurement. Latent class models have been used successfully for choice data to identify unobserved consumer segments differing in preference structures and decision-making strategies (Jedidi & Kohli, 2005; Swait & Adamowicz, 2001).
Summing up, this short overview illustrates three points. First, there is much evidence that individuals do not only differ in the trait level but sometimes also in the general way to process the items of the test. This allows for a grouping of the subjects into several latent classes. The number and nature of these classes depend on the content of the test and the situation the assessment takes place. Considering the overview given above, however, very often it should be just two latent classes that have to be distinguished. Depending on the context this could be the classes of subjects that respond rapidly versus in a well-thought-out way, that use controlled versus automated processes, that answer honestly versus social desirably, or that take a peripherical route versus a central route of information processing.
Second, it is of diagnostic value to identify the presence of latent classes and to classify the subjects into the class they belong to. Each latent class is characterized by a different relationship between the response and the latent trait and therefore requires a class-specific measurement model to avoid wrong conclusions about the traits of the test takers. The problem of invalid trait inference however is only one aspect. The classification of subjects into different classes might be of diagnostic value itself in case one is able to identify rapid guessers, individuals with strong attitudes, or different consumer segments. Knowledge of the class membership allows for a more detailed characterization of the individuals that is known to serve as a moderator variable of the trait behavior relationship; see, for example, the publications about attitude strength or social desirableness.
And third, the response time might be a good indicator of the latent class membership. Very often, the time course of the response process differs significantly for different ways to approach an item. In the case of automated versus controlled processing or the central versus peripherical route it is the response time and not so much the response that distinguishes the two solution processes. Rapid guessing becomes apparent not only in a large amount of wrong responses but also in short response times. And even in the case of social desirability it is the response time that has been discussed as the primary indicator of this response style. Not surprisingly, it is the response time that has very often been used to identify subgroups of test takers responding in a different way; see the references cited above. This is similar to experimental psychology, where the response time is considered as the key quantity in order to analyze the response process of subjects.
Consequently, in this article, we propose a model for response times in tests that allows for different response processes. The model combines two well-known approaches to response time modeling, latent class analysis, and the proportional hazards model with random effects. The latent class model assumes that individuals can be grouped into different latent classes with distinct, class-specific response time models. The class-specific response time models are assumed to be proportional hazards models with random effects that account for individual differences in work pace. This approach parallels the mixture factor analysis model of Yung (1997) when the standard factor model is replaced by the proportional hazards model. Here, the proportional hazards model was chosen for two reasons. First, it is the most popular model for event times in biometrics because of its flexibility and semiparametric approach. Second, as highlighted by Wenger and Gibson (2004) and Ranger and Ortner (2013) the model has a precise psychological interpretation in terms of an information accumulation process and can be used to estimate the rate of information acquisition of the subjects. This is an advantage over more statistical models that are not closely related to elements of the solution process and lack a psychological interpretation.
The outline of the article is as follows. First, the proportional hazards model with random effects is described and an interpretation in terms of an information accumulation process is given. Then, the model is extended to a latent class version. Model estimation and approaches to model inference are described in a separate section. In this section, we also address ways to test the fit of the model. Finally, the model is applied to a real data set.
The Proportional Hazards Model and Its Interpretation
The proportional hazards model of Cox (1972) is the most popular model for the analysis of event times in statistics. It has been introduced in psychology by Douglas, Kosorok, and Chewing (1999). Ranger and Ortner (2012) adapted the proportional hazards model to response times in tests by including a random effect representing the work pace of an individual. Similarly the proportional hazards model was used by Wang, Chang, and Douglas (2013) and Wang, Fan, Chang, and Douglas (2013) as a special variant of the more general linear transformation model. A proportional hazards model with crossed random effects has been proposed by Loeys, Legrand, Schettino, and Pourtois (2014).
The basic quantity in the proportional hazards model is the hazard function. The hazard function is a function of time and is defined as
where
The hazard function determines the response time distribution as there is a one-to-one relation between the response time density and the hazard function. The assumption of different response time distributions over subjects and items implies that for each item g and each subject i there is a distinct hazard function
The baseline hazard function
Using the well-known relationship between the hazard function and the density function-see, for example, Klein and Moeschberger (1997)-the density of the response time distribution follows from Equation (2) as
where
A Mixture Proportional Hazards Model
The fusion of latent trait and latent class models has a long tradition in psychometrics. The basic idea of such models is the combination of continuous latent variables in form of the latent trait with discrete latent variables in form of a latent class membership (Vermunt, 2008). Several such models have been proposed for the responses from psychological tests. Mixture item response models assume different item response models for several not directly observable subgroups of the test takers (Magidson & Vermunt, 2001; Rost, 1990). Mixture factor models postulate different factor models for likewise not directly observable subpopulations (Varriale & Vermunt, 2012; Yung, 1997). Recently, combinations of latent class models with latent trait models have also been suggested for response time data in tests, predominantly in the context of rapid guessing (Meyer, 2010). One could also model test response data with one of the latent class factor models suggested in biometrics for event data (Almansa, Vermunt, Forero, & Alonso, 2014; Asparouhov, Masyn, & Muthen, 2006). However, instead of using one of the modeling approaches proposed in the past that either are based on the normal distribution or that can be used for discrete time data only, we propose a latent class latent trait model that is based on the proportional hazards model for continuous time in this article.
When combining latent trait and latent class analysis, two different strategies can be pursued, the k-mixtures of common latent trait models versus the k-mixtures of heterogeneous latent trait models (Yung, 1997). In the first approach, a common latent trait model applies to all subgroups of the population. The distribution of the latent trait however is different in the latent classes such that the marginal distribution of the latent trait across the latent classes follows a mixture of distributions. In the second approach, each subgroup can be distinguished by a group-specific latent trait model. The latent trait, however, is assumed to be standard normally distributed in each subgroup. It is the second approach that is followed here. More specifically, we assume that there are different subgroups of test takers that differ in their way of responding. Each way of responding can be characterized by a distinct rate of information acquisition, that is, by a distinct baseline hazard function. For sake of simplicity, we assume just two subgroups. Two latent subgroups might be the first choice as very often there are two fundamental different ways of responding; see the introduction for examples. Each test taker can be assigned to one of the two subgroups according to his/her way of responding. Note, that this assumption implies that individuals do not change their way of responding during the test. The group membership of the individuals cannot be observed directly. Therefore, the subgroups are denoted as latent classes.
Represent by
Here,
where
In accordance with the standard assumption typically made in item response and structural equation models the distribution
where π 1 and π 2 denote the mixing proportions, that is, the probability of belonging to one of the two latent classes. Note that in case the latent class of the subjects was known, the model would simplify to a stratified survival model for multivariate response times, a so-called stratified frailty model.
In the present form, the model is not identified. Scale and location differences of the class-specific latent trait distributions
Equation (7) is similar to the k-mixtures of heterogeneous factor models of Yung (1997) with the exception that the class-specific distributions of the response times are not the multivariate normal distributions implied by the standard factor model but follow from the proportional hazards model with random effects. The model also bears resemblance to the multivariate multilevel model for continuous survival times described in Asparouhov et al. (2006). The main advantage of the proportional hazards framework is its flexibility and its interpretability in terms of an information accumulation process. The class-specific baseline hazard functions allow for a precise assessment of how the subgroups differ with respect to information processing. In the previous section, only two latent classes have been considered. This is often the first choice as a model should be as parsimonious as possible and there are good reasons to assume just two fundamental different ways of responding. This could imply a group guessing rapidly versus a group responding adequately or a group responding honestly versus a group concerned with impression management. Theoretically, the approach could be extended to more latent groups. The number of latent classes is only limited by the ability to calibrate the model. However, it might be the better strategy to prefer a simpler model over a more complex one that is hard to fit to the data (Hastie, Tibshirani, & Friedman, 2009).
Implementation of the Model and Model Calibration
One of the key quantities of the model is the baseline hazard function. It can be interpreted as the rate at which an average individual in a latent class acquires information. Therefore, the class-specific baseline hazard functions should be the most informative quantity to compare the different solution processes in the two latent classes. The hazard function also determines the response time distribution. When the hazard function is constant the response times are distributed according to an exponential distribution. Linear or quadratic functions result in Weibull distributed response times. When implementing the model, strong assumptions about the form of the class-specific baseline hazard functions should be avoided. In fact, the primary motivation to use the proportional hazards model is its potential to record the process of information acquisition in a flexible way. The required flexibility can be achieved by approximating the class-specific baseline hazard functions
A standard approach to piecewise continuous modeling consists in the usage of polynomial spline functions (Hastie et al., 2009). Polynomial spline functions are polynomials on each segment with continuous derivatives of a certain order across the whole range of their domain. Polynomial spline functions are usually generated by a linear combination of basis functions. In general, the B-spline basis is used. The resulting linear combination however is not necessarily positive as it is required by the baseline hazard function. Although the positivity of polynomial splines can be guaranteed by imposing restrictions on the weights of the linear combination (Etezadi-Amoli & Ciampi, 1987) this approach is not easy to implement. It is easier to approximate the baseline hazard function with an approach proposed by Ramsay (1988). The approach is based on a different basis, the so-called M-spline basis. The M-spline basis functions are likewise combined via a linear combination to a piecewise continuous polynomial. Due to characteristics of the basis functions, the positivity of the weights guarantees the positivity of the piecewise polynomial such that no complicated restrictions are necessary. A disadvantage, however, is the fact that the M-spline basis with positive weights is somewhat restricted as it does not cover the whole space of positive piecewise polynomials. This is usually not a disadvantage in practice as the M-Spline usually suffices for most shapes of the baseline hazard function; see Cai, Lin, and Wang (2011) and Younes and Lachin (1997) for an application of M-splines in response time modeling.
Because of space limitations, the approach based on M-splines will only be sketched here. More details can be found in Appendix B. Piecewise modeling of the class-specific baseline hazard functions requires that the time axis is divided into disjoint segments at knots that define the segment borders. In each segment, the hazard function is modeled by a distinct polynomial of degree m. The segment-specific polynomials, however, are subject to restrictions as the resulting function is required to be continuous with m− 1 continuous derivatives. A piecewise quadratic function for example is a quadratic function in each segment with continuity up to the first derivative. The continuity requirements guarantee the smoothness of the function. The piecewise function can be generated by a linear combination of basis functions, the M-spline basis. With q segments

M-spline basis functions for a piecewise quadratic function and three segments defined by segment borders at 3.33 and 6.66. An exemplified linear combination of the basis functions with positive weights yields the bold line that represents a potential approximation of the baseline hazard function.
The piecewise polynomial function is generated by a linear combination of the M-spline basis functions. The weights of the linear combination have to be restricted to be positive. This restriction guarantees the positivity of the resulting polynomial because all M-spline basis functions are positive. This allows the approximation of the baseline hazard functions as follows. In a first step, several time points have to be chosen for each item that divides the time axis into segments. Then, the M-spline basis functions have to be determined for each set of time points. The class-specific baseline hazard function
The application of the model requires the estimation of the unknown parameters, that is, the regression coefficients from the proportional hazards model, the mixing proportions, and the weights needed for the M-spline approximation of the class-specific baseline hazard functions. The knots that define the segments are regarded as given. Usually, one places the knots at equidistant quantiles or spreads them uniformly over the time continuum. The unknown parameters can be estimated with marginal maximum likelihood estimation as it is standard practice in latent trait modeling. Marginal maximum likelihood estimation is based on the marginal response time distribution that can be obtained as follows. First, one has to replace the baseline hazard functions in Equation (4) with the M-spline approximation
Denote by vector
where
Analysis of Model Fit
Before any substantial interpretation of the research findings can be attempted two questions have to be investigated: the correct number of latent classes and the overall fit of the model to the data. Testing for the correct number of latent classes is not straightforward in mixture models. The standard approach of comparing nested models with different numbers of latent classes via a likelihood ratio test is invalid as the mixture parameters are on the boundary of the parameter space and the model is not identified under the null hypothesis (Lo, Mendell, & Rubin, 2001). In latent class analysis or mixture item response models one usually uses information criteria to determine the correct number of latent classes. Popular information criteria are the Akaike information criterion (AIC) of Akaike (1992) and the Bayes information criterion (BIC) of Schwarz (1978). Although their application is also not justified in presence of boundary value parameters (Hughes & King, 2003) these measures or variants thereof usually perform well in practice, at least in larger samples (Bulteel, Wilderjans, Tuerlinckx, & Ceulemans, 2013; Henson, Reise, & Kim, 2007; Li, Cohen, Kim, & Cho, 2009; Nylund, Asparouhov, & Muthen, 2007). Therefore, in the present context we also advise the usage of information criteria for inferences about the number of the latent classes.
The overall fit of the model can be assessed with a test proposed by Ranger and Kuhn (2014). A test of model fit is usually superior to graphical checks in highly parameterized models. The test proposed by Ranger and Kuhn (2014) is a χ
2-like test of goodness of fit similar to the χ
2-test of Chernoff and Lehmann (1954). The test requires the binning of the response times at some prespecified time points
Simulation Study
In order to assess the proposed approach to model estimation a simulation study was conducted. This simulation study served two purposes, to assess the parameter recovery of the maximum likelihood estimator and to evaluate the proposed methods for inference. The simulation study was based on the response times in a test of 12 items. Two latent classes were assumed with mixture proportions of
The simulation samples were obtained as follows. In a first step, the respondents were randomly assigned to one of the two latent classes according to the Bernoulli distribution and the mixture proportions. Then, the latent trait values were drawn from the standard normal distribution for every fictitious test taker in the sample. Finally, the response times were generated for each test taker by using the proportional hazards model of the latent class the test taker belonged to. In this way, 250 samples with 1,000 subjects and 250 samples with 5,000 subjects were generated.
Simulation Study: Parameter Recovery
In a first step, the model was fit to the simulation samples. The model was implemented as described above. The time axis was divided into three segments at the knots
The integral over the normal distribution needed for marginal maximum likelihood estimation, see Equation (6), was approximated with Gauss Hermite Quadrature using 20 nodes. The marginal log-likelihood function was maximized with the EM-algorithm. The algorithm was implemented in the statistical software environment R (R Development Core Team, 2009). The scripts are available from the authors on request. As mixture models usually possess several local maxima the performance of the estimator depends crucially on good initial values. The initial values were determined as follows. In a first step, the log response times were analyzed with the package mclust which estimates a finite mixture model based on the multivariate normal distribution (Fraley, Raftery, & Scrucca, 2014). A two-class solution was enforced and each test taker was assigned to one of the two latent classes. Then the total response time of each subject was used as a proxy for the latent trait. A provisional latent trait estimate was generated by transforming the total response time into a quantile of the standard normal distribution via quantile matching. Using the preliminary class memberships and the provisional latent traits of the subjects two proportional hazards model were estimated for every item and the two groups of respondents.
For each of the 500 (2 sample sizes × 250 replications) data sets the item parameters were estimated as described above. The EM-algorithm converged each time. Having estimated the item parameters, confidence intervals were calculated for the mixture proportion and the regression coefficients. Therefore, the standard error of the estimates was taken from the diagonal entries of the inverse of the observed Fisher information matrix. Then, Wald-type confidence intervals were calculated. The results can be found in Table 1 which contains the true value, the average estimate, the standard error of estimation, and the relative coverage frequency of the confidence intervals for confidence level c = 0.95. Note that the results have been aggregated over items with the same item parameter (Items 1-4, 5-8, and 9-12).
Parameter Recovery of the Regression Coefficients.
Note. Results for
The results in Table 1 imply that the maximum likelihood estimator performs well. The estimates of the parameters are virtually unbiased. However, the estimates are slightly imprecise in samples of 1,000 subjects. Especially the estimates of the regression coefficients in the second latent class have a large standard error. This might be due to the smaller mixing proportion (
In addition to the recovery of the item parameters, we also investigated the recovery of the class-specific baseline hazard functions (momentary rate of information accumulation) and the cumulative baseline hazard functions (total accumulated information). In general, the functions could be recovered almost without bias. However, the standard error of estimation was rather large for the class-specific baseline hazard functions, especially in the class with the lower mixing proportion. This is not surprising as the hazard function is notoriously hard to estimate. More details concerning the estimation of the two quantities can be obtained from the authors on request.
Simulation Study: Model Fit
The performance of the information criteria and the test of model fit were evaluated in a second simulation study. Therefore, different simulation conditions were defined by varying the number of latent classes of the data-generating model. In the condition with two latent classes, the response times were generated as before, mixing the Weibull distribution with the exponential distribution. In the condition with one latent class, all response times were drawn from an exponential distribution. This exponential distribution was identical to the one of the first simulation condition. In the condition with three latent classes, the response times were generated by mixing three distributions. The first and second distribution were the Weibull and exponential distribution of the first simulation study. The third distribution was similar to the Weibull distribution of the first simulation condition with the exception that all response times were shifted (increased) by the fixed amount of 0.5. The mixing proportions were 0.25, 0.50, and 0.25. The resulting distribution was still unimodal. Again, 250 data sets of 1,000 and 5,000 subjects were generated for each simulation condition.
The data sets were analyzed as described above. First, the mixture model for two latent classes was fit to each of the 2 × 3 × 250 data sets using the marginal maximum likelihood estimator. In addition, a modified version for one and three latent classes was fit to the data as well. However, only the data sets with 1,000 subjects were analyzed with the version for three latent classes as fitting the model was extremely time-consuming. Despite the fact that the models were misspecified in case the structure of the data set did not correspond to the assumed number of latent classes all estimators converged. Having estimated the parameters the AIC and BIC indices were calculated for all models and all data sets. The model with the lowest information criterion was chosen as the true one. The relative frequency of the choices is tabulated in Table 2 for the different simulation conditions and samples of 1,000 subjects. Note, that the diagonal in each subtable denotes the rate with which the number of latent classes was identified correctly.
Relative Frequency With Which a Model Was Chosen as the Best Approximation by the AIC and BIC Indices for Generating Models With Different Number of Latent Classes in Samples of 1,000 Subjects.
Note. AIC = Akaike information criterion; BIC = Bayesian information criterion. Results based on 250 simulation samples.
The performance of the information criteria is suboptimal. The BIC fails at detecting the correct number of latent classes when two or more classes are present. This is somewhat surprising although it is known that the BIC sometimes performs poorly in more complex models (Bulteel et al., 2013). The AIC detects the presence of two latent classes but also fails at revealing the existence of three. The low detection rate in case of the third simulation condition is due to the good approximation of the marginal response time distribution by the two-class model. P-P plots that compared the correspondence of the true response time distribution with the one implied by the two-class model did not indicate the misspecification either. Given the variance of the Weibull distribution a shift of 0.5 appears negligible. As the AIC selects the model with the best trade-off between variance and bias it tends to favor the model with fewer parameters. Sample size adjusted versions of the criteria that reduce the penalty for the model complexity might improve the detection rate (Tofighi & Enders, 2007).
The performance of the proposed tests of model fit was analyzed as follows. In a first step, the two-class mixture model was fitted to each data set from the first simulation condition (two latent classes) and the third simulation condition (three latent classes). Then, the response times of every item were categorized into 10 time slices at 8 equally spread thresholds ranging from 0.4 to 1.10. With the categorized response times two tests of model fit were run. The first test was the test of item fit and compared the expected and observed numbers of responses in the time slices for each item separately. Such a test would be used during test development for item selection. The second test evaluated the congruence of observed and expected numbers of responses in all items jointly, thereby testing the overall fit of the model. Such a test is similar to the overall test of model fit in confirmatory factor analysis. The empirical rejection rates of the tests can be found in Table 3 for several nominal Type I error rates and the two simulation conditions. The reported relative frequencies are the empirical Type I error rate in the condition with two latent classes and the power of the test in the condition with three latent classes. Note, that the results from the item-specific version of the test have been averaged over the items.
Empirical Type I Error Rate (Condition LCA 2) and Power (Condition LCA 3) for Two Tests of Model Fit in Samples of 1,000 and 5,000 Subjects and Different Nominal Type-I Error Rates α.
Note. The tests evaluate the adequacy of a two-class model either itemwise (Test: RT) or globally (Test: RT–G). In Condition LCA 2, the data were based on two latent classes and in Condition LCA 3 on three. Results are based on 250 simulation samples. Tests for item fit are averaged over items.
As can be seen in Table 3, the tests adhere closely to the nominal Type I error rate. Note that for
Empirical Example
To demonstrate the usefulness of the model a real data set was analyzed. The data came from a mental calculation test given to second to fourth graders in a large-scale educational study. Here, only the first eight items that were given to all pupils were used. These items contained addition and subtraction problems that varied with respect to their difficulty. The items were constructed by systematically varying the number of digits (one vs. two) and the occurrence of a decade crossing (no decade crossing vs. decade crossing). The data set contained the responses and response times from 980 pupils.
The data were analyzed as described before. The response times were censored at roughly the q 10 and q 90 quantiles in each item. The time continuum between the censoring boundaries was partitioned into three segments by two knots. Then the proposed model was fit to the data. Several starting values were used and the best solution was chosen. Versions of the model with one and three latent classes were estimated as well. The AIC indices were AIC1 = 18153.99, AIC2 = 1447.61, and AIC3 = 17473.15 for the one-, two-, and three-class version. This result suggested the version with two latent classes as the best model. The results of the fit tests however questioned the adequateness of the model. The overall test of model fit based on eight equally spaced cut points indicated a significant deviation of the observed and implied response time distribution (χ 2 = 40.70, df = 26.77, p = .04). Inspecting the single items revealed that only the first item exhibited misfit (χ 2 = 18.15, df = 4.01, p < .01). Therefore, this item was removed and the data analysis was repeated. Again, the version of the model with two latent classes was best (AIC1 = 17002.04, AIC2 = 16418.34, AIC3 = 16479.41). Having removed the first item the overall fit of the model was excellent (χ 2 = 22.75, df = 22.80, p = .46).
The two latent classes had mixing proportions of
This interpretation was also supported by the estimates of the baseline hazard function. The shape of the functions differed considerably over the two latent classes. This points to qualitative differences in the solution process. Although the baseline hazard function cannot be estimated too precisely in small samples and should be interpreted with care, some patterns seemed to emerge. In the second latent class there was a sharp peak around 1.3 seconds that appeared independently of the item difficulty. The hazard functions of the first latent class did not exhibit this sharp increase. Given that the second latent class consists of the older subjects one might conjecture that these subjects have switched from controlled to automated processing. Note that all items were rather easy and should be prone for automaticity.
Discussion
Latent trait models in general and item response models in particular are popular tools in psychological assessment. These models are based on the assumption that all individual differences are due to different levels of an underlying trait and therefore require that all individuals respond in the same way. However, sometimes there is also heterogeneity with respect to the response process. When this heterogeneity is ignored and a standard latent trait model is used as a measurement for the underlying latent trait the conclusions about the test takers are possibly wrong. The identification of subgroups that respond differently is important not only from a measurement perspective but also from a theoretical one and could improve test development and theory induction. Therefore, a model for the detection of different subgroups would be of practical value, especially in case the model provides some insight into how the groups differ. One candidate of such a model is the mixture proportional hazards model with random effects proposed in the article.
The model combines latent class analysis and the proportional hazards model and inherits the strengths of both approaches. The latent class analysis allows for subgroups differing with respect to the response process. The proportional hazards model provides a model for the information accumulation process in each subgroup. This allows for a detailed comparison of the different response processes via the subgroup-specific baseline hazard function. Although the model is for response times only, one could easily extend the model to a joint model for the responses and response times. One simply had to include a mixture item response model via the hierarchical framework of van der Linden (2007).
The model can be calibrated with marginal maximum likelihood estimation. The performance of the estimator is good, but the precise recovery of the item parameters requires a large sample. Large sample sizes are also needed for valid confidence intervals and a good performance of the proposed approaches to model inference. The correct number of latent classes can be recovered well in case the number is less than 3. Three latent classes are hardly detected, at least for the simulation conditions. This is partly due to the difficulty to fit the version of the model for three classes. Problems like several local minima of the marginal likelihood function are aggravated here. Furthermore, the model for two latent classes already provides a good approximation to the response time distribution. However, as the latent classes were not well separated, the simulation conditions were rather challenging. With a clearer separation of the latent classes, for example, when individuals either guess rapidly or answer conscientiously, the findings might be different.
The model was used for the analysis of an empirical data set. This analysis was supposed to be a demonstration of a prototypical application of the model. In this application pupils could be classified according to whether mathematical operations have already become automatized or not. But of course, the model has more potential applications. It can be used to detect rapid guessing or social desirable responding. It could be used for the identification of strong attitudes that are related more closely to behavior. The detection of differential item functioning is also a potential field of application. The usefulness of the model thereby stems from two sources: The model is flexible and avoids too strong assumptions about the response time distribution. Additionally, it provides a profound interpretation in psychological terms due to its relation to the process of information acquisition. Therefore, the model is not only a measurement model but also bears a resemblance to cognitive process models from experimental psychology. This is a new approach that might become popular in the future (van der Maas, Molenaar, Maris, Kievit, & Boorsboom, 2011).
Footnotes
Appendix A
Appendix B
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
