Abstract
Background:
Response shift (RS) can threaten the internal validity of pre–post designs. As RS may indicate a redefinition of the target construct, its occurrence in training evaluation is rather likely. The most common approach to deal with RS is to implement a retrospective pretest (then-test) instead of the traditional pre-test. In health psychology, an adapted measurement invariance approach (MIad) was developed as an alternative technique to study RS. Results produced by identifying RS with the two approaches were rarely studied simultaneously or within an experimental framework.
Objectives:
To study RS in two different treatment conditions and compare results produced by both techniques in identifying various types of RS. We further studied validity aspects of the then-test.
Research Design:
We evaluated RS by applying the then-test procedure (TP) and the measurement invariance apporach MIad within an experimental design: Participants either attended a short-term or a long-term classroom management training program.
Subjects:
Participants were 146 student teachers in their first year of master’s study.
Measures:
Pre (before training), post, and then self-ratings (after training) on classroom management knowledge were administered.
Results:
Results indicated that the two approaches do not yield the same results. The MIad identified more and also group-specific RS as opposed to the findings of the TP, which found less and only little evidence for group-specific RS.
Conclusions:
Further research is needed to study the usability and validity of the respective approaches. In particular, the usability of the then-test seems to be challenged.
Introduction
A principal issue in evaluating student teachers’ learning is using valid methods to measure improvement (Moore & Tananis, 2009; Pratt, McGuigan, & Katzev, 2000). However, a direct and proximal assessment is not always feasible (Desimone, 2009). Instead, self-reports are often employed, which provide the basis for an indirect measurement of change via pre–post designs (Lucas & Baird, 2006). Although pre–post designs relying on self-report measures are well established in the area of training evaluation, there are two main drawbacks that jeopardize their internal validity: pretesting effects (e.g., priming effects—pre-testing as a treatment in its own) and response shift bias (Aiken & West, 1990; Lam & Bengo, 2003). The phenomenon of response shift (RS) and empirical methods to identify its occurrence will be the focus of this study.
RS “refers to a change in the meaning of one’s self-evaluation of a target construct” (Sprangers & Schwartz, 1999, p. 1508). For example, RS is present if participants of a stress management training evaluate “stress” differently before versus after the training due to an altered appraisal of stress. However, for pre- and post-test scores to be comparable, a common metric must exist between the two sets of scores (Cronbach & Furby, 1970). Consequently, if RS is present, pre–post difference scores are invalid. Yet, disregarding RS may lead to wrong conclusions: an arbitrary rejection of an effective training (false-negative results, e.g., if RS masks underlying effects) or a long-term implementation of a less effective training (false-positive results, e.g., if RS led to overestimation of effects; Howard, 1980; Howard & Dailey, 1979). Accordingly, identification and prevention of RS are crucial for both researchers and evaluators.
Sprangers and Schwartz (1999, p. 1508) further differentiated the concept of RS and identified three types of RS: “(1) recalibration, that is, a change in the respondent’s internal standards of measurement; (2) reprioritization, that is, a change in the respondent’s values (i.e., reevaluation of the importance of [indicators] constituting the target construct); or (3) reconceptualization, that is, a redefinition of the target construct” (as cited in Oort, Visser, & Sprangers, 2009, p. 1127). Consequently, RS may not only represent biased measurement (due to history, maturation, or regression, cf. Brossart, Clay, & Willson, 2002) but may also be an outcome of the training itself: that is, treatment-induced RS (Brossart et al., 2002; Norman, 2003; Rosenman, Tennekoon, & Hill, 2011). Treatment-induced RS is of particular concern in evaluating educational training programs that explicitly aim at recalibration, reprioritization, and reconceptualization/redefinition of dependent variables (Gibbons, 1999; Nolte, Elsworth, Sinclair, & Osborne, 2009). A recalibration RS may be due to changed awareness or social comparison within the training; participants interpret items and their related responses differently postintervention (Gibbons, 1999; Hill & Betz, 2005). A reprioritization RS may be due to changed priority of certain facets of a construct and points to changes in the connotation of single indicators. Finally, a reconceptualization RS is most likely to occur if the target construct is abstract or unfamiliar preintervention or if the training program explicitly intends to elucidate concepts. By definition, treatment-induced RS cannot occur for untrained controls and may vary in different treatments (Aiken & West, 1990).
Numerous efforts have been undertaken in order to address the issue of RS bias (for an overview, cf. Barclay-Goddard, Epstein, & Mayo, 2009; Schwartz & Sprangers, 1999). The majority of approaches studying RS either belong to the group of design approaches (i.e., implementing additional measures) or to the group of statistical approaches (i.e., comparing separate components of pre- and postdata statistically). Design approaches are the most commonly used method for measuring RS within the field of educational research (Hill & Betz, 2005; Pratt et al., 2000; Sibthorp, Paisley, Gookin, & Ward, 2007), in particular the use of a retrospective pre-test (or then-test): In addition to—or instead of—the conventional pre-test, subjects rate their initial status at the end of the training program using the then-test. Differences between pre- and then-test are interpreted as RS bias (Hill & Betz, 2005). In organizational and recreation research, statistical approaches have evolved which investigate RS by means of factor analyses, growth curve modeling, structural equation modeling, or tests of measurement invariance (for an overview, see Schwartz & Sprangers, 1999).
The aim of this study was to address three different issues. First, we intended to adopt a statistical approach to study RS for an educational training evaluation context. Second, we wanted to further investigate validity aspects of the then-test. Third, we sought to study the sensitivity of the two approaches to identify differential effects (i.e., treatment-specific RS) within an experimental design.
Response Shift and the Then-Test Procedure
The then-test was developed to both identify and deal with RS (Hoogstraten, 1985). In particular, the then-test only aimed at identifying the recalibration RS via manifest comparison of pre- and then-test scores (Cantrell, 2003; Schwartz, Sprangers, Carey, & Reed, 2004)—indicated by then-test procedure (TP) in the following. If the TP identifies RS, then–post comparison serves as an alternative, “quasi-indirect” measurement of change (e.g., Cantrell, 2003; Drennan & Hyde, 2008; Holden, Barker, Rosenberg, & Onghena, 2008; Moore & Tananis, 2009; Pratt et al., 2000; Sibthorp et al., 2007). As then- and post-test are completed at the same occasion, it is assumed that subjects refer to the same internal metric and understanding of the target construct (Hill & Betz, 2005).
Hoogstraten (1982) introduced the TP to the educational training context. He argued for its validity because he found that both a task performance test and the then–post comparison showed evidence for subjects’ improvement, whereas the traditional pre–post comparison did not. Furthermore, since pre–then difference scores were positive, he concluded that subjects initially overestimated their performance. Numerous educational and medical studies replicated these early findings of overestimation at pretesting (e.g., Cantrell, 2003; Drennan & Hyde, 2008; Hill & Betz, 2005; Holden et al., 2008; Hoogstraten, 1985; Moore & Tananis, 2009; Pratt et al., 2000), finding larger effects for the then–post as opposed to the pre–post comparison (e.g., Cantrell, 2003; Pratt et al., 2000, Sibthorp et al., 2007). Moreover, then-tests tend to be better predictors for objective measures and other external validation criteria than traditional pre-tests (e.g., interview data or self-rated change; Ahmed, Mayo, Corbiere, et al., 2005; Cantrell, 2003; Hoogstraten, 1985; Pratt et al., 2000; Sibthorp et al., 2007), and they correlate less with social desirability measures (Aiken & West, 1990; Hill & Betz, 2005).
However, it is widely discussed that the TP may increase rather than reduce bias (Hill & Betz, 2005; Schwartz & Sprangers, 2010). Implicit theories of change might operate, when pre- and then-test scores differ from each other: Subjects may underscore initial status in their retrospective ratings in order to reduce cognitive dissonance (e.g., effort justification bias; Aiken & West, 1990; Sibthorp et al., 2007). Also, then-tests might be confounded with recall bias, because subjects differ in their ability to remember prior status (Henry, Moffitt, Caspi, Langley, & Silva, 1994; Schwartz et al., 2004; Schwartz & Sprangers, 2010). Accordingly, a difference in pre–then scores may indicate mere invalidity of then-scores (e.g., due to poor memory) instead of an RS bias. Hence, in order to study the validity of the then-test, additional measures are necessary to find out whether pre–then differences are due to overestimation at pretesting, deliberate adjustment of then-scores, recall bias, or training participation (Norman, 2003). Furthermore, Nolte, Elsworth, Sinclair, and Osborne (2009) challenged that then- and post-test share a common metric per se (Schwartz & Sprangers, 2010). They evaluated a chronic disease self-management course, where latent then–post comparison revealed even less invariant items than pre–post comparison. Consequently, although TP has been established as a common method to deal with RS, there are serious concerns about its psychometric properties, in particular to its ability to identify RS.
Response Shift and the Measurement of Invariance
Various statistical techniques have evolved in order to adequately detect RS and its various types (e.g., Ahmed, Mayo, Corbiere, et al., 2005; Brossart et al., 2002; Gandhi, Ried, Huang, Kimberlin, & Kauf, 2013; Oort, 2005; Schmitt, 1982; cf. Schwartz & Sprangers, 1999). One of the most recent developments was to adapt the longitudinal measurement of invariance (MI) approach (Meredith, 1993, merging structural equation modeling and item response theory) for RS evaluation—a latent approach that will be indicated by MIad in the following. The main advantage of the MIad is that it analyzes testable assumptions and explicitly allows for the statistical comparison of separate components of the measurement model over time (i.e., number of common factors, pattern of factor loadings, factor loadings, item intercepts, and item residual variance; Visser, Oort, & Sprangers, 2005; Schwartz et al., 2004). To date, Oort’s (2005) MIad procedure has become most prevalent, since it allows for the testing of all different types of RS and maximizes statistical power (Gandhi et al., 2013; see also Table 1). Nolte et al. (2009) further specified Oort’s approach: They assigned the terminology of different types of RS to corresponding levels of measurement invariance, which had not been made explicit in Oort’s original RS model. Identification of RS and MI testing are combined as follows (starting from the lowest level of invariance):
Three-Step Procedure for Response Shift Detection, Invariance Hypothesis, and Attribution to Level of Measurement Invariance and Type of Response Shift.
Source. Adapted from Oort (2005, p. 593) and Nolte et al. (2009, p. 1176).
Note. MI = measurement of invariance; RS = response shift. aThe interpretation in terms of RS only applies to pre- versus post-test. For then–post comparison, the MI terminology is used.
Configural invariance (CI): The overall number of common factors and patterns of the common factor loadings are invariant across occasions. Differences between patterns indicate that the meaning of the construct has changed over time. If CI is not met, a reconceptualization RS has occurred.
Weak (also metric) invariance (WI): In addition to CI, factor loadings are invariant across occasions. Differences between item factor loadings indicate that a variable has become more indicative or less indicative of the latent construct. If WI is not met, a reprioritization RS has occurred.
Strong invariance (SI): In addition to WI, item intercepts are invariant across occasions. Differences between item intercepts indicate a recalibration in the mean structure. If SI is not met, a recalibration RS, more specifically, a uniform recalibration RS (Oort, 2005) has occurred.
Strict invariance (STI): In addition to SI, residual variances of items are invariant across occasions. Differences between residual variances indicate a recalibration to the covariance structure (item uniqueness) of the observed variables that cannot be attributed to a change in the common factor variances. If STI is not met, a nonuniform recalibration RS (Oort, 2005) has occurred.
Since at least SI needs to be met for an unbiased comparison of factor means over time (Brown, 2006; Meredith, 1993), only nonuniform recalibration RS still allows for a valid pre–post comparison. Accordingly, reconceptualization/redefinition is interpreted in the following as the most severe RS and nonuniform recalibration as the least severe RS.
In spite of the recognized strengths of the MIad, there are certain drawbacks that may compromise its feasibility, reliability, and validity, for example, the requirement of relatively large sample sizes, or the question of how to handle reconceptualization, that is, when no level of invariance is achieved (Ahmed, Mayo, Corbiere, et al., 2005; Schwartz & Sprangers, 1999). As opposed to the commonly applied TP, empirical validity studies examining the MIad to determine RS are quite rare and predominantly refer to the questionnaire SF-36 (assessing health-related quality of life [HRQOL], cf. Schwartz et al., 2006). Validity checks were administered via comparison to TP results, interviews, and experimental research designs. The simultaneous investigation of MIad and TP revealed inconsistent findings: While Visser, Oort, and Sprangers (2005) found convergence in RS detected by both approaches for subjects with cancer, others reported divergence because they found scale recalibration with the TP and individual interviews but no RS conducting the MIad for poststroke subjects (Ahmed, Mayo, Wood-Dauphinee, Hanley, & Cohen, 2005). A single study explored RS with the MIad within a quasi-experimental framework: Ahmed, Mayo, Corbiere, et al. (2005) compared people who suffered from a stroke to a control group without stroke and found that neither group experienced RS, although an occurrence of RS in the stroke group was expected. Thus, although the MIad appears to be convincing in theory, empirical support for its validity is sparse.
Objectives and Research Hypotheses
Pre–post designs are prone to RS bias where the frame of reference, priorities of facets, or the target concept are not intra- and/or interindividually stable across time. Although the phenomenon and proper handling of RS were studied intensively, none of the developed approaches has become the “gold standard.” The two most common approaches are a latent framework (i.e., MIad) and a manifest framework (i.e., TP). Although testing of measurement invariance is common in empirical research, research into RS using the MIad was developed in the area of HRQOL and has not expanded to other fields such as educational and organizational research, where RS research initially started. However, the application of the MIad could strengthen the in-depth understanding of RS in the educational training context because it was developed to identify the different types of RS distinctly. Hence, our first objective was to implement the MIad approach within the field of educational training evaluation for the analyses of RS. The TP, in contrast, is well established, but its validity remains questioned. Accordingly, our second objective was to critically examine the then-test as a measure for both the identification of recalibration RS (validity of pre–then comparison to identify recalibration RS) and the quasi-indirect measure of change (validity of then–post comparison to measure change).
A precondition for studying identification methods for RS is to have assumptions about its occurrence. With regard to RS definition and the possibility of treatment-induced RS, our third objective was to further study the validity of the respective approaches via experimental variation in RS (experimental framework for studying RS). Recognizing that a treatment may affect RS (Oort et al., 2009), occurrence and type of RS should be distinct in different treatment conditions depending on treatment intensity (in terms of length and teaching methods applied). To date, RS has predominantly been studied within single-group designs (Pratt et al., 2000) or medical research where the treatment conditions differed in terms of surgery or medication (for an overview, see Schwartz et al., 2006). Therefore, systematic research on RS within training evaluation is rare. Based on these research objectives, we deduced four hypotheses that are presented in inverse order.
Experimental framework for studying RS
Group-based educational training programs where participants gain new insights into a target construct, discuss it, and also interact with each other raise the probability of RS between occasions (Gibbons, 1999; Nolte et al., 2009). Accordingly, we expect RS to occur after participation in an educational training program. Furthermore, we expect RS to be more severe for a group of individuals exposed to a long-term and more interactive training (group 2 [G2]) as compared to a group exposed to a short-term and less interactive training (group 1 [G1]). We hypothesize:
Validity of pre–then comparison to identify recalibration RS
The then-test was originally developed to measure recalibration RS, assuming that reprioritization and reconceptualization do not occur (Visser et al., 2005; Schwartz et al., 2004). If this is true, the detection of RS should be equal by both TP and MIad: Wherever the TP identifies a significant pre–then difference, the MIad should identify either a uniform or a nonuniform recalibration RS. However, we challenge that the TP is able to differentiate between recalibration, reprioritization, and reconceptualization RS. Rather, the manifest score building of pre- and then-scores is supposed to confound those different types of RS. Hence, we hypothesize:
Validity of then–post comparison to measure change
The implicit assumption that subjects refer to the same set of internal standards for then- and post-test has yet to be tested and confirmed (Schwartz & Sprangers, 2010). To our knowledge, there is only Nolte et al.’s (2009) study, which examined levels of invariance for then–post comparison, and they found that then–post comparison did not reveal invariant measures. However, as Nolte et al. presented each then- and post item simultaneously, it may have reinforced, for example, effort justification bias. Thus, we hypothesize:
Method
Study Design and Setting
We applied an experimental design where eight study groups were randomly assigned to either the short-term intervention group (G1) or the long-term intervention group (G2). The training programs were implemented in a university course that aimed at improving teachers’ knowledge on classroom management for master’s degree student teachers in their first year. Since classroom management is not a mandatory part of teacher training curricula, participants could have had some level of elementary knowledge on classroom management. However, as student teachers had obtained their bachelor’s degree at different universities that level might have varied significantly among participants.
Participants provided pretest data at the beginning of the course (T1). Postmeasures were administered consecutively (post-test [T2], and then-test, i.e., the retrospective pre-test [T1r]). In total, 176 student teachers participated in the course. Only those subjects who completed all three questionnaires (pre-, post-, and then-test) were included in the analyses. Hence, the original sample was reduced to N = 146 (83% of the initial sample; 24 students did not attend the session where pretesting was administered, and 6 students did not attend the session where then- and post-testing was administered). As students without pretest also did not provide data on sample characteristics, comparison of attendees and nonattendees was not possible. The final sample consisted of N G1 = 81 and N G2 = 65 student teachers. Groups were comparable with regard to their age (M = 25 years, SD = 3.35), their teaching experience (median [Md] = 12 hr of own teaching), their motivations for vocational choice (Thiel, & Blüthmann, 2009), and the teacher’s self-efficacy (Schwarzer & Hallum, 2008). G2 student teachers were more satisfied with their studies at pretesting (M G1 = 6.40, SD = 1.35, and M G2 = 6.67, SD = 1.20, on an 8-point scale, 8 indicating high satisfaction; Thiel, & Blüthmann, 2009) and were less often female (G1 = 79%, G2 = 55%).
Treatment Conditions
The short-term training program in G1 consisted of two blocks (1 and 2), and the long-term training in G2 consisted of an additional, interactive third block (1, 2, and 3): basics of classroom research (three sessions 90 min each), basics of classroom management (two sessions 90 min each), and practice of classroom management (three sessions of 90 min each).
Only Block 2 and Block 3 focused on improving classroom management (CM). In Block 2, students were introduced to the following aspects of CM strategies: establishing and maintaining expected behavior, directing learning activities, and solving conflicts (i.e., dealing with disruptive student behavior, using rules, providing structured and clear tasks, maintaining activity flow via procedures, group mobilization, and time management, building a working alliance, and handling conflicts among students). This was realized via lecture accompanied by a video presentation and self-reflection exercises. In Block 3, student teachers first analyzed staged video cases focusing on disruptive student behaviors in small groups in order to actively generate and discuss alternative strategies (two sessions) and then tried out new CM strategies in simulations using the microteaching method (one session, Grossman, 2005; Piwowar, Thiel, & Ophardt, 2013). The main focus of these three sessions was to meet disruptive student behavior (dealing with disruptions and referring to rules) while keeping the learning process running and sheltering the working alliance between students and the teacher. Both video analyses and microteaching required active participation of each student and included intensive collaborative discussion with peers.
The short-term intervention group G1 completed the post- and then-ratings after attending Block 1 and Block 2 (= five sessions) without participation in the interactive sessions in Block 3. The long-term intervention group G2 completed the post- and then-ratings after completing the whole training (Blocks 1, 2, and 3 = eight sessions).
Material
Knowledge on classroom management was assessed using the self-rating Selbsteingeschätztes Wissen im Klassenmanagement, self-rated knowledge on classroom management (SEWIK; Piwowar et al., 2013; Thiel, Ophardt, & Piwowar, 2013). The SEWIK constitutes a multidimensional assessment of CM knowledge containing the following dimensions: (a) clarity of program of action (CL), (b) conflicts among students (CO), (c) dealing with disruption (DD), (d) group mobilization (GM), (e) procedures (PR), (f) rules (RU), (g) time management (TM), and (h) working alliance (WA). Student teachers were instructed to refer to their theoretical knowledge only disregarding if they succeeded in a real classroom situation. Student teachers rated statements such as “I am familiar with various strategies for involving as many students as possible actively in the classroom” (GM) and “I am familiar with strategies for dealing with severe problem behavior” (DD) on an 8-point scale ranging from 1 (strongly disagree) to 8 (strongly agree). Each scale consisted of 3 items (24 items in total). Previous studies indicated SEWIK’s good predictive validity because it showed positive relationships between knowledge and performance as well as between knowledge and teaching experience: Student teachers had the lowest, preservice teachers had moderate, and in-service teachers had the highest knowledge on dimensions of classroom management (Thiel et al., 2013).
Data Analysis
In order to evaluate RS based on the manifest then-test framework TP, t-tests of individual difference scores (ΔT1 − T1r) were performed separately for both G1 and G2 (H0 = mean difference scores are zero). Difference scores provided an estimate of the direction and the magnitude of the RS effect. Positive difference scores indicate overestimation, and negative difference scores indicate underestimation (Schwartz & Sprangers, 2010). Effect size values of d = 0.2, 0.5, and 0.8 were considered “small,” “medium,” and “large,” respectively (Cohen, 1988). They were calculated by dividing means by the pooled standard deviation of change scores of the two groups.
In order to evaluate RS based on the latent measurement invariance framework MIad, we compared pre- and post-test scores (T1 vs. T2) following Oort’s (2005) three-step, within-group procedure (separately for both G1 and G2, cf. Nolte et al., 2009). Oort has developed a fourth step to further identify latent change. As this was not the focus of the present article, we only ran through steps one to three:
Establishing a measurement model (configural invariance of occasions): The measurement model served as baseline model with identical factor patterns across measurement occasions and no imposed constraints. This step confirmed whether the hypothesized measurement model fits the data. If the fit was unsatisfactory, the analyses ended at this point, concluding that a reconceptualization of the construct occurred. If configural invariance was confirmed, we proceeded to Step 2.
Overall test of RS (strict invariance of occasions): In Step 2, all model parameter constraints across occasions in sense of strict measurement invariance were imposed to test for overall RS. In case of an unsatisfactory fit, at least one type of RS had occurred (i.e., one model parameter was not invariant) and we proceeded to Step 3. If the fit was satisfactory and the fully constrained model was not significantly worse than the baseline model, we stopped the analysis without any occurrence of RS.
Response shift detection (test of level of measurement invariance): The third step started with the model of Step 2 and consisted of testing for specific types of RS: First, parameter constraints for factor loadings (reprioritization), then intercepts (uniform recalibration), and then residuals (nonuniform recalibration) were released in the hierarchy of measurement invariance (Gandhi et al., 2103; Nolte et al., 2009). The evaluation of fit indices, modification indices, and standardized residuals guided this step (Muthén & Muthén, 1998–2010). Each of the modified models was then compared to the baseline model using the χ2 difference testing (Brown, 2006) in order to determine the final level of invariance.
This procedure is summarized in Table 1; it also served to evaluate the invariance of then- and posttest scores (T1r vs. T2). However, while the statistical procedure applied to both pre–post and then–post comparison, the assignment of RS to the level of measurement invariance only applied to the former (pre–post comparison). For then–post comparison, the measurement invariance terminology is used. Having a sample size that was too small to include the 24 SEWIK items in a single model, analyses were conducted at the subscale level (with 3 items per scale, cf. Tanaka, 1987). We evaluated whether assumptions for the latent analysis (e.g., multivariate normality, multicollinearity, outliers) were met. Covariances between the errors of scales were allowed because it improved the longitudinal model fit (Sörbom, 1975). Model evaluation was based on goodness-of-fit indices (Brown, 2006), that is, χ2(p > .05 indicates model fit), root mean square error of approximation (RMSEA; ≤ .08 reasonable fit, ≤ . 05 close fit), standardized root mean square residual (SRMR; ≤ .10 reasonable fit, ≤ .05 close fit; Vandenberg & Lance, 2000), and comparative fit index (CFI; ≥ .90 reasonable fit, ≥ .95 close fit). Chi-square difference testing was used for model comparison, as models were nested. The SPSS19 software served to carry out the manifest analyses (t-tests); missing values were replaced by scale scores. The Mplus software (Version 6.0) served to carry out the latent analyses for the MIad; the full information maximum likelihood estimate served to estimate missing values (default in MPlus, Muthén & Muthén, 1998–2010). Missing values were rare: SEWIK items for T1, T2, and T1r had missings of 0.26%, 0.03%, and 0.15%, respectively. For all statistical analyses, the α level was set at p < .05.
Results
Descriptives
Table 2 shows means and standard deviations of T1, T1r, and T2, separated into groups and the assessed dimensions of knowledge on CM. In both pre-(T1) and then self-ratings (T1r), students’ mean initial knowledge on CM was moderate with the least knowledge on DD (M G1, T1 = 3.87, M G1, T1r = 4.23; M G2, T1 = 4.05, M G2, T1r = 3.48) and the highest knowledge on CL (M G1, T1 = 5.38, M G1, T1r = 5.35; M G2, T1 = 5.02, M G2, T1r = 5.03). Mean posttest (T2) scores indicated higher knowledge on all dimensions when compared to the traditional as well as the retrospective pretest scores in both groups: After the training, students reported the least knowledge on CO (M G1, T2 = 4.86, M G2, T2 = 5.18) and the highest knowledge on CL in G1 (M G1, T2 = 5.77) and DD in G2 (M G2, T2 = 6.12), respectively. Cronbach’s αs were good to excellent (Md T1 = .87, .78 < αT1 < .95; Md T2 = .87, .79 < αT2 < .95; and Md T1r = .94, .85 < αT1r < .96) and were higher in the postmeasures (post- and then-test) than in the pretest.
Means (M) and Standard Deviations (SD) for T1, T1r, and T2, Identification of Response Shift, and Level of Measurement Invariance.
Note. Means are based on 8-point scales, 1 = low declarative knowledge, 8 = high declarative knowledge. MI = measurement of invariance; T1 = pre-test; T1r = retrospective pre-test (then-test); T2 = posttest; ΔT1 − T1r = Pre–then difference scores; SEWIK = Selbsteingeschätztes Wissen im Klassenmanagement, self-rated knowledge on classroom management. Boldface indicates identified RS. aDimensions of the SEWIK questionnaire: CL = clarity of program of action, CO = conflicts among students, DD = dealing with disruptions, GM = group mobilization, PR = procedures, RU = Rules, TM = time management, WA = working alliance. bTraining conditions: G1 = short-term training group, G2 = long-term training group. cResponse shift (RS) identification: REC, n = nonuniform recalibration RS, REP = reprioritization RS, RED = redefinition/reconceptualization RS. dFinal level of measurement invariance: 0 = no invariance, 1 = configural invariance, 2 = metric invariance, 3 = strong invariance, 4 = strict invariance. Results of the preceding three-step procedure are shown in Table 3.
*p < .05.
Experimental Framework for Studying RS
RS as measured by the TP
Table 2 shows effect sizes d and level of statistical significance via within group t-tests of pre–then difference scores (ΔT1 − T1r). The TP identified RS on three of the eight dimensions in both G1 and G2. In G1, identified RS indicates initial underestimation with regard to the WA (d G1 = −.27, p < .05), DD (d G1 = −.26, p < .05), and CO (d G2 = −.21). In G2, identified RS indicates initial underestimation with regard to the WA (d G2 = −.28, p < .05) and initial overestimation with regard to DD (d G2 = .33, p < .05) and GM (d G2 = .20). All effect sizes were small; difference scores were nonsignificant for CO and GM.
RS as measured by the MIad
Table 3 shows the results of Oort’s three-step procedure to identify levels of measurement invariance for pre–post comparison (T1 vs. T2). Model misfit was predominantly indicated by the RMSEA index. Table 2 again summarizes the identified types of RS and corresponding levels of measurement invariance.
Three-Step Procedure for Pre–Post and Then–Post Comparison, Separated Into Groups.
Note. RMSEA = root mean square error of approximation; CFI = comparative fit index, SRMR = standardized root mean square residual; G1 = short-term training group; G2 = long-term training group; MI = measurement of invariance. Indexes that achieved a critical value are indicated in boldface. aDimensions of the SEWIK questionnaire: CL= Clarity of Program of Action, CO = Conflicts among Students, DD = Dealing with Disruptions, GM = Group Mobilization, PR = Procedures, RU = Rules, TM = Time Management, WA = Working Alliance. bSteps for RS detection (also see Table 1). 1 = testing configural invariance, 2 = testing strict invariance, 3 = adapted model with highest level of invariance (either SI = strong invariance is met or WI = weak invariance is met) and item indices where restrictions were released. cχ2 Diff = comparison to the measurement model in Step 1. dFinal level of measurement invariance after running the three-step procedure. Final level of measurement invariance is again summarized in Table 2.
The MIad identified RS on every dimension in both G1 and G2. The types of RS are as follows (listed in descending order, starting from least severe RS and the highest level of measurement invariance): In G1, four nonuniform recalibration RS occurred (with regard to Clarity, Dealing with Disruptions, Group Mobilization, and Rules), two reprioritization RS (Conflicts and Time Management), and two redefinition RS (Procedures and Working Alliance). In G2, two nonuniform recalibration RS occurred (Conflicts and Group Mobilization), and redefinition RS on all the six remaining dimensions. Thus, uniform recalibration RS was not present in any of the groups; RS was more severe in G2 than in G1 in four of the eight dimensions, it was the same in three of the eight dimensions, and it was less severe on one dimension.
Validity of Pre–Then Comparison to Identify Recalibration RS
The comparison of RS identified by the TP and the MIad (i.e., significant group differences and identification of uniform or nonuniform recalibration RS, respectively, see Table 2) served to evaluate validity of the TP to identify recalibration RS. There was only one out of eight dimensions in each group in which the TP and the MIad converged; these two dimensions are DD in G1, and GM in G2. All remaining dimensions showed divergence in RS identification of the two approaches.
Validity of Then–Post Comparison to Measure Change
Table 3 shows the results of Oort’s three-step procedure to identify level of measurement invariance of then–post comparison (T1r vs. T2). Model misfit was predominantly indicated by the RMSEA index. Table 2 again summarizes the identified types of RS and corresponding levels of measurement invariance.
The MIad identified measurement invariance (i.e., at least strong invariance is met) on five of the eight dimensions in each group. The identified level of measurement invariance is listed in descending order: Strict invariance was present on three of the eight dimensions (G1: CO, GM, WA; G2: GM, PR, WA), strong invariance was present on two dimensions (G1: CL, PR; G2: CL, TM), weak invariance was present on one dimension (both groups: DD), and no level of invariance was present on two dimensions (G1: RU, TM; G2: CO, RU). Although frequency pattern of level of invariance was identical for the two groups, it did not refer to the same dimensions.
Discussion
RS bias is of particular concern for research into the effectiveness of training programs (Gandhi et al., 2013; Pratt et al., 2000). The two most common approaches to study and account for RS are the TP and an adapted MI framework MIad. The former can be used instead of a traditional pretest to arrive at an “RS-adjusted” estimate of change (Ahmed, Mayo, Wood-Dauphinee, et al., 2005, p. 1131). The latter serves to identify invariant item parameters, account for possible RS, and assess unbiased change (Oort, 2005).
This study aimed at three objectives concerning the reliability and validity of the respective approaches. First, to implement the MIad in an educational training context. Second, to further examine then-test validity in terms of its ability to identify recalibration RS and its suitability as a quasi-indirect measurement of change (assuming then- and post-test are invariant). Third, to study differential RS effects depending on the kind of treatment.
Concordant with our first hypothesis, the TP identified RS in both G1 and G2: RS was present in three of the eight dimensions in both the groups. However, the TP was not able to identify a greater number or a greater degree of RS in G2 than in G1, that is, treatment-specific RS (Hill & Betz, 2005). Likewise, our second hypothesis was verified only partially. The MIad identified RS, and it appeared to be treatment-specific in terms of the degree of RS. However, the number of RS did not differ between the groups because RS was present in all target dimensions in both the groups. RS was more severe in G2 than in G1 (which means that the level of measurement invariance was less in G2) but only in four of the eight dimensions. Accordingly, the application of an experimental framework showed that CM dimensions were most often reconceptualized in G2 as opposed to a more frequent change in the range of answers in G1. Apparently, the G2 treatment during which students not only improved knowledge into strategies to cope with and manage problem behavior but also interacted with peers and deployed learned strategies in role-plays induced a profound reconceptualization of CM aspects, as was expected. In addition, uniform recalibration RS (i.e., changed mean structure of the observed variables) did not occur; reprioritization RS was rare and only present in G1. While HRQOL research repeatedly reported findings of these two latter shifts, it needs to be studied in future research whether changed mean structures and changed loadings are in fact an RS issue for well-constructed recursive constructs (Bollen & Lennox, 1991). In contrast, HRQOL is a formative construct (Donaldson, 2005): It is measured by cause indicators—where the indicators constitute the construct, as opposed to reflexive indicators (also effect indicators)—where the indicators are equivalent manifestations of the latent construct (Bollen & Lennox, 1991; Vandenberg & Lance, 2000). Recently introduced individualized latent approaches seem promising and could broaden our knowledge about interactions of the type of RS and the type of construct studied (Mayo, Scott, & Ahmed, 2009; Rosenman et al., 2011).
Concordant with our third hypothesis, the TP seemed unsuitable to identify recalibration RS when compared to recalibration RS identification by the MIad. Patterns of RS comparing both approaches could not be identified either, for example that pre–then RS was present on those dimensions that yielded any type of RS via MIad. Hence, there seems to be no systematic convergence of the TP and the MIad. This raises the question of what is assessed by the depicted methods and which explanatory variables lead to or do not lead to a shift in responses. Apparently, either the MIad is more sensitive in detecting RS, or the TP is too general to detect RS bias adequately. A possible explanation could be that the diverse types of RS on the item level interfere with each other resulting in comparable manifest pre-then means although RS did occur. Consequently, the TP may not be reliable enough to detect treatment-specific RS (Hill & Betz, 2005). However, these findings are only first indicators for the complex effects of treatment type on RS, which we investigated in our study with two different treatment groups. In order to further our understanding of RS as a potential side effect in control trials, future studies should also incorporate a waiting control group or implement a design in which subjects completing then-tests are not administered pre-tests (cf. the study of Hoogstraten, 1982).
Finally, our fourth hypothesis that then–post comparisons are measurement invariant could be confirmed only partially. In both groups, only four of the eight dimensions met at least strong invariance of then–post comparison, and two dimensions did not meet any invariance level. Accordingly, then–post comparison was not consistently measurement invariant. It seems plausible that participants who succeeded to adequately remember their status at pre-testing also referred to their former frame of reference and internal understanding of the construct, and used different scaling for then- and post-test although it is the same occasion. Comparing the findings elicited by our approach (post-test and then-test administered consecutively) with Nolte et al.’s (2009), who implemented the then-test in direct comparison to the post-test, reveals that the stability of both estimates may be strengthened by responding consecutively. However, our approach was still not sufficient to produce invariant measures. Research should systematically address this issue by varying procedures and instructions in order to find the optimal strategy for an invariant measurement of then- and post-scores (e.g., the mere announcement of external validation, or encouragement of explicit comparison to a known reference group or an ideal norm, see Aiken & West, 1990 or Norman, 2003). Also, implementation of individual techniques (such as personal interviews or direct assessment of change) could uncover why measures at the same occasion lead to invariant parameters and which indications could be helpful for test instruction to achieve this.
Provided that then-scores are valid indicators of initial knowledge, a practical implication of our study can be derived from the fact the TP also identified underestimation of initial knowledge: In G1, all three RS effects indicate initial underestimation. In G2, still one of the three RS effects indicates underestimation. With regard to their relative mean levels, underestimation was present on those dimensions that were rated lower than others at pretesting. These findings contradict previous research, which predominantly reported overestimation at pretesting (e.g., Cantrell, 2003; Holden et al., 2008; Moore & Tananis, 2009) and suggest two conclusions: First, training programs have the ability to strengthen participants’ confidence in judging their own knowledge as is shown by the initial underestimation of the working alliance. In retrospect, participants of both groups realized that they already knew a lot about this aspect of CM although they were not as confident when they had entered the course. This interpretation supports Hill and Betz’s (2005) conclusion about the then-test being an effective procedure to reinforce feelings of confidence and promote this type of reflection after the course. Second, these findings suggest a further treatment-specific effect: It is possible that only the interactive practices in G2 led to critical retrospective self-reflection, as it may have changed the frame of reference to a more realistic or an altered social norm, for example, with regard to dealing with disruptions (Norman, 2003). Both conclusions are of particular relevance for the composition of university courses and stress the importance of interactive opportunities to learn.
Limitations
The current study was limited in several ways. First, our research design only allowed for the comparison of two treatment groups. An additional waiting group or nontreatment control group could have broadened our understanding of treatment-induced RS and would have further facilitated the disentanglement of the RS from treatment effects (Schwartz & Sprangers, 2010). In addition, the time frame between pre- and postmeasures was not identical in the two groups. It is possible that identified group-specific findings may have, in part, occurred due to this fact (e.g., recall bias may have been present in G2 only).
Second, since our sample size was too small to test an overall model, it was necessary to run MIad analyses on a subscale level in order to achieve stable estimates (ratio of sample size to number of free parameters is at least 10:1, Bentler, & Chou, 1987). Accordingly, it is crucial to further validate our results in a larger sample to test the overall model and find out whether estimates are robust. Moreover, larger samples could further serve to study predictors of RS (e.g., level of self-efficacy, gender or age, cf. Moore & Tananis, 2009; Rosenman et al., 2011).
Finally, the identified level of measurement invariance strongly depends on the chosen cutoff values of fit indices (Brown, 2006). Predominantly, constraints needed to be relieved because the RMSEA achieved a critical value. Hu and Bentler (1998) pointed to the fact that the RMSEA index is less preferable at small sample sizes (≤ 250). Although we have chosen a rather liberal cutoff value (≤ .08), it needs to be further analyzed whether larger sample sizes tend to identify less serious RS (i.e., uniform and nonuniform recalibration RS).
Conclusions
Contrary to current research praxis, the evaluation of RS needs to be an integral part of program evaluation that rely on self report measures because it is essential for interpreting treatment effects. If RS is not tested, we cannot be sure whether a change in observed test scores fully represents a change in the target construct only or also a change in the response behavior of the respondent (Oort, 2005; Vandenberg & Lance, 2000). However, there seems to be no single best approach to ascertain RS to date (Barclay-Goddard, Epstein, et al., 2009).
The current study evaluated the application of two procedures to identify RS—TP and analyses based on MIad. With regard to identification of RS, we were able to transfer the MIad of RS detection to other concepts than HRQOL (i.e., classroom management). However, applying the two approaches did not lead to the same results. In particular, the MIad seemed to be more sensitive to RS than the TP. In contrast, the TP identified statistically significant RS effects mainly on dimensions that both treatments particularly focused on and on dimensions in which the mean knowledge increased most after treatments. It appears that the TP partially confounds treatment effects and RS bias although it did not show group-specific RS, which was expected.
Finally, we present a concluding evaluation of the validity of the two approaches with regard to its validity for RS identification and its validity for the measurement of change. The MIad may be a useful approach to identify the different types of RS because it was able to detect the types of RS predominantly in the way we expected. However, more research is necessary to find out whether the MIad may be too sensitive and whether reprioritization RS and uniform recalibration RS are of minor importance when analyzing recursive constructs. In contrast, the results of our study challenge the validity of the TP to identify RS because it remained unclear which factors caused a pre–then difference and what exactly is assessed by the TP. Thus, while taking into account other aspects that may threaten its psychometric properties, we recommend only cautious application of the then-test for RS identification.
Although the MIad may be able to identify RS, its application for the measurement of change may render it practically impossible to have a meaningful comparison of pre- and post-scores if the two scores are not invariant (i.e., reconceptualization RS, reprioritization RS, or nonuniform recalibration RS occurred), which was the case in the present study. If this is the case, it may be indicated to use simple post-score comparison of treatment groups. In contrast, then–post comparison revealed higher levels of invariance in the majority of studied dimensions yet not in all dimensions. Hence, the then-test appears to be adequate for a quasi-indirect measurement of change if pretesting effects are probable and recall bias or socially desirable answers are unlikely (cf. Hill & Betz, 2005). However, also in then–post comparison, it is inevitable to test measurement invariance and whether violation of measurement invariance is small enough to justify at least partial invariance in order to obtain changes in latent means. Beyond that, it is advisable to use alternative and more sophisticated analytic techniques in addition to mere manifest change scores or group differences, which help to eliminate error sources (e.g., see Barclay-Goddard, Lix, Tate, Weinberg, & Mayo, 2009; King-Kallimanis, Oort, Visser, & Sprangers, 2009; Rapkin, 2009). Finally, since RS was found to differ across groups, it seems promising to analyze predictors of an RS in general but also of the various types of RS in more detail (Barclay-Goddard, Epstein, et al., 2009).
Footnotes
Acknowledgments
We would like to thank the student teachers and the administrative staff involved in this research. We would also like to thank Marius Eckert and Franziska Pfitzner-Eden for editing our manuscript. Finally, we would like to thank the anonymous reviewers for their valuable comments and suggestions that greatly improved the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research project was funded by the BMBF (Bundesministerium für Bildung und Forschung; Federal Ministry of Education and Research of Germany, funding code 01JH920).
