Abstract
In this commentary, we welcome Schimmack’s reanalysis of Bar-Anan and Vianello’s multitrait multimethod (MTMM) data set, and we highlight some limitations of both the original and the secondary analyses. We note that when testing the fit of a confirmatory model to a data set, theoretical justifications for the choices of the measures to include in the model and how to construct the model improve the informational value of the results. We show that making different, theory-driven specification choices leads to different results and conclusions than those reported by Schimmack (this issue, p. 396). Therefore, Schimmack’s reanalyses of our data are insufficient to cast doubt on the Implicit Association Test (IAT) as a measure of automatic judgment. We note other reasons why the validation of the IAT is still incomplete but conclude that, currently, the IAT is the best available candidate for measuring automatic judgment at the person level.
A good operationalization of a theoretical construct allows for testing whether that construct is useful: Is it related to any other interesting constructs? Can it help explain, predict, or control behavior? For that reason, it is important to ask whether the Implicit Association Test (IAT) measures interindividual differences in automatic judgment. 1 Even if the IAT were ruled out as a measure of interindividual differences in automatic judgment, it might still be useful for measuring intergroup differences in automatic judgment (e.g., in experimental designs), but the interpretation of some correlational evidence documented with the IAT would be called into question, calling for reevaluation of current knowledge about automatic judgment.
According to Schimmack (2021; this issue, p. 408), “there is little empirical support for the claim that the IAT measures implicit attitudes that are not accessible to introspection and that cannot be measured with self-report measures.” It has already been argued that the judgment measured by the IAT is accessible to introspection (Gawronski, Hofmann, & Wilbur, 2006), and we are unaware of any rebuttal of that claim. Does the IAT measure automatic judgment that typical self-report measures do not measure? In this commentary, we argue that although Schimmack’s reanalyses of our data (Bar-Anan & Vianello, 2018) are insufficient to cast doubt on the IAT as a measure of automatic judgment, other reasons for that doubt do exist.
Discriminant Validity in Bar-Anan and Vianello (2018)
We welcome Schimmack’s reanalysis of our multitrait multimethod (MTMM) data set. Our study included seven indirect measures (often called implicit measures) that were developed to measure automatic favorability judgment toward American political parties, White and Black people, and the self; several direct (self-report) measures of those judgments; and possible auxiliary measures of behaviors and intentions that might correlate with those judgments (e.g., voting in previous elections). We agree with Schimmack that an MTMM data set can make important contributions to the investigation of the construct supposedly measured by the IAT and other indirect measures. An MTMM study can test whether different techniques developed to measure automatic judgment capture a construct distinct from that captured by measures developed to measure deliberate judgment. As Schimmack argued, that would be evidence for discriminant validity.
When we analyzed our data (Bar-Anan & Vianello, 2018), we found that a dual model—in which indirect measures of each topic load only on automatic constructs and direct measures of each topic load only on deliberate constructs—was superior to alternative models that assumed that direct and indirect measures of each topic load on the same latent construct. Statistical inference is more convincing when it is informed by theory. Our analysis was informed by theories assuming that, at least on some topics, indirect measures capture a different construct than do direct measures. 2 However, we were not always successful in following the decisions that a dual model would dictate. For instance, one of our measures was speeded rating of the stimuli used in the indirect measures (e.g., photos of Black and White people in the race measures). On the basis of theory and previous evidence (Ranganath, Smith, & Nosek, 2008), we expected speeded rating to load on the automatic constructs. However, the estimation of that model failed because the estimation algorithm did not converge (hence, the model was probably wrong for our data). We used a model that assigned the speeded rating to the deliberate constructs because that model converged successfully and showed good fit to the data. Like any deviation from a theory-driven prediction, that deviation decreases the confidence in the inference from our data. At the same time, researchers collect data to learn about the world, and deviations like this constitute novel information about the world. Replications and extensions of our study would help to further understand the significance of that specific deviation.
Schimmack mentioned another possible weakness of our study, stating that our models unrealistically assumed that “a single-method factor would account for correlations among all implicit measures and across attitude domains” (p. 402). It is indeed possible that the affect-misattribution procedure (AMP; Payne, Cheng, Govorun, & Stewart, 2005), evaluative-priming task (EPT; Fazio, Sanbonmatsu, Powell, & Kardes, 1986), and the sorting-paired-features (SPF) task (Bar-Anan, Nosek, & Vianello, 2009) should each load on a separate method factor, rather than on the same method factor as the IAT and its variants. Unfortunately, we were unable to test models with a large number of factors, and this was probably because of lack of information in the data that led to empirical unidentifiability of the models. Although we collected data from 23,215 participants, most of them completed only a small number of the measures, which resulted in a high level of data missingness.
As for estimating method variance across attitude domains, that is the very logic behind an MTMM design (Campbell & Fiske, 1959; Widaman, 1985): Method variance is shared across measures of different traits that use the same method (e.g., among indirect measures of automatic racial bias and political preferences). Trait variance is shared across measures of the same trait that use different methods (e.g., among direct and indirect measures of racial attitude). Separating the MTMM matrix into three separate submatrices (one for each trait), as Schimmack did in his article, misses a main advantage of an MTMM design.
To illustrate the limitation of analyzing only data that pertain to one trait in the MTMM data set, consider Schimmack’s (2021) reanalysis of the racial attitude submatrix (Fig. 3, p. 408). Schimmack argued that the model successfully estimated an IAT method factor from a few indirect measures of the same attitude—IAT, brief IAT (BIAT; Sriram & Greenwald, 2009), and the go/no-go association task (GNAT; Nosek & Banaji, 2001). Yet exactly the same scores were used in the same model to estimate the latent trait factor (automatic racial bias). Two other indirect measures (AMP and EPT) were added as indicators of automatic racial bias. The model did not include a method factor for the AMP and/or EPT. Hence, the maximum-likelihood solution would assign to the IAT method factor all the variance that the IAT, BIAT, and GNAT share among them but not with the AMP and the EPT. That variance might include trait variance because the AMP and EPT are not perfect measures of the trait factor. Further, method variance that the IATs might share with the AMP or EPT will be accounted for by the latent trait factor. The purpose of the MTMM design is to avoid those possible threats to the analysis and the interpretation.
In Schimmack’s reanalysis of our data set, just as in our original analysis, theoretical and practical justifications for the modelling decisions are important for estimating the confidence in the conclusions inferred from the results. For instance, Schimmack chose to omit one of the indirect measures—the SPF—from the models, to include the Modern Racism Scale (McConahay, 1983) as an indicator of political evaluation, and to omit the thermometer scales from two of his models. We assume that Schimmack had good practical or theoretical reasons for his modelling decisions; unfortunately, however, he did not include those reasons. That information would help readers estimate their confidence in the conclusions inferred from his results. As we have argued about our own analysis decisions, results are more convincing when they are informed by theory, although practical statistical reasons (such as the failure of a model to converge) are also informative and important, even if they require stronger statistical validation (e.g., with replication). From what we know about Schimmack’s reanalyses of our data at this point, we do not agree that they cast doubt on the conclusions from our original analysis. We adhere to the conclusion that our study does indeed provide discriminant-validity evidence for the IAT and other indirect measures. Yet we have noted some limitations in our study. Further MTMM research on new data is needed for stronger conclusions.
Predictive Validity in Bar-Anan and Vianello (2018)
A novel, potentially excellent aspect in Schimmack’s reanalysis is that he used outcome measures that were included in our study. These measures were included for a comparison of the psychometric properties of the seven indirect measures (Bar-Anan & Nosek, 2014). Testing relations between the IAT and measures of other theoretically related variables can justify inferences made on test scores, provide evidence about the importance of that construct and, ultimately, help understand what construct the IAT measures.
Multimeasure studies of predictive validity can increase confidence that the predictive validity is due to the construct that multiple indirect measures were developed to capture. Unfortunately, however, the lack of theoretical or practical justifications for his modeling decisions might decrease the potential validity of Schimmack’s conclusions. In addition to the omission of the SPF, the model Schimmack used for testing political evaluation was missing past voting as a criterion measure. Most importantly, the models for testing race evaluation and self-esteem lacked a method factor for the direct measures. Method variance, if present, influenced the trait factor because it was not controlled for, possibly inflating the deliberate judgment–criterion relationship. In research that aims to compare the predictive validity of direct and indirect measures, controlling for method variance in one set of measures but not in the other seriously undermines the interpretability of the results. That seems especially relevant in the present research because the direct measures were more methodologically similar to the outcome measure (self-reported behaviors and intentions) than the indirect measures.
For this commentary, we added a novel analysis to address the potential weaknesses in Schimmack’s analyses. The model depicted in Figure 1 specified an “IAT-method factor” that accounted for method variance in the IAT, BIAT, GNAT, single-target IAT (ST-IAT; Karpinski & Steinman, 2006), and SPF and a “Priming method factor” that accounted for method variance in EPT and AMP. Compared with the model with a single-method factor for all indirect measures, the two-method-factors model showed a trivial increase in fit, evidenced by a very small change in comparative fit index (ΔCFI = .001) and a null change in root mean square error of approximation (ΔRMSEA = 0.00). Nonetheless, the difference in fit between the two models was significant, Δχ2(2) = 21.56, p < .001. Most importantly, comparing the results with those from the models proposed by Schimmack, this model shows that when different decisions were made and when the full MTMM design was used to estimate both direct and indirect method variances across three relatively independent traits (attitude domains), the automatic latent variable did show incremental predictive validity in both the race and political domains (see Table 1). Our results show that the automatic race-evaluation factor predicted contact over and above its deliberate counterpart. Furthermore, our results show that deliberate, but not automatic, political evaluation reliably predicted voting intentions and that automatic, but not deliberate, political evaluation was related to past voting behavior. In this model, the IAT and the BIAT always had the highest loadings on the latent trait factor, hence they showed the best validity among all seven indirect measures. Interestingly, these results did not change when we collapsed the two-method factor into a single “indirect method” factor (for the AMOS 25 syntax and output files of both models, see https://osf.io/jp4nv/).

Six-correlated-traits, three-correlated-methods model (6CT3CM) plus outcomes. χ2(679) = 1719.39, p < .0001; comparative fit index = .97; root mean square error of approximation = 0.008, 95% confidence interval = [0.008, 0.008]. Correlations among outcomes’ residuals were estimated but are not depicted for clarity. MRS = Modern Racism Scale; IAT = Implicit Association Test; BIAT = brief IAT; GNAT = go/no-go association task; ST-IAT = single-target IAT; SPF = sorting-paired-features task; EPT = evaluative-priming task; AMP = affect-misattribution procedure.
Predictive Validity Estimates of Race and Political Evaluations
We do not wish to overstate the importance of our novel findings, which were inspired by Schimmack’s ideas and analyses. Our grouping of the indirect measures into two method factors is not the only grouping possible; other groupings might find different results. In our view, our novel findings are helpful mostly in demonstrating that inference from current evidence is not yet clear cut. More validation research in search of strong and consistent evidence is much needed.
Does the IAT Measure Automatic Judgment?
Schimmack focused on evidence for discriminant validity in a handful of studies, but there is much more evidence that the IAT, as a measure of inter-individual differences, has incremental validity beyond self-reported judgment. Three meta-analyses (Greenwald, Poehlman, Uhlmann, & Banaji, 2009; Kurdi, Seitchik, et al., 2019 and Oswald, Mitchell, Blanton, Jaccard, & Tetlock, 2013) independently concluded that, when the judgment pertained to the intergroup domain, the incremental predictive validity of the IAT over self-reported judgment was slightly higher than the incremental validity of self-reported judgment over the IAT. That evidence suggests that, at least for some behaviors, inferences made on individual IAT scores (e.g., people with high scores on a race IAT are more likely to discriminate against Black people) are indeed valid, and that direct measures alone would lead researchers to make less valid inferences. Controlling for measurement error, the superiority of automatic judgment over deliberate judgment in predicting intergroup behavior was smaller, but still present (Kurdi, Seitchik, et al., 2019).
The evidence that the IAT captures variance that is not measurement error and is not captured by direct measures of the same domain is evidence of discriminant validity. It suggests that some IATs, under some conditions, measure a construct other than self-reported judgment. Is that construct automatic judgment? Automatic judgment refers to judgment characterized by at least one automaticity feature (Gawronski & De Houwer, 2014). Automaticity is a general term that can refer to many process features, including unintentional, unstoppable, unconscious, fast, and efficient (Bargh, 1994; Moors, 2016). A process analysis (e.g., Conrey, Sherman, Gawronski, Hugenberg, & Groom, 2005) of the factors that influence the IAT score might help to show that the IAT is sensitive to automatic effects of stimuli related to judgment attributes (e.g., good, bad) and to attitude objects (e.g., Europeans, Americans). Yet, how would we know that these automatic effects are the cause of the IAT’s unique shared variance with behavior (after controlling for self-reported judgment)? The IAT might be sensitive to both automatic and nonautomatic factors and to both judgment processes and nonjudgment processes (e.g., Conrey et al., 2005; Klauer, Voss, Schmitz, & Teige-Mocigemba, 2007). In many studies, the IAT and the self-report measures differed in many respects and not only in their potential sensitivity to automatic judgment (Gawronski, 2019; Payne, Burkley, & Stokes, 2008). Therefore, it is challenging to find evidence that the IAT measures automatic judgment processes that also influence other behaviors. Such evidence would increase the confidence that the IAT is a useful measure of automatic judgment.
Perhaps the best evidence that the IAT is a useful measure of interindividual differences in automatic judgment can come from research that examines the prediction of behaviors characterized by different degrees of automaticity. Informative research would target contexts that motivate people to reject their automatically activated judgment (Fazio, 2007; Gawronski & Bodenhausen, 2011). In such contexts, the IAT is expected to be more strongly related to behaviors conducted under conditions of automaticity than to controlled behaviors, whereas direct measures should show the opposite pattern. Relevant examples have been published (e.g., Dovidio, Kawakami, & Gaertner, 2002; McConnell & Leibold, 2001). For instance, the IAT predicted choice under high cognitive load better than choice under low cognitive load, whereas self-reported judgment showed the opposite pattern (Friese, Hofmann, & Wänke, 2008). A finding such as this strongly suggests that the IAT is a very good candidate for the measurement of automatic judgment. However, studies that provide this kind of evidence have not been sufficiently replicated yet (for more discussion, see Axt, Bar-Anan, & Vianello, 2020) and did not distinguish between automatic and deliberate components of the IAT effect.
The available evidence about the IAT’s incremental validity, its sensitivity to automatic processes, and the initial demonstrations that—under theoretically relevant conditions—it predicts automatic behavior better than controlled behavior suggests that the IAT is the best candidate for measurement of automatic judgment. The confidence that the available evidence provides in the IAT’s validity is probably better than most measures in social psychology because the discipline largely favors face validity (whether the measure appears reasonable) over validation studies. Yet meta-analytical evidence that the IAT is, overall, not more related to automatic behavior than to controlled behavior (Kurdi, Seitchik, et al., 2019), the lack of replications of the demonstrations that have found such evidence when theory expects that discrepancy, and the lack of research that routinely employs process dissociation models to score the IAT, leave room for doubt about whether the IAT is a good measure of automatic judgment. Acknowledging that there is a lack of evidence that the IAT captures automatic judgment processes that also influence other behaviors would be helpful for directing more effort into searching for that evidence.
Just as Churchill would recommend that countries adopt democracy because, despite its many weaknesses, it is better than all the other forms of government that have been tried, given the currently available evidence, we recommend that researchers who wish to study automatic judgment use the IAT as their first choice of measurement. Findings based on the IAT should incite conceptual replications with other indirect measures—mostly, the SPF, EPT, and the AMP. Finding the same results with a few of those measures (and, hopefully, with new and improved measures) would bolster the confidence that a finding indeed pertains to an automatic-judgment construct. Finding different results with different measures would contribute useful data for the understanding of those measures. Until stronger validity evidence is documented, researchers (and reviewers!) must accept that science is hard (“National Science Foundation . . .,” 2002) and that the exact meaning of the findings might change in the future, as evidence about the validity of indirect measures accumulates.
