The Implicit Association Test: A Method in Search of a Construct

Abstract

In 1998, Greenwald, McGhee, and Schwartz proposed that the Implicit Association Test (IAT) measures individual differences in implicit social cognition. This claim requires evidence of construct validity. I review the evidence and show that there is insufficient evidence for this claim. Most important, I show that few studies were able to test discriminant validity of the IAT as a measure of implicit constructs. I examine discriminant validity in several multimethod studies and find little or no evidence of discriminant validity. I also show that validity of the IAT as a measure of attitudes varies across constructs. Validity of the self-esteem IAT is low, but estimates vary across studies. About 20% of the variance in the race IAT reflects racial preferences. The highest validity is obtained for measuring political orientation with the IAT (64%). Most of this valid variance stems from a distinction between individuals with opposing attitudes, whereas reaction times contribute less than 10% of variance in the prediction of explicit attitude measures. In all domains, explicit measures are more valid than the IAT, but the IAT can be used as a measure of sensitive attitudes to reduce measurement error by using a multimethod measurement model.

Keywords

personality individual differences social cognition measurement construct validity convergent validity discriminant validity structural equation modeling

Twenty-one years ago, Greenwald, McGhee, and Schwartz (1998) published one of the most influential articles in personality and social psychology. With 4,717 citations in Web of Science, it is the third-most-cited article in the Journal of Personality and Social Psychology. The high citation count attests to the popularity of the Implicit Association Test (IAT). The IAT had even more impact outside psychology; millions of visitors of the IAT website have taken an IAT and received feedback about their IAT scores.

Despite its popularity, relatively little is known about the construct validity of the IAT. The last assessment of construct validity is more than 10 years old (Nosek, Greenwald, & Banaji, 2007), and it is time to reexamine the construct validity of the IAT for several reasons. First, construct validity of psychological measures depends on many factors that can change over time and new data may provide new insights into the construct that is being measured. Therefore, construct validation is a continuing process (Messick, 1995). Second, previous assessments of construct validity were made by the authors of the IAT. As Cronbach (1989) pointed out, construct validation is better examined by independent experts than by authors of a test because “colleagues are especially able to refine the interpretation, as they compensate for blind spots and capitalize on their own distinctive experience” (p. 163). A third reason is that the developers of the IAT are experts in experimental social psychology, a discipline that focuses on situational manipulations that change mean scores, whereas stable individual differences are often treated as measurement error. In contrast, construct validation originated in personality research in which the focus is on stable personality dispositions that do not directly correspond to observable behaviors (Cronbach, 1971). Thus, it can be particularly valuable to examine the construct validity of the IAT from the outsider perspective of personality psychology (Walker & Schimmack, 2008) and psychometrics (Borsboom, 2006). Finally, previous assessments of construct validity concluded that the IAT is valid without quantifying validity. In validation research, it is not sufficient to reject the null hypothesis that a test is absolutely invalid (Cronbach & Meehl, 1955). It is of utmost importance to determine how much of the variance in IAT scores is valid variance and how much of the variance is due to measurement error, especially when IAT scores are used to provide individualized feedback.

What Is the Construct?

The IAT is a simultaneous classification task. Participants have to classify pairs of stimuli into mutually exclusive categories. For example, the flower-insect IAT requires participants to distinguish flowers (e.g., a rose) and insects (e.g., mosquito) and to distinguish positive (e.g., peace) and negative (e.g., death) stimuli that are unrelated to flowers and insects. The classification task is repeated twice. Once, flowers and positive stimuli are paired on one response key, and insects and negative stimuli are paired on the other response key. The other time, flowers and negative stimuli are paired, and insects and negative stimuli share the same response key. Without any prior evaluations of flowers or insects, performance on this task is expected to be equal for both conditions. However, if participants have some preexisting associations of flowers with positive evaluations and insects with negative evaluations, their preexisting associations can interfere with the classification task. As a result, responses are faster when the preexisting associations match the pairing of categories on a response key. IAT scores reflect the interference effect from preexisting associations on the classification task.

Superficially, the IAT is similar to other interference tasks, such as the well-known Stroop task, in which color words interfere with the classification of colors. Perfect performance (e.g., by illiterate individuals) would show no interference effects. However, for individuals who can read the color words, interference occurs. Just like the Stroop test, the IAT shows highly reliable interference effects when the majority of participants have preexisting associations (e.g., most people like flowers and do not like insects). Few people doubt that the IAT can reveal average differences in associations (e.g., on average, people like flowers more than insects). Thus, there is no disagreement about the validity of the IAT as “as a general method for measuring relative association strengths” (Greenwald, Banaji, & Nosek, 2015, p. 553).

There is also good evidence that the IAT can reveal group differences in associations. For example, German fans have positive associations with the German soccer team, and Brazilian fans have positive associations with the Brazilian team. The two groups are expected to have different interference effects on a Germany–Brazil IAT. In support of this prediction, empirical studies show that the IAT reliably shows group differences in associations when groups have opposing attitudes (Greenwald et al., 1998).

Showing reliable group differences in the strength of associations is not sufficient to demonstrate validity of the IAT as a measure of individual differences, which was the purpose for developing the IAT as stated in the title of Greenwald et al.’s (1998) seminal article “Measuring Individual Differences in Implicit Cognitions.” This claim requires additional evidence that the IAT can measure differences between individuals, not just differences in group means. For example, when Malcolm Gladwell received an IAT score that reflected a moderate preference for White people (Banaji & Greenwald, 2013), validity implies that Malcolm Gladwell does indeed have a preference for White people.

An even more important requirement for validity is that IAT scores reflect individual differences in implicit cognitions. Banaji and Greenwald (2013) elaborated on this claim in no uncertain terms. For example, they stated that “The Implicit Association Test has enabled us to reveal to ourselves the contents of hidden-bias blindspots” (p. 33) and that the IAT is “a method that gives the clearest window now available into a region of the mind that is inaccessible to question-asking methods” (p. xiii). On the basis of these claims, the IAT has attracted “a surge of interest from scientists and the public as a tool to uncover unconscious attitudes” (Izuma, Kennedy, Fitzjohn, Sedikides, & Shibata, 2018, p. 343).

Not everybody shares the view that the IAT is a valid measure of individual differences in implicit attitudes. Payne, Vuletich, and Lundberg (2017) argued for a situational model of attitudes that change from situation to situation. If attitudes are not stable over time, it makes little sense to use the IAT as a measure of individual differences. Thus, this radical reinterpretation of the IAT as a measure of temporarily accessible cognitions dramatically shifts the construct that the IAT was supposed to measure. However, this radical reinterpretation of the IAT is an isolated view. Most researchers regard the IAT as a valid measure of enduring attitudes that vary across individuals (De Houwer, Teige-Mocigemba, Spruyt, & Moors, 2009; Gawronski & Bodenhausen, 2017; Kurdi & Banaji, 2017; Rae & Greenwald, 2017).

There is also no consensus in the literature on whether the IAT measures something different from explicit measures. On the one hand, Greenwald and colleagues consistently argued that the IAT measures constructs that cannot be measured with self-report measures (Banaji & Greenwald, 2013; Greenwald et al., 1998; Nosek et al., 2007). Others proposed that the IAT is simply an implicit or indirect measure of the same constructs that are measured with explicit measures. “The automatic activation of attitudes should not be equated with individuals’ lack of awareness of the attitude” (Samayoa & Fazio, 2017, p. 273). Others appeared to be agnostic about the implicitness of the constructs and to use the term implicit exclusively to distinguish the IAT and rating scales (De Houwer et al., 2009). Recently, Greenwald and Banaji (2017) also expressed concerns about their earlier assumption that IAT scores reflect unconscious processes: “Even though the present authors find themselves occasionally lapsing to use implicit and explicit as if they had conceptual meaning, they strongly endorse the empirical understanding of the implicit–explicit distinction” (p. 862).

In conclusion, although there is general consensus to make a distinction between explicit measures and implicit measures, it is not clear what the IAT measures. There are at least four possibilities. The first possibility is that the IAT is not a valid measure of personality attributes because there are no stable attributes that influence performance on the IAT (Payne et al., 2017). The second possibility is that there are stable attributes but that the IAT is a poor measure of these attributes (Falk & Heine, 2015; Walker & Schimmack, 2008). A third possibility is that the IAT is a valid measure of attributes that can also be measured with explicit measures (Samayoa & Fazio, 2017). Finally, the IAT may be a measure of implicit attitudes that cannot be measured with explicit measures, as originally proposed by Greenwald et al. (1998).

To complicate matters further, the validity of the IAT may vary across attitude objects. After all, the IAT is a method, just like Likert scales are a method, and it is impossible to say that a method is valid (Cronbach, 1971). For example, the IAT may be a valid measure of political preferences (Bar-Anan & Vianello, 2018), whereas it may be an invalid measure of implicit self-esteem (Bosson, Swann, & Pennebaker, 2000; Falk, Heine, Takemura, Zhang, & Hsu, 2015). Finally, reviews of validity have failed to quantify validity, but such quantification matters for the interpretation of IAT scores. If only a small portion of the variance in IAT scores were valid, it would be possible that thousands of visitors of the IAT website, including Malcolm Gladwell, received false feedback about their attitudes because measurement error biased their scores.

Construct Validation

The basic principles of construct validation were outlined in two seminal articles (Campbell & Fiske, 1959; Cronbach & Meehl, 1955; see also Cronbach, 1971). First, a measure needs to demonstrate convergent validity with other measures. Importantly, these other measures have to be independent of each other; that is, they cannot share common method variance. For example, the correlation between two repeated measures of the same IAT or between two highly similar IATs provides information about reliability, but the correlation cannot be used to assess convergent validity because shared method variance can contribute to the reliable variance.

Convergent validity can be examined informally by inspecting correlation tables. However, a better way to examine convergent validity uses factor analysis (Cronbach, 1971; Schimmack, 2010, in press). If several measures reflect a common construct, they should load on a common factor. The higher the loading on the factor, the greater the validity of the measure. The main advantage of factor analysis is that factor loadings provide quantitative information about the amount of valid variance in measures.

Confirmatory factor analysis (CFA) of multimethod data also provides valuable information about the conceptual distinction between explicit and implicit attitudes. If the IAT and other implicit measures reflect attitudes that are different from attitudes that are measured with explicit measures, implicit measures should be more highly correlated with each other than with explicit measures (Campbell & Fiske, 1959). In a CFA model, discriminant validity is revealed by examining the correlation between one factor that represents the shared variance among implicit measures and one factor that represents the shared variance among explicit measures.

Figure 1 shows a two-factor model that follows from dual-attitude models that distinguish between explicit attitudes and implicit attitudes. In this model, the observed correlation between the IAT and an explicit measure of attitudes is a function of three parameters: (a) the validity of the IAT as a measure of implicit attitudes, (b) the validity of an explicit measure as a measure of explicit attitudes, and (c) the correlation or overlap between explicit and implicit attitudes. The model can also include criterion variables to further examine construct validity. According to dual-attitude models, controlled behaviors should be influenced by explicit attitude, whereas automatic behaviors are influenced by implicit attitudes.

Fig. 1.

Hypothetical dual-attitude model.

At present, relatively little is known about the contribution of these three parameters to observed correlations in hundreds of monomethod studies. Take the self-esteem literature as an example. It is a robust finding that correlations between Rosenberg ’s (1965) self-esteem scale and the self-esteem IAT are close to zero (Bosson et al., 2000; Izuma et al., 2018). This finding has been interpreted in two different ways. One interpretation is that self-ratings and the IAT are valid and that there is very little overlap between individuals’ consciously accessible self-esteem and implicit self-esteem (Greenwald & Farnham, 2000; Izuma et al., 2018). Others argued that the low correlation shows that the IAT has low validity as a measure of implicit self-esteem or explicit self-esteem (Falk et al., 2015). A third possibility would be that explicit self-esteem measures are invalid, but nobody has made this suggestion because of the ample evidence that Rosenberg’s self-esteem scale has at least moderate validity as a measure of self-esteem (Falk et al., 2015; Schimmack & Diener, 2003; Simms, Zelazny, Yam, & Gros, 2010).

In sum, construct validation research requires multimethod data, and a valid measure needs to demonstrate convergent and discriminant validity (Campbell & Fiske, 1959; Cronbach, 1971; Schimmack, 2010). A valid measure needs to demonstrate convergent validity with independent measures of the same construct (e.g., IAT and evaluative priming), and it needs to demonstrate discriminant validity with independent measures of a distinct construct (e.g., IAT and feeling thermometer). That is, same-construct/cross-method correlations have to be higher than cross-construct/cross-method correlations. In the context of factor analysis, a valid individual difference measure of implicit attitudes needs to load on a common factor with other implicit measures, and the factor loading of the IAT on this factor provides information about the amount of valid variance in IAT scores. If the IAT is a measure of implicit attitudes that are distinct from explicit attitudes, a two-factor model should fit the data better than a one-factor model, and the correlation between the implicit and explicit factors should show that each factor has unique variance.

A Critical Review of Greenwald et al.’s (1998) Original Article

Greenwald et al. (1998) were keenly aware that an introduction of a new measure requires psychometric evidence of its validity. Throughout their article, the authors mentioned psychometric criteria such as convergent and discriminant validity. The emphasis on discriminant validity made it clear that Greenwald et al. were introducing the IAT not only as a new method of measurement but also as a method that can measure something new that could not be measured with explicit measures.

Although the Greenwald et al. (1998) article repeatedly claimed support for the construct validity of the IAT as a measure of implicit attitudes, closer inspection of the studies shows that they lacked the requirement to examine discriminant validity because the IAT was the only implicit measure in these studies and monomethod studies cannot distinguish between low validity and high discriminant validity.

In Study 1 in Greenwald et al. (1998), 32 participants showed low correlations between IAT scores and explicit ratings of flowers and insects. The authors suggested that this finding provided evidence of discriminant validity: “This conceptual divergence between the implicit and explicit measures is of course expected from theorization about implicit social cognition” (p. 1470). A comparison of this statement with the model in Figure 1 shows the problem with this statement. A low correlation between the implicit and explicit factors is only one possible explanation for the observed low correlation. Another explanation would be low validity of the self-ratings. The third possibility is that the IAT scores are unreliable or invalid. Thus, the low correlation is theoretically expected from a dual-attitude model, but it does not provide empirical evidence for it.

Experiment 2 in Greenwald et al. (1998) used the IAT with 17 Korean and 15 Japanese American students to assess their attitudes toward Koreans versus Japanese. In this study, Greenwald et al. found “unexpectedly the feeling thermometer explicit rating was more highly correlated with the IAT measure (average r = .59) than it was with another explicit attitude measure, the semantic differential (r = .43)” (p. 1473). This finding is unexpected from a dual-attitude model, but Greenwald et al. did not consider it further.

Study 3 in Greenwald et al. (1998) introduced the race IAT to measure prejudice with the IAT in a sample of 26 participants. In this small sample, IAT scores were only weakly and not significantly correlated with explicit measures. The authors realized that this finding was open to multiple interpretations. “Although these correlations provide no evidence for convergent validity of the IAT, nevertheless—because of the expectation that implicit and explicit measures of attitude are not necessarily correlated-neither do they damage the case for construct validity of the IAT” (Greenwald et al., 1998, p. 1475).

Although the Greenwald et al. (1998) article provided no meaningful psychometric evidence regarding construct validity as a measure of implicit constructs, the discussion section implied that evidence of discriminant validity was obtained: “It is clear that these implicit-explicit correlations should be taken not as evidence for convergence among different methods of measuring attitudes but as evidence for divergence of the constructs represented by implicit versus explicit attitude measures” (p. 1477).

In conclusion, the seminal IAT article (Greenwald et al., 1998) introduced the IAT as a measure of implicit constructs that cannot be measured with explicit measures, but it did not really test this dual-attitude model. An alternative explanation for the results in the Greenwald et al. (1998) study is that the IAT is a poor measure of individual differences in attitudes.

Construct Validity in 2007

In 2007, Greenwald and colleagues (Nosek et al., 2007) assessed construct validity of the IAT and found that multimethod studies showed weak correlations of the IAT with other implicit measures. The authors offered two explanations for this finding. First, they pointed out that implicit measures have low reliability. However, Nosek et al. (2007) did not realize that low reliability is a problem for the construct validity of the IAT because reliability sets the upper limit for validity of a measure. Nosek et al. conflated measures and constructs when they wrote that “the relations among implicit measures (and between implicit measures and other variables) will be underestimated to the extent that they are unreliably assessed” (p. 277). What is being underestimated is the relationship among constructs, not the validity of the measure, and the validity of the measure is the primary focus of construct validation research. “Little practical interest attaches to the correlation of the hypothetical constructs save as these indicate the highest value correlations measures could [emphasis added] reach” (Cronbach, 1971, p. 499). Thus, even if low convergent validity is caused by low reliability, it poses a problem for the validity of the IAT as a measure of individual differences.

The second explanation for low correlations among implicit measures is that different implicit measures rely on different cognitive processes to measure implicit attitudes. This systematic variance due to unique cognitive processes would further attenuate convergent validity. Although this statement is true, it is also a threat to the validity of the IAT as a measure of individual differences in attitudes (Campbell & Fiske, 1959). Task-specific variance that is unique to the measurement procedure of the IAT is method variance that influences scores on the IAT independently of the attitude that is being measured (Klauer, Voss, Schmitz, & Teige-Mocigemba, 2007). Thus, task-specific factors that produce reliable individual differences in IAT scores also pose a problem for the construct validity of the IAT as a measure of attitudes.

Nosek et al. (2007) also suggested that existing studies support the dual-attitude model: “Structural equation modeling revealed that the best-fitting models represented the IAT and self-report as related but distinct constructs” (p. 278). Closer inspection of the evidence for this claim shows that these structural equation models did not use a multimethod approach (Nosek & Smyth, 2007). That is, the implicit attitude factor was measured only by the IAT, and the explicit attitude factor was measured only by a single self-report measure. The structural equation model corrects only for random measurement error and did not take systematic measurement error in IAT scores and explicit ratings into account. Thus, the studies do not meet the requirement that each construct has to be measured by multiple independent measures that do not share method variance (Campbell & Fiske, 1959).

In conclusion, the 2007 review of construct validity (Nosek et al., 2007) revealed major psychometric challenges for the construct validity of the IAT, which explains why some researchers have concluded that the IAT cannot be used to measure individual differences (Payne et al., 2017). The review also revealed that most studies were monomethod studies that could not examine convergent and discriminant validity. Subsequently, I use published multimethod studies to estimate the validity of IAT scores to measure individual differences in racial bias (four studies), self-esteem (two studies), and political orientation (one study).

Multimethod Studies of the Race IAT

Cunningham, Preacher, and Banaji (2001)

Cunningham et al. (2001) conducted the first multimethod study of racial attitudes. Participants were 93 students with complete data. Each student completed a single explicit measure of prejudice, the Modern Racism Scale (McConahay, 1986), and three implicit measures: the standard race IAT (Greenwald et al., 1998), a response-window IAT (Cunningham et al., 2001), and a response-window evaluative priming task (Fazio, Sanbonmatsu, Powell, & Kardes, 1986). The assessment was repeated on four occasions, each 2 weeks apart. I used the published correlation matrix to reexamine the discriminant validity of implicit and explicit attitude measures.

One limitation of the Cunningham et al. (2001) study was that the authors used only a single explicit method. As a result, it is not possible to create an explicit factor and examine discriminant validity with the model in Figure 1 because repeated measures with the same scale share method variance. To address this limitation, I fitted a single-factor model and examined the loadings of the implicit measures on the common factor. If implicit measures share variance that is not shared with explicit measures, they should have consistently higher loadings on the single factor than the explicit measure. Figure 2 shows that this is not the case.

Fig. 2.

Reanalysis of Cunningham, Preacher, and Banaji (2001). IAT = implicit association test; RW = response window; RT = response time.

Factor loadings for the Modern Racism Scale ranged from .35 to .45 (M = .40). Factor loadings for the standard IAT ranged from .43 to .54 (M = .47). Factor loadings for the response window IAT ranged from .41 to .69 (M = .51). The evaluative priming measures had the lowest factor loadings ranging from .13 to .47 (M = .29). The factor loadings for the first standard IAT-RT suggests that 16% of the variance in IAT scores can be attributed to racial attitudes.

Another noteworthy finding is that a single factor accounted for correlations among all measures on the same occasion and across measurement occasions. This finding shows that there were no true changes in racial attitudes over the course of this 2-month study. This finding is important because Cunningham et al.’s (2001) study is often cited as evidence that implicit attitudes are highly unstable and malleable (e.g., Payne et al., 2017). This interpretation is based on the failure to distinguish random measurement error and true change in the construct that is being measured (Anusic & Schimmack, 2016). Although Cunningham et al.’s results suggest that the IAT is a highly unreliable measure, the results also suggest that the racial attitudes that are measured with the race IAT are highly stable over periods of weeks or months.

In conclusion, after correcting for measurement error, the construct variance in explicit and implicit measures are highly related and difficult to distinguish. The main reason for low correlations between scores on the Modern Racism Scale and the IAT is that both measures have only modest validity as measures of racial preferences. The factor loading for the standard IAT suggests that only about 20% of the variance in IAT scores reflects racial preferences, whereas the remaining variance is due to random measurement error and systematic method variance. There is no support for a dual-attitude model with distinct explicit and implicit attitudes. A major design flaw of the study was that explicit attitudes were assessed with a single measure, which made it impossible to fit a dual-attitude model to the data.

Bar-Anan and Vianello (2018)

The most recent and extensive multitrait, multimethod validation study of the IAT was published in 2018 (Bar-Anan & Vianello, 2018). In the abstract, the authors claimed that the results provided clear support for the validity of the IAT as a measure of implicit cognitions, including implicit self-esteem. “The evidence supports the dual-attitude perspective, bolsters the validation of 6 indirect measures, and clears doubts from countless previous studies that used only one indirect measure to draw conclusions about implicit attitudes” (Bar-Anan & Vianello, 2018, p. 1264). Closer inspection of the authors’ results shows that this claim exaggerates the evidence for discriminant validity. Their structural equation model made the unrealistic assumption that a single-method factor would account for correlations among all implicit measures and across attitude domains. However, the race IAT and the parallel Brief Implicit Association Test for race (race BIAT) are likely to share more method variance than the race IAT and the self-esteem evaluative priming task. Given this unrealistic assumption, it is not surprising that loadings on the method factors tended to be low; many loadings were less than .2, which would suggest that there is no systematic method variance in implicit measures.

Another problem was that even a strong correlation of r = .91 between the implicit and explicit political orientation factor was considered positive evidence for a dual-attitude model, which ignores the effect size of this correlation.

I fitted a dual-attitude model to the racial-attitude measures to examine the validity of the race IAT as a measure of racial attitudes. Another advantage of this data set was the inclusion of a contact measure. This measure makes it possible to examine the predictive validity of implicit and explicit measures of prejudice because racial attitudes are expected to be correlated with intergroup contact (Fig. 3).

Fig. 3.

Reanalysis of Bar-Anan and Vianello (2018) racial attitude measures. IAT = implication association test; AMP = affective misattribution paradigm; GNAT = go/no-go association task.

The most important finding is the high correlation between the explicit and implicit factors, ρ = .86. With an estimated sampling error of σ = .05, the 95% confidence interval ranges from .77 to .96. Although this interval does not include a value of 1, the strength of this correlation shows that most of the variance in the explicit and implicit factors is shared. Moreover, only the explicit factor predicts contact. Thus, there is no evidence that implicit measures have incremental predictive validity in these data. The results also show that IAT variants share some common method variance because they rely on similar cognitive processes to measure attitudes. On the basis of the parameter estimates, we see that .44² = 19% of the variance in the standard IAT reflects individual differences in racial attitudes, and 86% of this variance is shared with the explicit factor. Thus, only a small portion of the valid variance might reflect unique implicit attitudes. Bar-Anan and Vianello’s (2018) large study of construct validity also provides little evidence for the original claim that the IAT measures a new construct that cannot be measured with explicit measures and confirms the estimate from Cunningham et al. (2001) that about 20% of the variance in IAT scores reflects variance in racial attitudes.

Greenwald, Smith, Sriram, Bar-Anan, and Nosek (2009)

A third multimethod study of the race IAT examined predictive validity of the race IAT as a measure of racial bias in voting. Greenwald et al. (2009) wanted to predict voting intentions in the historic 2008 election, in which U.S. voters had the opportunity to elect the first Black president. Although the outcome is now a historic fact, it was uncertain before the election how much Barak Obama’s racial background would influence White voters. There was also considerable concern that voters might not reveal their true feelings to pollsters. This situation provided a great opportunity to demonstrate that the IAT has incremental predictive validity. According to the abstract of the article, the results confirmed this prediction.

The implicit-race-attitude measures (Implicit Association Test and Affect Misattribution Procedure) predicted vote choice independently of the self-report raced attitude measures, and also independently of political conservatism and symbolic racism. These findings support construct validity of the implicit measures. (Greenwald et al., 2009, p. 242).

These claims were based on results of multiple regression analyses. “When entered after the self-report measures, the two implicit measures incrementally explained 2.1% of vote intention variance, p = .001,” when political conservativism was also included in the model, “the pair of implicit measures incrementally predicted only 0.6% of voting intention variance, p = .05” (Greenwald et al., 2009, p. 247).

I tried to reproduce these results with the published correlation matrix and failed to do so. I contacted Anthony Greenwald, who provided the raw data, but I was unable to recreate the sample size of 1,057. Instead, I obtained a similar sample size of 1,035. Performing the analysis on this sample also produced nonsignificant results (IAT: b = −0.003, SE = 0.044, t = 0.070, p = .944; affect-misattribution procedure, or AMP: b = −0.014, SE = 0.042, t = 0.344, p = .731). Thus, there is no evidence for incremental predictive validity in this study.

More important, Greenwald et al. (2009) used the IAT and the AMP as implicit measures and included race and voting behavior as criterion variables. Thus, it was possible to fit a two-factor model to the data with the IAT and AMP as indicators of an implicit attitude factor. The model had good fit, χ²(14) = 44.61, comparative fit index (CFI) = .989, root mean square error of approximation (RMSEA) = .039, 90% CI = [.027, .053].

The most important finding was a strong correlation between the implicit and explicit factors, ρ = .97, σ = .06, 95% CI = [.85, 1.0]. Thus, once more the data do not support the dual-attitude model. Inspection of the factor loadings shows a stronger loading of the IAT on the implicit factor than in the previous studies with about 30% valid variance. It is not immediately obvious why validity would be stronger in this sample. Finally, the model shows that political orientation is the strongest predictor of voting intentions. However, the explicit attitude factor predicts voting intentions above and beyond political orientation. This finding suggests that Obama’s race influenced the 2008 vote. However, explicit measures were able to detect this bias, and adding the IAT did not increase prediction of the race effect on voting in the 2008 U.S. election (Fig. 4).

Fig. 4.

Analysis of Greenwald, Smith, Sriram, Bar-Anan, and Nosek (2009). IAT = implicit association test; AMP = affective misattribution paradigm.

Axt (2018)

A study by Axt (2018) used a more limited multimethod approach. However, this study is noteworthy because it examined the construct validity of the IAT in a large data set with 450,254 IAT scores from participants who received feedback about their scores on the Project Implicit website. Because millions of participants have received feedback about their racial attitudes from this website, it is particularly interesting to examine the validity of the IAT scores that are used to inform individuals about their implicit bias or automatic preferences.

One limitation of Axt’s (2018) data set is that the IAT is the only implicit measure. However, it included race of participant as a validation criterion, which made it possible to fit a structural equation model to a single-factor model. The model assumes that there is no shared method variance between explicit and implicit measures, which is consistent with the results of the larger multimethod studies. The model also assumes that the IAT and explicit measures measure a common construct, which is consistent with the lack of discriminant validity in the previously described studies. With three indicators, the model is just identified and has perfect fit (Fig. 5).

Fig. 5.

Analysis of Axt (2018). IAT = implicit association test. Values in parentheses are standard errors.

Given Axt’s (2018) 540,723 respondents, sampling error is very small, σ = .002, and parameter estimates can be interpreted as true scores in the population of Project Implicit visitors. A comparison of the factor loadings shows that explicit ratings are more valid than IAT scores. The factor loading of the race IAT on the attitude factor once more suggests that about 20% of the variance in IAT scores reflects racial attitudes. The results also show that racial attitudes differ by participants’ race. Multiplying the path coefficient from race to the attitude factor and the factor loading of the IAT shows that 8% of the valid variance is due to between-group differences in IAT scores, and 12% of valid variance is due to individual differences within racial groups. Thus, if we focus on individual differences in racial attitudes among White respondents, the valid variance in IAT scores on the Project Implicit website is estimated to be 12%.

Discussion of race IAT studies

Multimethod studies consistently provide no evidence that explicit measures and the IAT measure distinct constructs. Either a single-attitude model fits the data or most of the variance in separate explicit and implicit factors is shared. This finding contradicts Greenwald et al.’s (1998) original interpretation of low correlations between the race IAT and explicit measures as evidence that the IAT measures implicit race bias. The present results suggest that moderate validity of explicit measures and low validity of the IAT account for low explicit-IAT correlations in monomethod studies. The valid variance in IAT scores is estimated to be between 16% and 30%. The remaining variance is random measurement error and systematic measurement error that is unique to the IAT and similar implicit tasks. Two data sets also examined incremental predictive validity. There was no evidence that IAT scores added to the prediction of contact or voting behaviors above and beyond the variance that is shared with explicit measures. This finding is not surprising given the lack of discriminant validity, which is a prerequisite for incremental predictive validity.

Multimethod Studies of the Self-Esteem IAT

As noted previously, the validity of the IAT depends on the construct that is being measured. One application of the IAT has been the attempt to measure implicit self-esteem. The notion of implicit self-esteem implies that some people may harbor unconscious feelings of self-doubt that are hidden behind a display of high self-confidence. For example, on May 7, 2019, I took the self-esteem IAT on the Project Implicit website and was asked to “click here to discover your hidden thoughts about a range of mental health topics.” The informed consent form also suggests that attitudes split into conscious and unconscious evaluations: “We are interested in evaluations that occur outside conscious control or awareness. Thus, some people are surprised by the results of the study” (https://implicit.harvard.edu/implicit/user/pimh/index.jsp. Note that the text of both pages has been changed since May 7, 2019.)

Bosson et al. (2000) were the first to examine the convergent validity of the self-esteem IAT with other implicit measures of self-esteem and found weak correlations. Since then, other critical articles have been published (Gawronski, LeBel, & Peters, 2007; Walker & Schimmack, 2008). The most recent and extensive review was conducted by Falk and Heine (2015), who found that “the validity evidence for the IAT in measuring [implicit self-esteem] is strikingly weak” (p. 6). They also pointed out that implicit measures of self-esteem “show a remarkably consistent lack of predictive validity” (p. 6). To examine the construct validity of the self-esteem IAT more thoroughly, I fitted CFA models to two multimethod data sets.

Falk et al. (2015)

Falk et al. (2015) conducted the most comprehensive validation study of the self-esteem IAT. The biggest advantage of the study was the inclusion of informant ratings of self-esteem, which makes it possible to model method variance in self-ratings (Anusic, Schimmack, Pinkus, & Lockwood, 2009) and validate self-ratings of self-esteem (Simms et al., 2010). I tried to fit a two-factor model to the data, but this was impossible because the implicit measures failed to show evidence of convergent validity. Thus, I fitted a single-factor model to the data to examine the validity of the self-esteem IAT as a measure of self-esteem. The model had good fit, χ²(67) = 115.85, comparative fit index (CFI) = .964, RMSEA = 0.050 (90% CI = [0.034, 0.065]). The self-ratings had high loadings on the self-esteem factor. Self-esteem is also equally related to high levels of positive affect and negative affect. Most important, the self-esteem IAT and the other implicit measures have low and nonsignificant loadings on the self-esteem factor (Fig. 6).

Fig. 6.

Analysis of Falk, Heine, Takemura, Zhang, and Hsu (2015). IAT = implicit association test.

Bar-Anan and Vianello (2018)

Bar-Anan and Vianello (2018) also examined convergent and discriminant validity of the self-esteem IAT. One limitation of this study was that it did not include informant ratings of self-esteem. Nevertheless, I was able to fit a two-factor model to the data, and the model was identified and had good fit, χ²(22) = 30.76, CFI = .994, RMSEA = 0.006 (90% CI = [0.000, 0.010]). The results for this data set differ from the previous results in that the explicit and implicit measures showed evidence of discriminant validity, although both factors clearly overlapped with a correlation of ρ = .58. Because sampling error was considerable, σ = .10, the 95% CI ranged from .39 to .76. However, the study confirmed that IAT scores have low validity, with an estimate of 14% valid variance in self-esteem IAT scores. The Rosenberg self-esteem scale also had a surprisingly low amount of valid variance in this study at only 25% valid variance. Thus, low validity contributes considerably to low observed correlations between IAT scores and explicit self-esteem measures (Fig. 7).

Fig. 7.

Analysis of Bar-Anan and Vianello (2018) self-esteem. IAT = implicit association test; AMP = affective misattribution paradigm; PO = political orientation; GNAT = go/no-go association task.

Conclusion about self-esteem

Low observed correlations between implicit and explicit measures of self-esteem in monomethod studies have been misinterpreted as evidence that explicit and implicit measures of self-esteem measure distinct constructs (Greenwald & Farnham, 2000). Multimethod studies that control for measurement error show much stronger correlations or no evidence for discriminant validity. More important, convergent validity across implicit measures is low, and even if some form of implicit or unconscious self-esteem existed, the self-esteem IAT is a poor measure of implicit self-esteem. These results are not new (Bosson et al., 2000) but have been ignored by proponents of the self-esteem IAT as a valid measure of implicit self-esteem (Greenwald & Farnham, 2000).

Multimethod Study of Political Orientation

It is interesting to examine the validity of the political orientation IAT because previous studies have shown higher correlations between a political orientation IAT and explicit measures of political orientation. This finding has been interpreted as evidence that correlations between implicit and explicit attitudes can vary across attitude objects (Greenwald & Banaji, 2017). However, in light of the previous findings that implicit and explicit attitude factors are highly correlated, it is possible that correlations in this domain are stronger because the IAT is a more valid measure of political orientation. To test this hypothesis, I fitted a two-factor model to the multimethod data from Bar-Anan and Vianello (2018). The model has good fit to the data, χ²(49) = 90.38, CFI = .992, RMSEA = 0.008 (90% CI = [0.005, 0.010]; Fig. 8).

Fig. 8.

Analysis of Bar-Anan and Vianello (2018). IAT = implicit association test; AMP = affective misattribution paradigm; GNAT = go/no-go association task; SC = single category.

The results for the correlation between the explicit and implicit factors are similar to the previous results. Again, the correlation is high but significantly different from zero, ρ = .89, σ = .015. More important, the factor loading of the IAT on the implicit factor is much higher than for self-esteem or racial attitudes, suggesting over 50% of the variance in political orientation IAT scores is valid variance, π = .79, σ = .016. The loading of the self-report on the explicit ratings was also higher, π = .90, σ = .010. Finally, voting intentions are strongly predicted by the explicit factor and unrelated to the implicit factor. Thus, although there is some unique variance in the implicit factor, it is not clear what this unique variance represents.

The results for political orientation confirm that it is impossible to make general statements about the validity of the IAT, just as it is impossible to make general statements about the validity of Likert ratings. In the domain of political orientation, Likert ratings and the IAT appear to have high validity as measures of a common construct. At the same time, there is little evidence to suggest that some people who are consciously Democrats unconsciously favor the Republican party and vice versa.

Variation of Implicit–Explicit Correlations Across Domains

Hundreds of studies have examined the correlation of the IAT with a corresponding explicit measure. Meta-analysis of these monomethod studies (one measure for one construct) documented reliable differences in these correlations across attributes (Hofmann, Gawronski, Geschwendner, Le, & Schmitt, 2005). Correlations are lowest for self-esteem, somewhat higher for the race IAT, and even higher for political orientation and consumer preferences.

Different interpretations of this finding have been offered. For example, Greenwald and Banaji (2017) suggested,

The plausible interpretations of the more common pattern of weak implicit–explicit correlations are that (a) implicit and explicit measures tap distinct constructs or (b) they might be affected differently by situational influences in the research situation (cf. Fazio & Towles-Schwen, 1999; Greenwald et al., 2002) or (c) at least one of the measures, plausibly the self-report measure in many of these cases, lacks validity. (p. 868)

Contrary to these speculations, evidence from multimethod studies suggests that the key factor that influences correlations of the IAT with explicit measures is the validity of the IAT, which is low for self-esteem and high for political orientation. This novel discovery raises an interesting new question about the IAT, namely, why is the IAT a more valid measure of political orientation than self-esteem?

I propose that one cause of variation in the validity of the IAT is related to the proportion of respondents with opposing attitudes (e.g., pro-Democrat vs. pro-Republican). As noted earlier, there is good evidence that the IAT can show mean differences between groups. Republicans find the IAT task easier when Republican is paired with good, and Democrats find the IAT easier when Democrat is paired with good. Thus, the IAT can be valid because it reflects group membership even if it is a poor measure of attitudes within each group. If we separate valid variance into variance that reflects the direction of an attitude (group membership) and variance caused by differences in the strength of attitudes (individual differences within groups), it is possible that the high validity of the political orientation IAT reflects polarized attitudes rather than high validity for the measurement of individual differences in the strength of attitudes (Iyengar & Westwood, 2015).

To test this hypothesis, I dichotomized IAT scores with zero as a neutral point. I then ran a hierarchical regression with the dichotomous predictor first and then followed by both predictors to examine the incremental contribution from reaction times. Table 1 shows that neither dichotomous scores nor reaction time predicted explicit self-esteem in both studies. Dichotomous scores explained a modest amount of variance in explicit racial attitude ratings, and reaction times added some explained variance. Most interesting was the result for political orientation: A full 38% of the variance in explicit ratings was explained by the dichotomous measure, whereas reaction times added only 7%, which is similar to the results for racial attitudes. Thus, the validity of IAT scores increases with the polarity of attitudes. This finding suggests that the IAT is good in classifying individuals into opposing groups but has low validity of individual differences in the strength of attitudes.

Table 1.

Explained Variance by Direction and Strength of Attitude

Data set	Citation	Dichotomous scores	Incremental reaction time
Self esteem study 1	Falk, Heine, Takemura, Zhang, & Hsu (2015)	.016	.000
Self-esteem study 2	Bar-Anan & Vianello (2018)	.009	.002
Race study 1	Bar-Anan & Vianello (2018)	.041	.060
Race study 2	Greenwald, Smith, Sriram, Bar-Anan, & Nosek (2009)	.073	.057
Race study 3	Axt (2018)	.045	.030
Political orientation	Bar-Anan & Vianello (2018)	.380	.070

Discussion

What do IATs measure?

Construct validation is a difficult and iterative process because scientific evidence can alter the understanding of constructs. In this spirit, Banaji and Greenwald (2013) recognized that they did “not have the luxury of believing that what appears true and valid now will always appear so” (p. xv). In 2013, Banaji and Greenwald proposed that the IAT measures implicit constructs that are difficult to measure with explicit measures (Banaji & Greenwald, 2013; Greenwald et al., 1998; Nosek et al., 2007). This view was based on low to moderate correlations between IAT scores and explicit measures, which was interpreted as evidence of discriminant validity. However, this interpretation was premature. The present results suggest that measurement error alone is often sufficient to explain these low correlations. Thus, there is little empirical support for the claim that the IAT measures implicit attitudes that are not accessible to introspection and that cannot be measured with self-report measures.

Of course, absence of evidence is not the same as evidence that implicit attitudes do not exist or never influence behavior. However, empirical claims about implicit attitudes require valid measures of implicit attitudes, and IAT scores fail to provide this required information. For 21 years, the lack of discriminant validity has been overlooked because psychologists often fail to take measurement error into account and do not clearly distinguish between measures and constructs. Thus, findings obtained with the IAT were often treated as if they provide insights into implicit attitudes. For example, on the basis of the observation that IAT scores change in response to experimental manipulations, Lai et al. (2016) wrote that “implicit preferences are malleable” (p. 1002). However, they also noticed that none of these effects lasted and that the manipulations did not change scores on explicit attitude measures. This finding was interpreted as “short-term malleability in implicit preferences” (p. 1002).

The present results provide a much simpler explanation for their findings. Although the actual racial preferences did not change, the experimental manipulations introduced temporary changes in the cognitive processes that are used to measure attitudes with the IAT (Klauer et al., 2007). This parsimonious explanation is also consistent with the longitudinal multimethod study of racial attitudes that showed no real changes in racial attitudes over a 2-month period (Cunningham et al., 2001). It also explains why the most effective manipulation of IAT scores is an explicit instruction to fake IAT scores (Lai et al., 2016), whereas interventions that are designed to change actual attitudes show inconsistent effects (Forscher, Mitamura, Dix, Cox, & Devine, 2017; Joy-Gaba & Nosek, 2010; Van Dessel, De Houwer, Gast, & Tucker Smith, 2015). Moreover, publication bias in experimental IAT studies makes it difficult to identify paradigms that produce replicable intervention effects (Forscher et al., 2019).

In the future, researchers need to be more careful when they make claims about constructs on the basis of a single measure such as the IAT because measurement error can produce misleading results. To ensure that results can be generalized from a specific measure to actual constructs, it is desirable to incorporate a multimethod approach into the measurement of attitudes. Even if studies use a monomethod approach, a series of monomethod studies should demonstrate that results are consistent across different measures. Otherwise, sampling error will produce an unreliable pattern of dissociations between measures that masks consistent patterns at the level of constructs.

Researchers should avoid terms, such as implicit attitude or implicit preferences, that make claims about constructs simply because attitudes were measured with an implicit measure. Even Greenwald and Banaji (2017) are trying to avoid writing about implicit constructs, suggesting that it is time to put the ideas that the IAT measured unconscious processes and that individuals harbor hidden implicit attitudes to rest.

Implications for dual-attitude models

The present results also have implications for theoretical models of attitudes because measures such as the IAT have influenced how social psychologists think about attitudes (Bohner & Dickel, 2011). Research with implicit measures has led to the creation of dual-attitude models. The key postulate of dual-attitude models is that implicit and explicit attitudes have separate mental representations (Petty, Briñol, & DeMarree, 2007). One finding that seemed to support dual-attitude models was the low correlation between implicit measures such as the IAT and explicit measures of attitudes (Bar-Anan & Vianello, 2018). Thus, the present results of low discriminant validity pose a problem for dual-attitude models and favor models that consider attitudes as a single construct (Fazio & Olson, 2003).

A single construct does not imply that there is a fixed representation of object valence that is stored in memory. In different situations, different representations may be activated and produce inconsistency in implicit and explicit measures. The same is true for personality traits and behaviors (Schimmack, 2010). The notion of a single attitude implies only that there are individual differences in the disposition to like or dislike an attitude object across different situations: Chocolate lovers love chocolate most of the time but not all of the time.

The key distinction between this single-attitude model and a dual-attitude model is the hypothesis that any unique variance in IAT scores is unique to the IAT and does not generalize to other implicit measures or prediction of specific behaviors in real-world situations. Some researchers have used incremental predictive validity of IAT scores as evidence for a dual-attitude model (Greenwald et al., 2009). However, at present, evidence of incremental predictive validity is inconclusive. First, explicit measures also have measurement error. Even two explicit measures with moderate validity would show incremental predictive validity without demonstrating that there are two explicit attitudes. Thus, incremental validity supports dual-attitude models only if the IAT predicts criterion variables after controlling for measurement error in explicit measures. As shown in the reanalysis of Greenwald et al.’s (2009) data, evidence for incremental predictive is elusive when measurement error in explicit measures is taken into account. A recent meta-analysis by Kurdi et al. (2019) suggested that IAT scores have an average incremental predictive validity of r = .14. However, this estimate did not control for systematic measurement error in explicit measures. Kurdi et al. (2019) also noted that most studies were extremely underpowered to examine incremental predictive validity. Thus, there is currently a lot of uncertainty about the ability of IAT scores to predict real-word outcomes above and beyond explicit measures (Carlsson & Agerström, 2016; Oswald, Mitchell, Blanton, Jaccard, & Tetlock, 2013, 2015).

Method variance in the IAT

One contribution of this article is to make a clear theoretical distinction between systematic measurement error and valid variance in IAT scores. However, the structural equation models only demonstrated that systematic measurement error contributes to variance in IAT scores. They did not reveal the cognitive processes that produce method variance in IAT scores. This question has been addressed in studies that decompose variance in IAT scores into different processing components (Klauer et al., 2007). One component reflects a general slowing in the more difficult incompatible trials. Variation in this component for a political orientation IAT showed convergent validity with explicit ratings of political orientation (Klauer et al., 2007). In contrast, another component reflects whether participants slow to avoid errors or do not slow and make more errors (speed-accuracy trade-off). This component was not related to explicit attitudes but shared variance with two conceptually unrelated IATs. The authors also speculated that setting of the speed-accuracy parameter is influenced by transient factors. Thus, this method variance may produce internal consistency without producing stability over time. This research supports the interpretation of reliable variance that is shared across IATs as method variance. Moreover, this work may be useful to improve the validity of the IAT by minimizing variance due to method factors and trying to maximize the variance that reflects attitudes.

Method variance in explicit measures

The structural equation models also showed evidence of measurement error in self-ratings. Although some of this variance is simply random measurement error, retest correlations show that systematic measurement error also plays a role (Cunningham et al., 2001). One source of unique method variance in explicit measures is item content. For example, Axt (2018) listed over 20 different measures of racial attitudes. Correlations of these measures with a simple direct preference rating can range from r = .45 for the American National Election Survey scale (Payne et al., 2010) to r = .34 for the Modern Racism Scale (McConahay, 1986), r = .34 for the Symbolic Racism Scale (Sears, 1988), and r = .33 for Bayesian racism (Uhlmann, Brescoll, & Machery, 2010). Thus, even explicit measures share less than 50% of variance with each other, and it is not surprising that correlations of the race IAT with a single explicit measure are low.

Axt (2018) found that out of all attitude measures, a direct rating of racial preferences correlated most highly with the IAT. Given the current finding that explicit measures and the IAT measure a common construct, this finding suggests that direct measures have the highest construct validity as measures of attitudes. Other measures, such as the Symbolic Racism Scale, blend racial attitudes with other constructs such as political orientation, as shown in the analysis of Greenwald et al.’s (2009) study.

An interesting question is whether some of the shared variance among explicit measures is also systematic method variance. One plausible candidate for method variance in self-report measures is socially desirable responding. It is a common assumption that response biases attenuate the validity of self-report measures. The present results provide little support for this hypothesis. The multimethod models showed no shared variance between direct ratings and the Modern Racism Scale (Bar-Anan & Vianello, 2018, data) or the Symbolic Racism Scale (Greenwald et al., 2009, data). This finding suggests that concerns about socially desirable responding are overblown. At least in anonymous survey situations, explicit ratings provide valid information about individual differences even for sensitive attitude objects, and the validity of explicit ratings is higher than the validity of the IAT.

How well does the IAT measure what it measures?

Studies with the IAT can be divided into applied studies (A studies) and basic studies (B studies). B studies use the IAT to study basic psychological processes. In contrast, A studies use the IAT as a measure of individual differences. Whereas B studies contribute to the understanding of the IAT, A studies require that IAT scores have construct validity. Thus, B studies should provide quantitative information about the psychometric properties for researchers who are conducting A studies. Unfortunately, 21 years of B studies have failed to do so. For example, after an exhaustive review of A studies, De Houwer et al. (2009) concluded that “IAT effects are reliable enough to be used as a measure of individual differences” (p. 363). This conclusion is not helpful for the use of the IAT in A studies because (a) no quantitative information about reliability is given and (b) reliability is necessary but not sufficient for validity. Height can be measured reliably, but it is not a valid measure of happiness.

This article provides the first quantitative information about validity of three IATs. The evidence suggests that the self-esteem IAT has no clear evidence of construct validity (Falk et al., 2015). The race IAT has about 20% valid variance and even less valid variance in studies that focus on attitudes of members from a single group. The political orientation IAT has over 40% valid variance, but most of this variance is explained by group differences and overlaps with explicit measures of political orientation. Although validity of the IAT needs to be examined on a case-by-case basis, the results suggest that the IAT has limited utility as a measurement method in A studies. It is either invalid or the construct can be measured more easily with direct ratings. The most promising use of the IAT is to use it as a complementary method and use the shared variance between IAT scores and explicit measures to control for measurement error in both methods. This approach is similar to the use of self-ratings and informant ratings for the measurement of personality traits (Anusic et al., 2009; Schimmack, 2010).

Implications for the use of IAT scores in assessment

Personality psychologists routinely use measures with moderate validity to test personality theories (Schimmack, 2010). Personality measures do not require high validity for this purpose. However, high validity is important for personality assessment (i.e., when test scores of individuals are used to draw inferences about their internal attributes). The use of psychological tests for personality assessment is regulated by the American Psychological Association Ethical Principles of Psychologists. These guidelines state that “Psychologists administer, adapt, score, interpret, or use assessment techniques, interviews, tests, or instruments in a manner and for purposes that are appropriate in light of the research on or evidence of the usefulness and proper application of the techniques” (American Psychological Association, 2017). The low construct validity of IATs raises some concerns about the use of the IAT for the assessment of individual attitudes (e.g., Malcolm Gladwell’s racial preferences).

Over a million visitors of the Project Implicit website have received feedback about their IAT scores with the interpretation that they harbor some potentially unconscious racial biases against African Americans, (Banaji & Greenwald, 2013). Given the modest validity of the race IAT and the lack of evidence for discriminant validity, it seems problematic that respondents are not informed about the possibility that their test scores are likely to be invalid. One way to communicate this possibility to visitors of the Project Implicit site is to compute a confidence interval with a range of values that are consistent with an IAT score (Carter & Feldt, 2001).

Although there are several formulas, they all take the standard deviation, error probability, and reliability of a measure into account. However, reliability does not account for bias due to systematic measurement error. Thus, I suggest to replace the reliability coefficient with the validity coefficient. For example, if we assume that 20% of the variance in scores on the race IAT is valid variance, the 95% CI for IAT scores from Project Implicit (Axt, 2018), using the D-scoring method, with a mean of .30 and a standard deviation of .46, ranges from −0.51 to 1.11. Thus, participants who score at the mean level could have an extreme pro-White bias (Cohen’s d = 1.11/.46 = 2.41) but also an extreme pro-Black bias (Cohen’s d = −.51/.46 = −1.10). Thus, it seems problematic to provide individuals with feedback that their IAT score may reveal something about their attitudes that is more valid than their beliefs.

In my opinion, the disclaimer on the website (https://implicit.harvard.edu/implicit/takeatest.html) is insuf-ficient:

In reporting to you results of any IAT test that you take, we will mention possible interpretations that have a basis in research done (at the University of Washington, University of Virginia, Harvard University, and Yale University) with these tests. However, these Universities, as well as the individual researchers who have contributed to this site, make no claim for the validity of these suggested interpretations.

Instead, visitors of the Project Implicit should be given information about the psychometric properties of IAT scores.

Conclusion

Social psychologists have always distrusted self-report, especially for the measurement of sensitive topics such as prejudice. Many attempts were made to measure attitudes and other constructs with indirect methods. The IAT was a major breakthrough because it has relatively high reliability compared with other methods. Thus, creating the IAT was a major achievement that should not be underestimated because the IAT lacks construct validity as a measure of implicit constructs. Even creating an indirect measure of attitudes is a formidable feat. However, in the early 1990s, social psychologists were enthralled by work in cognitive psychology that demonstrated unconscious or uncontrollable processes (Greenwald & Banaji, 1995). Implicit measures were based on this work, and it seemed reasonable to assume that they might provide a window into the unconscious (Banaji & Greenwald, 2013). However, the processes that are involved in the measurement of attitudes with implicit measures are not the personality characteristics that are being measured. There is nothing implicit about being a Republican or Democrat, gay or straight, or having low self-esteem. Conflating implicit processes in the measurement of attitudes with implicit personality constructs has created a lot of confusion. It is time to end this confusion. The IAT is an implicit measure of attitudes with varying validity. It is not a window into people’s unconscious feelings, cognitions, or attitudes.

Footnotes

Acknowledgements

A previous version of this article appeared as a blog post on .

Transparency

Action Editor: Laura A. King

Editor: Laura A. King

ORCID iD

Ulrich Schimmack

References

American Psychological Association. (2017). Section 9.02, In Ethical principles of psychologists and code of conduct. Retrieved from https://www.apa.org/ethics/code/index

Anusic

Schimmack

(2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, 110, 766–781. doi:10.1037/pspp0000066

Anusic

Schimmack

Pinkus

Lockwood

(2009). The nature and structure of correlations among Big Five ratings: The halo-alpha-beta model. Journal of Personality and Social Psychology, 97, 1142–1156.

Axt

J. R.

(2018). The best way to measure explicit racial attitudes is to ask about them. Social Psychological & Personality Science, 9, 896–906. doi:10.1177/1948550617728995

Banaji

M. R.

Greenwald

A. G.

(2013). Blindspot: Hidden biases of good people. New York, NY: Delacorte Press.

Bar-Anan

Vianello

(2018). A multi-method multi-trait test of the dual-attitude perspective. Journal of Experimental Psychology: General, 147, 1264–1272. doi:10.1037/xge0000383

Bohner

Dickel

(2011). Attitudes and attitude change. Annual Review of Psychology, 62, 391–417. doi:10.1146/annurev.psych.121208.131609

Borsboom

(2006). The attack of the psychometricians. Psychometrika, 71, 425–440. doi:10.1007/s11336-006-1447-6

Bosson

J. K.

Swann

W. B.

Jr. Pennebaker

J. W.

(2000). Stalking the perfect measure of implicit self-esteem: The blind men and the elephant revisited? Journal of Personality and Social Psychology, 79, 631–643. doi:10.1037/0022-3514.79.4.631

10.

Campbell

D. T.

Fiske

D. W.

(1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. doi:10.1037/h0046016

11.

Carlsson

Agerström

(2016). A closer look at the discrimination outcomes in the IAT literature. Scandinavian Journal of Psychology, 57, 278–287. doi:10.1111/sjop.12288

12.

Carter

R. A.

Feldt

L. S.

(2001). Confidence intervals for true scores: Is there a correct approach? Journal of Psychoeducational Assessment, 19, 350–364.

13.

Cronbach

L. J.

(1971). Test validation. In Thorndike

R. L.

(Ed.), Educational measurement (2nd ed., pp. 443–507). Washington, DC: American Council on Education.

14.

Cronbach

L. J.

(1989). Construct validation after thirty years. In Linn

R. L.

(Ed.), Intelligence: Measurement theory and public policy: Proceedings of a symposium in honor of Lloyd G. Humphreys (pp. 147–171). Urbana: University of Illinois Press.

15.

Cronbach

L. J.

Meehl

P. E.

(1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. doi:10.1037/h0040957

16.

Cunningham

W. A.

Preacher

K. J.

Banaji

M. R.

(2001). Implicit attitude measures: Consistency, stability, and convergent validity. Psychological Science, 12, 163–170. doi:10.1111/1467-9280.00328

17.

De Houwer

Teige-Mocigemba

Spruyt

Moors

. (2009). Implicit measures: A normative analysis and review. Psychological Bulletin, 135, 347–368. doi:10.1037/a0014211

18.

Falk

C. F.

Heine

S. J.

(2015). What is implicit self-esteem, and does it vary across cultures? Personality and Social Psychology Review, 19, 177–198.

19.

Falk

C. F.

Heine

S. J.

Takemura

Zhang

C. X.

Hsu

(2015). Are implicit self-esteem measures valid for assessing individual and cultural differences. Journal of Personality, 83, 56–68. doi:10.1111/jopy.12082

20.

Fazio

R. H.

Olson

M. A.

(2003). Implicit measures in social cognition research: Their meaning and use. Annual Review of Psychology, 54, 297–327. doi:10.1146/annurev.psych.54.101601.145225

21.

Fazio

R. H.

Sanbonmatsu

D. M.

Powell

M. C.

Kardes

F. R.

(1986). On the automatic activation of attitudes. Journal of Personality and Social Psychology, 50, 229–238. doi:10.1037/0022-3514.50.2.229

22.

Forscher

P. S.

Lai

C. K.

Axt

J. R.

Ebersole

C. R.

Herman

Devine

P. G.

Nosek

B. A.

(2019). A meta-analysis of procedures to change implicit measures. Psychological Bulletin, 117, 522–559.

23.

Forscher

P. S.

Mitamura

Dix

E. L.

Cox

W. T. L.

Devine

P. G.

(2017). Breaking the prejudice habit: Mechanisms, timecourse, and longevity. Journal of Experimental Social Psychology, 72, 133–146. doi:10.1016/j.jesp.2017.04.009

24.

Gawronski

Bodenhausen

G. V.

(2017). Beyond persons and situations: An interactionist approach to understanding implicit bias. Psychological Inquiry, 28, 268–272.

25.

Gawronski

LeBel

E. P.

Peters

K. R.

(2007). What do implicit measures tell us?: Scrutinizing the validity of three common assumptions. Perspectives on Psychological Science, 2, 181–193. doi:10.1111/j.1745-6916.2007.00036.x

26.

Greenwald

A. G.

Banaji

M. R.

(1995). Implicit social cognition: Attitudes, self-esteem, and stereotypes. Psychological Review, 102, 4–27. doi:10.1037/0033-295X.102.1.4

27.

Greenwald

A. G.

Banaji

M. R.

(2017). The implicit revolution: Reconceiving the relation between conscious and unconscious. American Psychologist, 72, 861–871. doi:10.1037/amp0000238

28.

Greenwald

A. G.

Banaji

M. R.

Nosek

B. A.

(2015). Statistically small effects of the Implicit Association Test can have societally large effects. Journal of Personality and Social Psychology, 108, 553–561. doi:10.1037/pspa0000016

29.

Greenwald

A. G.

Farnham

S. D.

(2000). Using the Implicit Association Test to measure self-esteem and self-concept. Journal of Personality and Social Psychology, 79, 1022–1038. doi:10.1037/0022-3514.79.6.1022

30.

Greenwald

A. G.

McGhee

D. E.

Schwartz

J. L. K.

(1998). Measuring individual differences in implicit cognition: The Implicit Association Test. Journal of Personality and Social Psychology, 74, 1464–1480.

31.

Greenwald

A. G.

Smith

C. T.

Sriram

Bar-Anan

Nosek

B. A.

(2009). Race attitude measures predicted vote in the 2008 U.S. Presidential Election. Analyses of Social Issues and Public Policy, 9, 241–253.

32.

Hofmann

Gawronski

Geschwendner

Schmitt

(2005). A meta-analysis on the correlation between the Implicit Association Test and explicit self-report measures. Personality and Social Psychology Bulletin, 31, 1369–1385. doi:10.1177/0146167205275613

33.

Iyengar

Westwood

S. J.

(2015). Fear and loathing across party lines: New evidence on group polarization. American Journal of Political Science, 59, 690–707.

34.

Izuma

Kennedy

Fitzjohn

Sedikides

Shibata

(2018). Neural activity in the reward-related brain regions predicts implicit self-esteem: A novel validity test of psychological measures using neuroimaging. Journal of Personality and Social Psychology, 114, 343–357. doi:10.1037/pspa0000114

35.

Joy-Gaba

J. A.

Nosek

B. A.

(2010). The surprisingly limited malleability of implicit racial evaluations. Social Psychology, 41, 137–146. doi:10.1027/1864-9335/a000020

36.

Klauer

K. C.

Voss

Schmitz

Teige-Mocigemba

(2007). Process components of the Implicit Association Test: A diffusion-model analysis. Journal of Personality and Social Psychology, 93, 353–368. doi:10.1037/0022-3514.93.3.353

37.

Kurdi

Banaji

M. R.

(2017). Reports of the death of the individual difference approach to implicit social cognition may be greatly exaggerated: A commentary on Payne, Vuletich, and Lundberg. Psychological Inquiry, 28, 281–287. doi:10.1080/1047840X.2017.1373555

38.

Kurdi

Seitchik

A. E.

Axt

J. R.

Carroll

T. J.

Karapetyan

Kaushik

. . . Banaji

M. R

. (2019). Relationship between the Implicit Association Test and intergroup behavior: A meta-analysis. American Psychologist, 74, 569–586. doi:10.1037/amp0000364

39.

Lai

C. K.

Skinner

A. L.

Cooley

Murrar

Brauer

Devos

. . . Nosek

B. A

. (2016). Reducing implicit racial preferences: II. Intervention effectiveness across time. Journal of Experimental Psychology: General, 145, 1001–1016. doi:10.1037/xge0000179

40.

McConahay

J. B.

(1986). Modern racism, ambivalence, and the Modern Racism Scale. In Dovidio

J. F.

Gaertner

S. L.

(Eds.), Prejudice, discrimination, and racism (pp. 91–125). San Diego, CA: Academic Press.

41.

Messick

(1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749. doi:10.1037/0003-066X.50.9.741

42.

Nosek

B. A.

Greenwald

A. G.

Banaji

M. R.

(2007). The Implicit Association Test at age 7: A methodological and conceptual review. In Bargh

J. A.

(Ed.), Social psychology and the unconscious: The automaticity of higher mental processes. Frontiers of Social Psychology (pp. 265–292). New York, NY: Psychology Press.

43.

Nosek

B. A.

Smyth

F. L.

(2007). A multitrait-multimethod validation of the Implicit Association Test: Implicit and explicit attitudes are related but distinct constructs. Experimental Psychology, 54, 14–29. doi:10.1027/1618-3169.54.1.14

44.

Oswald

F. L.

Mitchell

Blanton

Jaccard

Tetlock

P. E.

(2013). Predicting ethnic and racial discrimination: A meta-analysis of IAT criterion studies. Journal of Personality and Social Psychology, 105, 171–192. doi:10.1037/a0032734

45.

Oswald

F. L.

Mitchell

Blanton

Jaccard

Tetlock

P. E.

(2015). Using the IAT to predict ethnic and racial discrimination: Small effect sizes of unknown societal significance. Journal of Personality and Social Psychology, 108, 562–571.

46.

Payne

B. K.

Krosnick

J. A.

Pasek

Lelkes

Akhtar

Tompson

(2010). Implicit and explicit prejudice in the 2008 American presidential election. Journal of Experimental Social Psychology, 46, 367–374.

47.

Payne

B. K.

Vuletich

H. A.

Lundberg

K. B.

(2017). The bias of crowds: How implicit bias bridges personal and systemic prejudice. Psychological Inquiry, 28, 233–248. doi:10.1080/1047840X.2017.1335568

48.

Petty

R. E.

Briñol

DeMarree

K. G.

(2007). The Meta-Cognitive Model (MCM) of attitudes: Implications for attitude measurement, change, and strength. Social Cognition, 25, 657–686. doi:10.1521/soco.2007.25.5.657

49.

Rae

J. R.

Greenwald

A. G.

(2017). Persons or situations? Individual differences explain variance in aggregated implicit race attitudes. Psychological Inquiry, 28, 297–300. doi:10.1080/1047840X.2017.1373548

50.

Rosenberg

(1965). Society and the adolescent self-image. Princeton, NJ: Princeton University Press.

51.

Samayoa

J. A. G.

Fazio

R. H.

(2017). Who starts the wave? Let’s not forget the role of the individual. Psychological Inquiry, 28, 273–277. doi:10.1080/1047840X.2017.1373554

52.

Schimmack

(2010). What multi-method data tell us about construct validity. European Journal of Personality, 24, 241–257. doi:10.1002/per.771

53.

Schimmack

(in press). The validity crisis in psychology. Meta-Psychology.

54.

Schimmack

Diener

(2003). Predictive validity of explicit and implicit self-esteem for subjective well being. Journal of Research in Personality, 37, 100–106. doi:10.1016/S0092-6566(02)00532-9

55.

Sears

D. O.

(1988). Symbolic racism. In Katz

P. A.

Taylor

D. A.

(Eds.), Eliminating racism: Profiles in controversy (pp. 53–84). New York, NY: Plenum.

56.

Simms

L. J.

Zelazny

Yam

W. H.

Gros

D. F.

(2010). Self-informant agreement for personality and evaluative person descriptors: Comparing methods for creating informant measures. European Journal of Personality, 24, 3207–3221.

57.

Uhlmann

E. L.

Brescoll

V. L.

Machery

(2010). The motives underlying stereotype-based discrimination against members of stigmatized groups. Social Justice Research, 23, 1–16.

58.

Van Dessel

De Houwer

Gast

Tucker Smith

. (2015). Instruction-based approach-avoidance effects: Changing stimulus evaluation via the mere instruction to approach or avoid stimuli. Experimental Psychology, 62, 161–169. doi:10.1027/1618-3169/a000282

59.

Walker

S. S.

Schimmack

(2008). Validity of a Happiness Implicit Association Test as a measure of subjective well-being. Journal of Research in Personality, 42, 490–497. doi:10.1016/j.jrp.2007.07.005