Abstract
The Taylor Aggression Paradigm (TAP) is a widely used laboratory aggression task, yet item response theory analyses of this task are nonexistent. To estimate these aspects of the TAP, we combined data from nine laboratory studies that employed the 25-trial version of the TAP (combined N = 1,856). One- and four-factor solutions for the TAP data exhibited evidence of measurement invariance across gender (men vs. women) and experimental provocation (negative vs. positive social feedback), as well as negligible instances of differential item functioning. As such, psychometric properties of the TAP were invariant across binary representations of gender and experimental provocation. Furthermore, trials following low and high provocation were the least informative and those following moderate provocation were the most informative. Scoring approaches to the TAP may benefit from giving greater weight to trials following moderate provocation. Overall, we find great utility in applying item response theory approaches to behavioral laboratory tasks.
Keywords
Human aggression is a costly and complex phenomenon. Laboratory tasks that accurately and reliably assess aggression are needed to fully understand this behavior. The Taylor Aggression Paradigm (TAP; Taylor, 1967) and variants thereof have emerged as the primary approach to laboratory aggression assessment. Despite the large-scale adoption of this paradigm, several of its psychometric qualities remain largely unexamined. In what follows, we examined the measurement invariance of the TAP across men and women and across experimental provocation conditions. We then used item response theory (IRT) analyses to identify the invariance and informativeness of individual trials of the TAP.
The Taylor Aggression Paradigm
The TAP (Taylor, 1967) is one of the most widely used measures of aggressive behavior. In its most basic form, participants compete in a reaction-time game against an opponent and whoever loses each competition trial receives an aversive stimulus. Typically, the aversive stimuli take the form of electric shocks to the skin or a blast of harsh noise over a set of headphones. Aggression is quantified by the severity of the aversive stimulus that participants select for their opponent to receive if their opponent loses the competition and the participant wins. Most often, the TAP is administered as a computer program called the Competitive Reaction-Time Task and participants compete against a fictitious opponent who enacts a preprogrammed schedule of outcomes (i.e., wins and losses) and provocation (e.g., volume settings for noise blasts; Bushman & Baumeister, 1998). Participants are allowed to see their opponent’s settings as this acts as a within-person provocation manipulation. The intention behind this manipulation is that participants will be provoked into behaving aggressively if they see that their opponent tried to administer a high level of noise. This provocation schedule varies within-participants such that, typically, the earlier trials of the task are the “high provocation” trials (i.e., noise levels 8-10) and the provocation levels of the following trials are randomized.
Whether aggression can be accurately and reliably assessed in contrived laboratory settings such as those inherent to the TAP has been a point of considerable debate (e.g., Tedeschi & Quigley, 1996). However, the TAP has exhibited substantial evidence of construct, convergent, and external validity (Anderson & Bushman, 1997; Chester & Lasko, 2019; Giancola & Parrott, 2008; Giancola & Zeichner, 1995; Hyatt, Zeichner, et al., 2019; King & Russell, 2019). Valid measures must also exhibit other important yet underestimated psychometric properties, such as measurement invariance (Flake et al., 2017; Hussey & Hughes, 2020).
Measurement Invariance
An important property of any psychological measure of a latent construct is measurement invariance, which refers to the estimate of whether a latent construct is assessed in the same manner across different groups (Coulacoglou, & Saklofske, 2017; Millsap, 2011). If the invariance of a measure remains unestimated (an unfortunately common practice: Hussey & Hughes, 2020), then investigators cannot know whether a measure can be accurately employed across a diverse array of sampling populations or whether its validity and factor structure is specific to one or more of those populations (Flake et al., 2017).
Measurement invariance is empirically estimated via confirmatory factor analysis (CFA). In this modality, there are four increasingly strict levels of measurement invariance (Bialosiewicz et al., 2013; Coulacoglou & Saklofske, 2017). First, configural invariance tests whether a measure’s latent factor structure replicates across groups. For example, men might exhibit a two-factor structure and women might exhibit a three-factor structure for a given measure. Second, metric invariance estimates whether members of different groups respond to each of the measure’s individual items in a similar way (i.e., they have similar factor loadings on each item). Third, scalar invariance then estimates whether the groups have similar average intercepts for each item. Fourth and finally, strict invariance estimates whether error variance is consistent across groups. Scalar and strict forms of measurement invariance are categorized together as “strong” invariance and are often difficult to observe. Indeed, “strong” invariance is a highly constrained approach that assumes that measurement error is equivalent across groups, which is rarely achieved. When noninvariance is observed, additional analyses are needed to identify the specific sources of the between-group variability.
Item Response Theory and Differential Item Functioning (DIF)
The measurement invariance approaches summarized above are derived from classic test theory. Although these factor analytic approaches are able to examine the measurement invariance of overall measures, there are several factors that render them less suited to identify specific items that might contribute to noninvariance as compared with other methods such as IRT, which offers additional information and advantages that the CFA approach does not. IRT analyses examine the extent to which individual items from psychological measures capture the latent constructs they are intended to assess (Embretson, & Reise, 2013). The process of applying the sequence of equality constraints in CFA can detect overall group differences in the measure itself, however if the measure exhibits noninvariance at any level, a series of additional analyses (e.g., removing/adding individual constraints until partial invariance is achieved) are needed to determine the specific sources of noninvariance. Conversely, IRT approaches were designed to examine the functioning of a measure’s individual items and therefore yields more information per item compared with CFA. As such, if noninvariance is observed the IRT approach is ideal to dig deeper into the specific sources of that noninvariance. Of particular relevance to the testing of measurement invariance are IRT analyses that examine DIF. DIF occurs when a given item captures the target latent construct differently between two or more subgroups (see Method section, for more detail). Yet what subgroups would be most important to examine for signs of measurement invariance and DIF?
Invariance Across Gender Identities
The grouping variable that is often invoked in differentially influencing aggression is gender. Self-identified men tend to be more physically aggressive than self-identified women (Archer, 2004), which is reflected in higher TAP scores observed for men (as compared with women; Zeichner et al., 2003). These gender differences in laboratory aggression may be due to real differences in these groups’ aggressive behavior or they may simply be due to the TAP actually exhibiting noninvariant psychometric properties for men compared with women. For example, it is possible that men and women approach the task in different ways or the task may simply be scaled differently for men versus women. Indeed, many contextual factors that are simulated by the TAP (e.g., opponent provocation) moderate the effect of gender on aggression (Arriaga & Aguiar, 2019; Bettencourt & Miller, 1996; Björkqvist, 2018). The crucial role of gender in the study of aggression necessitates that the measurement invariance of the TAP, by gender identity, is investigated.
Invariance Across Experimental Manipulations of Provocation
Although the TAP includes built-in provocation in the form of noise blasts delivered by the participant’s opponent, many studies also include experimental manipulations prior to the TAP that are intended to further provoke aggressive behavior. Examples of these added provocations include manipulations of being socially excluded (vs. included; e.g., Chester & DeWall, 2017) and receiving feedback from another person that is insulting (vs. complimentary; e.g., Chester & Lasko, 2019). These provocation manipulations reliably heighten aggression on the TAP (Chester & Lasko, 2019). However, these mean differences between provoked and unprovoked participants may be due, in part, to the effect of the provocation manipulation on the psychometric properties of the TAP itself and not the latent aggression construct. For example, an already-provoked participant might be more reactive to the provocation inherent to the TAP, creating a different behavioral profile than participants who were initially unprovoked. Measurement invariance analyses are able to examine this possibility.
Present Research
The TAP has been the subject of many efforts to ascertain its psychometric validity. Yet examinations of this paradigm’s measurement invariance across key grouping variables are still lacking. In what follows, we combined nine existing data sets on which we performed measurement invariance analyses using both confirmatory factor analyses and IRT according to a two-part preregistered plan (Part 1: https://osf.io/t59gu; Part 2: https://osf.io/j3nmt). We predicted that the TAP would exhibit measurement invariance across gender and experimental provocation groups, given the substantial evidence that the task exhibits considerable validity when these groups are combined (e.g., Chester & Lasko, 2019).
Method
Participants
The data used in the present analyses were from 1,886 undergraduate participants across nine separate studies. All participants were recruited from an introductory psychology subject pool and completed the study for course credit. The following exclusion criteria applied to all nine studies (except where noted): (a) participants must have been at least 18 years old and (b) participants could not have a hearing disorder or other hearing sensitivity. Thirty participants were manually removed from this original data set for either failing to indicate their gender or indicating a gender other than female or male, resulting in a final sample size of 1,856. Participant demographic information from each study is presented in Table 1. In Studies 1 and 9, 17-year-old students were permitted to participate via a parental consent waiver.
Separated and Combined Descriptive Statistics of Participant Demographics From Each Study.
Materials
Taylor Aggression Paradigm
In each of the nine studies, participants completed a computerized version of the 25-trial TAP (Bushman & Baumeister 1998; Taylor, 1967). Participants were instructed that they would be competing in a reaction-time game against a same-sex stranger. In reality, participants were playing against the computer with preprogrammed wins and losses. In each of the task’s 25 trials, participants competed to press a button faster and a loud noise blast was delivered to the slower player. Participants chose the volume and duration of the noise their opponent would hear if they lost. Similarly, participants were told that their opponent would choose the volume and duration of the noise that they (the participant) would hear. Volume levels ranged from Level 1 (60dB) to Level 10 (105 dB), in addition to a nonaggressive option (Level 0). Duration also ranges from Level 0 (0 seconds) to Level 10 (5 seconds), increasing duration length by half-second increments. Wins and losses were randomized within participants according to the task’s default settings (see Figure 2; Bushman & Baumeister, 1998) and this pattern of randomization was held constant across participants.
Procedure
Study 1. 1
Participants reported their demographics and were then randomly assigned to either be rejected or not via a widely used social rejection paradigm called Cyberball (Williams & Jarvis, 2006). As social rejection is a form of provocation, this task served as the provocation manipulation. Cyberball took the form of a virtual ball-tossing game, which participants ostensibly played with two other same-sex students. In the rejection condition, participants were assigned to receive only three ball tosses at the beginning and then none, while their partners tossed the ball to each other (Chester et al., 2017). After the provocation manipulation, all participants completed the TAP against one of their Cyberball partners.
Study 2. 2
Participants reported their demographics and then were randomly assigned to either be provoked or not via an essay feedback paradigm in which participants were instructed to write a brief essay about an important time in their life and then exchanged essays with an ostensible partner in a different room who would give them feedback on the essay. Participants were randomly assigned to receive either negative feedback (8/35 points, “One of the WORST essays I’ve EVER read!”) or positive (33/35 points, “Great essay!”) feedback (Bushman & Baumeister, 1998; Chester & DeWall, 2017). After the provocation manipulation, participants completed the TAP against their essay feedback partner.
Study 3
Participants reported their demographics and then were all provoked via the same essay feedback provocation paradigm as Study 2. After being provoked, participants completed the TAP against their essay feedback partner.
Study 4. 2
Participants reported their demographics and then completed the TAP. No provocation manipulation was employed in this study.
Study 5. 2
Participants reported their demographics, completed a battery of personality questionnaires, and then were all provoked via the same essay feedback provocation paradigm as previous studies. This manipulation followed the same procedure as the previous essay feedback studies with the exception that participants wrote about a time they were angry rather than about an important time in their life (Chester et al., 2015). After being provoked, participants completed the TAP.
Study 6. 3
Participants reported their demographics and were then randomly assigned to either be provoked or not via the same Cyberball task as Study 1. After the provocation manipulation, all participants then completed the TAP.
Study 7. 3
Participants were randomly assigned to either be provoked or not by the same essay feedback provocation paradigm as Study 2 and then completed the TAP.
Study 8. 3
This study was nearly identical to Study 7 with the exception that the TAP was counterbalanced with two other behavioral aggression tasks.
Study 9
This study procedure was identical to Study 6.
Data Analysis
Confirmatory Factor Analysis
We fit a one-factor CFA with maximum likelihood estimation using the lavaan package (version 0.6-5; Rosseel, 2012) for R statistical software (version 3.6; R Core Team, 2019). Of the total 1,856 participants, 1,770 were used in the CFA due to listwise deletion of missing observations. We conducted Little’s MCAR test using the BaylorEdPsych v0.5 package for R statistical software. The results were nonsignificant, χ2(81) = 55.01, p = .99, and fewer than 5% of observations were missing (4%), therefore no further steps were taken to address missing data. The CFA examined the fit of a model in which all 50 TAP items (25 trials × 2 settings per trial) were set to load onto a single latent “aggression” factor. One randomly chosen item’s factor loading was set to 1 to allow for intercept estimation. We decided on this initial single-factor structure because this is one of the most commonly used scoring strategies for the TAP (i.e., a single average across all trials; Chester & Lasko, 2019).
To test the measurement invariance of this factor model, we first ran an unconstrained CFA and then applied increasingly strict equality constraints onto the remaining 49-factor loading parameters. We then compared the model fit of each/constrained model (with parameters set to equal between men and women or provoked and unprovoked participants) with the model before it. In the first constrained model, only the factor loadings were set to equal (i.e., metric invariance). In the second constrained model, both the loadings and intercepts were set to equal across groups (i.e., scalar invariance). In the final constrained model, the loadings, intercepts, and residuals were set to equal across groups (i.e., strict invariance). The fit indices we used to compare these models were χ2, root mean square error of approximation (RMSEA), comparative fit index (CFI), and the Tucker–Lewis index (TLI).
Item Response Theory and Differential Item Functioning
To determine which of the 50 TAP items were the individual sources of the noninvariance, we conducted DIF analyses using the mirt package (version 1.31; Chalmers, 2012) for R statistical software, following the two-step DIF analytic procedures outlined by Chalmers et al. (2016). In the first step, we tested all 50 TAP items simultaneously for potential DIF using the one-factor CFA models to impose equality constraints on each individual item’s factor loadings. We then categorized each item as either a “test item” that potentially exhibited DIF (if the χ2 invariance test for that item was statistically significant [i.e., p < .05]), or as an “anchor item” that did not potentially exhibit DIF (the χ2 invariance test for that item was not statistically significant [i.e., p > .05]). In the second step of the analysis, we reran the CFAs that applied equality constraints to each test item’s slopes and intercepts. If the χ2 invariance test for any test item was statistically significant (i.e., p < .05), we deemed that item as exhibiting DIF. Both of these steps are necessary to obtain accurate parameter estimates.
Item Response Theory and Differential Test Functioning (DTF)
To determine the magnitude of any DIF effects on the validity of the overall TAP, we also conducted DTF analyses (Chalmers et al., 2016). Instead of focusing on individual items, DTF estimates the impact of DIF on a measure’s aggregated score. To do so, these analyses compute two statistics to describe whether two groups, given equivalent levels of the latent trait (e.g., aggression), differ significantly on their expected TAP scores. The signed DTF (sDTF) statistic reflects overall measurement bias, across all items, in favor of one group over another at the omnibus (i.e., test) level. The unsigned DTF (uDTF) statistic represents the degree to which the expected TAP scores differ between two groups at varying levels of the latent trait. The latter is commonly represented visually via expected score curves, which plot a given range of expected TAP scores as a function of varying levels of the latent trait, separately for each group. The uDTF reflects the degree to which the curves for each group overlap with each other. If one or both of these DTF statistics are statistically significant, then this indicates nontrivial DTF due to the noninvariant items identified in the DIF analyses.
Results
Confirmatory Factor Analyses
The fit of the single-factor model was unexpectedly poor, χ2(1, 175) = 21,934.44, p < .001, RMSEA = .10, standardized root mean square residual (SRMR) = .07, TLI = .66, and CFI = .67. Standardized factor loadings for all TAP items are displayed in Supplemental Table 1 (available online).
Measurement Invariance by Gender
According to our preregistered criteria, the single factor CFA model’s fit to the TAP data exhibited configural and metric invariance, but not scalar or strict invariance by gender (0 = male, 1 = female; Table 2).
Model Fit Statistics for Each of the Gender Invariance Models Using the One-Factor Structure.
Note. df = degrees of freedom; RMSEA = root mean square error of approximation, CFI = comparative fit index; SRMR = standardized root mean squared residual; TLI = Tucker–Lewis index, Model comp. = model being compared with (e.g., comparing Model 1 [M1] with Model 2 [M2]).
Measurement Invariance by Provocation
The single factor CFA model’s fit to the TAP data again exhibited configural and metric invariance, but not scalar or strict invariance by provocation condition (1 = provoked, 0 = unprovoked; Table 3).
Model Fit Statistics for Each of the Provocation Invariance Models Using the One-Factor Structure.
Note. df = degrees of freedom; RMSEA = root mean square error of approximation; CFI = Comparative Fit Index; SRMR = standardized root mean squared residual; TLI = Tucker–Lewis index; Model comp. = model being compared with (e.g., comparing Model 1 [M1] with Model 2 [M2]).
Exploratory Factor Analyses
Due to the poor model fit of the initial single-factor CFA, we conducted an exploratory factor analysis (EFA) on all 50 TAP items to identify a more appropriate underlying structure of the TAP data using the stats (version 3.6.0; R Core Team, 2019) and nFactors (version 1.9.12; Raiche & Magis, 2010) packages for R statistical software. The factor solution was examined using both varimax and promax rotations of the factor loading matrix, which did not produce meaningfully different factor solutions. For the sake of simplicity, we therefore only present EFA results that used varimax rotation. Based on a parallel analysis using the nFactors (version 1.9.12; Raiche & Magis, 2010) package for R statistical software, five factors from the EFA were initially retained, which explained 55% of the variance (Figure 1; Table 4).

Scree plot depicting the eigenvalues of each factor from the parallel analysis.
Results From the Horn’s Parallel Analysis of All 50 Taylor Aggression Paradigm Items.
We retained 34 of the 50 TAP items, each of which exhibited a factor loading exceeding |.40|. Sixteen TAP items were removed because they exhibited factor loadings below this threshold or because they exhibited cross-factor loadings within |.20|. Removing these items left the fifth factor with no items that exhibited sufficient factor loadings. Therefore, this fifth factor was eliminated and a four-factor solution was adopted (Figure 2).

Trials of the Taylor Aggression Paradigm categorized by the factor they loaded onto (in gray), alongside trials with problematic cross-factor loadings (in red).
Confirmatory factor analyses revealed that the four-factor solution derived from the EFA showed modest fit to the data, χ2(521) = 6278.29, RMSEA = .08, SRMR = .05, CFI = .86, TLI = .85. However, this model fit was substantially improved compared to our original, single-factor model, Δχ2(654) = 15656.10, p < .001. Standardized factor loadings for all TAP items are displayed in Supplemental Table 2 (available online).
Measurement Invariance by Gender
As with the single-factor model, the four-factor model’s fit to the 34 TAP items exhibited configural and metric invariance, but not scalar or strict invariance (Table 5).
Model Fit Statistics for Each Gender Invariance Model Using the Four-Factor Structure.
Note. df = degrees of freedom; RMSEA = root mean square error of approximation; CFI = comparative fit index; SRMR = standardized root mean squared residual; TLI = Tucker–Lewis index; Model comp. = model being compared with (e.g., comparing Model 1 [M1] with Model 2 [M2]).
Measurement Invariance by Provocation
As in the single-factor model, the four-factor model’s fit to the TAP data exhibited configural and metric invariance, but not scalar or strict invariance (Table 6).
Model Fit Statistics for Each Provocation Invariance Model Using the Four-Factor Structure.
Note. df = degrees of freedom; RMSEA = root mean square error of approximation; CFI = comparative fit index; SRMR = standardized root mean squared residual; TLI = Tucker–Lewis index; Model comp. = model being compared with (e.g., comparing Model 1 [M1] with Model 2 [M2]).
Exploratory Differential Item Functioning Analyses
To determine which of the 50 TAP items were the individual sources of the noninvariance we observed in our prior analyses, we conducted DIF analyses.
DIF by Gender
In the first step of the DIF analyses, only one of the 50 TAP items (i.e., the duration of TAP Trial 6) initially exhibited potential DIF between men and women, Akaike information criterion (AIC) = −5.49, Bayesian information criterion (BIC) = −0.01, χ2(1) = 7.49, p = .006 (all 50 initial DIF test results are presented in Supplemental Table 3 [available online]; all 50 item information plots, separated by gender, are depicted in Supplemental Document 1 [available online]). The second phase of the DIF analysis revealed that this item no longer showed significant DIF, AIC = 1.22, BIC = 6.69, χ2(1) = 0.78, p = .377.
Follow-up DTF analyses suggested that the effect of the one item that initially exhibited DIF was negligible on the overall TAP, sDTF = −1.25 (95% confidence interval [CI; −17.36, 12.73]), uDTF = 7.51 (95% CI [4.94, 18.79]), omnibus p = .866. These inferential results were reflected in the largely overlapping expected score plots of men and women (Figure 3).

Expected Taylor Aggression Paradigm scores as a function of trait levels of aggression between men (0) and women (1).
DIF by Provocation
Unlike the gender-based DIF analyses, 47 of the 50 TAP items initially showed potential DIF between provoked and unprovoked participants (see Supplemental Table 4 [available online], for DIF results for each of the 50 TAP items and see Supplemental Document 2 [available online], for individual item information plots). However, the second phase of the DIF analysis revealed that only seven items ultimately showed significant DIF (Table 7; see Supplemental Document 2 [available online], for individual expected score plots). Six of these seven DIF items were localized to the last eight trials of the TAP (i.e., Trials 18-25).
Fit Statistics and Significance Tests for TAP Items That Showed Differential Item Functioning Between Provoked and Unprovoked Participants.
Note. TAP = Taylor Aggression Paradigm; AIC = Akaike information criterion; BIC = Bayesian information criterion.
Follow-up DTF analyses suggested that the effect of the items that exhibited DIF was negligible on the overall TAP, sDTF = −1.22 (95% CI [0.16, 0.79]), uDTF = 3.89 (95% CI [1.39, 6.98]), omnibus p = .204. These inferential results were reflected in the largely overlapping expected score plots of provoked and unprovoked participants (Figure 4).

Expected Taylor Aggression Paradigm scores as a function of trait levels of aggression between provoked (1) and unprovoked (0) participants.
Item Informativeness by Provocation (Exploratory)
We observed an interesting pattern in the item information curves that we computed as part of our DIF analyses, in which many trials exhibited curves that portrayed poor informativeness (e.g., Supplemental Figure S5 [available online]). Less informative items appeared to be (a) at the beginning or end of the task and (b) preceded by high or low opponent provocation on the previous trial. As part of subsequent exploratory analyses, we extracted area under the curve (AUC) values from each trial’s item information curve (collapsing across duration and volume settings; and excluding the first TAP trial because it was not preceded by any noise blasts from the opponent). Plotting these 24 AUC values against the TAP opponent’s provocation level from the previous trial (averaging the preset duration and volume settings along their 0-10 continuum within each trial) revealed a curvilinear relationship between these two variables (Figure 5). Indeed, the informativeness of each trial was lower for trials that followed low and high provocation and were optimal at moderate levels, peaking at the midpoint of provocation (i.e., 5).

The curvilinear association between the informativeness of tap Trials 2-24 and their opponent’s provocation level on the previous trial.
Discussion
The TAP, and the many variants thereof, is an important measure of aggression in the laboratory. Yet, the TAP’s ability to measure aggression in a similar way across men and women and across experimentally provoked and unprovoked participants has remained uninvestigated. In this investigation, we estimated this unknown psychometric quality by combining TAP data from over 2,000 participants, testing the hypotheses that this paradigm would exhibit measurement invariance across gender and experimental provocation groups.
Factor Structure of the TAP
The initial one-factor solution to the TAP data exhibited poor model fit. This is somewhat in line with our previous research demonstrating that a multifactor structure may fit the data better than a single factor (Chester & Lasko, 2019). Subsequently, an EFA returned a four-factor solution. Unexpectedly, the model fit of this four-factor model was also relatively poor (although still an improvement on the one-factor model). These four factors did not appear to map well onto any structural aspects of the TAP that investigators have previously used to delineate different metrics of aggression (e.g., the first unprovoked trial vs. subsequent provoked trials; Lawrence & Hutchinson, 2013). It remains uncertain why the data structure aggregated into these four factors. The provocation levels among the factors did differ; however, these differences were modest ones (i.e., provocation point difference of ~1), and thus, we believe are insufficient to warrant a distinction between “low,” “moderate,” and “high” provocation factors. Nonetheless, these interpretations are inherently subjective; as such, readers and other scholars are welcome to reinterpret our results to reach their own conclusions. More research is needed to test the replicability of the four-factor structure we observed. If it is replicated, the underlying reasons for this data structure and the implications it may have for improved quantification strategies for the TAP. Although the first TAP trial did not load as strongly onto the first factor as the other trials, this cannot be interpreted as conclusive evidence that it represents a separate construct as the difference in the factor loadings was a modest one at best. Rather, our findings overall suggest that researchers’ treatment of the first, unprovoked TAP trial as a measure of a different construct than later trials may in fact be unwarranted. However, this interpretation remains subjective and additional research in this area is needed to confirm these conclusions.
Factor analyses also showed that the duration and volume settings from the same trial almost always loaded onto the same factor, suggesting that these two metrics are not meaningfully distinct from one another. Volume and duration settings also exhibited similar levels of DIF. Thus, the duration and intensity settings for the TAP’s aversive stimuli may largely be redundant and quantification strategies that aggregate across these two aggression modalities are employing a valid means of increasing the internal consistency of the TAP by combining indices of the same construct (Chester & Lasko, 2019). However, the use of both volume and duration indices requires participants to make twice as many responses (i.e., decisions) as they would if only one index were used. Given the well-established literature on decision fatigue (e.g., Pignatiello et al., 2020), this use of redundant settings on each trial inflicts meaningful costs on participants that may bias their responses. Therefore, future versions of the TAP might benefit from using only the duration or volume setting as a means to reduce participant burden without undermining the reliability or validity of the TAP.
Both the one-factor and four-factor solutions failed to exhibit adequate model fit to the TAP data, which may undermine the validity of our previous prescription to take a one-factor approach to scoring the TAP (Chester & Lasko, 2019). Even the four-factor solution, the “best” fitting model as determined by exploratory factor analyses, did not exhibit good model fit. This inability to find adequate model fit to the data is reflective of a broader psychometric trend, in which the confirmatory model fit of even the most widely used and well-validated questionnaires’ (e.g., the Big Five Inventory) factor structures often fail to reach adequacy in independent data sets (Hussey & Hughes, 2020). It may be that the experimental elements of the task (e.g., varying levels of provocation, wins, and losses) or the nested structure of the data (i.e., volume and duration items nested within trials) undermined the emergence of any coherent factor solution. Perhaps a hierarchical factor structure is more appropriate, necessitating a bifactor model. Future work is needed to determine the underlying reasons for this poor model fit, such as systematically varying the experimental elements of the task and estimating their effects on the factor structure or accounting for the TAP’s nested data structure (e.g., via exploratory structural equation modeling). Doing so would not only improve our psychometric understanding of the TAP but would serve as an important example for future work on the structural validity of behavioral laboratory tasks. In the end, we are unable to determine the underlying reason for these poor model fit estimates and urge caution in interpreting our findings in light of these issues.
Invariance by Gender
We found that the TAP exhibited both configural and metric invariance across men and women, which supports our hypotheses and the validity of this task across both groups. However, we did not observe evidence of “strong” invariance (i.e., scalar and strict invariance). These latter forms of invariance are often difficult for scholars to obtain, given their harsh assumptions and standards (Asparouhov & Muthén, 2014; Davidov et al., 2018). Although not unrealistic or unachievable in all cases, strict invariance is a high threshold. Indeed, only 4% of a sample of 15 commonly used measures exhibited such equivalent factor structure, factor loadings, and item-level intercepts across groups (i.e., strict invariance; Hussey & Hughes, 2020). It may be particularly difficult to obtain this level of invariance with larger samples as it becomes easier to detect smaller between-group differences (i.e., noninvariance). This may have been particularly true for the large sample that we employed, considering the results of our IRT analyses. Specifically, our DIF and DTF analyses suggested that the influence of any noninvariance across gender groups had little influence on the overall measure. As such, the preponderance of the evidence supports the claim that the TAP is a valid aggression measure across these gender identity groups and even suggest that men’s and women’s aggression may not be as different as many assume.
Invariance by Provocation
As with gender groups, experimentally provoked and unprovoked groups of participants exhibited configural and metric invariance, though not scalar or strict invariance. For reasons outlined above, we take these findings to be generally supportive of our prediction that provoked and unprovoked groups would exhibit invariance on the TAP. We further found that any noninvariance likely arose from nine items that largely occurred in later trials of the TAP, though these noninvariant items had a negligible influence on the overall TAP’s validity. It remains unclear why later trials would be more likely to exhibit noninvariance, perhaps experimental provocation may exacerbate fatigue effects that present toward the end of the task. Accordingly, it may be that the invariance of the TAP across experimental provocation conditions might be improved by removing later trials, though further work is needed to empirically validate such possibilities. Alternatively, it is possible that greater provocation (i.e., louder noise blast settings by the opponent) near the end of the task, compounded by the provocation received prior to the task, created a ceiling effect that contributed to the noninvariance we observed for these trials. Nonetheless, these findings overall suggest that experimental provocation manipulations that precede the TAP do not invalidate this paradigm and that the TAP is a valid aggression measure across both unprovoked and provoked conditions.
Informativeness of TAP Trials Based on Prior Provocation
The provocation that is inherent to the task is intended to influence participants’ aggressive behavior. Yet our analyses revealed it may have another, unseen influence. The informativeness of each trial of the TAP (i.e., the extent to which each trial reveals meaningful information about the latent aggression construct), exists as a curvilinear function of provocation. Trials following relatively high or low provocation are less informative than those following moderate levels of provocation. This is likely due, in part, to ceiling and floor effects that arise from the human tendency toward reciprocation. For instance, if the opponent selected a 10, most participants then retaliated toward the upper ceiling of the aggression response range. Conversely, opponents who selected low levels of provocation tended to elicit a reciprocal response toward the floor of the aggression response range. Whereas after a moderate amount of provocation, participants were free to respond broadly in either direction of the response scale. Furthermore, ambiguous levels of provocation allowed for individual differences in the tendency toward retaliatory escalation or conciliatory de-escalation to express themselves. Our findings suggest that the common practice of focusing only on TAP items that follow low or high provocation may be psychometrically unsound for samples composed of undergraduate student populations. Within this population, a high provocation-focused or low provocation-focused psychometric approach may obscure true underlying effects that are only detectable among the more informative trials of the TAP. Trials that involve high or low provocation may have more utility among clinical populations (e.g., borderline personality disorder, antisocial personality disorder) or aggressive forensic populations. More research is needed to determine how best to optimize quantification strategies for the TAP that take advantage of the differential informativeness of the trials, especially across population types (Hyatt, Chester, et al., 2019).
Limitations and Future Directions
The chief limitation of this project was that participants were all undergraduate students. Although college students are not devoid of aggressive tendencies, there are likely many differences between this population and broader swaths of humanity. Future research can test whether our findings will replicate in more representative samples or in clinical or forensic groups characterized by heightened aggression. Furthermore, the measurement invariance of the TAP across different cultures remains unknown. A valuable enterprise going forward will be to administer this measure across many cultures and test its psychometric invariance.
We also took a binary categorical approach to gender identity, which is well-known to exist along a nonbinary continuum. We were forced into this position by the available data, which only asked participants to report whether they identified as male or female. Future research must examine the invariance of the TAP across nonbinary identities and by modeling female and male gender identities as continuous spectra. It may also prove valuable to investigate invariance across biological sex categories and examine whether findings with this variable show agreement with or diverge from participants’ gender identities.
Additionally, we only assessed the sound blast version of the TAP, not other versions of the TAP that employ different modalities of aversive stimuli (e.g., shocks), due to the nature of the existing data we analyzed. Future research should thoroughly examine the psychometric properties of other aggression modalities, which should be subjected to the same methodological scrutiny.
Conclusions
Reducing aggression requires understanding and understanding aggression requires that it is accurately measured. The TAP is a premier aggression measure that enables experimenters to examine the personal and situational factors that make people more or less violent. Using a well-powered and preregistered approach, we demonstrated that the TAP exhibits measurement invariance across men and women and across experimentally provoked and unprovoked participants. We hope these results lend confidence to the use of the TAP to identify meaningful between-group differences in aggression that are not an artifact of poor psychometric properties. More broadly, we hope that investigators will continue to assess and improve the validity of laboratory aggression measures, in hopes that doing so will promote the ultimate reduction of harmful behaviors.
Supplemental Material
sj-pdf-1-asm-10.1177_1073191121996450 – Supplemental material for Measurement Invariance and Item Response Theory Analysis of the Taylor Aggression Paradigm
Supplemental material, sj-pdf-1-asm-10.1177_1073191121996450 for Measurement Invariance and Item Response Theory Analysis of the Taylor Aggression Paradigm by Emily N. Lasko and David S. Chester in Assessment
Supplemental Material
sj-pdf-2-asm-10.1177_1073191121996450 – Supplemental material for Measurement Invariance and Item Response Theory Analysis of the Taylor Aggression Paradigm
Supplemental material, sj-pdf-2-asm-10.1177_1073191121996450 for Measurement Invariance and Item Response Theory Analysis of the Taylor Aggression Paradigm by Emily N. Lasko and David S. Chester in Assessment
Supplemental Material
sj-pdf-3-asm-10.1177_1073191121996450 – Supplemental material for Measurement Invariance and Item Response Theory Analysis of the Taylor Aggression Paradigm
Supplemental material, sj-pdf-3-asm-10.1177_1073191121996450 for Measurement Invariance and Item Response Theory Analysis of the Taylor Aggression Paradigm by Emily N. Lasko and David S. Chester in Assessment
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
