Abstract
First impressions are commonly assumed to be particularly important: Information about a person that we obtain early on may shape our overall impression of that person more strongly than information obtained later. In contrast to previous research, the present series of preregistered analyses uses actual person judgment data to investigate this so-called primacy effect: Perceivers (N = 1,395) judged the videotaped behavior of target persons (N = 200) in 10 different situations. Separate subsamples of about 200 perceivers each were used in moving from exploratory to increasingly confirmatory analyses. Contrary to our expectations, no primacy effect was found. Instead, judgments of the targets in later situations were more strongly associated with overall impressions, indicating an acquaintance effect. Relying on early information seems unreasonable when more comprehensive information is readily available. Early information may, however, affect perceivers’ behavioral reactions to the targets and thus their future interactions, if such interactions are possible.
The idea of a primacy effect in interpersonal impression formation, where “information presented early in a sequence has more influence on final judgments than information presented late in the sequence” (Tetlock, 1983, p. 286), has been entertained by psychology for quite some time (see Asch, 1946). Various explanations for such an alleged predominance of early impressions have been proposed: For example, perceivers may interpret information presented later in a sequence in accordance with their first impressions (e.g., because they want to be consistent or because they use early information as an anchor and fail to adjust afterward; e.g., Hogarth & Einhorn, 1992; Steiner & Rain, 1989). Perceivers may also pay less attention to the information presented later, believing that the most important information likely becomes available first or simply due to their own fatigue or boredom (e.g., Hendrick & Constantini, 1970; Stewart, 1965).
The relevant studies typically employed Asch’s (1946) classic paradigm: A sequence of trait adjectives (e.g., “kind,” “shallow”) used to describe a fictional target person is presented to participants who are then asked to judge said target. While primacy effects were often evident (e.g., Anderson, 1965; Hendrick & Constantini, 1970; Sullivan, 2019), many studies also showed that different conditions (e.g., modes of responding) could influence whether primacy effects occurred, were absent, or even replaced by recency effects (i.e., disproportionate impact of more recent information; e.g., Forgras, 2011; Richter & Kruglanski, 1998; Stewart, 1965).
Most notably, however, these studies did not investigate judgments of the target persons’ actual behavior. The present study does so—it uses real person judgment data (i.e., perceivers judge the actual behavior of targets across different situations). Based on how the primacy effect is commonly conceptualized (see above), we preregistered the following expectations: First, observations of a target that a perceiver makes early on will shape that perceiver’s overall judgment of that target more strongly than later observations. Such a main effect of presentation order is what the primacy effect refers to, in essence. Second, we expect a main effect of “aggregate size”: The contribution of individual judgments to an overall judgment becomes weaker the more of the former are averaged to obtain the latter. This assumption is simply logical and not particularly interesting. Aggregate size is important, however, because in smaller aggregates early information competes with less subsequent information, which may make the occurrence of a primacy effect more likely. Therefore, we expect that, third, aggregate size moderates the strength of the primacy effect, with the possibly disproportionate influence of early information becoming mitigated the more subsequent information is being presented.
Method
The present study uses part of a larger data set (Wiedenroth & Leising, 2020) pertaining to various aspects of behavior observation and interpersonal perception. From this larger data set, we use all data relevant to the possible existence of primacy effects. We report how we determined our sample size, all data exclusions, all experimental manipulations, and all measures relevant to the study. The supplements mentioned in the following (osf.io/fs6gh), as well as the data itself (osf.io/tnq8g), are available on the Open Science Framework (OSF).
Target Sample
Two-hundred target persons were recruited from the general (German) population by multiple means (e.g., newspaper and online ads). This sample size was set a priori in the project’s grant proposal (LE 2151 / 6-1, German Research Foundation), and based on expected effect sizes including several effects that are not part of this paper. With this sample size of 200, effect sizes of |ρ| = .20 (two-tailed, α = .05) can be detected with more than 80% power (G*Power Version 3.1.9.2; Faul et al., 2014). We aimed for some representativeness (as compared to typical student samples), by trying to recruit equal numbers of male and female participants and equal numbers of participants above and below the age of 30. The final target sample comprised 98 male and 102 female targets between the ages of 17 and 80 years (Mage = 33.29, SDage = 14.48).
The targets were invited to the laboratory and videotaped individually in 20 different, standardized situations, which were supposed to make personality differences between them visible. Some of these situations were adapted from previous research (e.g., Borkenau et al., 2004; Leising et al., 2014). Each situation lasted a few minutes only and consisted of a specific task for the target to engage with. For example, targets were asked to sing a song of their own choice, to describe what they would do with a million Euro, or to plan a hypothetical party for 50 people (for the complete list of situations, see OSF Supplement A). To avoid order effects, we used Latin Squares (Williams, 1949) to determine the order in which the targets completed these situations. Targets received €40 for participating.
Perceiver Sample
The videotapes were then watched and judged by a different group of participants. These perceivers were randomly assigned to one of two different half-block designs: In the “between-target” condition, perceivers watched 10 different targets in the same situation, whereas in the “within-target” (WT) condition, perceivers watched one target in 10 different situations. The present study only uses data from the WT condition.
Each of 1,400 perceivers in the WT condition was supposed to watch one of the 200 targets in 10 different situations (using an online platform accessible through a personalized link that could only be used once). Perceivers were allocated randomly to a link that specified (in anonymized form) (a) the target, (b) the 10 situations in which that target was to be observed, and (c) the order in which those situations would be presented. Each of the 200 targets was randomly assigned seven different perceivers (thus the total of 1,400 perceivers). Within each group of seven perceivers judging the same target, the overlap among the 10 (out of 20) situations in which the individual perceivers got to see their respective target was systematically varied to enable an investigation of the effects of information overlap (Kenny, 1994), which is not relevant here but helpful to know for understanding the data structure. Situations and their presentation order were assigned to the perceivers using a combination of Latin Squares and randomization: First, the 200 targets were divided into 10 blocks of 20 targets and for each block, the 20 situations were distributed across the 20 targets using Latin Squares. This determined which situations were presented to each of the seven perceivers per target (e.g., the first perceiver was assigned the first 10 situations for their target from the Latin Square). Then, the order of the ten assigned situations was randomized for each perceiver.
Directly after watching “their” target in one of the 10 situations, the perceivers rated the target’s personality based on their behavior in that video. For these ratings, they used a list of 30 person-descriptive German adjectives from Borkenau and Ostendorf (1998; see OSF Supplement B), which will be abbreviated as “BO-30” in the following. In this measure, each of the Big Five personality domains (Agreeableness [A], Conscientiousness [C], Extraversion[E], Neuroticism [N], and Openness to Experience [O]; which was named Intellect in the original study) is assessed by six adjectives of which three load positively and three load negatively on the domain. Ratings were made on a 5-point scale ranging from 1 = does not apply at all (“trifft überhaupt nicht zu”) to 5 = applies exactly (“trifft genau zu”).
Criteria for data exclusion were set prior to data analysis. Perceivers had to be at least 18 years of age and they had to complete the entire assessment process to be included. Additional reasons for exclusion were, for example, knowing the target or responding carelessly (for more details, see Wiedenroth & Leising, 2020). With these criteria, the final perceiver sample included 1,395 instead of the 1,400 perceivers we had originally aimed for. Perceivers were recruited nation-wide through multiple channels (e.g., newspaper and Facebook ads). The final perceiver sample consisted of 528 male and 867 female participants between the ages of 18 and 76 (M age = 28.27, SD age = 10.54). They were compensated with €10 for their participation.
Analysis Plan
Our preregistered analysis plan (osf.io/f3he4) documents all data analytic steps that we undertook to test our three main hypotheses, from the earliest and rather exploratory analyses to the latest and strictly confirmatory ones. Altogether, we had access to seven independent samples of about 200 perceivers each (because each of the 200 targets was judged by seven different randomly assigned perceivers). Within each of these seven perceiver samples, each perceiver successively watched and judged one of the 200 targets in 10 different situations. The position of the individual situations in a perceiver’s sequence of judgments of the same target constitutes our primary predictor of interest.
All analyses presented below are based on domain-wise ratings: The perceivers’ ratings of the targets on the individual items of the BO-30 were aggregated into the five scale scores (A, C, E, N, and O) as outlined by Borkenau and Ostendorf (1998). This was first done for each rating of a target in an individual situation by a perceiver. The resulting scale scores were then further processed (e.g., averaged across situations) for subsequent analyses (see below). Expectations were the same for all five domains. Analyses were performed using SPSS (Version 25.0), R (Version 3.5.1; R Core Team, 2018), and RStudio (Version 1.1.463; RStudio Team, 2016) using the package boot for bootstrapping (Version 1.3-20; Canty & Ripley, 2017).
Dependent variable
We computed aggregates (i.e., the mean) of personality ratings across an increasing number of situations (separately for each perceiver and Big Five domain). That is, we used the order in which the situations had been presented to a perceiver and aggregated this perceiver’s ratings of their target across the first two situations, the first three situations, and so on, resulting in nine different aggregates (Size 2–10) for each perceiver. The mean levels of these aggregates represent the overall impressions that the perceivers had of the targets, after absorbing different amounts of information.
As a next step, we computed all uncorrected item–total correlations between Big Five ratings of the targets in individual situations (i.e., separately for situations shown in Position 1–10) and the aforementioned aggregates. More precisely, each individual judgment based on one specific situation was correlated with all aggregates to which it contributed. Using these uncorrected item–total correlations, we investigated how strongly an “item” (i.e., a situation) contributed to the perceivers’ overall impressions, which is what the primacy effect is about. For example, ratings based on the situation in which a perceiver had seen their target first (Position 1) were correlated with the aggregate of the first two situations, the first three situations, and so on; ratings based on the situation in Position 3 were correlated with the aggregate of the first three situations, the first four situations, and so on. Note that only two such correlations exist for aggregates comprising two situations, whereas 10 such correlations exist for aggregates comprising 10 situations. This resulted in 54 correlation coefficients per Big Five domain, all of which were based on the same set of perceiver–target dyads. These correlations were then Fisher’s Z-transformed, to enable using them as the dependent variable in our linear regression models.
Model testing
To test our three hypotheses, we explored how this pattern of correlations could be predicted from (1) the position of the individual situations in the perceivers’ sequences of judgments, (2) aggregate size (i.e., the number of individual situations across which ratings were averaged to derive the perceivers’ overall judgments of the targets), and (3) their interaction. We also considered the possibility that the expected effects may be curvilinear rather than linear in nature.
All predictors were mean-centered. Our primary index of model fit was the determination coefficient R 2, which compares the amount of variance explained by a given model to the amount of total variance in the data. 1 It is important to note that the determination coefficient only equals the squared Pearson correlation coefficient R 2 in some cases and that the latter cannot be used as a fit index when comparing models where all parameters are freely estimated with models where some coefficients are fixed (e.g., a given intercept or slope). To determine the significance of model fit and regression weights, we calculated 99% bootstrap confidence intervals (CIs; bias-corrected and accelerated). For this, the complete procedure (i.e., calculating the 54 item–total correlations and then fitting models to predict them) was bootstrapped with 100,000 resamplings being drawn from the 200 cases (i.e., the 200 target–perceiver dyads). To compare fit between models, the difference in R 2 and the bootstrap CI for this difference were calculated.
Sample-wise approach
We used the first perceiver sample for exploratory analyses only, to obtain a first set of estimates regarding the sizes and shapes of the three hypothesized effects. In a sequence of preregistered analyses (to be explained in the next section), we then repeatedly specified, tested, and revised our respective models in an increasingly confirmatory manner, using new independent perceiver samples. Our overall goal was to obtain a single model capturing all three of the hypothesized effects as specifically as possible, along with cross-validated estimates of model fit.
Results
Exploration: Perceiver Samples 1–3
The first perceiver sample was used for exploration only. We calculated uncorrected item–total correlations for all 54 combinations of position and aggregate size as described above. Then, we experimented with fitting various regression models (see Table 1) to the overall pattern of correlations, employing different variants and combinations of predictors that might capture the three hypothesized effects (e.g., the position of the individual judgment, the squared position, the aggregate size of the overall judgment, the squared aggregate size, and the multiplicative interaction of position and aggregate size). For Perceiver Samples 2 and 3, we sequentially preregistered some of these models to be tested (see Table 2), while also performing additional exploratory analyses on both samples (preregistrations: osf.io/y3mna, osf.io/2mhs6).
Regression Models.
a Model 9 is a cross-domain model that includes the Big Five domains as four dummy variables, while the fifth domain (here: Openness to Experience) serves as the reference category (dummy variables were created using dummy coding).
Model Fit (R 2) With Bootstrapped 99% CIs for Preregistered Models in the Exploration Phase (Perceiver Samples 2–3) and Confirmation Phase (Perceiver Samples 4–5).
Note. For all confirmatory analyses in Perceiver Samples 2–5, the table displays the number of the preregistered models and overall model fit (R 2) with bootstrap CIs (type BCa) from bootstrapping the whole analysis procedure with 100,000 resamplings. The respective model equations can be found in Table 1. A = Agreeableness; C = Conscientiousness; E = Extraversion; N = Neuroticism; O = Openness to Experience; CI = confidence interval; BCa = bias-corrected and accelerated.
a Model 9 is a cross-domain model that includes the Big Five domains as four dummy variables, while the fifth domain (here: Openness to Experience) serves as the reference category (dummy variables were created using dummy coding).
For the sake of parsimony, in the following, we focus on the overall conclusions to be drawn. 2 There was no negative main effect of position, that is, no primacy effect overall. To the contrary, we found positive regression weights for position in simple models (using position as the only predictor) and in more complex models (e.g., using position, aggregate size, and their interaction as predictors). Analyses of scatterplots suggested that item–total correlations tended to increase with position at first, but around Positions 3–7 (depending on the personality domain) began to stagnate or even decrease somewhat again.
The negative main effect of aggregate size materialized as expected and across all personality domains: Individual observations contributed less to overall impressions the more observations were being averaged. Again, scatterplots offered some more insights: Item–total correlations decreased with aggregate size at first but then stabilized around Aggregate Size 6. Accordingly, including squared aggregate size as a predictor in the models increased the proportion of explained variance, with this predictor showing a positive effect. A positive interaction between position and aggregate size was also apparent, implying that the contribution of early information to overall impressions was particularly weak in larger aggregates.
Simultaneously including main effects of position and aggregate size, or these main effects and their interaction, also increased the proportion of explained variance. Overall, the most complete model (Model 7—which includes position and aggregate size as linear and squared predictors, as well as their interaction) showed the best fit in terms of R 2 across all personality domains (see Table 2), outperforming all other models by a wide margin. The incremental contribution of the interaction effects on top of the main effects (Model 7 vs. Model 8) was small, but the differences in R 2 were significant. Note that even though R 2 was our preregistered and decisive index of model fit, other indices that penalize nonparsimony (e.g., Akaike Information Criterion) also confirmed the superiority of Model 7.
Finally, we also fitted a cross-domain model (Model 9): This model includes all five predictors from Model 7, plus four dummy variables representing the five personality domains (with O being the category of reference). These domain dummies were included to account for possible domain differences regarding the overall level of correlations. Experimenting with domain as a moderator of the other effects (e.g., an interaction of domain and position) had not resulted in consistent or meaningful contributions to model fit.
This cross-domain model also fit the data very well. The differences in model fit (for Perceiver Sample 3) between the cross-domain Model 9 and the domain-specific models of Type 7 ranged from −.09 for A to .02 for N. However, the 99% CIs for all these differences included zero, suggesting no clear superiority of one model over the other.
Confirmation: Perceiver Samples 4–5
After the first three rounds of analyses, the overall pattern of results seemed to have become decently consistent. Thus, in the next two rounds, we moved on to testing only the best-fitting model from the first rounds further—in its domain-specific version (Model 7) as well as in its cross-domain version (Model 9). We preregistered these analyses in identical form for Perceiver Samples 4 and 5 (osf.io/pmq5d).
Models 7 and 9 again fit the data very well (see Table 2) and results were consistent with the previous analyses. Comparing the fit of Model 7 across the five domains in Samples 4 and 5 did not yield a consistent pattern of domain differences, suggesting that this model fit the five domains about equally well. Differences in R 2 between the cross-domain model and the domain-separated models (Model 9 vs. Model 7) ranged from −.12 for E to .13 for A in Perceiver Sample 4, and from −.20 for A to .06 for E and N in Perceiver Sample 5. However, the CIs for all these differences again included zero, suggesting that no model was clearly superior over the other.
Cross-Validation: Perceiver Samples 6–7
Our final goal was to obtain a cross-validated model and a reliable estimation of model fit. So far, we had only used abstract linear regression models, where all coefficients were freely estimated. This will result in an optimal fit to the data at hand and thus in an overestimation of the actual effect sizes in the population (known as “overfitting”). We therefore performed proper cross-validations in the last two rounds of analysis: For this, we tested models with specified regression weights that we had derived from the previous samples (preregistration: osf.io/e34gy).
To use the most reliable slope estimates, based on the full amount of information available, we drew on the five samples analyzed thus far. That is, item–total correlations were first separately calculated per perceiver sample, and then averaged across the five samples, resulting in 54 averaged correlations. Then, our favored models (Models 7 and 9) were freely fitted to that average sample. The resulting regression weights were used to specify the slopes for six cross-validation models: one Model 7 per domain and one cross-domain Model 9. Intercepts remained unspecified as they were not part of our predictions. These models were then fitted to Perceiver Samples 6 and 7, respectively. Additionally, we once more fitted models with freely estimated coefficients to the two remaining perceiver samples because comparing them with the specified models allowed us to assess the impact of overfitting in this type of analysis.
Results are shown in Tables 3 and 4. All cross-validation models fit the two new samples significantly, and model fit was quite substantial with R 2 ranging from .41 to .71. Freely estimated models expectedly showed even better fit with R 2 ranging from .49 to .83. R 2 of freely estimated models surpassed that of cross-validation models by .06 to .37. The effects of position, aggregate size, and their interaction were consistent with all previous analyses: There was a positive main effect of position (with a negative quadratic component), a negative main effect of aggregate size (with a positive quadratic component), and a positive interaction between position and aggregate size. These three effects are also illustrated in Figure 1. They represent the main takeaway from the present article.
Fit of Cross-Validation (Specified) and Freely Estimated (Free) Models Applied to Perceiver Samples 6 and 7.
Note. Model equations can be found in Table 1. A = Agreeableness; C = Conscientiousness; E = Extraversion; N = Neuroticism; O = Openness to Experience.
Coefficients of Cross-Validation (Specified) and Freely Estimated (Free) Models Applied to Perceiver Samples 6 and 7.
Note. Model equations can be found in Table 1. Aggregate = aggregate size; A = Agreeableness; C = Conscientiousness; E = Extraversion; N = Neuroticism; O = Openness to Experience. Italics: Coefficients were prespecified using the 54 averages of the correlations from Perceiver Samples 1–5.

Visualization of the effect pattern for the two cross-validation samples (left column: Perceiver Sample 6, right column: Perceiver Sample 7). Dots represent uncorrected item–total correlations (Z-transformed) between judgments of the targets in individual situations and the average of different numbers of situations (= overall judgments). The three rows separately display the effects of (A) the position of the individual situation in the perceivers’ sequence of judgments (informative regarding the primacy effect), (B) the aggregate size of the overall judgment, and (C) the interaction of position and aggregate size. Lines represent regression functions incorporating the following empirically supported sets of (mean-centered) predictors: (A) position plus squared position, (B) aggregate size plus squared aggregate size, and (C) position multiplied with aggregate size. Lines use coefficients from specified cross-validation Model 9 (see Table 4).
Discussion
Using a strong data set reflecting perceivers’ judgments of targets based on their actual behavior, the present study found no evidence for the existence of a primacy effect in person judgments. To the contrary, item–total correlations increased with the position in which the respective individual judgment was made. That is, judgments of the targets that were made later in the sequence were associated more strongly with the perceivers’ overall impressions. We were able to confirm and ultimately cross-validate this finding across a series of preregistered analyses with several independent sets of about 200 perceivers each.
It seems straightforward to interpret this finding in terms of increasing acquaintance: Observing targets for a longer time will yield a more realistic picture of their overall behavioral tendencies (Borkenau et al., 2004; Kenny, 1994). From an evolutionary perspective, it simply makes more sense for perceivers to pool all the information about a target that is currently available than to give greater weight to information that was obtained early on. That is because cumulative evidence will enable better predictions of a target’s future behavior, which may be the main purpose of person judgments altogether (Funder, 1991; Wessels et al., 2020 ). In fact, our findings once more suggest that people find it hard to impossible to ignore what they have previously learned about someone, when judging that person based on their behavior in a new situation (Leising et al., 2014): Despite our instruction to base their judgments on the targets’ behavior in the situation at hand, such spillover was clearly evident in our data. However, we did not find any evidence that this effect was particularly pronounced for the earliest information that the perceivers received.
We also found that the positive main effect of position had a negative quadratic component to it: From some point on, around the middle of the judgment sequence, associations between new judgments and overall impressions did not increase any further. Rather, they remained at largely the same level or even started to decrease somewhat. This may be explained (a posteriori) in terms of an increasing redundancy of the newly incoming information (Borkenau et al., 2004; Kenny, 1994) or in terms of perceiver fatigue. Our design did not permit a systematic disentangling of these two possibilities.
The present study purposely investigated social judgments only because that is what the alleged primacy effect (Tetlock, 1983) is about. The perceivers in our study had no means of influencing the targets’ behavior, as they did not personally interact with the targets. Under such circumstances, a primacy effect does not emerge, according to our study. Under different circumstances, however, when perceivers do get to interact with targets, the targets’ earliest behaviors may indeed have a disproportionate influence on the perceivers’ overall judgments, but such an influence may come about by way of a completely different mechanism: The perceivers’ earliest impressions of the targets may shape the perceivers’ future behaviors toward the targets (e.g., the questions that they ask), which may then shape the targets’ behaviors toward the perceivers, which may then shape the perceivers’ impressions even more, and so on (“self-fulfilling prophecy”; e.g., Rosenthal & Jacobson, 1968). Such effects have been extensively documented (e.g., Jussim, 2012; Jussim & Eccles, 1995; Rosenthal & Rubin, 1978; Snyder, 1984), though overall they may be relatively small and seem to dissipate over time (Jussim, 2012). Although these ideas have been around in the social psychology literature for quite some time, we think that recent technological developments make it possible only now to study them in appropriate depth. First and foremost, this concerns the possibility of letting participants interact with one another digitally (e.g., via Zoom), recording their respective behaviors, experimentally varying behavioral exchanges, assessing the perceivers’ impressions of the targets, and using these as predictors of the perceivers’ behaviors toward the targets, and of the targets’ responses. In our view, these are very promising lines of possible future research.
Finally, we are convinced that the present study demonstrates two valuable points regarding methodology: First, preregistration may actually be surprisingly flexible and accommodate all kinds of research ranging from very exploratory (as in the earliest stages of the present project) to strictly confirmatory (as in the last stage). We highly recommend the approach that we pursued in the present study, especially with research projects for which the optimal ways of assessing or analyzing the phenomena of interest have not been determined yet (which may be most research projects in psychology). In our experience, publicly documenting the development of one’s own theoretical thinking and its concrete operationalizations in terms of measurement and statistical analysis is a powerful tool that definitely increases the level of discipline and rigor in one’s scientific work. Second, cross-validation needs to be taken more seriously in psychological research. As our analyses show (Table 3), it does make quite a difference whether one simply “replicates” an effect (meaning that one finds a significant effect in the same direction once more) or whether one actually assesses how well an a priori specified model fits a new set of data. Only the latter approach will yield a realistic estimate of how good one’s model actually is.
Conclusion
The belief that first impressions matter disproportionately and may be hard to correct later on is held by many people, including psychologists. In psychological research, this “primacy effect” has been studied, but the present series of preregistered analyses was the first to use judgments of target persons’ actual behavior. It conclusively showed that the primacy effect does not exist with such data. Rather, later judgments were more predictive of a perceiver’s overall impression than early judgments, as one would expect based on increasing acquaintance. Early information may still be disproportionately important, but only if there is a chance for perceivers to react to the targets accordingly, thus shaping the future course of their interaction and the flow of information.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Preparation of this article was supported by the German Research Foundation (grant number LE2151/6-1 to Daniel Leising).
