Abstract
Simultaneous estimation of structural and measurement models in structural equation modeling (SEM) is not always tenable in small samples. In such cases, it may be necessary or advantageous to obtain scores. Current scoring recommendations draw predominantly from simulations with sample sizes greater than N = 200. This paper extends these recommendations to small N, directly comparing factor scores to sum scores. In addition, scores computed from an essentially tau-equivalent factor model are introduced as an alternative scoring option aimed at balancing the competing benefits of sum scores and factor scores. Findings largely suggest that factor scores from an essentially tau-equivalent factor model are advantageous when considering convergence and stability, even when not supported by the data, so long as departures from their assumptions are not substantial. They are obtainable when congeneric factor models fail to converge and have similar correlations with true scores compared with typical factor scores in samples at or less than N = 200.
Introduction
An appealing benefit of structural equation modeling (SEM) is its ability to simultaneously estimate both measurement and structural models, producing estimates of relations among latent variables and measurement model parameters in a single model. Despite these advantages, typical SEM methods require large samples for stable estimation, particularly with increasing model complexity (e.g., Gagne & Hancock, 2006; Rosseel, 2020). A potential solution for small-sample SEM involves a two-step procedure, wherein a measurement model is fit, factor scores are computed using parameter estimates from this measurement model, and then these scores—or a corrected covariance matrix of scores—are used as input to estimate the structural component of the model. This two-step process has been studied extensively in a variety of contexts, including regression (Bogaert et al., 2023; Croon, 2002; Skrondal & Laake, 2001), path analysis (Devlieger & Rosseel, 2017; Kelcey, 2019; Lu et al., 2011), models with latent-interaction effects (Cox & Kelcey, 2021), and multilevel models (Devlieger & Rosseel, 2020; Kelcey et al., 2021), among others.
Recently, methodologists have engaged in increased scholarship and heightened debate directly comparing estimated factor scores obtained from a confirmatory factor analysis (CFA) with traditional sum or mean scores (McNeish, 2023; McNeish & Wolf, 2020; Rhemtulla & Savalei, 2025; Sijtsma et al., 2024; Widaman & Revelle, 2023, 2024). Both options for obtaining scores have noted strengths and limitations. For one, sum scores can be expressed as a type of estimated factor score, albeit a highly constrained underlying factor model. Specifically, all factor loadings are equivalent and set to 1.0, and all residual variances are set equal. This implies all observed indicators are equally influenced by the latent factors they purport to measure. These conditions are often unrealistic in practice, making estimated factor scores from less constrained factor models more attractive in some cases. On the other hand, factor score estimates based on parameter estimates from a CFA are highly sample-dependent. The elements of the formulae themselves will differ from sample to sample, reducing both consistency and generalizability across samples. Therefore, although sum scores may not fully capture all aspects of the true population generating measurement model, assuming the population generating model most closely aligns with the common factor model with varying factor loadings and residual variances, they are less prone to inconsistency due to sampling variability.
Critically, the cited methodological publications comparing estimated factor scores and sum scores involved simulation studies and empirical demonstrations utilizing samples typically of sizes at or greater than N = 200. Furthermore, research investigating samples less than N = 200 has largely explored the method of Croon (2002), a promising alternative to simultaneous SEM in small samples, but this work has not emphasized comparisons to sum scores or considered alternative models with which to compute scores when sample size limitations preclude stable estimation of the measurement model in isolation (Bogaert et al., 2023; Kelcey, 2019). There is a gap in the present literature providing evidence-based recommendations to researchers regarding best practice when obtaining factor score estimates, considering sum scores and other alternatives, when the impacts of sampling variability are most deleterious.
My goal is to fill this gap by offering pragmatic guidance on scoring methods in small samples. To accomplish this aim, I begin with a brief review of CFA to establish a shared notation to be used throughout. After this, I introduce formulae used to obtain factor score estimates from a CFA. Next, I propose an intermediary option between typical factor scores and sum scores, drawing on well-established psychometric tools and applying these to factor scoring. Finally, I evaluate the usability and performance of these scores in samples at and less than N = 200 through a simulation demonstration.
Confirmatory Factor Analysis
The CFA aims to identify one or more substantively motivated latent factors that explain covariation among a larger set of observed indicators. Here, latent factors represent unmeasured or unmeasurable constructs (Bollen, 2002), such as personality traits, psychopathology, or cognitive processes. Observed indicators are measured variables believed to assess underlying levels of one or more latent factors. The relation of observed indicators to latent factors can be described by Equation 1
where
The model-implied mean and covariance structures resulting from Equation 1 are as follows:
where
Estimates of model parameters, namely,
Factor Scores and Sum Scores
Precise elements of
There are theoretically an infinite number of solutions for
Generally, factor score estimates can be obtained with the following expression:
where
Regression scores (Thomson, 1935, 1938) are some of the most frequently utilized, readily obtainable, and researched factor score estimates. They can be computed with the following matrix-valued function:
where
Sum scores are different from regression scores and other factor scores in that they do not require prior estimation of a CFA. They avoid some aspects of indeterminacy by defining scores as a simple function of equally weighted observed variables using all observed variance as opposed to partitioning variance into true score variance and error variance for each observed indicator uniquely (Widaman & Revelle, 2023). They can, however, be likened to factor scores computed from a highly constrained factor model. Specifically, when considered in the latent variable modeling framework, sum scores are factor scores computed from a model where all standardized factor loadings are set equal, implying equality of unstandardized loadings and residual variances (McNeish & Wolf, 2020). Practically speaking, this assumption requires equal weighting of all observed indicators, which can be challenging to meet.
Importantly, although assumptions underlying sum scores may be practically untenable in some instances, they have important benefits. For one, sum scores ensure that the same scoring formulae are used across samples (Widaman & Revelle, 2023). This can prevent overfitting, in that instability in sample estimates will not deleteriously impact scoring formulae. In addition, because of the instability in sample estimates obtained from estimation procedures like ML that require large samples, choosing to use a consistent but potentially wrong weighting for observed indicators, such as 1.0, can be beneficial (Rhemtulla & Savalei, 2025; Uanhoro, 2019). This can be conceptually embedded in the broader framework of the bias-variance trade-off, wherein bias is systematically and intentionally introduced into sample estimates to improve stability and reduce variance within and across samples.
Continuing with the conceptual parallel to the bias-variance trade-off, it is not necessary to introduce a consistent degree of potential bias, such as unit weighting observed indicators and holding error variance constant. For example, many applications of the bias-variance trade-off involve tuning parameters or values that can be thoughtfully selected, which determine the extent to which sample estimates are influenced by the unique characteristics of the sample or intentionally biased to reduce variance. Therefore, factor scores computed from a congeneric factor model, wherein all factor loadings and error variances are freely estimated, can be likened to an extreme in variance, in that they are likely to differ substantially across samples, especially when sample sizes are small. Sum scores are then an extreme in bias in typical applications of the factor model, where equal weighting and error variances are not reasonably supported. Thus, a natural compromise between the two is the essentially tau-equivalent model (Graham, 2006).
The essentially tau-equivalent model imposes equal weighting of observed indicators, or equivalence of factor loadings within a latent factor, but does not require equal residual variances or equivalent standardized loadings. Although the essentially tau-equivalent model has been a standard psychometric tool for decades, it appears underutilized when the goal of measurement modeling is to estimate latent standing via factor scores, and it has received limited attention in methodological contributions in scoring. Any factor score that can be computed using parameter estimates from a congeneric CFA can also theoretically be computed using parameter estimates from an essentially tau-equivalent CFA. Furthermore, factor score estimates computed using parameter estimates from the essentially tau-equivalent model may be a more practical and stable scoring method in small samples, balancing the benefits and limitations of the aforementioned “extremes.”
Methods
The purpose of this investigation is to provide guidance and recommendations on scoring procedures in small to moderate samples (N
Data were simulated by manipulating five design characteristics: (a) interfactor correlation, (b) factor loadings and residual variances, (c) number of items per factor, (d) sample size, and (e) scoring procedure. These will be discussed in turn. Population generating values were selected to match a range of values commonly encountered in practice. First, true scores on two correlated latent factors were simulated from a multivariate normal distribution with variances of one and factor correlations of either .20, .40, or .60, representing moderate to large interfactor correlations. A two-correlated factors model was selected as a population generating model, given noted increases in sample size requirements when moving from a one-factor to a two-factor model but not when increasing above two factors (Wolf et al., 2013), suggesting two-factor models can be used to provide key insights into model performance of multifactor models.
Observed indicators were simulated as a function of simulated true scores according to Equation 1. Fixed values for
Critically, in all but one of these conditions, residual variances were completely determined by the associated factor loading value. To depart more completely from the essentially tau-equivalent model, two additional conditions were included in which factor loadings and error variances were sampled independently of one another, and factor loadings were not all equivalent. In one condition, factor loadings were randomly sampled from a range of values between .60 and .80. After this, residual variances were independently sampled from the range between .36 and .64. These values align with the third condition presented above, but include residual variance estimates that were not a function of factor loadings. Then, the same set of fixed loadings and residual variances was used throughout this condition. In all conditions, errors were drawn from a multivariate normal distribution with a diagonal variance/covariance matrix.
To distribute the range of loadings across observed indicators and explore conditions with a different number of indicators per factor, a total of 10 indicators were simulated per latent factor. For the condition with consistent loadings and randomly sampled error variances, a total of 10 error variances were randomly drawn, and these were assigned to each of the 10 indicators in turn. For the condition with factor loadings of .6 and .8 and residual variances dependent on factor loadings, loadings were assigned in an alternating fashion. For the condition with loadings from .40 to .80 and residual variance dependent on factor loadings, the five loadings were assigned to the first five indicators and then repeated for the last five. For the two conditions where loadings and error variances were sampled independently of one another, a total of 10 loadings and 10 error variances were randomly drawn, and these were assigned to each of the 10 indicators in turn. Then, CFAs were estimated using unstandardized simulated data and different numbers of simulated observed variables to vary number of observed indicators per factor: in one model, all 10 indicators per factor were included, leading to a two-factor model with a total of 20 observed variables, or 10 indicators per factor; in another model, only the first five indicators per factor were included, leading to a two-factor model with a total of 10 observed variables, or five indicators per factor.
In total, four CFAs were fit to each simulated data set: (a) a CFA with freely estimated loadings and five indicators per factor, (b) a CFA with freely estimated loadings and 10 indicators per factor, (c) an essentially tau-equivalent CFA with five indicators per factor, and (d) an essentially tau-equivalent CFA with 10 indicators per factor. Regression scores were computed from each CFA. Sum scores were also computed using either the first 5 indicators per factor or all 10. Sum scores were computed using unstandardized observed indicators, as is typical when observed variables are on similar metrics. Consequently, in the three conditions where residual variances were dependent on the loading values, observed variables were set to all have a mean of zero and standard deviation of one, meaning they were simulated as standardized. In the subsequent conditions with independent factor loadings and residual variances, observed variable means were still set to zero, but variances differed from one; however, the overall scales of distributions of observed variables were still roughly similar, making it plausible that an analyst would not standardize before computing sum scores.
Taken together, there were a total of three scoring methods used across two different numbers of indicators per factor and three different interfactor correlations. There were a total of six conditions for factor loadings and residual variances and three different interfactor correlation values. All design conditions were fully crossed.
Finally, all conditions were evaluated in each of eight sample sizes from N = 30 to N = 100, in increments of 10, for increased granularity in very small samples. Additional sample sizes of N = 150 and N = 200 were also explored to evaluate scores in moderate sample sizes. One thousand data were simulated per design cell.
Given the aim of evaluating the utility of scores themselves, three outcome variables were explored. First, I computed the proportion of properly converged solutions out of the total 1,000 iterations per design cell as a measure of the viability of score estimation. Nonconvergence was defined to include models that converged but resulted in inadmissible solutions due to implausible values such as negative residual variances and correlations greater than one. Given sum scores were computed without the estimation of a factor model, convergence was not considered for these.
The majority of converged but inadmissible solutions occurred due to negative residual variances, or Heywood cases (Heywood, 1937). In practice, many analysts choose to manually fix problematic residual variances to zero or a very small constant. Although Heywood cases are often the symptom of a larger structural problem, this simple fix is commonly implemented (Cooperman & Waller, 2022). Therefore, for all solutions where the congeneric CFA resulted in one or more negative variance estimates, a follow-up investigation was conducted wherein implausible residual variance estimates were set to zero using the bounds = “pos.var” option in lavaan. Regression scores were then computed from the congeneric CFA, with any negative residual variances fixed to zero. These solutions were set aside to be separately assessed.
In addition to convergence, the correlation between score estimates and true scores was computed to explore the extent to which each scoring method provided an accurate estimate of true latent standing. In some cases, correlations were negative due to the choice of scaling indicator, so absolute values were taken to ensure all correlations were positive. To allow comparability against prior simulation work evaluating factor scores, Pearson correlations were computed (e.g., Curran et al., 2016; Rhemtulla & Savalei, 2025; Strauss & Curran, 2026). Furthermore, these correlations provide a useful measure of factor score reliability (Estabrook & Neale, 2013). Finally, standard deviations taken from the empirical sampling distribution of Pearson correlations with true scores were computed to assess variability.
Results
Convergence
Nonconvergence was defined to include solutions that properly converged, but resulted in implausible parameter estimates. Most of these were Heywood cases or, to a much lesser extent, correlations among latent factors greater than 1.0. In general, convergence issues were trivial with 10 indicators per factor. Out of a total of 360,000 factor models where all 10 indicators were included, only 45 did not converge properly. Of these, 39 nonconverged solutions occurred in the CFA with freely estimated factor loadings, and six nonconverged solutions occurred in the essentially tau-equivalent CFA. In addition, nonconvergence was also generally trivial with sample sizes greater than N = 100. With 10 indicators per factor, all solutions properly converged when sample sizes were at or greater than N = 100. With five indicators per factor, only a single solution did not properly converge when sample sizes were at or greater than N = 100. Because nonconvergence was more common with sample sizes below N = 150 and five indicators per factor, only these are summarized in Figure 1.

Convergence Rates of Congeneric CFA and Essentially Tau-quivalent CFA With Five Indicators per Latent Factor and Sample Sizes up to N = 100.
As expected, convergence rates were substantially higher for the essentially tau-equivalent CFA compared with the congeneric CFA, in most conditions and at small N. Differences were most pronounced when the range of factor loadings was more extreme. For example, with factor loadings ranging from .40 to .80, independent error variances, an interfactor correlation of .20, and a sample size of N = 30, the convergence rate for the congeneric CFA was .76 compared with .99 for the essentially tau-equivalent CFA. This increased to .87 for the congeneric CFA at N = 40. When factor loadings ranged from .40 to .80 and error variances were determined with the formula
Correlations With True Scores
Correlations of factor scores with true scores from appropriately converged solutions and sum scores with true scores are summarized in Figures 2 and 3. Correlations with true scores from solutions that resulted in Heywood cases will be considered in a subsequent section. Figures are separated by indicators per factor to maintain consistent y-axes within each figure. Figures display only correlations for one of the two simulated correlated factors, as there were no notable differences in findings across factors. In most conditions, patterns were consistent across different numbers of items per factor, but more pronounced with five items per factor compared with 10 items per factor, as seen in the reduced range of the y-axis in Figure 3 compared with Figure 2.

Correlations With True Scores With Five Indicators per Factor.

Correlations With True Scores With 10 Indicators per Factor.
Interestingly, in most conditions, there were no substantial differences between correlations with regression scores and true scores from the congeneric CFA compared with the essentially tau-equivalent CFA. At very small N, regression scores from the essentially tau-equivalent CFA had average correlations with true scores slightly smaller than regression scores from the congeneric CFA, but differences did not surpass the third decimal place. Therefore, I refer to regression scores generally, regardless of whether these were obtained from the congeneric CFA or essentially tau-equivalent CFA, in the following description of correlation patterns.
With a large and consistent population generating values for
As expected, larger differences among scoring methods emerged as the data-generating parameters departed from what is assumed under the sum score model (i.e., varied factor loadings and residual variances within a latent factor) and again as data-generating parameters further departed from the essentially tau-equivalent model (i.e., factor loadings varied and residual variances did not depend on unstandardized factor loadings). With factor loadings of .60 and .80 and residual variances of .64 and .36, respectively, correlations between sum scores and true scores were most similar to regression scores only at small N. As N increased, correlations between sum scores and true scores remained constant, whereas correlations between regression scores and true scores increased. With factor loadings ranging between .60 and .80 and residual variances independently ranging between .36 and .64, sum scores had marginally larger correlations with true scores at N = 30, but differences did not exceed .02 units. As the sample size increased, correlations with true scores converged across all three score types, except in the case where interfactor correlations were .60. With large interfactor correlations, regression scores had marginally higher correlations with true scores compared with sum scores as N increased, with differences not exceeding .01 units.
The largest differences across score types occurred when factor loadings were most varied. With factor loadings ranging from .40 to .80 and residual variances set to
Standard Deviations of the Empirical Sampling Distribution of Correlations With True Scores
Standard deviations about the averages presented in Figures 2 and 3 are summarized in Figures 4 and 5. Figures are separated by indicators per factor to maintain consistent y-axes within each figure. Again, figures display only values for one of the two simulated correlated factors, as there were no notable differences in findings across factors.

Standard Deviations of Correlations With True Scores With Five Indicators per Factor.

Standard Deviations of Correlations With True Scores With 10 Indicators per Factor.
Differences in standard deviations of average correlations with true scores were marginal in most conditions. The most visible differences emerged with five indicators per factor, factor loadings ranging from .40 to .80, and residual variances drawn independently from factor loadings and ranging from .36 to .85 (Figure 4). Here, at very small N, correlations between sum scores and true scores depicted the least variability, followed by regression scores from the essentially tau-equivalent CFA, and finally by regression scores from the congeneric CFA. For example, at N = 30 and with interfactor correlations of .20, sum score correlations had a standard deviation of .06, regression score correlations from the essentially tau-equivalent CFA had a standard deviation of .07, and regression scores from the congeneric CFA had a standard deviation of .09.
Heywood Cases
Throughout simulations, one of the most common reasons the congeneric factor model resulted in an inadmissible solution was due to one or more negative residual variances, or Heywood cases. 1 These were initially flagged as nonconverged solutions and dropped from Figures 2 to 5; however, in practice, when analysts encounter Heywood cases, they often employ the pragmatic workaround of fixing any problematic residual variance estimate(s) to zero or some small constant. Therefore, an additional investigation was conducted to compare regression scores computed from a congeneric CFA where Heywood cases were addressed by fixing problematic values to zero to regression scores computed from a properly converged essentially tau-equivalent CFA. Comparisons to sum scores were also included.
Specifically, for all iterations where the congeneric CFA resulted in a Heywood case and the essentially tau-equivalent CFA converged appropriately, an additional congeneric CFA was fit wherein problematic residual variances were fixed to zero prior to obtaining factor score predictions. Correlations with true scores among these scores, regression scores from the essentially tau-equivalent CFA, and sum scores are summarized in Figure 6. These are not broken down by sample size, given that many sample sizes resulted in only a small number of Heywood cases. Furthermore, when factor loadings ranged from .60 to .80, there were fewer than approximately five Heywood cases in total beyond sample sizes of N = 50; therefore, for this condition, Figure 6 only displays results up to N = 50. When factor loadings ranged from .40 to .80, there were fewer than approximately five Heywood cases in total beyond sample sizes of N = 80. For this condition, sample sizes up to N = 80 are included in Figure 6. Population generating models with consistent

Distributions of Correlations With True Scores for Solutions Where the Congeneric CFA Resulted in a Heywood Case.
In all conditions, regression scores from a congeneric CFA with one or more Heywood cases fixed to zero produced the lowest correlations between scores and true scores, and the most variability in these correlations. For example, with factor loadings between .40 and .80 and independently defined residual variances, some iterations resulted in solutions where correlations between regression scores and true scores were at or below .25. Average correlations with true scores differed among the scoring procedures as much as nearly .10 units. For example, with factor loadings ranging from .40 to .80 and residual variances determined independently from factor loadings, the average correlation between true scores and regression scores from the congeneric CFA was .69 (SD = 0.13) compared with .78 (SD = 0.09) for regression scores from the essentially tau-equivalent CFA and .79 (SD = 0.07) for sum scores. Comparatively, average correlations did not substantially differ across regression scores from the essentially tau-equivalent CFA and sum scores, with maximum within-condition differences around .01 units.
Discussion
Although in many applications it may be advantageous to simultaneously estimate a structural and measurement model with multiple indicator latent factors, this is often untenable in small samples. In such cases, a potential solution involves obtaining scores, which represent estimates of latent standing on one or more latent factors, and separately modeling relations among latent factors; however, because standard scoring methods rely on sample estimates of often complex measurement structures, these are deleteriously impacted by sampling variability (Rhemtulla & Savalei, 2025; Uanhoro, 2019). Alternatively, sum scores (or mean scores) can be used. These assume equivalent factor loadings and residual variances, which may not align with the true data-generating process. They are, however, not influenced by unreliable sample estimates of measurement model parameters.
Between these two extremes, I proposed computing factor scores from an essentially tau-equivalent model, which imposes equality of unstandardized factor loadings, but not of residual variances. Regardless of the data-generating process, one can fit an essentially tau-equivalent CFA and obtain factor score predictions directly from parameter estimates of this constrained model. In small samples, these models are more likely to result in admissible solutions compared with congeneric factor models. The primary aim of this investigation was to evaluate the utility of different scoring methods in the presence of substantial sampling variability, which is also the condition in which scores may be necessary. These scoring methods were: (a) regression scores computed from a congeneric CFA, (b) regression scores computed from the essentially tau-equivalent CFA, and (c) sum scores.
Simulation results revealed three notable insights that can be translated into practical recommendations: (a) when sample sizes are small and regardless of whether data conform to the assumptions of essential tau-equivalence, factor scores from the essentially tau-equivalent model are comparably correlated with true scores compared with factor scores from the congeneric CFA and are generally obtainable even when the congeneric CFA does not properly converge; (b) sum scores are also comparable with factor scores from the essentially tau-equivalent models in very small samples, but in moderate samples these degrade mildly as a function of departure of their assumptions, with some exception; and (c) critically, absolute differences across scoring methods from properly converged solutions were largely marginal, except when comparing regression scores computed from a congeneric CFA where one or more Heywood cases were addressed by fixing problematic values to zero. In such cases, these regression scores were comparatively substantially less correlated with true scores than the considered alternatives.
Factor Scores From the Essentially Tau-Equivalent CFA
Factor scores from the essentially tau-equivalent model were found to be advantageous, even in data structures where essential tau-equivalence was not a feature of the underlying data-generating mechanism or would not be otherwise supported in practice. Most critically, they are obtainable when congeneric factor models fail to converge appropriately or result in implausible parameter estimates, such as negative residual variances or correlations greater than 1. Importantly, scores from the essentially tau-equivalent model exhibited similar correlations with true scores compared with regression scores from a properly converged congeneric CFA, and often larger correlations with true scores compared with sum scores. At low N, they had comparable variance of correlations with true scores compared with sum scores and less variance of correlations with true scores compared with factor scores from the congeneric CFA. At moderate N, they had comparable variance of correlations with true scores to factor scores from a congeneric CFA and less variance of correlations than sum scores. This implies it is possible to obtain factor score estimates even when a standard CFA does not properly converge, providing a key path forward for analysts in practice.
Increased stability across samples is an additional and important benefit of computing scores using parameter estimates from the essentially tau-equivalent CFA. Whereas factor loadings may vary substantially from sample to sample, essential tau-equivalence ensures that loadings, while not precisely equal across samples, will be equally weighted across samples. This layer of stability may be attractive for analysts wary of substantial sample-to-sample variance inherent in typical applications of factor scoring but desiring factor scores that will relate to true scores in similar patterns as typical factor scores computed using parameter estimates from a congeneric CFA.
Importantly, findings suggest that imposing essential tau-equivalence can be advantageous both when it is supported by the true data-generating process and when it is not. When a feature of the true measurement model, the added stability is widely defensible if the goal of analysis is to compute factor scores that represent the true data-generating process; however, even when fit degrades substantially as a product of equality constraints on lambda, the added stability can still be justifiable. In the present simulation, when population generating values of lambda varied, fit indices were expectedly below generally accepted cut-offs (e.g., comparative fit index (CFI) and Tucker-Lewis index (TLI) were below .90 on average, and root mean square error of approximation (RMSEA) was above .05 on average). In scenarios where essential tau-equivalence is not supported by the data, as evidenced by poor fit, analysts should be aware that imposing essential tau-equivalence can still be useful, but the argument for additional equality constraints is in conflict with the goal of identifying the true population generating measurement model. Instead, the equality constraints are a tool to improve stability of parameter estimates prior to score prediction, and, in some cases, produce a measurement model that does not result in one or more implausible parameter estimates. This may be similar to an argument for using sum scores when a population generating measurement model does not support equal factor loadings and residual variances, and is further similar to the intentional introduction of bias to reduce variance in the classic bias-variance trade-off.
Sum Scores
When population generating parameters adhered to the restrictions of the sum score model or departed mildly, sum scores performed comparably to regression scores from the essentially tau-equivalent CFA. With equivalent factor loadings or factor loadings ranging from .60 to .80, the observed benefits of regression scores were marginal. In such cases, sum scores are a justifiable alternative to more complex scoring methods. That being said, when factor loadings ranged from .40 to .80, and when residual variances were determined based on factor loadings, implying larger unstandardized factor loadings equate to larger standardized factor loadings, regression scores from the essentially tau-equivalent model were notably improved in terms of correlations with true scores and stability of these correlations compared with sum scores. In contrast, when factor loadings ranged from .40 to .80 and when residual variances were not determined by factor loadings, implying larger unstandardized factor loadings do not necessarily equate to larger standardized factor loadings, sum scores exhibited larger correlations with true scores and less variability in those correlations than regression scores from the essentially tau-equivalent CFA. Stated differently, regression scores from the essentially tau-equivalent CFA generally perform comparably or better than sum scores except in the case of substantial departure from their assumptions, in which case sum scores are a more stable alternative than regression scores from a congeneric CFA.
Absolute Differences
Within a design cell, the largest difference in correlations with true scores among scoring methods did not exceed .03 units—a practically small difference. In spite of much debate about the relative merits of sum scores and factor scores, when samples are small, and two-step processes are likely a necessary alternative to simultaneous estimation of structural and measurement models, obtaining an admissible solution to a measurement model is the most critical limiting factor. With that, any score from a properly converged measurement model may be reasonably justifiable, including sum scores that are not subject to convergence challenges. This corroborates past findings on unit weighting (Wainer, 1976). Therefore, the analyst’s decision to impose essential tau-equivalence before computing factor scores in practice is likely to be largely driven by convergence and by the desire for greater within and across sample consistency via equal weighting, noting that gains on other performance metrics are marginal.
To reiterate, nonconvergence was defined to include models that properly converged but resulted in one or more implausible parameter estimates, such as a negative residual variance, or Heywood case. In practice, an analyst may be tempted to fix a residual variance to zero or some small constant, fit a congeneric CFA, and then compute regression scores from that CFA. Absolute differences in average correlations with true scores and standard deviations about these averages were most pronounced when comparing regression scores computed from a congeneric CFA with one or more negative variance fixed to zero, compared with other scores considered. Specifically, differences did not exceed .03 units in the initial investigation, but comparisons between scores computed from models where Heywood cases were addressed by fixing problematic residual variances to zero exhibited differences as large as .10 units. Therefore, if congeneric factor models result in inadmissible solutions, it is advantageous to impose tau-equivalence before computing factor scores, or to compute sum scores, rather than to address inadmissible solutions by fixing problematic values to some plausible constant.
Limitations and Future Directions
As is typical in simulation research, population generating models were designed to adhere to the assumption of multivariate normality; however, due to increased sampling variance in small samples, many samples exhibited mild-to-moderate observed variable skew (absolute values as large as 2.15) and kurtosis (absolute values as large as 6.15). It is important to note that this represents skew and kurtosis resulting from sampling variability and not population-level nonnormality. Results are therefore generalizable to data exhibiting mild to moderate skew and kurtosis, assuming that skew and kurtosis are products of sampling variability. Results, however, cannot be generalized to more extreme skew and kurtosis, as well as nonnormality that is a product of sampling from a non-normal population. This simulation has offered guidance regarding scoring methods when data are not ideal as a result of sample size, including when observed variable skew and kurtosis are present due to sampling variability. Future research should extend this work to other sources of unideal data, such as substantial skew and kurtosis stemming from non-normal population distributions. Furthermore, future research should consider other forms of data complexity common in both small and moderate samples, such as ordinal indicators. The utility of factor scores from ordinal data modeled using weighted least squares mean- and variance-adjusted (WLSMV) estimation may be different than the results presented in this manuscript.
A key limitation of this investigation is that scores were evaluated solely by their relation to true scores, and not by their ability to accurately capture structural relations among latent variables. Similar to past scoring research introducing alternative factor models from which to compute and obtain factor scores, it is a necessary and important step to begin by evaluating score performance in isolation (Curran et al., 2016; Strauss & Curran, 2026) before bringing these scores to subsequent models (Curran et al., 2018). Importantly, marginal differences noted in this study may lead to substantial differences in subsequent structural models. Future research should consider the benefits and limitations of these scores when used in subsequent models. In particular, it would be practically useful to investigate the utility of scores from an essentially tau-equivalent CFA when used in structural models, such as regression or path analysis, as these have not been critically evaluated in this context. This could be considered by incorporating scores into the bias-correcting approach to factor score regression and path analysis (Croon, 2002) or the bias-avoiding approach to factor score regression (Skrondal & Laake, 2001). Alternative methods for small-sample SEM, such as the structural after measurement approach (Rosseel & Loh, 2024), have also recently been proposed. These do not require obtaining scores directly, but future work on scoring in small samples, evaluating structural models, would benefit from comparisons to these methods. Particularly with clear recommendations for including sum scores in simulation research (Georgeson, 2025), future directions aimed at offering practical small-sample guidance in modeling, considering a vast range of models and methods, are well positioned for further inquiry.
Supplemental Material
sj-pdf-1-epm-10.1177_00131644261441852 – Supplemental material for Factor Scores in Small Samples: Recommendations and Solutions
Supplemental material, sj-pdf-1-epm-10.1177_00131644261441852 for Factor Scores in Small Samples: Recommendations and Solutions by Christian L. L. Strauss in Educational and Psychological Measurement
Footnotes
Ethical Considerations
Ethical approval was not required.
Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statements
Not applicable.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
