Factor Scores in Small Samples: Recommendations and Solutions

Abstract

Simultaneous estimation of structural and measurement models in structural equation modeling (SEM) is not always tenable in small samples. In such cases, it may be necessary or advantageous to obtain scores. Current scoring recommendations draw predominantly from simulations with sample sizes greater than N = 200. This paper extends these recommendations to small N, directly comparing factor scores to sum scores. In addition, scores computed from an essentially tau-equivalent factor model are introduced as an alternative scoring option aimed at balancing the competing benefits of sum scores and factor scores. Findings largely suggest that factor scores from an essentially tau-equivalent factor model are advantageous when considering convergence and stability, even when not supported by the data, so long as departures from their assumptions are not substantial. They are obtainable when congeneric factor models fail to converge and have similar correlations with true scores compared with typical factor scores in samples at or less than N = 200.

Keywords

factor scores sum scores measurement tau-equivalent

Introduction

An appealing benefit of structural equation modeling (SEM) is its ability to simultaneously estimate both measurement and structural models, producing estimates of relations among latent variables and measurement model parameters in a single model. Despite these advantages, typical SEM methods require large samples for stable estimation, particularly with increasing model complexity (e.g., Gagne & Hancock, 2006; Rosseel, 2020). A potential solution for small-sample SEM involves a two-step procedure, wherein a measurement model is fit, factor scores are computed using parameter estimates from this measurement model, and then these scores—or a corrected covariance matrix of scores—are used as input to estimate the structural component of the model. This two-step process has been studied extensively in a variety of contexts, including regression (Bogaert et al., 2023; Croon, 2002; Skrondal & Laake, 2001), path analysis (Devlieger & Rosseel, 2017; Kelcey, 2019; Lu et al., 2011), models with latent-interaction effects (Cox & Kelcey, 2021), and multilevel models (Devlieger & Rosseel, 2020; Kelcey et al., 2021), among others.

Recently, methodologists have engaged in increased scholarship and heightened debate directly comparing estimated factor scores obtained from a confirmatory factor analysis (CFA) with traditional sum or mean scores (McNeish, 2023; McNeish & Wolf, 2020; Rhemtulla & Savalei, 2025; Sijtsma et al., 2024; Widaman & Revelle, 2023, 2024). Both options for obtaining scores have noted strengths and limitations. For one, sum scores can be expressed as a type of estimated factor score, albeit a highly constrained underlying factor model. Specifically, all factor loadings are equivalent and set to 1.0, and all residual variances are set equal. This implies all observed indicators are equally influenced by the latent factors they purport to measure. These conditions are often unrealistic in practice, making estimated factor scores from less constrained factor models more attractive in some cases. On the other hand, factor score estimates based on parameter estimates from a CFA are highly sample-dependent. The elements of the formulae themselves will differ from sample to sample, reducing both consistency and generalizability across samples. Therefore, although sum scores may not fully capture all aspects of the true population generating measurement model, assuming the population generating model most closely aligns with the common factor model with varying factor loadings and residual variances, they are less prone to inconsistency due to sampling variability.

Critically, the cited methodological publications comparing estimated factor scores and sum scores involved simulation studies and empirical demonstrations utilizing samples typically of sizes at or greater than N = 200. Furthermore, research investigating samples less than N = 200 has largely explored the method of Croon (2002), a promising alternative to simultaneous SEM in small samples, but this work has not emphasized comparisons to sum scores or considered alternative models with which to compute scores when sample size limitations preclude stable estimation of the measurement model in isolation (Bogaert et al., 2023; Kelcey, 2019). There is a gap in the present literature providing evidence-based recommendations to researchers regarding best practice when obtaining factor score estimates, considering sum scores and other alternatives, when the impacts of sampling variability are most deleterious.

My goal is to fill this gap by offering pragmatic guidance on scoring methods in small samples. To accomplish this aim, I begin with a brief review of CFA to establish a shared notation to be used throughout. After this, I introduce formulae used to obtain factor score estimates from a CFA. Next, I propose an intermediary option between typical factor scores and sum scores, drawing on well-established psychometric tools and applying these to factor scoring. Finally, I evaluate the usability and performance of these scores in samples at and less than N = 200 through a simulation demonstration.

Confirmatory Factor Analysis

The CFA aims to identify one or more substantively motivated latent factors that explain covariation among a larger set of observed indicators. Here, latent factors represent unmeasured or unmeasurable constructs (Bollen, 2002), such as personality traits, psychopathology, or cognitive processes. Observed indicators are measured variables believed to assess underlying levels of one or more latent factors. The relation of observed indicators to latent factors can be described by Equation 1

y_{i} = ʋ + Λ η_{i} + ϵ_{i}

(1)

where $y_{i}$ contains responses on all observed indicators for person i, $ʋ$ contains observed indicator intercepts, $Λ$ contains factor loadings, $η_{i}$ contains true latent standing for person i, and $ϵ_{i}$ represents latent disturbances for person i, such that $ϵ_{i} ~ N (0, Θ_{ϵ})$ .

The model-implied mean and covariance structures resulting from Equation 1 are as follows:

μ (θ) = ʋ + Λ α

(2)

Σ (θ) = Λ Ψ Λ' + Θ_{ϵ}

(3)

where $α$ contains latent variable means and $Ψ$ is the latent covariance matrix.

Estimates of model parameters, namely, $\hat{ʋ}$ , $\hat{Λ}$ , $\hat{Ψ}$ , $\hat{α}$ , and ${\hat{Θ}}_{ϵ}$ are typically obtained using maximum likelihood (ML) or full information maximum likelihood (FIML), which are iterative estimators designed to obtain parameter estimates with the greatest likelihood of occurring given the data (for a detailed review, see Enders, 2022). Among other desirable properties, ML estimators are asymptotically efficient, meaning they are generally more accurate than other estimators given a properly specified model and a reasonably large sample size (e.g., Chen et al., 2023). This, in part, may explain the ubiquity of ML estimation in factor analysis and why ML estimators serve as default estimation procedures in most software with CFA capabilities; however, without a reasonably large sample size, the accuracy of parameter estimates used directly in formulae for common factor scoring methods will deteriorate.

Factor Scores and Sum Scores

Precise elements of $η_{i}$ , or individual latent standings, are not estimated in the CFA. Instead, mean values, variances, and covariances among all latent factors are determined. If an analyst’s goal is to obtain individual estimates of latent standing for visualization or for use in a subsequent structural model, estimates of $η_{i}$ must be computed after fitting a measurement model. These estimates are referred to as factor score estimates.

There are theoretically an infinite number of solutions for $η_{i}$ consistent with model parameters—a phenomenon often referred to as factor score indeterminacy (Steiger & Schönemann, 1978). This has led methodologists to propose multiple kinds of factor score estimates that take different approaches to solving the problem of indeterminacy and maintain different properties (Anderson & Rubin, 1957; Bartlett, 1937; Krijnen et al., 1996; Ten Berge et al., 1999; Thomson, 1935, 1938; Thurstone, 1935).

Generally, factor score estimates can be obtained with the following expression:

F_{η, i} = A_{η} y_{i}

(4)

where $F_{η, i}$ is the factor score estimate of $η$ for person i and $A_{η}$ is a matrix-valued function or an algebraic expression of matrices containing parameter estimates from a factor model.

Regression scores (Thomson, 1935, 1938) are some of the most frequently utilized, readily obtainable, and researched factor score estimates. They can be computed with the following matrix-valued function:

A_{η}^{R} = \hat{Ψ} \hat{Λ} (Σ (\hat{θ}))^{- 1}

(5)

where $\hat{.}$ is used to denote sample estimates of population parameters. The left-hand side of Equation 5 is a series of weights for observed response variables in $y_{i}$ aimed to maximize prediction of $η_{i}$ . Alternative weighting formulae have been proposed, such as Bartlett scores (Bartlett, 1937) and Anderson and Rubin scores (Anderson & Rubin, 1957). Importantly, Bartlett scores are linear transformations of regression scores, and thus are perfectly correlated at the vector level, or when obtained from single-factor models (Thomson, 1938). Anderson and Rubin scores take an alternative approach, producing scores that more closely reproduce observed covariances among latent factors. Importantly, different scoring formulae can lead to different predictions of latent standing; however, past research suggests that many commonly used scoring procedures produce factor scores that are highly correlated with one another (e.g., Fava & Velicer, 1992).

Sum scores are different from regression scores and other factor scores in that they do not require prior estimation of a CFA. They avoid some aspects of indeterminacy by defining scores as a simple function of equally weighted observed variables using all observed variance as opposed to partitioning variance into true score variance and error variance for each observed indicator uniquely (Widaman & Revelle, 2023). They can, however, be likened to factor scores computed from a highly constrained factor model. Specifically, when considered in the latent variable modeling framework, sum scores are factor scores computed from a model where all standardized factor loadings are set equal, implying equality of unstandardized loadings and residual variances (McNeish & Wolf, 2020). Practically speaking, this assumption requires equal weighting of all observed indicators, which can be challenging to meet.

Importantly, although assumptions underlying sum scores may be practically untenable in some instances, they have important benefits. For one, sum scores ensure that the same scoring formulae are used across samples (Widaman & Revelle, 2023). This can prevent overfitting, in that instability in sample estimates will not deleteriously impact scoring formulae. In addition, because of the instability in sample estimates obtained from estimation procedures like ML that require large samples, choosing to use a consistent but potentially wrong weighting for observed indicators, such as 1.0, can be beneficial (Rhemtulla & Savalei, 2025; Uanhoro, 2019). This can be conceptually embedded in the broader framework of the bias-variance trade-off, wherein bias is systematically and intentionally introduced into sample estimates to improve stability and reduce variance within and across samples.

Continuing with the conceptual parallel to the bias-variance trade-off, it is not necessary to introduce a consistent degree of potential bias, such as unit weighting observed indicators and holding error variance constant. For example, many applications of the bias-variance trade-off involve tuning parameters or values that can be thoughtfully selected, which determine the extent to which sample estimates are influenced by the unique characteristics of the sample or intentionally biased to reduce variance. Therefore, factor scores computed from a congeneric factor model, wherein all factor loadings and error variances are freely estimated, can be likened to an extreme in variance, in that they are likely to differ substantially across samples, especially when sample sizes are small. Sum scores are then an extreme in bias in typical applications of the factor model, where equal weighting and error variances are not reasonably supported. Thus, a natural compromise between the two is the essentially tau-equivalent model (Graham, 2006).

The essentially tau-equivalent model imposes equal weighting of observed indicators, or equivalence of factor loadings within a latent factor, but does not require equal residual variances or equivalent standardized loadings. Although the essentially tau-equivalent model has been a standard psychometric tool for decades, it appears underutilized when the goal of measurement modeling is to estimate latent standing via factor scores, and it has received limited attention in methodological contributions in scoring. Any factor score that can be computed using parameter estimates from a congeneric CFA can also theoretically be computed using parameter estimates from an essentially tau-equivalent CFA. Furthermore, factor score estimates computed using parameter estimates from the essentially tau-equivalent model may be a more practical and stable scoring method in small samples, balancing the benefits and limitations of the aforementioned “extremes.”

Methods

The purpose of this investigation is to provide guidance and recommendations on scoring procedures in small to moderate samples (N $\leq$ 200), directly comparing scores from a congeneric CFA to sum scores, and introducing scores computed from the essentially tau-equivalent model as an interstitial option. All artificial data were simulated in R (R Core Team, 2023), and all factor models were estimated using lavaan (Rosseel, 2012). R-script examples for the simulation are available in the linked supplementary materials.

Data were simulated by manipulating five design characteristics: (a) interfactor correlation, (b) factor loadings and residual variances, (c) number of items per factor, (d) sample size, and (e) scoring procedure. These will be discussed in turn. Population generating values were selected to match a range of values commonly encountered in practice. First, true scores on two correlated latent factors were simulated from a multivariate normal distribution with variances of one and factor correlations of either .20, .40, or .60, representing moderate to large interfactor correlations. A two-correlated factors model was selected as a population generating model, given noted increases in sample size requirements when moving from a one-factor to a two-factor model but not when increasing above two factors (Wolf et al., 2013), suggesting two-factor models can be used to provide key insights into model performance of multifactor models.

Observed indicators were simulated as a function of simulated true scores according to Equation 1. Fixed values for $λ$ and $Θ_{ϵ}$ were selected to enhance comparability within design cells, but six different conditions were considered to demonstrate a range of population generating models, including a model in which sum scores were tenable and increasing in complexity until only the congeneric factor model was tenable. In one condition, all factor loadings were set to .8, and all residual variances were set to .36. This condition was selected to adhere most closely to the sum score model. In another, all factor loadings were set to .8, and residual variances were randomly sampled from a range of values between .36 and .64. This condition was selected to adhere most closely to the essentially tau-equivalent model. Two further conditions were selected to incrementally depart from what is assumed by sum scores. In another condition, factor loadings were simulated to be .6 and .8 (with residual variances of .64 and .36, respectively), and in an additional condition, factor loadings were simulated to be .40, .50, .60, .70, and .80 (with residual variances ranging from .36 to .84, determined by setting residual variances equal to $1 - λ^{2}$ ).

Critically, in all but one of these conditions, residual variances were completely determined by the associated factor loading value. To depart more completely from the essentially tau-equivalent model, two additional conditions were included in which factor loadings and error variances were sampled independently of one another, and factor loadings were not all equivalent. In one condition, factor loadings were randomly sampled from a range of values between .60 and .80. After this, residual variances were independently sampled from the range between .36 and .64. These values align with the third condition presented above, but include residual variance estimates that were not a function of factor loadings. Then, the same set of fixed loadings and residual variances was used throughout this condition. In all conditions, errors were drawn from a multivariate normal distribution with a diagonal variance/covariance matrix.

To distribute the range of loadings across observed indicators and explore conditions with a different number of indicators per factor, a total of 10 indicators were simulated per latent factor. For the condition with consistent loadings and randomly sampled error variances, a total of 10 error variances were randomly drawn, and these were assigned to each of the 10 indicators in turn. For the condition with factor loadings of .6 and .8 and residual variances dependent on factor loadings, loadings were assigned in an alternating fashion. For the condition with loadings from .40 to .80 and residual variance dependent on factor loadings, the five loadings were assigned to the first five indicators and then repeated for the last five. For the two conditions where loadings and error variances were sampled independently of one another, a total of 10 loadings and 10 error variances were randomly drawn, and these were assigned to each of the 10 indicators in turn. Then, CFAs were estimated using unstandardized simulated data and different numbers of simulated observed variables to vary number of observed indicators per factor: in one model, all 10 indicators per factor were included, leading to a two-factor model with a total of 20 observed variables, or 10 indicators per factor; in another model, only the first five indicators per factor were included, leading to a two-factor model with a total of 10 observed variables, or five indicators per factor.

In total, four CFAs were fit to each simulated data set: (a) a CFA with freely estimated loadings and five indicators per factor, (b) a CFA with freely estimated loadings and 10 indicators per factor, (c) an essentially tau-equivalent CFA with five indicators per factor, and (d) an essentially tau-equivalent CFA with 10 indicators per factor. Regression scores were computed from each CFA. Sum scores were also computed using either the first 5 indicators per factor or all 10. Sum scores were computed using unstandardized observed indicators, as is typical when observed variables are on similar metrics. Consequently, in the three conditions where residual variances were dependent on the loading values, observed variables were set to all have a mean of zero and standard deviation of one, meaning they were simulated as standardized. In the subsequent conditions with independent factor loadings and residual variances, observed variable means were still set to zero, but variances differed from one; however, the overall scales of distributions of observed variables were still roughly similar, making it plausible that an analyst would not standardize before computing sum scores.

Taken together, there were a total of three scoring methods used across two different numbers of indicators per factor and three different interfactor correlations. There were a total of six conditions for factor loadings and residual variances and three different interfactor correlation values. All design conditions were fully crossed.

Finally, all conditions were evaluated in each of eight sample sizes from N = 30 to N = 100, in increments of 10, for increased granularity in very small samples. Additional sample sizes of N = 150 and N = 200 were also explored to evaluate scores in moderate sample sizes. One thousand data were simulated per design cell.

Given the aim of evaluating the utility of scores themselves, three outcome variables were explored. First, I computed the proportion of properly converged solutions out of the total 1,000 iterations per design cell as a measure of the viability of score estimation. Nonconvergence was defined to include models that converged but resulted in inadmissible solutions due to implausible values such as negative residual variances and correlations greater than one. Given sum scores were computed without the estimation of a factor model, convergence was not considered for these.

The majority of converged but inadmissible solutions occurred due to negative residual variances, or Heywood cases (Heywood, 1937). In practice, many analysts choose to manually fix problematic residual variances to zero or a very small constant. Although Heywood cases are often the symptom of a larger structural problem, this simple fix is commonly implemented (Cooperman & Waller, 2022). Therefore, for all solutions where the congeneric CFA resulted in one or more negative variance estimates, a follow-up investigation was conducted wherein implausible residual variance estimates were set to zero using the bounds = “pos.var” option in lavaan. Regression scores were then computed from the congeneric CFA, with any negative residual variances fixed to zero. These solutions were set aside to be separately assessed.

In addition to convergence, the correlation between score estimates and true scores was computed to explore the extent to which each scoring method provided an accurate estimate of true latent standing. In some cases, correlations were negative due to the choice of scaling indicator, so absolute values were taken to ensure all correlations were positive. To allow comparability against prior simulation work evaluating factor scores, Pearson correlations were computed (e.g., Curran et al., 2016; Rhemtulla & Savalei, 2025; Strauss & Curran, 2026). Furthermore, these correlations provide a useful measure of factor score reliability (Estabrook & Neale, 2013). Finally, standard deviations taken from the empirical sampling distribution of Pearson correlations with true scores were computed to assess variability.

Results

Convergence

Nonconvergence was defined to include solutions that properly converged, but resulted in implausible parameter estimates. Most of these were Heywood cases or, to a much lesser extent, correlations among latent factors greater than 1.0. In general, convergence issues were trivial with 10 indicators per factor. Out of a total of 360,000 factor models where all 10 indicators were included, only 45 did not converge properly. Of these, 39 nonconverged solutions occurred in the CFA with freely estimated factor loadings, and six nonconverged solutions occurred in the essentially tau-equivalent CFA. In addition, nonconvergence was also generally trivial with sample sizes greater than N = 100. With 10 indicators per factor, all solutions properly converged when sample sizes were at or greater than N = 100. With five indicators per factor, only a single solution did not properly converge when sample sizes were at or greater than N = 100. Because nonconvergence was more common with sample sizes below N = 150 and five indicators per factor, only these are summarized in Figure 1.

Figure 1.

Convergence Rates of Congeneric CFA and Essentially Tau-quivalent CFA With Five Indicators per Latent Factor and Sample Sizes up to N = 100.

As expected, convergence rates were substantially higher for the essentially tau-equivalent CFA compared with the congeneric CFA, in most conditions and at small N. Differences were most pronounced when the range of factor loadings was more extreme. For example, with factor loadings ranging from .40 to .80, independent error variances, an interfactor correlation of .20, and a sample size of N = 30, the convergence rate for the congeneric CFA was .76 compared with .99 for the essentially tau-equivalent CFA. This increased to .87 for the congeneric CFA at N = 40. When factor loadings ranged from .40 to .80 and error variances were determined with the formula $1 - λ^{2}$ , convergence rates were .80 for the congeneric CFA at N = 30 and .90 for the congeneric CFA at N = 40. When all factor loadings and error variances were set to a fixed and equivalent value (top panel of Figure 1), convergence rates did not differ notably between the two models.

Correlations With True Scores

Correlations of factor scores with true scores from appropriately converged solutions and sum scores with true scores are summarized in Figures 2 and 3. Correlations with true scores from solutions that resulted in Heywood cases will be considered in a subsequent section. Figures are separated by indicators per factor to maintain consistent y-axes within each figure. Figures display only correlations for one of the two simulated correlated factors, as there were no notable differences in findings across factors. In most conditions, patterns were consistent across different numbers of items per factor, but more pronounced with five items per factor compared with 10 items per factor, as seen in the reduced range of the y-axis in Figure 3 compared with Figure 2.

Figure 2.

Correlations With True Scores With Five Indicators per Factor.

Figure 3.

Correlations With True Scores With 10 Indicators per Factor.

Interestingly, in most conditions, there were no substantial differences between correlations with regression scores and true scores from the congeneric CFA compared with the essentially tau-equivalent CFA. At very small N, regression scores from the essentially tau-equivalent CFA had average correlations with true scores slightly smaller than regression scores from the congeneric CFA, but differences did not surpass the third decimal place. Therefore, I refer to regression scores generally, regardless of whether these were obtained from the congeneric CFA or essentially tau-equivalent CFA, in the following description of correlation patterns.

With a large and consistent population generating values for $λ$ , correlations with true scores were generally similar across all scoring methods. At N = 30, sum scores had marginally larger correlations with true scores, but differences did not exceed .01 units. When the population generating model was the essentially tau-equivalent model (condition labeled “.8, varied”), regression scores had larger correlations with true scores than sum scores as N increased, but absolute differences remained minimal.

As expected, larger differences among scoring methods emerged as the data-generating parameters departed from what is assumed under the sum score model (i.e., varied factor loadings and residual variances within a latent factor) and again as data-generating parameters further departed from the essentially tau-equivalent model (i.e., factor loadings varied and residual variances did not depend on unstandardized factor loadings). With factor loadings of .60 and .80 and residual variances of .64 and .36, respectively, correlations between sum scores and true scores were most similar to regression scores only at small N. As N increased, correlations between sum scores and true scores remained constant, whereas correlations between regression scores and true scores increased. With factor loadings ranging between .60 and .80 and residual variances independently ranging between .36 and .64, sum scores had marginally larger correlations with true scores at N = 30, but differences did not exceed .02 units. As the sample size increased, correlations with true scores converged across all three score types, except in the case where interfactor correlations were .60. With large interfactor correlations, regression scores had marginally higher correlations with true scores compared with sum scores as N increased, with differences not exceeding .01 units.

The largest differences across score types occurred when factor loadings were most varied. With factor loadings ranging from .40 to .80 and residual variances set to $1 - λ^{2}$ , regression scores had consistently higher correlations with true scores than sum scores, with differences increasing as sample size increased to a maximum difference of .03 units. When residual variances were randomly sampled from a range of values between .36 and .84, patterns differed across a number of items per factor. With five items per factor, sum scores generally had larger correlations with true scores than regression scores with low N and low interfactor correlations. With increasing sample size and interfactor correlations, correlations between regression scores and true scores exceeded correlations between true scores and regression scores.

Standard Deviations of the Empirical Sampling Distribution of Correlations With True Scores

Standard deviations about the averages presented in Figures 2 and 3 are summarized in Figures 4 and 5. Figures are separated by indicators per factor to maintain consistent y-axes within each figure. Again, figures display only values for one of the two simulated correlated factors, as there were no notable differences in findings across factors.

Figure 4.

Standard Deviations of Correlations With True Scores With Five Indicators per Factor.

Figure 5.

Standard Deviations of Correlations With True Scores With 10 Indicators per Factor.

Differences in standard deviations of average correlations with true scores were marginal in most conditions. The most visible differences emerged with five indicators per factor, factor loadings ranging from .40 to .80, and residual variances drawn independently from factor loadings and ranging from .36 to .85 (Figure 4). Here, at very small N, correlations between sum scores and true scores depicted the least variability, followed by regression scores from the essentially tau-equivalent CFA, and finally by regression scores from the congeneric CFA. For example, at N = 30 and with interfactor correlations of .20, sum score correlations had a standard deviation of .06, regression score correlations from the essentially tau-equivalent CFA had a standard deviation of .07, and regression scores from the congeneric CFA had a standard deviation of .09.

Heywood Cases

Throughout simulations, one of the most common reasons the congeneric factor model resulted in an inadmissible solution was due to one or more negative residual variances, or Heywood cases.¹ These were initially flagged as nonconverged solutions and dropped from Figures 2 to 5; however, in practice, when analysts encounter Heywood cases, they often employ the pragmatic workaround of fixing any problematic residual variance estimate(s) to zero or some small constant. Therefore, an additional investigation was conducted to compare regression scores computed from a congeneric CFA where Heywood cases were addressed by fixing problematic values to zero to regression scores computed from a properly converged essentially tau-equivalent CFA. Comparisons to sum scores were also included.

Specifically, for all iterations where the congeneric CFA resulted in a Heywood case and the essentially tau-equivalent CFA converged appropriately, an additional congeneric CFA was fit wherein problematic residual variances were fixed to zero prior to obtaining factor score predictions. Correlations with true scores among these scores, regression scores from the essentially tau-equivalent CFA, and sum scores are summarized in Figure 6. These are not broken down by sample size, given that many sample sizes resulted in only a small number of Heywood cases. Furthermore, when factor loadings ranged from .60 to .80, there were fewer than approximately five Heywood cases in total beyond sample sizes of N = 50; therefore, for this condition, Figure 6 only displays results up to N = 50. When factor loadings ranged from .40 to .80, there were fewer than approximately five Heywood cases in total beyond sample sizes of N = 80. For this condition, sample sizes up to N = 80 are included in Figure 6. Population generating models with consistent $λ$ and when all 10 indicators were utilized were also not included given the minimal Heywood cases.

Figure 6.

Distributions of Correlations With True Scores for Solutions Where the Congeneric CFA Resulted in a Heywood Case.

In all conditions, regression scores from a congeneric CFA with one or more Heywood cases fixed to zero produced the lowest correlations between scores and true scores, and the most variability in these correlations. For example, with factor loadings between .40 and .80 and independently defined residual variances, some iterations resulted in solutions where correlations between regression scores and true scores were at or below .25. Average correlations with true scores differed among the scoring procedures as much as nearly .10 units. For example, with factor loadings ranging from .40 to .80 and residual variances determined independently from factor loadings, the average correlation between true scores and regression scores from the congeneric CFA was .69 (SD = 0.13) compared with .78 (SD = 0.09) for regression scores from the essentially tau-equivalent CFA and .79 (SD = 0.07) for sum scores. Comparatively, average correlations did not substantially differ across regression scores from the essentially tau-equivalent CFA and sum scores, with maximum within-condition differences around .01 units.

Discussion

Although in many applications it may be advantageous to simultaneously estimate a structural and measurement model with multiple indicator latent factors, this is often untenable in small samples. In such cases, a potential solution involves obtaining scores, which represent estimates of latent standing on one or more latent factors, and separately modeling relations among latent factors; however, because standard scoring methods rely on sample estimates of often complex measurement structures, these are deleteriously impacted by sampling variability (Rhemtulla & Savalei, 2025; Uanhoro, 2019). Alternatively, sum scores (or mean scores) can be used. These assume equivalent factor loadings and residual variances, which may not align with the true data-generating process. They are, however, not influenced by unreliable sample estimates of measurement model parameters.

Between these two extremes, I proposed computing factor scores from an essentially tau-equivalent model, which imposes equality of unstandardized factor loadings, but not of residual variances. Regardless of the data-generating process, one can fit an essentially tau-equivalent CFA and obtain factor score predictions directly from parameter estimates of this constrained model. In small samples, these models are more likely to result in admissible solutions compared with congeneric factor models. The primary aim of this investigation was to evaluate the utility of different scoring methods in the presence of substantial sampling variability, which is also the condition in which scores may be necessary. These scoring methods were: (a) regression scores computed from a congeneric CFA, (b) regression scores computed from the essentially tau-equivalent CFA, and (c) sum scores.

Simulation results revealed three notable insights that can be translated into practical recommendations: (a) when sample sizes are small and regardless of whether data conform to the assumptions of essential tau-equivalence, factor scores from the essentially tau-equivalent model are comparably correlated with true scores compared with factor scores from the congeneric CFA and are generally obtainable even when the congeneric CFA does not properly converge; (b) sum scores are also comparable with factor scores from the essentially tau-equivalent models in very small samples, but in moderate samples these degrade mildly as a function of departure of their assumptions, with some exception; and (c) critically, absolute differences across scoring methods from properly converged solutions were largely marginal, except when comparing regression scores computed from a congeneric CFA where one or more Heywood cases were addressed by fixing problematic values to zero. In such cases, these regression scores were comparatively substantially less correlated with true scores than the considered alternatives.

Factor Scores From the Essentially Tau-Equivalent CFA

Factor scores from the essentially tau-equivalent model were found to be advantageous, even in data structures where essential tau-equivalence was not a feature of the underlying data-generating mechanism or would not be otherwise supported in practice. Most critically, they are obtainable when congeneric factor models fail to converge appropriately or result in implausible parameter estimates, such as negative residual variances or correlations greater than 1. Importantly, scores from the essentially tau-equivalent model exhibited similar correlations with true scores compared with regression scores from a properly converged congeneric CFA, and often larger correlations with true scores compared with sum scores. At low N, they had comparable variance of correlations with true scores compared with sum scores and less variance of correlations with true scores compared with factor scores from the congeneric CFA. At moderate N, they had comparable variance of correlations with true scores to factor scores from a congeneric CFA and less variance of correlations than sum scores. This implies it is possible to obtain factor score estimates even when a standard CFA does not properly converge, providing a key path forward for analysts in practice.

Increased stability across samples is an additional and important benefit of computing scores using parameter estimates from the essentially tau-equivalent CFA. Whereas factor loadings may vary substantially from sample to sample, essential tau-equivalence ensures that loadings, while not precisely equal across samples, will be equally weighted across samples. This layer of stability may be attractive for analysts wary of substantial sample-to-sample variance inherent in typical applications of factor scoring but desiring factor scores that will relate to true scores in similar patterns as typical factor scores computed using parameter estimates from a congeneric CFA.

Importantly, findings suggest that imposing essential tau-equivalence can be advantageous both when it is supported by the true data-generating process and when it is not. When a feature of the true measurement model, the added stability is widely defensible if the goal of analysis is to compute factor scores that represent the true data-generating process; however, even when fit degrades substantially as a product of equality constraints on lambda, the added stability can still be justifiable. In the present simulation, when population generating values of lambda varied, fit indices were expectedly below generally accepted cut-offs (e.g., comparative fit index (CFI) and Tucker-Lewis index (TLI) were below .90 on average, and root mean square error of approximation (RMSEA) was above .05 on average). In scenarios where essential tau-equivalence is not supported by the data, as evidenced by poor fit, analysts should be aware that imposing essential tau-equivalence can still be useful, but the argument for additional equality constraints is in conflict with the goal of identifying the true population generating measurement model. Instead, the equality constraints are a tool to improve stability of parameter estimates prior to score prediction, and, in some cases, produce a measurement model that does not result in one or more implausible parameter estimates. This may be similar to an argument for using sum scores when a population generating measurement model does not support equal factor loadings and residual variances, and is further similar to the intentional introduction of bias to reduce variance in the classic bias-variance trade-off.

Sum Scores

When population generating parameters adhered to the restrictions of the sum score model or departed mildly, sum scores performed comparably to regression scores from the essentially tau-equivalent CFA. With equivalent factor loadings or factor loadings ranging from .60 to .80, the observed benefits of regression scores were marginal. In such cases, sum scores are a justifiable alternative to more complex scoring methods. That being said, when factor loadings ranged from .40 to .80, and when residual variances were determined based on factor loadings, implying larger unstandardized factor loadings equate to larger standardized factor loadings, regression scores from the essentially tau-equivalent model were notably improved in terms of correlations with true scores and stability of these correlations compared with sum scores. In contrast, when factor loadings ranged from .40 to .80 and when residual variances were not determined by factor loadings, implying larger unstandardized factor loadings do not necessarily equate to larger standardized factor loadings, sum scores exhibited larger correlations with true scores and less variability in those correlations than regression scores from the essentially tau-equivalent CFA. Stated differently, regression scores from the essentially tau-equivalent CFA generally perform comparably or better than sum scores except in the case of substantial departure from their assumptions, in which case sum scores are a more stable alternative than regression scores from a congeneric CFA.

Absolute Differences

Within a design cell, the largest difference in correlations with true scores among scoring methods did not exceed .03 units—a practically small difference. In spite of much debate about the relative merits of sum scores and factor scores, when samples are small, and two-step processes are likely a necessary alternative to simultaneous estimation of structural and measurement models, obtaining an admissible solution to a measurement model is the most critical limiting factor. With that, any score from a properly converged measurement model may be reasonably justifiable, including sum scores that are not subject to convergence challenges. This corroborates past findings on unit weighting (Wainer, 1976). Therefore, the analyst’s decision to impose essential tau-equivalence before computing factor scores in practice is likely to be largely driven by convergence and by the desire for greater within and across sample consistency via equal weighting, noting that gains on other performance metrics are marginal.

To reiterate, nonconvergence was defined to include models that properly converged but resulted in one or more implausible parameter estimates, such as a negative residual variance, or Heywood case. In practice, an analyst may be tempted to fix a residual variance to zero or some small constant, fit a congeneric CFA, and then compute regression scores from that CFA. Absolute differences in average correlations with true scores and standard deviations about these averages were most pronounced when comparing regression scores computed from a congeneric CFA with one or more negative variance fixed to zero, compared with other scores considered. Specifically, differences did not exceed .03 units in the initial investigation, but comparisons between scores computed from models where Heywood cases were addressed by fixing problematic residual variances to zero exhibited differences as large as .10 units. Therefore, if congeneric factor models result in inadmissible solutions, it is advantageous to impose tau-equivalence before computing factor scores, or to compute sum scores, rather than to address inadmissible solutions by fixing problematic values to some plausible constant.

Limitations and Future Directions

As is typical in simulation research, population generating models were designed to adhere to the assumption of multivariate normality; however, due to increased sampling variance in small samples, many samples exhibited mild-to-moderate observed variable skew (absolute values as large as 2.15) and kurtosis (absolute values as large as 6.15). It is important to note that this represents skew and kurtosis resulting from sampling variability and not population-level nonnormality. Results are therefore generalizable to data exhibiting mild to moderate skew and kurtosis, assuming that skew and kurtosis are products of sampling variability. Results, however, cannot be generalized to more extreme skew and kurtosis, as well as nonnormality that is a product of sampling from a non-normal population. This simulation has offered guidance regarding scoring methods when data are not ideal as a result of sample size, including when observed variable skew and kurtosis are present due to sampling variability. Future research should extend this work to other sources of unideal data, such as substantial skew and kurtosis stemming from non-normal population distributions. Furthermore, future research should consider other forms of data complexity common in both small and moderate samples, such as ordinal indicators. The utility of factor scores from ordinal data modeled using weighted least squares mean- and variance-adjusted (WLSMV) estimation may be different than the results presented in this manuscript.

A key limitation of this investigation is that scores were evaluated solely by their relation to true scores, and not by their ability to accurately capture structural relations among latent variables. Similar to past scoring research introducing alternative factor models from which to compute and obtain factor scores, it is a necessary and important step to begin by evaluating score performance in isolation (Curran et al., 2016; Strauss & Curran, 2026) before bringing these scores to subsequent models (Curran et al., 2018). Importantly, marginal differences noted in this study may lead to substantial differences in subsequent structural models. Future research should consider the benefits and limitations of these scores when used in subsequent models. In particular, it would be practically useful to investigate the utility of scores from an essentially tau-equivalent CFA when used in structural models, such as regression or path analysis, as these have not been critically evaluated in this context. This could be considered by incorporating scores into the bias-correcting approach to factor score regression and path analysis (Croon, 2002) or the bias-avoiding approach to factor score regression (Skrondal & Laake, 2001). Alternative methods for small-sample SEM, such as the structural after measurement approach (Rosseel & Loh, 2024), have also recently been proposed. These do not require obtaining scores directly, but future work on scoring in small samples, evaluating structural models, would benefit from comparisons to these methods. Particularly with clear recommendations for including sum scores in simulation research (Georgeson, 2025), future directions aimed at offering practical small-sample guidance in modeling, considering a vast range of models and methods, are well positioned for further inquiry.

Supplemental Material

sj-pdf-1-epm-10.1177_00131644261441852 – Supplemental material for Factor Scores in Small Samples: Recommendations and Solutions

Supplemental material, sj-pdf-1-epm-10.1177_00131644261441852 for Factor Scores in Small Samples: Recommendations and Solutions by Christian L. L. Strauss in Educational and Psychological Measurement

Footnotes

ORCID iD

Christian L. L. Strauss

Ethical Considerations

Ethical approval was not required.

Consent to Participate

Not applicable.

Consent for Publication

Not applicable.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statements

Not applicable.

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Anderson

T. W.

Rubin

(1957). Statistical inference in factor analysis. Proceedings of the Third Berkeley Symposium of Mathematical Statistics and Probability, 5, 111–150. https://doi.org/10.2307/2332891

Bartlett

M. S.

(1937). The statistical conception of mental factors. British Journal of Psychology, 28(1), 97–104.

Bogaert

Loh

W. W.

Rosseel

(2023). A small sample correction for factor score regression. Educational and Psychological Measurement, 83(3), 495–519. https://doi.org/10.1177/00131644221105505

Bollen

K. A.

(2002). Latent variables in psychology and the social sciences. Annual Review of Psychology, 53, 605–634.

Chen

Moustaki

Zhang

(2023). On the estimation of structural equation models. In Hoyle

R. H.

(Ed.), Handbook of structural equation modelling (2nd ed., pp. 164–180). Guilford Press.

Cooperman

A. W.

Waller

N. G.

(2022). Heywood you go away! Examining causes, effects, and treatments for Heywood cases in exploratory factor analysis. Psychological Methods, 27(2), 156–176. https://doi.org/10.1037/met0000384

Cox

Kelcey

(2021). Croon’s bias-corrected estimation of latent interactions. Structural Equation Modeling: A Multidisciplinary Journal, 28(6), 863–874. https://doi.org/10.1080/10705511.2021.1922283

Croon

(2002). Using predicted latent scores in general latent structure models. In Marcoulides

G. A.

Moustaki

(Eds.), Latent variable and latent structure models (pp. 195–224). Lawrence Erlbaum.

Curran

P. J.

Cole

V. T.

Bauer

D. J.

Hussong

A. M.

Gottfredson

(2016). Improving factor score estimation through the use of observed background characteristics. Structural Equation Modeling, 23(6), 827–844. https://doi.org/10.1080/10705511.2016.1220839

10.

Curran

P. J.

Cole

V. T.

Bauer

D. J.

Rothenberg

W. A.

Hussong

A. M.

(2018). Recovering predictor–criterion relations using covariate-informed factor score estimates. Structural Equation Modeling: A Multidisciplinary Journal, 25(6), 860–875. https://doi.org/10.1080/10705511.2018.1473773

11.

Devlieger

Rosseel

(2017). Factor score path analysis: An alternative for SEM? Methodology, 13, 31–38. https://doi.org/10.1027/1614-2241/a000130

12.

Devlieger

Rosseel

(2020). Multilevel factor score regression. Multivariate Behavioral Research, 55(4), 600–624. https://doi.org/10.1080/00273171.2019.1661817

13.

Enders

C. K.

(2022). Applied missing data analysis. Guilford Publications.

14.

Estabrook

Neale

(2013). A comparison of factor score estimation methods in the presence of missing data: Reliability and an application to nicotine dependence. Multivariate Behavioral Research, 48(1), 1–27. https://doi.org/10.1080/00273171.2012.730072

15.

Fava

J. L.

Velicer

W. F.

(1992). Multivariate behavioral an empirical comparison of factor, image, component, and scale scores. Multivariate Behavioral Research, 27(3), 301–322. https://doi.org/10.1207/s15327906mbr2703_1

16.

Gagne

Hancock

G. R.

(2006). Measurement model quality, sample size, and solution propriety in confirmatory factor models. Multivariate Behavioral Research, 41(1), 65–83. https://doi.org/10.1207/s15327906mbr4101_5

17.

Georgeson

A. R.

(2025). Deriving expected values of model parameters when using sum scores in simulation research. Structural Equation Modeling: A Multidisciplinary Journal, 32(1), 83–92. https://doi.org/10.1080/10705511.2024.2376330

18.

Graham

J. M.

(2006). Congeneric and (essentially) tau-equivalent estimates of score reliability: What they are and how to use them. Educational and Psychological Measurement, 66(6), 930–944. https://doi.org/10.1177/0013164406288165

19.

Heywood

H. B.

(1937). On finite sequences of real numbers. Proceedings of the Royal Society A: Mathematical, Physical, and Engineering Sciences, 134(824), 486–501. https://doi.org/10.1098/rspa.1931.0209

20.

Kelcey

(2019). A robust alternative estimator for small to moderate sample SEM: Bias-corrected factor score path analysis. Addictive Behaviors, 94, 83–98. https://doi.org/10.1016/j.addbeh.2018.10.032

21.

Kelcey

Cox

Dong

(2021). Croon’s bias-corrected factor score path analysis for small- to moderate-sample multilevel structural equation models. Organizational Research Methods, 24(1), 55–77. https://doi.org/10.1177/1094428119879758

22.

Krijnen

W. P.

Wansbeek

Ten Berge

J. M. F.

(1996). Best linear predictors for factor scores. Communications in Statistics–Theory and Methods, 25(12), 3013–3025. https://doi.org/10.1080/03610929608831883

23.

I. R. R.

Kwan

Thomas

D. R.

Cedzynski

(2011). Two new methods for estimating structural equation models: An illustration and a comparison with two established methods. International Journal of Research in Marketing, 28(3), 258–268. https://doi.org/10.1016/j.ijresmar.2011.03.006

24.

McNeish

(2023). Psychometric properties of sum scores and factor scores differ even when their correlation is 0.98: A response to Widaman and Revelle. Behavior Research Methods, 55(8), 4269–4290. https://doi.org/10.3758/s13428-022-02016-x

25.

McNeish

Wolf

M. G.

(2020). Thinking twice about sum scores. Behavior Research Methods, 52(6), 2287–2305. https://doi.org/10.3758/s13428-020-01398-0

26.

R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

27.

Rhemtulla

Savalei

(2025). Estimated factor scores are not true factor scores. Multivariate Behavioral Research, 60(3), 598–619. https://doi.org/10.1080/00273171.2024.2444943

28.

Rosseel

(2012). Lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36. https://doi.org/10.18637/jss.v048.i02

29.

Rosseel

(2020). Small sample size solutions for structural equation modeling. In Miočević

Schoot

R. van de

(Eds.), Small sample size solutions: A guide for applied researchers and practitioners (pp. 226–238). Routledge.

30.

Rosseel

Loh

W. W.

(2024). A structural after measurement approach to structural equation modeling. Psychological Methods, 29(3), 561–588. https://doi.org/10.1037/met0000503

31.

Sijtsma

Ellis

J. L.

Borsboom

(2024). Recognize the value of the sum score, psychometrics’ greatest accomplishment. Psychometrika, 89(1), 84–117. https://doi.org/10.1007/s11336-024-09964-7

32.

Skrondal

Laake

(2001). Regression among factor scores. Psychometrika, 66(4), 563–575. https://doi.org/10.1007/bf02296196

33.

Steiger

J. H.

Schönemann

P. H.

(1978). A history of factor indeterminacy. In Shye

(Ed.), Theory construction and data analysis (pp. 136–178). Jossey-Bass.

34.

Strauss

C. L. L.

Curran

P. J.

(2026). An evaluation of aggregated and disaggregated approaches to multilevel factor scoring. Journal of Educational and Behavioral Statistics, 51, 281–309. https://doi.org/10.3102/10769986251321380

35.

Ten Berge

J. M. F.

Krijnen

W. P.

Wansbeek

Shapiro

. (1999). Some new results on correlation-preserving factor scores prediction methods. Linear Algebra and Its Applications, 289(1–3), 311–318. https://doi.org/10.1016/S0024-3795(97)10007-6

36.

Thomson

G. H.

(1935). The definition and measurement of “g” (general intelligence). Journal of Educational Psychology, 26(4), 241–262. https://doi.org/10.1037/h0059873

37.

Thomson

G. H.

(1938). Methods of estimating mental factors. Nature, 141, 246.

38.

Thurstone

L. L.

(1935). Vectors of mind. University of Chicago Press.

39.

Uanhoro

(2019). When data are not so informative, it pays to choose the sum score over the factor score. https://www.jamesuanhoro.com/post/2019/08/02/when-data-are-not-so-informative-it-pays-to-choose-the-sum-score-over-the-factor-score/

40.

Wainer

(1976). Estimating coefficients in linear models: It don’t make no nevermind. Psychological Bulletin, 83(2), 213–217. https://doi.org/10.1037/0033-2909.83.2.213

41.

Widaman

K. F.

Revelle

(2023). Thinking thrice about sum scores, and then some more about measurement and analysis. Behavior Research Methods, 55(2), 788–806. https://doi.org/10.3758/s13428-022-01849-w

42.

Widaman

K. F.

Revelle

(2024). Thinking about sum scores yet again, maybe the last time, we don’t know, oh no . . .: A comment on McNeish (2023). Educational and Psychological Measurement, 84(4), 637–659. https://doi.org/10.1177/00131644231205310

43.

Wolf

E. J.

Harrington

K. M.

Clark

S. L.

Miller

M. W.

(2013). Sample size requirements for structural equation models: An evaluation of power, bias, and solution propriety. Educational and Psychological Measurement, 73(6), 913–934. https://doi.org/10.1177/0013164413495237

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.04 MB

0.00 MB