Abstract
The problem of factor score indeterminacy implies that the factor and the error scores cannot be completely disentangled in the factor model. It is therefore proposed to compute Harman’s factor score predictor that contains an additive combination of factor and error variance. This additive combination is discussed in the framework of classical test theory. On this basis, a definition of reliability, standard error of measurement, and confidence intervals for the factor score predictor are proposed. It is argued that factor score predictor intervals should be used instead of single score predictors to account for the error term in the factor model. The calculation of reliabilities and factor score predictor intervals is illustrated by means of a small simulation study and an empirical example.
In several applied settings and especially in psychological assessment, individual factor scores are of interest. It has been pointed out that factor scores cannot be determined precisely, as the number of common and unique factor scores exceeds the number of observed variables (McDonald & Burr, 1967). This problem has been termed factor score indeterminacy, a problem with a long history of debate (e.g., Grice, 2001; Guttman, 1955; Lovie & Lovie, 1995). It has also been noted that the definition of the factors in the factor model precludes the possibility of expressing them as linear combinations of the observed variables (Guttman, 1955). Moreover, Schönemann and Steiger (1976) proved that factor score predictors that are constructed as weighted composites of the observed variables cannot be factors of the common factor model. Nevertheless, linear combinations have been proposed as substitutes or predictors for the factors, because there is a need for some individual scores in applied settings (e.g., Anderson & Rubin, 1956; Bartlett, 1937; Harman, 1976; Krijnen, Wansbeek, & Ten Berge, 1996; McDonald, 1981; Thurstone, 1935). The properties of some of these factor score predictors have been compared and discussed (Grice, 2001; Krijnen, 2006; McDonald & Burr, 1967). Another aspect that introduces indeterminacy into the factor model is factor rotation. Several methods of factor rotation are available and there is a paucity of literature offering recommendations for different purposes (e.g., Browne, 2001; Sass & Schmitt, 2010). Although the focus of the present article is on factor score indeterminacy, effects of factor rotation on factor score predictors are briefly considered in a small simulation study illustrating the main arguments presented here.
The present article starts from the well-known observation that linear combinations of observed variables as they have been proposed cannot be compatible with the basic equations of the common factor model. It is concluded that the only type of factor score predictor that would be in line with the basic equations of the common factor model is a factor score predictor that does not result in a single score for each participant, but in a score probability interval.
First, the definitions of the common factor model are presented. Second, the incompatibility of linear combinations of observed variables with the basic equations of the common factor model is shown. Third, Harman’s ideal variable factor score predictor is shown to combine factor and error variance. Fourth, this combined factor and error score predictor is considered under the assumptions of classical test theory. On this basis, a definition of reliability, standard error of measurement, and factor score predictor intervals are proposed. Finally, a simulation study and an applied example are given to illustrate the effect of considering the reliability and the factor score predictor intervals when factor score predictors are computed.
Definitions
The defining equation of the common factor model is
where
where
Results
Factor Score Predictors and the Equations of the Factor Model
The available factor score predictors based on linear combinations of observed variables (e.g., Anderson & Rubin, 1956; Bartlett, 1937; Harman, 1976; McDonald, 1981; Thurstone, 1935) can be described as
Proof. Replacing
Postmultiplying Equation 3 with
Solving Equation 4 for
This completes the proof.
Moreover, entering the weights given on the right-hand side of Equation 5 for
Proof. Replacing
This completes the proof.
Thus, Thurstone’s regression score predictor is compatible with the defining equation of the factor model, but it is incompatible with the fundamental factor theorem. It follows from Lemmas 1 and 2 that factor score predictors of the form
The Combined Factor–Error Score Predictor
According to Equation 1, the factor scores can be computed as
Inserting this factor score into Equation 2 is, of course, compatible with the fundamental factor theorem, as can be seen from
However, both
Equation 9 can be transformed into
which gives the combined factor–error score predictor
As Var
Although ε[
Factor Score Predictor Intervals
It is proposed to take this reliability explicitly into account and to compute the standard error of measurement and intervals for the combined factor–error score predictor
Although the following interval is conceived in analogy to classical test theory, it is termed factor score predictor interval in the present case to make clear that it refers to the factor and error score as defined in the factor model. The corresponding 90% “factor score predictor interval” containing
where 1.65 is taken from the standard normal distribution. The corresponding value for a 95% “factor score predictor interval” would be 1.96. To be compatible with the assumptions of the factor model, it is necessary to report the factor score predictor interval instead of reporting the single value for
Because the factor loadings are essential for the calculation of the reliability of
Simulation Study
A simulation study was conducted with SPSS 20 to investigate the consequences of the use of factor score predictor intervals in applied settings. The simulation study allows for an evaluation of the effect of sampling error on the reliability estimates for the factor–error score predictor and on the factor score predictor interval. The study comprises three sample sizes (150, 300, 900 cases), three levels of salient loadings (.40, .60, .80; there were five salient loadings per factor and the nonsalient loadings were fixed to 0), two numbers of factors (3, 6) and two levels of interfactor correlations (orthogonal vs. oblique). In the oblique population models, the correlation between the factors was .30. Table 1 contains the orthogonal and oblique population models based on three factors with salient loadings of .40. As the population models differ only with respect to the salient loading size and the number of factors from the models presented in Table 1, further tables for the description of the population models would have been redundant. Overall, there were 36 conditions (3 sample sizes × 3 salient loading sizes × 2 numbers of factors × 2 levels of interfactor correlations), and for each condition, 1,000 samples were generated. Moreover, two common methods of orthogonal factor rotation were investigated for the data sets based on orthogonal population factors: Varimax-rotation (Kaiser, 1958) and Equamax-rotation (Saunders, 1962). Moreover, two common methods of oblique factor rotation were investigated for the data sets based on oblique population factors: Promax-rotation with kappa = 4 (Hendrickson & White, 1964) and Oblimin with delta = 0 (Jennrich & Sampson, 1966). Accordingly, the simulation study was based on 72,000 factor analyses. Maximum likelihood estimation was used as an extraction method and the number of factors for the sample analyses was specified according to the population models (3 or 6). As the purpose of the simulation study was the illustration of the use of factor score predictor intervals in applied settings, it was not regarded as necessary to cover different methods of factor extraction or estimation.
Orthogonal and Oblique Three-Factor Population Models Based on Salient Loadings of .40.
For each factor analysis, factor–error score predictors were computed according to Equation 10 and the corresponding 90% factor score predictor intervals were computed according to Equation 14. The corresponding intervals were used to calculate the upper and lower interval limit for each individual score on each factor. In the next step, the percentage of score discrimination per factor–error score predictor was calculated. Therefore, it was counted how many times an upper limit of an individual score interval was smaller than a lower limit of another individual score interval on the same factor–error score predictor. This indicates whether two scores can be discriminated according to the factor score predictor interval. This number was divided by the number of different individual scores for the same factor–error score predictor in the sample when no interval is applied and the result was multiplied by 100. Thus, the percentage of score discrimination per factor–error score predictor indicates the amount of overdiscrimination that occurs when factor score predictor intervals are not taken into account. Only when the error terms in a factor model are 0, the index will reach 100%. An index of 50% would mean that only 50% of all the individual differences between scores on the factor–error score predictor can be interpreted when the 90% factor score predictor interval is applied. The index was averaged across the factors for the solution of each sample to give an overall index. Of course, larger factor score predictor intervals can be computed (e.g., 95% or 99%). However, for the purpose of the illustration of the effects of number of cases, salient loading size, number of factors, and obliqueness, the 90% interval was regarded as sufficient. Reliabilities were calculated according to Equation 12.
As a first result of the simulation study, the reliabilities of the factor–error score predictors are presented (see Table 2). The effect of loading sizes on reliability is substantial. A comparison of the reliabilities from the samples of the simulation study with the reliabilities from the corresponding population model reveals that there is a considerable overestimation of reliabilities when sample sizes are small. However, even with a sample size of 300 cases, which might be regarded as sufficient for several applications, there was a reliability larger than .60 in the six-factor solutions, whereas the corresponding population reliability was only .49. Nevertheless, with a sample size of 900 cases, the reliabilities from the sample came reasonably close to the population reliabilities. The effect of overestimation of reliability was slightly more pronounced for oblique than for orthogonal models. There were no relevant differences between Varimax- and Equamax-rotation for the orthogonal solutions and between Promax- and Oblimin-rotation for the oblique solutions.
Reliabilities of Factor–Error Score Predictor.
Note: Pop. = population reliability.
The percentage of the population factor score being within the 90% factor score predictor interval (coverage rate) was also calculated (see Table 3). Under ideal conditions, 90% of the population factor scores should be found within the 90% factor score predictor interval. However, as becomes apparent from Table 2, the reliabilities were considerably overestimated when sample sizes were small and medium. In consequence, the size of the factor score predictor intervals was underestimated when sample sizes were small. The underestimation of the size of the factor score predictor intervals implies that less than 90% of the population factor scores should be found within the 90% interval with small and medium samples. It was found that the percentage of population factor scores comes close to 90% for the sample size of 900 cases. The overall positive effect of loading size on coverage was present, but rather small. Moreover, coverage rates were more pronounced for orthogonal than for oblique solutions, whereas the number of factors had only minimal effects on coverage rates.
Coverage Rates: Percentage of Population Factor Scores Within Factor Score Predictor Intervals.
Score discrimination was calculated for the factor–error score predictor in each sample. It was the number of times that an upper limit of an individual score interval was smaller than a lower limit of another individual score interval divided by the total number of different scores of the respective factor–error score predictor. This discrimination ratio was presented as a percentage in Table 4. The effect of overestimation of reliabilities in small and medium samples had also effects on score discrimination. Therefore, the percentage of score discrimination was larger for smaller samples. Moreover, the effect of loading size on discrimination was substantial. It should, however, be noted that even with factor loadings of about .80, only about 50% of the discriminations between individual scores were relevant when a 90% interval is used. This means that about 50% of the discriminations that might be interpreted when the scores are used without factor score predictor intervals were irrelevant.
Percentage of Differences for Factor Score Predictor Intervals.
Example
A further illustration of the use of factor score predictor intervals is based on a study with the German Version of the Eysenck Personality Questionnaire Revised (EPQ-R; Eysenck & Eysenck, 1991; Ruch, 1999). Overall, 1,223 German participants (828 females, 395 males; age in years: M = 34.12; SD = 12.71) filled in the EPQ-R together with other personality questionnaires. Written informed consent has been indicated by all participants. Although most of the participants (76.40%) were recruited by means of newspaper advertising, university students (23.60%) were recruited through advertising in university courses. Participants received feedback on their individual personality trait scores.
According to Cattell’s (1966) scree-test the course of the eigenvalues of the unrotated maximum-likelihood factors indicates a three-factor solution (see Table 5). Three factors were retained for Promax (kappa = 4) rotation, also because the EPQ-R has been developed to measure Eysenck’s three-factor model of personality (Eysenck & Eysenck, 1991). The interfactor correlations were small and the overall factor pattern was not perfect but compatible with the intended three-factor model comprising factors for Neuroticism (N), Extraversion (E), and Psychoticism (P). However, many items do not load on a single factor, but have relevant secondary loadings (see Table 5). This is a situation where unit-weighted sum scales will not yield an optimal representation of the data. Therefore, researchers might compute the factor–error score predictor and might be interested in the computation of corresponding reliabilities and in the computation of the factor score predictor interval.
Maximum-Likelihood Three-Factor Solution of EPQ-R Items: Promax Factor Pattern.
Note: EPQ-R = Eysenck Personality Questionnaire Revised; N = Neuroticism; E = Extraversion; P = Psychoticism. Intended loadings were given in boldface. Items were marked with a prefix referring to the intended factor (n, e, p) and the number within the EPQ-R.
The reliabilities of the respective factor–error score predictors (see Equation 12) were .90 for N, .90 for E, and .80 for P. Although the factor loadings were rather small, the large number of variables per factor leads to acceptable reliabilities. The corresponding standard errors of measurement (see Equation 13) were 0.32 for N and E and 0.45 for P. The corresponding upper and lower levels of 90% factor score predictor intervals were calculated by means of multiplying the standard errors of measurement by 1.65 and then adding and subtracting the result from the corresponding factor -error score predictor. As an example, the factor score predictor interval for E and N is presented for two participants of the present study (see Figure 1). The standard deviations of the factor–error score predictors for E and N were 1.06. They were expected to be larger than 1, because according to Equation 11, the factor–error score predictor contains a sum of the true-score variance (which is 1) and the error score variance (which is smaller than 1). As the simulation study focused on the comparison of scores between individuals, the comparison of the E and N scores within individuals was considered in the present example. Although the difference between the E and N scores were nearly half a standard deviation for Participant 15, the factor score predictor intervals overlap (see Figure 1). This indicates that the difference between E and N should not be regarded as relevant for this participant. Only very large differences (considerably greater than 1, as for Participant 16) were interpretable when the factor score predictor interval was considered. It should be noted that the intraindividual differences between E and N depend also on the scores of the other participants of the study. For example, if the other participants had larger scores on N, the relative score of the Participant 16 on N would have been lower so that the difference between N and E might have been smaller for this participant. Moreover, intraindividual differences on personality dimensions might have specific theoretical relevance. For example, the relative magnitude of E and N on the individual level has been related to the concepts of impulsivity and anxiety (e.g., Gray, 1987).

Example: The 90% factor score predictor interval for N and E for two participants.
Discussion
It has been shown that factor score predictors based on linear combinations of observed variables cannot be simultaneously compatible with the defining equation of the factor model and the fundamental factor theorem. It was therefore proposed to take the variability that is due to the measurement error explicitly into account when individual factor score predictors are to be computed. It was proposed to compute Harman’s (1976) factor score predictor, which is a sum of factor scores and error scores that represents a combined factor–error score predictor. Due to the similarity of the assumptions of the factor model and the assumptions of classical test theory, the additive combination of factor and error in the factor–error score predictor can be regarded as a combination of true score and error score. Accordingly, the formula of classical test theory can be applied to compute reliability, standard error of measurement, and “factor score predictor intervals” for the combined factor–error score predictor. Instead of reporting a single factor–error score predictor for a person, it is proposed to report an individual factor score predictor interval that is based on the individual factor–error score predictor and the standard error of measurement. Thereby, the indeterminacy of factor score predictors is explicitly acknowledged and can be taken into account when individual scores based on factor analysis are used in applied settings. Thus, computing factor score predictor intervals is a consequence but not a solution of the indeterminacy of factor score predictors.
The consequences of computing reliabilities for the factor–error score predictor and factor score predictor intervals were illustrated by means of a simulation study. The main results of the simulation study were that there was a considerable overestimation of reliabilities when sample sizes were small. Sample sizes of about 900 cases were necessary to keep the reliability estimates close to the population estimates. Overestimation of reliabilities in small samples was slightly more pronounced for oblique solutions than for orthogonal solutions. The reason for the overestimation of reliability that occurred in small and medium samples might be that factor models that are based on sample data might be in part based on correlations that are due to sampling error. Thus, this effect might be regarded as an effect of capitalization on chance. However, the magnitude of this effect was considerable and might be explored in further studies. However, it should also be acknowledged that the simulation study was conducted with just one set of models where each manifest variable loads only on one factor. It can therefore not be excluded that the effect of capitalization on chance is smaller or does not occur in models with other loading patterns.
Considerably less than 90% of population factor scores were found within the 90% factor score predictor interval when sample sizes were small and medium. This is probably a direct consequence of the overestimation of reliabilities in small samples, which leads to an underestimation of the sizes of the factor score predictor intervals. On the one hand, these results of the simulation study demonstrate that, in the samples, the population factor scores might be more rarely within the factor score predictor interval than one would expect from the population parameter values. Accordingly, the method presented here should be used with caution when sample sizes and factor loadings are small. On the other hand, this does not mean that factor score predictors should be used without factor score predictor intervals when sample sizes and loadings are small. In contrast, the implication of the results is that it might be reasonable to perform a simulation study with exactly the number of variables, the loadings, and the number of cases corresponding to the empirical data at hand, to estimate the coverage rates and to adjust (enlarge) the size of the factor score predictor interval appropriately so that the intended percentage of population factor scores within the factor score predictor interval is obtained. Thus, the overestimation of reliabilities should be taken into account when sample sizes are small or medium. It should also be concluded that samples of at least 900 cases are necessary to get reasonable coverage rates.
Moreover, the overlap between the factor score predictor intervals reduces the number of interindividual differences that are interpreted. This is a direct consequence of taking the error term of the factor model into account. Even for solutions based on population main loadings of .80, only about 50% of the differences between individuals are interpreted on the basis of a 90% factor score predictor interval. As the factor score predictor interval reflects the effects of the error term in factor analysis on the scores, about 50% of the interindividual differences cannot be interpreted under the conditions of the present simulation study when the assumption that the factor model also contains an error term is taken into account. However, it would be inconsistent to have a term for measurement error in the factor model and to expect that the factor score predictor could be computed as if this measurement error does not occur. The overestimation of the reliabilities that has been found for small and medium sample sizes will, of course, also affect the amount of interindividual differences that are interpreted on the basis of the factor score predictor interval. Therefore, the method presented here should be used for samples of at least about 900 cases or an additional simulation study based on the parameters of the empirical model should be conducted to improve the factor score predictor interval.
Finally, an example was based on a factor analysis of the EPQ-R (Eysenck & Eysenck, 1991). It was shown that even when the main loadings are rather small, the reliabilities can be reasonably high, when the number of items per factor is large. Moreover, it is unlikely that there has been a considerable overestimation of the reliabilities as the sample size in the empirical example was larger than 900 cases. It was, moreover, shown that intraindividual comparisons of scores may turn out to be more realistic when factor score predictor intervals are considered. As a final remark, it is noted that taking the error term or uniqueness into account when the factor loadings are to be estimated is a main characteristic of the factor model. However, this implies that the error term should also be taken into account when factor score predictors are computed. In this sense, the factor score predictor interval is a direct consequence of the properties of the factor model.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
