Abstract
There is a lack of research on the effects of outliers on the decisions about the number of factors to retain in an exploratory factor analysis, especially for outliers arising from unintended and unknowingly included subpopulations. The purpose of the present research was to investigate how outliers from an unintended and unknowingly included subpopulation affected the decisions about the number of factors to retain using four commonly used methods separately. The results showed that all the decision methods could provide biased results and the number of factors could be inflated, deflated, or remain the same depending on the decision methods used and outlier conditions. The findings also revealed that symmetric outliers did not affect the three principal component analysis–based methods but affected chi-square (ML) sequential tests. Finally, sample size did not play a role in the effect of outliers.
Exploratory factor analysis (EFA) is a widely used statistical technique in the psychosocial, behavioral, and health sciences. However, the matter of the impact of outliers on the decisions about the number of factors to retain is largely undocumented in the psychometric and methodological literature. Recently, Liu, Zumbo, and Wu (in press) demonstrated the impact of outliers on the decision about the number of factors. Their study focused on outliers that are errors in the data—that is, Liu and Zumbo’s (2007) first category of outlier sources. Liu and colleagues found that this type of outliers inflated, deflated, or had no effect on the number of factors retained, depending on the extent of outlier contamination and which decision method (e.g., parallel analysis, Kaiser–Guttman’s eigenvalues-greater-than-one, minimum average partial, or sequential chi-square tests) was used.
The purpose of the present article is to continue this line of research and investigate Liu and Zumbo’s (2007) second and third categories of outliers using a probabilistic mixture of distributions. With this purpose in mind, we first discuss the connection between the various sources of outliers and the models used for simulating them in psychometric studies. Next, two studies are reported. The first study demonstrates the impact of these types of outliers on the decision about the number of factors to retain. The second study is a follow-up to the first study, including a focused simulation study of a small correlation matrix and a report on the skewness and kurtosis of variables that were simulated using the same simulation design as in Study 1. The second study provides insight into how the outliers altered the properties of the correlation matrix. Readers interested in a review of the literature as well as a discussion of the need to study outliers in terms of deciding on the number of factors to retain should see Liu et al. (in press).
Various Sources of Outliers and the Models Used in Simulation Studies
As Zumbo and Zimmerman (1993) state, computer simulation (including Monte Carlo simulation) is an empirical method of experimental mathematics that is loosely defined as the mimicking of the rules of a model (in our case, a psychological or psychometric phenomenon) via random processes. The key concept in this definition is the correspondence of the psychometric or psychological process to how it is being mimicked in the simulation. For example, in the case of the study of outliers, it is important that the simulation method matches the source of outliers being considered. Below, we briefly review a taxonomy of sources of outliers and then address the simulation methods that can be used to mimic these outliers.
Sources of Outlier Contamination
In terms of psychometric analysis, Liu and Zumbo (2007) described three categories of possible sources of outliers in item responses—that is, a univariate distribution of item responses. As noted above, the first category usually refers to “errors” in the data, instantiations of which include errors that occur during data collection, data recording, or data entry. The outliers generated from such sources are obviously illegitimate observations and should, when found, be corrected. This first category of outliers arise from mistakes and are hence specific to a particular data set, so that they are a property of a sample but not of a population. For example, typically it does not make sense to talk about the number of typographical data entry errors in a population; however, it does make sense to talk about the number of such typos in a sample. Because of this characteristic, this type of outliers distinguishes itself from the other two categories of outliers in terms of the outlier generation models used in methodological and simulation studies—that is, deterministic or slippage simulation models.
The second category of outliers refers to the unpredictable measurement-related errors from participants, including guessing and inattentiveness during item responding, which may be caused by fatigue or participants’ lack of interest in participation. Another example of this category includes item misresponding, which happens when, for example, participants misunderstand the instructions or the descriptors on the response scale (e.g., Barnette, 1999). Unlike the first category, which are clearly errors and sample specific, depending on the particular psychological processes in item responding, the second category of outliers may be considered either (sample specific) errors or characteristics and propensities of respondents and hence a population characteristic. For example, misunderstanding the item response instructions may be due to something that reflects momentary inattention or an inherent inattentiveness by the respondent. The former is a sample specific error (and hence akin to an error of the kind in the first category), whereas the latter is, by definition, a characteristic of respondents and hence may reflect a subgroup of inattentive respondents in the population of possible item respondents.
Liu and Zumbo’s (2007) third category of outliers occurs when researchers unknowingly recruit some individuals who are not members of the target population, resulting in a subpopulation for whom the measure operates differently than for the target population. Liu and Zumbo described an example of this in the context of self-concept research conducted with a student population wherein some study participants are from Asian countries for which self-concept may be a different construct. There are many examples of this sort of problem as evidenced by the growing number of articles on construct comparability and test adaptation (e.g., Hambleton, Merenda, & Spielberger, 2005).
Outliers from the third category as well as the second category (when they are a characteristic of respondents) reflect an unintended and unknowingly included (henceforth referred to as “unintended”) subgroup in one’s target population, which are usually simulated via probability models. Although they represent different psychological phenomena, these outliers behave the same mathematically and hence can be simulated by the same outlier generation models.
Models Used in Simulation Studies
In the statistical literature, one sees reference to three common models for simulating outliers: deterministic, slippage, and mixture models. Deterministic and slippage models are typically used for the first category and for sample-specific errors in the second category, whereas mixture models are typically used for the second category of outliers that are a characteristic of respondents (and, hence, a population characteristic) and for the third category of outliers. Whether it is the second and third categories, the mixture model is used to mimic unintended subpopulations. It should be noted, however, that the slippage model can, in particular instances, also be used to model unintended subpopulations except that the number of outliers, in this case, would be fixed from replication to replication.
Deterministic model
The first category of outliers, errors in the data, has been simulated using a deterministic model (Barnett & Lewis, 1994). Because this type of outliers is sample specific, the number of outliers is fixed for a sample and rejection of the null hypothesis of no outliers is deterministically correct, as these outliers are obviously different from the majority of observations (Barnett & Lewis, 1994). One way to simulate outliers using a deterministic model is simply to alter the original data, by either multiplying or adding a constant to raw scores. Examples of this type abound in the literature and include EFA studies by Yuan, Marshall, and Bentler (2002) and Study 1 of Liu et al. (in press). In both examples, outliers were created by multiplying raw scores of one or more variables by a constant (2, 3, 4, or 5) for a certain proportion of subjects in a sample.
Slippage model
Another common strategy of simulating outliers as errors in the data (i.e., the first category of outliers) is the slippage model. Like the deterministic model, the number of outliers in a sample is fixed from replication to replication in a simulation study; however, with the slippage model, these outliers arise from some probability distribution.
The slippage model has been widely discussed and used in the literature (e.g., Anscombe, 1960; Barnett & Lewis, 1978; Dixon, 1950; Liu et al., in press). In its general form, the null model (without outliers) is
The alternative model is
where I + P = n; F denotes an target distribution (sometimes called parent distribution); G denotes a contamination distribution with a different mean and/or variance; n denotes the total number of observations in a sample; I is the number of observations from a target distribution; and P is the number of observations from a contamination distribution. In the null model, all observations are assumed to come from the same population distribution. In the alternative model, a small number of observations are assumed to come from a contamination distribution, and the total number of observations in a sample is the sum of observations from a target distribution and from a contamination distribution (Balakrishnan & Childs, 2001).
Mixture of distributions
One can think of this model intuitively as mixing two different population distributions together. A psychological example may help make this concrete: people from Denmark typically rate their life satisfaction as much higher than people from Hungary (Organization for Economic Co-operation and Development, 2005). If we targeted people in Denmark for an investigation, but unknowingly also recruited a small group of people who just immigrated to Denmark from Hungary, the observations recruited are from a mixture of two populations and responses from Hungarian people might appear as outliers. To mimic this kind of outliers, one would use a mixture of distributions, 1 which has been widely used in the research literature and also is used in the present research.
A mixture of two distributions is a general model, comprising two weighted probability distributions with positive weights that sum up to one (Blischke, 1978). As the weights represent a probability distribution, the mixture is also a probability distribution. The two distributions thus mixed, depending on the parameter values for the mixing, represent different populations. These components of a mixture of distributions can be normal distributions or nonnormal distributions (e.g., Poisson, negative binomial distributions). In the statistical and psychometric research literatures, a mixture of two normal distributions has been frequently used for simulating outliers. One of the most well known and widely used mathematical models is the mixture contamination model—also referred to as the mixed normal distribution, which was introduced by Tukey (1962) and later extended by Huber (1964), Mosteller and Tukey (1968), and Barnett and Lewis (1994). This mixture contamination model is the one used herein. It is generated by including two normal distributions, a target distribution with mean µ and standard deviation σ, N(µ, σ), denoted by F, and a contamination distribution with some values of mean and/or standard deviation different from F, denoted by G.
Given a sample of n independent observations, Xi (i = 1, 2, . . . , n), the majority of the data points follow the target distribution F and the proportion of the sample is denoted by 1 − p, whereas a small fraction, p, follows the contamination distribution G. The mixed contamination model is a mixture of F and G. The null model is
The alternative model is
where the amount of contamination/outliers p must be less than one half, and often substantially less, which indicates the probability that an observation arises from a contamination distribution G. If the amount of outliers is as large as near half, any outlier treatment methods, such as robust methods, are not legitimate to apply to these outliers in practice and the outliers should be modeled as another population. It is important to note that, in a simulation study with, for example, 100 replications, the proportion of the sample from the G distribution is itself a random variable whose average over the 100 replications (i.e., the expected value) is the proportion p—that is, the proportion of outliers varies from sample to sample, however, on average it will be p. In the results section, the varying proportion of outliers from sample to sample was shown in the description of our simulation method (Table 1).
Documenting Simulations Using Mixture of Distributions: Proportion of Outliers in Each Sample Across 100 Replications
Note. Pc = proportion of contamination in the population; n = sample size; SD = standard deviation.
Slippage models and the mixture contamination model share some similarities, but have some fundamental differences. Barnett and Lewis (1994) pointed out that the number of outliers is fixed in a slippage model and outliers are regarded as fixed contamination, whereas the number of outliers is a random variable in a mixture contamination model and hence outliers are regarded as random contamination. It should be noted that it is not appropriate to use a mixture of distributions model to simulate outliers from Liu and Zumbo’s (2007) first category, but is more appropriate to simulate outliers from their second and third categories. As the first category of outliers is obvious (typographical) errors and sample specific, the randomness of mixture of distributions models does not fit into the fixed property of outliers from the first category. However, slippage models can be used for this kind of outliers because the number of outliers is fixed in each sample.
There are two contamination conditions: symmetric and asymmetric contamination. The contamination is symmetric if the population is a mixture of N(µ, σ) and N(µ, bσ), where b is a positive constant greater than one and hence can generate a contamination distribution with a larger standard deviation (SD) than the parent distribution, which is called SD shift in this article. It is worth noting that if b is less than one, the condition of inliers should be considered instead of outliers, which is not of interest of the present study. The contamination is asymmetric when the population is a mixture of N(µ, σ) and N(µ + aσ) or N(µ, σ) and N(µ + a, bσ), where a is a constant and a ≠ 0. The mean and SD of F are usually defined as 0 and 1, respectively, that is, N(0, 1), so adding or subtracting any value to zero will result in the mean shift of a contamination distribution from the center of the population distribution and hence lead to the asymmetric contamination. Therefore, a variety of contamination conditions can be generated by increasing the three outlier factors, that is, the proportion of contamination, mean shift, and SD shift of the contamination distribution.
An example of outliers in a mixed contamination model is given in Figure 1. Figure 1A is a normal distribution, N(µ = 0, σ = 1). Figure 1B presents a case of symmetric outliers with 15% of outliers, consisting of a parent distribution N(µ = 0, σ = 1) and a contamination distribution N(µ =0, σ = 3). Outliers are shown as long and heavy tails at each side of the distribution and result in a highly leptokurtic (peaked) distribution. Figure 1C demonstrates a case of asymmetric outliers with 15% of outliers, consisting of a parent distribution N(µ = 0, σ = 1) and a contamination distribution N(µ = 3, σ = 1). Outliers are shown as a heavy tail on one side of the distribution. Figure 1D shows another case of asymmetric outliers with a parent distribution N(µ = 0, σ = 1) and a contamination distribution N(µ = 3, σ = 3). Outliers make the distribution have a heavy tail on one side as well as a high peak.

An example of symmetric and asymmetric outliers (proportion of contamination = 0.15)
Building on the findings of Liu et al. (in press), the purpose of the present research was to investigate how outliers, arising from an unintended and unknowingly recruited subpopulation (Liu and Zumbo’s second and third categories of outliers), affected the decisions about the number of factors to retain using four commonly used methods, and the most commonly used variants thereof, that is, parallel analysis (PA, using the PCA model and the 95th percentile of the 100 random data sets), Kaiser–Guttman’s (K-G) eigenvalues-greater-than-one, minimum average partial (MAP, see Velicer, 1976, for a detailed description of the procedure used herein), or sequential chi-square tests based on maximum likelihood estimation (
Study 1: Investigating the Effects of Outliers Generated Using the Mixture Contamination Model
Method
Study design
A Monte Carlo simulation study was used to investigate the effects of outliers on decisions about the number of factors by the four decision methods. This study systematically varied five factors with 100 replications for each outlier condition (i.e., simulation condition). These five factors are as follows:
Mean shift of a contamination distribution (0, 1.5, 3)
SD shift of a contamination distribution (1, 1.5, 3)
Proportion of contamination (i.e., proportion of the subjects from the contamination distribution; .01, .08, .15)
Sample size (250, 500, 1,000)
Number of variables with outliers (1, 6, 12, 24)
The study design is therefore a 3 × 3 × 3 × 3 × 4 completely crossed factorial design with 324 conditions, which also includes the no-outlier conditions (i.e., the comparison condition) that has mean shift of zero and SD shift of one.
To ensure a systematic investigation of outlier effects, the selection of the magnitude of three factors (mean shift, SD shift, and proportion of contamination), which are the parameters of a typical mixture contamination model, were guided by previous studies, Blair and Higgins (1980), Liu and Zumbo (2007), Mosteller and Tukey (1968), and Zumbo and Jennings (2002). Following these studies, the present study adopted similar values of model parameters with some modifications to fit the purpose of the present study. The number of variables with outliers was also included in the present study as it was demonstrated to be an influential factor in determining the number of factors in Liu et al.’s (in press) study. In addition, sample size was found in the literature to affect the performance of the K-G rule as well as chi-square tests (e.g., Gorsuch, 1983; Hubbard & Allen, 1987; Zwick & Velicer, 1986). Hence, we included samples size as a factor in the present study.
Data generation
In line with the earlier work by Liu and Zumbo (2007), Liu et al. (in press), and the psychometric context of our study, the outliers are induced in the item responses, that is, the marginal distributions. Twenty-four continuous variables were simulated and therefore our findings apply equally to analyses of subscale scores or visual analogue item response data (Liu & Zumbo, 2007).
For the mixture contamination model in Equation (1), both the target and contamination data were generated based on the population correlation matrix from Holzinger and Swineford’s (1939) classic data set. The original data set consists of 24 psychological ability test scores from 301 junior high school students with a four-factor solution recommended by many researchers (e.g., Gorsuch, 1983; Harman, 1976; Liu et al., in press). As in Liu et al.’s studies, a four-factor solution based on maximum likelihood EFA was obtained using Holzinger and Swineford’s data. To give the reader a sense of the factorial solution that generated the implied correlation matrix, using maximum likelihood EFA along with PROMAX rotation, the average interfactor correlation was .46 and ranged from .40 to .55. In addition, the structure coefficients (i.e., the factor loadings) demonstrate some complexity (i.e., not precisely simple structure); however, every factor had at least eight loadings greater than .40, and the interpretation of the factors is in line with Gorsuch (1983). The resulting reproduced correlation matrix (i.e., the implied correlation matrix with “1s” on the diagonal rather than the reproduced communalities) was used as the population correlation matrix in the simulation to generate multivariate normal data sets with specified marginal means and SDs, depending on the experimental condition, that correspond to the target or contamination distribution in Equation (1). Multivariate normal data were generated in software R 2.12.1, using a method akin to the Kaiser and Dickman (1962) method wherein we used Cholesky decomposition rather than principal components analysis in the computation. Generating data from a model with a known (prespecified) number of factors allowed us to compare the number of factors obtained from different outlier conditions to a common criterion in the population: four factors.
Outcome variable
In each of the 324 experimental conditions, and for each of the 100 replications, the number of factors to retain for the EFA was determined, separately, by the K-G rule, PA, MAP, and sequential
Analysis of the simulation results
Following the data analysis strategy used in Liu and Zumbo (2007) and Liu et al. (in press), five-way ANOVAs (3 × 3 × 3 × 3 × 4) were conducted with the number of factors retained as the dependent variable separately for each of the four decision methods, that is, the K-G rule, PA, MAP, and
The sequential
Results
Proportion of outliers in a given sample
As noted earlier, when using the mixture contamination model to simulate outliers, the proportion of outliers in a sample can vary across replications—that is, from sample to sample. To our knowledge, the central tendency and variability in sample-to-sample proportions of outliers has not been documented in simulation studies. To better understand these statistics, we recorded the proportion of outliers across 100 replications for a single variable.
Table 1 lists the central tendency (mean, median) and variability (SD, quartiles, minimum and maximum values) for the proportion of outliers for the various conditions in the current simulation study across the 100 replications. Starting from the far left in Table 1, one can find the population value of the proportion of contamination, the sample size, and then the seven descriptive statistics computed across the 100 replications. One can see that, as expected, the mean is equal to the population value of contamination in every case. However, also as expected, there is variability in the proportion of contamination across the samples, which depends on the sample size and the population proportion of contamination.
Results of the simulation study
Tables 2 to 5 present the results for the four decision methods (K-G, MAP, PA, and sequential

Graphs for three-way interactions of variables with outliers versus proportion of contamination by three levels of mean shift on the number of factors extracted by the K-G rule

Graphs for three-way interactions of variables with outliers versus proportion of contamination by three levels of mean shift on the number of factors extracted by MAP approach

Three-way interactions of variables with outliers versus proportion of contamination by three levels of mean shift on the number of factors extracted by PA approach

Three-way interactions of variables with outliers versus proportion of contamination by three levels of standard deviation shift on the number of factors decided by the sequential
Table 2 presents the results of the variance decomposition for the K-G rule, and Figure 2 shows the corresponding plot of the three-way interaction. With a mean shift of zero (i.e., symmetric outliers), the number of factors was not affected by outliers, which was also the case for the MAP and PA methods. With mean shift of 1.5 and 3 (i.e., asymmetric outliers), the change in the number of factors depended on the number of variables having outliers and the proportion of contamination. When mean shift was 1.5, the number of factors was not affected when one variable and all variables (24) had outliers, but was inflated (from 4 up to 5 factors) when 6 and 12 variables had outliers. With a mean shift of 3, the number of factors was not affected when only one variable had outliers, was inflated (from 4 up to 5 factors) when 6 and 12 variables had outliers, but deflated when all 24 variables had outliers (from 4 to an average of 2.7 factors). There was more deflation with an increase in the proportion of contamination.
Variable Ordering for a Five-Way ANOVA on the Number of Factors Extracted by the K-G Rule
Note. R2 = 0.758. ANOVA = analysis of variance; K-G = Kaiser–Guttman rule; mean = mean shift of the contamination distribution; SD = standard deviation shift of the contamination distribution; pc = proportion of contamination in the population; n = sample size; vars = number of variables having outliers. The important main effects and/or interactions, as described in the Methods section, are listed in boldface.
Table 3 presents the results of the variance decomposition for the MAP method, with the corresponding plots in Figure 3. Figure 3 showed that the number of factors was not affected when the mean shift was 0 and 1.5, but inflated from 4 to 5 when the mean shift increased to 3 for the cases of 6 and 12 variables having outliers. The magnitude of the inflation increased with the increase of the proportion of contamination. It is worth noting that the number of factors retained was not affected when all variables had outliers in the MAP method, which was different from the K-G and PA methods.
Variable Ordering for a Five-Way ANOVA on the Number of Factors Extracted by the MAP Approach
Note. R2 = 0.636. ANOVA = analysis of variance; MAP = minimum average partial; mean = mean shift of the contamination distribution; SD = standard deviation shift of contamination distribution; pc = proportion of contamination in the population; n = sample size; vars = number of variables having outliers. The important main effects and/or interactions, as described in the Methods section, are listed in boldface.
Table 4 presents the results of the variance decomposition for the PA method, and Figure 4 is the corresponding interaction plot. Similar to the performance of the K-G and MAP methods, the PA method was robust to symmetric outliers. In general, the PA method was accurate in retaining the number of factors in the presence of asymmetric outliers; however, it became dysfunctional when all variables had outliers: (a) the number of factors was deflated slightly when the mean shift was 1.5 and deflated dramatically when the mean shift increased to 3 and (b) the magnitude of deflation increased when the proportion of contamination increased.
Variable Ordering for a Five-Way ANOVA on the Number of Factors Extracted by the PA Approach
Note. R2 =0.824. ANOVA = analysis of variance; PA = parallel analysis; mean = mean shift of the contamination distribution; SD = standard deviation shift of the contamination distribution; pc = proportion of contamination in the population; n = sample size; vars = number of variables having outliers. The important main effects and/or interactions, as described in the Methods section, are listed in boldface.
Unlike the three PCA-based methods, for the sequential
Variable Ordering for a Five-Way ANOVA on the Number of Factors Decided by the Chi-Square (ML) Test
Note. R2 = .946. ANOVA = analysis of variance; mean = mean shift of the contamination distribution; SD = standard deviation shift of the contamination distribution; pc = proportion of contamination in the population; n = sample size; vars = number of variables having outliers. The important main effects and/or interactions, as described in the Methods section, are listed in boldface.
It should be noted that nonconvergence was found for the sequential
Percentage of Nonconvergent Replications With the Sequential Chi-Square (ML) Tests
Note. SD = standard deviation; pc = proportion of contamination in the population; vars = number of variables having outliers.
Study 2: Demonstrations of Effects of Outliers on Correlation Matrix and Kurtosis and Skewness of Item Responses
The purpose of Study 2 was to facilitate our understanding about why these decision methods performed differently in the presence of outliers. As Liu et al. (in press) pointed out, correlation matrices are the engine for the PCA-based methods and hence are the input data for them. Furthermore, skewness and kurtosis are related to the performance of the
Demonstration 1
Researchers usually ignored the effects of outliers on factor analysis partly because they believed that a few outliers should not substantially change the correlation matrix and, as such, a factor analysis should not be affected by outliers. The present small-scale simulation aimed to demonstrate how outliers may distort properties of a correlation matrix. Following Liu et al.’s (in press) study, we also used the original correlation matrix of the first four variables from Holzinger and Swineford’s (1939) classic data as the population correlation matrix for simulating multivariate normal data sets. For demonstration purposes, we only included extreme outlier conditions (mean shift = 3 and/or SD shift = 3) with either two or all four variables having outliers as well as a no-outlier condition. Across all outlier conditions, the proportion of contamination was 0.15. In Study 1, we did not find sample size effects; therefore, in this demonstration we ruled out this factor and used data sets with 100,000 observations so as to have population analogues.
To examine the change in the correlation matrix under outlier conditions, we used the matrix’s condition number to document if it is ill-conditioned and the magnitude of ill-condition—with larger condition number indicating more ill-conditioned. The advantage of using the condition number is that, when the correlation matrix is close to being singular, we can still obtain a solution, which disguises the problem of being ill-conditioned, but the condition number can reflect if the matrix is ill-conditioned and if the properties of the matrix are distorted. The condition number is a product term,
Table 7 presents the results of the simulation with four rows and seven columns. The first column indicates the outlier conditions (i.e., mean shift and SD shift), the second column shows the resulting correlation matrix with only two variables having outliers, the third and fourth columns are the corresponding condition number and eigenvalues, the fifth column shows the resulting correlation matrix with all four variables having outliers, and the sixth and seventh columns are the corresponding condition number and eigenvalues.
Demonstration of Changes in Correlation Coefficients, Condition Number, and Eigenvalues Using a Four-Variable Data Set in the Presence of Outliers (15%) Compared to the No-Outlier Condition
Note. M = mean; SD = standard deviation. Dashed lines are used to indicate which variables had outliers.
The top row presents the results for the correlation matrix in the no-outlier condition. The second row shows the results for symmetric outlier condition, with no mean shift and a SD shift of 3. The effects of outliers were not found for either the case of two variables having outliers or that of all variables having outliers. Some of the correlation coefficients were deflated to a small degree when two variables had outliers, and the condition number as well as the magnitude of eigenvalues was not affected by symmetric outliers.
However, there were dramatic changes for the asymmetric outlier condition with mean shift only (mean shift = 3, SD shift = 1). Echoed in the findings of Liu et al. (in press), we also found that when two variables had outliers, the correlation coefficient for those two variables was inflated, whereas the remaining correlations in the matrix were either deflated when involving combinations of variables with and without outliers or were unchanged when only involving variables without outliers. The complex pattern created an extra factor and resulted in an increase of the condition number from 3.415 (baseline) to 6.481. When all the variables had outliers, the correlation coefficients were all inflated resulting in the creation of a more dominant (or salient) factor and a large increase in the condition number from 3.415 to 11.363.
For the asymmetric outlier condition with both mean shift and SD shift (mean shift = 3, SD shift = 3), the effects of outliers were reduced to some degree. Compared with the mean shift only condition (mean shift = 3, SD shift = 1), the magnitude of inflation in correlation coefficients became smaller; the condition number dropped to some extent, from 6.481 and 11.363 to 4.465 and 7.093; the second eigenvalue for two variables having outliers was not greater than one anymore (i.e., dropped from 1.009 to 0.955); and the magnitude of the largest eigenvalue for all variables having outliers decreased from 3.047 to 2.666.
The interesting findings here were that symmetric outliers did not affect the correlation matrix whereas the asymmetric outliers, especially in the mean shift only condition, distorted the correlation matrix, which either created an extra factor or led to the appearance of a dominant factor that could reduce the number of factors if there were more than one factor. This helps us understand why mean shift and the number of variables having outliers played important roles for PCA-based methods whereas SD shift did not. Although they are all PCA-based, K-G, MAP, and PA methods adopt different procedures and hence one should not be surprised to find some variation among these methods when determining the number of factors, which was shown in our Study 1.
Demonstration 2
Our findings from Study 1 revealed that, for the sequential
Table 8 comprises two parts: the upper part reports kurtosis and the lower part reports skewness. 2 The kurtosis was inflated when the SD shifted from 1 to 1.5, and greatly inflated when SD shift became 3. Mean shift affected kurtosis to some degree for the proportion of contamination of .01 and .08, but not much for a proportion of contamination of .15. It should be noted that the kurtosis was inflated to 8.62 (mean shift = 3, SD shift = 3) for the proportion of contamination of .08, but was 5.96 for the proportion of contamination of .15. This suggests that a higher level of proportion of contamination (.15) led to less inflation in kurtosis than a lower level (.08). As shown mathematically by Pena and Prieto (2001), symmetric outliers increase the kurtosis and a small proportion of asymmetric outliers also increase kurtosis, but a large proportion of asymmetric outliers can make kurtosis smaller.
Demonstration of Effects of Outliers on Kurtosis and Skewness
Note. Mean = mean shift; SD = standard deviation shift; Pc = proportion of contamination.
The lower part of Table 8 shows that, as expected, skewness was inflated when the mean shifted to 1.5 and 3 and SD shifted to 1.5 and 3. The largest increase of skewness is 1.98 for the mean shift of 3 and SD shift of 3 with .08 proportion of contamination. However, the inflation of skewness was not as large as the inflation of kurtosis. Hence, the inflation in kurtosis likely drove the inflation of Type I error rate of chi-square (ML) sequential tests in our simulation, in which SD shift was an influential factor. This might reflect why SD shift played an important role in explaining the inflation of number of factors in sequential
General Discussion
The common practices to deal with outliers in data analysis are either to (a) remove or correct them if they are errors or (b) use a robust estimator if one is uncertain about the source of the outliers or if one is uncertain that outliers are present (e.g., outliers in high-dimensional data are very difficult to detect). However, it should be noted that not all outliers are typographical or data entry (or recording) errors. If outliers arise from unintentionally and unknowingly included subpopulations other than the target population, the outliers are not errors of the first kind described in Liu and Zumbo (2007) but rather, in that sense, legitimate observations that arise from a subpopulation different than the target population in a study. The example we provided earlier of the study of life satisfaction in Denmark demonstrates the subtle issues of unintentionally and unknowingly invoking an assumption of measurement universality with heterogeneous populations involved in a multicultural and globalized assessment environment (Hambleton et al., 2005).
The purpose of the present research was to investigate the effects of outliers, arising from an unintentionally and unknowingly recruited subpopulation, on decisions about the number of factors to retain in an EFA using four decision methods separately. Four important findings are summarized as follows. First, the effects of outliers did not depend on the sample size. This is an important finding because many practitioners believe that having a larger sample size makes them immune to the effects of outliers, which has been shown herein (and elsewhere) to not be the case. Second, the performance of the three PCA-based methods (K-G, MAP, and PA) was not affected by symmetric contamination, but that of sequential
The present study, along with the earlier study by Liu et al. (in press), provide a broad picture of the effects of outliers on the decisions about the number of factors to retain in an EFA study. When reading extant literature, or conducting an EFA study, readers can be assured that outliers in the item response distributions are likely to have a significant impact on the conclusions, either inflating or deflating the number of factors retained depending on the decision methods used, outlier sources (Liu & Zumbo, 2007), and manifestation of outliers (e.g., asymmetric or symmetric outliers) in the sample. The take-home message in this line of research, however, is still the same: researchers are strongly encouraged to check for outliers and use robust methods in their day-to-day research practice (Huber, 1981; Wilcox, 2010, in press) and that not doing so may lead to misleading empirical conclusions.
The present research has high fidelity with real data situations, which provides useful information for applied researchers. However, using real data for simulation also brings some limitations, such as we only mimic one real data situation, so we did not vary the number of variables and number of factors and manipulate different levels of factor loadings and factor correlations. We would encourage future research to investigate these variables when examining the effects of outliers on the decision about the number of factors.
Footnotes
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) disclosed receipt of the following support for the research, authorship, and/or publication of this article: Bruno Zumbo wishes to acknowledge support from the Social Sciences and Humanities Research Council of Canada (SSHRC) and the Canadian Institutes of Health Research (CIHR) during the preparation of this work.
