Abstract
Muthén and Asparouhov introduced an approach for conducting Bayesian inference in the context of structural equation models that they termed Bayesian structural equation modeling (BSEM). In this article, we provide an overview of the BSEM technique, illustrate how this technique relates to confirmatory and exploratory factor analysis, and highlight several key problems with using the BSEM approach as it is currently advocated. Utilizing data from a large-scale study of entrepreneurial self-efficacy, we develop a modified approach for applying the BSEM technique in a manner that is more consistent with accepted principles of reflective measurement, factor analysis, and model selection. We devise a series of recommendations to guide future use of the BSEM technique to help ensure that mainstream use of this approach heralds the coming of a new day in measurement development rather than a false dawn.
Keywords
Introduction
The development and dissemination of new statistical techniques often creates challenges. New techniques sow promising seeds through their potential to (1) test hypotheses in ways more consistent with theory, (2) alleviate the need to make unrealistic modeling assumptions, and (3) subject previously untestable hypotheses to empirical scrutiny. On the other hand, the introduction of new techniques may reap a harvest of ambiguous findings if inappropriately utilized (MacCallum, Edwards, & Cai, 2012). These ambiguous findings can arise because applied scholars may fail to realize that supporters of new methods often base their advocacy on justifications that have not been thoroughly vetted (Rönkkö & Evermann, 2013).
One new statistical technique garnering attention is Muthén and Asparouhov’s (2012a: 316) approach to applying Bayesian inference in the context of structural equation models, which they term Bayesian structural equation modeling (BSEM). The BSEM technique leverages the ability of Bayesian estimation, conducted using the Gibbs sampler and Markov chain Monte Carlo (MCMC) resampling, to estimate models that would not be identified using frequentist estimators. At first glance, the BSEM technique appears to represent a methodological leap forward that would allow management scholars a more flexible approach for estimating the measurement models of our increasingly complex constructs and leveraging advantages of Bayesian inference (Kruschke, Aguinis, & Joo, 2012; Zyphur & Oswald, 2015). However, statisticians have expressed apprehension about the BSEM technique because of this approach’s ability to estimate far more parameters than possible when conducting exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) with frequentist estimators, which may result in models with limited theoretical meaning (Rindskopf, 2012) and generalizability (MacCallum et al., 2012). As such, there exists the concern that the uncritical use of the BSEM approach could, in fact, hamper scientific progress in the management discipline. More specifically, uncritical use of the BSEM technique could (1) allow the estimation of overly complex models that have little likelihood of performing well in both future samples from the same population (i.e., cross-validation; Browne, 2000) and different populations (i.e., generalized validation; Busemeyer & Wang, 2000); (2) enable the estimation of models that will fit any data structure, thus making the model not falsifiable and consequently useless (MacCallum, 2003; MacCallum, Roznowski, Mar, & Reith, 1994); and (3) provide a means to mask poor construct measurement with statistical sophistication.
The purpose of this article is to provide the management community with an overview of Muthén and Asparouhov’s (2012a) BSEM technique and, through a synthesis of the literature on factor analysis and model selection, develop a series of recommendations to guide future use of this approach. These recommendations address not only the concerns raised by statisticians about the BSEM technique but also additional issues with the BSEM technique identified in this essay. To achieve this purpose, we first provide an overview of the BSEM technique and compare it vis-à-vis more widely known factor analysis approaches. We use this comparison as a springboard to discuss the concerns articulated about the BSEM approach and demonstrate that advocates have utilized questionable logic to support the BSEM technique. We then illustrate a refined application of BSEM using data from a large-scale study of entrepreneurial self-efficacy (ESE) to demonstrate how this approach enables the identification of an alternative measurement structure than was originally theorized. Unlike the examples to date of the BSEM technique, we utilize the technique’s flexibility as a diagnostic tool instead of a means to obtain the final measurement model simple structure. This modified approach, which is more consistent with Muthén and Asparouhov’s (2013) and van de Schoot, Kluytmans, Tummers, Lugtig, Hox, and Muthén’s (2013) recent work on measurement invariance testing, further incorporates principles from the model selection literature (Burnham & Anderson, 2004; Zucchini, 2000) to better enable scholars to evaluate the degree to which empirical data are consistent with each model. We further demonstrate how the estimation of all of the unique factor covariances, as illustrated in Muthén and Asparouhov (2012a), appears to ensure perfect model fit no matter the degree of model misspecification, calling into question the usefulness of this approach. Lastly, we provide a series of recommendations for management scholars wishing to utilize the BSEM approach to help ensure that mainstream use of this technique heralds the coming of a new day in measurement development rather than a false dawn.
The BSEM Technique: Comparison With Other Approaches for Estimating Reflective Measurement Models
In the following discussion, we will assume that the researcher has a battery of p reflective measures designed to operationalize m common factors, each representing one dimension of a multidimensional reflective construct (Bollen, 1989). Assuming that the common factor model fits perfectly in the population, the population covariance matrix can be expressed as:
In Equation 1,
To provide management scholars an understanding of how the BSEM technique compares to CFA and EFA with target (Browne, 2001; Myers, Ahn, & Jin, 2013) and analytical (Sass & Schmitt, 2010) rotation, we juxtaposed the BSEM approach vis-à-vis these approaches with regards to the estimation of factor loadings (see Table 1).
Comparison of Muthén and Asparouhov’s Bayesian Structural Equation Modeling (BSEM) Technique Vis-à-Vis Other Factor Analysis Models With Regards to the Estimation of Factor Loadings
Several aspects of Table 1 are important. First, we wish to emphasize that Muthén and Asparouhov’s (2012a) BSEM technique is distinct from Bayesian estimation using the Gibbs sampler in conjunction with the MCMC algorithm, given that Bayesian estimation (and, consequently, Bayesian inference) is equally possible when estimating CFA and EFA models. Rather, the BSEM technique refers to the systematic approach outlined in their article whereby parameters theorized to be large (important) are given diffuse priors, whereas parameters theorized to be small (unimportant), such as cross-loadings, are given informative priors with a mean of zero and a small variance. Second, Table 1 demonstrates that the BSEM approach has weaker modeling assumptions regarding theorized unimportant cross-loadings than CFA. The modeling assumption that unimportant cross-loadings are zero in CFA is stronger than the modeling assumption in the BSEM approach because fixing a parameter to a specified value (e.g., zero) represents the application of very strong prior information which, in effect, prevents the data from updating previously held knowledge about that parameter (MacCallum et al., 2012). Third, the BSEM technique makes stronger modeling assumptions about the strength of unimportant factor loadings than EFA with target rotation. BSEM requires scholars to specify a prior distribution for unimportant loadings, which is not needed to conduct target rotation. Thus, management scholars can think of the BSEM approach lying on a continuum between CFA and EFA with target rotation (Muthén & Asparouhov).
Where the BSEM approach significantly differs from factor analysis models estimated using frequentist estimators is that all unique factor covariances can be estimated. Following Muthén and Asparouhov (2012a), this is mathematically represented in Equation 2:
In Equation 2,
Some Concerns About the BSEM Technique
This section reviews several concerns about the BSEM technique as currently advocated by Muthén and Asparouhov (2012a), Fong and Ho (2013), and Golay, Reverte, Rossier, Favez, and Lecerf (2013). We begin by reviewing the arguments put forth to justify the BSEM technique over CFA and highlight weaknesses with these justifications. Next, we summarize the concerns raised by others about the BSEM technique with regards to specification of informative priors, theoretical ambiguity when estimating unique factor covariances, and limited model generalizability. For parsimony, we focus on theoretical aspects of model development and eschew discussions about the intricacies of Bayesian estimation and instead refer the reader to the dialogue between MacCallum et al. (2012) and Muthén and Asparouhov (2012b).
Reasons Advocated for Using the BSEM Technique and Their Logical Flaws
Muthén and Asparouhov (2012a), Fong and Ho (2013), and Golay et al. (2013) argue that the BSEM technique is preferable to CFA because it allows scholars to relax the “unnecessarily strict” assumption made in CFA that theorized small cross-loadings are fixed to zero. They put forward three arguments for why fixing hypothesized small cross-loadings to zero is overly strict. The first argument is that measurement instruments are often not precise enough to warrant fixing cross-loadings to zero (Muthén & Asparouhov). The second argument is that small loadings may be theoretically permissible, with Golay et al. stating, “With typical ML-[maximum likelihood-]CFA, many cross-loadings between latent variables and measures are fixed to be exactly zero, which does not always faithfully reflect the researchers’ hypotheses. . . . Indeed, many small but nonzero cross-loadings could be equally compatible with theory” (2013: 498). The third argument is that common factor covariances will be inflated if these small cross-loadings are not estimated (Fong & Ho).
There are logical flaws with all three arguments. Consider the first argument that measurement theory may not be precise enough to warrant fixing cross-loadings to zero. The first flaw with this argument is that it fails to recognize that these cross-loadings may be the result of poor instrument development (e.g., use of double-barreled indicators); thus, estimation of cross-loadings allows researchers to utilize statistical sophistication to “dust under the rug” the fact that the measurement instrument is poorly constructed. The second flaw with this argument is that to utilize this justification, one must be willing to assume that cross-loadings are not of theoretical importance (i.e., they arise because of imprecise language or other sources not connected to the substantive constructs under examination), which then requires one to ask, “Why should we estimate these cross-loadings if indeed they are noise?” The clear answer is that modeling noise is inappropriate, as it renders meaningless measures of model fit and makes model comparison a spurious activity. The third flaw with this argument is that it is unclear why scholars would develop measurement instruments that are so imprecise that substantial cross-loadings with little theoretical justification occur. In other words, the argument that measurement imprecision automatically permits the estimation of cross-loadings is, in our opinion, highly problematic in that it could serve as an “escape hatch” for management scholars to justify poor measurement instrument design and, consequently, inhibit the development of improved instruments.
Consider next the argument that small cross-loadings should be estimated because they may be consistent with a researcher’s measurement theory. The logical flaw with this argument is that it is inconsistent with the purpose of factor analysis, which is to identify an easily interpretable measurement model simple structure (Thurstone, 1947), which inherently means that scholars should be concerned with identifying large loadings that are practically important (Cudeck & O’Dell, 1994). Simply stated, it makes little sense that scholars would want to purposefully design a measurement instrument such that some items exhibit small (i.e., unimportant) loadings. Rather, indicators should be selected so as to fully capture (tap) the sampling domains of the constructs under investigation (Little, 2013; Little, Lindenberger, & Nesselroade, 1999). It is of note that in both of these arguments, Muthén and Asparouhov (2012a), Fong and Ho (2013), and Golay et al. (2013) have provided blanket statements as justification for why such small loadings might be a part of a researcher’s theory without actually ever explaining what such theory might encompass.
Addressing the third argument is somewhat more complex. While we agree that high multicollinearity between common factors can threaten discriminant validity, our concern is that for researchers to utilize this rationale, they must implicitly assume that not estimating these parameters results in an incorrectly specified model. However, if the estimated cross-loadings do not differ from zero and/or are small enough to not be practically meaningful, then it is hard not to argue that noise has been introduced that artificially reduces the common factor covariances. The reader should keep in mind that when noise is added to two variables, the magnitude of the covariance between these variables is invariably attenuated (Spearman, 1904). Thus, this rationale is highly problematic because it justifies the use of statistical sophistication to cover up the more serious problem: low discriminate validity of operationalized constructs.
In addition, two more problems exist with the third justification for using the BSEM technique. First, common factor multicollinearity can result from using observed measures with low reliability (i.e., low communality), with Grewal, Cote, and Baumgartner noting, “Probably the most important safeguard against the damaging effects of multicollinearity is to make sure that all constructs are measured as reliably as possible” (2004: 527). Second, high correlations between common factors may suggest that a bifactor model (Holzinger & Swineford, 1937; Reise, 2012) is more appropriate than a correlated factors model. Thus, researchers should not automatically conclude that common factor multicollinearity is the result of CFA models requiring unnecessarily strict assumptions but may instead indicate problems with the observed measures or that a different measurement model may be more appropriate for the data.
Justification of Informative Prior Distributions
MacCallum et al. (2012) and Yuan and MacKinnon (2009) point out that the specification of informative priors is one of the most important steps when conducting Bayesian inference. In frequentist inference, the issue of specifying priors on parameters is greatly simplified because scholars have two choices: (1) freely estimate parameters, which is conceptually equivalent to assigning diffuse priors in a Bayesian estimation; or (2) fix parameters at a given value (most often zero), which is conceptually equivalent to assigning infinitely dense degenerate priors (MacCallum et al.). In contrast, when conducting Bayesian inference, scholars have the third option of specifying informative priors for various model parameters. Incorporating prior information can greatly improve the precision of parameter estimation, which is particularly pronounced in settings where scholars have a strong understanding of systematic noise arising from measurement (Gregory, 2005). However, the misspecification or misuse of priors can result in severely compromised models in that the data is not allowed to “speak for itself” because of the strength of the priors (Muthén & Asparouhov, 2012b; Yuan & MacKinnon).
As the issue of specifying prior distributions extends well beyond the BSEM technique, and indeed is a core issue with Bayesian inference in general, we refer readers to more general treatments in Gregory (2005), Kruschke et al. (2012), Yuan and MacKinnon (2009), and Zyphur and Oswald (2015). It should be noted that an argument in favor of Bayesian methods is that they allow researchers to make use of existing states of knowledge before incorporation of new data. This knowledge is expressed in the form of distributions. This “prior” distribution is then combined with its likelihood of occurrence, forming a posterior probability. The end result (or posterior probability) represents a revised or updated belief after taking new data into account. If there is a lack of existing knowledge on the subject of interest, the researcher can still rely on Bayesian techniques. In this case, vague or uninformed prior probabilities are used, a representation that there is only weak prior knowledge available.
Estimation of Unique Factor Covariances
One of the principal concerns with the BSEM technique is that the estimation of the covariances between the unique factors results in solutions with limited theoretical interpretability. Rindskopf’s statement is apropos in the context of Muthén and Asparouhov’s (2012a) analysis of a male and female sample of the Big Five where there were 17 and 37 significant unique factor covariances respectively: If I were a personality researcher, I do not think I would be happy with a “Big 5 plus moderate to small 27, give or take 10” theory. If there are supposed to be five factors, then the number of failures to fit the model (by adding extra correlations) should be small, and either their numerical value should be small or there should be a theoretical explanation for why these residuals are correlated. Of course, this theoretical explanation would be post hoc (if it were known ahead of time, the model would have included the expected parameters). In this case, the theoretical explanation would be tentative and would need to be corroborated on a different data set. (2012: 338)
Building on Rindskopf’s (2012) arguments, a second reason for apprehension with the estimation of all unique factor covariances is that this procedure will result in outstanding model fit for any specified model. For example, when Muthén and Asparouhov (2012a) estimate all unique factor covariances, the hypothesized structure of the Big Five fits perfectly based on the posterior predictive checking (PPC) criterion. Furthermore, all hypothesized large factor loadings specified with diffuse priors exhibited large loadings, whereas hypothesized small loadings specified with zero mean and small-variance informative priors were near zero. Similarly, when Fong and Ho (2013) estimate all unique factor covariances, all three of their different CFA models for the Hospital Anxiety and Depression Scale fit perfectly based on the PPC criterion. Thus, the concern is that estimating all unique factor covariances may result in perfect model fit even in the presence of fundamental model misspecifications. MacCallum captures this problem of models having the ability to fit any data structure (i.e., high flexibility) by stating: In practice, if a highly flexible model fits observed data well, support is still weak, because the model would fit a wide range of data well. On the other hand, if an inflexible model were found to fit well in an empirical study, support for that model would be stronger. If two models were found to fit equally well, we should prefer the one that is less complex or flexible . . . the evaluation of a given model should take into account the capacity of the model to fit a wide array of data. And models should be devalued to the extent that they are able to achieve good fit to nearly any data. (2003: 133)
Loss of Generalizability Due to Complexity of Model
Another concern expressed about the BSEM technique is that it results in the estimation of a far greater number of parameters than if the same measurement model was estimated using CFA. While estimating more parameters is likely to improve model fit, MacCallum et al. express reservation by stating: As attractive as this improved fit might be, it comes with a price. It is achieved through the relaxing of restrictions on parameters, much as fit would be improved in a conventional ML approach to SEM [structural equation modeling] by converting some fixed parameters into free parameters. The estimation of additional parameters brings with it an increase in estimation error in the sense that parameters are estimated with less precision as their number increase. . . . In addition, such an increase in estimation error raises in turn the possibility that stability and generalizability could be reduced. It is well recognized in conventional SEM that the cross-validity of solutions is reduced as the number of free parameters increases. . . . We anticipate that the same thing would occur in the BSEM approach, with degree and consequences remaining an open question for now. (2012: 344)
In other words, the fact that the BSEM technique results in the estimation of many more parameters raises concerns that model fit is achieved at the expense of the model capturing random sampling noise (Zucchini, 2000). Thus, improved fit is achieved at the cost of modeling idiosyncratic characteristics of the sample that are unlikely to generalize in subsequent samples. This concern is summarized well by Myung: A complex model with many parameters, because of its extra flexibility, tends to capture these spurious patterns [caused by sampling error] more easily than a simple model with few parameters. Consequently, the complex model yields a better fit to the data, not because of its ability to more accurately approximate the underlying process but rather because of its ability to capitalize on sampling error. Therefore, choosing a model based solely on its fit, without appropriately filtering out the effects due to sampling error, will result in choosing an overly complex model that generalizes poorly to other data from the same underlying process. (2000: 202)
It is important to note at this juncture that our arguments are not to be interpreted as an outright condemnation of the BSEM approach. Indeed, expressing these considerations is critical for the BSEM technique to be best applied by management scholars. To illustrate the flexibility of the BSEM approach and to address some of the aforementioned concerns, in the next section, we illustrate several analyses using the BSEM approach. In presenting these examples, we aim to show how the BSEM technique can be leveraged to the advantage of the applied scholar, while simultaneously addressing these more fundamental modeling issues. This examination provides a basis for building recommendations to guide the proper and effective use of the BSEM technique.
Method
Research Setting
As part of a larger scale study on the antecedents and consequences of entrepreneurship, a questionnaire-based survey was administered to the alumni of a large U.S. university in the summer of 2011. Amongst the measures collected was a multidimensional ESE scale developed by McGee, Peterson, Mueller, and Sequeira (2009). The concept of self-efficacy has played a central role in theories of social learning and social cognition (Wood & Bandura, 1989). Self-efficacy can be adequately summarized as one’s belief in one’s ability to accomplish tasks within a domain. The expectations and motivation that arise from individuals’ self-efficacy have an influence on those individuals’ coping behaviors, expended effort, adversity tolerance, goal setting, and choice of actions (Bandura, 1977; Gist, 1987). When self-efficacy is used to appraise individuals’ belief in their personal capabilities related to the formation of a new venture, it is further delineated as ESE (Boyd & Vozikis, 1994). This specification of self-efficacy is based on the assumption that the entrepreneurial process involves a range of interrelated tasks that are unique to such a degree that they cannot be readily captured in a general measure of self-efficacy (Chen, Greene, & Crick, 1998).
An ESE scale developed by McGee et al. (2009) specified a five-factor CFA solution consisting of 19 indicators and comprising the dimensions of search, plan, marshal, implement finance, and implement people. This scale conceptualizes the process of entrepreneurship as a multistaged life cycle. Stevenson, Roberts, and Grousbeck (1985) proposed a process model that separates new venture creation into multiple phases: evaluating the opportunity, developing the business concept, acquiring needed resources, and managing the venture. During the searching phase (evaluating the opportunity, a term by Stevenson et al.), the entrepreneur develops a novel idea or identifies a market opportunity. As part of this process, entrepreneurs rely on their creativity and innovativeness to explore many alternatives. The planning phase (developing the business concept and assessing required resources) is focused on formalizing the entrepreneurial concept into an implementable plan that fits within the entrepreneur’s abilities and goals. During the marshaling phase (acquiring needed resources), the entrepreneur acts to gain control over the resources needed to implement the business. The implementing stage (managing and harvesting the venture) is focused on managing the venture and assuring its successful growth past incubation. In the McGee et al. scale, the implementing stage has been conceptualized as involving both an aspect of managing people (implementing people) and an aspect of managing the finances of the business (implementing finance).
While the McGee et al. (2009) CFA model in their original article showed acceptable fit, with a comparative fit index (CFI) of .96, a Tucker-Lewis index of .95, and a root mean square error of approximation (RMSEA) of .06, with strong factor loadings (range = 0.70–0.92), as we illustrate in the subsequent analyses, this model did not display acceptable fit in our study. This provided an opportunity to illustrate how the flexibility of the BSEM technique could be leveraged to develop a different measurement model structure that could be subsequently cross-validated.
Data and Sample
Approximately 70,000 potential respondents were queried for participation via e-mail, with 7,891 participants completing the survey instruments (a response rate of ~11.3%). As this study is primarily expository, it was elected to drop any respondent who did not totally complete the ESE scale rather than to rely on an imputation strategy. This election was made as imputation in a Bayesian context is a unique area of inquiry that this article will not attempt to address (see Rubin, 1996). This listwise deletion strategy reduced the sample to 6,306 respondents. In order to focus the measurement model on a consistent population, it was elected to remove those individuals who reported that they were either retired, unable to work, or were already an entrepreneur or self-employed. This resulted in a final sample size of 4,041 respondents. For analysis purposes, two random samples of 500 participants were drawn from this final sample in order to create sample sizes more in line with typical studies applying SEM. One sample was used for the refinement of the measurement structure and the second was used for purposes of cross-validation. It should be noted that the results presented here are relatively consistent across sample sizes ranging from 200 to 500.
CFA
The first step in the analysis was to fit a CFA model to determine if the originally hypothesized measurement model displayed acceptable fit and discriminant validity. The raw data were used as input into Mplus Version 7 (Muthén & Muthén, 1998-2012). To provide comparability to the majority of published studies, we used the default ML estimator. The standardized factor loadings and common factor correlations for this model are displayed in Table 2.
Confirmatory Factor Analysis (CFA) Measurement Model Fitted Using the Maximum Likelihood (ML) Estimator
Note: Asterisks denote significance at α = .05.
Fit statistics: χ2 = 845.0, df =142, root mean square error of approximation = .100, comparative fit index = .900, standardized root mean square residual = .075.
Table 2 indicates that while all items have high standardized loadings on their respective constructs, overall model fit is highly suspect. Specifically, the CFI (.900) is well below the .95 recommendation (Hu & Bentler, 1999), the RMSEA (.100) point estimate indicates unacceptable fit on the basis of Browne and Cudeck’s (1992) guidelines, and the standardized root mean square residual (SRMR) is relatively large at .075. Given the poor model fit, this provided the opportunity to explore the use of the BSEM technique to reconcile the failure of the posited CFA model.
One might wonder at this juncture why one would switch to the BSEM approach rather than rely on the oft-used strategy of examining and updating a model based on modification indices (MIs) provided in most SEM programs. While the use of MIs is popular, they suffer from two important shortcomings. First, a step-by-step series of changes in response to large MIs has the tendency to capitalize on unique aspects of the current sample. More specifically, parameters freed in this manner have a low likelihood of cross-validating (MacCullum, Roznowski, & Necowitz, 1992), particularly in small samples (MacCallum, 1986). Second, MIs are an estimate of the amount the model’s chi-square distribution will change under the assumption that the estimates of the free parameters remain unchanged. However, if other misspecifications exist, freeing the parameter can result in changes to the remaining free parameters, which can result in (1) the newly freed parameter having a small value and/or (2) a smaller-than-anticipated change in the chi-square distribution (Steiger, 1990). In contrast, with the BSEM approach, the potential model changes that we observe are estimated jointly in a single analysis (Muthén & Asparouhov, 2012a).
Bayesian Analysis—Cross-Loadings
Given the overlap of the theoretical definitions of the facets of ESE, all posited zero loadings were given an informative prior with a mean of 0 and a standard deviation of 0.141. Since the setting of the prior is sensitive to the scale of the model, all observed variables were standardized prior to the analysis. The hypothesized large loadings were given Mplus’ default diffuse normal prior. Estimation was completed using the PX1 Gibbs sampler. Convergence was evaluated through examination of the proportional scale reduction, evaluating trace plots of parameters, and evaluating the autocorrelation plots of parameters. The median values for the estimated parameters are reported in Table 3. We further report the values for three model selection criteria: the Bayesian information criterion (BIC; Schwarz, 1978), the sample-size adjusted Bayesian information criterion (BICSSA; Nylund, Asparouhov, & Muthén, 2007; Sclove, 1987), and the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, & van der Linde, 2002).
Bayesian Model With Small-Variance Informative Priors Specified for Cross-Loadings
Note: Factor loadings in bold were freely estimated using diffuse priors. Daggers indicate 95% credibility interval does not contain zero.
Free parameters = 143; estimated free parameters = 112.2; posterior predictive checking = [231.2, 339.1]; Bayesian information criterion = 21,009; sample-size adjusted Bayesian information criterion = 20,555; deviance information criterion = 20,344; pseudo standardized root mean square residual = .028.
The PPC confidence interval for this model is 231.2 to 339.1, and the model’s BIC, BICSSA, and DIC are 21,009, 20,555, and 20,344, respectively. Though the PPC confidence interval indicates misfit, as argued by Gelman (2003) and Levy (2011), PPC can be viewed more as a diagnostic tool than as a measure of model fit per se. Under this philosophy, the PPC for the model where cross-loadings were estimated fits better than the model (results not reported) where all theorized cross-loadings were constrained to zero using degenerate priors (a model equivalent to the originally specified CFA model), resulting in a PPC confidence interval of 654.5 to 750.8. This degenerate priors model had a BIC of 20,998, a BICSSA of 20,786, and a DIC of 20,717. Comparing model fit statistics, BICSSA and DIC clearly favor the cross-loadings model, while the BIC’s more stringent penalty for increased model parameters leads it to favor the simpler model.
Owing to the complexity of this measurement model, it is not surprising that the proposed solution fails to totally replicate the data, even when allowing for cross-loadings, given that all models are to some degree incorrect (MacCallum, 2003). While fit statistics such as CFI and RMSEA are unavailable in a Bayesian framework (Levy, 2011), it is possible to calculate a proxy for the SRMR. When conducting Bayesian estimation, SRMR is actually a distribution (Levy), but Mplus does not yet have the capability of calculating SRMR during MCMC iterations to develop this distribution. Given this limitation, in this study, a point estimate of SRMR is calculated using the median point estimates of the various posterior parameter distributions, which we term pseudo SRMR (pSRMR). Details about the calculation of pSRMR are provided in Appendix B in the online supplement. Using the pSRMR to evaluate model fit, the model with degenerate priors demonstrates poor fit with a pSRMR of .075 (equivalent to SRMR from the CFA model estimated with ML), whereas for the model with cross-loadings, the pSRMR was .028, indicating this model, as expected, better replicates the observed data variance-covariance matrix.
Returning to Table 3, there is evidence that there are three cross-loadings (S3 on plan, P1 on search, and P4 on marshal) that are statistically and practically significant (i.e., > 0.30). Examining the wording of these items, it becomes apparent that this might be a model in which items should be expected to show theoretically justifiable cross-loadings. The wording of Item S3, “Understand the process of how to satisfy the needs and wants of customers,” would appear to refer to both the constructs of search (identifying a new opportunity or market) and plan (formalizing the entrepreneurial concept). Likewise, Item P1, “Understand the need and demand for a new product or service,” is clearly related to plan for a new business but may also be considered by some to be a central aspect of identifying an opportunity (search). Lastly, Item P4, “Design an effective marketing/advertising campaign for a new product or service,” requires the ability to both plan and marshal (acquire needed resources).
Because these three cross-loadings have a strong theoretical justification and are practically important, the analysis was repeated with these three cross-loadings freely estimated by using diffuse priors and fixing the remaining cross-loadings to zero with degenerate priors, given that the 95% credibility intervals for these loadings contained zero or lacked practical significance. This newly specified model (see Table 4) provides a good fit to the data, with a pSRMR of .050. While the PPC confidence interval of 365.1 to 458.9 and DIC of 20,430 are inferior to the model with zero mean and small-variance priors specified on all cross-loadings, the BIC of 20,723 and BICSSA of 20,501 represent a marked improvement over the model with all cross-loadings estimated.
Modified Bayesian Model Allowing for S3, P1, and P4 to Load Onto Multiple Factors
Note: Factor loadings in bold were freely estimated using diffuse priors. Daggers indicate 95% credibility interval does not contain zero.
Free parameters = 70; estimated free parameters = 70.8; posterior predictive checking = [365.1, 458.9]; Bayesian information criterion = 20,723; sample-size adjusted Bayesian information criterion = 20,501; deviance information criterion = 20,430; pseudo standardized root mean square residual = .050.
Bayesian Analysis—Covarying Unique Factors
Muthén and Asparouhov’s (2012a) alternative approach to address the misfit of the original model is to estimate the covariances of the unique factors,
Bayesian Model With Informative Priors Specified for Factor Loadings and Correlated Unique Variances
Note: Factor loadings in bold were freely estimated using diffuse priors. Daggers indicate 95% credibility interval does not contain zero.
Free parameters = 314; estimated free parameters = 188.4; posterior predictive checking = [–58.6, 58.1]; Bayesian information criterion = 21,699; sample-size adjusted Bayesian information criterion = 20,703; deviance information criterion = 20,134; pseudo standardized root mean square residual = .018.
This model displays outstanding fit judged by the PPC criteria; the 95% PPC confidence interval contains zero, indicating the replicated data closely matches the sample data, which is not surprising, given the enormous increase in the number of parameters. The pSRMR value (.018) further indicates excellent fit. However, the values of the model selection criteria (BIC = 21,699, BICSSA = 20,703, and DIC = 20,134) indicate that the model in Table 5 would not be selected over the model in Table 4 according to the BIC and BICSSA (which were 20,723 and 20,501, respectively). In addition, examination of the factor loadings reveals an important concern: With the unique factor covariances estimated, evidence that S3 and P4 each have two large factor loadings is not apparent, which would lead one to conclude on a different measurement structure. Furthermore, the outstanding model fit raises the concern highlighted earlier: Can estimation of all unique factor covariances allow researchers to obtain acceptable fit for the model they desire? To check this concern, we estimated the original model where all cross-loadings were fixed at zero using degenerate priors with the addition of allowing the estimation of all unique factor covariances. These results (not reported) lend credence to this concern: The PPC confidence interval contains zero [–60.0, 54.1], and the pSRMR for this model is excellent (.013). We will return to the implications of these results later in the recommendations provided in the Discussion section.
Quantifying the Strength of Evidence in Favor of Each Model
In Table 6, we quantify the strength of the evidence in favor of the models reported in Tables 2 through 5 following procedures outlined in Burnham and Anderson (2004) for the BIC and BICSSA. These procedures (described in the note to Table 6) allow for the calculation of the posterior probabilities of each model given the data (Raftery, 1995).
Note: Number of estimated parameters treats parameters with uninformative (diffuse) and informative priors equally. LL refers to the log likelihood of the model, with smaller values (i.e., closer to zero) indicating better fit. The Δi BIC value is calculated by identifying the model in the set with the smallest BIC and then subtracting this BIC value from the given model in the set. The same procedure is completed to calculate Δi BICSSA. P
i
is the posterior probability of each model given the data. Assuming that each model has an equal prior probability, Pi BIC is calculated by
The posterior probabilities calculated from the BIC and BICSSA in Table 6 indicate overwhelming support for the model with three estimated cross-loadings. As such, this model was thus selected to undergo cross-validation.
Cross-Validating the Selected Model
In order to avoid the possibility that the previous results are an outcome of sample-specific characteristics, a calibration/validation strategy was implemented per the recommendations of MacCallum et al. (2012). The validation was completed by drawing a second independent sample of 500 observations and estimating the same model as in Table 4 across increasing measurement invariance constraints. We estimated these models as mixture models with known classes representing the two samples. The initial model is a configural model, which assumes only the same pattern of fixed and free parameters. The results in Table 7 reveal that this model fits both samples in a roughly equivalent manner, noting that the BIC of 41,618 and PPC of 825.2 to 960.7 are approximately twice the BIC of 20,723 and PPC of 365.1 to 458.9 of the single sample model. Next, more constraints were incorporated to test for weak factorial invariance (factor loadings held equivalent) and strong factorial invariance (factor loadings and indicator intercepts held equivalent). Examination of the BIC, BICSSA, and DIC criteria indicates substantial support for the strong invariance model, which results in our conclusion that the measurement model from Table 4 generalizes within our population. Furthermore, a separate analysis (not reported) using this new sample indicated that the measurement model in Table 4 displays far superior fit and generalizability compared with the originally hypothesized measurement model where S3, P1, and P4 were specified to load onto only one latent variable.
Comparison of Modified Bayesian Complexity Two Model Between Group 1 (Calibration) and Group 2 (Validation) Via Mixture Models
Note: PPC = posterior predictive checking; BIC = Bayesian information criterion; BICSSA = sample-size adjusted Bayesian information criterion; DIC = deviance information criterion; pSRMR = pseudo standardized root mean square residual.
Discussion
Management scholars are on the precipice of a watershed change as we move away from frequentist statistical inference to Bayesian statistical inference (Kruschke et al., 2012). One approach for conducting Bayesian inference is Muthén and Asparouhov’s (2012a) BSEM technique, which has already garnered interest in psychology (Fong & Ho, 2013; Golay et al. 2013) and management (Zyphur & Oswald, 2015). The BSEM approach affords researchers a degree of modeling flexibility previously unseen, but as with all new statistical techniques, it is critical to vet this technique to avoid reaping a harvest of ambiguous findings.
In this article, we have sought to provide management scholars with an overview of the BSEM approach to convey how this technique compares to other approaches for conducting factor analysis. We further summarized concerns expressed about the BSEM technique by Rindskopf (2012) and MacCallum et al. (2012) and, expanding on these concerns, highlighted logical problems with the justifications articulated for preferring the BSEM technique over traditional CFA. Given these concerns, we then demonstrated the use of the BSEM technique to identify the simple structure (Thurstone, 1947) for the ESE scale of McGee et al. (2009) and, subsequently, conducted a cross-validation analysis of this simple structure on a separate sample. Our analysis demonstrates how the decision to utilize informative priors to estimate hypothesized small factor loadings or unique factor covariances can result in the retention of a fundamentally different simple structure. Furthermore, our analysis provides evidence, consistent with findings from Fong and Ho (2013) and Muthén and Asparouhov (2012a), that suggests that estimation of all unique factor covariances will result in perfect model fit, calling into question the utility of this strategy.
Given the promise that we see in the BSEM technique to enable management scholars to push the boundaries of theory testing, we offer the following series of recommendations to guide the future use of the BSEM technique with regards to measurement model development. These recommendations are developed through a synthesis of the literature on factor analysis (Asparouhov & Muthén, 2009; Browne, 2001; Cudeck & O’Dell, 1994; Reise, Waller, & Comrey, 2000), model selection (Burnham & Anderson, 2004; Myung, 2000; Zucchini, 2000), and reflective measurement (Bagozzi, 2007; Bollen, 1989). Furthermore, we provide management scholars an understanding of when the use of the BSEM technique is likely to be most problematic by drawing theoretical predictions from MacCallum and Tucker’s (1991) conceptualization of the common factor model. We summarize the implications that their conceptualization has for using the BSEM technique in Appendix C (see the online supplement) and incorporate these implications in several of our recommendations. It should be noted that given our focus in this article is not on technical details of the BSEM technique, our recommendations focus on theoretical aspects of model development rather than statistical aspects of execution.
Recommendation 1: The BSEM Technique Is Not a Cure for Poor Indicator Development
During the review process, one of the reviewers astutely pointed out that a critical concern with the use of the BSEM technique as recommended by its advocates is that the BSEM approach’s flexibility will allow management scholars to obscure poor scale development efforts with statistical sophistication. For example, scholars could utilize the BSEM technique to obscure measurement problems that arise from using double-barreled, poorly worded, or overly complex indicators (Comrey, 1988; Smith & McCarthy, 1995). As such, we wish to reiterate that the BSEM technique does not alleviate the necessity to follow accepted techniques for developing indicators. If anything, the added flexibility of the BSEM technique makes such procedures even more critical.
Recommendation 2: Do Not Lose Sight of Simple Structure
Thurstone’s (1947) seminal work on factor analysis established the concept of simple structure as obtaining an easily interpretable, theoretically justified solution for
One of the most useful features of the BSEM approach is that significance tests are obtained for each estimated factor loading. Similar to Cudeck and O’Dell’s (1994) arguments in a frequentist framework, these significance tests are an invaluable assistance in helping to define a simple structure, and we strongly urge scholars using the BSEM technique to indicate which estimated factor loadings are statistically significant. This should serve as a first litmus test for defining simple structure, given that finding a loading’s credibility interval contains zero indicates lack of evidence that the parameter differs from zero in the population. However, beyond statistical significance, scholars must argue why a given statistically significant loading has practical significance to warrant its estimation. For example, in Muthén and Asparouhov’s (2012b) BSEM bifactor analysis of the Holzinger and Swineford (1937, Table 1: 353) data, only 3 of 77 cross-loadings estimated with informative priors were significant, and of these, the largest standardized loading is 0.183, indicating that the latent factor is explaining ~3.4% of the variance in the observed measures, which may not be practically significant to warrant its estimation as a free parameter in subsequent studies. Following McDonald (1999), we recommend a threshold of 0.30 from a fully standardized output as a practically meaningful cross-loading, though we wish to stress this guideline should not be interpreted dogmatically.
In short, we believe that management scholars should go beyond the current practice when using the BSEM technique of estimating hypothesized small cross-loadings using zero mean and small-variance priors, and upon obtaining a solution, presenting this solution as the simple structure. Rather, per Thurstone (1947) and Cudeck and O’Dell (1994), we urge scholars to examine the statistical and practical significance of these loadings and make decisions about the underlying simple structure to guide future usage of the measurement scale, as was illustrated in our cross-validation analysis. The utility of our recommendation is demonstrated by comparing the interpretability of Table 3 and Table 4: The simple structure in Table 4 is far more interpretable than the simple structure in Table 3. Importantly, this recommendation does not require that each observed measure be influenced by only one common factor as argued by Gerbing and Anderson (1988) and Hair, Black, Babin, and Anderson (2010).
Recommendation 3: Critically Examine Whether Multidimensional Indicators Are Necessary
The management community has avoided utilizing multidimensional indicators (i.e., indicators that are influenced by more than one common factor) because multidimensional indicators have been declared to be per se invalid (Gerbing & Anderson, 1988; Hair et al., 2010). However, multidimensional indicators are accepted in the broader methodological community as permissible (Browne, 2001; Edwards, 2010; Little, Rhemtulla, Gibson, & Schoemann, 2013) and have even been encouraged (Green & Yang, 2009; Marsh et al., 2009). With this being said, the use of multidimensional indicators can create challenges by potentially obscuring the empirical meaning of one or more operationalized constructs and reducing measurement model parsimony (Asparouhov & Muthén, 2009). To help ensure that the use of the BSEM technique does not result in overzealous estimation and retention of unanticipated cross-loadings, we provide an overview of key decisions that need to be made when unanticipated cross-loadings are identified (see Figure 1). We emphasize that once scholars reach the third decision in Figure 1, our recommendations should be interpreted as guidelines rather than strict yes/no decisions, given inherent subjectivity in the modeling process (Cudeck & Henly, 1991).

Flowchart Containing Guidelines for Evaluating Whether to Include a Cross-Loading and/or Delete a Multidimensional Indicator
As explained in the Recommendation 2 section, the first two litmus tests for retaining an unanticipated cross-loading are that this loading should be statistically and practically relevant. Assuming these two conditions are met, the next step is to evaluate whether a concise, theoretically sound explanation exists for the cross-loading. On the basis of our examination of the literature, we believe such justifications will likely be limited to situations when (1) the construct being measured is multidimensional, the subdimensions are themselves broad, and the subdimensions are theorized to be correlated (Reise, Morizot, & Hayes, 2007); or (2) the indicators have a complex structure akin to a testlet (Adams, Wilson, & Wang, 1997; Lee, Dunbar, & Frisbie, 2001). In the former case, an indicator may tap more than one subdimension of a multidimensional construct, which is what occurred in this study. In the latter case, which is likely to occur in the management context when developing scales to measure processes, an indicator is likely to covary with other indicators from the same temporal stage of the process in addition to covarying with other indicators from the same content domain. As such, in our hypothetical process scale, indicators would be influenced by a latent variable representing the content domain and a latent variable representing the temporal stage of the process. In contrast, if an indicator loads onto two common factors that are theorized to have distinct sampling domains, there is strong concern for interpretational confounding (Bagozzi, 2007; Burt, 1976), and removal of the indicator should be strongly considered.
Once it is determined that a sound theoretical justification exists for an indicator to be influenced by multiple common factors, the next step is to determine whether each of the common factors onto which the indicator loads have an independent cluster basis (McDonald, 1999). More specifically, McDonald states: A useful sufficient condition for identification in a multidimensional item response model (which carries over directly from well-established factor theory; e.g., Bollen, 1989) is that for each trait [reflective latent variable] there are at least two items measuring it that are factorially simple, in Thurstone’s (1947) classical terminology. This means that they [the items] measure only one trait, having zero loadings on all others. . . . If each trait has sufficient simple indicators to yield identifiability, it has an independent cluster basis. . . . An IC [independent cluster] basis allows each trait to be interpreted as the common attribute of its simple indicators; any complex items [items that have large loadings on more than one trait] can then be studied as mixtures of those attributes. (2000: 101-102)
In the ESE example, the common factors that have multidimensional indicators (search, plan, and marshal) all have an independent cluster basis, given that indicators S1 and S2 load only onto search, indicators P2 and P3 load only onto plan, and indicators M2 and M3 load only onto marshal. Once the presence of an independent cluster basis is established, the fifth step in Figure 1 is to determine if a multidimensional indicator uniquely taps a facet of a construct. For example, in our study, indicator P4 captures the extent the informant believes he or she has the capability to design a marketing campaign, which fits within the domain of the plan construct. Furthermore, none of the remaining indicators appear to tap into this facet of plan. As such, removal of P4 would result in a deficient operationalization of this construct (Little et al., 1999). However, if another indicator were present that tapped this facet of plan, then deleting P4 would not be expected to substantially alter the construct; thus, the benefit obtained from deleting this indicator may outweigh retaining it. Note that a multidimensional indicator might tap a unique facet of one construct but be redundant in another construct. In this situation, one would generally opt to keep the indicator as to avoid making the former construct deficient.
Even if a multidimensional indicator appears to tap a unique, important facet of a construct, if the indicator has a low communality (proportion of variance explained by the common factors), removal of the indicator may be preferred. Numerous benefits exist for using observed measures with high communalities, which is why Asparouhov and Muthén state, “One can argue that it is more important to find accurate measurements than to find a pure set of measurements” (2009: 430). First, using low communality measures is the primary driver of model misfit and parameter instability (MacCallum & Tucker, 1991). Conversely, using high communality measures enables accurate factor extraction even with small samples (MacCallum, Widaman, Zhang, & Hong, 1999). Second, utilizing measures with high communalities reduces the negative consequences of common factor multicollinearity (Grewal et al., 2004). Inspection of the three indicators with cross-loadings in our study reveals that these indicators have high communalities (R2 = .57 for S3, R2 = .63 for P1, and R2 = .52 for P4). As such, following the advice of Gerbing and Anderson (1988) and Hair et al. (2010) of adopting a per se strategy of deleting multidimensional indicators would be ill advised. As we do not know of any study that has addressed the question of what level of communality a multidimensional indicator should have to warrant inclusion, we tentatively recommend an R2 threshold of .50, with the caveat that lower communality thresholds may be appropriate for initial scale development efforts.
The last step in Figure 1 is to evaluate whether characteristics of the analysis make it more likely that cross-loadings will arise as a result of capitalization on chance, such as small sample size and using indicators with low communalities. Before moving to the next recommendation, we want to point out one remaining issue with using multidimensional indicators: The use of multidimensional indicators precludes the use of observed scale scores for subsequent analysis because it would be inappropriate to include the same indicator in both scales when summing or averaging across indicators. As such, because of reduced practical utility, scholars may wish to identify new indicators that capture the same facet of a construct without cross-loading.
Recommendation 4: Remember the Principles of Cross-Validation and Develop Models Accordingly
This recommendation is derived primarily from the MacCallum and Tucker (1991) framework. As discussed by Cudeck and Browne (1983), when researchers have small samples, simpler models (i.e., fewer parameters) have a greater likelihood of cross-validating and are thus favored over more complex models that capture idiosyncratic sample characteristics. Management scholars should be cognizant that the likelihood that the BSEM technique will inadvertently model idiosyncratic sample characteristics (i.e., sampling error) is greater when researchers have (1) small sample sizes, (2) low communality observed measures, and (3) complex models (e.g., the ratio of observed measures to common factors is small; MacCallum & Tucker). On the basis of simulation results by MacCallum et al. (1999), we expect the impact of sampling error to be most extreme in settings with (1) low communality indicators and complex models or (2) low communality indicators and small sample size. As such, management scholars (along with reviewers and editors) should be cautious about unexpected results identified using the BSEM technique when these results occur in research settings that are highly susceptible to modeling sampling error.
Given that the BSEM technique may be prone to modeling idiosyncratic sampling characteristics, cross-validation of solutions takes on added importance when scholars use this approach to make model-specific modifications (MacCallum et al., 2012). Apart from helping management scholars to better build a cumulative body of knowledge by ensuring that the most generalizable models are selected, cross-validating post hoc model-specific modifications identified using the BSEM technique forces scholars to parsimoniously define measurement model simple structure, thus reinforcing Recommendation 2.
Recommendation 5: Follow Sound Principles of Model Selection
In factor analysis, the goal of model selection is to identify a measurement model that effectively recreates the variance-covariance matrix of the observed indicators in a theoretically meaningful, parsimonious manner that generalizes to new samples (Cudeck & O’Dell, 1994; Reise et al., 2000). Accomplishing this task requires a measurement model to minimize overall discrepancy, which is the sum impact of discrepancy of approximation and discrepancy of estimation (Cudeck & Henly, 1991). Discrepancy of approximation arises from the inherent simplifications made when trying to approximate the population data generating mechanism (Zucchini, 2000), and discrepancy of estimation arises from estimating unknown population parameters with only information from a random sample (Myung, 2000). It is generally assumed that discrepancy of approximation is a fixed value, whereas discrepancy of estimation decreases as sample size increases (MacCallum et al., 1994; Preacher, Zhang, Kim, & Mels, 2013).
We make four suggestions to guide model selection when using the BSEM technique. First, we strongly caution scholars not to make model selection decisions solely on the basis of whether the PPC confidence interval contains zero as this procedure represents a test of “perfect fit” (Browne & Cudeck, 1992), which is unrealistic given that models are never perfect representations of the process that generated the data (MacCallum, 2003; Myung, 2000). Second, given the aforementioned lack of frequentist fit indices like RMSEA, we recommend discrepancy of approximation be evaluated using the PPC confidence interval and until a distribution of SRMR is available, pSRMR. Third, to evaluate overall discrepancy, we recommend scholars report the BIC, BICSSA, and DIC, which can be calculated from standard outputs. Given that the BIC and BICSSA treat all estimated parameters equally while the DIC adjusts the number of estimated parameters, in part, on the strength of parameter priors, the BIC and BICSSA are likely to select more parsimonious models. Given the substantial theoretical (Raftery, 1995; Schwarz, 1978) and empirical (Nylund et al., 2007; Vrieze, 2012) support for the BIC and BICSSA, coupled with strong theoretical opposition to the DIC (see the rejoinders to Spiegelhalter et al., 2002), we recommend scholars place greater weight on the decisions from the BIC and BICSSA. Furthermore, given greater concerns about estimating complex models in smaller samples, coupled with the BICSSA having known limitations in smaller samples (i.e., N < 200; Nylund et al.), we recommend that the greatest weight be given to the BIC. Fourth, when comparing multiple measurement models, we recommend scholars calculate the posterior probabilities of each model using procedures outlined by Burnham and Anderson (2004). However, scholars should note that these information criteria, like all statistics, have their own sampling distributions and, thus, are not infallible (Preacher & Merkle, 2012). Thus, as noted by Cudeck and Henly: The problem of making comparisons inevitably arises when two or more structures apply to the same data and one wishes to evaluate their relative performance. Often the best that can be done . . . is to state clearly the criteria that are used in the comparison, in conjunction with descriptions of the models, characteristics of the data, and the purpose for which the models were constructed. Evaluations of this kind inevitably include elements of prior experience and individual preference. These personal points of view should be articulated. Many believe that such subjectivity in model development is somehow unscientific and undesirable, as if it could be avoided by pretending that it does not exist. (1991: 517)
Recommendation 6: Avoid Estimation of Unique Factor Covariances With Multidimensional Reflective Constructs Until Guidelines Become Available
Our sixth recommendation is that, until more information becomes available, scholars should avoid estimating the entire matrix of unique factor covariances (i.e., the
The first reason estimating the matrix of unique factor covariances should be avoided is that this procedure runs counter to the core goal of factor analysis, which is to “identify the major common factors and their pattern of effects on the variables. Given this objective, investigators must simply accept the fact that minor common factors will exist and will contribute to lack of fit of a parsimonious model” (MacCallum & Tucker, 1991: 507). Simply stated, some degree of misfit will be present when conducting factor analysis because models are inherently approximations of a complex reality. Box goes further to state, “Models, of course, are never true, but fortunately it is only necessary that they be useful. For this it is usually needful only that they not be grossly wrong” (1979: 2). As such, while estimating the matrix of unique factor covariances will improve model fit, given that this procedure is arguably the antithesis of factor analysis, we do not believe this approach has a defensible theoretical justification.
The second reason estimating all unique factor covariances should be avoided is the inherent assumption that all model misfit arises solely from sampling error and unimportant minor factors. This assumption is necessary to estimate these covariances and maintain that the hypothesized factor loadings and specified relationships between common factors still have theoretical meaning. This presumption allows the researcher to treat the matrix of unique factor covariances as a “vacuum” that captures all noise arising from model error and sampling error. Thus, estimation of the unique factor covariances is problematic if misfit is the result of failing to estimate theoretically meaningful effects, such as including an additional common factor. Illustrating this concern, the previous example demonstrated that estimating all unique factor covariances would have resulted in the researcher affirming that the originally hypothesized simple structure for the McGee et al. (2009) ESE scale fits the data perfectly. As such, our concern is that if estimation of the matrix of unique factor covariances becomes common practice, there will be a great increase in the likelihood that management scholars will find evidence in favor of their originally hypothesized measurement models when, in fact, the originally hypothesized measurement models were severely misspecified.
The third reason estimating all unique factor correlations should be avoided is that the PPC criterion and pSRMR are rendered “worthless” measures of fit, as estimation of all unique factor covariances allows the researcher’s specified model to replicate the data (covariance matrix). This phenomenon can be seen in Muthén and Asparouhov’s (2012a) Big Five example, Fong and Ho’s (2013) Hospital Anxiety and Depression example, and the models reported earlier. Appendix D in the online supplement further illustrates that estimating all unique factor covariances can enable a grossly ill-specified model to accurately replicate a sample’s variance-covariance matrix. Thus, to use Preacher’s (2006) terminology, estimation of all unique factor covariances appears to result in a model with an unparalleled degree of fitting propensity such that the model can fit any observed variance-covariance matrix. This is extremely concerning, as such a model cannot be disconfirmed and, thus, has no value (Myung, 2000) because, to quote MacCallum et al., researchers must keep in mind the important principle of disconfirmability in model specification and evaluation. It is important that models be disconfirmable; that is, that it be possible for the model to be inconsistent with observed data so that we can evaluate the degree of such inconsistency, and so that we can make a decision as to whether a given model is sufficiently consistent with the observed data as to be deemed a plausible model. (1994: 9)
Recommendation 7: Do Not Use the Rationales Articulated by BSEM Advocates When Justifying the Use of the BSEM Technique to Develop Measurement Models
Our final recommendation concerns how scholars justify (and editors and reviewers evaluate) the use of the BSEM technique to develop measurement models. As described earlier, the rationales provided by BSEM advocates that the BSEM technique is preferable to CFA are logically flawed and, thus, should not be used in the future. This naturally raises the question, When and how should the use of the BSEM technique be justified?
In the context of measurement models, the BSEM technique is most fruitfully applied by management scholars when (1) developing a new measurement instrument, (2) refining an existing measurement instrument by adding additional indicators, or (3) reconciling the failure of an established measurement instrument. When developing measurement instruments, the logical place for the BSEM technique is to serve as the follow-up to an initial EFA using analytic rotation (if the number of common factors is not posited a priori). Alternatively, if scholars have a strong sense for the number of common factors and primary loadings, the BSEM technique would be utilized for the initial analysis. When refining an existing measurement instrument by adding new indicators, management scholars would likely want to specify a model similar to Table 3 where each indicator’s primary loading is assigned a diffuse prior and hypothesized small loadings are assigned an informative prior with a mean of zero and a small variance. This specification will allow scholars to determine if the new indicators behave in the theorized manner while also permitting the detection of indicators that are multidimensional. After this initial analysis, scholars should follow the approach demonstrated in this essay to define a new simple structure. When reconciling the failure of an established measurement instrument, scholars could follow the approach demonstrated in this essay, as it leverages the flexibility of the BSEM approach to identify model misspecifications while adhering to principles of factor analysis and model selection. In all scenarios, after identifying the measurement model simple structure from the BSEM solution, scholars would then seek to cross-validate this simple structure to increase confidence that the simple structure generalizes (MacCallum et al., 2012). Furthermore, we encourage scholars to report BSEM solutions, such as our model in Table 3, because this information can help guide future scale development efforts by, for example, suggesting indicators which may require additional refinement and clarification.
Limitations
In this article, attention has been purposely focused on the theoretical meanings and implications inherent in Muthén and Asparouhov’s (2012a) BSEM technique. In limiting our focus, several additional issues had to be passed over quickly or left uncovered, for example, the actual implementation and execution of MCMC and the Gibbs sampler was left unexplored. However, several excellent sources are available for readers interested in the technicalities underlying these issues (see Asparouhov & Muthén, 2010; Dunson, Palomo, & Bollen, 2005; Edwards, 2010; Lee 2007). Furthermore, the examination of measurement models was restricted to cross-sectional, reflective, scales. Alternatives to such specifications exist, including the bifactor model, multiple indicators and multiple causes models, and various mixture models. In regards to longitudinal models, the issues addressed here are still pertinent, although additional complexities are introduced that deserve discussion (Little, 2013). One of these issues, which is also relevant in multigroup research designs, is the role of invariance (both measurement and temporal; see Appendix E in the online supplement). Muthén and Asparouhov (2013), along with other scholars, are currently examining issues related to the role of near invariance. Furthermore, for parsimony, we avoided discussion of structural models, though it should be noted many of the issues with the use of the BSEM technique for measurement models also apply for structural models (Muthén & Asparouhov, 2012a).
Conclusions
Modeling covariance structures using Bayesian approaches, particularly the BSEM technique outlined by Muthén and Asparouhov (2012a), appears to herald the dawn of a second generation of covariance structure modeling (Kaplan & Depaoli, 2012). This article has explored and critiqued the particular logic and rationales underlying the application of the BSEM approach to measurement modeling. It should be kept in mind that the world of Bayesian statistics and inference, including Bayesian approaches to SEM, is much broader than the limited aspects that this article has addressed.
However, regardless of the technique used for modeling, as scholars, we have an obligation to contemplate the theoretical meaning that is directly or tacitly embedded in the modeling approach we elect. One of the most important questions addressed when starting the modeling process is the intent of the model: to be descriptive, to be predictive, or to act as a test of theory. Theory-testing models tend to focus on parsimony in order to provide generalizability, while predictive models are often targeted to a narrow population with a few primary predictors. Both of these models often sacrifice model fit in favor of clarity. On the other hand, descriptive models are often designed to replicate the data as accurately as possible, leading to a reduction in parsimony in order to minimize model misfit. Full adoption of the BSEM technique, in particular the practice of estimating the entire matrix of unique factor covariances, would appear to lean more towards the descriptive approach. The recommendations provided in this article provide a way of utilizing the BSEM technique’s flexibility while nonetheless developing a model that is readily compatible with a theory-testing approach.
To best leverage the flexibility provided by the BSEM technique, as a community, we must thoroughly vet the rationale underlying this approach to establish boundaries for using this added modeling flexibility to ensure that resultant models remain theoretically meaningful. This article has sought to address this issue by illustrating how the flexibility of BSEM, in conjunction with recommendations from factor analytic theory, can provide a means of transcending current approaches for reflective measurement while still conforming to timeless aspects of statistical modeling. To conclude, we would like to leave the reader with the following quotation from Cudeck and Henly, which provides an elegant summary of our core points: In the study of mathematical models, the process of developing and justifying a model is the most fundamental of issues, because every other feature associated with the use of quantitative models is influenced by the final form of the structure. Yet no model is completely faithful to the behavior under study. Models usually are formalizations of processes that are extremely complex. It is a mistake to ignore either their limitations or their artificiality. The best one can hope for is that some aspect of a model may be useful for description, prediction, or synthesis. The extent to which this is ultimately successful, more often than one might wish, is a matter of judgment. (1991: 512)
Footnotes
Acknowledgements
The authors would like to acknowledge Michael Browne and Michael Edwards for their comments on earlier versions of this essay. We also thank the two reviewers and special issue editors for their comments and feedback during the review process. Any errors are the responsibility of the authors.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
