Abstract
Meta-analyses are well known and widely implemented in almost every domain of research in management as well as the social, medical, and behavioral sciences. While this technique is useful for determining validity coefficients (i.e., effect sizes), meta-analyses are predicated on the assumption of independence of primary effect sizes, which might be routinely violated in the organizational sciences. Here, we discuss the implications of violating the independence assumption and demonstrate how meta-analysis could be cast as a multilevel, variance known (Vknown) model to account for such dependency in primary studies’ effect sizes. We illustrate such techniques for meta-analytic data via the HLM 7.0 software as it remains the most widely used multilevel analyses software in management. In so doing, we draw on examples in educational psychology (where such techniques were first developed), organizational sciences, and a Monte Carlo simulation (Appendix). We conclude with a discussion of implications, caveats, and future extensions. Our Appendix details features of a newly developed application that is free (based on R), user-friendly, and provides an alternative to the HLM program.
In the 1970s and 1980s, researchers working independently across scientific disciplines created the foundation for what has become modern-day meta-analysis (Glass & Smith, 1979; Hedges & Olkin, 1985; Rosenthal & Rubin, 1978; Schmidt & Hunter, 1977). Gene Glass (1976) is given credit for the term meta-analysis to “refer to statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings” (p. 3). In practice, the Hunter-Schmidt (H&S) meta-analytic technique is the most widely used approach in the organizational sciences (Hunter & Schmidt, 2004; Schmidt & Hunter, 2015). More recently, metaregression techniques that allow for modeling of random-effects, multiple substantive predictors, and methodological moderators (Borenstein, Hedges, Higgins, & Rothstein, 2009; Gonzalez-Mule & Aguinis, 2017) or the Hedges-Olkin type meta-analyses (H&O; Borenstein et al., 2009; Hedges & Olkin, 1985) have also grown in terms of application.
Since its inception, meta-analysis has gained in popularity across scientific fields in the natural and social sciences (Borenstein et al., 2009; Schmidt & Hunter, 2015) and has become critical for the advancement of theory, evidence-based practice, and policymaking (Banks, Kepes, & Banks, 2012; Briner & Rousseau, 2011; Le, Oh, Shaffer, & Schmidt, 2007). The value of meta-analysis has been acknowledged in important practitioner documents, such as the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999), as well as the Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology [SIOP], 2003). As an indication of their value in academia, meta-analyses tend to be one of the highest cited types of studies (Judge, Cable, Colbert, & Rynes, 2007). Thus, meta-analytic effect size estimates can be considered the gold standard for informing future research, practice, and policymaking.
Despite these advances and benefits to existing meta-analytical techniques, these models yet suffer from a critical limitation that could alter statistical conclusions. The complicating issue with all single-level meta-analytic techniques to date, be it H&S, H&O, metaregression, or something else, is that they all rest on an assumption of independence of primary effect sizes, much like single-level ordinary least sqaures regression studies (M. W. L. Cheung, 2014; S. F. Cheung & Chan, 2004; Marín-Martínez & Sánchez-Meca, 1999; Rosenthal & Rubin, 1986). That is, all these techniques assume that the primary effect sizes are independent of one another. This is quite a serious issue, and Romano and Kromrey (2009) summarized it as follows: The assumption regarding independence of observations is commonly violated in meta-analytic research (Becker, 2000; Hedges & Olkin, 1985; Hunter & Schmidt, 1990)…. Furthermore, several studies have been conducted concerning the consequences of dependent observations in meta-analysis (e.g., Becker & Kim, 2002; Beretvas & Pastor, 2003; Cooper, 1979; Greenhouse & Iyengar, 1994; Hedges & Olkin, 1985; Landman & Dawes, 1982; Raudenbush, Becker, & Kalaian, 1988; Rosenthal & Rubin, 1986; Tracz, Elmore, & Pohlmann, 1992). In general, this body of research has indicated that ignoring the assumption of independence may affect statistical inferences. (italics added; p. 406)
Our first contribution is to conceptually and empirically demonstrate how modeling primary effect sizes in a meta-analysis via multilevel modeling techniques, where appropriate, leads to different statistical conclusions than traditional meta-analyses that assume independence of effect sizes. We demonstrate that when significant between-unit dependencies exist, effect size estimates as well as confidence bands and standard errors differ from their single-level meta-analytic counterpart. Our second contribution is in unpacking the complexities inherent to running multilevel models from meta-analytic input data (a special class of models known as Vknown models). Toward this end, we demonstrate the application of multilevel modeling with meta-analytic data as inputs via the HLM 7.0 software. We also include a newly developed application with a step-by-step guide for users so they can simply upload their effect size estimates and implement the techniques discussed in this article. Relatedly, the same analyses could be implemented via Stata, Mplus, and SAS; however, the implementation in these software packages is beyond the scope of our article, but the conceptual ideas translate. Third, we discuss extensions of our ideas to different nesting structures and cross-classification, ending with a how-to guide for readers. We expand on each of these contributions in the following.
Our first contribution is in initiating a scholarly dialogue regarding dependency of meta-analytic effect sizes and presenting ways to test for and model it where appropriate. This is an important step in the organizational sciences as researchers regularly use meta-analyses to build new theory, further empirical findings in a domain, and inform policy and practice, as noted earlier. Most of these meta-analyses assume effect size heterogeneity is random and test for it via the credibility interval or the Q-statistic. Our approach, however, contends that such effect size heterogeneity may not be random and scholars should model this heterogeneity via a multilevel model as detailed in the following. Further, this approach addresses some of the challenges noted by DeSimone, Köhler, and Schoen (2018), who cautioned that researchers rarely temper their interpretation of meta-analytic effect sizes by acknowledging the heterogeneity of effect sizes reported in the original meta-analytic studies. Relatedly, we explain in a following section how modeling moderators in meta-analyses is not equivalent to modeling dependence in effect sizes via nesting or cross-classification.
Further, our work goes beyond previous studies that address dependence of effect sizes in meta-analyses (M. W. L. Cheung, 2014; Konstantopoulos, 2011) in two important ways: First, we illustrate, via a meta-analysis previously published in the organizational sciences (de Wit, Greer, & Jehn, 2012), how and when statistical conclusions might change when effect size dependencies are accounted for. For example, in our following illustration, we demonstrate how multilevel modeling is warranted for the association between task conflict and intragroup performance but not relationship conflict and intragroup performance. Second, we offer a user-friendly tool that facilitates such applications as most previous works are not easily accessible to the average management scholar.
Our second contribution is in delineating a step-by-step illustration of testing for and modeling dependencies in a meta-analytic context. In illustrating how a typical meta-analytic model could be conceptualized and implemented as a multilevel model via our examples and the HLM 7.0 software, we assert that it could change conclusions regarding established associations as consumers of research (see discussion of our findings in a later section). As noted earlier, meta-analyses are extensively used when designing interventions, implementing practice-oriented guidelines, and policymaking. Discrepancies in magnitudes of estimates, standard errors, and confidence bands (or any other statistical artifact) cast doubt on the ability of our science to inform such outcomes. As such, it is no trivial issue to correct for such biases. Relatedly, the application of multilevel modeling is common in the organizational sciences. However, the implementation of such models to meta-analytic data, where primary data are unavailable, presents unique challenges, as we demonstrate via our examples. Moreover, we offer an easy-to-use application that allows users who are less familiar with multilevel modeling to implement analyses either in HLM or R. Our work thus prompts meta-analysts to first test for dependency in effect sizes and then implement techniques to model such dependencies where they exist.
Third, we discuss different nesting versus cross-classification structures for future meta-analytic data. Across the three contributions described previously, the overarching contribution from our work is in prompting the testing of the independence assumption in meta-analyses and translating the use of Vknown, multilevel models for meta-analyses so that it is easily accessible to the average management scholar. In the following sections, we begin with a brief overview of the extent of the dependency problem in meta-analysis and then introduce meta-analysis as multilevel models. Then, we illustrate the use of our techniques via previously published meta-analyses across two distinct domains: an example study used in Konstantopoulos (2011) from Research Synthesis Methods and de Wit et al. (2012) in the Journal of Applied Psychology.
Dependencies in Meta-Analyses
As stated previously, the modeling of dependence in meta-analyses via multilevel techniques is not yet common in management literature (see M. W. L. Cheung, 2014). While we focus on between-study dependencies in this article (e.g., multiple samples drawn from the same school district as in Konstantopoulos, 2011, as described in the following), within-study dependencies have been acknowledged in the applied psychology literature (e.g., S. F. Cheung & Chan, 2004; Marín-Martínez & Sánchez-Meca, 1999; Rosenthal & Rubin, 1986). Such within-study dependencies can arise from drawing multiple estimates from the same study. Undoubtedly, these estimates would not be independent due to common factors driving measurement, design, study context, and setting. S. F. Cheung and Chan (2004) commented on the extent of both between- and within-study dependencies in meta-analyses: We conducted an informal survey for the meta-analyses published in the period between 1991 and 2001 in the Journal of Applied Psychology. Out of the 49 meta-analyses identified, only the author of 1 of them stated explicitly that all the effect sizes were independent before conducting the meta-analysis. The authors of 17 of them did not state clearly whether there was a problem of dependent effect sizes. Authors of the remaining 31 meta-analyses stated that some of the effect sizes were dependent, usually because they were from one single sample. Of these 31 meta-analyses, 5 treated the dependent effect sizes as independent. Seventeen meta-analyses computed a within-sample average for dependent effect sizes (i.e., the sample wise procedure). Two meta-analyses computed a composite score correlation for each set of dependent effect sizes, and the remaining 7 either did not make it clear how the dependent effect sizes were handled or used a mixture of the above procedures. (p. 783)
To retrieve all possible meta-analyses, we used meta as a search term in the title of the article and set a start date of 2002 (S. F. Cheung & Chan, 2014, reviewed dependence in all meta-analyses up to 2001). This search yielded 287 articles, which included all published meta-analyses and methodological primers for meta-analysis. Of these, we retained 246 meta-analytic reviews that addressed substantive topics (i.e., were not methods papers) published in the six journals since 2002 (see Table 1): JAP (119), JOM (48), P-Psych (45), AMJ (17), JBV (11), and SMJ (6).
Dependency Search in Meta-Analyses (2002-2018).
Within these meta-analyses, we searched for depend in the text of the articles to identify any meta-analyses that refer to dependence. Many meta-analysts refer to “independent” samples or studies simply to signal the number of unique samples (i.e., k) in their meta-analysis and not necessarily to consider the issue of dependent effect sizes. As such, we did not simply rely on the automated search and carefully scanned each of the returned articles (i.e., 246 articles) (a) to determine if the authors discussed dependence (both within- and between-study dependencies) in their data, and (b) when such dependency was discussed, we determined how many studies actually addressed within- and between-study dependencies in primary effect sizes.
As seen in Table 1, over half of published meta-analyses do not mention that their effect sizes are independent (i.e., 55% of 246). It is promising, however, to note that 43% of published meta-analyses had some discussion of dependence. Of the studies that did mention dependency, all discussed within-study dependencies (i.e., all 43% of the overall 246 addressed within-study dependency). This type of dependency is a function of multiple effect sizes being drawn from the same primary study, and these studies typically adopted a weighted average of the multiple effect sizes or composite correlations (e.g., S. F. Cheung & Chan, 2004). However, only 10 (i.e., 4%) of the overall 246 published meta-analyses addressed between-study dependence. Note that all the studies addressing between-study dependence also discussed within-study dependence and as such, would count in both categories.
This percentage is troubling as it identifies a clear omission of addressing between-study dependence in meta-analyses to date. We speculate that this neglect of between-study dependencies might be even worse in the field of management as a whole (i.e., less than 4%) if one considers that the journals in our search are some of the most empirically and methodologically rigorous. Even when significance testing might not be of interest, the omission of addressing between-study dependencies in meta-analyses is perplexing given that such dependencies affect routinely reported confidence bands of effect sizes. Furthermore, we will demonstrate that it could also bias the effect size estimate in some instances (see also M. W. L. Cheung, 2014; Goldstein, 2011; Hox, 2010; Konstantopoulos, 2011; Moeyaert, Ugille, Ferron, Beretvas, & Van den Noortgate, 2013; Raudenbush & Bryk, 2002; Snijders & Bosker, 2012; Van den Bussche, Van den Noortgate, & Reynvoet, 2009; Van den Noortgate et al., 2013, 2015; Wang, Parrila, & Cui, 2013).
As the multilevel paradigm in organizational research (for primary studies) is not new and plenty of writings address the implications of dependence in organizational data (e.g., Bliese, 2000; Bliese & Hanges, 2004; Dansereau et al., 1984; Gooty & Yammarino, 2011; Hox, 2010; Kenny & Judd, 1986; Kozlowski & Klein, 2000; Snijders & Bosker, 2012), we suspect that this neglect of dependence in meta-analyses is primarily due to omission in that most of these published meta-analyses do not acknowledge the independence assumption and the implications of violating it. Secondarily, we believe that a major impediment to its application has been the lack of user-friendly, step-by-step guides on how such models could be implemented and the lack of availability of associated software. For example, most of the methodological works in this area are driven by writings in educational psychology and behavior, with associated examples in Stata (e.g., Konstantopoulos, 2011). In the sections to follow, we first discuss a conventional meta-analysis and then step into multilevel modeling for meta-analytic data.
Traditional Meta-Analysis
All meta-analytic traditions seek to estimate an overall population parameter while estimating sampling error variance and the residual (Kepes, McDaniel, Brannick, & Banks, 2013). In traditional meta-analytic approaches, either a random- or fixed-effects model is typically employed (Borenstein et al., 2009; Hedges & Olkin, 1985; Hunter & Schmidt, 2004; Hunter, Schmidt, & Le, 2006), although there are also models characterized as mixed-effects (Raudenbush & Bryk, 1985). While the estimation of models in the H&O tradition can be random- or fixed-effects, only random-effects are used in the H&S tradition. Random-effects models tend to be the most commonly used estimation models given the underlying assumptions in most scientific research. The “bare bones” or basic meta-analysis in the H&S tradition uses a random-effects model with no corrections for artifactual variance. For instance, if the parameter estimate is constant across samples, then the best estimate of the true relationship in the population is a weighted average. Weights are determined by the number of participants or cases in each individual sample. The parameter estimate would be identified by
In the H&S tradition, ri is the correlation for sample I, and Ni is the total number of participants in sample i (Hunter & Schmidt, 2004, p. 81). The variance across samples is then the frequency weighted average squared error:
While the bare bones approach can correct for random-sampling error (see a worked example in Hunter & Schmidt, 2004, pp. 83–88), H&S also allows for the corrections of other types of artifactual variance, such as measurement error and range restriction. The corrected correlations have less precision though, and a compound attention factor (Ai) is needed to account for bias due to artifacts in the weight (Wi)
A summary of the H&S approach, which is the most frequently used in management research, is as follows: (a) Equation 1 summarizes the overall parameter estimate, (b) Equation 2 indicates the degree of heterogeneity in effect sizes across samples or simply between-study variance, and (c) Equation 3 (and others in Hunter & Schmidt, 2004) details corrections for measurement error and range restriction.
The H&O tradition uses a different approach in its random-effects model. As sampling error does not account for all variance in sample effect sizes, weights are adjusted. Here, Vi reflects the within-study sampling variance (1 / Ni – 3), and τ2 represents the between-sample variability, that is, the variability not accounted for by sampling error (Borenstein et al., 2009). Vi is unique to each effect size, and τ2 (portion of the effect size weight) is held constant across effect sizes.
In both meta-analytic traditions, there are several ways to examine the presence and magnitude of heterogeneity, including the Q-statistic, the random-effects variance component τ2 or σ2 ρ, and credibility or prediction intervals (Kepes et al., 2013). There are also multiple techniques for characterizing the presence and magnitude of heterogeneity, including the I2 statistic and the 75% rule as well as subgroup analyses (for a review of accounting for heterogeneity in both traditions, see Kepes et al., 2013). These statistics could be used as a starting point to determine if enough study-level heterogeneity exists, thus justifying the transition to multilevel modeling and higher-level moderators as described in the following. We show in the following, however, that the multilevel model goes a step further than Equations 1 through 4 in building in dependencies in effect sizes. While most equations described previously simply evaluate aggregate effect sizes and allow for sampling error, measurement error, and range restriction corrections, more recently, Gonzalez-Mulé and Aguinis (2017) and others have called for metaregression techniques, which are represented by:
in which y represents the effect size estimate from the ith study, the betas are unstandardized regression coefficients associated with j boundary conditions (i.e., moderators), ui represents the random-effects variance component, and ei is the within-study variance. This equation in particular corresponds to a two-level random-effects model as shown in the following but does not account for dependencies in those primary yis.
Meta-Analysis as a Three-Level Multilevel Model
A simple two-level hierarchical model can model random-effects via maximum likelihood estimation techniques, wherein primary effect size estimates are nested within studies. It is important to note, however, that the two-level hierarchical model is, in essence, a mixed-effects meta-analytic model as in Equation 5. We begin with illustrating those models in the following as a pathway to ultimately building the three-level model, which accounts for dependencies in between-study effect sizes. In the equations to follow, we adopt Raudenbush and Bryk (R&B, 2002) terminology, but it is important to note that we are dealing with effect sizes, not primary data. As such, the terminology is a bit different than in regular R&B models. The difficulty encountered in such models is that the raw data are rarely available to the meta-analytic researcher and thus missing from the multilevel models we seek to implement. Instead, effect size estimates (e.g., correlations, mean differences) are typically obtained, each of which is assumed to be independent when applying traditional meta-analytic techniques. The Level 1-1 null model using R&B (2002) terminology can be represented via:
where d is the standardized effect size (mean differences between experiment and control groups, correlations 2 , etc.) from Studies 1 through j, δ is the population parameter of d, and e is the sampling error associated with the effect size estimate d, which is assumed normal with a known variance of zero through Vj. The sampling variance Vj is computed as:
where nj is the sample size in study j.
Each of these models estimate the effect size in the population (δ j ) as an aggregate of the primary studies’ effect sizes (dj) plus sampling (within-study) error (Vj). These models assume Level 1 variance (ej) is known. However, in a multilevel model, we do not assume that dj (i.e., standardized effect size estimate) is independent. In Equations 1 through 4, such an assumption is made, and we discussed the implications of violating that assumption earlier. Here, we model that intercept δ as an outcome at a higher level (i.e., Level 2) to account for dependencies as discussed earlier. Following typical hierarchical modeling terminology, the simplest (unconditional or null) Level 2 intercept as outcome model is:
which could be further simplified into a mixed-effects unconditional model as:
In Equation 9, γ0 represents the mean effect size estimate while not assuming independence of primary effect sizes. Uj is the random Level 2 error, which is assumed normally distributed with a variance of zero through τ. All of these values can now be computed as dj and ej are known. Such problems are also known as Vknown problems in R&B terminology due to Level 1 variance being known or fixed (Raudenbush & Bryk, 2002). Very often in practice, one might be interested in modeling the actual effect of specific boundary conditions or moderators, such as those discussed earlier (i.e., common study characteristics). These study characteristics could then be built in as additional predictors at Level 2 in the following way:
w1-j are study characteristics, and each of the regression coefficients (γ1-γ j ) represent estimates of study characteristics that the meta-analyst might wish to include. The combined model, then, is represented via:
A quick and cursory examination of Equations 5 and 11 reveals that they are essentially the same. This is why many meta-analysts and multilevel scholars have argued for meta-analyses being essentially two-level multilevel models. Up to this point, the models subsume within- and between-study dependencies under the error terms and incorporate moderators at Level 2. Despite these similarities, one key distinction between multilevel models and meta-analyses are the weighting matrices. Unlike many multilevel models, meta-analyses weight each study using some index of precision (i.e., the known variance referenced previously in Equation 7). These weights are applied to the various parameters in a model (i.e., intercepts, covariates, variance components) depending on the nesting structure (Konstantopoulos, 2011; Van den Noortgate et al., 2015). The variance-covariance matrix is block diagonal in the multilevel meta-analysis, whereas it is a diagonal matrix in the traditional, random-effects meta-analysis. Because studies are weighted differently, not only will the standard errors and confidence intervals differ across models with different nesting structures, but the parameter estimates themselves may also shift (Konstantopoulos, 2011). This reflects a substantial departure from the widely discussed effects of dependence in primary studies. 3
Another key point needs to be made here regarding the effect of moderators (i.e., ws in Equation 11) versus higher-level nesting structures. The difference is as follows: Incorporating a moderator variable in mixed-effects, or two-level, meta-analysis (i.e., Equations 5 or 11) assumes that the primary effect sizes are independent. In such models, while the association (i.e., the effect size) is presumed to vary across levels of a categorical moderator, the effect sizes within and between each category are still assumed to be independent. Given the issues with violating the assumption of independent error terms, it seems prudent to delineate one’s nesting structure first and then test for moderators (much like a primary multilevel study). That is, even if one tested all of the appropriate moderators but overlooked a critical grouping variable (i.e., a substantial ICC1 at Level 3 that indicates significant between-unit variance), then it is possible that the statistical conclusions regarding moderator variables (i.e., ws in Equation 11) could be biased. This would be analogous to a “slopes as outcomes” random coefficient model where the effect of a second-level variable (e.g., study characteristic) depends on the higher-level grouping (i.e., third-level variable, like culture). One’s model would completely omit such nested effects if one relied exclusively on a metaregression or mixed-effects meta-analysis. Having discussed the distinction between moderators and nesting structures, we now return to building the three-level model. We first illustrate three-level models via the Konstantopoulos (2011) example as he reported findings from Stata, and we compute findings from HLM (and R at a later stage). This provides a useful opportunity to triangulate findings across multiple software packages (i.e., Stata, HLM, and R). Konstantopoulos drew his data from an earlier published meta-analysis (Cooper, Valentine, Charleton, & Melson, 2003), and the first 15 rows are illustrated in Table 2.
Subsection of Konstantopoulos (2011) Data as an Example for HLM 7.0.
Note: District is Level 3 ID; study is both Level 1 and Level 2 ID as V-known problems in multilevel allow for variance at Level 1 to be fixed. We acknowledge that year appears to be confounded with district based on this subsection of the data. However, with the full data set, this is not the case as year does vary across districts.
Cooper et al. (2003) conducted a meta-analysis on the effect of modifying school calendars (e.g., shorter summer breaks) on student achievement. Thus, their primary effect sizes (standardized mean differences) were from studies on individual students embedded in schools that in turn were embedded in school districts. A total of 56 studies provided primary effect sizes. The Level 1 model with these studies is at the within-study level, and Level 2 is at the between-study level. This is analogous to almost every meta-analysis in the organizational sciences, where there is within-study variance and between-study variance as well. Certain study characteristics (e.g., year the study was published) might be of interest and could be modeled as Level 2 predictors (also see Equations 5 and 11). The most interesting part of such models, however, is that these effect size estimates within each district are likely related. Thus, school district could be modeled as a higher-level grouping factor. To account for this higher level of nesting, a third level could be added to the two-level model discussed earlier. The sources of variance in this three-level model now reflect a between-study within-district variance (i.e., Level 2 variance component) as well as a between-district variance (i.e., Level 3 variance component). Equation 8, which presents the intercept as outcome model at Level 2 with no predictors, is now modified to the Level 3 intercept as outcome model as follows (which is consistent with our previous discussion of first incorporating the third-level nesting structure prior to moderators or covariates):
The combined model with no predictors at Levels 2 and 3 or the null (unconditional) three-level model is as follows:
where g represents the Level 3 units, namely, districts, from 1 through 11 (see Table 2). ujg and η0g represent the random Level 2 and Level 3 variance components. The first step (much like a typical multilevel study) of estimating a null model based on Equation 13 is important in calculating the ICC(1) at each of the two levels: between-study within-district and between-districts levels (M. W. L. Cheung, 2014). That is, by using the variance estimates from each of the three levels, meta-analytic multilevel models can be used to account for the between-study and within-third level (e.g., district in our example) heterogeneity while also estimating intraclass correlation (ICC) in the effect sizes. These ICCs can be defined as:
and
These indices can be interpreted as the proportion of the total variance of the effect size across studies due to the Level 2 and Level 3 categories (M. W. L. Cheung, 2014). For instance, in our example that uses district as a Level 3 grouping variable, Level 2 ICC(1) and Level 3 ICC(1) can be interpreted as the proportions of between-study within-district variation and between-district variation, respectively. In our example, these variance components are shown in Table 2. The Level 2 ICC(1) is 0.33, and the Level 3 ICC(1) is 0.66. These statistics indicate that 33% of the variance is between studies, within districts, whereas 66% of the variance resides between districts. While we are not aware of literature discussing the minimum thresholds for ICC(1) in three-level models, Woehr, Loignon, Schmidt, Loughry, and Ohland (2015) found that across group-level and organizational-level studies, the average ICC(1) estimate was 0.21 (SD = 0.15). Thus, 66% is a high enough estimate that investigation of a model wherein studies are nested within districts is justified. The two-level model depicted in Equation 11, with predictors (study characteristics) and incorporating the third level, now results in a conditional model as follows:
The overall combined three-level model is:
Equation 17 encapsulates several pieces of information. The overall mean effect size estimate (β00) is adjusted for the district grouping factor (η 0g ). The betas represent the effect size estimate after adjusting for grouping factor and any boundary conditions of the effect size estimate. The gammas (γ1 and γ2) represent boundary conditions or study characteristics at Level 2 and are routinely used in most management research. In practice, the overall mean effect size estimate (β00) at Level 3 plus the coefficients for Level 2 moderators (γ1, γ2, etc.) are typically the most relevant parameters for most meta-analysts.
To facilitate future applications of multilevel meta-analyses, we provide a brief note on the data files and analyses via HLM 7.0. The first 15 rows of data from Konstantopoulos (2011) are reproduced in Table 2. The district column represents the Level 3 ID, while the study column represents the Level 2 ID. Effect size estimates, variance, and year (a study characteristic) are all labeled as such. The key distinction from typical multilevel model data files is that the same data file is used as the input at all three levels. The files are sorted by Level 3 ID (district in this case), or the MDM file (input for HLM) is not read. 4 The rest of the procedure is similar to typical multilevel modeling via HLM 7.0 in which multilevel models are built step by step, beginning with the null or unconditional model, and predictors are added in steps. In this case, since we are replicating Table 5 of Konstantopoulos (2011), we model year as a Level 2 predictor and grand mean centered it prior to entering it in the equation. The estimation techniques follow maximum likelihood techniques and are discussed in extensive detail in Raudenbush and Bryk (2002) for the two-level model and Konstantopoulos (2011) for the three-level model. The findings from the null and conditional model are presented in Table 3.
HLM Findings.
Note: Numbers in parentheses are those reported in de Wit et al. (2012).
aHLM does not report confidence intervals. These are from R.
From Table 3, HLM findings are largely similar to those in Konstantopoulos (2011), with one estimate that differs at the third decimal place. In the unconditional model, the overall effect size estimate, 0.184, now represents the effect size for school year modification and student achievement after adjusting for dependencies in data due to district- and study-level random variation. The variance components at both Level 2 (studies within-districts) and Level 3 (between-districts) were greater than zero and significant. In the following, we illustrate how these procedures could be applied to a meta-analysis in the organizational sciences and how published findings might change when dependencies are accounted for.
Illustrative Example From Organizational Science: de Wit et al. (2012)
In this section, we provide an illustration of how an existing, published meta-analysis could be recast as a three-level multilevel model to demonstrate the utility of this approach. The current example, de Wit et al. (2012), is a meta-analysis on the association between intragroup conflict and various group outcomes (e.g., performance, satisfaction) published in the Journal of Applied Psychology. We chose this study primarily due to practical concerns (e.g., large sample size, multiple subgroups, and access to sufficient data), but the steps outlined could be applied to any meta-analysis. In their meta-analytic review, de Wit and colleagues investigated multiple associations and moderating variables. Here, for the sake of brevity, we focus on only one of their outcomes—group performance (see Appendix A of de Wit et al., 2012, for the data). We focus on two types of association from their work to illustrate how and when casting a meta-analysis as a multilevel model is warranted and not warranted: task conflict and relationship conflict, respectively, with performance. 5
Furthermore, we model task type, which includes seven categories (creativity, decision making, production and service, project, mixed set of tasks, other, not applicable) as the hierarchical level within which studies are nested. 6 Conceptually, effect sizes drawn from a particular type of task that groups perform might be nonindependent due to what Kozlowski and Klein (2000) label common fate. That is, all teams performing a creative task, for example, are confronted with different demands (e.g., leveraging expertise, communication patterns) compared to those performing project tasks. The study authors model task type as a moderator, and our modeling illustrates how a moderator and higher-level nesting are not equivalent. In summary, following our rationale and equations described previously, we modeled effect size estimates between all task conflict and performance at Level 1, which are then nested within studies at Level 2 and task type at Level 3. 7
We prepared the data file in the exact same way as shown earlier for Konstantopoulos (2011). We had 95 effect sizes overall for task conflict to performance, with a mean effect size estimate of –0.07 and a standard deviation of 0.25. We then ran a three-level null or unconditional model with Level 1 variance set to computed variance (computed per Equation 7). The estimates from this model are in Table 3. Based on the Level 2 and Level 3 variance estimates (see Table 3), we computed ICC(1) at the study level and task type level per Equations 14 and 15 to be 0.81 and 0.19, respectively. These estimates indicate that 81% of the variance in effect sizes are between studies, and 19% is between the types of tasks a team is performing. While there are no established cutoffs for Level 3 ICC(1)s in Vknown models, we believe that 19% is substantial enough to warrant further modeling (for a discussion of meaningful ICCs, see Bliese, 2000; Woehr et al., 2015).
From Table 3, the overall estimate is –0.07 (SE = 0.04, CI [–0.16 to 0.01]), which differs from the overall estimate (–0.01) and falls outside the confidence interval (–0.06 to 0.04) originally reported by de Wit et al. (2012). Further, this estimate is different from modeling task type as a moderator (de Wit et al., 2012, report these findings in Table 6) as these authors concluded that task type did not moderate the association between task conflict and performance. Modeling task type as a moderator versus a higher-level unit is thus conceptually and statistically distinct. The former relies on an assumption that an x-y association depends on values of z, whereas the latter suggests that each rating on x and y might not be independent and could be similar within categories of z. Thus, as noted previously, modeling the same variable as a moderator in a traditional meta-analysis versus as a Level 3 grouping variable involves different conceptual assumptions and may yield distinct empirical conclusions. Our Level 3 ICC(1) indicates that the association between task conflict and performance might differ when nested under task type, and the overall estimate is different from one that does not account for such dependencies. Furthermore, the confidence bands differ as well.
Next, we modeled the association between relationship conflict and performance while using the same nesting structure (i.e., effect sizes nested within studies, which were further nested within task types). Following the same procedures as those described previously for running null models, we found the ICC(1) (0.95) to be much larger at the study level (i.e., Level 2) in this instance than the ICC(1) (0.05) at the task type level (Level 3). In such instances, wherein the ICC(1) is negligible (Bliese, 2000), we do not recommend proceeding with a multilevel model as the association between relationship conflict and performance do not indicate substantial variance between differing task types. However, for the sake of continuing our exposition, we also report these findings in Table 3. Here, the overall estimate is –0.16 (SE = 0.03, CI [–0.21 to –0.11]), which are nearly identical to the findings reported by de Wit et al. (2012; Table 3).
Discussion
The current work challenges a fundamental assumption of one of the most popular statistical techniques in management research (i.e., meta-analyses). Specifically, there seems to be a widespread assumption or misconception in our field regarding independence of primary effect sizes between studies. As illustrated by our literature review (see Table 1), less than 5% of work published in elite journals address it or test for it. If such dependencies are present, however, it leads to erroneous conclusions regarding confidence intervals, point estimates, and standard errors reported, which might ultimately degrade the confidence one can place in meta-analytic reports (Becker & Kim, 2002; Beretvas & Pastor, 2003; M. W. L. Cheung, 2014; S. F. Cheung & Chan, 2004; Cooper, 1979; Greenhouse & Iyengar, 1994; Hedges & Olkin, 1985; Landman & Dawes, 1982; Marín-Martínez & Sánchez-Meca, 1999; Raudenbush et al., 1988; Romano & Kromrey, 2009; Rosenthal & Rubin, 1986; Tracz et al., 1992). To help researchers address this violation of independence assumption in their meta-analyses, we presented an integration of multilevel and meta-analytic models as well as provided illustrative examples highlighting how modeling dependence in primary effect sizes results in differing effect size estimates and confidence bands. We also developed a web-based application to support further applications of this technique. Taken as a whole, we believe the current article helps move the field of organizational science forward in some important ways.
Based on the equations and examples introduced, we adopted a step-by-step model building approach, wherein conceptually identifying how and why effect sizes might be dependent is the first step. Researchers can then test a null model that allows one to compute the ICC(1) at Level 3. When the ICC(1) reveals a substantial amount of variance (see Equation 15), then it is likely that ignoring such between-study dependencies will yield biased estimates and confidence bands. For example, the de Wit et al. (2012) article demonstrates that not only does the effect size for task conflict and performance change when effects sizes are nested within task types, but the substantive conclusions may be called into question. de Wit et al.’s findings, which have been cited over 700 times, found that task conflict was unique in that it had a negligible relationship with team performance, whereas other forms of conflict were consistently negatively related to this criterion. Our findings suggest that task conflict may in fact be negatively related to team performance. Thus, the paradoxical effects of task conflict that de Wit et al. report, and subsequent researchers have sought to untangle, may be more ephemeral than originally suggested. This shift in findings has direct implications for how team researchers conceptualize team conflict and how practitioners and managers choose to cultivate “constructive controversy” in their groups. To be clear, we do not anticipate that such differences in findings and conclusions are unique to de Wit et al.’s study. Instead, we believe that such dependencies may be operating in a number of literatures.
We illustrated how such models could be interpreted, along with providing materials that facilitate the execution of such analyses. Specifically, we included materials to replicate findings via the HLM software and a newly developed application that is freely available at https://orgscience.uncc.edu/about-us/research/resources and is discussed further in the Appendix. Having introduced and demonstrated the utility of meta-analyses as a multilevel model, we now discuss alternate forms of dependencies and nesting structures than the ones illustrated in our examples.
Effect sizes included in a meta-analysis could have complex forms of dependencies (M. W. L. Cheung, 2014). The most common nesting structure that management scholars are likely to encounter are where primary study effect sizes (Level 1) are nested within studies (Level 2), which are nested within another unit, such as district, hierarchical level, and so on (Level 3; Tanner-Smith, Tipton, & Polanin, 2016). In fact, across all examples in our work, we implemented this type of nesting structure. Many alternative nesting structures are plausible, however. For instance, one alternative form of nesting that may occur could be attributed to the presence of multiple dependent variables (Van den Noortgate et al., 2015). For instance, suppose that a meta-analyst is examining the effect of team processes on various measures of team effectiveness (e.g., team performance and team viability: LePine, Piccolo, Jackson, Mathieu, & Saul, 2008). If the two outcome variables are positively correlated and some studies measured both outcome variables, then we would expect between-study dependencies to emerge. Thus, in these circumstances, there is nesting due to the types of outcome measures researchers are using. Although we anticipate that the application of these models should unfold in a similar manner as discussed here, and our preliminary testing supports this assumption, future meta-analyses that explicitly adopt these types of multilevel models would be quite informative.
Until now, we have focused on nested random-effects as these are the most intuitive multilevel designs. However, as with primary data collections, meta-analyses may also feature cross-classified structures (Fernandez-Castilla et al., in press). These structures are warranted when multiple Level 1 units are “nested” within a higher-level grouping variable and thus group membership is not unique (see Gooty & Yammarino, 2011, for a review). For instance, such a cross-classified structure could occur when multiple effect sizes for the same association (say corporate social responsibility with firm performance) within a primary study are drawn from three different databases. While the x-y association itself is the same conceptually, note that the same association differs based on which database it is drawn from. Thus, effect sizes could be simultaneously “nested” within databases and say, a higher-level variable, such as country. Interestingly, this structure may be common in the strategy or entrepreneurship literature given the nature of the data that are regularly used.
Finally, another application of our technique of Vknown multilevel models is that they can provide additional insights via the random-effects and variance estimates using the ICC calculations we demonstrated earlier. Indeed, an important caveat to our work is that multilevel modeling of primary effect sizes is applicable when the Level 3 grouping factor reveals a substantial Level 3 ICC(1) as it did in Konstantopoulos (2011) and de Wit et al. (2012) for effect sizes pertaining to task conflict but not relationship conflict. Along with the ICC(1) indices, these models could provide estimates that are useful in interpreting an I2 index (Higgins & Thompson, 2002). The I2 index slightly extends the logic of the ICC indices, which originate in a multilevel modeling framework, by incorporating the known-sampling variance that is estimated in a meta-analysis (Vj). Returning to the Konstantopoulos example, the I2 for the three-level model would be .95. This suggests that 95% of the variance in effect sizes is due to characteristics of either Level 2 (study) or Level 3 (district) grouping variables. By subtracting the I2 value from 1, one can determine the proportion of variance due to sampling error (i.e., random fluctuations across effect sizes). This value might be especially useful as it indicates how much of the heterogeneity in effect sizes is systematic (i.e., Level 2 or Level 3 grouping variables) and how much can be considered random (i.e., the “typical” within-study sampling variance, Vj).
Limitations and Future Research
Although we believe our work presents an important integration of two existing analytic frameworks within management research, we should acknowledge a few limitations. First, we have not provided resources that help correct for measurement error and range restriction prior to entering the effect sizes into multilevel models, which is typically done in an H&S meta-analysis. We suspect that effect sizes could be corrected and then entered into the formulas we present here, but we have yet to vet those procedures. Second, although correcting for artifactual variance and considering the nested nature of data helps improve estimation of population parameters, the results are still only accurate so far as there is not bias in the data. For example, publication bias and potential outliers can still introduce systematic and random bias into a meta-analysis (Banks et al., 2012; Kepes, Banks, McDaniel, & Whetzel, 2012). If sample-level publication bias exists or primary studies report biased estimates due to questionable research practices (O’Boyle, Banks, & Gonzalez-Mule, 2017), meta-analytic estimates could still be problematic to interpret.
Third, we were limited by the availability of data in our examples of meta-analyses. We believe future work could avoid such limitations when conducting multilevel meta-analyses by coding for a conceptual dependency while collecting data. Fourth, HLM only allows for full maximum likelihood (FML) estimation in three-level models at the time of writing this article. The benefits of restricted maximum likelihood (REML) have been discussed elsewhere and is preferable for smaller sample sizes, as is the case with our Level 3 units. This could be seen as a limitation of our work as well, although ML is more conservative than REML. Further, we report REML findings for the same examples in the Appendix via our application.
It is also worth highlighting that we have focused our efforts on delineating the necessity and benefits of three-level, multilevel meta-analyses. Nevertheless, researchers may be confronted with nesting structures that extend well beyond three levels (e.g., four, five, six layers of between-study dependencies). At the time of writing this article, HLM could handle four levels, but many of the software programs are limited in handling more than three levels. We encourage meta-analysts to acknowledge these limitations when confronted with these types of nesting structures. An exciting future implication of our work is the conducting of multilevel meta-analyses rather than applying multilevel techniques to meta-analyses as we did here. Specifically, we focused on the application of multilevel modeling to existing meta-analytic data. However, these techniques could be extended to meta-analyses using a levels perspective, wherein effect sizes are nested under studies, which are in turn nested within levels of analyses and could be coded for such structures a priori (e.g., intra- and interindividual, teams, etc.).
Finally, a review of the most influential meta-analyses in management could be warranted based on modeling dependence as we have shown here (DeSimone et al., 2018). This could have key implications for future theory testing, policymaking, and practice as well. For example, many meta-analyses of leader-member exchange (e.g., Dulebohn, Bommer, Liden, Brouer, & Ferris, 2012; Martin, Guillaume, Thomas, Lee, & Epitropaki, 2016) suggest follower-reported leader-member exchange is positively associated with performance ratings. Would this association hold if country or geographic region were included as a third-level grouping variable? Or if performance was self-, other-, or machine-rated? Thus, several key questions remain in the organizational sciences that could be more precisely answered via the techniques we demonstrate.
Conclusion
In conclusion, we recommend that future meta-analysts carefully think through the potential for dependency in their effect sizes at the project conceptualization stage, articulate various nesting structures (as there might be more than one), collect data and code for such structures, and ultimately, test for these dependencies. Thus, very simply, where the Level 3 ICC(1) is appreciable, we recommend that scholars should report findings from multilevel modeling, as we demonstrated here, as it leads to greater accuracy and lower probability of errors in the confidence band of that effect size estimate. Ultimately, these practices will culminate in a more rigorous and precise science.
Supplemental Material
Appendix - Meta-Analyses as a Multi-Level Model
Appendix for Meta-Analyses as a Multi-Level Model by Janaki Gooty, George C. Banks, Andrew C. Loignon, Scott Tonidandel and Courtney E. Williams in Organizational Research Methods
Footnotes
Appendix
Authors’ Note
George C. Banks and Andrew Loignon contributed equally to this work.
Acknowledgments
We thank Erik Mule-Gonzalez for helpful comments and feedback on certain aspects of this article. We are grateful to John Antonakis for helping us with the Monte Carlo simulation.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The first author was supported by a Belk College of Business Summer Research Grant in developing this work.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
