Abstract
The formulae for attenuation correction in meta-analysis treat reliabilities as if they were independent of each other. The current study puts this assumption of independence to the test by empirically examining the correlation among predictor and criterion reliability estimates across studies. Interdependence of reliabilities would result in either overestimation or underestimation of population correlations depending on the direction of the relationship between the reliabilities. We conducted two studies to examine the extent to which predictor and criterion reliabilities correlate across studies. Study 1 is based on 628 pairs of reliability estimates from 518 studies published in the Academy of Management Journal and the Journal of Applied Psychology between 2004 and 2011, while Study 2 is based on 564 pairs of reliability estimates from 347 studies included in a meta-analysis on perceived organizational support (POS) and some of its antecedents and outcomes. The findings in both studies show substantial correlations between predictor and criterion reliability coefficients across studies. Our article discusses important implications from these findings for future research and for the future conduct of meta-analyses.
Meta-analysis is useful in many areas of empirical research because it combines and integrates findings from the various studies that have been conducted on a given topic. More specifically, it summarizes findings across studies and corrects for common study artifacts that bias estimates of true relationships. While researchers have noted that meta-analysis cannot make lemonade out of lemons, meaning its results are influenced by the quality of the primary-level studies on which the meta-analysis is based, it will usually provide more accurate and reliable estimates than primary-level studies given its use of larger samples and corrections for measurement error and range restriction (Aguinis, Pierce, Bosco, Dalton, & Dalton, 2011).
But what if the process that we currently use for artifact correction is itself flawed? In that case, our estimates of the underlying population relationships among variables may be biased, resulting in over- or underestimation of true relationships. Such inaccurate estimation of relationships would have implications not only for science, but for practice as well. Meta-analysis results are widely used, for example in organizational decision making, in the construction and legal defense of personnel selection and training systems, and in estimating the utilities of interventions. If meta-analytic estimates are inaccurate, then the conclusions in our results and discussion sections are misleading, and our advice to organizations is misguided.
Current meta-analytic practice commonly involves, among other things, correction of mean observed effect sizes for unreliability in one or more of the variables involved (Cortina, 2003). The form of unreliability (e.g., internal consistency, interrater) varies from study to study, and sometimes even varies within study (e.g., correcting the predictor for internal inconsistency and the criterion for interrater unreliability). Regardless of the form of unreliability, current correction practices rely on formulae that assume independence of reliabilities. For example, Aguinis, Dalton, Bosco, Pierce, and Dalton (2011) report that 83.5% of the 196 meta-analyses included in their review of meta-analyses published in top management and organizational studies journals between 1982 and 2009 used the Hunter and Schmidt (1990, 2004) approach, which recommends the use of formulae that correct for study artifacts as if they were independent of one another.
There are, however, many factors that might influence the quality of measurement on both the predictor and criterion side. Some of these factors are of a statistical nature. For example, the actual true relationship between the underlying constructs in a given study, that is, their shared true score variance, connects the predictor and criterion reliability for these two constructs. Given that reliability is defined as the amount of true score variance in the observed variance, the correlation between true score variances in two correlated constructs will make their reliabilities correlate. Second, range restriction in the sample will affect both the predictor and the criterion measurement and thus the reliability, and create covariation between them. Apart from statistical factors, there are likely also going to be situational factors that create a correlation between reliability indices including factors such as the setting in which data were collected, the nature of the assessment, or the source of the data.
The presence of these factors should cause estimates of reliability to be related, resulting in violation of the assumption of independence in the commonly used meta-analysis formulae. If this assumption is violated, then the standard estimator of the corrected correlation is biased. The purpose of this article is to examine and test the assumption of independence among reliabilities by empirically estimating relationships among predictor and criterion reliabilities. To our knowledge, this is the first article to empirically address the concerns regarding lack of independence raised by previous research. Observed relationships among predictor and criterion reliabilities would have important implications for the interpretation of previous meta-analyses and the conduct of future meta-analyses.
After reviewing the common procedure for correction for artifacts, we discuss the underlying assumption of independence and the potential challenges that interdependence between artifacts would create. We provide evidence from two independent meta-analyses, one including 518 studies published in the Academy of Management Journal and the Journal of Applied Psychology between 2004 and 2011 and one including 347 studies on the relationship between POS and related variables, that shows that study artifacts are in fact substantially correlated. We conclude with a discussion of implications for future research and for the conduct of future meta-analyses.
Typical Correction for Artifacts
Consistent with recommendations by Hunter and Schmidt (2004), Raju, Burke, Normand, and Langlois (1991), and others for conducting meta-analysis, the artifacts that are most commonly corrected for are the reliability of the predictor (r xx), the reliability of the criterion (r yy), and range restriction. Cortina (2003) showed that 78.6% of the 1,647 meta-analyses that were published in the Journal of Applied Psychology through 1997 corrected for one or more of these artifacts. Out of these 1,294 meta-analyses, 275 (approximately 21%) corrected the sample size-weighted mean correlation for more than one of these artifacts. Aguinis, Dalton, et al. (2011) showed that 51.2% of 5,560 meta-analyzed effect sizes included in their review corrected for measurement error in the dependent variable and 48.3% of 5,346 meta-analyzed effect sizes corrected for measurement error in the independent variable. Only 9.5% of 5,541 meta-analyzed effect sizes were corrected for range restriction on the dependent variable and 11.6% of 5,568 meta-analyzed effect sizes were corrected for range restriction on the independent variable.
Traditional correction formulae allow one to correct for multiple artifacts simultaneously. Earlier formulae assumed statistical independence among range restriction estimates, estimates of predictor reliability, and estimates of criterion reliability (Mendoza & Mumford, 1987). Later work by Mendoza and Mumford (1987), Theron (1998), and others identified the mathematical connections between range restriction and reliability, thus leading to the development of correction formulae that take this interdependence into account. However, no such formulae exist for reliabilities, and given the panoply of sources of overlap among reliabilities, it may be that such formulae are not possible.
The assumption of independent reliabilities is based on one of the axioms of classical test theory that states that errors associated with one measure are uncorrelated with errors associated with another measure (e.g., Lord & Novick, 1968; Novick, 1966; Zimmerman & Williams, 1977). This independence assumption is a within-study phenomenon, meaning that errors associated with the measurement of the true score of variable X in a given study, should be uncorrelated with errors associated with the measurement of the true score of variable Y in that same study. While this assumption has been contested by previous research (e.g., Cronbach, 1947; Novick, 1966; Rozeboom, 1966; Viswesvaran, Ones, & Schmidt, 1996), attempts to estimate it empirically are bound to fail. To estimate the correlation of errors within studies, one would have to compute the measurement error of variables X and Y as the difference between the true score and the observed score for each of the variables. Then one could correlate the error terms for the measurement of variables X and Y. However, we do not know the true score in the first place. Zimmerman and Williams (1977) have developed an adjusted formula for attenuation correction that takes correlations of measurement errors into account. But the fact that we cannot obtain empirical estimates for the errors reduces the utility of this work.
When considering the formulae used for the correction of attenuation in meta-analysis, however, we face a slightly different issue. While being conceptually based on the same rationale as the correction for attenuation in a single study case, the formulae in meta-analysis typically use a mean attenuation correction based on artifact distributions (e.g., Hunter & Schmidt, 2004; Pearlman, Schmidt, & Hunter, 1980). The question of independence then becomes a cross-study issue, and as such assumptions of independence based in classical test theory may be even more complicated.
In the next section, we review the correction for unreliability that is commonly used in the psychometric tradition of meta-analysis represented by the Hunter and Schmidt (1990, 2004) approach. As we pointed out earlier, according to Aguinis, Dalton, et al. (2011), this is the most commonly used approach in published meta-analyses in top management and organizational studies journals. We then discuss arguments for and against the assumption of independence at the cross-study level and derive research questions for the current article.
Correction for Unreliability
Variables are never measured without error. Measurement error masks the relationship between the true scores of the variables, usually resulting in an observed correlation that is weaker than the true score correlation. To better estimate the relationship between the true scores of two variables, researchers can either improve their measures or use formulae to correct for attenuation (Hunter & Schmidt, 2004; Pearlman et al., 1980).
Formula 1 expresses the general correction for attenuation for a single relationship between two variables X and Y.
The numerator represents the correlation between variables X and Y that is observed in a given study. The denominator contains the product of the two correction factors, that is, the square roots of the reliabilities of the measures of X and Y. With this formula, researchers “correct” for unreliability of the measures they used to assess X and Y, with the result being an estimate of the true score correlation between X and Y, that is, the correlation if they could be measured without error.
Just as authors of primary studies wish to estimate correlations between variables measured without error, so too do meta-analysts wish to correct mean effect sizes for flaws such as unreliability. To correct for attenuation in meta-analysis estimates, mean attenuation factors for predictor and criterion unreliability are typically computed (Hunter & Schmidt, 2004). These mean attenuation factors are based on reliability information from the studies in the meta-analysis that report it. The sample-size-weighted mean correlation in meta-analysis is then corrected for predictor and criterion unreliability by dividing by these mean attenuation factors (see Formula 2).
By aggregating reliabilities across the studies included in the meta-analysis and representing reliability as a single aggregate measure, we should no longer be concerned solely with interdependence issues at the within-study level. We also need to be concerned with interdependence at the between-study level. As outlined above, unfortunately, we cannot empirically assess the correlation of errors at the within-study level. We can, however, empirically examine the interdependence of reliabilities at the between-study level to determine the severity of this issue. In the following section, we provide a discussion of the arguments that previous research has raised with regard to interdependence among reliability estimates across studies.
The Assumption of Independence of Reliabilities Across Studies
James, Demaree, and Mulaik (1986) explain that the validity generalization procedures underlying the psychometric approach to meta-analysis are built on the following structural equation: ri = ρi + ei. In this equation, ri is the observed score for the subject/sample, ρi is the true score for the subject/sample, and ei is the random measurement or sampling error. As James et al. note, given that both the true score and the error are unknown, this equation is underidentified. So, to make use of this equation, validity generalization (VG) procedures need to make a set of assumptions to facilitate estimation of the true and error values. As James et al. summarize, Given the basic structural equation ri = ρi + ei, it is assumed that (a) the mean error is zero within each study (population, situation), (b) ρi and ei are unrelated across studies, and (c) σr
2 = σρ
2 + σe
2 (cf. Callender & Osburn, 1980, 1982; Hunter, Schmidt, & Jackson, 1982). Furthermore, implicit in the use of several VG estimating equations is the assumption that the errors are normally distributed and/or the assumption that the within-study error variances are homogeneous. (p. 446)
Apart from these widely known and discussed statistical assumptions, however, Hunter and Schmidt’s (2004) approach to meta-analysis is based on further assumptions that are less well-known and that speak directly to the issue of independence among statistical artifacts. These assumptions are not of a statistical nature, but they nevertheless are at the heart of the artifact correction formulae that most organizational researchers use. In their book, Hunter and Schmidt (2004) wrote, The algebraic formulas used in this meta-analysis make certain assumptions, namely, that artifact parameters are (1) independent of the actual correlation and (2) independent of each other. Examination of the substantive nature of artifacts suggests that the independence assumptions are reasonable; most artifact values are less a matter of scientific choice and more a function of situational and resource constraints. The constraints for different artifacts usually have little to do with each other. (p. 139) This procedure assumes independence of artifactual variance sources across studies. Specifically, it assumes that there is no correlation across studies among criterion reliability, test reliability, and range restriction. (Note that this assumption is different from, and not contradictory to, the fact that there is a mathematical interaction among these effects within a given study.) (p. 405) Since the tests used were typically commercially published instruments, their reliability is not likely to vary with criterion reliability or range restriction in specific studies. So the assumption of no correlation between test reliabilities and the other two sources of variance appears to be justified.
While the data for the Pearlman et al. meta-analysis consisted of studies in the area of personnel selection, in which commercially available tests were widely available, the same cannot be said for many other meta-analyses conducted on research topics that would not often use a commercially available test (e.g., organizational climate, psychological safety, organizational citizenship behaviors). It is reasonable to suppose that when researchers choose scales for their studies, they apply similar kinds of criteria for the selection of the measures for the predictor and the outcome. For example, how many items do the measures have (i.e., how long will it take participants to complete), has each measure been shown to be reliable in previous studies, or can the measures be administered simultaneously? Thus, while factors that lead to nonindependence might not stem from the construction of the scales per se, they may stem from the ways in which the tests are administered, the characteristics of the sample, or other situational constraints. So, in the end, whether predictor and criterion reliabilities are correlated is really an empirical question that should be answered by collecting the appropriate data.
Interestingly, the new meta-analytic approaches that were created to account for the fact that true scores and errors are correlated also include variance components to account for the correlation between statistical artifacts in their breakdown of total variance (see Mendoza & Reinhardt, 1991; Raju et al., 1991; Raju et al., 1998). However, none of the studies explains what the assumption of interdependence is based on. Furthermore, none of these studies has examined empirically the effect of correlated reliabilities across studies on the corrected correlations. Rather, the main purpose of these studies was to explore the utility of the new approaches as compared to other established meta-analytic techniques. The Monte Carlo simulations used in these articles focus on other interdependencies, mostly between the true score and the error score. There is a clear gap in our knowledge here of the magnitude of reliability correlations and the effect that correlated reliabilities have on our estimates of the corrected mean correlation. In the next section, we explore potential statistical and situational factors that can create interdependencies between predictor and criterion reliabilities across studies before examining empirically the question of correlated predictor and criterion reliabilities across studies.
Factors Contributing to Interdependence Between Predictor and Criterion Reliabilities
There are likely to be many factors that contribute to a correlation between predictor and criterion reliabilities across studies in a meta-analysis. We focus on two categories of factors, statistical and situational. To illustrate our argument, we use as an example a hypothetical meta-analysis of the relationship between the constructs conscientiousness and overall job performance. The meta-analysis consists of a number of original studies with different sample sizes and different measures of the focal constructs. Based on values taken from these studies, the meta-analyst computed the sample-size-weighted mean correlation and variance, correcting for sampling error in both estimates. In the next step, the researcher set out to correct for statistical artifacts as per Hunter and Schmidt (2004). Hence, our example meta-analysis is quite similar in design to many published meta-analyses.
We want to first discuss two statistical factors that are likely to create a correlation between the predictor and criterion reliabilities, true score variance and range restriction. Overlap between true score variances creates an underlying connection between reliability estimates.1 An observed correlation between two variables, in our example conscientiousness and job performance, is a function of the underlying relationship between the true scores of these two variables. At the same time, the respective reliabilities of these two variables are also affected by the true scores in that they represent the amount of true score variance in the observed variance for the variables. Thus, if there is an underlying relationship between the true score of conscientiousness and the true score of job performance, there should also be a correlation between their reliabilities given that the true scores in one variable covary with the true scores of the other.
A related, but slightly distinct, influence factor is range restriction, which could differentially affect reliabilities in primary studies (as it would also affect the true score variance and the observed correlation between the two variables). In our example, this could mean that one study included in our meta-analysis collected data on the relationship between conscientiousness and job performance from senior job incumbents. Given that these incumbents likely have progressed in the organization in part due to their conscientiousness and prior job performance, this data set would likely suffer from direct range restriction, thus suppressing reliabilities among other things. At the same time, our meta-analysis might include a second study in which data for conscientiousness and job performance were collected from a sample of interns or newly hired employees. This sample likely suffers much less from direct range restriction, and hence the reliabilities for both the predictor and the criterion would be higher. The facts that both reliabilities would be low in studies suffering from range restriction and that both reliabilities would be higher in studies not suffering from range restriction would create covariation of predictor and criterion reliabilities across studies in our meta-analysis.2
A different line of argument in support of interdependencies between predictor and criterion reliabilities can be built based on work by James and colleagues (e.g., James, Demaree, Mulaik, & Ladd, 1992; James, Demaree, Mulaik, & Mumford, 1988; James et al., 1986), which focused on an examination of the influence of situational factors on VG procedures. James et al. (1992) suggest that traditional VG models establish the existence of situational moderators only indirectly by assessing whether substantial residual variance is left over after accounting for variance attributable to statistical artifacts. These models do not in fact model or test potential moderating effects. The residualization approach assumes that situational moderators affecting true validities are independent of statistical artifacts. However, James et al. assume that situational factors (such as organizational settings or biases; James et al., 1986; 1988) affect both the true variance and the artifacts, which makes the “artifacts” not actually artifacts but an expression of an omitted spurious effect of situational factors. These situational factors affecting true validities and statistical artifacts generate correlations between the artifacts and the true validity. This correlation violates the assumption of errors being uncorrelated with true scores and then renders the residualization approach (as well as the formulae for corrected correlations and variance) flawed as it cannot account for the interdependence.
Going one step further, we argue that the same situational moderators can affect both predictor and criterion reliabilities and create interdependencies among the statistical artifacts. Above and beyond modeling the correlation between true validity and statistical artifacts, the residualization approach would also need to account for correlations between the reliabilities and between other statistical artifacts to avoid overcorrection. Failure to account for the correlation between reliabilities would result in double correction for the same measurement error.
Applying this thinking to our example of a meta-analysis between conscientiousness and job performance, the type of organization could impact both the true correlation between these two variables and their respective reliabilities. For example, employees in organizations in which level of detail is important (e.g., accounting firms) might need to be more conscientious than employees in other organizations in which level of detail is not as important. This would mean that over time only the more conscientious employees remain and ascend within the organization, which leads to range restriction in our independent variable. This situational factor would then be associated with weaker validities for the conscientiousness/performance relationship in detail-oriented organizations than in other, nonrestricted, organizations. This restriction of range would also lead to smaller reliabilities for both the predictor and the criterion.
Alternatively, given that employees might be more detail-oriented and conscientious in detail-oriented organizations, they might devote more attention to completing the questionnaires provided to them. This in turn might generate more reliable measures of the two constructs than could be expected in other organizations. This would contribute to a correlation between the predictor and criterion reliabilities. In both cases, we would expect to see stronger correlations between the predictor and criterion reliabilities across primary-level studies, which in turn would lead the meta-analyst to overcorrect the sample-weighted mean correlation.
Present Study
In our discussion of the issue of interdependence of predictor and criterion reliabilities, we have argued that previous research has largely neglected the issue. Previous research has examined in detail other statistical assumptions of meta-analysis, such as the assumption of uncorrelated errors and true validities, and has largely concluded that meta-analytic approaches should be adapted to take these interdependencies into account. Previous research has also examined the interdependencies of reliabilities and range restriction, in particular criterion reliability and range restriction, and has offered altered formulae to take these interdependencies into account such as to not overcorrect the sample-size-weighted mean correlation. As, for example, research by LeBreton, Burgess, Kaiser, Atchley, and James (2003) shows the effects of range restriction on reliability coefficients can be quite extensive and hence, can greatly affect corrected correlations and subsequently the conclusions we draw based on these correlations.
However, no such examination exists for the interdependencies between predictor and criterion reliabilities. Without an empirical examination of this issue, we do not know the magnitude of the problem. Hence, the purpose of the current article is twofold: (a) to empirically examine whether predictor and criterion reliabilities are correlated across studies and (b) to identify the extent of their interdependency.
To accomplish these goals, we conducted two empirical examinations. In Study 1, we examined the correlation between predictor and criterion reliabilities across all relevant studies published in the Academy of Management Journal (AMJ) and the Journal of Applied Psychology (JAP) between January 2004 and October 2011. This resulted in a mix of studies across different research topics, variables, measures, samples, and conditions. This type of meta-analysis is very atypical. Usually, meta-analysts would focus on one research topic, such as the relationship between two specific variables, rather than collecting estimates across studies with different research topics. In the current study, however, we wanted to explore how much reliabilities would correlate when there was no consistent underlying relationship between the independent variable (IV) and the dependent variable (DV) across studies, that is, when the IV-DV relationships across studies should not correlate. This gives us a “baseline” condition, in which we can exclude the underlying effect that the true score correlation between the predictor and criterion constructs has on the correlation between the reliabilities across studies. In addition, given that there is no substantive, meaningful connection between the variables or research conditions across the original studies included in Study 1, we would not expect to see any relationships between the reliabilities, unless they represent situational factors or larger external influence factors, such as publication requirements or institutionalized norms for the conduct of research. Any correlations between the predictor and criterion reliabilities we find here represent the minimum correlation between reliabilities researchers should expect to have in their meta-analyses, even if the relationship between their IV and DV is nonexistent.
In Study 2, we collected original studies involving perceived organizational support (POS) and some of its antecedents and outcomes. In comparison to Study 1, Study 2 provides us with a more typical meta-analysis setting, in which the researcher assesses a substantive, meaningful underlying IV-DV relationship in a specific research topic area. We collected predictor and criterion reliability information from relevant original studies and examined the reliability correlations for different IV/DV combinations to examine the strength of their relationship. In this study, the effect of the true score correlation on the correlation between reliabilities is captured. Statistical and research design influence factors as well as situational factors are also likely to contribute to the correlation between the predictor and criterion reliabilities. Comparing results from the two studies allows us to compare the typical meta-analysis situation with a “baseline” situation and to obtain comprehensive estimates of relationships among reliabilities.
Study 1
Method
Compilation of Studies
The data collection process included an extensive manual search for published studies that included information about predictor and criterion reliability. To limit the vast amount of studies available to a manageable pool of studies, we chose to focus our search on two of the most renowned journals in the field of organization studies that predominantly publish empirical research: the Academy of Management Journal and the Journal of Applied Psychology. Our data set includes all applicable studies published in these journals between January 2004 and October 2011.
A study was included if it had reliability information for both the predictor measure and the criterion measure of interest in any given study. The particular research topic of a study did not matter at all, given that our concern is about the correlation between measurement errors of the predictor and the criterion measure that are supposedly unrelated to the theoretical constructs under examination. Part of the goal of this study was to include studies from as many different topics in organizational studies as possible to ensure that the findings are in fact not dependent on a particular research topic, specific theoretical constructs, or certain fields of research or discipline areas.
From each study we collected the following information: The type of reliability coefficient the study reported for the predictor and the criterion (i.e., internal consistency, test–retest, split-half, alternate forms, intrarater, or interrater), the magnitude of each reliability coefficient, information about the predictor and criterion constructs (i.e., which constructs were actually measured, for example, transformational leadership, job performance, emotional exhaustion), sample size, information about the type of sample (i.e., whether the sample consisted of students, employees, MBA students, or firms/organizations), the number of items used to measure the predictor and the criterion, whether the predictor and criterion information was provided by the same source or by different sources (i.e., the same person or different people), and how many raters provided information (for intra- and interrater reliability coefficients). Internal consistency reliability was indexed with Cronbach’s alpha. These values were coded as internal consistency reliability when respondents completed a measure about themselves to distinguish them from ratings of others. Intrarater reliability was also usually indexed with Cronbach’s alpha; however, the target of the questions was different from the person who completed the questionnaire (e.g., peer reports). Whereas these last two indices referred to consistency among items, interrater reliability refers to consistency among raters. We also coded for alternate forms reliability (correlation between two alternate forms of a questionnaire), split-half reliability (correlation between two halves of a test), and test–retest reliability (correlation between an original testing occasion and a retesting occasion with the same questionnaire after a certain amount of time had elapsed). However, because there were not enough studies that had used alternate forms, split-half, and test–retest reliability indices, we did not include them in our analyses (see below).
Four coders went through the journal issues to collect all the information from each study. The first author of the study provided detailed instructions to the three other coders and trained them on how to collect the information from the studies. All the information that needed to be collected from the studies was readily available from the methods sections. No subjective coding of any information was necessary. In the rare case that one of the three coders was uncertain about collecting the appropriate information from a study, the coder approached the first author of this study and they collected the necessary information from the article together. The first author conducted frequent checks of the studies coded by the other coders. Coding mistakes were rare, but when they were uncovered, the first author worked with the coder to remedy the mistakes and to provide further training to avoid these mistakes in the future.
We succeeded in locating 518 studies for a variety of predictors. From these 518 studies, we obtained 628 pairs of reliability estimates (see Table 1). We chose to focus on combinations of predictor and criterion reliability coefficients that had a sufficient number of studies available. As a result, our analysis focuses on the following five combinations of predictor and criterion reliabilities:
Number of Independent Reliability Estimates Obtained for All Predictor/Criterion Reliability Combinations.
Note: Number of studies the estimates were obtained from in brackets.
Criterion internal consistency reliability/predictor internal consistency reliability
Criterion internal consistency reliability/predictor intrarater reliability
Criterion intrarater reliability/predictor internal consistency reliability
Criterion intrarater reliability/predictor intrarater reliability
Criterion interrater reliability/predictor internal consistency reliability
Data Analysis
Outlier analysis
In our study, we considered cases as outliers when their reliability coefficients for the predictor or the criterion were unusually large or small when compared to all the other values in the data set. We followed the approach for identifying outliers advocated by Aguinis, Gottfredson, and Joo (2013). They recommend a three-step approach that consists of the identification of error outliers, interesting outliers, and finally, influential outliers. Error outliers are deemed to have occurred due to inaccuracies, for example, due to coding errors, errors in computation, or errors in data manipulation. In the case of our meta-analysis, the only relevant error type would be due to coding errors, given that we obtained effect sizes from primary studies. So, if computation errors or errors in data manipulation happened in the primary studies, we would not be able to identify them. Aguinis et al. recommend two steps, identification of the potential error outlier and determination of whether the potential error outlier is an actual error outlier. For error identification, Aguinis et al. suggest using both single and multiple construct techniques by employing visual and quantitative tools.
In the current meta-analysis, we used the following visual tools: (a) We examined plots of the distributions of the reliability indices to identify cases that were in the tails of the distribution and thus very different from the rest of the values in the data set (i.e., single construct technique) and (b) we examined scatter plots of the bivariate relationships between predictor and criterion reliability to examine whether certain cases were situated far away from the main cloud of cases (i.e., multiple construct technique). Regarding quantitative tools, we chose to employ the following three-pronged approach for the quantitative identification of outliers: First, we checked for univariate outliers for each of the two reliability estimates (i.e., single construct technique). We conducted a standard deviation analysis and examined in more detail studies with reliability estimates that were more than 3 standard deviations above or below the mean. We chose 3 standard deviation units as a more conservative measure, given that our data are not normally distributed. Second, we checked for bivariate outliers using the difference-in-fit standardized (DFFITS) statistic. Third, we employed the approach suggested by Huffcutt and Arthur (1995), using the sample-adjusted meta-analytic deviancy (SAMD) statistic. The SAMD statistic uses a similar approach as the DFFITS statistic, but it takes the sample size on which each reliability estimate is based into account. We computed the SAMD statistic for each reliability estimate. For both the SAMD statistic and the DFFITS statistic we used scree plots to determine the values that are furthest away from the rest of the values (i.e., visual tool).
Based on these analyses, we identified eight potential error outliers. To determine whether these were actual error outliers, we went back to the original articles to check for coding errors. Amongst these cases, we found one coding error, which we corrected.
In the next step, we explored the characteristics of these cases a bit further. We observed that several of the high SAMD statistics were found for cases with large sample sizes. Given that this statistic takes sample size into account, studies with large samples were particularly prone to receiving a high SAMD statistic, even when their effect sizes were well within the limits of 3 standard deviations from the mean. We also found several effect sizes that were above or below 3 standard deviation units from the mean. None of these outliers were “interesting,” that is, outliers that are not due to inaccuracy and that could further theoretical knowledge by being representations of unique cases. Consequently, we moved on to determining whether these outliers are influential outliers, that is, outliers that have an effect on our parameter estimates. Following Aguinis et al.’s (2013) recommendations, we ran the main statistical analyses first using all data and then using the data set from which outliers were excluded to determine the effect that the identified outliers had on our data analysis.
For the data sets containing the effect sizes for the predictor and criterion internal consistency combination, there was virtually no difference in the results when running the analyses with outlier scores included or excluded, so we report results based on the full data set. The same was the case for the data set containing the effect sizes for the predictor intrarater/criterion internal consistency reliability combination.
For the data sets containing the predictor internal consistency/criterion interrater reliability combination and for the predictor and criterion intrarater reliability combination, there was one study that was very influential due to its sample size of N = 9,627. Despite the fact that the reported reliability coefficients were well within the range of the reliability coefficients reported by other studies in the same data sets, the inclusion of this study changed the correlation coefficients substantially. With the study included in the data set, the correlation between predictor internal consistency and criterion interrater reliability was r = –.45. Without the study, the correlation was r = –.06 (correlation with Fisher’s z-transformed scores: r = .08). For the predictor and criterion intrarater reliability combination, the inclusion of this study resulted in a correlation of r = .12, while its exclusion resulted in a correlation of r = .16 (correlation with Fisher’s z-transformed scores: r = .16). The main influence here seems to stem from the large sample size. Given the lower likelihood of sampling error in a large data set, we should put more rather than less confidence in a large study like this. As such, we decided to keep the study in our data set.
For the data set containing the predictor internal consistency/criterion intrarater reliability combination, we found one influential outlier where the reported criterion reliability was very low in comparison to the other criterion reliabilities in the data set. When the study was included the correlation was r = .01 (correlation with Fisher’s z-transformed scores: r = –.05), when it was excluded the correlation was r = .03 (correlation with Fisher’s z-transformed scores: r = –.04). The sample size for this study also was rather small (n = 84). Given that sampling error is likely to be larger in small samples than in big ones, we decided to eliminate this study from the data set. The conclusions were not significantly altered by the inclusion or exclusion of this study.
Main analysis
The procedure in the current meta-analysis is slightly different from other meta-analyses. Rather than determining the mean corrected correlation between two constructs across studies, the current meta-analysis involves computation of the correlation between reliability estimates obtained from the studies.
Being aware that nonnormal distributions of the data can affect correlation coefficients, we first assessed the shape of the predictor and criterion reliability distributions in our five data sets. The results are reported in Table 2. All of the underlying distributions were nonnormal as assessed by a Kolmogorov–Smirnov test for single samples, although most of the distributions did not show high values for skew or kurtosis. Most distributions were slightly skewed toward higher reliability scores, which can be expected given that studies with low reliabilities would have difficulties being published. This selective publication effect could also be seen in the fact that two of the five data sets did not show reliability scores below .70 (i.e., the predictor intrarater/criterion internal consistency and the predictor and criterion intrarater reliability combinations). However, the other three reliability combinations showed distributions of reliability scores that were much less affected by truncation.
Correlation Coefficients, Means, and Standard Deviations for the Five Combinations of Predictor and Criterion Reliability in the “Baseline” Meta-Analysis in Study 1.
**p < .01.
Recent work by Bishara and Hittner (2012) has examined the impact of violations to the assumption of normally distributed data on the significance testing of different correlation coefficients and on alternative approaches to normalize the data (e.g., through transformation and ranking). In their simulations, they found that Pearson’s r was quite robust with respect to Type I error rates, unless the sample size in the data was small or the underlying distributions of both the predictor and the criterion were highly leptokurtic. Small sample sizes were not an issue in our data sets. In addition, four out of five of our data sets did not show high positive levels of kurtosis (i.e., were leptokurtic) for both the predictor and the criterion. The exception was the predictor and criterion intrarater reliability combination, which showed strongly elevated kurtosis levels for the predictor and elevated levels for the criterion. This is due to the inclusion of one large sample. As we mentioned earlier, we also ran the analyses with this large sample excluded. When excluding this study from the analysis, the kurtosis values were not elevated and the correlation between the reliability coefficients still reached significance. So, slightly elevated Type I error rates (i.e., around .07 as reported in Bishara & Hittner’s study) should not be a major concern here. Overall, Pearson’s r should provide an accurate estimate of the correlation between predictor and criterion reliability. Nevertheless, to compensate for the skew in the data, we transformed the obtained reliability coefficients with Fisher’s r-to-z transformations and also ran our analyses using these transformed scores. We report the obtained Pearson correlations for the observed and the transformed reliability coefficients in our tables.
We used the following variant of the sample-size-weighted mean correlation formula to compute the sample-size-weighted correlation between the reliabilities in our study,
where xi represents the predictor reliability coefficient, yi represents the criterion reliability coefficient, and ni represents the sample size of the study from which the reliability estimates were obtained.
Similarly, we adapted the formula to compute the standard deviation for this correlation coefficient to obtain a weighted estimate of the standard deviation.
An analogous formula was used to compute sd(y)weighted.
Using these formulae, we computed separate sample-size-weighted correlation coefficients for each of the five different combinations of predictor and criterion reliability listed above.
Results
Table 2 summarizes the correlations, means, and standard deviations for the five combinations of predictor and criterion reliability coefficients. Across all five predictor/criterion reliability combinations, it is noteworthy that the mean reliability coefficients reported for the predictor are all very high although there was variability around these means. The mean predictor reliability ranges from .85 to .92 and the standard deviations from .03 to .07. The mean criterion reliability ranges from .61 to .88 and the standard deviations of the criterion reliability from .03 to .23. Based on these data, there seems to be a clear indication that studies published in AMJ and JAP on average show strong reliability estimates for both the predictor and the criterion but that there is some variability around these means. We have included a comparison of the means and standard deviations found in our study with findings reported in previous meta-analyses by Aguinis, Dalton, et al. (2011) and Viswesvaran et al. (1996) in an appendix.
The main finding of the current study is that the correlations between the predictor and criterion reliability estimates were significantly different from the previously assumed zero correlation for all five combinations, although they differed in size and direction. The strongest correlation between predictor and criterion reliability was observed for the predictor internal consistency/criterion interrater combination (r = –.45, p < .01; correlation with Fisher’s z-transformed scores: r = –.29). Somewhat smaller but still substantial were the correlations for the predictor and criterion intrarater reliability combination (r = .12, p < .01; correlation with Fisher’s z-transformed scores: r = .04), the predictor intrarater reliability/criterion internal consistency combination (r = –.12, p < .01; correlation with Fisher’s z-transformed scores: r = –.14), and the predictor and criterion internal consistency combination (r = .16, p < .01; correlation with Fisher’s z-transformed scores: r = .21). Only the predictor internal consistency/criterion intrarater reliability combination (r = .03, p < .01; correlation with Fisher’s z-transformed scores: r = –.04) shows an average correlation that is only slightly, albeit significantly, different from zero.
Although many of these values are small, it is important to note that journals reject articles because of low reliabilities. Specifically, a journal submission is in jeopardy if it is not the case that all measures of central constructs have adequate reliabilities. Thus, the range of reliability values in Study 1 is likely to be severely restricted. If one assumes that reliabilities range from .50 to 1.0 and that journals only publish articles for which both r xx and r yy exceed .70, then the standard deviations of observed reliabilities would be reduced by 30% to 50%, which would in turn reduce the correlation between reliabilities by approximately 40% (Hunter & Schmidt, 1990). If the range of true reliabilities is assumed to be greater, then the restriction-induced reduction in reliability correlations would be even greater.
Study 2
In our second study, we aimed to examine the issue of correlated reliabilities in a more typical meta-analysis situation, in which the researcher focuses on a specific research topic and examines a particular IV-DV relationship. The purpose of this examination was two-fold. On one hand, we wanted to see whether or not we would also find a correlation between predictor and criterion reliabilities across studies when looking at a set of original studies from the same topic area, that is, we wanted to see if we could replicate our results from Study 1. On the other hand, we wanted to explore whether the extent to which the reliabilities would correlate differed in this more typical scenario. As we argued earlier in the article, true score correlations in the underlying IV-DV relationship should contribute to the correlation between reliabilities. If we examine the same IV-DV relationship with (arguably) the same underlying true score correlation across studies, then the correlation of reliabilities across studies should be even higher than in Study 1.
Method
Compilation of Studies
Study 2 is based on a meta-analysis conducted by Kurtessis, Ford, Buffardi, and Stewart (2008), which examines the relationship between POS and its antecedents and outcomes. For this meta-analysis, Kurtessis et al. collected published and unpublished studies within the time frame of 1986 and 2011 by conducting electronic searches using ABI/INFORM, APA PsycNET, PsycINFO, ProQuest Research Library, Digital Dissertations, Google Scholar, and the Defense Technical Information Center. They used the following key terms to identify studies: perceived organizational support, organizational support, perceived support, and POS. They also used an alternate spelling, that is, organisational, which is frequently used in research on POS conducted in British Commonwealth nations, for each of the previously mentioned search terms. In addition, studies were identified by examining two prior meta-analyses of POS (Rhoades & Eisenberger, 2002; Riggle, Edmondson, & Hanson, 2009) and by collecting studies which cited one of several major source articles on POS (Eisenberger, Cummings, Armeli, & Lynch, 1997; Eisenberger, Fasolo, & Davis-LaMastro, 1990; Eisenberger, Huntington, Hutchison, & Sowa, 1986; Rhoades & Eisenberger, 2002).
Studies were included in the Kurtessis et al. meta-analysis based on several criteria. First, all studies had to report the information necessary for their analysis (e.g., sample, reliability coefficients, effect sizes) similar to the process outlined in the procedures for Study 1 in the current meta-analysis. Second, information could only be collected from studies that were available in English or that had an English translation. Third, measures of POS that were adapted for specific criteria (e.g., support for creativity) or where POS items were combined with items from other job attitudes measures (e.g., engagement) were also excluded as these measures reflect a construct that is distinct from POS. All studies were independently coded by two members of the Kurtessis et al. research team, and coding was compared to determine if discrepancies existed. Where coding discrepancies existed, they were resolved by referring to the original article or gathering input from the other members of the research team until a consensus about the correct coding was reached.
The Kurtessis et al. meta-analysis identified and coded a total of 492 articles containing 558 studies. Only a subset of these data were used in the current article. While the Kurtessis et al. meta-analysis focuses on the underlying relationship between POS and its associated variables, the current study is interested in the correlation between the reliabilities of the measures used to assess the predictor and the criterion variables in this meta-analysis. In particular, we examined a couple of different combinations of POS and its associated variables to examine different predictor and criterion reliability combinations, similar to Study 1, but also to examine different underlying IV and DV relationships to see how the correlation between the reliabilities would differ across different variable combinations. For example, we examined a collection of studies in which POS is the predictor, and some in which POS is the criterion (all of these were predictor and criterion internal consistency reliability combinations). We also examined the relationship between POS as the predictor and organizational citizenship behavior toward the organization (OCB-O) and performance as the respective outcome variables. For both OCB-O and performance we distinguished between self-rated (i.e., internal consistency reliability) and other-rated (intrarater reliability) to be able to compare the strength of the correlation between reliabilities for different predictor and criterion reliability combinations. We focused on the two most commonly used predictor and criterion reliability indices, that is, internal consistency and intrarater reliability.
From each original study we collected the following information: The type of reliability coefficient the study reported for the predictor and the criterion (i.e., internal consistency or intrarater), the magnitude of each coefficient, the observed correlation between the predictor and criterion constructs, sample size, information about the type of sample, the number of items used to measure POS, and whether the predictor and criterion information were provided by the same source or by different sources (for OCB-O and performance). All reliabilities again were indexed with Cronbach’s alpha.
Our compilation of studies resulted in the following data sets: Predictor: POS (internal consistency)/criterion: affective commitment (internal consistency): 230 effect sizes Predictor: POS (internal consistency)/criterion: job satisfaction (internal consistency): 140 effect sizes Predictor: procedural justice (internal consistency)/criterion: POS (internal consistency): 69 effect sizes Predictor: POS (internal consistency)/criterion: job performance self-rated (internal consistency): 31 effect sizes Predictor: POS (internal consistency)/criterion: job performance other-rated (intrarater): 50 effect sizes Predictor: POS (internal consistency)/criterion: OCB-O self-rated (internal consistency): 24 effect sizes Predictor: POS (internal consistency)/criterion: OCB-O other-rated (intrarater): 20 effect sizes
Data Analysis
Outlier analysis
For the outlier analysis, we followed the same steps as described in Study 1, applied to each of the seven data sets that we created. In the POS–affective commitment data set, we identified four potential outliers. In the POS–job satisfaction data sets, we identified three outliers. In the POS–performance data sets, we identified two potential outliers in the self-rated performance data set and five in the other-rated data set. In the POS–OCB-O data sets, we noted two potential outliers in the self-rated performance data set and one in the other-rated data set. Finally, in the procedural justice–POS data set, we identified six potential outliers, respectively, none of which were coding errors.
Again, the high SAMD statistics were found for cases with large sample sizes. We also found several effect sizes that were above or below 3 standard deviation units from the mean. We identified none of these outliers as interesting outliers. Like in Study 1, we moved on to determining whether these outliers are influential outliers by running the analyses with and without these cases to determine the influence of these outliers on the results.
Excluding outliers made very little difference for the strength of the correlation between the predictor and criterion reliability in the following data sets: POS–affective commitment and POS–job satisfaction. In all other data sets, the exclusion of outliers made a substantial difference. As a consequence, we decided that we will report both the results when all studies are included in the analyses and when the outliers in each data set were excluded, like we did in Study 1.
Main analysis
For the main analysis, we followed the same steps as described in Study 1.
Results
Tables 3 through 6 summarize the correlations, means, and standard deviations for the relationship between predictor and criterion reliability coefficients across the different POS meta-analysis data sets. Across all combinations between POS and its antecedents and outcomes, the means of the reported reliabilities were all very high. Only the mean for the criterion reliability coefficient for self-rated OCB-O was lower at .74. Two mean reliabilities were above .90 (both for POS measures in the self- and other-rated OCB-O data sets), nine reliabilities were between .85 and .90, and another two reliabilities were at least .80. Thus, 79% (i.e., 11 out of the 14) of the observed mean reliabilities were above .85. This prevalence of high reliability indices mirrors our findings from Study 1. And as was the case with Study 1, restriction of range due to the screening out of articles with low reliabilities likely suppresses these correlations.
Correlation Coefficients, Means, and Standard Deviations for the Relationship Between Predictor and Criterion Internal Consistency Reliability in the POS Meta-Analysis in Study 2 (All Studies Included).
Note: OCB-O = organizational citizenship behavior toward the organization; POS = perceived organizational support.
Correlation Coefficients, Means, and Standard Deviations for Predictor and Criterion Reliability (Self-Rated and Other-Rated) in the POS Meta-Analysis in Study 2 (All Studies Included).
Note: OCB-O = organizational citizenship behavior toward the organization; POS = perceived organizational support.
Correlation Coefficients, Means, and Standard Deviations for the Relationship Between Predictor and Criterion Internal Consistency Reliability in the POS Meta-analysis in Study 2 (Outliers Excluded).
Note: OCB-O = organizational citizenship behavior toward the organization; POS = perceived organizational support.
Correlation Coefficients, Means, and Standard Deviations for Predictor and Criterion Reliability (Self-Rated and Other-Rated) in the POS Meta-Analysis in Study 2 (Outliers Excluded).
Note: OCB-O = organizational citizenship behavior toward the organization; POS = perceived organizational support.
The main finding of the current study is that the correlations between the predictor and criterion reliability estimates were significantly different from the previously assumed zero correlation. In this second study, we analyzed the correlation between predictor and criterion reliabilities when there is a substantive underlying relationship between the IV and the DV, in our case between POS and some of its antecedents and outcomes. In this context, which represents a typical meta-analysis of organizational phenomena, we generally find much larger correlations between the predictor and criterion reliabilities.
For the analyses that included all data (see Tables 3 and 4), the largest correlation between predictor and criterion reliability of r = –.78 (p < .01; correlation with Fisher’s z-transformed scores: r = –.78) was observed for the POS/self-rated OCB-O relationship (i.e., a predictor and criterion internal consistency reliability combination). Also quite substantial were the correlations between predictor and criterion internal consistency reliability found for the POS/affective commitment relationship at r = .30 (p < .01; correlation with Fisher’s z-transformed scores: r = .29) and for the procedural justice/POS relationship at r = .34 (p < .01; correlation with Fisher’s z-transformed scores: r = .37). While still significant at the p < .01 level, the reliability correlations for the other POS data sets were smaller. The correlation between predictor and criterion internal consistency reliabilities for the POS/job satisfaction relationship was r = .07 (correlation with Fisher’s z-transformed scores: r = .04). For the POS/self-rated performance relationship it was r = –.08 (correlation with Fisher’s z-transformed scores: r = –.04) (i.e., a predictor and criterion internal consistency reliability combination). Examining the two predictor internal consistency and criterion intrarater reliability combinations, the relationship between the reliabilities for the POS/other-rated performance relationship it was r = .12 (correlation with Fisher’s z-transformed scores: r = .01) and for the POS/other-rated OCB-O relationship it was r = .08 (correlation with Fisher’s z-transformed scores: r = .07).
When excluding outliers (see Tables 5 and 6), the correlation between predictor and criterion internal consistency reliabilities for the POS/self-rated OCB-O relationship decreased, but remained very substantial at r = –.45 (p < .01; correlation with Fisher’s z-transformed scores: r = –.41). The correlation between reliabilities for the POS/affective commitment relationship slightly increased to r = .34 (p < .01; correlation with Fisher’s z-transformed scores: r = .31) and decreased for the procedural justice/POS relationship to r = .14 (p < .01; correlation with Fisher’s z-transformed scores: r = .22). There was no change in the correlation between predictor and criterion internal consistency reliabilities for the POS/job satisfaction relationship. The correlation between the reliabilities for the POS/self-rated performance relationship increased to r = –.20 (correlation with Fisher’s z-transformed scores: r = –.12). For the POS/other-rated performance relationship (i.e., a predictor internal consistency and criterion intrarater reliability combination) the correlation between reliabilities changed direction (i.e., we now observed a negative correlation between the reliabilities) and increased in strength to r = –.13 (correlation with Fisher’s z-transformed scores: r = –.20). This is the only combination between POS and a related variable, for which the directionality of the correlation between reliabilities changed when excluding outliers. Finally, when excluding outliers, the correlation between reliabilities for the POS/other-rated OCB-O relationship (i.e., the second predictor internal consistency and criterion intrarater reliability combination) increased to r = .13 (correlation with Fisher’s z-transformed scores: r = .10).
Discussion
In the current study, we have provided empirical evidence that predictor and criterion reliabilities are correlated across studies. While previous research has provided evidence for interdependencies between reliabilities and range restriction (e.g., LeBreton et al., 2003; Mendoza & Mumford, 1987; Theron, 1998), to our knowledge no prior study has examined whether interdependencies also exist between predictor and criterion reliabilities. This is an important question, given that the underlying correction formulae in the Hunter and Schmidt (2004) approach assume independence between all study artifacts.
The current study also provides an estimate of the extent to which predictor and criterion reliabilities are correlated. In Study 1, we computed the correlation between predictor and criterion reliability estimates across all applicable studies published in two of the most highly regarded journals in the field of organization sciences, AMJ and JAP. This was our baseline condition. In Study 2, we reported findings from a meta-analysis on POS and its antecedents and outcomes. We computed the correlation between predictor and criterion reliability estimates for seven different data sets that correlated POS and another variable. In both studies, we found significant correlations between predictor and criterion reliabilities. In Study 1, correlations between predictor and criterion reliabilities in AMJ and JAP ranged from small (i.e., r = .03) to moderate (i.e., r = .16, r = –.12, and r = .12) to large (r = –.45). In Study 2, reliability correlations ranged from small (i.e., r = .07, r = .08, and r = –.08) to moderate (i.e., r = .12) to large (r = .30, r = .34, and r = –.80).
Noteworthy for both studies is that the medians of the reported reliabilities for the predictor and the criterion measures were very high. In Study 1, 2 of the 10 reported median reliability estimates are in the .90s. Another 6 out of the 10 are in the upper .80s, and 1 out of the 10 is in the upper .70s. Only one of the median reliability estimates is below .50. In Study 2, 7 out of the 14 reported median reliability estimates are in the .90s. Another 4 are in the upper .80s, and 2 are in the lower .80s. Only one median reliability coefficient is below .70. This indicates severe truncation in the reliability distributions, which leads to range restriction and subsequently a suppression of the correlations among reliabilities. In other words, we would assume that in studies that include a wider range of reliability indices, we would find even stronger correlations between the reliability coefficients.
What do these correlations mean for the meta-analyses that exist in our field? Related to our discussion of Formula 2 earlier, the Hunter and Schmidt (2004) correction formulae meta-analysts currently use correct the mean observed correlation for predictor and criterion reliability independently. The facts that these two reliability estimates are correlated and that the formulae do not take this correlation into account mean that we correct for their shared portion twice. Thus, we are currently overcorrecting for unreliability. As pointed out above, this is a threat to the accuracy of findings reported by meta-analyses. This is a crucial issue because meta-analysis is seen as a mechanism for the generation of unbiased and accurate estimates of relationships among variables.
However, at this point, we do not know whether the correlation between reliabilities and our failure to account for their interdependence actually leads to a substantial bias in meta-analytically derived corrected mean correlation coefficients. Previous research has provided arguments for two divergent views on whether the difference in estimates might be small or large. For example, Aguinis, Pierce, et al. (2011) argue that technical refinements in meta-analysis usually lead to very small substantive changes in the results and in our subsequent conclusions. Simulation research by Raju et al. (1998) has further shown that even under violations of the assumption of independence between the validity coefficient and study artifacts, the traditional VG models provided corrected mean correlations with reasonable accuracy. So, it could be that taking correlations between predictor and criterion reliabilities into account does not substantially change our findings or conclusions.
At the same time, research into the interdependence between reliabilities and range restriction demonstrates that correlations between study artifacts can have substantial impact on findings and the conclusions we draw from them. Work by LeBreton et al. (2003), for example, shows that range restriction has a large effect on interrater reliability estimates, which are typically rather low (e.g., Viswesvaran et al., 1996). When using interrater reliability indices that are not affected by range restriction, interrater reliability estimates showed levels of between-source agreement similar to within-source agreement. As a consequence, LeBreton et al. suggested that conclusions about the meaning of low interrater reliability estimates for substantive research topics (in their case 360-degree feedback and its interpretation) should be revisited. Correlations between reliabilities could have similarly strong effects onto conclusions drawn from meta-analysis.
In the current article, we find that some of the correlations between reliabilities are quite substantial, that is, in the .30s and .80s. To explore the implications, let us look at the specific example of POS/self-rated OCB-O from our meta-analysis in Study 2. The mean observed correlation between these variables (uncorrected) is .36. The mean reliability for the predictor is .92, the mean reliability for the criterion is .74. These reliabilities are correlated .80, so they share 64% of their variance. So, correcting for predictor unreliability only, the partially corrected correlation between POS and self-rated OCB-O would be .375. Correcting for criterion unreliability in a second step, we would correct the partially corrected correlation to .44, if we were not taking the correlation between the reliabilities into account. However, 64% of the variance between them, we have already accounted for. So, 64% of that additional increase should be too much, that is, .042.
In a second example, let us assume that a meta-analyst was interested in the relationship between learning goal orientation and training performance (assessed by raters). Let us assume that she found a mean observed correlation (uncorrected) between these two variables of .40. In this meta-analysis, the mean predictor reliability was found to be .80. Given that the meta-analyst did not consistently find information about the criterion reliabilities in the original studies, she decided to correct for criterion unreliability by using the estimate for interrater reliability from Viswesvaran et al.’s (1996) study, that is, r = .52. The results of our meta-analysis in Study 1 show that the correlation between predictor internal consistency and criterion interrater reliability is –.45. So, they share about 20% of their variance. Correcting for predictor unreliability, the partially corrected correlation between learning goal orientation and training performance would be .447. Correcting for criterion unreliability in a second step as if the predictor and criterion reliabilities were uncorrelated results in a corrected correlation coefficient of .62. However, 20% of the variance was already accounted for in the predictor unreliability correction. So, that means that we are overcorrecting by .035.
These are two grossly simplified examples. The actual effect of correlated reliabilities on the mean corrected correlation is unlikely to be this simple. However, due to the fact that an actual correction formula that can take the correlations between reliabilities into account has not yet been derived, this is the closest estimate of its actual effect that we can provide.
Arguably, these inflations of the corrected correlation do not seem to be very big. However, given that the core purpose of meta-analysis is to provide the most reliable, accurate, and stable estimates of relationships between variables, any technical refinement that leads to an improvement in our estimation of corrected correlations should be worthwhile. Even if the increase in accuracy is small, it gets us closer to the core purpose of meta-analysis. In addition, one can argue that small effect sizes can sometimes make a large difference. Cortina and Landis (2009) provide a detailed discussion of situations in which the existence of small effects can actually have large implications. They suggest, for example, that in situations in which we expect no effect, evidence for a small effect can be impressive. In addition, in situations in which we expect small effects to accumulate over time (e.g., a few more lives saved from more healthy living practices, or less energy consumed at the end of the year from consciously turning off lights when one leaves the room) small effects can have important outcomes.
In the case of meta-analyses, we might be able to extend these arguments to similar conditions. For example, if a meta-analyst found an effect, even a small one, in situations where research does not expect one, this should be a noteworthy finding. A case in point might be the current study that finds nonzero correlations between predictor and criterion reliabilities when decades of research have assumed there is none. On the flip side, correcting for the correlation between study artifacts might also mean that effect sizes previously reported by meta-analyses, might decrease in size. This would, of course, also be important to know, given that many organizational decisions (such as the design of selection systems or trainings) are informed by meta-analytic evidence. Either way, enabling meta-analysts to make more accurate predictions seems to be a worthwhile cause.
Sources of Interdependence Between Reliability Estimates
James and colleagues (e.g., James et al., 1986; James et al., 1988) have argued that situational factors, such as the organizational settings in which data were collected or assessment biases, could have an influence on the interdependence between study artifacts and validity. We extended the argument to suggest that situational factors could also be the root of interdependencies between predictor and criterion reliabilities. For example, if a large number of primary studies in a meta-analysis were collected from similar samples, with a similar type of assessment, or in very similar settings, this could create an interdependence between the reliabilities across studies. For example, one of the potential influence factors mentioned frequently by prior research is the sample size of the primary study (e.g., James et al., 1986; Raju et al., 1998). Prior research has lamented that, while we correct the observed correlation for sampling error, we do not do the same for the study artifacts, when they are influenced just as much by sampling errors (e.g., Raju et al., 1991). In fact, the influence of sample size would be the same for the predictor and the criterion validity in a given study. To assess this issue in the current study, we ran a post hoc analysis on the data from Study 1, using sample size as a moderator of the predictor/criterion reliability relationship. We found that all relationships between predictor and criterion reliabilities were moderated by sample size (see Table 7). In some cases, while significant, the interaction effect was smaller (e.g., for the predictor internal consistency/criterion interrater reliability combination, the predictor intrarater reliability/criterion internal consistency combination, and for the predictor and criterion intrarater combination). In some cases, however, the effect of sample size on the relationship between predictor and criterion reliability was quite substantive (e.g., for the predictor and criterion internal consistency combination and the predictor internal consistency/criterion intrarater reliability combination). All in all, the role of sampling error is something that future research should investigate further.
Moderated Regressions With Sample Size as the Moderator for the Five Combinations of Predictor and Criterion Reliability.
Note: All variables were centered prior to analysis.
**p < .01.
Further research design characteristics that might create correlations between predictor and criterion reliabilities across studies could include the number of items on the predictor and criterion tests, the number of raters to assess interrater or intrarater reliability, the type of sample (e.g., student vs. organizational), or whether information for the predictor and the criterion measure were collected from the same source or from different sources. These are just a few examples for how research design characteristics might be at the root of interdependencies between predictor and criterion reliabilities across studies. Future research will need to assess the impact of each of these factors.
Where to Go From Here—Implications for Future Meta-analyses and Research
Implications for Conducting Meta-Analysis
The ultimate goal of this study was to discover the degree to which reliabilities are correlated and to identify the situations in which this seems to be particularly problematic. The next step for research in this domain would be to provide a formulaic solution to the problem of correlated predictor and criterion reliabilities. Future research should explore ways in which we could adapt existing meta-analysis formulae so that they take the correlation between study artifacts into account. We have explored possible avenues by consulting with statisticians, mathematicians, and members of the RMNET community of the Academy of Management. However, these endeavors have not led to the development of a solution. We trust that future research will be motivated by the findings in the current study to develop a viable formulaic solution.
Until then, we would like to suggest some changes to the reporting practices for meta-analyses that should be added to the best reporting practices suggested by Aguinis, Pierce, et al. (2011), Aytug, Rothstein, Zhou, and Kern (2012), and the Meta-Analysis Reporting Standards (MARS) of the American Psychological Association (2008).3 As we mentioned in the literature review, independence between study artifacts is an underlying assumption of commonly used meta-analytic formulae. Given that the current article is the first one to empirically refute this assumption, there is a lack of knowledge regarding the extent of this issue. It would be valuable to test the robustness of independence assumptions of study artifacts in future meta-analyses to obtain a better estimate of the underlying issue. Hence, we recommend that future meta-analyses compute and report the correlation between study artifacts across studies. This would give us an idea of the seriousness of the issue in substantive meta-analyses where there is actually a substantial correlation between the predictor and the criterion variables.
In addition, similar to the illustrations we provided in the beginning of the discussion sections, once meta-analysts know and report how much the predictor and criterion reliabilities correlate in their meta-analysis, they could then assess by how much they might be overcorrecting when assuming independence rather than interdependence between the reliabilities. This would give readers of the meta-analysis an idea of how much the corrected correlation coefficient in the given meta-analysis would change, if they were to take the correlation between reliabilities into account.
This, of course, also means that meta-analysts need to become much more consistent in reporting their artifact corrections and in reporting the reliabilities they used for the correction. As Aytug et al. (2012) find in their review of 198 meta-analyses published in 11 different journals, only 50% of the studies that conducted artifact corrections actually reported the values they used for the correction. Aguinis, Dalton, et al.’s (2011) article echoes these findings. They note that 51.2% of the meta-analyses they reviewed reported the criterion reliability and specified the correction they used, while 48.3% of the reviewed meta-analyses reported the predictor reliability and specified the correction they used.
In addition, if future research were to establish the degree to which shared true score variance affects the correlation between predictor and criterion reliabilities across studies (one of our earlier suggestions for future research), then authors of future meta-analyses could compare their observed reliability correlations to the estimate based on shared true score variance alone to determine the extent to which situational factors such as research design characteristics might affect their findings. Furthermore, when future research finds a formulaic solution that takes the correlation between artifacts into account, then readers of meta-analyses that reported study artifact correlations can correct the reported mean corrected correlation after the fact and obtain a more accurate validity coefficient than previously possible.
Implications for Future Research
In addition to the implications for the future practice of meta-analysis, we would also like to highlight an important extension of the issue of correlated artifacts for future research, that is, interdependencies of predictor and criterion reliabilities with other study artifacts, such as range restriction. We have mentioned prior studies in the literature review that have developed formulae that take the relationship between range restriction and reliability into account (e.g., Mendoza & Mumford, 1987; Theron, 1998). To complicate matters further, previous research has shown that sample and measurement characteristics affect both reliabilities and range restriction (e.g., Capraro, Capraro, & Henson, 2001; Cortina, 1993; Crocker & Algina, 1986; Mendoza & Mumford, 1987; Thompson & Vacha-Haase, 2000). We have discussed that range restriction is likely to exacerbate issues with regard to interdependencies of reliability coefficients across studies. However, while we tried to obtain range restriction information from the primary studies in our meta-analysis in Study 1, there were simply not enough studies that reported information regarding range restriction. Future research should examine the effect of range restriction on the correlation between predictor and criterion reliabilities across studies. Furthermore, future research should try to develop an all-inclusive formula that takes the interconnectedness among the three most common artifacts, range restriction, predictor unreliability, and criterion unreliability into account.
Conclusion
Many researchers look to meta-analysis to answer questions that cannot be answered in one single study. Because meta-analysis results frequently have lasting implications for research and practice, it is imperative that the results are as accurate as can be. Therefore, our findings that study artifacts are not independent across studies, as previously assumed, have important implications for researchers conducting meta-analyses. Taking the correlation between the reliabilities into account is important to not overcorrect the observed mean effect size. In this article, we have provided an estimate of the extent to which predictor and criterion reliabilities are correlated. In addition, we have provided implications for future research to generate a solution to the problem and suggested guidelines for the reporting of artifact correlations in future meta-analyses to provide an assessment of the robustness of underlying assumptions of study artifact interdependence.
Footnotes
Appendix
Acknowledgments
We thank Christina Cregan, Neal Schmitt, and Larry James for their advice on a prior version of this article. We also thank the editor and three anonymous reviewers for their in-depth advice and recommendations, from which the article has benefitted tremendously.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
