Abstract
Validity coefficients for multicomponent measuring instruments are known to be affected by measurement error that attenuates them, affects associated standard errors, and influences results of statistical tests with respect to population parameter values. To account for measurement error, a latent variable modeling approach is discussed that allows point and interval estimation of the relationship of an underlying latent factor to a criterion variable in a setting that is more general than the commonly considered homogeneous psychometric test case. The method is particularly helpful in validity studies for scales with a second-order factorial structure, by allowing evaluation of the relationship between the second-order factor and a criterion variable. The procedure is similarly useful in studies of discriminant, convergent, concurrent, and predictive validity of measuring instruments with complex latent structure, and is readily applicable when measuring interrelated traits that share a common variance source. The outlined approach is illustrated using data from an authoritarianism study.
Keywords
Validity is an essential requirement of measurement in the behavioral, educational, and social sciences (e.g., McDonald, 1999). A main type of validity is criterion validity that is subsumed under the comprehensive construct validity concept (Messick, 1995). A commonly used index of criterion validity is the correlation coefficient between a test or scale score and a criterion variable, especially when both can be considered or treated as continuous (e.g., Raykov & Marcoulides, 2011; see also Crocker & Algina, 2006, for the discrete case). As is well known, however, measurement error usually attenuates this coefficient (cf. Raykov, Marcoulides, & Patelis, 2015, for a more general treatment). Accounting for this error both in the scale score and criterion variable is therefore an essential step toward obtaining unbiased estimates of criterion validity coefficients.
Multicomponent instrument construction is a complicated multistage process (e.g., Raykov, 2012). Despite well-informed efforts aimed at attaining unidimensionality of the resulting instrument—such as a test, scale, inventory, or self-report (referred to also as “test” or “scale” below)—important concerns about validity and construct underrepresentation tend to contribute to its more complex latent structure in empirical research. Oftentimes this test may be tapping into more than one interrelated constructs that load on a second-order factor. For example, a mathematics ability test could consist of a part evaluating algebra ability, another assessing geometry ability, a third measuring trigonometry ability, and a fourth concerned with problems assessing abstract thinking ability, with all four abilities loading on a second-order factor representative of the targeted mathematics ability. In such situations, it may be difficult to argue for a wider use of the overall scale score (the unweighted or weighted sum of the individual instrument components), owing to the fact that the latter is not unidimensional (see also below). For this reason, it would similarly be problematic to interpret the correlation of that scale score with a criterion variable of interest as a straightforward index of criterion validity.
The present article addresses these concerns using a latent variable modeling (LVM; e.g., Muthén, 2002) approach. An LVM procedure is discussed below that allows accounting for (a) the measurement error in the overall scale or test score as well as in the criterion variable(s), and (b) the second-order factorial structure of an instrument under consideration (see also Note 1). The method permits one to point and interval estimate the correlation between the second-order factor and a criterion variable. With that feature of the approach, this correlation may be considered a latent criterion validity coefficient that one could argue represents an appropriate validity index in such complex latent structure settings. The procedure can be viewed as a generalization of the method of discriminant and convergent validity evaluation discussed in Raykov and Marcoulides (2011, chap. 8), which assumes unidimensionality of a measuring instrument under consideration, to the case of lack of unidimensionality that is characterized by a second-order factorial structure. The discussed method is illustrated on data from a study involving the measurement of a multidimensional concept of authoritarianism (Beierlein, Asbrock, Kauff, & Schmidt, 2014; Duckitt & Bizumic, 2013).
Background, Notation, and Assumptions
For the aims of the present article, we assume that a set of (approximately) continuous measures are given that are denoted Y1, Y2, . . ., Yp (p > 1) and represent the components of a measuring instrument whose criterion validity is of interest to evaluate with respect to a prespecified variable. For this test or scale, we posit the following second-order factor structure
where
A Latent Variable Modeling Procedure for Evaluating Latent Criterion Validity of Measuring Instruments With Second-Order Factorial Structure
Equation (1) shows that it would not be appropriate, strictly speaking, to view as unidimensional the instrument comprising the components
From Equation (3), after straightforward algebra (e.g., on the individual observed variable equations), one notices that all observed variables loading say on η
j
in Equation (1) share the (common) factors ξ and δ
j
(j = 1, . . ., q); hence, the instrument consisting of the measures
or the weighted sum
using weights wj (j = 1, . . ., p), with a criterion variable cannot be meaningfully treated as a criterion validity index associated with this multicomponent measuring instrument.
A Latent Criterion Validity Coefficient
Equations (3) through (5) reveal further that for any given value of the second-order factor ξ, the average sum scores X and W represent deterministic linear functions of ξ. Hence, one can treat each of these sum scores as effectively measuring that higher order factor ξ.
In order to be in a position to account for measurement error in a criterion variable of interest, we assume it in the rest of this discussion as a latent variable (construct) ζ with positive variance that is evaluated by m indicators, Z1 through Zm (m > 1). The remainder also assumes, as indicated earlier, that the overall model resulting when Equations (1) and (2) are augmented by Z1, . . ., Zm and the pertinent measurement model for ζ (see Equation 7 for its formal definition) is identified as well as plausible for a studied population. 1
For this setting, which is frequently of relevance in empirical social and behavioral research, the present article proposes to consider the correlation
where Corr(·,·) denotes correlation, as an appropriate criterion validity coefficient associated with the multi-component instrument comprising the measures
Point and Interval Estimation of Latent Criterion Validity
In order to estimate and obtain a confidence interval of the LCVC, one can make use of LVM. To this end, we consider the model
which extends the model defined in Equations (1) and (2) (with its assumptions mentioned earlier) by the measurement model for the criterion indicators
With model (7) in mind, the LCVC in Equation (6) is readily seen as a nonlinear function of appropriate model parameters, namely, as the ratio of the latent covariance of the second order factor and latent criterion to the product of their square rooted variances (cf. Crocker & Algina, 2006). Hence, point and interval estimation of the LCVC proposed in this article becomes possible when model (7) is fitted to data and found plausible. This fitting is feasible using for instance the popular LVM software Mplus (Muthén & Muthén, 2017), and a point estimate of the LCVC (6) is rendered thereby as a routine product of this process. An application subsequently of the monotone transformation-based approach to confidence interval estimation of the correlation coefficient in Equation (6), as described for instance in Raykov and Marcoulides (2011), furnishes a confidence interval (CI) for the LCVC of interest here. (This CI construction approach is readily applied using the R-function “ci.pc” in the last cited source, which is also presented in the appendix for completeness of the present discussion.) The Mplus source code accomplishing the LCVC evaluation is also provided in the appendix, where it is applied on the empirical data used in the next section to demonstrate the outlined criterion validity evaluation procedure.
Illustration on Empirical Data
For the aims of this section, we utilize data from a study of n = 163 members of an online panel representing a sample of German adults (internet users), which was concerned with examining right-wing authoritarianism (Duckitt & Bizumic, 2013). Right-wing authoritarianism has been defined as a multidimensional concept covering three main traits that were identified by Adorno, Frenkel-Brunswik, Levinson, and Sanford (1950) as follows: (i) conformity to the governmental structures and state authorities (submissiveness), (ii) authoritarian aggression, and (iii) rigid support for traditions and established norms (conventionalism). According to Altemeyer (1981), the combination of these three traits leads to the concept of right-wing authoritarianism.
In the presently used empirical study, authoritarianism as a second-order factor was measured by the Short Scale of Authoritarianism (referred to as KSA-3; Beierlein et al., 2014). The scale consists of 9 indicators (items) and covers the above dimensions (i) through (iii), with three items per dimension, which are correspondingly referred to as Aggression, Submission and Conventionalism in the rest of the section (see Table 1 for specifics regarding these items per dimension). The nine indicators of authoritarianism were verbally formulated in such a way that high scores (or agreement with their statements) were associated with a higher degree of authoritarianism. As a latent criterion variable, we utilize gender-role attitude that was evaluated by three items on “consequences for parenting”, and refer to it as Parenting in the remainder (see also Note 1). These items constitute a subscale of a psychometric battery evaluating gender-role attitudes that was included in the German General Social Survey (GGSS) study (cf. Braun, 2014). Two indicators (items) of Parenting were formulated in such a way that agreement (higher scores) expressed a liberal opinion and support for women labor force participation, while one item was reversely formulated (in keeping with modern approaches to self-reporting, the wording of both the positive and negative items was always positive; Ferrando & Lorenzo-Seva, 2010). The text of all 12 items from the KSA-3 (Die Kurzskala Autoritarismus) scale and the Parenting subscale, is provided in Table 1 (with an approximate translation from their original German version).
KSA-3 (Die Kurzskala Autoritarismus) and Parenting Items (Authors’ Translation From German).
To illustrate the validity evaluation method outlined in the preceding sections of the present paper, we fit to this study data model (7) that effectively underlies this article, with p = 9 and q = m = 3 (see the appendix for the needed Mplus source code and notes to it). This model includes a total of 5 latent constructs—the above three first-order factors of Aggression, Submission, and Conventionalism with three indicators each, their second-order factor Authoritarianism, and the criterion construct of Parenting. The indicators of Authoritarianism were evaluated each using a 5-point numeric fully verbalized rating scale and the indicators on parenting were evaluated using a 7-point numeric rating scale (see also Table 1). The 12 indicators are considered for the illustration purposes of this section as approximately continuous measures on which the robust maximum likelihood method of model testing and parameter estimation is applied (cf. DiStefano, 2002; see also Raykov & Marcoulides, 2011). To deal with a notable proportion of missing data in the construct indicators and counteract possible violations of the missing at random (MAR) assumption underlying this method, we also include as an auxiliary variable the score from the so-called Left-Right Self-Placement scale (LRSP; e.g., Enders, 2010). LRSP refers to the two familiar political orientations (with “left” being associated here with liberal positions, described for instance by loyalty and acceptance of other groups’ positions and low levels of concerns with respect to social order and obligations; and “right” being associated with conservative political opinions, described for example by strong in-group orientations and high respect for social order, established norms and hierarchies; e.g., Crawford & Pilanski, 2014; Haidt, Graham, & Joseph, 2009). Since both right-wing authoritarianism and gender-role attitudes are conceptually incorporated in the left-right distinction (Knight, 1993), and several studies have found notable relationship between them (e.g., Banaszak & Plutzer, 1993; Leone, Desimoni, & Chirumbolo, 2014), it was decided to use the LRSP scale score as an auxiliary variable (see also Enders, 2010).
The described model was found to be associated with the following tenable fit indices: chi-square (χ2) = 64.419, degrees of freedom (df) = 50, p-value (p) = .083, and root mean square error of approximation (RMSEA) = .042 with a 90% confidence interval [0, 0.069]. The resulting parameter estimates in it are presented in Table 2.
Parameter Estimates, Standard Errors, t Values, and Two-Tailed p Values Associated With Fitted Model (Software Output Format).
Note. S.E. = standard error; AGGR# = indicator of Aggression construct; UW# = indicator of Submission construct; T# = indicators of Conventionalism construct; RF# = indicator of Parenting construct; A = Authoritarianism construct; LCV = Latent Construct Validity (LCVC in the text). The negative sign of RF12 is due to reversed scoring (see Table 1), and the negative sign of the LCV estimate indicates a tendency of high authoritarianism to be associated with lower support of labor force participation of women (see also main text).
In this plausible model, of particular interest is the LCVC estimate, which as seen from Table 2 results as −0.343, with a standard error of 0.123. Using the aforementioned monotone transformation-based approach to CI construction (see also the appendix), we obtain a 95% CI for the LCVC of (−0.558, −0.084). This relatively wide CI is not unexpected given the sample size that cannot be considered really large in the empirical study used. At least as importantly, the CI indicates a significant but weak to moderate (linear) relationship between Authoritarianism, as measured by the employed nine-item scale, on one hand, and the construct of Parenting as a criterion (latent) variable on the other hand. This correlation indicates a marked tendency of persons above average on Authoritarianism to be among those with scores on Parenting below its mean. In addition, as validity related coefficient, this correlation indicates also a considerable and expected discriminant validity of the used Authoritarianism scale with respect to the Parenting construct.
Conclusion
This article was concerned with an LVM procedure for point and interval estimation of a proposed latent criterion validity coefficient for a multicomponent measuring instrument with a latent structure that is more complex than that of unidimensionality. The discussed approach is useful in empirical situations with second-order factorial structure of psychometric tests or scales under consideration, when a researcher is also interested in accounting for measurement error in criterion variables of concern. The method is applicable when the criterion variables are uncorrelated with the error terms in the individual components of a given test or scale (a testable condition that can be examined using, e.g., the approach in Raykov, Marcoulides, Gabler, & Lee, 2017). The outlined procedure may be of particular utility in studies of criterion, discriminant, convergent, concurrent, or predictive validity of relatively long but internally consistent tests that are, however, not homogeneous.
Several limitations of the discussed approach are worthwhile pointing out here. As indicated earlier, the procedure assumes (approximately) continuous individual scale components and criterion variables. In case of indicator normality, as is well known the use of maximum likelihood (ML) estimation is appropriate and yields ML estimates of the latent criterion validity coefficients of interest (e.g., Bollen, 1989). With up to mild deviations from normality, which do not result from piling at scale end for an individual component(s), it may well be recommendable to use instead the robust ML method (MLR; Muthén & Muthén, 2017), possibly also with components having as few as five to seven response options (e.g., DiStefano, 2002; see also the appendix). Further research on the robustness of the MLR method is needed, however, in such situations. With fairly large samples, weighted least squares (WLS) estimation is also available with nonnormal continuous instrument components (e.g., Bollen, 1989). Relatedly, the outlined validity evaluation procedure is best used with large samples, owing to the fact that its application rests on ML, robust ML, or WLS estimation, with all of them grounded in asymptotic statistical theory (e.g., Muthén, 2002). Future research is needed also here, which may contribute to the development of possible guidelines for determining sample sizes at which one could rely on that large-sample theory.
As a third limitation, we assumed throughout that observations (studied persons) were independent, that is, not clustered or nested within (higher order) Level-2 units, such as schools, clinicians, interviewers, physicians, neighborhoods, cities, and so on. It may be hypothesized that the robust ML estimation method may also have some robustness to limited violations of this classical independence assumption, especially when the degree of nonnormality is not pronounced. To our knowledge, however, there is not sufficient research in this area that could help find out the extent and conditions under which one may trust such a potential recommendation.
Last but not least, the plausibility and identification of model (7) when used in applications of the procedure of this paper is essential, as indicated earlier (see also Note 1). When either of these two conditions is not satisfied, the discussed method cannot be generally recommended as it may yield misleading parameter estimates, standard errors, and statistical test results with regard to criterion validity of studied tests or scales. Lack of identification of the overall model may be expected with an insufficient number of indicators for any of the first-order factors, and may be resolved by adding appropriate parameter constraints that reflect substantively plausible parameter relationships in studied populations (e.g., Raykov & Marcoulides, 2006).
In conclusion, this article offers to empirical educational, behavioral, and social scientists a widely applicable means for point and interval estimation of criterion validity of multicomponent measuring instruments with second-order factorial structure, which permits also accounting for measurement error in associated overall sum scores (whether weighted or not) as well as in used criterion variables.
Footnotes
Appendix
Acknowledgements
This research was in part conducted while T. Raykov was visiting the Leibniz Institute for the Social Sciences, Mannheim, Germany. Thanks are due to B. Rammstedt for helpful support.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
