Abstract
In generalizability theory (G theory), one-facet models are specified to be additive, which is equivalent to the assumption that subject-by-facet interaction effects are absent. In this article, the authors first derive estimators of variance components (VCs) for nonadditive models and show that, in some cases, they are different from their counterparts in additive models. The authors then demonstrate and later confirm with a simulation study that when the subject-by-facet interaction exists, but the additive-model formulas are used, the VC of subjects is underestimated. Consequently, generalizability coefficients are also underestimated. Thus, depending on the nature of interaction effects, an appropriate model, either additive or nonadditive, should be used in applications of G theory. The nonadditive G theory developed in this article generalizes current G theory and uses data at hand to determine when additive or nonadditive models should be used to estimate VCs. Finally, the implications of the findings are discussed in light of an analysis of real data.
Keywords
Introduction
Generalizability theory (G theory) investigates the dependability or reliability of the results from a sample of behavioral measurements at hand, and the results are then to be generalized to any valid measurements for decision making (Brennan, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Cronbach, Nageswari, & Gleser, 1963; Shavelson & Webb, 1991). It aims to pinpoint all potential sources (specified by users) of variation in measurement scores by using ANOVA. The variance components (VCs) of identified sources of variation are the basic building blocks of G theory. An analysis based on G theory typically consists of two steps, a generalizability study and a decision study. A decision can be a relative or an absolute decision. Generalizability coefficients (G coefficients) measure the dependability or reliability of results that can be generalized from a sample of observations to a universe of admissible observations. For details about G theory, see Brennan (2001) and Shavelson and Webb (1991).
Model specifications in current G theory assume additivity such that “all effects in the model are uncorrelated” (Brennan, 2001, p. 23), which is statistically equivalent to assuming no interaction between the subject (e.g., person or essay) and the facet (e.g., item or rater) in a design with one facet. However, the assumption of additivity may not be satisfied in practice. In a multiple-rating design of performance-based assessment, each subject response is scored by multiple raters. Clearly, the subject-by-facet interaction may be present. For example, such an interaction may be introduced when some raters tend to award higher scores for certain responses, whereas other raters award lower scores for such responses. Another scenario that may also introduce interaction is when some raters tend to give polarized scores (very high scores for high performance students and very low scores for low performance students) in a performance-based assessment, whereas other raters tend to give toward-average scores.
In the presence of interaction, directly applying additive models in the analyses and failing to consider the interaction effects may lead to underestimation of the VC of subjects as the article will reveal later. Depending on the degree to which the VC is underestimated, negative estimated VCs may occur. Consequently, generalizability coefficients are also underestimated and may even be negative as will become clear in the next section.
Although the assumption of additivity has been widely accepted in G-theory applications, a more general linear model is the nonadditive model, which relaxes the assumption of no interaction (i.e., additivity). In this article, the differences and implications for additive and nonadditive model specifications are discussed. Nonadditive models are introduced to G theory to broaden the scope of current G-theory applications such that nonadditivity is also considered. The authors further show that the estimated VCs of subjects are different for the additive and nonadditive models. Thus, appropriate uses of the formulas rest on careful examinations of the underlying model assumptions and of the nature of subject-by-facet interaction effects
The rest of the article is organized as follows: First, the difference between random and fixed effects is discussed, and the nonadditive model is introduced. Next, the estimators of VCs for one-facet nonadditive models, which include an additive model as a special case, are derived. Tukey’s test is introduced for testing nonadditivity, and a nonadditive G-theory approach is developed. Then, two simulation studies are conducted. The first simulation study shows that the nonadditive approach produces almost the same accurate results as the additive approach used in current G theory when data contain no interaction (i.e., an additive model is true). The second simulation study confirms that when the subject-by-facet interaction is present, the formulas for estimated VCs following the additive models can severely underestimate the VC for subjects, especially when the number of facet levels is small, but the results from the nonadditive approach are much improved. A real data set with one negative VC estimate based on current additive G theory is reanalyzed using the formulas for estimated VCs in nonadditive G theory developed in this article. The results show that all VC estimates are now positive as a result of using the nonadditive approach. Finally, the nonadditive approach and its implications on the G-theory framework are further discussed.
Models and VC Estimates
Suppose in a writing-proficiency study, each of 25 students writes an essay on the same topic, and each essay is graded by all three raters. This is a fully crossed
The subject effects are considered to be random in G theory, while the facet can be either fixed or random. Thus, the linear model used in G theory is either a random-effect or mixed-effect model. The difference between random and fixed facets is that the former treats, for example, the rater facet as random and therefore the generalizability is to a universe of similarly qualified raters, whereas the latter treats the rater facet as fixed and hence the derived inferences are limited to the raters recruited in measurement design at hand. For details about guidelines for judging whether random or fixed effects should be used in a general linear model, see Gelman (2005, pp. 21-23), Kreft and de Leeuw (1998, Section 1.3.3), and Snijder and Bosker (2012, Section 4.3.1).
A random-effect linear model for a one-facet design in G theory is
where
Note that Scheffe (1959) considered all complete two-way layouts. This includes the special case where there is only one observation per cell, which is the case considered in G theory and in this article.
According to Scheffe’s theorem, the interaction term
It is a common belief that the subject-by-facet interaction and the pure error are totally indistinguishable because there is only one observation per cell in a one-facet design. Thus, these two components are always assumed to be uncorrelated with other components and combined together as a single error term in Model 1 in current G theory. That is, current G theory only uses random-effect additive models in one-facet designs and assumes there is no subject-by-facet interaction.
Tabachnick and Fidell (2007) pointed out that
[o]ne’s fondest hope in repeated measures is that there is no genuine (population) interaction, so that the error term is only random error. This is a forlorn hope with many real data sets, because of contingencies in the responses of the same cases as they are measured repeatedly over time. (p. 248)
The (genuine) interaction effects, if exist, are correlated among themselves and/or correlated with other main effects and cannot be treated as error components in statistics. One type of interaction is that
In current G theory, whether a facet is treated as random or fixed is determined by the objective of a research study and thus, a facet is assumed to be random when a user intends to generalize results toward the underlying universe. In a writing-proficiency study, although raters are typically not randomly selected from the universe of all qualified raters, they are generally considered interchangeable with other raters in the universe. Thus, the rater facet is regarded as random (see Shavelson & Webb, 1991). In addition, the facet is always considered random in a one-facet design in current applications of G theory. However, this practice is not appropriate in certain scenarios (i.e., in the presence of interaction). The issue occurs due to the potential incongruence between actually selected facet levels (conditions of measurement, for example, raters) and the universe of admissible observations (e.g., all qualified raters). For example, to treat the rater facet as random in a writing-proficiency study, one crucial requirement for qualified raters is that they are able to grade students’ essays objectively (no interaction) according to Corollary 1. In reality, however, “qualified raters” are typically identified according to their academic degrees and work experience, and these qualifications do not necessarily guarantee objectivity in scoring. That is, students’ scores may be interfered with by such a nominally qualified rater’s preferences and thus, subject-by-facet interaction may exist. In such a case, it is not appropriate to use the random-effect additive model due to the interaction. In short, although the facet is conceptualized as random (no interaction) in a one-facet design, whether the facet can be treated as such in a statistical model is an empirical question, and therefore one cannot take it for granted in practice. From a statistical perspective, users need to check whether interaction effects exist in the data at hand, especially in a design where the subject and/or facet involve humans, before they can safely treat the facet as random in a one-facet linear model.
Three research issues should be investigated. First, given a data set from a one-facet design, is there any method to test whether subject-by-facet interaction effects are significant or not? Clearly, if the interaction effects are not significant, one can treat the facet as random and use current G theory to complete the rest of the data analyses. However, if the interaction effects are significant, then the next research question is “What are the consequences of continuing to treat the facet as random and applying current G theory to the data?” Finally, if the consequences are severe, one needs to develop an appropriate new method to deal with possibly existing interaction effects under a more general setting than current G theory.
According to Corollary 1, a random-effect linear model is additive and assumes no interaction in a one-facet design. As such, for a model to accommodate interaction effects, the facet has to be assumed fixed and thereby a mixed-effect nonadditive model should be applied.
Model 1 becomes a mixed-effect linear model when the facet is treated as fixed: The treatment effects
where
That is, the Additive Model 2 is equivalent to Model 1 when the facet is random or when the facet is fixed without any interaction. The advantage of using Equation 2 for an additive case is that it explicitly shows that no genuine interaction is considered in the model, whereas Model 1 may give readers the wrong impression that the interaction is always considered.
The regular notations of ANOVA are used in this article: The sum squares (SS) are
The mean squares (MS) are obtained by dividing these SS by their corresponding degrees of freedom (df).
The expected MS for a one-facet model under the additive (for both random and mixed) and nonadditive (mixed) models are presented in Table 1. As the same notation is used, the formulas for both random- and mixed-effect models are the same when models are additive. Given a set of data, all estimation results based on an additive model with a fixed facet are the same as those with a random facet in any one-facet design. The only difference is in the interpretation of the analysis results. For readers’ convenience, Table 1 includes expected MS for the additive models under both Equations 1 and 2. Readers can easily find that the column for the Additive Model 1 in Table 1 is exactly the same as Table 3.1 in Shavelson and Webb (1991). The formulas for the expected MS for a within-subject design can also be found in experimental design books (e.g., Davis, 2002; Keppel & Wickens, 2004; Myers, 1979; Myers, Well, & Lorch, 2010; Scheffe, 1959).
Expected MS in a One-Facet Design.
Note. MS = mean squares.
The formulas for estimated VCs under the Nonadditive Mixed-Effect Model 1 can be obtained by replacing the expected MS with the corresponding MS in Table 1 (i.e., by the moment matching method). Specifically, according to the last column of Table 1, let
By solving these three equations, the formulas for estimated VCs under the Nonadditive Mixed-Effect Model 1 are obtained:
In the case of an additive model, according to the fourth column of Table 1,
and the estimators of
The variances of interaction and pure error are confounded with each other in a design with only one observation in each cell. In general,
Tukey (1949) proposed a single-degree-of-freedom test for nonadditivity (interaction) in the two-way layout (i.e., one-facet design) with only one observation per cell. Note that the number of df for the residual sum of squares,
and
Myers (1979) showed that Tukey’s test is remarkably sensitive to average-by-effect interactions, but may not work in cases where some interaction effects were clearly present. Kirk (2013) also wrote that
Tukey’s test is sensitive to situations in which the scores for each block follow the same general trend, but the amount of change from
Myers (1979) presented a plot to summarize the pattern of interaction to which Tukey’s test is sensitive, and pointed out that “the nature of most interaction effects is such that the cross-product row-mean relation will have a linear component and Tukey’s test will therefore be serviceable” (see Myers, 1979, p. 186, Figures 7-2). Although Tukey’s test has different powers for different patterns of interaction, it can control the rate of Type I errors at any significance level, which lends credence to using Tukey’s test in G-theory settings (see Lin, 2014).
Let
where
The Nonadditive Approach
In this article, a conservative approach is adopted: First, Tukey’s test for nonadditivity is used to gauge whether any interaction exists: If
or
The above two estimates are not necessarily the same. In this article, the average of these two estimates is used as the default value for estimated
It should be noted that this nonadditive approach uses a nonadditive model only when Tukey’s test is significant. Thus, when applied to data without any interaction, the nonadditive approach is the same as the additive approach used in current G theory except for the cases where Tukey’s test makes Type I errors. When interaction effects are not significant, the facet can be treated as random.
Clearly,
By comparing Equation 6 of an additive model with Equation 7 of a nonadditive model, one can see that G theory currently only uses additive models in which
The formulas for estimated VCs under additive and nonadditive one-facet models are summarized and presented in Table 2. Note that the estimators of the VCs for
Estimated VCs in a One-Facet Design.
Note. VC = variance component; MS = mean squares.
The estimators of VCs for a fixed facet and for a random facet coincide when interactions between subject and facet are absent. However, when such an interaction exists, but an additive model is inadvertently used to analyze the data,
or expressed with population parameters as
Equation 8 can be used to evaluate the bias if the additivity is assumed as current G theory does. When a data set follows an additive model (i.e.,
The VCs are the building blocks of many important indexes such as the intraclass correlation coefficient (ICC) in ANOVA and the G coefficients in G theory. Readers can easily find the similarity between the ICC and G coefficients from the formulas presented below.
ICCs are used to measure the strength of association and statistical dependence:
for a fixed facet, and
for a random facet. For details about how to calculate estimated ICCs for fixed- and random-effect models, see Shrout and Fleiss (1979).
As mentioned in the previous section, an analysis in G theory typically consists of two steps, a generalizability study and a decision study, and a decision can be a relative or an absolute decision. The G coefficient for a relative decision is defined as
where
Here,
where
When
Notice that y = x / (x+c) is a monotone increasing function of x, when c > 0. Thus, if
and the one based on a nonadditive model (when interaction effects are significant) is
where
Similarly, the estimated G coefficients for an absolute decision based on additive and nonadditive models can be obtained. Clearly, the estimated ICC,
Ideally, the facet in a one-facet design should be treat as random because a G-theory user intends to generalize results to the universe of generalization. Thus, the existence of subject-by-facet interaction effects is not desirable in a generalizability study with one facet. If interaction effects are significant, depending on the nature of the facet, appropriate actions, such as intensive rater training, item modification, or form reconstruction, may be called for to eliminate possible causes of the interaction in any future studies, including the follow-up decision study. If possible, the generalizability study should be repeated independently; that is, data should be recollected so as to ensure that new data contain no interaction (e.g., essays are rescored by newly trained raters). Clearly, this is an ideal way to deal with situations where the initial data contain significant interaction effects. However, in practice, data recollection is typically not feasible and one has to use the original data set to do any analyses. In such a case, not all selected nominally “qualified raters” (conditions of measurement, in general) are truly qualified (admissible) and thus, the nonadditive model should be applied to isolate the interactions and better estimate the VC for subjects. Regardless of the presence or absence of interaction, it is reasonable to assume that the subject effect and the rater effect based on either nominally “qualified raters” or truly “qualified raters” will be the same. That is, nominally “qualified raters” are expected to produce the same VCs for subjects and raters in a Nonadditive Model (1) as those produced by truly “qualified raters” in a Random-Facet Additive Model 2. As such, the estimated VCs from a Nonadditive Model 1 can be regarded as the corresponding VC estimates for a Random-Facet Additive Model 2, the G coefficients can be estimated by Equation 14, and a decision study can be performed. That is, a nonadditive model is used to better estimate the VC of subjects only when data contain interactions but data recollection is not feasible.
Simulation Studies
Two simulation studies were carried out to evaluate and compare the performance of the nonadditivity approach developed in the preceding section with that of the additive approach used in current G theory. The first study tested the nonadditivity approach on simulated data without any interaction (i.e., data generated from an additive model or
Simulation Settings and Data Generation
In both simulation studies, data sets were generated from Equation 1. Items were targeted to be scored as 0, 1, 2, or 3. The overall mean was set to be
In the first simulation study, the subject effects
In the second study, the subject effects
In both studies, the number of persons were selected to be
Both the additive approach of current G theory and the nonadditive approach developed in this article were applied to analyze each set of simulated data and results were summarized and compared. Note that these two approaches will produce exactly the same estimation results for
Simulation Study 1: Additive Model (
)
In this study, data were generated from Equation 1 with random effects (i.e., an additive model) satisfying all of the assumptions of current G-theory models. The results are summarized in Table 3.
Average Estimated Additivity Index and Biases of Estimates of the VC for Persons Based on 1,000 Replications When Data Contain No Interaction.
Note. VC = variance component;
The third column presents the averages of estimated
Simulation Study 2: Nonadditive Model (
)
In this study, data were generated from Equation 1 with interaction and
Average Biases of Three Estimates of the VC for Persons Based on 1,000 Replications.
Note. VC = variance component;
Average RMSEs of Three Estimates of the VC for Persons Based on 1,000 Replications.
Note. RMSE = root mean square error; VC = variance component;
Frequency of Negative Estimated VCs for Persons Among 1,000 Replications.
Note. VC = variance component;
Average Biases of Estimated G Coefficients for a Relative Decision Based on 1,000 Replications.
Note.
Tables 4 and 5 present the average biases and RMSEs of estimated VCs for persons, respectively. The average biases and RMSEs of
Note that the magnitude of the bias from an additive model depends on
Table 5 shows that the RMSEs of
Table 6 shows that the frequency of getting negative
The estimated G coefficients for a relative decision,
In sum, the second simulation study confirms the theoretical results obtained in the preceding section that when data contain the subject-by-facet interaction, the VC estimator for subjects/persons based on the formula from the current G theory (i.e., additive approach) has a substantial negative bias when the number of facet levels is small. The nonadditive approach with Tukey’s test for nonadditivity can reduce the bias substantially in such cases. Although the simulation study shows that the estimator based on the nonadditive approach works well in terms of reducing bias and RMSE, there are still many cases with a relatively large bias of
Note that the bias of the estimated VCs based on additive models will become small when
A Real Operation Data Analysis
The data set was collected in summer 2009 during a standards-to-standards correspondence study in Mississippi (Chi, Lin, & Yap, 2009). One of the objectives of the study was to compare cognitive complexity between two sets of standards (descriptors). One set of 25 standards was treated as the anchor standards. The anchor standards are English language proficiency standards designed for English language learners (ELLs). These anchor standards describe linguistic skills and language demands in relation to ELLs being able to understand and convey content knowledge in mainstream classrooms. A panel of six reviewers
Before comparing the cognitive levels of the two sets of standards, Chi et al. (2009) examined the degree to which ratings from the panel of reviewers were reliable. The G coefficients for relative and absolute decisions were used to assess the reliability of the panel of reviewers. For the panel of six reviewers in the pre-Kindergarten to Kindergarten grade cluster,
The nonadditive approach was applied to reanalyze the data. The estimate of the additivity index based on Tukey’s method is 0.4988. According to Equation 7,
Discussion
This article first discusses the distinction between the fixed and random facets and the distinction between genuine and removable interactions in G theory. Then, it introduces Scheffe’s theorem to point out the fact that the random-effect linear G-theory model is actually assumed to be additive, which is equivalent to assuming that no subject-by-facet interaction exists in a one-facet design. Ideally, the facet in a one-facet design should be treated as random and subject-by-facet interactions should be avoided. However, in practice, additivity (no interaction) may be untenable and can be severely violated. In such a case, the VC estimator for subjects based on a random facet or an additive model is expected to have negative bias. As shown by Equation 8, the bias can be severe when the number of selected facet levels is small. Thus, prior to treating the facet as random in statistical analyses, it is important to examine whether interaction effects are significant. Finally, the article expands G theory to a more general setting in which both additive and nonadditive models are considered. Such research efforts are important in that interactions are likely to occur in operational settings, and when they do, applications of nonadditive G theory become necessary. In sum, when interactions are absent or removable, current G-theory methods will suffice, but when interactions exist, nonadditive G theory is needed to estimate VCs.
Given a data set, the nonadditive G-theory approach developed in this article starts with a mixed-effect model to deal with possible interaction effects. First, it uses Tukey’s test for nonadditivity to examine whether the subject-by-facet interaction exists. If the test result is significant, then the nonadditive approach estimates the additivity index based on Tukey’s method and estimates the VC for subjects using Equation 7; otherwise, it sets the additivity index to be one and uses an additive model or Equation 6 to estimate the VC for subjects just as current G theory does. After obtaining estimated VCs, the rest of the analysis remains the same as that in current G theory. Thus, the nonadditive approach generalizes current G theory to deal with the complication of VC estimates due to nonadditivity in data by adding the initial step of testing the subject-by-facet interaction.
In relation to the real data analysis in the previous section, the practical implications of nonadditive G theory come in two folds. First, the notion of interaction/nonadditivity gives another possible explanation for negative VC estimates in practice. Additive G-theory models are shown to underestimate the VC for subjects in the presence of interaction in data. If the true VC is expected to be small, the estimated VC could be negative depending on the degree of such underestimation. Second, the use of the additivity index can serve as a quick tool to evaluate the quality of data. A small value in the additivity index would signal the presence of interaction between the subject (e.g., person or essay) and the facet (e.g., item or rater).
In general, it is important to check the nonadditivity (interaction) of any given data. Tukey (1949) investigated this issue and proposed a test for nonadditivity. Tukey decomposed the residual sum of squares into an interaction component representing nonadditivity and the remaining component as an error term for testing nonadditivity. Although Tukey’s test was successful in singling out the interaction effects in the second simulation study, it is not sensitive to all interactions. As Kirk (2013) pointed out, “Tukey’s procedure tests the (null) hypothesis that a special type of contrast–contrast interaction is equal to zero. For this contrast–contrast interaction, the coefficients of the contrast are the
Footnotes
Acknowledgements
The authors would like to thank Sarah Zhang, Xiaohong Gao, and four anonymous reviewers for their comments and suggestions.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
