Abstract
Although validity is acknowledged as the most fundamental consideration in psychological and educational testing, there is disagreement stemming from two different approaches: the ontological vs. the epistemological. This article introduces a flexible approach to test validity based on context-specific constructs, which can accommodate the two existing approaches conceptually. Based on the test foundation and rationale, a unique construct domain can be defined and measured by one test score given that the score meaning is valid. With the separation of the meaning and use validities and different types of score use (i.e., basic, extended, and joint), the proposed approach offers great flexibility for test validation. However, this paper only provides a skeleton for the flexible approach, and much more work is needed.
Although there is consensus that validity is the most fundamental consideration in psychological and educational testing, there is disagreement in regard to what test validity is and how validation should proceed (e.g., Borsboom, Mellenbergh, & van Heerden, 2004; Cizek, 2012; Kane, 2001; Lissitz & Samuelsen, 2007). This disagreement stems from two different philosophical approaches (Hood, 2009): ontological vs. epistemological. Whereas the ontological approach has a long history which dates back to early last century (e.g., Kelley, 1927), the epistemological one is contemporary and represents the mainstream in the current testing industry (AERA, APA, & NCME, 1999, 2014). Hood (2009) related both approaches to scientific realism in the philosophy of science and emphasized that they can complement each other, potentially. In recent conceptualizations of validity, one can also see attempts to draw ideas from both approaches (e.g., Cizek, 2012; Lissitz & Samuelsen, 2007). From the perspective of social and behavioral science however, the two approaches represent two ends of the context-dependency continuum in the domain of observation, ranging from context-independent to context-inseparable. This article presents a flexible approach to test validity based on the context-specific construct, which can accommodate the two existing approaches conceptually. Note that while the terms construct and attribute are often used under the epistemological and ontological approaches, respectively, they are treated equally or similarly in this paper. The proposed approach provides, among other benefits, (a) unique score meaning and corresponding meaning validity within specific context, (b) the differentiation of score meaning and score use, and (c) greater flexibility of test validation, especially for ambitious test scope and purpose. In the rest of this paper, the two existing approaches will be described and compared, followed by an introduction of the proposed approach in detail. The article will conclude with a discussion on what can be done to make the philosophical approach concrete and practical in future work. Note that the accounts of the existing approaches in this paper substantially differ from those in Hood (2009) in two ways: (a) Hood emphasized the semantic, metaphysical, and epistemic aspects while this paper focuses on the psychometric and methodological comparisons and (b) Hood emphasized the complementary potential while this paper focuses on the differences between the two approaches.
The ontological approach
In the traditional or ontological approach, validity is the property of the test or test score; that is, a test is valid if it measures what it purports to measure (Cattell, 1946; Kelley, 1927). According to Borsboom et al. (2004), in this conception of test validity, the psychological attribute that the test purports to measure exists, and its variations causally produce variations in the test outcome (i.e., test score). The kernel of this ontological approach is the theoretically grounded attribute, that is, its theoretical existence without reference to any contextual factor and its causation to behavioral responses. With this understanding, one can establish a unique and fixed meaning for the test score without reference to a nomological network (Cronbach & Meehl, 1955), making score interpretation straightforward. When the test score is a valid measure of the psychological attribute, it implies both the theoretical existence of the attribute and its causality to the test score.
Test validity can be regarded as an ontological concept, like truth, and represents an ideal situation, whereas validation is the methodological process, like hypothesis testing, used to determine validity (Borsboom et al., 2004). Essentially, test validity is research validity, and validating a test is similar to validating research, both of which are confirmatory in nature. This conceptualization has additional implications: (a) test validity is substantially different from test validation, and the way that validation proceeds is of less concern given that the validity of a test is warranted; (b) there is no universal framework for test validation, as there are few common characteristics of the methodological processes across the vast world of measurement; and (c) test validity concerns only score interpretation (i.e., intended score meaning), not score use. From this perspective, psychometrics is treated as a natural extension of physical measurement, with psychological attributes (e.g., intelligence, emotion, and attitude) understood as analogous to physical attributes (e.g., length, temperature, and pressure).
This approach to test validity is elegant and concise but increasingly unpopular in psychological and educational testing. The first issue, with regard to practice, concerns the strong theoretical or empirical support required for the existence of the attribute and its causation to the behavioral responses; such support is logically prior to the testing process and serves as the test foundation. Although such support is widely available in the physical sciences, it is much less common in the social and behavioral sciences. The second issue is that, although some attributes can be largely context-independent, many others are intertwined with contextual factors such as age, place, or language. Accordingly, score meaning may not be independently defined and validated. Last, since test validity is not different from other research validity and there is no universal way to guide the validation process, there is little useful information to inform test construction and development.
The epistemological approach
In the epistemological approach, validity refers to the degree to which evidence, either empirical or theoretical, supports the interpretation and use of the test score (AERA et al., 1999, 2014; Kane, 2006; Messick, 1989). Validity is on a continuous (i.e., the degree of support) rather than a dichotomous (i.e., valid or invalid) scale (Zumbo, 2007). Score interpretation and use, as well as their social consequences, are a concern of validity. There can be multiple ways to interpret or use the same test score, given each is appropriately supported. The kernel of this conception is “an integrated evaluative judgment” (Messick, 1989, p. 13) of the adequacy of evidence and consequences, which brings methodology to center stage. As a result, the ontological question, “what is validity” is mixed with the methodological question, “how to validate” (Borsboom et al., 2004; Cizek, 2012). In fact, the terms validity and validation are often used similarly or interchangeably in the literature of this approach (e.g., Cronbach, 1988; Kane, 1992, 2006, 2013).
Moreover, validity or validation is considered unitary rather than fragmented under the concept of construct validity (Loevinger, 1957), whereby different kinds of evidence and social consequences are summarized and judged in an integrative way (Messick, 1989). Since construct is defined as “the concept or characteristics that a test is designed to measure” (AERA et al., 2014, p. 11), construct validity is test validity per se. However, the construct may not be theoretically grounded or causally bring about the test score. Instead, it can be considered as shorthand for regularities or patterns in the domain of observation, and score interpretation in terms of the construct can be circular (Kane, 2006). Construct or score interpretations are inseparable to the test purposes and circumstances of the observable behavior (e.g., age, place). There are various types of validity evidence (i.e., content, response process, internal structure, relations to external variables, and consequences), which need to be integrated under the concept of construct validity. Such an approach makes it necessary and possible for a unified framework to guide the validation process across various situations. Some researchers have further advocated the logic of evaluation argument as the basis of a unified framework and have proposed an argument-based framework for test validation (Cronbach, 1988; Kane, 1992, 2006). Through an “interpretation/use argument” (Kane, 2013), the framework provides an overall evaluation of score interpretation and use under construct validity.
Although pragmatic and useful to inform test development, this epistemological approach also suffers from certain concerns. The first concern is the lack of causation to the test score. When causality is out of validity, the approach runs a risk of slipping into the so-called weak program where any evidence connected to the test score is considered as relevant to validity, making validation purely empirical and score interpretation exploratory. The second concern is the lack of a clear ontological concept and theoretical rationale in construct validity. Hence validity becomes very open-ended without a clear boundary as to where to start or stop. Moreover, the various types of validity evidence can overlap with each other. As a result, it is hardly conclusive how much evidence one should accumulate to determine the degree of validity. Besides, it is difficult to tell what might be the critical problem or missing link in case of low validity. The third concern is that different types of evidence are qualitatively different and can be incompatible under the unitary concept of construct validity (Cizek, 2012).
The flexible approach
The context-dependency continuum
Consistent with both existing approaches, the construct can be considered as what a test is purported or designed to measure. Accordingly, it is critical to position the role of the construct clearly in the conception of test validity. The constructs in the above two approaches can be positioned at two ends of the context-dependency continuum, ranging from context-independent to context-inseparable. On the independent end, the construct exists without reference to any contextual factor theoretically. It implies that valid measures of the construct will be reliable and meaningful independent of the circumstance. Accordingly, score meaning based on the theoretically grounded construct can be independently defined and validated regardless of the circumstance, and hence validity is a property of the test score. Moreover, test purpose or intended uses of test score are independent of the score meaning. On the inseparable end, the construct can be considered as merely shorthand for patterns or regularities of observable behavior in the domain of observation, and is inseparable from contextual factors. Construct or score meaning is partially shaped by the intended use of the test score. It implies that score interpretation is inseparable from the circumstances of intended use, and valid measures of the construct need to take into account the test purpose or score use. Accordingly, validity is a property of score inferences based on the intended interpretation and use, resulting in a unitary method of validation.
Context-specific construct
This paper introduces a flexible approach to position the construct as context-specific. A construct is context-specific when it can be uniquely defined and measured by one test score 1 within the boundary of a specific context. The boundary is shaped by the test foundation and rationale. The test foundation is the substantive content of the construct that is theoretically grounded and with causation to the test score. In contrast, the test rationale comes from legal, professional, educational, or clinical regulations or requirements which inform the necessity of the test score. For constructs with stronger theoretical support (e.g., executive functions, intelligence), the foundation can dominate the shaping of the boundary. For those less theoretical (e.g., accounting proficiency, math achievement), the rationale can prevail. Namely, valid measures of the construct will be reliable and meaningful under the context. Score meaning can be uniquely defined and validated given that the test foundation and rationale are legitimate. From this perspective, one can argue that test validity is conditional.
The flexible approach can cover both ends of the context-dependency continuum effectively. Note that although the unobservable constructs play an important role in testing, the building blocks of the social and behavioral sciences are possible behaviors or the domain of observation. In the domain of observation, most constructs that a test is designed to measure can be operationalized uniquely under specific contexts, and both the test foundation and rationale can work together to shape the contextual boundary. This is still true on the inseparable end, given that the test rationale is legitimate. For instance, accounting certification tests can be based on accounting regulations in specific regions, which necessitate a certain degree of accounting proficiency in order to conduct a qualified practice. In this example, region and language can be appropriate contextual factors to help define accounting proficiency operationally. Even on the independent end, one can always extend the boundary to be context-independent eventually.
Validity of score meaning
To define a context-specific construct with unique score meaning, three elements need to be considered to form a unique construct domain: content, population, and scaling (Figure 1). All elements in the construct domain are based on the test foundation (i.e., causality) and rationale (i.e., necessity). The content element endows the construct domain with substantive content, operational definition, behavioral patterns, or regularities. In large-scale assessments, the content might be further partitioned into a finer grain size. For the score meaning to be valid, however, the internal structure or dimensionality of the content must be reflected by that of the test score. The population element identifies the target population with demographic or situational features (e.g., age, time, place, language). If the test foundation is more theoretically grounded, the population will be more general or less situational and vice versa. The scaling element informs the nature of scale of the construct (i.e., continuous vs. categorical), and the corresponding frame of reference for the score meaning (e.g., norm- or criterion-referenced). In the case of continuous or norm-referenced scaling, the normative group is implicitly defined by the population. In the case of categorical or criterion-referenced scaling, categorical or criterion levels need to be qualitatively prescribed and are external to the population. For instance, the score meaning of a math ability test can be referenced to external criteria defined as mastery, partial mastery, and nonmastery. When the structure of the test score is multidimensional and the scaling is criterion-referenced, the criterion levels for each dimension may need to be prescribed individually. With these three elements, the construct domain can be uniquely defined and corresponding score meaning can be validated.

Schematic diagram of the flexible approach.
Validity of score use
The validity of score meaning and that of score use are substantially differentiated, as every test score should be defined once (i.e., unique construct domain) but can be used in different ways (e.g., to make predictions or decisions; to diagnose). Test validity consists of meaning and use validities. Moreover, meaning validity is a prerequisite of use validity because a test score has to be meaningful before it can be useful. When score use depends solely on score meaning, no additional variable or factor is involved and this is considered as the basic use. The use validity is essentially the meaning validity. Whenever any external variable and factor is involved, it constitutes an extended use and additional effort is required to validate the score use. When the test score is used to make decisions, for instance, one can distinguish between the criteria-driven and selection-driven cases. The former case relies on the cutoff based solely on score meaning and can be regarded as the basic use. Real-life examples are test-based professional or educational certificates. In the latter case, the cutoff is driven by external variables (e.g., gender, race) or factors such as the number or proportion of examinees that one can admit, and should be considered as an extended use. Real-life examples are test-based employment or educational admissions.
When multiple test scores, each with its own construct domain, are used together, it constitutes a joint use. In this case, the meaning of each test score is validated individually, whereas their use is validated jointly. This provides an alternative approach to test validation: instead of validating different substantive contents with one construct domain (i.e., one test score), they can be divided first into multiple domains for validation of score meaning and then validated together for joint use. It can be especially helpful when the test foundation is not theoretically homogenous across different substantive contents, which is not uncommon in large-scale assessments with ambitious purpose and scope that cover a wide range of contents. In language assessments for instance, the language construct with one test score often consists of different skills (e.g., reading, writing, listening, and hearing), the uses of which are validated in a unitary way under the contemporary approach (Chapelle, Enright, & Jamieson, 2010). With the flexible approach, one can validate the score meaning of each skill test first before validating the joint use of multiple test scores. Considering that the theoretical foundations for the skills differ substantially, the flexible approach would be more reasonable.
Discussion
This paper introduces a flexible approach to test validity based on a context-specific construct, which can accommodate the two existing approaches conceptually. Based on the test foundation and rationale, a unique construct domain can be defined and measured by one test score given that the score meaning is valid. With the separation of the meaning and use validities and the three types of score use (i.e., basic, extended, and joint), the proposed approach offers great flexibility for test validation. Note that, more often than not, test score is not used alone nowadays. Accordingly, a mix of extended and joint use can provide even more flexibility. In addition, one should distinguish between the validity and consequences of score use. In brief, the use validity is essentially scientific, whereas the consequences of score use are largely value-laden.
Although a skeleton of the proposed approach is introduced in this article, much more work is needed for the approach to be full-fledged. In addition to finalizing the theoretical and conceptual details, more effort should be directed towards the practical and methodological elements, which are largely missing in this paper. Specifically, one can examine the possibility for a general framework to guide the validation process, which may explain the types of theoretical or empirical evidence needed and how they should be organized. Moreover, one can explore how to inform test development and to distinguish between test validation and regular research validation under the framework. Besides, it is desirable to assess ways to incorporate contextual factors into the construct domain. Among all contextual factors, time is the most influential, as most behavior patterns change substantially over time. If the time factor can be efficiently incorporated, the test score can be redefined periodically depending on how behavior patterns evolve. The redefining of score meaning is not gradual but, rather, episodic in a way similar to the paradigm shift of scientific progress (Kuhn, 1962). In this sense, one can validate the shift of score meaning similarly to a micro version of scientific progress.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Open Fund of the National Higher Education Quality Monitoring Data Center (Guangzhou). Grant, No. M1601.
