Abstract
Goal attainment scaling (GAS) is an internationally recognized measure that is widely used in educational, counseling, and clinical settings to identify and evaluate relevant goals for an individual. The GAS is an unusual measure because its content, which consists of goals, is formed by the respondent and/or users in the process of completing the GAS. Using the unified view of validity as a guiding framework, this systematic review examines validation practices and how goals are represented in this measure. This review demonstrates that validation practices tend to focus on aspects that do not support the overall construct validity of the measure, as well as reference to the GAS measure or GAS scores as a property. Several gaps in validity evidence and the various ways goals are conceptualized are described and discussed. The varying ways goals are considered suggest clarity is needed to enhance explanations and score meaning. This review urges researchers to consider ways validity and validation evidence can help verify the many claims that are made about this measure. Future validity research needs to consider application of a theoretical framework and response processes as key aspects of substantiating the construct measured by the GAS.
Introduction
Goal attainment scaling (GAS; Kiresuk & Sherman, 1968) is an internationally recognized measure that is used across many disciplines to identify and evaluate relevant goals for an individual (Kiresuk, Smith, & Cardillo, 1994). Although GAS originates from counselling and clinical settings, it is increasingly being used in educational settings, as goals and goal-setting are highly relevant to educational contexts and educational assessment (Kiresuk et al., 1994). The GAS is unlike typical measures because it lacks fixed content; that is, it consists of goals formed by the respondent and/or users in the process of completing the measure. The nonstandard format of the GAS has not deterred investigations of its measurement properties, such as validity and reliability evidence. This systematic review aims to understand validation practices for the GAS; specifically, how validity evidence is gathered and reported.
The GAS is often endorsed as an “individualized” measure because it allows users to develop and set personalized goals. At the time the GAS was first developed, it was used to specify individual goals for patients in a community mental health program. During this initial use, selection of goals for the GAS involved a committee or “goal selector” (p. 446) who determined a set of realistic goals for the patient, and then graded and scaled goals according to likely treatment outcomes (Kiresuk & Sherman, 1968). Later descriptions of the GAS modified this condition so goals were set either individually or collaboratively between a student and teacher, or client and practitioner (Kiresuk et al., 1994). Once goals are set, they are scaled to identify variations in goal attainment, typically by individuals who have knowledge of the treatment or intervention (e.g., a practitioner or teacher). Scaling a goal involves identifying variations in goal attainment that indicates movement above or below treatment or intervention expectations. The GAS measure assumes that users will bring relevant or prior knowledge of the treatment or intervention to determine what goals are realistic, and to grade and scale these goals appropriately. Therefore, using the GAS involves (a) assessing an individual’s (e.g., student or patient’s) skill level in a particular problem area, (b) developing and scaling a goal that is the intended result of a treatment or intervention, and (c) later scoring the goal based on perceived change. Altogether, the GAS is a unique measure, and a striking feature is that it has “no fixed content” (Kiresuk et al., 1994, p. 167), as users of the GAS determine both the goals and their scaling. Given this measure lacks fixed content and has been used for numerous years, examining validation practices will provide insight into how validity information is gathered for this unique measure.
Validity is defined as the justifications or explanations for variations in scores on a measure, and validation is the process of acquiring that information (Zumbo, 2009). Validity information provides evidence related to the content of tools that are used to measure phenomenon, as well as the interpretations and inferences that are made from their scores. While validity provides critical information about measures, it has also been reported that validation practices are inconsistent, and that there is a disconnect between the practice of validation and validity theory (Shankar, Miller, Roberson, & Hubley, 2019; Zumbo & Chan, 2014). Previous evidence has noted an imbalance in validity evidence presented and a lack of explicit reference to a validity framework (Shear & Zumbo, 2014). As the meaning and language surrounding validity has changed over the years, this may also influence how validity evidence for the GAS is collected and reported since the original publication over 50 years ago (Kiresuk & Sherman, 1968). Common approaches for talking about validity include more than one view—modern (unified) validity theory and traditional (Trinitarian) validity theory (Guion, 1980; Newton & Shaw, 2013). The unified perspective was originally described by Cronbach and Meehl in 1955, and has evolved into a view that is currently endorsed by the Standards for Educational and Psychological Testing 1 (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014). This view of validity includes several sources of evidence, and the Standards identifies five sources, which are test content, response processes, internal structure, relations to other variables, and consequences (AERA et al., 2014). Within the unified view, validity came to be seen as centered around the construct, with the sources of evidence all contributing to the “whole of validity” evidence (Loevinger, 1957, p. 636), and the importance of building a nomological network for interpretations of scores (Cronbach & Meehl, 1955; Hubley & Zumbo, 2011). The notion of the construct lies at the core of this view, whereby the term construct describes an unobserved concept or behavior that can be operationalized through a measurement process. In contrast, the assumption in the Trinitarian view is that validity exists as different “types,” and this view sees validity as a property of a measure, so measures either do or do not have validity (Hubley & Zumbo, 1996). The tripartite view of validity has evolved and developed toward a more comprehensive view that considers validity as an integrative evaluative judgment, with validation as an ongoing process (Hubley & Zumbo, 2011; Messick, 1995). The unified view considers different types of validity (as historically considered), such as content and criterion related, as subsumed under construct validity (Messick, 1989b). This review uses the unified view of validity as a guiding framework for studying validation practices by recognizing that all validity evidence contributes toward an understanding of the construct. The unified view is also recognized by developers of the GAS (Kiresuk et al., 1994), who draw attention to the unified view of validity and discuss the importance of construct validation.
Of the sources of validity evidence identified in the Standards, test content and response processes are, arguably, foundational to the initial development and verification of a measurement instrument (AERA et al., 2014). Content-related evidence evaluates how well the content in the instrument represents the construct it is intending to measure (AERA et al., 2014; Haynes, Richard, & Kubany, 1995; Sireci, 1998), and one can think broadly of response processes as, “the mechanisms that underlie what people do, think, or feel when interacting with, and responding to, the item or task and are responsible for generating observed test score variation” (Zumbo & Hubley, 2017, p. 2), which in the case of the GAS is connected to a goal set by the user. Evidence based on test content and response processes are complementary in their objectives and in their descriptions (Padilla & Benítez, 2014). They both evaluate the representativeness of a measure and its elements in relation to the construct by evaluating response consistency (Messick, 1975; Vogt, King, & King, 2004). Together, these elements contribute toward an understanding of the meaning behind the GAS score (Messick, 1989a), and in particular what aspect of the goal construct the GAS intends to measure and how the GAS is interpreted among users.
Although the GAS is a measure that has variable content, some researchers have argued that evidence based on test content is a prerequisite for establishing other validity evidence (Vogt et al., 2004). Content-related evidence can be defined as how a test is related to the content it is intended to measure, as well as the degree to which a measure represents a specific construct for a certain assessment purpose (AERA et al., 2014; Haynes et al., 1995). When evidence based on test content is obtained, the content domain for a measure is evaluated, and feedback is received—and it is this process that justifies the content on the test, thereby judging the overall quality of a test (Sireci, 1998). Typically, test content evidence applies to the development and revision of instrument items, and the process includes specification of the construct of interest, review of test content, and consultation with experts (Haynes et al., 1995; Vogt et al., 2004). Since the GAS does not have a defined “universe of items” (Cronbach & Meehl, 1955, p. 282) and instead relies on a universe of goals set by users, evaluating construct validity and specifying the construct of interest becomes essential to understand its content and adequately define what one is trying to measure. It has been suggested by Kiresuk et al. (1994) that various types of evaluation are limited by the GAS format. They contend that the final score in the GAS provides information about validity, which can be directly interpreted to evaluate change given certain assumptions are met. One of these assumptions is that content-related evidence is “adequate” (Kiresuk et al., 1994, p. 245). As the content in the GAS is formed by users, evaluating test content must focus on the construct the GAS intends to measure to provide support and justification for subsequent score interpretations (Kane, 2006). Examining both the definition and operationalization of the construct are activities that gather content evidence in support of a measure’s construct validity (Sireci, 1998; Vogt et al., 2004). Therefore, understanding what goal construct is being identified will highlight how the construct in the GAS is being operationalized and provide validity evidence toward the meaning of its score (Anastasi, 1986; Messick, 1980; Tenopyr, 1977).
Correspondingly, response process evidence examines the congruence between the construct and individual processes through examination of both theory and empirical analyses (AERA et al., 2014; Messick, 1989a). Response processes have traditionally been investigated using cognitive processing methods, such as think-aloud or cognitive interviews (Padilla & Leighton, 2017). Examination of response processes have expanded to include aspects such as one’s behavior and motivations, to more fully understand what one is thinking or feeling as they interact with a measure (Leighton, Tang, & Guo, 2017). As well, aspects of emotion can be examined through various expressions (Leighton et al., 2017). In all, evidence based on response processes systematically assesses how respondents understand and process aspects of the construct that is measured by the GAS and can draw connections between the construct and individual responses. Along with evidence based on test content, these sources of validity evidence can provide justification for the goal construct the GAS purports to measure.
Synthesizing validity evidence provides a coherent account of the evidence supporting or disconfirming the intended interpretations from scores (O’Leary, Hattie, & Griffin, 2017). To appraise the way in which validity evidence is assembled, this review focuses on test content and response processes to examine representation of the goal construct measured by the GAS, and move beyond the individual test-takers’ behaviors that are traditionally used in validation research, toward an explanation-focused view of validity (Zumbo, 2017a, 2017b). By applying this position, this systematic review examines validation practices for the GAS by collecting information about how validity evidence has been reported over the period 1970 to 2018. Specifically, the purpose of this systematic review is to investigate how validity evidence for the GAS is assembled and then examine the available validity and reliability evidence.
Method
Search Strategy
The following six databases were searched for relevant research articles: PubMed, Embase, Cumulative Index to Nursing and Allied Health Literature (CINAHL), Eric, PsycINFO, and Cochrane Database of Systematic Reviews. Library databases were examined using the search criteria: (a) keyword “goal attainment scaling” combined with (b) valid* (*denotes truncation to search for variations in the word). Peer-reviewed articles, written in English, published since January 1970 that describe the use of GAS with any human sample were selected. Articles over several decades were searched to gather all available literature on validation practices with the GAS. Reference lists of articles that met full-text inclusion criteria were reviewed to determine whether any additional articles should be retrieved.
Eligibility Criteria
For inclusion in this review, articles were reviewed in a step-wise manner by two reviewers, once duplicates were removed: (a) titles and abstracts were screened, followed by (b) a review of the full-text to code and select final articles. Titles and abstracts were initially screened by the first author (S.S) and a second reviewer (S.K.M) to determine whether articles identified: use of the GAS as a measurement tool and the abstract identified measurement properties of the GAS. As valid* was already searched, if any measurement properties were mentioned, the terms validity or validation would be contained in the full text. Altogether, this process was liberal and tended toward inclusion rather than exclusion. The focus was to include all articles and examine how validity was conceptualized and examined. Furthermore, experimental studies, reviews, and commentaries were all included to examine how validity evidence for the GAS was both investigated as well as interpreted. As a final screen, the first author (S.S) reviewed the full text before selecting final articles; in this last step, all articles were coded to examine how validation evidence was investigated or described in each corresponding article. The second reviewer (S.K.M) reviewed 20% of full-text articles to verify the coding and the article selection process, and to obtain a macro-level sense of similarities of ratings between both raters. Final articles were selected if the coding process verified validity evidence was examined or reviewed with the GAS. Only studies that examined validity as it pertained to measurement properties of the GAS were included, and articles that discussed social validity, ecological validity, or treatment validity were excluded. Furthermore, articles that only mentioned the GAS as “valid” but did not describe specifics about the validation process were excluded. Once data were read and coded, only those articles that described measuring validity or providing validity information for the GAS instrument were included. Figure 1 outlines details about article selection and reasons for exclusion in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol (Moher, Liberati, Tetzlaff, Altman, & PRISMA Group, 2009).

PRISMA flowchart for systematic review.
Coding
A data abstraction and coding form was developed by authors to gather validity and validation information from screened articles. A coding sheet that included two sections was used to assess the available validity evidence. Section A collected descriptive information about articles and sample characteristics, and noted which articles identified as a review by their synthesis of literature, and Section B collected information about validity and validation evidence.
Using a similar coding process as Chan et al. (2014) and Cizek, Rosenberg, and Koons (2008), Section B was organized using the following scheme and the five sources identified in the Standards (AERA et al., 2014): (a) test content, (b) relations with other variables, (c) internal structure, (d) consequences, and (e) response processes (see the Appendix for a sample of the coding sheet and definitions). As well, an “other validity” category was included to account for validity sources not described in the Standards (AERA et al., 2014), and reliability was also included, since reliability is a necessary condition for validity (Hubley & Zumbo, 2013). In particular, the type of reliability estimate, specifically internal consistency, test–retest, or inter-rater estimates were noted to make a comparison with the amount of validity evidence that was sought and reported. Furthermore, to understand validity perspectives, this review coded for the following: (a) whether articles mentioned a unitary perspective of validity and (b) whether articles stated that validity was a property of the test or a property of test scores.
Data Analysis
Validity evidence that was gathered from coded articles was evaluated using a combination of inductive and deductive content analytic approaches (Elo & Kyngäs, 2008; Hsieh & Shannon, 2005; Mayring, 2000). Gathering descriptive information invoked a deductive approach as information reported in research articles were noted. A deductive content analytic approach was used to collect information on all sources of validity evidence. For instance, when gathering construct definitions for evidence based on test content, as well as evaluating the use of theory and systematic testing of individual processes provided evidence for response processes. A deductive content analytic approach was also used when information regarding validation practices was applied to coding. Evaluating sources of validity evidence and validity perspectives included a combination of approaches; a deductive approach was used to collect information that was reported, and an inductive approach was used to interpret this information.
Inter-Rater Agreement on Coded Studies
The first author and second reviewer were in agreement for 50 of 60 data points for these 37 studies, and this represented 83.3% agreement. Any differences were discussed between the first and second rater to reach consensus before assigning codes to cases.
Results
This review identified a total of 115 articles once duplicates were removed. After abstracts were screened and full text reviewed, a total of 37 articles were selected as examining or reporting validity evidence for the GAS (Figure 1). Of the selected articles, 10 identified themselves as a review and synthesized literature. Selected articles examined the GAS for a variety of reasons related to validity or validation, such as to gather information about validity and/or other measurement properties, examine the feasibility or utility of the GAS in certain settings, use GAS as an outcome measure, and review the GAS by itself or in comparison with other measures. The GAS was used with individuals who were patients or students. The samples included children, youth, adults, and older adults, and in one article the sample was nonspecific (Table 1).
Article Description.
Reporting of Validity Evidence
The majority of articles in this review reported on or examined relations to other variables (89.2%), followed by evidence based on test content (51.4%; Table 2). Evidence based on consequences was reported in one article (2.7%), and no article reported information about internal structure or response processes.
Frequencies of Sources of Validity Evidence Reported.
Construct representation through test content and response processes
To gather information about evidence based on test content, we reviewed articles to see whether experts were consulted, a construct was identified, and a corresponding construct definition was provided. Content was evaluated by examining goal domains, agreement between experts, comparing goals on the GAS to the content from other reports or assessments, and expert opinion. Expert panels included practitioners, patients, students, family, team members, or individuals doing intake. Although evidence based on test content included some expert consultation, the construct measured by the GAS was not clearly identified. In several articles, there was no clear construct that the GAS was identified as measuring, and articles identified the GAS in two predominant ways (Table 3): (a) solely as a measure of goals (40.5%) or (b) as both a measure of goals and its own method or measurement technique (56.7%), and one article stated the GAS construct lacked clarity and was nonspecific. While several articles mentioned a theory (13.5%) and one article identified a specific goal theory, no articles actually used a theoretic approach.
Description of the GAS.
Note. GAS = goal attainment scaling.
To examine evidence based on response processes, this review evaluated whether theory was used to guide application of the GAS, and also whether response processes were empirically tested. Among the articles that mentioned a theory, only one article (i.e., Hurn, Kneebone, & Cropley, 2006) identified a specific goal theory. Furthermore, results indicate that no articles reported information about systematic testing of response processes, such as cognitive, motivational, or behavioral types of processing. Nonetheless, many articles mentioned observations or reflections on how goals were set, such as through negotiation or consensus.
Internal structure, relations with other variables, and consequences
No article reported evidence based on internal structure of the GAS, which is congruent with this type of measure since it lacks fixed content. Validity evidence based on relations with other variables was reported in varying ways, such as construct, convergent, concurrent, criterion, and predictive validity, as well as the nomological network. In addition, responsiveness or sensitivity to change was reported in more than half of the articles (67.6%). As well, the only article (i.e., Rockwood, 1994) that reported considering the unintended consequences of testing advocated for the individualized nature of the GAS as guarding patients against unintended consequences of other measures. Among reviewed studies that mentioned a score and its applied purpose, almost half of all articles (48.7%) discussed that the GAS score may be interpreted as a change score.
Other Evidence Related to Validity
Numerous articles (37.8%) also identified other sources of validity evidence that were outside the criteria identified, such as face validity, external validity, internal validity, and congruent validity.
Reliability
Although all included articles discussed validity evidence of the GAS, the majority of articles also reported or examined corresponding reliability evidence (94.6%). Inter-rater reliability was reported most frequently (73.0%), followed by test–retest (16.2%) and internal consistency (13.5%).
Validation Practices
Based on the validity evidence that was gathered, most studies tended to gather validity evidence by focusing on relations with other variables and reliability evidence. Validity evidence was often gathered as types and studies tended to gather different types of validity. No article mentioned the unitary perspective of validity or any editions of the Standards (e.g., AERA et al., 2014), and one article identified the tripartite view (i.e., Gordon, Powell, & Rockwood, 1999) as their theoretical approach to validity. Validity was discussed as either a property of the GAS measure or the GAS scores.
Discussion
This review provides a glimpse into validation practices for the GAS measure by examining how this evidence was gathered and assessing available validity evidence. The 37 articles selected for this study verified that the GAS is used in a variety of settings and with a variety of samples (Table 1).
Validity and Validation Evidence of the GAS
Validity evidence was evaluated by all studies in a number of ways. Most validity evidence tended to focus on relations to other variables, or reliability. The concentration of data in these areas is not uncommon and is similar to the findings by Zumbo and Chan (2014). They note that the high concentration of this evidence brings some difficulty interpreting evidence and building a sound validity argument. They also note limited guidance from orientations to theory, including validity theory. Altogether, validity evidence was not gathered in any systematic way and reflected a piecemeal approach to validation.
Articles approached validity by gathering varying types of validity evidence and then reasoning that the GAS was either “valid” or “not valid.” Discussing validity in this way, as a property of the measure or its scores, implies that validity is seen as a fixed or immutable quality. This idea emerged in the 1940s, where validity was conceptualized as a static property that had to be proven or established (Goodwin & Leech, 2003). Newton and Shaw (2013) in their analysis of ways validity is discussed or thought about, suggested validity was referred to as if it were a property of a test for several reasons, such as intentional misuse, lack of awareness or misunderstanding, and genuine divergence from the view of validity as a property of interpretations. Newton and Shaw (2013) also identified 122 discrete validity labels intended to capture an aspect of validity for measurement. From these labels and the results from this review, it is apparent that articles do not consider validity as the interpretations from scores on a measure. As discovered in this review, the GAS has varying interpretations.
Interpretations of the GAS
The GAS was frequently identified as both a measure of goals and also its own measurement technique. Among articles that described the GAS solely as a measure of goals, articles either specified goals were individualized, or simply referred to goals in a broad or general manner. The term goal was used by itself or referred to related aspects of the goal construct, such as goal achievement, goal attainment, or goal-setting. It was difficult to decipher how varying aspects of the goal construct were distinguished since these terms were used interchangeably in relation to the GAS.
Goal constructs identified in the GAS
The results from this review draw attention to the various ways goals may be described in the GAS and some discrepancies between how aspects of the goal construct are discussed. One discrepancy is that individualized goals or goal achievement is not equivalent to the process of goal setting. However, studies included in this review readily moved from identifying the GAS as measuring goals to the GAS as a tool for goal setting and also evaluating goal achievement. Ostensibly, these aspects or dimensions of the goal construct were all viewed somewhat synonymously.
Although similarities exist between these various aspects of the goal construct, there are important distinctions that influence what outcomes are produced from the GAS, as well as the score meaning. Of the articles included in this review, only one article (e.g., Hurn et al., 2006) identified a construct definition for a goal. As noted by Elliot and Fryer (2008), the term goal is rarely defined in research, with the assumption that all readers understand the word similarly, but the term can take on different meanings. While it is not uncommon to consider the goal construct in a parsimonious way, separating its differential aspects has a number of advantages, such as limiting the confusion surrounding the construct, improving understanding of the influence of multiple goals, and minimizing assumptions (Austin & Vancouver, 1996).
As noted by Austin and Vancouver (1996), proliferation of various aspects of the goal construct makes its examination problematic, which is evidenced in this review. Presumably, the GAS has been assumed to measure different aspects of the goal construct, and no study examining validity evidence considered their distinctions. Improving the clarity surrounding the goal construct in the GAS can help determine clear functional properties of this measure, instead of wondering whether additional constructs may account for a particular behaviour (Elliot & Fryer, 2008). Likewise, a number of factors affect goal outcomes, their structure, and process (Austin & Vancouver, 1996). For instance, the activation and pursuit of goals depends on one’s conscious desires, which can influence their thoughts, emotions, and behaviors (Fishbach & Ferguson, 2007), and goal commitment is only recognized when there is an investment of affect, cognitive resources, and behavior (Mann, de Ridder, & Fujita, 2013). Moreover, goal achivement is influenced by the nature of the task and how applicable the goal is toward it (Fishbach & Ferguson, 2007), as well as the context and one’s level of control (Austin & Vancouver, 1996). With a vast amount of psychological literature surrounding the goal construct, these aspects reflect a succinct view into some of the factors associated with this construct. Certainly, from a validity and validation standpoint, the varying constructs identified suggest clarity is needed to enhance explanations and score meaning.
Evidence of theory guiding definitions
Validity is a matter of inference and a process that provides information related to the meaning of scores, which in turn provides information about an outcome of interest. Understanding what inferences are made with test scores refers back to how theory is applied to justify the claims that are made regarding these scores (Kane, 2009). The only critique included in this review wondered, “How is the construct embedded in the theory?” (Cytrynbaum, Ginath, Birdwell, & Brandt, 1979, p. 33), and the results from this review indicate that this question remains unexamined and, therefore, unverified. Theoretical perspectives about the goal construct are rarely mentioned and notably absent in applications of the GAS. Among studies reviewed, only one article (e.g., Hurn et al., 2006) specifically mentioned a goal theory (i.e., Locke, 1968), but this was not operationalized in the respective study. Another study included in this review mentioned that theory relates to something clinicians consider during the goal-setting process; however, this too was not tested (e.g., Vu & Law, 2012). Using the GAS to set goals is a complex endeavor, and theoretical rationales can assist by providing guidance in action-planning this process (Scobbie, Dixon, & Wyke, 2011). In a Cochrane review that investigated the GAS and goal setting in rehabilitation medicine, the authors noted that only one study implemented goal setting in a way that was consistent with a theory (Levack et al., 2016). The results from this systematic review suggest that future research using the GAS needs to better implement a theory to guide establishment of a definition and its application.
Given the GAS does not have items like a conventional measure that is scored and lacks fixed content, this added complexity stresses the importance of construct definitions and theory to guide how this construct is operationalized. By discounting how these aspects contribute to validity evidence, we can only be certain that we are assuming the GAS measures an aspect of the goal construct, not verifying it, which has consequences for users of the GAS. There is no shortage of applicable theories that relate to the goal construct (Austin & Vancouver, 1996), and theories of behavior change can inform goal-setting interventions (Scobbie et al., 2011), such as social cognitive theory (Bandura, 1997), goal setting theory (Locke & Latham, 2002), and the health action process approach (Schwarzer, 1992), all of which could be applicable to the GAS. Indeed, from a validation perspective, strong forms of construct validity evidence include a theory that is well-articulated and tested, which helps strengthen the nomological network and provide a sound validity argument (Cronbach & Meehl, 1955; Loevinger, 1957; Messick, 1989b; Zumbo, 2009).
The need for evidence based on response processes
In addition to theory, this systematic review also examined whether individual interactions with the GAS were tested, as an aspect of response process evidence. No articles tested response processes, including the more commonly examined cognitive processes, such as think aloud or cognitive interviews (Padilla & Leighton, 2017). Nonetheless, articles did mention aspects related to how individuals interacted with the measure and how goals were set. Articles included in this review often stated whether goals were set collaboratively or alone and how final goals were determined (e.g., through negotiation or consensus). This review uncovered that several articles considered some aspects related to goal setting, but no article empirically tested and verified these aspects. For instance, two articles contemplated student or patient feelings and motives (e.g., Mcgaghie & Menges, 1975; Stolee et al., 2012), and the patient or family concerns (e.g., Stolee, Stadnyk, Myers, & Rockwood, 1999), and in another case, an article addressed conceptualizing goal setting as difficult for a cognitively impaired patient (e.g., Krasny-Pacini, Evans, Sohlberg, & Chevignard, 2016). As well, one article stated that goal orientations can be influenced by an individual’s motivation (e.g., Kiresuk, Lund, & Larsen, 1982), and another mentioned that precision of goals was related to reporting and how goals were identified (e.g., Milne, Robert, Tang, Drummond, & Ross, 2009). It is noteworthy that these aspects were considered; however, testing these considerations by investigating individual interactions with the GAS can provide empirical evidence to support these claims.
The GAS Score and Its Meaning
Altogether, building a validity argument is a key aspect of strengthening score interpretations. Although reliability is a part of the validity argument and provides insight into the consistency of the GAS scores, it contributes minimally to the accuracy of the findings and is not a substitute for validity (Barry, Chaney, Piazza-Gardner, & Chavarria, 2014; Zumbo & Chan, 2014). Reliability was reported in almost all reviewed articles; however, it is not enough to justify the use of the GAS score. A fundamental feature of the validity argument and integrating validity evidence within a unitary concept of construct validity is how the construct is represented (Messick, 1995).
Almost all the studies included in this review mentioned the GAS score was measuring a change, and an applied purpose of the GAS score was to produce a change score. In most cases, change was measured with respect to student or patient progress and to compare program effectiveness, for example, “program success is measured in ‘goals achieved’” (Calsyn & Davidson, 1978, p. 306). In addition, it was not uncommon for studies to interpret a particular GAS score as an evaluation of change (i.e., improvement, no change, or deterioration). Articles discussed a number of different ways the GAS evaluates change and used the terms responsiveness, sensitivity to change, and change score to discuss or denote change over time; reporting of change was highly variable and inconsistent. In a literature review that investigated how studies of treatment effectiveness and program evaluations measure change over time, the authors found there are challenges to interpreting change scores, and difficulty comparing estimates of responsiveness (Beaton, Bombardier, Katz, & Wright, 2001). Beaton et al. (2001) also noted that the same terms were used, sometimes interchangeably, and emphasize that statistics such as responsiveness are highly contextualized. Responsiveness is a term that is widely used to denote change over time or sensitivity to change (Middel & van Sonderen, 2002; Terwee, Dekker, Wiersinga, Prummel, & Bossuyt, 2003) and refers to the ability of a measure to accurately detect change when it has occurred (Beaton et al., 2001). While change scores are also considered as an indicator of change over time (Thomas & Zumbo, 2012), they may also refer more generally to any difference score (Cronbach & Furby, 1970). Although the GAS produces a score that incorporates different time points, the nature of the change needs to be understood so interpretation of the GAS score as demonstrating change is reasonable and clear.
Thus, questions about whether the GAS is a suitable measure to evaluate change hinges on clarity about its construct definition and subsequent score meaning. Before one can determine whether the GAS is a measure of change and can effectively measure change in a goal construct, understanding the interaction between participants’ responses and how they align with the goal construct is imperative. As noted by one article included in this review, “initial exposure to goal setting may have allowed the person time to reflect, thereby possibly leading to a change in the goal areas” (Rushton & Miller, 2002, p. 776). There are a number of reasons goals may change, as well as factors that influence their achievement. Although many factors relate to the goal construct, perhaps most important is a well-identified gap between goal intentions and goal behavior, demonstrated in a meta-analysis by Webb and Sheeran (2006). This discrepancy suggests that intentions do not necessarily lead to actions, and actions need facilitation (Webb & Sheeran, 2006). Indeed, activation of a goal can dissipate once a goal has been reached or if an obstacle that cannot be overcome is encountered (Fishbach & Ferguson, 2007), and activation of multiple goals can shift over time (Austin & Vancouver, 1996). Thus, if the GAS is, as Kiresuk et al. (1994) maintain, a measure of one’s “perceived ability to change” (p. 245), how does one know whether the GAS is measuring change in the identified goal or quite simply a change in goal?
Strengthening Validity Evidence and Validation Practices
Ultimately, a score cannot be interpreted on a test if one does not know what the test is measuring (Sireci, 2012), and as shown through variations in goal construct, “what the test is measuring” is unclear. Of the 37 articles included, 10 identified themselves as a review article of GAS literature; however, none appraised interpretations of the GAS or drew connections to theory. Strengthening the validity argument for the GAS requires better testing procedures to justify the goal construct measured by the GAS and verify that the GAS does measure change. As well, consequences of score interpretation and use was considered in one article (e.g., Rockwood, 1994), and more studies need to consider the applied purpose of the GAS score. Effectively, validation requires scientific inquiry alongside a rational argument to substantiate the score interpretation and use (Messick, 1995).
The variability noted among reviewed articles highlights that the GAS has many different interpretations. There is a lack of clarity regarding how the GAS is best interpreted, what specific construct the GAS measures, and whether the GAS measures a goal construct or whether the GAS is best regarded as its own measurement technique. An advantage of the unified view is that score interpretations infer a construct that underlies their score (Sireci & Sukin, 2013), and this logic can improve how the GAS is used and discussed. Given the GAS does not have items like a conventional measure that is scored, this added complexity stresses the importance of construct definitions and theory to guide how this construct is operationalized. Importantly, researchers need to gather input from students, patients, and/or families as part of response process information and validation efforts. Evidence based on response processes will enable researchers to link theoretic information and judgments about the content of a test with consistencies in item responses, thus improving explanations of score meaning and subsequent interpretations, as well as the consequences of testing (AERA et al., 2014; Messick, 1989a).
This review noted that validity evidence was often provided without explanations that enhance interpretations of the GAS measure. Articles did not employ the unitary perspective of validity that has been encouraged by the Standards (AERA et al., 2014), and did not regard validity as an integrative judgment (Messick, 1995; Zumbo, 2009). The way in which evidence was gathered and presented suggest some crucial changes are necessary to update measurement knowledge across disciplines, for better implementation and stronger collaborative practice. Perhaps most important in outlining a sound validity argument is for researchers to begin by identifying a validity theory to guide their validation approach. Researchers may legimately choose to use one validity approach over another; however, in its absence, and as shown in this review, validity evidence does not move toward the same objective. At the core of our findings is that the construct measured by the GAS is unclear and has not been substantiated by previous validity evidence. While differences will continue to exist between researchers and disciplines in choosing one view over the other, an obvious question to ask is, “How does validity evidence enhance our understanding of the GAS?” or any measure, for that matter.
Conclusion and Recommendations
This systematic review is a unique contribution to the interdisciplinary measurement literature and highlights some gaps in the accumulated validity evidence for the widely used GAS across disciplines. This investigation goes beyond studies that simply conducted examinations of validity; we synthesize validation practices and highlight gaps in evidence that limit confidence in the GAS. Fundamentally, the inability to identify a clear goal construct for the GAS affects the ability to measure this construct reliably and suggests some core aspects that are problematic. This review demonstrates the importance of building a validity argument starting with identifying a validity approach, and points out the influence of theory and response processes to substantiate the construct in the GAS measure. Use of the Standards (AERA et al., 2014) is recommended as a decision-making tool to strengthen validation practices. Its use is encouraged to improve how validity evidence is considered and gathered, and should not be mistaken as a check-box list of guidelines one follows mechanically.
In addition to investigating validation practices and validity evidence for the GAS measure, this review shows that validity evidence for test content and response processes are key pieces of evidence in establishing what construct the measure represents. This review found that no articles questioned whether applying approaches to examine validity for measures with specific items should be applied to the GAS, a measure in which the content is formed by the respondent and/or users during completion of the GAS. Instead, articles applied the same procedures as are commonly used for measures with fixed items to examine validity. Consequently, this review provides a never before seen look into measures without uniform content, and opens several opportunities for future validity research.
It is often emphasized that the GAS is a measure of change, and the score indicates change in goal attainment. Therefore, the GAS score has an applied purpose and a social consequence (Messick, 1989a). Whether the GAS score is used in educational or clinical settings, its score has meaning, and a judgment or interpretation is formed based on its value. To evaluate how plausible an interpretation is, “it is necessary to be clear about what the interpretation claims” (Kane, 1994, p. 431). This review points to areas for further improvement in validity evidence for the GAS and urges researchers to consider ways validation practices can help verify the many claims that are made about this measure.
Footnotes
Appendix
Defining and Coding for Validity Evidence.
| Source of validity evidence | Definition | Coding |
|---|---|---|
| Test content | The construct has been clearly identified and defined, and content experts were consulted. | (i) What construct was identified and if yes, what was the definition? |
| (ii) Were content experts consulted or mentioned (yes/no). Experts were considered broadly, and may include teachers, therapists, patients, students, or family members | ||
| Response process | Whether theory was examined or individual responses were systematically tested. | (i) Was theory used or mentioned and if yes, what was it? |
| (ii) Were individual responses systematically tested (yes/no). If response processes were tested, how this was tested (e.g., cognitive)? If not, were interactions between individuals and the GAS measure considered? | ||
| Internal structure | Any statistical technique to determine whether the GAS reflects the construct it proposes to measure (e.g., factor analyses) | (i) Were any statistics that tests for internal structure reported or measured? (yes/no) |
| Relations with other variables | Evidence for how the construct is related to other variables. Responsiveness and sensitivity to change (as a relation to its previous score) was also coded. | (i) Was this source of validity reported (yes/no) and if so, what was it called? These were coded as convergent, divergent, criterion-predictive, criterion-concurrent, criterion-group differences, generalizations, discriminant, nomological network, construct validity, other, unsure/not clear. |
| (ii) Was resposiveness or sensitivity to change reported? (yes/no) | ||
| Consequences | Included positive or negative consequences of GAS. Evidence that pertained to how the score was interpreted or other evidence of the score’s applied purpose and utility was noted (Messick, 1995). | (i) Were consequences reported? (yes/no) |
| (ii) What evidence was provided for score’s applied purpose (e.g., as a change score)? |
Note. GAS = goal attainment scaling.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
