A Systematic Review of Validation Practices for the Goal Attainment Scaling Measure

Abstract

Goal attainment scaling (GAS) is an internationally recognized measure that is widely used in educational, counseling, and clinical settings to identify and evaluate relevant goals for an individual. The GAS is an unusual measure because its content, which consists of goals, is formed by the respondent and/or users in the process of completing the GAS. Using the unified view of validity as a guiding framework, this systematic review examines validation practices and how goals are represented in this measure. This review demonstrates that validation practices tend to focus on aspects that do not support the overall construct validity of the measure, as well as reference to the GAS measure or GAS scores as a property. Several gaps in validity evidence and the various ways goals are conceptualized are described and discussed. The varying ways goals are considered suggest clarity is needed to enhance explanations and score meaning. This review urges researchers to consider ways validity and validation evidence can help verify the many claims that are made about this measure. Future validity research needs to consider application of a theoretical framework and response processes as key aspects of substantiating the construct measured by the GAS.

Keywords

validity scale development/testing measurement educational psychology educational assessment

Introduction

Goal attainment scaling (GAS; Kiresuk & Sherman, 1968) is an internationally recognized measure that is used across many disciplines to identify and evaluate relevant goals for an individual (Kiresuk, Smith, & Cardillo, 1994). Although GAS originates from counselling and clinical settings, it is increasingly being used in educational settings, as goals and goal-setting are highly relevant to educational contexts and educational assessment (Kiresuk et al., 1994). The GAS is unlike typical measures because it lacks fixed content; that is, it consists of goals formed by the respondent and/or users in the process of completing the measure. The nonstandard format of the GAS has not deterred investigations of its measurement properties, such as validity and reliability evidence. This systematic review aims to understand validation practices for the GAS; specifically, how validity evidence is gathered and reported.

The GAS is often endorsed as an “individualized” measure because it allows users to develop and set personalized goals. At the time the GAS was first developed, it was used to specify individual goals for patients in a community mental health program. During this initial use, selection of goals for the GAS involved a committee or “goal selector” (p. 446) who determined a set of realistic goals for the patient, and then graded and scaled goals according to likely treatment outcomes (Kiresuk & Sherman, 1968). Later descriptions of the GAS modified this condition so goals were set either individually or collaboratively between a student and teacher, or client and practitioner (Kiresuk et al., 1994). Once goals are set, they are scaled to identify variations in goal attainment, typically by individuals who have knowledge of the treatment or intervention (e.g., a practitioner or teacher). Scaling a goal involves identifying variations in goal attainment that indicates movement above or below treatment or intervention expectations. The GAS measure assumes that users will bring relevant or prior knowledge of the treatment or intervention to determine what goals are realistic, and to grade and scale these goals appropriately. Therefore, using the GAS involves (a) assessing an individual’s (e.g., student or patient’s) skill level in a particular problem area, (b) developing and scaling a goal that is the intended result of a treatment or intervention, and (c) later scoring the goal based on perceived change. Altogether, the GAS is a unique measure, and a striking feature is that it has “no fixed content” (Kiresuk et al., 1994, p. 167), as users of the GAS determine both the goals and their scaling. Given this measure lacks fixed content and has been used for numerous years, examining validation practices will provide insight into how validity information is gathered for this unique measure.

Validity is defined as the justifications or explanations for variations in scores on a measure, and validation is the process of acquiring that information (Zumbo, 2009). Validity information provides evidence related to the content of tools that are used to measure phenomenon, as well as the interpretations and inferences that are made from their scores. While validity provides critical information about measures, it has also been reported that validation practices are inconsistent, and that there is a disconnect between the practice of validation and validity theory (Shankar, Miller, Roberson, & Hubley, 2019; Zumbo & Chan, 2014). Previous evidence has noted an imbalance in validity evidence presented and a lack of explicit reference to a validity framework (Shear & Zumbo, 2014). As the meaning and language surrounding validity has changed over the years, this may also influence how validity evidence for the GAS is collected and reported since the original publication over 50 years ago (Kiresuk & Sherman, 1968). Common approaches for talking about validity include more than one view—modern (unified) validity theory and traditional (Trinitarian) validity theory (Guion, 1980; Newton & Shaw, 2013). The unified perspective was originally described by Cronbach and Meehl in 1955, and has evolved into a view that is currently endorsed by the Standards for Educational and Psychological Testing¹ (American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014). This view of validity includes several sources of evidence, and the Standards identifies five sources, which are test content, response processes, internal structure, relations to other variables, and consequences (AERA et al., 2014). Within the unified view, validity came to be seen as centered around the construct, with the sources of evidence all contributing to the “whole of validity” evidence (Loevinger, 1957, p. 636), and the importance of building a nomological network for interpretations of scores (Cronbach & Meehl, 1955; Hubley & Zumbo, 2011). The notion of the construct lies at the core of this view, whereby the term construct describes an unobserved concept or behavior that can be operationalized through a measurement process. In contrast, the assumption in the Trinitarian view is that validity exists as different “types,” and this view sees validity as a property of a measure, so measures either do or do not have validity (Hubley & Zumbo, 1996). The tripartite view of validity has evolved and developed toward a more comprehensive view that considers validity as an integrative evaluative judgment, with validation as an ongoing process (Hubley & Zumbo, 2011; Messick, 1995). The unified view considers different types of validity (as historically considered), such as content and criterion related, as subsumed under construct validity (Messick, 1989b). This review uses the unified view of validity as a guiding framework for studying validation practices by recognizing that all validity evidence contributes toward an understanding of the construct. The unified view is also recognized by developers of the GAS (Kiresuk et al., 1994), who draw attention to the unified view of validity and discuss the importance of construct validation.

Of the sources of validity evidence identified in the Standards, test content and response processes are, arguably, foundational to the initial development and verification of a measurement instrument (AERA et al., 2014). Content-related evidence evaluates how well the content in the instrument represents the construct it is intending to measure (AERA et al., 2014; Haynes, Richard, & Kubany, 1995; Sireci, 1998), and one can think broadly of response processes as, “the mechanisms that underlie what people do, think, or feel when interacting with, and responding to, the item or task and are responsible for generating observed test score variation” (Zumbo & Hubley, 2017, p. 2), which in the case of the GAS is connected to a goal set by the user. Evidence based on test content and response processes are complementary in their objectives and in their descriptions (Padilla & Benítez, 2014). They both evaluate the representativeness of a measure and its elements in relation to the construct by evaluating response consistency (Messick, 1975; Vogt, King, & King, 2004). Together, these elements contribute toward an understanding of the meaning behind the GAS score (Messick, 1989a), and in particular what aspect of the goal construct the GAS intends to measure and how the GAS is interpreted among users.

Although the GAS is a measure that has variable content, some researchers have argued that evidence based on test content is a prerequisite for establishing other validity evidence (Vogt et al., 2004). Content-related evidence can be defined as how a test is related to the content it is intended to measure, as well as the degree to which a measure represents a specific construct for a certain assessment purpose (AERA et al., 2014; Haynes et al., 1995). When evidence based on test content is obtained, the content domain for a measure is evaluated, and feedback is received—and it is this process that justifies the content on the test, thereby judging the overall quality of a test (Sireci, 1998). Typically, test content evidence applies to the development and revision of instrument items, and the process includes specification of the construct of interest, review of test content, and consultation with experts (Haynes et al., 1995; Vogt et al., 2004). Since the GAS does not have a defined “universe of items” (Cronbach & Meehl, 1955, p. 282) and instead relies on a universe of goals set by users, evaluating construct validity and specifying the construct of interest becomes essential to understand its content and adequately define what one is trying to measure. It has been suggested by Kiresuk et al. (1994) that various types of evaluation are limited by the GAS format. They contend that the final score in the GAS provides information about validity, which can be directly interpreted to evaluate change given certain assumptions are met. One of these assumptions is that content-related evidence is “adequate” (Kiresuk et al., 1994, p. 245). As the content in the GAS is formed by users, evaluating test content must focus on the construct the GAS intends to measure to provide support and justification for subsequent score interpretations (Kane, 2006). Examining both the definition and operationalization of the construct are activities that gather content evidence in support of a measure’s construct validity (Sireci, 1998; Vogt et al., 2004). Therefore, understanding what goal construct is being identified will highlight how the construct in the GAS is being operationalized and provide validity evidence toward the meaning of its score (Anastasi, 1986; Messick, 1980; Tenopyr, 1977).

Correspondingly, response process evidence examines the congruence between the construct and individual processes through examination of both theory and empirical analyses (AERA et al., 2014; Messick, 1989a). Response processes have traditionally been investigated using cognitive processing methods, such as think-aloud or cognitive interviews (Padilla & Leighton, 2017). Examination of response processes have expanded to include aspects such as one’s behavior and motivations, to more fully understand what one is thinking or feeling as they interact with a measure (Leighton, Tang, & Guo, 2017). As well, aspects of emotion can be examined through various expressions (Leighton et al., 2017). In all, evidence based on response processes systematically assesses how respondents understand and process aspects of the construct that is measured by the GAS and can draw connections between the construct and individual responses. Along with evidence based on test content, these sources of validity evidence can provide justification for the goal construct the GAS purports to measure.

Synthesizing validity evidence provides a coherent account of the evidence supporting or disconfirming the intended interpretations from scores (O’Leary, Hattie, & Griffin, 2017). To appraise the way in which validity evidence is assembled, this review focuses on test content and response processes to examine representation of the goal construct measured by the GAS, and move beyond the individual test-takers’ behaviors that are traditionally used in validation research, toward an explanation-focused view of validity (Zumbo, 2017a, 2017b). By applying this position, this systematic review examines validation practices for the GAS by collecting information about how validity evidence has been reported over the period 1970 to 2018. Specifically, the purpose of this systematic review is to investigate how validity evidence for the GAS is assembled and then examine the available validity and reliability evidence.

Method

Search Strategy

The following six databases were searched for relevant research articles: PubMed, Embase, Cumulative Index to Nursing and Allied Health Literature (CINAHL), Eric, PsycINFO, and Cochrane Database of Systematic Reviews. Library databases were examined using the search criteria: (a) keyword “goal attainment scaling” combined with (b) valid* (*denotes truncation to search for variations in the word). Peer-reviewed articles, written in English, published since January 1970 that describe the use of GAS with any human sample were selected. Articles over several decades were searched to gather all available literature on validation practices with the GAS. Reference lists of articles that met full-text inclusion criteria were reviewed to determine whether any additional articles should be retrieved.

Eligibility Criteria

For inclusion in this review, articles were reviewed in a step-wise manner by two reviewers, once duplicates were removed: (a) titles and abstracts were screened, followed by (b) a review of the full-text to code and select final articles. Titles and abstracts were initially screened by the first author (S.S) and a second reviewer (S.K.M) to determine whether articles identified: use of the GAS as a measurement tool and the abstract identified measurement properties of the GAS. As valid* was already searched, if any measurement properties were mentioned, the terms validity or validation would be contained in the full text. Altogether, this process was liberal and tended toward inclusion rather than exclusion. The focus was to include all articles and examine how validity was conceptualized and examined. Furthermore, experimental studies, reviews, and commentaries were all included to examine how validity evidence for the GAS was both investigated as well as interpreted. As a final screen, the first author (S.S) reviewed the full text before selecting final articles; in this last step, all articles were coded to examine how validation evidence was investigated or described in each corresponding article. The second reviewer (S.K.M) reviewed 20% of full-text articles to verify the coding and the article selection process, and to obtain a macro-level sense of similarities of ratings between both raters. Final articles were selected if the coding process verified validity evidence was examined or reviewed with the GAS. Only studies that examined validity as it pertained to measurement properties of the GAS were included, and articles that discussed social validity, ecological validity, or treatment validity were excluded. Furthermore, articles that only mentioned the GAS as “valid” but did not describe specifics about the validation process were excluded. Once data were read and coded, only those articles that described measuring validity or providing validity information for the GAS instrument were included. Figure 1 outlines details about article selection and reasons for exclusion in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol (Moher, Liberati, Tetzlaff, Altman, & PRISMA Group, 2009).

Figure 1.

PRISMA flowchart for systematic review.

Coding

A data abstraction and coding form was developed by authors to gather validity and validation information from screened articles. A coding sheet that included two sections was used to assess the available validity evidence. Section A collected descriptive information about articles and sample characteristics, and noted which articles identified as a review by their synthesis of literature, and Section B collected information about validity and validation evidence.

Using a similar coding process as Chan et al. (2014) and Cizek, Rosenberg, and Koons (2008), Section B was organized using the following scheme and the five sources identified in the Standards (AERA et al., 2014): (a) test content, (b) relations with other variables, (c) internal structure, (d) consequences, and (e) response processes (see the Appendix for a sample of the coding sheet and definitions). As well, an “other validity” category was included to account for validity sources not described in the Standards (AERA et al., 2014), and reliability was also included, since reliability is a necessary condition for validity (Hubley & Zumbo, 2013). In particular, the type of reliability estimate, specifically internal consistency, test–retest, or inter-rater estimates were noted to make a comparison with the amount of validity evidence that was sought and reported. Furthermore, to understand validity perspectives, this review coded for the following: (a) whether articles mentioned a unitary perspective of validity and (b) whether articles stated that validity was a property of the test or a property of test scores.

Data Analysis

Validity evidence that was gathered from coded articles was evaluated using a combination of inductive and deductive content analytic approaches (Elo & Kyngäs, 2008; Hsieh & Shannon, 2005; Mayring, 2000). Gathering descriptive information invoked a deductive approach as information reported in research articles were noted. A deductive content analytic approach was used to collect information on all sources of validity evidence. For instance, when gathering construct definitions for evidence based on test content, as well as evaluating the use of theory and systematic testing of individual processes provided evidence for response processes. A deductive content analytic approach was also used when information regarding validation practices was applied to coding. Evaluating sources of validity evidence and validity perspectives included a combination of approaches; a deductive approach was used to collect information that was reported, and an inductive approach was used to interpret this information.

Inter-Rater Agreement on Coded Studies

The first author and second reviewer were in agreement for 50 of 60 data points for these 37 studies, and this represented 83.3% agreement. Any differences were discussed between the first and second rater to reach consensus before assigning codes to cases.

Results

This review identified a total of 115 articles once duplicates were removed. After abstracts were screened and full text reviewed, a total of 37 articles were selected as examining or reporting validity evidence for the GAS (Figure 1). Of the selected articles, 10 identified themselves as a review and synthesized literature. Selected articles examined the GAS for a variety of reasons related to validity or validation, such as to gather information about validity and/or other measurement properties, examine the feasibility or utility of the GAS in certain settings, use GAS as an outcome measure, and review the GAS by itself or in comparison with other measures. The GAS was used with individuals who were patients or students. The samples included children, youth, adults, and older adults, and in one article the sample was nonspecific (Table 1).

Table 1.

Article Description.

Author	Review article?	Article title	Sample
Bouwens, van Heugten, and Verhey (2009)	Yes	Review of Goal attainment scaling as a useful outcome measure in psychogeriatric patients with cognitive disorders.	Older adults
Calsyn and Davidson (1978)	No	Do we really want a program evaluation strategy based solely on individualized goals? A critique of goal attainment scaling.	Adults
Cusick, McIntyre, Novak, Lannin, and Lowe (2006)	No	A comparison of goal attainment scaling and the Canadian occupational performance measure for pediatric rehabilitation research.	Children
Cytrynbaum, Ginath, Birdwell, and Brandt (1979)	Yes	Goal attainment scaling: A critical review.	Adult and children
de Beurs et al. (1993)	No	Goal attainment scaling: An idiosyncratic method to assess treatment effectiveness in agoraphobia.	Adults
Donnelly and Carswell (2002)	Yes	Individualized outcome measures: A review of the literature.	Adults
Fisher and Hardie (2002)	No	Goal attainment scaling in evaluating a multidisciplinary pain management program.	Adults
Gaasterland, Jansen-van der Weide, Weinreich, and van der Lee (2016)	Yes	A systematic review to investigate the measurement properties of goal attainment scaling, toward use in drug trials.	Various
Gordon, Powell, and Rockwood (1999)	No	Goal attainment scaling as a measure of clinically important change in nursing-home patients.	Older adults
Heavlin, Lee-Merrow, and Lewis (1982)	No	The psychometric foundations of goal attainment scaling.	Nonspecific
Hurn, Kneebone, and Cropley (2006)	Yes	Goal setting as an outcome measure: A systematic review.	Adults and older adults
Jones et al. (2006)	No	Using goal attainment scaling to evaluate a needs-led exercise program for people with severe and profound intellectual disabilities.	Adults
Joyce, Rockwood, and Mate-Kole (1994)	No	Use of goal attainment scaling in brain injury in a rehabilitation hospital.	Adults
Kiresuk, Lund, and Larsen (1982)	No	Measurement of goal attainment in clinical and health care programs.	Adults and children
Krasny-Pacini, Evans, Sohlberg, and Chevignard (2016)	No	Proposed criteria for appraising goal attainment scales used as outcome measures in rehabilitation research.	Adults and children
Krasny-Pacini, Hiebel, Pauly, Godon, and Chevignard, (2013)	Yes	Goal attainment scaling in rehabilitation: A literature-based update.	Adults and children
Malec (1999)	No	Goal attainment scaling in rehabilitation.	Adults
Mannion, Caporaso, Pulkovski, and Sprott (2010)	No	Goal attainment scaling as a measure of treatment success after physiotherapy for chronic low back pain.	Adults
Mcgaghie and Menges (1975)	No	Assessing self-directed learning.	Students
Palisano and Gowland (1993)	No	Validity of goal attainment scaling in infants with motor delays.	Adults
Palisano, Haley, and Brown (1992)	No	Goal attainment scaling as a measure of change in infants with motor delays.	Children
Rock (1987)	No	Goal and outcome in social work practice.	Adults and children
Rockwood (1994)	Yes	Setting goals in geriatric rehabilitation and measuring their attainment.	Older adults
Rockwood, Stolee, Howard, and Mallery (1996)	No	Use of goal attainment scaling to measure treatment effects in an antidementia drug trial.	Adults
Rushton and Miller (2002)	No	Goal attainment scaling in the rehabilitation of patients with lower-extremity amputations: A pilot study.	Adults
Sakzewski, Boyd, and Ziviani (2007)	Yes	Clinimetric properties of participation measures for 5- to 13-year-old children with cerebral palsy: A systematic review.	Children
Schlosser (2004)	No	Goal attainment scaling as a clinical measurement technique in communication disorders: A critical review.	Adults and children
Shefler, Canetti, and Wiseman (2001)	No	Psychometric properties of goal-attainment scaling in the assessment of Mann’s time-limited psychotherapy.	Adults
Steenbeek, Ketelaar, Galama, and Gorter (2007)	Yes	Goal attainment scaling in pediatric rehabilitation: A critical review of the literature.	Children
Stolee et al. (2012)	No	The use of goal attainment scaling in a geriatric care setting.	Older adults
Stolee, Rockwood, Fox, and Streiner (1992)	No	A multisite study of the feasibility and clinical utility of Goal Attainment Scaling in geriatric day hospitals.	Older adults
Stolee, Stadnyk, Myers, and Rockwood (1999)	No	An individualized approach to outcome measurement in geriatric rehabilitation.	Older adults
Turner-Stokes, Fheodoroff, Jacinto, Maisonobe, and Zakine (2013)	No	Upper limb international spasticity study: Rationale and protocol for a large, international, multicenter prospective cohort study investigating management and goal attainment following treatment with Botulinum Toxin A in real-life clinical practice.	Adults
Vu and Law (2012)	Yes	Goal-attainment scaling: A review and applications to pharmacy practice.	Adults
Willer and Miller (1976)	No	On the validity of goal attainment scaling as an outcome measure in medical health.	Adults
Woodward, Santa-Barbara, Levin, and Epstein (1978)	No	The role of goal attainment scaling in evaluating family therapy outcome.	Adults
Yip et al. (1998)	No	A standardized menu for goal attainment scaling in the care of frail elders.	Older adults

Reporting of Validity Evidence

The majority of articles in this review reported on or examined relations to other variables (89.2%), followed by evidence based on test content (51.4%; Table 2). Evidence based on consequences was reported in one article (2.7%), and no article reported information about internal structure or response processes.

Table 2.

Frequencies of Sources of Validity Evidence Reported.

Validity evidence	No. of articles (N)	%
Test content	19	51.4
Response processes	0	0
Internal structure	0	0
Relations to other variables	33	89.2
Consequences	1	2.7

Construct representation through test content and response processes

To gather information about evidence based on test content, we reviewed articles to see whether experts were consulted, a construct was identified, and a corresponding construct definition was provided. Content was evaluated by examining goal domains, agreement between experts, comparing goals on the GAS to the content from other reports or assessments, and expert opinion. Expert panels included practitioners, patients, students, family, team members, or individuals doing intake. Although evidence based on test content included some expert consultation, the construct measured by the GAS was not clearly identified. In several articles, there was no clear construct that the GAS was identified as measuring, and articles identified the GAS in two predominant ways (Table 3): (a) solely as a measure of goals (40.5%) or (b) as both a measure of goals and its own method or measurement technique (56.7%), and one article stated the GAS construct lacked clarity and was nonspecific. While several articles mentioned a theory (13.5%) and one article identified a specific goal theory, no articles actually used a theoretic approach.

Table 3.

Description of the GAS.

GAS description	No. of articles (N)	%
Measure of goals (includes goal achievement/attainment or goal-setting)	15	40.5
Measure of goals and its own measurement technique or approach	21	56.8
Nonspecific	1	2.7

Note. GAS = goal attainment scaling.

To examine evidence based on response processes, this review evaluated whether theory was used to guide application of the GAS, and also whether response processes were empirically tested. Among the articles that mentioned a theory, only one article (i.e., Hurn, Kneebone, & Cropley, 2006) identified a specific goal theory. Furthermore, results indicate that no articles reported information about systematic testing of response processes, such as cognitive, motivational, or behavioral types of processing. Nonetheless, many articles mentioned observations or reflections on how goals were set, such as through negotiation or consensus.

Internal structure, relations with other variables, and consequences

No article reported evidence based on internal structure of the GAS, which is congruent with this type of measure since it lacks fixed content. Validity evidence based on relations with other variables was reported in varying ways, such as construct, convergent, concurrent, criterion, and predictive validity, as well as the nomological network. In addition, responsiveness or sensitivity to change was reported in more than half of the articles (67.6%). As well, the only article (i.e., Rockwood, 1994) that reported considering the unintended consequences of testing advocated for the individualized nature of the GAS as guarding patients against unintended consequences of other measures. Among reviewed studies that mentioned a score and its applied purpose, almost half of all articles (48.7%) discussed that the GAS score may be interpreted as a change score.

Other Evidence Related to Validity

Numerous articles (37.8%) also identified other sources of validity evidence that were outside the criteria identified, such as face validity, external validity, internal validity, and congruent validity.

Reliability

Although all included articles discussed validity evidence of the GAS, the majority of articles also reported or examined corresponding reliability evidence (94.6%). Inter-rater reliability was reported most frequently (73.0%), followed by test–retest (16.2%) and internal consistency (13.5%).

Validation Practices

Based on the validity evidence that was gathered, most studies tended to gather validity evidence by focusing on relations with other variables and reliability evidence. Validity evidence was often gathered as types and studies tended to gather different types of validity. No article mentioned the unitary perspective of validity or any editions of the Standards (e.g., AERA et al., 2014), and one article identified the tripartite view (i.e., Gordon, Powell, & Rockwood, 1999) as their theoretical approach to validity. Validity was discussed as either a property of the GAS measure or the GAS scores.

Discussion

This review provides a glimpse into validation practices for the GAS measure by examining how this evidence was gathered and assessing available validity evidence. The 37 articles selected for this study verified that the GAS is used in a variety of settings and with a variety of samples (Table 1).

Validity and Validation Evidence of the GAS

Validity evidence was evaluated by all studies in a number of ways. Most validity evidence tended to focus on relations to other variables, or reliability. The concentration of data in these areas is not uncommon and is similar to the findings by Zumbo and Chan (2014). They note that the high concentration of this evidence brings some difficulty interpreting evidence and building a sound validity argument. They also note limited guidance from orientations to theory, including validity theory. Altogether, validity evidence was not gathered in any systematic way and reflected a piecemeal approach to validation.

Articles approached validity by gathering varying types of validity evidence and then reasoning that the GAS was either “valid” or “not valid.” Discussing validity in this way, as a property of the measure or its scores, implies that validity is seen as a fixed or immutable quality. This idea emerged in the 1940s, where validity was conceptualized as a static property that had to be proven or established (Goodwin & Leech, 2003). Newton and Shaw (2013) in their analysis of ways validity is discussed or thought about, suggested validity was referred to as if it were a property of a test for several reasons, such as intentional misuse, lack of awareness or misunderstanding, and genuine divergence from the view of validity as a property of interpretations. Newton and Shaw (2013) also identified 122 discrete validity labels intended to capture an aspect of validity for measurement. From these labels and the results from this review, it is apparent that articles do not consider validity as the interpretations from scores on a measure. As discovered in this review, the GAS has varying interpretations.

Interpretations of the GAS

The GAS was frequently identified as both a measure of goals and also its own measurement technique. Among articles that described the GAS solely as a measure of goals, articles either specified goals were individualized, or simply referred to goals in a broad or general manner. The term goal was used by itself or referred to related aspects of the goal construct, such as goal achievement, goal attainment, or goal-setting. It was difficult to decipher how varying aspects of the goal construct were distinguished since these terms were used interchangeably in relation to the GAS.

Goal constructs identified in the GAS

The results from this review draw attention to the various ways goals may be described in the GAS and some discrepancies between how aspects of the goal construct are discussed. One discrepancy is that individualized goals or goal achievement is not equivalent to the process of goal setting. However, studies included in this review readily moved from identifying the GAS as measuring goals to the GAS as a tool for goal setting and also evaluating goal achievement. Ostensibly, these aspects or dimensions of the goal construct were all viewed somewhat synonymously.

Although similarities exist between these various aspects of the goal construct, there are important distinctions that influence what outcomes are produced from the GAS, as well as the score meaning. Of the articles included in this review, only one article (e.g., Hurn et al., 2006) identified a construct definition for a goal. As noted by Elliot and Fryer (2008), the term goal is rarely defined in research, with the assumption that all readers understand the word similarly, but the term can take on different meanings. While it is not uncommon to consider the goal construct in a parsimonious way, separating its differential aspects has a number of advantages, such as limiting the confusion surrounding the construct, improving understanding of the influence of multiple goals, and minimizing assumptions (Austin & Vancouver, 1996).

As noted by Austin and Vancouver (1996), proliferation of various aspects of the goal construct makes its examination problematic, which is evidenced in this review. Presumably, the GAS has been assumed to measure different aspects of the goal construct, and no study examining validity evidence considered their distinctions. Improving the clarity surrounding the goal construct in the GAS can help determine clear functional properties of this measure, instead of wondering whether additional constructs may account for a particular behaviour (Elliot & Fryer, 2008). Likewise, a number of factors affect goal outcomes, their structure, and process (Austin & Vancouver, 1996). For instance, the activation and pursuit of goals depends on one’s conscious desires, which can influence their thoughts, emotions, and behaviors (Fishbach & Ferguson, 2007), and goal commitment is only recognized when there is an investment of affect, cognitive resources, and behavior (Mann, de Ridder, & Fujita, 2013). Moreover, goal achivement is influenced by the nature of the task and how applicable the goal is toward it (Fishbach & Ferguson, 2007), as well as the context and one’s level of control (Austin & Vancouver, 1996). With a vast amount of psychological literature surrounding the goal construct, these aspects reflect a succinct view into some of the factors associated with this construct. Certainly, from a validity and validation standpoint, the varying constructs identified suggest clarity is needed to enhance explanations and score meaning.

Evidence of theory guiding definitions

Validity is a matter of inference and a process that provides information related to the meaning of scores, which in turn provides information about an outcome of interest. Understanding what inferences are made with test scores refers back to how theory is applied to justify the claims that are made regarding these scores (Kane, 2009). The only critique included in this review wondered, “How is the construct embedded in the theory?” (Cytrynbaum, Ginath, Birdwell, & Brandt, 1979, p. 33), and the results from this review indicate that this question remains unexamined and, therefore, unverified. Theoretical perspectives about the goal construct are rarely mentioned and notably absent in applications of the GAS. Among studies reviewed, only one article (e.g., Hurn et al., 2006) specifically mentioned a goal theory (i.e., Locke, 1968), but this was not operationalized in the respective study. Another study included in this review mentioned that theory relates to something clinicians consider during the goal-setting process; however, this too was not tested (e.g., Vu & Law, 2012). Using the GAS to set goals is a complex endeavor, and theoretical rationales can assist by providing guidance in action-planning this process (Scobbie, Dixon, & Wyke, 2011). In a Cochrane review that investigated the GAS and goal setting in rehabilitation medicine, the authors noted that only one study implemented goal setting in a way that was consistent with a theory (Levack et al., 2016). The results from this systematic review suggest that future research using the GAS needs to better implement a theory to guide establishment of a definition and its application.

Given the GAS does not have items like a conventional measure that is scored and lacks fixed content, this added complexity stresses the importance of construct definitions and theory to guide how this construct is operationalized. By discounting how these aspects contribute to validity evidence, we can only be certain that we are assuming the GAS measures an aspect of the goal construct, not verifying it, which has consequences for users of the GAS. There is no shortage of applicable theories that relate to the goal construct (Austin & Vancouver, 1996), and theories of behavior change can inform goal-setting interventions (Scobbie et al., 2011), such as social cognitive theory (Bandura, 1997), goal setting theory (Locke & Latham, 2002), and the health action process approach (Schwarzer, 1992), all of which could be applicable to the GAS. Indeed, from a validation perspective, strong forms of construct validity evidence include a theory that is well-articulated and tested, which helps strengthen the nomological network and provide a sound validity argument (Cronbach & Meehl, 1955; Loevinger, 1957; Messick, 1989b; Zumbo, 2009).

The need for evidence based on response processes

In addition to theory, this systematic review also examined whether individual interactions with the GAS were tested, as an aspect of response process evidence. No articles tested response processes, including the more commonly examined cognitive processes, such as think aloud or cognitive interviews (Padilla & Leighton, 2017). Nonetheless, articles did mention aspects related to how individuals interacted with the measure and how goals were set. Articles included in this review often stated whether goals were set collaboratively or alone and how final goals were determined (e.g., through negotiation or consensus). This review uncovered that several articles considered some aspects related to goal setting, but no article empirically tested and verified these aspects. For instance, two articles contemplated student or patient feelings and motives (e.g., Mcgaghie & Menges, 1975; Stolee et al., 2012), and the patient or family concerns (e.g., Stolee, Stadnyk, Myers, & Rockwood, 1999), and in another case, an article addressed conceptualizing goal setting as difficult for a cognitively impaired patient (e.g., Krasny-Pacini, Evans, Sohlberg, & Chevignard, 2016). As well, one article stated that goal orientations can be influenced by an individual’s motivation (e.g., Kiresuk, Lund, & Larsen, 1982), and another mentioned that precision of goals was related to reporting and how goals were identified (e.g., Milne, Robert, Tang, Drummond, & Ross, 2009). It is noteworthy that these aspects were considered; however, testing these considerations by investigating individual interactions with the GAS can provide empirical evidence to support these claims.

The GAS Score and Its Meaning

Altogether, building a validity argument is a key aspect of strengthening score interpretations. Although reliability is a part of the validity argument and provides insight into the consistency of the GAS scores, it contributes minimally to the accuracy of the findings and is not a substitute for validity (Barry, Chaney, Piazza-Gardner, & Chavarria, 2014; Zumbo & Chan, 2014). Reliability was reported in almost all reviewed articles; however, it is not enough to justify the use of the GAS score. A fundamental feature of the validity argument and integrating validity evidence within a unitary concept of construct validity is how the construct is represented (Messick, 1995).

Almost all the studies included in this review mentioned the GAS score was measuring a change, and an applied purpose of the GAS score was to produce a change score. In most cases, change was measured with respect to student or patient progress and to compare program effectiveness, for example, “program success is measured in ‘goals achieved’” (Calsyn & Davidson, 1978, p. 306). In addition, it was not uncommon for studies to interpret a particular GAS score as an evaluation of change (i.e., improvement, no change, or deterioration). Articles discussed a number of different ways the GAS evaluates change and used the terms responsiveness, sensitivity to change, and change score to discuss or denote change over time; reporting of change was highly variable and inconsistent. In a literature review that investigated how studies of treatment effectiveness and program evaluations measure change over time, the authors found there are challenges to interpreting change scores, and difficulty comparing estimates of responsiveness (Beaton, Bombardier, Katz, & Wright, 2001). Beaton et al. (2001) also noted that the same terms were used, sometimes interchangeably, and emphasize that statistics such as responsiveness are highly contextualized. Responsiveness is a term that is widely used to denote change over time or sensitivity to change (Middel & van Sonderen, 2002; Terwee, Dekker, Wiersinga, Prummel, & Bossuyt, 2003) and refers to the ability of a measure to accurately detect change when it has occurred (Beaton et al., 2001). While change scores are also considered as an indicator of change over time (Thomas & Zumbo, 2012), they may also refer more generally to any difference score (Cronbach & Furby, 1970). Although the GAS produces a score that incorporates different time points, the nature of the change needs to be understood so interpretation of the GAS score as demonstrating change is reasonable and clear.

Thus, questions about whether the GAS is a suitable measure to evaluate change hinges on clarity about its construct definition and subsequent score meaning. Before one can determine whether the GAS is a measure of change and can effectively measure change in a goal construct, understanding the interaction between participants’ responses and how they align with the goal construct is imperative. As noted by one article included in this review, “initial exposure to goal setting may have allowed the person time to reflect, thereby possibly leading to a change in the goal areas” (Rushton & Miller, 2002, p. 776). There are a number of reasons goals may change, as well as factors that influence their achievement. Although many factors relate to the goal construct, perhaps most important is a well-identified gap between goal intentions and goal behavior, demonstrated in a meta-analysis by Webb and Sheeran (2006). This discrepancy suggests that intentions do not necessarily lead to actions, and actions need facilitation (Webb & Sheeran, 2006). Indeed, activation of a goal can dissipate once a goal has been reached or if an obstacle that cannot be overcome is encountered (Fishbach & Ferguson, 2007), and activation of multiple goals can shift over time (Austin & Vancouver, 1996). Thus, if the GAS is, as Kiresuk et al. (1994) maintain, a measure of one’s “perceived ability to change” (p. 245), how does one know whether the GAS is measuring change in the identified goal or quite simply a change in goal?

Strengthening Validity Evidence and Validation Practices

Ultimately, a score cannot be interpreted on a test if one does not know what the test is measuring (Sireci, 2012), and as shown through variations in goal construct, “what the test is measuring” is unclear. Of the 37 articles included, 10 identified themselves as a review article of GAS literature; however, none appraised interpretations of the GAS or drew connections to theory. Strengthening the validity argument for the GAS requires better testing procedures to justify the goal construct measured by the GAS and verify that the GAS does measure change. As well, consequences of score interpretation and use was considered in one article (e.g., Rockwood, 1994), and more studies need to consider the applied purpose of the GAS score. Effectively, validation requires scientific inquiry alongside a rational argument to substantiate the score interpretation and use (Messick, 1995).

The variability noted among reviewed articles highlights that the GAS has many different interpretations. There is a lack of clarity regarding how the GAS is best interpreted, what specific construct the GAS measures, and whether the GAS measures a goal construct or whether the GAS is best regarded as its own measurement technique. An advantage of the unified view is that score interpretations infer a construct that underlies their score (Sireci & Sukin, 2013), and this logic can improve how the GAS is used and discussed. Given the GAS does not have items like a conventional measure that is scored, this added complexity stresses the importance of construct definitions and theory to guide how this construct is operationalized. Importantly, researchers need to gather input from students, patients, and/or families as part of response process information and validation efforts. Evidence based on response processes will enable researchers to link theoretic information and judgments about the content of a test with consistencies in item responses, thus improving explanations of score meaning and subsequent interpretations, as well as the consequences of testing (AERA et al., 2014; Messick, 1989a).

This review noted that validity evidence was often provided without explanations that enhance interpretations of the GAS measure. Articles did not employ the unitary perspective of validity that has been encouraged by the Standards (AERA et al., 2014), and did not regard validity as an integrative judgment (Messick, 1995; Zumbo, 2009). The way in which evidence was gathered and presented suggest some crucial changes are necessary to update measurement knowledge across disciplines, for better implementation and stronger collaborative practice. Perhaps most important in outlining a sound validity argument is for researchers to begin by identifying a validity theory to guide their validation approach. Researchers may legimately choose to use one validity approach over another; however, in its absence, and as shown in this review, validity evidence does not move toward the same objective. At the core of our findings is that the construct measured by the GAS is unclear and has not been substantiated by previous validity evidence. While differences will continue to exist between researchers and disciplines in choosing one view over the other, an obvious question to ask is, “How does validity evidence enhance our understanding of the GAS?” or any measure, for that matter.

Conclusion and Recommendations

This systematic review is a unique contribution to the interdisciplinary measurement literature and highlights some gaps in the accumulated validity evidence for the widely used GAS across disciplines. This investigation goes beyond studies that simply conducted examinations of validity; we synthesize validation practices and highlight gaps in evidence that limit confidence in the GAS. Fundamentally, the inability to identify a clear goal construct for the GAS affects the ability to measure this construct reliably and suggests some core aspects that are problematic. This review demonstrates the importance of building a validity argument starting with identifying a validity approach, and points out the influence of theory and response processes to substantiate the construct in the GAS measure. Use of the Standards (AERA et al., 2014) is recommended as a decision-making tool to strengthen validation practices. Its use is encouraged to improve how validity evidence is considered and gathered, and should not be mistaken as a check-box list of guidelines one follows mechanically.

In addition to investigating validation practices and validity evidence for the GAS measure, this review shows that validity evidence for test content and response processes are key pieces of evidence in establishing what construct the measure represents. This review found that no articles questioned whether applying approaches to examine validity for measures with specific items should be applied to the GAS, a measure in which the content is formed by the respondent and/or users during completion of the GAS. Instead, articles applied the same procedures as are commonly used for measures with fixed items to examine validity. Consequently, this review provides a never before seen look into measures without uniform content, and opens several opportunities for future validity research.

It is often emphasized that the GAS is a measure of change, and the score indicates change in goal attainment. Therefore, the GAS score has an applied purpose and a social consequence (Messick, 1989a). Whether the GAS score is used in educational or clinical settings, its score has meaning, and a judgment or interpretation is formed based on its value. To evaluate how plausible an interpretation is, “it is necessary to be clear about what the interpretation claims” (Kane, 1994, p. 431). This review points to areas for further improvement in validity evidence for the GAS and urges researchers to consider ways validation practices can help verify the many claims that are made about this measure.

Footnotes

Appendix

Defining and Coding for Validity Evidence.

Source of validity evidence	Definition	Coding
Test content	The construct has been clearly identified and defined, and content experts were consulted.	(i) What construct was identified and if yes, what was the definition?
Test content		(ii) Were content experts consulted or mentioned (yes/no). Experts were considered broadly, and may include teachers, therapists, patients, students, or family members
Response process	Whether theory was examined or individual responses were systematically tested.	(i) Was theory used or mentioned and if yes, what was it?
Response process		(ii) Were individual responses systematically tested (yes/no). If response processes were tested, how this was tested (e.g., cognitive)? If not, were interactions between individuals and the GAS measure considered?
Internal structure	Any statistical technique to determine whether the GAS reflects the construct it proposes to measure (e.g., factor analyses)	(i) Were any statistics that tests for internal structure reported or measured? (yes/no)
Relations with other variables	Evidence for how the construct is related to other variables. Responsiveness and sensitivity to change (as a relation to its previous score) was also coded.	(i) Was this source of validity reported (yes/no) and if so, what was it called? These were coded as convergent, divergent, criterion-predictive, criterion-concurrent, criterion-group differences, generalizations, discriminant, nomological network, construct validity, other, unsure/not clear.
Relations with other variables		(ii) Was resposiveness or sensitivity to change reported? (yes/no)
Consequences	Included positive or negative consequences of GAS. Evidence that pertained to how the score was interpreted or other evidence of the score’s applied purpose and utility was noted (Messick, 1995).	(i) Were consequences reported? (yes/no)
Consequences		(ii) What evidence was provided for score’s applied purpose (e.g., as a change score)?

Note. GAS = goal attainment scaling.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Sneha Shankar

Notes

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. [AERA, APA, & NCME] (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Anastasi

(1986). Evolving concepts of test validation. Annual Review of Psychology, 37, 1-15.

Austin

J. T.

Vancouver

J. B.

(1996). Goal constructs in psychology: Structure, process, and content. Psychological Bulletin, 120, 338-375. doi:10.1037/0033-2909.120.3.338

Bandura

(1997). Self efficacy: The exercise of control. New York, NY: W.H. Freeman.

Barry

A. E.

Chaney

Piazza-Gardner

A. K.

Chavarria

E. A.

(2014). Validity and reliability reporting practices in the field of health education and behavior. Health Education & Behavior, 41, 12-18. doi:10.1177/1090198113483139

Beaton

D. E.

Bombardier

Katz

J. N.

Wright

J. G.

(2001). A taxonomy for responsiveness. Journal of Clinical Epidemiology, 54, 1204-1217.

Bouwens

van Heugten

C. M.

Verhey

F. R. J.

(2009). The practical use of goal attainment scaling for people with acquired brain injury who receive cognitive rehabilitation. Clinical Rehabilitation, 23, 310-320.

Calsyn

R. J.

Davidson

W. S.

(1978). Do we really want a program evaluation strategy based solely on individualized goals? Community Mental Health Journal, 14, 300-308.

Chan

E. K. H.

Munro

Huang

A. H. S.

Zumbo

B. D.

Vojdanijahromi

Ark

. (2014). Chapter 5 Validation Practices in Counseling: Major Journals, Mattering Instruments, and the Kuder Occupational Interest Survey (KOIS). In Validity and Valdiation in Social, Behavioral, and Health Sciences (pp. 67–87).

10.

Cizek

G J.

Rosenberg

S. L.

Koons

H. H

. (2008). Sources of validity evidence for educational and Psychological tests. Educational and Psychological Measurement, 68, 397-412. doi:10.1177/0013164407310130

11.

Cronbach

L. J.

Furby

(1970). How we should measure “change”: Or should we? Psychological Bulletin, 74, 68-80. doi:10.1037/h0029382

12.

Cronbach

L. J.

Meehl

P. E.

(1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. doi:10.1037/h0040957

13.

Cusick

McIntyre

Novak

Lannin

Lowe

(2006). A comparison of goal attainment scaling and the Canadian occupational performance measure for paediatric rehabilitation research. Developmental Neurorehabilitation, 9, 149-157. doi:10.1080/13638490500235581

14.

Cytrynbaum

Ginath

Birdwell

Brandt

(1979). Goal attainment scaling: A critical review. Evaluation Quarterly, 3, 5-40. doi:10.1177/0193841X7900300102

15.

de Beurs

Lange

Blonk

R. W. B.

Koele

van Balkom

A. J. L. M.

Van Dyck

. (1993). Goal attainment scaling: An idiosyncratic method to assess treatment effectiveness in agoraphobia. Journal of Psychopathology and Behavioral Assessment, 15, 357-373. doi:10.1007/BF00965038

16.

Donnelly

Carswell

(2002). Individualized outcome measures: A review of the literature. Canadian Journal of Occupational Therapy, 69, 84-94. doi:10.1177/000841740206900204

17.

Elliot

A. J.

Fryer

J. W.

(2008). The goal construct in psychology. In Shah

J. Y.

Gardner

W. L.

(Eds.), Handbook of motivation science (pp. 235-250). New York, NY: Guilford Press.

18.

Elo

Kyngäs

(2008). The qualitative content analysis process. Journal of Advanced Nursing, 62, 107-115. doi:10.1111/j.1365-2648.2007.04569.x

19.

Fishbach

Ferguson

M. J.

(2007). The goal construct in social psychology. In Kruglanski

A. W.

Higgens

T. E.

(Eds.), Social psychology: Handbook of basic principles (pp. 334-352). New York, NY: Guilford Press. doi:10.1007/BF02333407.

20.

Fisher

Hardie

R. J.

(2002). Goal attainment scaling in evaluating a multidisciplinary pain management programme. Clinical Rehabilitation, 16, 871-877.

21.

Gaasterland

C. M. W.

Jansen-van der Weide

M. C.

Weinreich

S. S.

van der Lee

J. H.

(2016). A systematic review to investigate the measurement properties of goal attainment scaling, towards use in drug trials. BMC Medical Research Methodology, 16, Article 99. doi:10.1186/s12874-016-0205-4

22.

Goodwin

L. D.

Leech

N. L.

(2003). The meaning of validity in the New Standards for Educational and Psychological Testing: Implications for measurement courses. Measurement and Evaluation in Counseling and Development, 36, 181-191.

23.

Gordon

J. E.

Powell

Rockwood

(1999). Goal attainment scaling as a measure of clinically important change in nursing-home patients. Age and Ageing, 28, 275-281. doi:10.1093/ageing/28.3.275

24.

Guion

R. M.

(1980). On Trinitarian doctrines of validity. Professional Psychology: Research and Practice, 11, 385-398. doi:10.1037/0735-7028.11.3.385

25.

Haynes

S. N.

Richard

D. C. S.

Kubany

E. S.

(1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological Assessment, 7, 238-247. doi:10.1037/1040-3590.7.3.238

26.

Heavlin

W. D.

Lee-Merrow

S. W.

Lewis

V. M.

(1982). The psychometric foundations of goal attainment scaling. Community Mental Health Journal, 18, 230-241. doi:10.1007/BF00754339

27.

Hsieh

H.-F.

Shannon

S. E.

(2005). Three approaches to qualitative content analysis. Qualitative Health Research, 15, 1277-1288. doi:10.1177/1049732305276687

28.

Hubley

A. M.

Zumbo

B. D.

(1996). A dialectic on validity: Where we have been and where we are going. Journal of General Psychology, 123, 207-215.

29.

Hubley

A. M.

Zumbo

B. D.

(2011). Validity and the consequences of test interpretation and use. Social Indicators Research, 103, 219-230.

30.

Hubley

A. M.

Zumbo

B. D.

(2013). Psychometric characteristics of assessment procedures: An overview. In Geisinger

K. F.

(Ed.), APA handbook of testing and assessment in psychology (Vol. 1, pp. 3-19). Washington, DC: American Psychological Association Press.

31.

Hurn

Kneebone

Cropley

(2006). Goal setting as an outcome measure: A systematic review. Clinical Rehabilitation, 20, 756-772. doi:20/9/756 [pii]\r10.1177/0269215506070793

32.

Jones

M. C.

Walley

R. M.

Leech

Paterson

Common

Metcalf

(2006). Using goal attainment scaling to evaluate a needs-led exercise programme for people with severe and profound intellectual disabilities. Journal of Intellectual Disabilities, 19, 317-335.

33.

Joyce

B. M.

Rockwood

Mate-Kole

C. C.

(1994). Use of Goal Attainment Scaling in brain injury in rehabilitation hospital. American Journal of Physical Medicine & Rehabilitation, 73, 10-14.

34.

Kane

M. T.

(1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425-461.

35.

Kane

M. T.

(2006). Content-related validity evidence. In Downing

S. M.

Haladyna

T. M.

(Eds.), Handbook of test development (pp. 131-153). Mahwah, NJ: Lawrence Erlbaum.

36.

Kane

M. T.

(2009). Validating the interpretations and uses of test scores. In Lissitz

R. W.

(Ed.), The concept of validity: Revisions, new directions and applications (pp. 39-64) Charlotte, NC: Information Age Publishing.

37.

Kiresuk

T. J.

Lund

S. H.

Larsen

N. E.

(1982). Measurement of goal attainment in clinical and health care programs. Drug Intelligence & Clinical Pharmacy, 16, 145-153.

38.

Kiresuk

T. J.

Sherman

R. E.

(1968). Goal attainment scaling: A general method for evaluating comprehensive community mental health programs. Community Mental Health Journal, 4, 443-453. doi:10.1007/BF01530764

39.

Kiresuk

T. J.

Smith

Cardillo

J. E.

(Eds.). (1994). Goal attainment scaling: Applications, theory, and measurement. New York, NY: Lawrence Erlbaum.

40.

Krasny-Pacini

Evans

Sohlberg

M. M.

Chevignard

(2016). Proposed criteria for appraising goal attainment scales used as outcome measures in rehabilitation research. Archives of Physical Medicine and Rehabilitation, 97, 157-170. doi:10.1016/j.apmr.2015.08.424

41.

Krasny-Pacini

Hiebel

Pauly

Godon

Chevignard

(2013). Goal attainment scaling in rehabilitation: A literature-based update. Annals of Physical and Rehabilitation Medicine, 56, 212-230. doi:10.1016/j.rehab.2013.02.002

42.

Leighton

J. P.

Tang

Guo

(2017). Response processes and validity evidence: Controlling for emotions in think aloud interviews. In Zumbo

B. D.

Hubley

A. M.

(Eds.), Understanding and investigating response processes in validation (pp. 137-158). Cham, Switzerland: Springer.

43.

Levack

W. M.

Weatherall

Hay-Smith

J. C.

Dean

S. G.

McPherson

Siegert

R. J.

(2016). Goal setting and strategies to enhance goal pursuit in adult rehabilitation: Summary of a Cochrane systematic review and meta-analysis. European Journal of Physical and Rehabilitation Medicine, 52, 400-416.

44.

Locke

E. A.

(1968). Toward a theory of task motivation and incentives. Organizational Behavior and Human Performance, 3, 157-189. doi:10.1016/0030-5073(68)90004-4

45.

Locke

E. A.

Latham

G. P.

(2002). Building a practically useful theory of goal setting and task motivation: A 35-year odyssey. The American Psychologist, 57, 705-717. doi:10.1037/0003-066X.57.9.705

46.

Loevinger

(1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694. doi:10.2466/pr0.1957.3.3.635

47.

Malec

J. F.

(1999). Goal attainment scaling in rehabilitation. Neuropsychological Rehabilitation, 9, 253-275. doi:10.1080/096020199389365

48.

Mann

de Ridder

Fujita

(2013). Self-regulation of health behavior: social psychological approaches to goal setting and goal striving. Health Psychology, 32, 487-498.

49.

Mannion

A. F.

Caporaso

Pulkovski

Sprott

(2010). Goal attainment scaling as a measure of treatment success after physiotherapy for chronic low back pain. Rheumatology, 49, 1734-1738. doi:10.1093/rheumatology/keq160

50.

Mayring

(2000). Qualitative content analysis. Forum: Qualitative Social Research, 1(2), 1-10. doi:10.1111/j.1365-2648.2007.04569.x

51.

Mcgaghie

W. C.

Menges

R. J.

(1975). Assessing self-directed learning. Teaching of Psychology, 2, 56-59.

52.

Messick

(1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955-956.

53.

Messick

(1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012-1027. doi:10.1037/0003-066X.35.11.1012

54.

Messick

(1989a). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18, 5-11.

55.

Messick

(1989b). Validity. In. Linn

R. L.

(Ed.), Educational measurement (3rd ed., pp. 13-103). New York, NY: Macmillan.

56.

Messick

(1995). Validity of psychological assessment. American Psychologist, 50, 741-749.

57.

Middel

van Sonderen

(2002). Statistical significant change versus relevant or important change in (quasi) experimental design: Some conceptual and methodological problems in estimating magnitude of intervention-related change in health services research. International Journal of Integrated Care, 2, 1-18.

58.

Milne

J. L.

Robert

Tang

Drummond

Ross

(2009). Goal achievement as a patient-generated outcome measure for stress urinary incontinence. Health Expectations, 12, 288-300. doi:10.1111/j.1369-7625.2009.00536.x

59.

Moher

Liberati

Tetzlaff

Altman

D. G.

, & PRISMA Group. (2009). Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. PLoS Medicine, 6, e1000097. doi:10.1371/journal.pmed.1000097

60.

Newton

P. E.

Shaw

S. D.

(2013). Standards for talking and thinking about validity. Psychological Methods, 18, 301-319. doi:10.1037/a0032969

61.

O’Leary

T. M.

Hattie

J. A. C.

Griffin

(2017). Actual interpretations and use of scores as aspects of validity. Educational Measurement: Issues and Practice, 36, 16-23. doi:10.1111/emip.12141

62.

Padilla

J.-L.

Benítez

(2014). Validity evidence based on response processes. Psicothema, 26, 136-144. doi:10.7334/psicothema2013.259

63.

Padilla

J.-L.

Leighton

J. P.

(2017). Cognitive interviewing and think aloud methods. In Zumbo

B. D.

Hubley

A. M.

(Eds.), Understanding and investigating response processes in validation (pp. 211-228). Cham, Switzerland: Springer.

64.

Palisano

R. J.

Gowland

(1993). Validity of goal attainment scaling in infants with motor delays. Physical Therapy, 10, 651-658.

65.

Palisano

R. J.

Haley

S. M.

Brown

D. A.

(1992). Goal Attainment Scaling as a measure of change in infants with motor delays. Physical Therapy, 72, 432-437.

66.

Rock

B. D.

(1987). Goal and outcome in social work practice. Social Work, 32, 393-398. doi:10.1093/sw/32.5.393

67.

Rockwood

(1994). Setting goals in geriatric rehabilitation and measuring their attainment. Reviews in Clinical Gerontology, 4, 141-149. doi:10.1017/S0959259800003737

68.

Rockwood

Stolee

Howard

Mallery

(1996). Use of Goal Attainment Scaling to measure treatment effects in an anti-dementia drug trial. Neuroepidemiology, 15, 330-338.

69.

Rushton

P. W.

Miller

W. C.

(2002). Goal attainment scaling in the rehabilitation of patients with lower-extremity amputations: A pilot study. Archives of Physical Medicine and Rehabilitation, 83, 771-775.

70.

Sakzewski

Boyd

Ziviani

(2007). Clinimetric properties of participation measures for 5 to 13 year old children with cerebral palsy: A systematic review. Developmental Medicine & Child Neurology, 49, 232-240.

71.

Schlosser

R. W.

(2004). Goal attainment scaling as a clinical measurement technique in communication disorders: A critical review. Journal of Communication Disorders, 37, 217-239. doi:10.1016/j.jcomdis.2003.09.003

72.

Schwarzer

(1992). Self-efficacy in the adoption and maintenance of health behaviors: Theoretical approaches and a new model. In Self-efficacy: Thought control of action (pp. 217-243). Washington, DC, US: Hemisphere Publishing Corp.

73.

Scobbie

Dixon

Wyke

(2011). Goal setting and action planning in the rehabilitation setting: Development of a theoretically informed practice framework. Clinical Rehabilitation, 25, 468-482. doi:10.1177/0269215510389198

74.

Shankar

Miller

W. C.

Roberson

N. D.

Hubley

A. M.

(2019). Assessing patient motivation for treatment: A systematic review of available tools, their measurement properties and conceptual definition. Journal of Nursing Measurement, 27(2), 177-209. doi: 10.1891/1061-3749.27.2.177

75.

Shear

B. R.

Zumbo

B. D.

(2014). What counts as evidence: A review of validity studies in Educational and Psychological Measurement. In Zumbo

Chan

E. K.

(Eds.), Validity and validation in social, behavioral, and health sciences (1st ed., pp. 91-111). Cham, Switzerland: Springer.

76.

Shefler

Canetti

Wiseman

(2001). Psychometric properties of goal-attainment scaling in the assessment of Mann’s time-limited psychotherapy. Journal of Clinical Psychology, 57, 971-979.

77.

Sireci

S. G.

(1998). The construct of content validity. Social Indicators Research, 45, 83-117.

78.

Sireci

S. G.

(2012). “De-constructing” test validation. Paper presented at the Annual Conference of the National Council on Measurement in Education as part of the symposium “Beyond Consensus: The Changing Face of Validity”, April 14, 2012, Vancouver, Canada.

79.

Sireci

S. G.

Sukin

(2013). Test validity. In Geisinger

K. F.

(Ed.), Test Theory and Testing Assessment in Industrial and Organizational Psychology, Vol. 1. APA handbook of testing and assessment in psychology (pp. 61-84). Washington, DC, US: American Psychological Association.

80.

Steenbeek

Ketelaar

Galama

Gorter

J. W.

(2007). Goal attainment scaling in paediatric rehabilitation: A critical review of the literature. Developmental Medicine & Child Neurology, 49, 550-556. doi:10.1111/j.1469-8749.2007.00550.x

81.

Stolee

Awad

Byrne

Deforge

Clements

Glenny

(2012). A multi-site study of the feasibility and clinical utility of Goal Attainment Scaling in geriatric day hospitals. Disability and Rehabilitation, 34, 1716-1726. doi:10.3109/09638288.2012.660600

82.

Stolee

Rockwood

Fox

R. A.

Streiner

D. L.

(1992). The use of goal attainment scaling in a geriatric care setting. Journal of the American Geriatrics Society, 40, 574-578. doi:10.1111/j.1532-5415.1992.tb02105.x

83.

Stolee

Stadnyk

Myers

A. M.

Rockwood

(1999). An individualized approach to outcome measurement in geriatric rehabilitation. The Journals of Gerontology, 54, M641-M647. doi:10.1093/gerona/54.12.M641

84.

Tenopyr

M. L.

(1977). Content-construct confusion. Personnel Psychology, 30, 47-54.

85.

Terwee

C. B.

Dekker

F. W.

Wiersinga

Prummel

M. F.

Bossuyt

P. M.

(2003). On assessing responsiveness of health-related quality of life instruments: Guidelines for instrument evaluation. Quality of Life Research, 12, 349-362.

86.

Thomas

D. R.

Zumbo

B. D.

(2012). Difference scores from the point of view of reliability and repeated-measures ANOVA: In defense of difference scores for data analysis. Educational and Psychological Measurement, 72, 37-43. http://doi.org/10.1177/0013164411409929

87.

Turner-Stokes

Fheodoroff

Jacinto

Maisonobe

Zakine

(2013). Upper limb international spasticity study: Rationale and protocol for a large, international, multicentre prospective cohort study investigating management and goal attainment following treatment with botulinum toxin A in real-life clinical practice. BMJ Open, 3, 1-12. doi:10.1136/bmjopen-2012-002230

88.

Vogt

D. S.

King

D. W.

King

L. A.

(2004). Focus groups in psychological assessment: Enhancing content validity by consulting members of the target population. Psychological Assessment, 16, 231-243.

89.

Law

A. V.

(2012). Goal-attainment scaling: A review and applications to pharmacy practice. Research in Social & Administrative Pharmacy, 8, 102-121. doi:10.1016/j.sapharm.2011.01.003

90.

Webb

T. L.

Sheeran

(2006). Does changing behavioral intentions engender behavior change? A meta-analysis of the experimental evidence. Psychological Bulletin, 132, 249-268. doi:10.1037/0033-2909.132.2.249

91.

Willer

Miller

(1976). On the validity of goal attainment scaling as an outcome measure in mental health. Public Health Briefs, 66, 1197-1198.

92.

Woodward

C. A.

Santa-Barbara

Levin

Epstein

N. B.

(1978). The role of goal attainment scaling in evaluating family therapy outcome. American Journal of Orthopsychiatry, 48, 41-49.

93.

Yip

A. M.

Gorman

M. C.

Stadnyk

Mills

W. G. M.

Macpherson

K. M.

Rockwood

(1998). A standardized menu for goal attainment scaling in the care of frail elders. The Gerontologist, 38, 735-742.

94.

Zumbo

B. D.

(2009). Validity as contextualized and pragmatic explanation, and its implications for validation practice. In Lissitz

R. W.

(Ed.), The concept of validity: Revisions, new directions and applications (Vol, 48, pp. 65-82). Charlotte, NC: Information Age. doi:10.1111/j.1745-3984.2011.00155.x

95.

Zumbo

B. D.

(2017a). On models and modeling in measurement and validation studies. In Zumbo

B. D.

Hubley

A. M.

(Eds.), Understanding and investigating response processes in validation (pp. 363-370). Cham, Switzerland: Springer.

96.

Zumbo

B. D.

(2017b). Trending away from routine procedures, towards an ecologically informed “in vivo” view of validation practices. Measurement: Interdisciplinary Research and Perspectives, 15, 137-139. doi:10.1080/15366367.2017.1404367

97.

Zumbo

B. D.

Chan

E. K. H.

(Eds.). (2014). Validity and validation in social, behavioral and health sciences. Cham, Switzerland: Springer. doi:10.1007/978-3-319-07794-9

98.

Zumbo

B. D.

Hubley

A. M.

(Eds.). (2017). Understanding and investigating response processes in validation. Cham, Switzerland: Springer.