Abstract
The current article presents the findings from a systematic review of the available reliability and validity evidence supporting the use of criterion-referenced assessments based on the applied behavior analysis framework. We identified 46 studies that reported reliability and/or validity evidence for six assessments, 37 of which presented reliability evidence and 43 presented validity evidence. Additionally, we extracted and summarized information related to participant characteristics (e.g., age, sex, diagnosis), geographic location, and research setting (e.g., residential facility, home). Overall, we found conflicting support for the use of the assessments. When coupled with the reported usage by behavior analysis professionals, our findings suggest a misalignment between the reportedly used assessments and the number of published studies providing validity and/or reliability evidence. We found inconsistent use of measurement-related vocabulary and that many studies could have been strengthened by conducting different statistical analyses. We provide a summary of studies, findings, and offer recommendations for clinical practice and future measurement research.
Behavior analysts use a variety of assessments within their practice to (1) assess an individual’s strengths and weaknesses (Gould et al., 2011), (2) identify the function of an individual’s challenging behavior (Iwata et al., 1982), (3) develop goals (Sundberg, 2014), and (4) monitor progress toward goal attainment and desired outcomes (Sundberg, 2014). The behavior analysts who create the most effective treatment programs are able to use assessments to accurately determine a client’s current behavior repertoire and pair this repertoire with a curriculum tailored to that client’s gaps across all areas of functioning (Gould et al., 2011). Several forms of assessment have been used in the field to obtain crucial information to inform the development of treatment plans for individuals being served. The majority of behavior analysts (73%) serve individuals with a diagnosis of autism spectrum disorder (ASD; Behavior Analyst Certification Board [BACD], n.d.). ASD is a condition that can affect several areas of a child’s development, such as cognitive, social, and adaptive skills. ASD affects one in 44 children in the United States, which is a 55% increase since 2010 and a 240% increase since 2000 (Centers for Disease Control and Prevention [CDC], 2018). In light of the rise in prevalence and given the breadth of development areas affected, assessment processes should be comprehensive and address all major areas of human functioning, such as social, motor, language, daily living, play, and academic skills (Gould et al., 2011). Researchers and practitioners must utilize practices and assessments that have strong evidence supporting their use for these purposes in order to develop intervention plans that effectively target an individual’s skill deficits.
Assessment serves different purposes depending on the problem areas identified during the screening process. Function-based behavioral assessments (e.g., experimental functional analyses) provide vital information used for addressing challenging behaviors (Cooper et al., 2020), and skill-based assessments aid in the identification of an individual’s current skill level and potential areas of growth. Regardless of purpose, assessments based within the framework of applied behavior analysis (ABA) heavily rely on direct observation of an individual’s behavior. Over the last 30 years, several criterion-referenced assessments have been developed within the ABA framework that specifically target skill acquisition for individuals diagnosed with ASD. Criterion-referenced assessments compare observed behavior or performance against an externally established set of interpretive criteria, which differs from norm-referenced assessments that produce scores that should be interpreted in comparison to a norm group (i.e., the population for which the assessment is intended).
Behavior assessments, like cognitive and psychological assessments, are forms of measurement that are concerned with the methods used to systematically assign numbers to individuals that represent quantities of attributes or to categorize individuals with specified characteristics (Allen & Yen, 1979; Ghiselli et al., 1981; Nunnally & Bernstein, 1994). Measurement allows researchers to quantify an attribute of interest, such as a skill, in a consistent and systematic way. The Standards for Educational and Psychological Testing (2014) developed by the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME) provide criteria for the development and evaluation of measurement quality in assessments and assessment practices for both criterion- and norm-referenced assessments. These criteria apply to the development, administration, scoring, and interpretation for both norm- (e.g., intelligence, psychological, etc.) and criterion-referenced (e.g., achievement, skill, behavioral) assessments. According to the guidelines, an assessment must have sufficient evidence regarding the validity and reliability of the measure prior to wide-spread usage. One important difference between behavior measurement and cognitive/psychological measurement is that behavior assessment is conducted via direct observation to record the presence or absence of a behavior or skill, frequency of a behavior or skill, or duration of exhibited behaviors. This characteristic for behavior assessment has important implications for the type of reliability and validity evidence to be collected as well as how the evidence is collected. Procedures used to provide evidence for the validity of a psychological assessment may or may not be appropriate for a behavior assessment depending on the theorized nature of the phenomenon being measured. Behavioral phenomena (e.g., manding, tacting) and cognitive phenomena (e.g., mathematics achievement, problem-solving ability) are often theorized to exist differently because the former may be directly observed whereas the latter cannot; this distinction has important implications for the analytic procedure used to provide validity evidence. We discuss this notion in more detail in the discussion section after presenting our study. Nevertheless, behavior analysts conducting these assessments have a responsibility to the client as well as their profession to select assessments that have sufficient evidence regarding the assessment’s reliability and validity (American Educational Research Association [AERA], 2014) for a specific use. The current guiding ethics of ABA call for (1) reliance on scientific knowledge when making scientific or professional judgments in human service provision and (2) a client’s right to effective treatment (BACB, 2014), both of which are based on the use of sound evidence. Furthermore, the code of ethics effective January 2022 states that behavior analysts should select and design assessments that (1) are consistent with behavioral principles, (2) are based on scientific evidence, and (3) meet the diverse needs, contexts, and resources of the client and stakeholders (Ethics code 2.13; BACB, 2020).
Validity can be broadly defined as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (AERA et al.,2014, p. 11). The definition and nature of validity have been widely debated for decades, but validity is now generally accepted as a unified concept that is traditionally supported by multiple types of evidence, such as evidence related to content, criterion, and construct validity. It should be noted that validity has to do with the underlying rationale for our interpretations and uses of scores (Bandalos, 2018). Researchers must also determine which data they will accept as evidence supporting that rationale. For the articles reviewed in this study, the authors classified the types of validity evidence they provided as content-, criterion-, or construct-related. We give a brief overview of these types next.
Content-related validity refers to the degree to which a domain is represented by the set of items or prompts included on an assessment. For example, suppose the ratings on a set of prompts are intended to support decisions about a client’s ability to tact. These responses/ratings form the basis for generalizing one’s tacting behaviors to other settings at other times. Assessors and/or test users should consider the following questions: (1) What evidence exists that those specific prompts are representative of tacting? (2) Do those specific prompts support generalizing to other settings at other times? (3) How many prompts are needed to support accurate decisions? (4) Are the prompts aligned with the theoretical definition of tacting? (5) Whose opinions on these issues matter? The answers to these questions form the basis for content-related validity evidence (Sireci & Faulkner-Bond, 2014).
Criterion-related validity is supported when data collected from an assessment align in an expected way with data collected from some other, meaningful criterion that has strong validity evidence. The criterion could be another assessment that was developed to measure phenomena that are theoretically similar (or dissimilar) or future performance or outcome. For example, a researcher who just developed a new assessment for assessing manding should consider administering the new assessment along with an existing manding assessment (i.e., criterion) to determine if the two assessments yield similar results. The expectations for the relationships are often expressed by the magnitude and sign of a correlation coefficient. Criterion-related validity can be categorized on the basis of (1) the expected relationship between the assessment and criterion is positive (convergent validity) or non-positive (discriminant validity) and (2) whether the criterion is observed at about the same time as the assessment (concurrent validity) or at some point in the future (predictive validity).
Construct validity refers to the extent to which inferences about constructs are supported by evidence and are aligned with the intended use of an assessment. Constructs are psychological attributes that characterize individuals’ behaviors in home, work, school, and/or social settings (Crocker & Algina, 1986). That is, constructs are hypothetical concepts used to develop theories explaining human behavior. Due to their theoretical nature, constructs must be operationally defined in order to “establish some rule of correspondence between the construct and the observable behaviors that are legitimate indicants of the construct” (Crocker & Algina, 1986, p. 4). Following operational definitions, researchers must then choose (1) a set of tasks that will produce the behaviors of interest and (2) the conditions under which an individual can demonstrate optimal or typical behavior. ABA relies on direct observation of behavior, and in most cases the behaviors observed are taken as evidence of one’s present level of functioning in some domain. The domain in these instances can be considered a construct.
A second important concept related to measurement quality is reliability, which refers to the extent to which an assessment’s results can be interpreted in a consistent manner across different settings and situations (Johnson & Morgan, 2016). Reliability is a necessary prerequisite for establishing validity evidence because strong reliability evidence is associated with more confidence that scores obtained would be consistent had they been collected in different settings, at different times, or by different assessors. Consistency of information/results provided by an assessment can refer to consistency across time (e.g., test-retest), administrations of the assessment (e.g., equivalence), items/tasks within the assessment (e.g., internal consistency), and/or the rater/observer/scorer (e.g., interrater reliability/interobserver agreement). Evidence supporting an assessment’s reliability can and should be provided as a part of the validation process. The methods and types of evidence supporting an assessment’s validity and reliability are generally the same for both criterion- and norm-referenced assessments.
ABA Assessments Reportedly Used in Practice
Padilla (2020) surveyed nearly 1,500 behavior analysis professionals to collect information related to their use of assessments for treatment goals and curriculum planning, including which assessments they use, why those assessments were selected, and what types of training they have received to administer the assessments with fidelity. The VB-MAPP was the most widely used assessment for determining target goals for their clients, with 76% (n = 1,086) of respondents reporting using VB-MAPP by itself or in conjunction with another assessment. Other assessments that were reported to be widely used included Assessment of Basic Language Learning Skills-Revised (ABLLS-R; n = 638, 45%), Promoting the Emergence of Advanced Knowledge Relational Training System (PEAK; n = 197, 14%) and Vineland Adaptive Behavior Scales (n = 485, 34%). Additionally, 115 different assessments or forms of assessment were reported by respondents with the two most common being the Assessment of Functional Living Skills (AFLS; n = 172, 12%) and Essential for Living (EFL; n = 72, 5%). When asked why they selected these assessments, professionals reported their decision was based on the efficacy of use reported in available research literature. Although these professionals value the validity and/or reliability research regarding these assessments, limited research exists on these assessments (Padilla, 2020). Additional research is needed to identify and evaluate the amount and strength of available validity and reliability evidence for ABA-based assessments, including those widely used by ABA professionals.
Ackley et al. (2019) presented evidence for the use of ABA-based assessments designed for language development in individuals with ASD. Following an extensive search for these assessment materials, they identified 18 protocols. An additional search was completed to explore the academic literature to identify empirical articles supporting the reliability, validity, and effectiveness of the identified protocols. Their search revealed empirical support for the use of only four of the 18 protocols—ABLLS-R, PEAK, SKILLS®, and VB-MAPP. Though these authors conducted a thorough search regarding all protocols used with individuals who have ASD, ABA services are also frequently provided to individuals who do not have a diagnosis of ASD. Further, these researchers included non-standardized protocols in their review. Ackley et al. (2019) offer an important first step in understanding the reliability and validity evidence supporting the use of ABA-based assessments designed for language development. However, their study does not address the wide use of these instruments and it was limited to assessments related to language development. Additional information is needed to determine if an assessment is appropriate for a client, such as the characteristics and representativeness of the sample used for developing the assessment and conducting validity or reliability studies (e.g., disability or diagnosis, age, ethnicity, etc.).
Purpose of this Study
Clearly, the availability of validity and reliability evidence is a critical consideration when choosing an assessment for use as the basis for decision making and progress monitoring. There is also limited evidence for the validity and reliability of ABA assessments. The purpose of this paper is to present the evidence available in the academic literature for the reliability and validity of criterion-referenced assessments developed using the applied behavior analysis framework. Therefore, the guiding research question was:
(1) What validity or reliability evidence has been provided supporting the use of ABA-based criterion-referenced assessments used for skill building programs and curriculum development in ABA?
Methods
Search Procedures
Primary search
The following databases were utilized during the systematic search: Academic Search Complete, APA PyscARTICLES, APA PsycInfo, APA PsychTESTS, Education Research Complete, ERIC, MEDLINE, and Psychology and Behavioral Sciences Collection. The search was conducted using the keywords “reliability,” “validity,” “factor analysis,” “correlation,” “item response theory,” “structural equation modeling,” or “regression.” These terms were paired with the key terms “verbal behavior,” “applied behavior analysis,” “ABA,” “verbal behavior milestones assessment and placement program,” “VB-MAPP,” “assessment of basic language and learning skills,” “ABLLS-R,” “promoting the emergence of advanced knowledge,” “essential* for living,” “assessment of functional living skills,” or “AFLS,” using Boolean operators and truncation. Results for “PEAK” were removed because there is another assessment with the same acronym used in the medical field which contributed to over 36,000 articles found initially when that acronym was included. EFL was removed for similar reasons. The search included studies available through October 2020.
Secondary search
This initial search produced additional assessments not included in Padilla’s (2020) survey results. Thus, the authors conducted a second search that included assessment specific search terms for the additional assessments to ensure a more comprehensive search. For the secondary search, the original first key search terms (i.e., “reliability,” “validity,” “factor analysis,” “correlation,” “item response theory,” “structural equation modeling,” or “regression”) were paired with the terms: “assessment of basic learning abilities,” “ABLA,” “training and assessment of relational precursors and abilities,” “TARPA,” “verbal behavior assessment scale,” or “VerBAS.”
Inclusion and exclusion criteria
Restrictions for the primary and secondary searches included studies that were peer-reviewed, published in English, and included human participants. Additionally, each study must have included an assessment that evaluated participants’ present level of functioning regarding their verbal repertoire (e.g., verbal operants, verbal behavior) or based in ABA. The study must also have assessed the validity or reliability of the assessment being utilized by reporting the validity (e.g., criterion, construct, content) or reliability (e.g., inter-rater, internal consistency, test-retest) evidence of the assessment. Studies that were not peer-reviewed (e.g., dissertations), or studies that utilized assessments for clinical diagnoses were excluded.
A total of 2,043 articles were identified in the search. After duplicate articles were removed, 1,879 articles remained. The titles and abstracts of the resulting 1,879 articles identified in the initial search were reviewed for inclusion and 18 met the criteria. A full text search was conducted on the 18 articles and 17 met the inclusion criteria for the review. Once the full text search was completed, an ancillary search of the reference lists in the included studies was conducted to identify additional articles for possible inclusion. The ancillary search resulted in the identification of 14 additional articles to be included in the review. PRISMA, or Preferred Reporting Items for Systematic Reviews and Meta-Analyses, is a reporting statement that includes a checklist and flowchart (Moher et al., 2009; Page et al., 2021) for documenting the steps of systematic literature reviews and meta-analyses. A PRISMA flowchart of the article elimination process for the entire search is shown in Figure 1. A secondary search was conducted with the measurement keywords combined with the new assessments (e.g., TARPA) identified from the primary search that were not a part of its search keywords. The secondary search resulted in the identification of 13 additional articles that met our inclusion criteria. A PRISMA flowchart of the article elimination process for the secondary search is shown in Figure 2. The primary, ancillary, and secondary searches combined resulted in 44 articles with a total of 46 studies to be included for review.

Article elimination process during primary systematic literature search process.

Summary of secondary search process.
Data Extraction and Analysis
After the systematic search was completed, each included article was examined to extract data related to each of the following categories: (a) assessment-specific characteristics, (b) participant characteristics, and (c) general study information. For assessment-specific characteristics, data were extracted on the specific assessment used (e.g., VB-MAPP, ABLLS-R), reliability estimates, validity evidence, and the statistical analyses utilized. For participant characteristics, data were extracted on the number of participants, age, sex, race, ethnicity, IQ reporting, diagnoses, and any other measures collected (e.g., verbal repertoire, physical abilities). For general study information, data were extracted on the geographic location of the study, assessor, and the setting in which the assessment was used. Each of these study details plays an important role in understanding the context(s) in which the data were collected and has implications for generalizability of findings to other settings.
The extracted data were analyzed and reported globally to summarize the total numbers and percentages across all included articles. That is, data were summarized for participant characteristics (e.g., age, sex), geographic location, assessor, and assessment setting across all studies (n = 46). For example, across the 46 studies, 1,502 participants were identified. Of the 1,502 participants, 60% (n = 897) were male, 28% were female (n = 414), and 12% (n = 191) were not reported. This percentage was calculated based on number of male participants across all 46 studies (n = 897) divided by the total number of participants (n = 1,502) multiplied by 100. Additionally, the data were analyzed for each individual assessment. In other words, the data from all articles related to an individual assessment type (e.g., all articles evaluating the reliability or validity of the PEAK) were analyzed and reported in terms of the total number and percentages. For example, across the 11 studies evaluating reliability or validity evidence for the PEAK, a total of 470 participants were identified, which accounted for 31% (470 ÷ 1,502) of the total participants across all 46 studies. Of the 470 participants identified in PEAK studies, 84% of the participants were male (n = 397) and 16% were female (n = 73). This percentage was calculated based on the number of male participants across all 11 PEAK studies (n = 397) divided by the total number of participants (n = 470) in the PEAK studies multiplied by 100.
Inter-rater Agreement
Search reliability
Two authors independently conducted the electronic database search, abstract and title search, and full text search to assess the reliability of the search terms and inclusion criteria. After independently applying the inclusion and exclusion criteria to the resulting articles, inter-rater agreement was calculated by dividing the total number of agreements by agreements plus disagreements and multiplying by 100%. Inter-rater agreement for the search was 99%. Discussion among all authors was used to resolve the resulting discrepancies.
Data extraction reliability
Two authors independently summarized 30% (n = 14) of the included studies included in the review to assess reliability of data extraction. Each study was summarized based on 18 items related to the assessment, participant characteristics, and general study information. Inter-rater agreement on data extraction was calculated by taking the total number of agreements by the number of agreements plus disagreements and multiplying by 100%. Mean inter-rater agreement was 89%. Given the complexity of the studies, the researchers conducted consensus coding (Johnson et al., 2000) with a psychometrician until inter-rater agreement reached 100%.
Assessments Identified
Six assessments were identified in the literature that had reliability and/or validity studies. The assessments are described below.
Assessment of basic learning abilities (ABLA)
The ABLA test, formerly called the AVC test of discrimination (Kerr et al., 1977) was developed in order to assess task acquisition skills for individuals with intellectual disability (Vause et al., 2003). The AVC was first developed by Nancy Kerr and Lee Meyerson and was later revised into the ABLA test by Kerr et al. (1977). It is now in its third rendition and is currently called the Assessment of Basic Learning Abilities-Revised (ABLA-R). The assessment is intended to assess an individual’s performance on six hierarchal levels of discrimination tasks: Level 1 (Imitation), Level 2 (Position Discrimination), Level 3 (Visual Discrimination), Level 4 (Visual Identity Match-to-Sample Discrimination), Level 5 (Visual Non-Identity Match-to-Sample Discrimination), and Level 6 (Auditory-Visual Combined Discrimination) (Awadalla et al., 2014). Casey and Kerr (1977) indicated that typically developing children should ideally pass all levels by the age of three. The ABLA can be used to match the learning ability to the difficulty of training tasks for children and adults with ASD or developmental disabilities (Martin et al., 2008). An individual’s ability to complete these discrimination tasks is foundational to learning higher-level skills related to academic and adaptive behaviors.
This assessment takes approximately 30 minutes to administer (Kerr et al., 1977). The test uses standardized prompting and reinforcement procedures for each level in attempts to teach a student to learn the discrimination tasks (Martin et al., 2008). Each task begins with the examiner demonstrating the correct response, physically guiding the examinee in providing the correct response, and then providing the examinee the opportunity to perform the response independently. Scoring begins after this initial trial and reinforcement (e.g., praise, edibles) is provided for correct responses. Incorrect responses are followed by error correction procedures (Barker-Collo et al., 1995), which include demonstration, a guided practice trial, and an opportunity for the examinee to respond independently (Martin et al., 2008). Trials for each task level continue until eight consecutive correct responses (passing criterion) or eight cumulative errors (failing criterion) are performed (Barker-Collo et al., 1995).
Assessment of basic language and learning skills-revised (ABLLS-R)
The ABLLS-R was developed by Partington in 2006 (Partington, 2010) and is based on the analysis of verbal behavior (Skinner, 1957). The ABLLS-R assesses 25 skill areas that address 544 skills from domains such as language, social interaction, self-help, academic, and motor abilities. This assessment is used primarily for children with autism spectrum disorders or other developmental disabilities. The ABLLS-R was developed to help identify the skills that are needed in a child’s repertoire, specifically those related to communication. The results are intended to inform the development of a comprehensive curriculum focused on the development of language. The skills addressed in the ABLLS-R are those that most typically developing individuals acquire prior to entering a formal education setting (e.g., kindergarten; Partington et al., 2018).
The ABLLS-R is comprised of the Scoring Instructions and IEP Development Guide and Protocol. The guide includes information about the assessment such as its features, scoring guidelines, and goal development. The Protocol is used to score the individual’s performance on the tasks and contains a set of scoring grids for each domain assessed that can be used to track the individual’s progress.
Promoting the emergence of advanced knowledge (PEAK)
The PEAK Relational Training System (Dixon, 2014) was originally developed in 2008 and is based on Relational Frame Theory (Hayes et al., 2001). The PEAK assesses language and cognitive deficits experienced by individuals from special populations (Dixon et al., 2016). It consists of four learning modules, each of which contains 184 programs that are hierarchically ordered by complexity. The four modules are Direct Training (PEAK-DT), Generalization (PEAK-G), Equivalence (PEAK-E) and Transformation (PEAK-T). Altogether, the modules focus on verbal relations ranging from simple vocalizations to complex language using different methodologies including discrete-trial training, stimulus equivalence, and Relational Frame Theory (Dixon, Whiting et al., 2014). Assessments can be conducted via direct and indirect measures in numerous settings resulting in more comprehensive scores across environments (Dixon, 2014). All scores are reported in a pyramidal fashion with color codes to visually represent the current skill level of the individual. Each of the 184 programs for all PEAK modules have curricula that target specific skills for the individual.
Verbal behavior assessment scale (VerBAS)
The VerBAS is a 15-item questionnaire that assesses the three elementary verbal operants defined by Skinner in Verbal Behavior (i.e., mand, tact, and echoic; Skinner, 1957). The VerBAS was created by P. C. Duker in 1999 as a tool to assess communicative functions of individuals with intellectual and developmental disabilities across all ages. The results can be used to identify specific needs related to communication problems and assess the effects of subsequent training procedures for this population.
The VerBAS can be administered as an interview or be completed by a rater as a questionnaire. The interviewee or rater must have known the individual being assessed for a minimum of 6 months. The rater responds to each item by providing one of the following response alternatives: (a) never, indicated by 0; (b) sometimes, indicated by 1; (c) half of the time, indicated by 2; (d) often, indicated by 3; and (e) always, indicated by 4. The maximum score for each subscale is 20 and for the overall scale is 60. During administration of the assessment, the rater cannot observe or interact with the individual in question. The time required to administer the VerBAS is approximately 5 minutes (Didden et al., 2004).
Training and assessment of relational precursors and abilities (TARPA)
The TARPA is a computer-based protocol that was developed and piloted in 2010 by Moran and colleagues (Moran et al., 2010). It is used to assess key forms of responding that contribute to the development of generative verbal behavior in children diagnosed with autism spectrum disorder. The TARPA is based on Relational Frame Theory (Hayes et al., 2001). Specifically, the TARPA assesses participants’ (a) basic discriminations, (b) non-arbitrary conditional discriminations, (c) arbitrary conditional discriminations, (d) mutually entailed relational responding (e.g., deriving a symmetrical relation B → A from a trained relation A → B), (e) combinatorial entailed relational responding (e.g, deriving combinatorial relations A → C and C → A when trained on A → B and B → C), and (f) transfer of function (responding to stimuli in different appropriate ways based on its participation in a derived relation). Each of these stages are subdivided into multiple levels. Additionally, each level in Stages 4, 5, and 6 are further subdivided into training sections and derivation sections. The TARPA is administered on a computer with the child and administrator present. Each level of the assessment begins with a training element. During each training element, the administrator provides a demonstration trial in which they show the child how to respond correctly. It was developed to assess response generalization and novel responding specifically for children with autism (Moran et al., 2014).
Verbal behavior milestones assessment and placement program (VB-MAPP)
The VB-MAPP is based on the analysis of verbal behavior and the science of applied behavior analysis. The VB-MAPP includes five components. The first is the VB-MAPP Milestones Assessment, which is designed to provide a representative sample of a child’s existing verbal and related skills across three development age levels (0–18, 18–30, and 30–48 months). The VB-MAPP Milestones Assessment also encompasses the Early Echoic Skills Assessment (EESA) which measures an examinee’s ability to vocally imitate a variety of sounds. The overall EESA score contributes to echoic items in the VB-MAPP Milestones Assessment. Remaining components include the (1) VB-MAPP Barriers Assessment that considers both common learning and language acquisition barriers, (2) VB-MAPP Transition Assessment that assists in decisions regarding placement in a less restrictive educational environment, (3) VB-MAPP Task Analysis and Supporting Skills that provides a further breakdown of skills within the Milestones Assessment, and (4) VB-MAPP Placement and IEP Goals that aids in an all-inclusive intervention plan (Sundberg, 2014). The VB-MAPP is frequently used in clinical settings as a means of assessing new clients’ abilities, as well as a tool for monitoring developmental progress during the therapeutic process (Gould et al., 2011; Sundberg, 2014).
Results
As noted previously, each included article was examined to extract data related to each of the following categories: (a) assessment-specific characteristics, (b) participant characteristics, and (c) general study information.
Overall Findings
Forty-four articles were found, and two of the articles presented multiple studies within the same article. The multiple studies were considered separately here because they included different samples and analyses. Therefore, the total number of measurement-related studies returned using the search procedures described above was 46. These 46 studies presented reliability and/or validity evidence across six assessments: ABLLS-R (n = 2), PEAK (n = 11), TARPA (n = 5), ABLA (n = 24), VerBAS (n = 2) and VB-MAPP (n = 2).
Measurement studies
Thirty-seven studies (80%) presented reliability evidence supporting the use of the assessment; 32 studies presented evidence for one type of reliability, and five reported evidence for multiple types of reliability. Of the 37 studies, 32 (86%) reported inter-rater reliability, five (14%) reported internal consistency, and five (14%) reported test-retest. Fifteen of the studies that reported inter-rater reliability (54%) also reported an additional form of reliability more frequently used in the field of ABA called procedural reliability/integrity. In terms of the statistical analyses used as evidence for the assessments’ reliability, different procedures were used depending on the type of reliability index. For those studies that reported inter-rater reliability (n = 32), the most commonly reported estimate was percent agreement (n = 31, 97%) followed by the kappa index (n = 1, 3%), and intraclass correlation coefficient (ICC; n = 1, 3%). For the five studies that reported internal consistency, each used coefficient alpha (Cronbach, 1951) as the estimate. Of the five studies that reported test-retest reliability evidence, two (40%) used a correlation coefficient and three (60%) used ICC.
Of the 46 studies identified, 43 presented evidence for an assessment’s validity.
Three of the validity studies reported content-, 36 reported criterion-, and eight reported construct-related validity evidence. In the three studies reporting evidence for content-related validity, one (i.e., Usry et al., 2018) used an expert panel to rate how “essential” each skill was on the assessment and reported the content validity ratio (CVR; Lawshe, 1975), and the other two studies (i.e., Rowsey et al., 2015, 2017) used principal component analysis to evaluate content-related validity evidence. For those reporting criterion-related validity evidence, the criteria used for comparisons were assessments, subscales or tasks evaluating verbal repertoire (n = 21, 58%), intelligence tests (n = 4, 11%), adaptive measures (n = 7; 19%), and achievement/academic related tasks (n = 5, 14%). Twelve studies (33%) used discrimination tasks, social skills assessments, or diagnostic classification as the external criterion. In some cases, multiple assessments were used for comparisons. For example, the Gilliam Autism Rating Scale (GARS), VB-MAPP, and ABLLS-R were used in one study each as the external criterion for validity evidence. As for the statistical analyzes used for criterion-related validity evidence, 22 (61%) estimated correlation coefficients (e.g., Pearson, Spearman, Phi), five (14%) used a type of regression analysis (e.g., linear, quadratic, cubic, logistic), 15 (43%) used descriptive statistics (e.g., percentage, patterns of performance on assessment tasks), five (14%) used a t test, and one (3%) employed Fisher’s exact test. With respect to construct-validity evidence, three (38%) reported using principal components analysis, one (13%) used patterns of performance, and four (50%) used order analysis.
Participant demographics
A total of 1,502 participants were represented across all studies where sample size was reported. The average sample size was approximately 33 (SD = 39) and ranged from 5 to 226. The sex of the participants was reported in 38 studies. Overall, about 60% (n = 897) of the participants were male, 28% were female (n = 414), and 12% (n = 191) were unknown because the sex of the participants was not reported. The percentage of male participants ranged from 29% to 100% with an average percentage of 67% (SD = 17%). Participants ranged in age from 6 months to 66 years across all studies. Twenty of the studies included participants with ASD (n = 456, 30%), 18 included participants with intellectual disability (n = 399, 27%), 14 included participants with other developmental disabilities (n = 372, 25%), two included typically developing participants (n = 63, 4%), six studies included those classified as “other” (n = 201, 14%), and one study did not include enough information to discern disability category. Only three studies (7%) provided the IQ scores of its participants, but 18 studies (39%) reported the level of intellectual and/or developmental functioning (e.g., severe, profound) for participants. Only two of the 46 studies (4%) reported the race/ethnicity of the participants.
ABLA
Measurement studies
A summary of ABLA articles is provided in Table 1. Twenty-four studies presented measurement properties of the ABLA. Through the extensive literature search, articles were included that addressed reliability or validity of the ABLA as well as the AVC, which was the first iteration of the ABLA.
Summary of ABLA Measurement Studies.
Note. ASD = autism spectrum disorder; DD = developmental disability; ID = intellectual disability; Other = other type of disability; nM = number of males; nF = number of females.
Twenty-two studies (92%) presented reliability evidence supporting the use of the assessment. One study (5%) presented test-retest reliability evidence in addition to inter-rater reliability, and 15 studies (68%) reported an additional form of reliability referred to as procedural reliability and/or procedural integrity. For those studies that reported inter-rater reliability and/or procedural reliability/integrity, all reported percent agreement (n = 22, 100%) as the reliability estimate.
Of the 24 studies identified, all presented evidence of some form of validity supporting the use of ABLA. Of these validity studies, 20 (83%) reported criterion-, three (13%) reported construct-, and one study (5%) presented criterion- and construct-related validity evidence. For the 21 studies reporting criterion-related validity evidence, the criteria used for comparisons were adaptive measures (n = 2, 10%), tasks or subscales evaluating verbal repertoire (n = 10, 48%), motor skills (n = 1, 5%), academic or classroom tasks (n = 2, 10%) and other criteria (n = 9, 43%) such as intelligence tests or discrimination tasks. A wide variety of statistical analyses were employed for criterion-related validity, depending on the types of data collected. With respect to studies providing construct-validity evidence, all reported using order analysis.
Participant demographics
Across the 24 studies, a total of 523 participants were represented. The average sample size was approximately 31 (SD = 13), and ranged from 9 to 54. The sex of the participants was reported in 17 of the 24 (71%) studies. These studies represent 332 participants, which account for 63% of the total participants. Fifty-eight percent of the participants were male (n = 191) and 42% were female (n = 141). The largest disability category represented was intellectual disability (n = 271, 52%) followed by other developmental disability (n = 174, 33%) and ASD (n = 32, 6%).
Study components
Regarding geographic location, 18 of the 24 (75%) studies were conducted in Canada, most reportedly in Manitoba. The locations for six (25%) of the studies were either not discernible from the information provided or not provided. With respect to the settings of the 24 included studies, the majority were conducted at a residential facility (n = 20, 83%). The most common assessor was an experimenter/researcher (n = 20, 83%) followed by other (e.g., direct staff, care worker; n = 5, 21%), research assistant (n = 5, 21%), and graduate student (n = 1, 4%).
ABLLS-R
Measurement studies
A summary of ABLLS-R articles is provided in Table 2. Two studies included an evaluation of the measurement properties of the ABLLS-R. Both studies evaluating the ABLLS-R (n = 2, 100%) presented reliability evidence; one study (50%) reported inter-rater reliability and the second study (50%) reported both an internal consistency index and the test-retest reliability index. One of the two studies reported evidence for the assessment’s validity, which was content-related validity (i.e., Usry et al., 2018).
Summary of ABLLS-R Measurement Studies.
Note. NT = neurotypical/typically developing; nM = number of males; nF = number of females; NW = number of White participants. NH = number of Hispanic participants.
Participant demographics
The sample size in one of the studies evaluating the ABLLS-R was comprised of “experts,” or non-members of the populations the assessments are designed to evaluate (e.g., individuals with developmental delays or disabilities). In the single study conducted with members of the target population, the reported sample size was 50. Forty-two percent (n = 21) of the participants were male and 58% were female (n = 29). The authors reported that all participants were typically developing.
Study components
Because of the nature of the study including an expert panel with members from various regions of the country, one of the two ABLLS-R studies reported multiple study locations. This study contained multiple types of assessors that included professionals from education, school psychology, occupational therapy, behavior analysis, and speech language pathology. The second ABLLS-R study did not report a specific geographic region and was conducted in a clinical setting by educators or other professionals who knew the study participants.
PEAK
Measurement studies
A summary of PEAK articles is provided in Table 3. Eleven studies were identified that evaluated the measurement properties of the PEAK. Seven studies (64%) presented reliability evidence with 4 (57%) reporting inter-rater reliability alone, one reporting inter-rater reliability and internal consistency, and one reporting inter-rater reliability and test-retest reliability. In terms of the statistical analyses used as evidence for the assessment’s reliability, different procedures were used depending on the type of reliability index. Of studies that evaluated inter-rater reliability, 5 (71%) reported using percent agreement and one (14%) used kappa.
Summary of PEAK Measurement Studies.
Note. ASD = autism spectrum disorder; DD = developmental disability; ID = intellectual disability; Other = other type of disability; nM = number of males; nF = number of females.
Out of the 11 studies evaluating the PEAK, 10 presented evidence for some form of validity. Of these 10 studies, eight (80%) reported criterion-related validity and two (20%) reported construct- along with content-related validity.
Participant demographics
A total of 470 participants were represented across the 11 PEAK studies, which accounts for 31% of the total participants across all 46 studies. The average sample size was approximately 43 (SD = 29; range = 13 to 98). 84% of the participants were male (n = 397) and 16% were female (n = 73). Participants ranged in age from 2 to 22 years old across all studies. The race and ethnicity of the participants was not reported for any of the studies. The largest disability category represented was ASD (n = 325, 69%) followed by other developmental disability (n = 110, 23%) and intellectual disability (n = 11, 2%).
Study components
Nine of the 11 studies reported the study location where the participants were assessed. The study locations included the Midwest regions of the United States (n = 8, 73%), Ontario, Canada (n = 1, 9%), and two (18%) studies did not provide enough information to discern the study location. The assessors included graduate students (n = 4, 36%), experimenters (n = 3 27%), educators (n = 1, 9%) staff workers (n = 1, 9%) and behavior analysts or registered behavior technicians (n = 2, 18%).
TARPA
Measurement studies
A summary of TARPA articles is provided in Table 4. Four articles were found that included five studies evaluating the measurement properties of the TARPA. Two of the studies presented reliability evidence and all five studies presented evidence for criterion-related validity. The criteria used for comparisons were adaptive measures (n = 2, 40%), assessments evaluating verbal repertoire (n = 4, 80%), social maturity (n = 1, 20%), autism rating scale (n = 1, 20%), and an abbreviated intelligence test (n = 1, 20%).
Summary of TARPA Measurement Studies.
Note. ASD = autism spectrum disorder; DD = developmental disability; NT = neurotypical/typically developing; nM = number of males; nF = number of females.
Participant demographics
Across the five studies, a total of 70 participants were represented. The average sample size was approximately 14 (SD = 12), and ranged from 5 to 35. Approximately 71% (n = 50) of the participants were male and 29% were female (n = 20). Participants ranged in age from 2 to 15 years old across all studies. Three of the studies were conducted with participants diagnosed with ASD (n = 50, 71%), one study with participants diagnosed with a developmental disability, and one study with participants classified as typically developing.
Study components
Four of the studies were conducted in Ireland. One study did not specify the location although the authors were affiliated with universities in Japan or Ireland. Four studies (80%) reported that the assessor included an experimenter and three included a parent/caregiver or “instructor.”
VerBAS
Measurement studies
A summary of VerBAS articles is provided in Table 5. Two studies were identified that evaluated the measurement properties of the VerBAS, both of which evaluated some form of reliability and validity. One study reported inter-rater reliability and an internal consistency index whereas the other study only reported internal consistency. The validity studies included the evaluation of construct- and criterion-related validity using multiple statistical procedures (e.g., MANOVA, principal components analysis).
Summary of VerBAS Measurement Studies.
Note. ASD = autism spectrum disorder; DD = developmental disability; ID = intellectual disability; Other = other type of disability; nM = number of males; nF = number of females.
Participant demographics
There was a sample size of 352 participants across both studies evaluating the VerBAS. There were 206 (59%) male participants and 146 (41%) female participants. Specific IQ scores were not reported, but the authors reported that all participants had severe or profound mental retardation. The race/ethnicity of the participants was not provided.
Study components
The geographic location was not reported for either study but both were conducted in a home setting or residential facility. Regarding types of assessors, one study included educators and staff members and the other study included caregivers and staff members.
VB-MAPP
Measurement studies
A summary of VB-MAPP articles is provided in Table 6. Two studies were identified that evaluated the measurement properties of the VB-MAPP. One study provided evidence for inter-rater reliability and criterion-related validity and the other study presented evidence for test-retest reliability. The researchers used percent agreement to analyze inter-rater reliability, ICC for test-retest reliability and patterns of performance for criterion-related validity. The criterion used for the validity study included an assessment of verbal repertoire skills.
Summary of VB-MAPP Measurement Studies.
Note. ASD = autism spectrum disorder; Other = other type of disability; nM = number of males; nF = number of females; nAA = African-American participants; nH = number of Hispanic participants.
Participant demographics
There was a total of 37 participants across both studies. For one study, the age range for participants was between 8 and 10 years of age with 84% male participants (n = 27) and 16% female participants (n = 5). Race/ethnicity and IQ were not reported for this study. All participants had ASD, two of which had ASD along with another type of disorder/disability.
For the other study, all five participants were male (two African-American and three from a Hispanic or Latino background) between 2 and 6 years of age. Two were diagnosed with ASD and three had some reported type of developmental delay or medical condition which affected their development (e.g., lead exposure, chromosome 3p21.3 homozygous deletion). Neither the study location nor the participants’ IQs were reported in this study.
Study components
Only one of the two studies reported the geographic location which was in Southern California. Evaluators for this study included professionals who were either certified as a Board Certified Behavior Analyst (BCBA) or had obtained a master’s degree from a behavior analysis program and had conducted the assessment in the natural setting where clients engaged in their daily routines (e.g., home, daycare, community center). For the other study, graduate students conducted all evaluations in a university affiliated location.
Discussion
Assessments used for the purposes of making decisions about behavior goals and monitoring progress toward those goals has become increasingly important because inferences about individuals’ behavioral and verbal repertoire are based on the information provided by these assessments. The accuracy and consistency of inferences made about individuals is of the utmost importance. Therefore, the available evidence supporting the accuracy (i.e., validity) and consistency (i.e., reliability) of inferences should be considered prior to administration of an assessment to ensure that (1) the administration and related conditions (e.g., individual’s age/level of functioning, setting, etc.) are consistent with the intended use of the assessment and (2) the researcher/assessor/clinician can have confidence in decisions made based on the results. The purpose of our study was to summarize the validity and reliability evidence for criterion-referenced assessments developed using the ABA framework. Our study addressed an important gap in the literature by systematically identifying and summarizing not only the validity and reliability evidence but also the characteristics of the participants (e.g., age, diagnosis, sex), geographic location, research setting (e.g., residential facility, clinic, etc.), and assessors. Altogether, this provides more comprehensive information for understanding measurement quality for assessments used in ABA.
Although we identified 46 studies that provided information related to validity and reliability evidence for an ABA-based assessment, a key finding from our study is the misalignment between the assessments reported by behavior analysis professionals (Padilla, 2020) and the number of published validity and/or reliability studies in the literature. Three of the six assessments identified in the systematic literature review (i.e., ABLA, TARPA, VerBAS) were not reportedly used by those practicing in behavior analysis (Padilla, 2020). That is, 31 of the 46 identified studies were based on assessments that are not reported to be widely used, if at all. Furthermore, according to the practitioners surveyed in Padilla (2020), the VB-MAPP was the most commonly used assessment despite only having two studies identified as providing validity- or reliability-related evidence in the research literature. The ABLA had the highest number of measurement-related studies (n = 24) yet no respondents reported using this assessment. The PEAK had the second highest number of measurement-related studies (n = 11) but was the least reportedly used assessment in the field of behavior analysis (Padilla, 2020). Additional research is needed to provide strong evidence supporting the use of several other commonly reported assessments used in ABA, such as the VB-MAPP, ABLLS-R, AFLS, and EFL. Because these assessments have already been developed, we can only assume that the steps in the assessment design and development were sufficient, such as the definition of the construct being measured and the appropriateness of the test development (Sireci, 1998). Research for commonly used assessments can begin with evaluating an assessment’s content validity as demonstrated in Usry et al. (2018), test-retest and inter-rater reliability. When a researcher decides to examine criterion-related validity evidence using two assessments, the researcher must ensure the comparison assessment (1) has supporting validity and reliability evidence and (2) relates to the target construct in a theoretically-supported manner.
An additional insight of this study is a strong need for both an evaluative tool for examining measurement-related studies of ABA assessments as well as high quality validity studies focused on assessments in ABA. An important consideration when evaluating measurement-related studies is the statistical analysis employed to serve as evidence. Each analysis will and should depend on the specific data and characteristics of any particular study. However, our summary of published studies demonstrates a fairly wide variety of analytic approaches for examining measurement quality. The variability in analytic approach should not be viewed as a weakness of this collection of studies necessarily; however, each analytic approach comes with additional requirements of the data, distributional assumptions, and theoretical underpinnings that could possibly render differential, conflicting, or biased evidence quality across studies. We discuss this consideration in more detail below. No single study can provide complete evidence supporting an assessment’s use so a set of strategically designed studies are recommended for any assessment. The salient point here is that an evaluative tool may benefit researchers and practitioners who are conducting their own or interpreting others’ validity and/or reliability studies. The adoption of guiding criteria for what constitutes sound evidence supporting an assessment would be useful and could address the considerations and recommendations below. Interested readers should see AERA (2014), American Psychological Association (2020), Johnson and Morgan (2016), and Padilla and Morgan (2022) for guidance on evaluating assessments.
Considerations and Recommendations
Appropriate estimates
The results of this review provide a cautionary tale even though the 46 studies conducted and disseminated represent an important step toward evaluating ABA-based assessments. Some additional insights are worth considering. Per the dimensions of ABA, decisions to select an assessment for use need to be based on data and research. In our systematic literature review, only the presence of measurement evidence was recorded—not the quality or strength of the evidence. For example, 31 (of the 32) reliability studies reported percent agreement as the estimate for interrater reliability evidence. Percent agreement has also been referred to as exact agreement and interobserver agreement (IOA). Despite its intuitiveness and ease of calculation, percent agreement is a notoriously biased estimate due to its potential capitalization on chance agreement (Cohen, 1960; Krippendorff, 1980). If, for instance, two observers randomly coded 15% of individuals’ performance as “Pass” and 85% of individuals’ performance as “Fail,” their percentage agreement would be about 75% due to chance alone. Furthermore, percent agreement is also influenced by the amount and types of scores collected for calculating agreement (Hausman et al., 2022). An estimate that accounts for chance agreement, such as kappa or weighted kappa, may be more accurate. This type of estimate was used by only one study in our review. Measurement estimates based on generalizability theory should also be considered (Stemler, 2004) depending on the purpose of the study or analysis.
Vocabulary
A note about vocabulary should be considered next. Several studies classified the type of validity evidence they provided as “predictive validity,” which was presumably based on the use of regression analysis. In regression analysis, many terms are used to refer to variables in the model. The outcome variable may be referred to as such but it may also be referred to as the dependent variable or criterion variable. The variables used to examine shared variability with the outcome variables are referred to as explanatory variables, independent variables, or predictors. The values returned by a regression equation are commonly referred to as “predicted values,” yet the outcome variables may not have been observed in the future (i.e., true prediction). In the context of criterion validity, the classification of evidence as “concurrent” or “predictive” depends on when the measure or variable used for comparison occurs. On the one hand, predictive validity evidence is based on comparisons between an assessment and some other variable that is collected in the future; the assessment is used to see if predictions made based on the assessment come to pass. On the other hand, concurrent validity evidence is based on comparisons between an assessment and some other variable that were collected at about the same point in time. For the studies we identified, the assessment and the comparison variable (e.g., parent ratings, staff ratings, etc.) occurred at about the same time. Therefore, the correct classification should most likely be concurrent validity evidence. The inconsistent terminology between regression and validity analyses does not diminish the value of the published work, but it may confuse some readers. Using established vocabulary from the field of measurement may increase the impact of validity research because there is the possibility that the methods and analyses used do not provide accurate information regarding the quality of reliability and validity evidence. In one or both instances, decisions to use an assessment based on incomplete or inaccurate evidence could be misguided.
Statistical Analyses
A final consideration worth noting relates to appropriate statistical analysis. Small samples were used in many of the studies we identified. Statistical analyses that require the tenability of distributional assumptions may be biased in these samples. For example, the Pearson correlation, which was used in about a quarter of the studies we identified, is a measure of linear relationship and requires that both variables being analyzed follow a normal distribution. That is, both variables are continuous or measured at least at the interval level in order to be accurate. In at least one study, the results reported were based on a linear relationship between variables when the relationship was clearly nonlinear. As a result, the reported correlation was too low so the validity evidence was weaker than it should have been. Estimates like Spearman’s rank-order correlation or some other estimate that requires a reduced set of assumptions would be worth considering. The same is true for group comparisons and other types of analyses, not just correlation.
In several studies reporting construct validity evidence, the authors reported the use of principal components analysis, yet referred to “factors” in the paper. Although factor analysis and principal components analysis are mathematically similar, they are theoretically opposite one another. In factor analysis, the construct being measured is defined by theory and the items/tasks used are viewed as manifestations of the underlying construct (i.e., factor). In component analysis, the construct is defined by the specific items/tasks used to measure the construct. On a more technical note, a component is a combination of the specific items/tasks weighted to maximize the variability shared by those items/tasks. Because items/tasks in factor analysis are viewed as manifestations, the substantive meaning of the construct does not change should other items/tasks be used, such as could be the case if a new edition of an assessment is published. In contrast, because the items/tasks in component analysis define the underlying components, the component’s meaning changes should other tasks/items be used. Due to these important distinctions, factor analysis is the prevailing paradigm throughout the social sciences because it commonly aligns more closely with what researchers theorize or hypothesize about the phenomena under investigation. With respect to the studies that reported component analysis, we do not mean to imply that their analyses were or were not appropriate; instead, we contend that authors using this type of analysis as construct validity evidence should make a strong theoretical justification regardless of the analytic approach because the definitions and subsequent interpretations of the constructs depend on the analytic approach. The dissemination of the findings would also benefit from using terminology that is consistent with the chosen approach.
Limitations and Delimitations
There are several limitations of the current review that should be mentioned. Although we employed a comprehensive search strategy, it is possible that the 46 studies that we were able to identify do not represent the corpus of studies designed to provide validity and reliability evidence for criterion-referenced assessments based in ABA. Studies published with different keywords or not using the reliability and/or validity vocabulary may not be reflected in our results, in which case would be a limitation. We provided all relevant details related to our search strategies in an attempt to mitigate this limitation.
To our knowledge, the review was the first to collect and summarize these measurement-related properties of criterion-referenced ABA-based assessments. As such, we chose to delimit our analysis to a summary of evidence that was or was not presented by the study authors. We did not evaluate the quality of evidence using any evaluative tool or rubric. Instead, we provide considerations for interpreting the results of the published studies so as to support the efforts that have been completed to date and identify areas for additional research moving forward.
Conclusions and Implications
Research
We hope that this review has clearly demonstrated that this is a growing area of research where important advances are being made, and there are some specific areas where additional research is needed. There are numerous research teams making progress in these areas and our review and summary serves as a report on the current state of validity and reliability studies for ABA-based criterion-referenced assessments. There is a preponderance of validity and reliability evidence provided in the 46 studies identified but the evidence in many cases may be incomplete, misclassified, or misaligned with the assessment’s reported usage. The implementation of the recommendations identified above would contribute to the continued growth of this research area. In many cases, the research designs from the studies we identified are appropriate for the research purposes stated by the study authors. The considerations noted above relate more to technical aspects of analysis. Thus, researchers engaging in validity and reliability studies should consult or collaborate with a measurement expert (e.g., psychometrician) and statistician. Additional research is needed where the analytic methods and statistical analyses used for providing evidence of measurement quality are directly aligned with the sample being studied and the intended use of the assessment. For example, studies seeking to provide reliability evidence between raters/observers should be specifically designed as such and should report indexes, such as Cohen’s
Generalizability is a critical element of many types research, including validity and reliability studies. Thus, an important reporting consideration that supports generalizability is full description of study participants. Most of the studies reported some information about study participants but many excluded the race/ethnicity backgrounds of the participants. This exclusion is a major weakness of the existing literature and must be a part of the design and reporting of future research. Representation of participant characteristics, such as age, gender/sex, race/ethnicity, and (dis)ability should be reported and incorporated into the design of validity research. Participant characteristics are crucial for evaluating the accuracy of assessments and/or effectiveness of interventions. Future research should attend to generalizability considerations of measurement research.
Future research should also examine not only the validity and reliability evidence but also how these commonly used assessments align with the dimensions of ABA that guide practice and research in the field (Baer et al., 1968). Such alignment would both bolster the impact of measurement-related research and ultimately promote acceptance of these measurement principles among the wider behavior analysis community of researchers and practitioners. Finally, detailed information and research regarding the appropriateness of test development procedures (e.g., items/prompts development, criteria selection, alignment with theoretical framework) of available assessments would aid users in evaluating content validity.
Clinical Practice
In the field of ABA there is widespread usage of criterion-referenced assessments to develop treatment plans for individuals with ASD (Padilla, 2020). The importance of the availability of sound evidence supporting the use of assessments, including those that are ABA-based, cannot be overstated. Decisions are made about individuals’ intervention plans to improve the quality of life of the individual and their families using data from assessments. Therefore, every effort should be made to ensure that there is ample and compelling evidence to support any assessment’s use. In clinical settings, each component of treatment plans (i.e., goals, teaching strategies, reinforcement schedules, and data collection procedures) is commonly based on the results of an assessment. The assessment results are based on the accuracy and appropriateness of scores/ratings of specific behaviors or responses that are elicited by the prompts, items, or tasks that comprise the assessment. Therefore, potential effectiveness of therapy, at least as it relates to the treatment plan, is bound by the quality of the validity and reliability evidence on which the treatment plan is based. As a simple example, if an assessment has little or weak evidence supporting inter-rater reliability, then it cannot be determined with confidence how much of assessment results are a reflection of the individual being assessed or subjectivity unique to the assessor.
Based on the findings of the study along with the considerations we present, the evidence supporting the use of these six assessments should be interpreted with caution. Practitioners in ABA need to have a basic understanding about measurement-related properties associated with the assessments they use for the clients they serve. Such understanding will help guide them to select assessments based on research. It is reasonable to assume practitioners in the field prefer to use assessments based in ABA, as demonstrated in the Padilla (2020) survey; however, clinicians need to understand when and how to interpret assessment data when there is limited evidence to support the use of the assessment. In this case, it is important to obtain supporting information from a variety of sources (e.g., other assessment data, behavior observations, client history, professional judgment) to support their assessment conclusions and treatment decisions. Insurance requirements also play a role in assessment selection (Padilla, 2020). Insurance companies tend to select norm-referenced assessments with validity and reliability evidence supporting their use, in general; however, the assessments used in ABA are often specific to certain skills and behaviors of individuals with specific conditions. Unfortunately, the assessment(s) selected by insurance companies may be based on a framework or intended use that differs from the basis of the behavioral therapy being provided. This demonstrates the strong and immediate need for additional research on criterion-referenced assessments in ABA and advocacy for research-based assessments that forms the basis for insurance mandates. Finally, the strength of reliability and validity evidence is an important factor when selecting assessments for practice, research, or insurance policies, but it must also be aligned with the code of ethics and dimensions central to the work of behavior analysts.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
