Psychometric Properties of the Assessment,Evaluation,and Programming System for Infants and Children

Abstract

The purpose of this article is to provide evidence of the technical adequacy of the Assessment, Evaluation, and Programming System–Third Edition (AEPS-3). The AEPS has long been identified as one of the most psychometrically sound early childhood curriculum-based assessments. In this article, results of three studies of technical adequacy are reviewed. First, a utility study was conducted to examine the degree to which teachers and providers found the AEPS-3 useful for its intended purposes (i.e., goal development and programming). Second, we examined the interrater reliability of the AEPS-3 by having teachers and providers view videotapes and score AEPS-3 items. Finally, a concurrent validity study was conducted, whereby a group of children were assessed using a norm-referenced assessment and the AEPS-3. Results of all three studies show provide early evidence that the AEPS-3 is psychometrically sound.

Keywords

assessment components of practice measure development (psychometrics)research methods young children curriculum-based assessment technical adequacy measurement

Early childhood assessment procedures are guided by recommended practices from a variety of sources (Division for Early Childhood, 2014; National Association for the Education of Young Children, 2009). Across all of these sources, there is agreement that early childhood assessments should be socially valid, use authentic assessment methods, allow for collaboration between families and professionals, have evidence that supports their reliability and validity, and be sensitive to child change over time. The type of assessment tools that match these recommended practices most closely are curriculum-based assessments (CBAs). According to Grisham-Brown & Pretti-Frontczak (2011), a CBA contains test items that are aligned with the curriculum so that what is on the test can actually be taught. Many CBAs are also criterion-referenced, meaning that the purpose of the test is to measure a child’s performance over time by determining the extent to which a child meets the criterion or standard for each item on the test. Curriculum-based assessments are primarily used to assist teacher/providers/interventionists in knowing what to teach the children with whom they work. When working with young children who have disabilities, data from a CBA are used to identify goals or outcomes for the child’s Individualized Education Plan (IEP) or Individualized Family Service Plan (IFSP). Moreover, CBAs show a child’s progress over time, as children are compared with themselves and not with a normative sample of children. More recently, CBAs have been used to report children’s progress as part of state and federal accountability systems (Grisham-Brown, Pretti-Frontczak et al., 2008).

The Assessment, Evaluation, and Programming System for Infants and Children (AEPS^®; Bricker, 2002) is one of the most widely used CBAs in the field of early childhood intervention (Bagnato et al., 2010). The AEPS evolved from an informal meeting in 1974 of early childhood practitioners who were concerned about the assessment of children with disabilities. In the beginning, AEPS work focused on the age range from birth to 2 years and included assessing a large number of developmental milestones (up to 600). Throughout the years, while the work narrowed down the number of milestones to assess, the authors were continually asked to include an assessment for the preschool years. The second edition of AEPS that published in 2002 was a four-volume set with curriculum and assessments for ages birth to 3 years (Level I) and 3 to 6 years (Level II). Currently, AEPS is used in early intervention programs, Early Head Start and Head Start programs, state-funded pre-kindergarten programs, and child care programs. It is widely used across the United States as well as internationally (e.g., Bulgaria, Republic of Georgia, and Singapore).

Based on input from users, AEPS has been revised to better meet the needs of programs that serve young children. The third edition of the AEPS Test is one seamless assessment that covers the development of children up to 6 years of age (Bricker et al., in press). Based on results of a Rasch analysis of the second edition (Winchell, 2011), AEPS-3 was redesigned to include more items at the lower and upper ends of the test. Expanded content on Math and Literacy strengthens usefulness in current educational environments. Criteria for AEPS-3 test items are refined for clarity and interrater reliability, while more examples address cultural diversity. AEPS-3 scoring now includes the requirement of a scoring note for the “1” score, which represents emerging skills. A new, separate tool called Ready Set contains a selection of items across all developmental areas that support school readiness and children’s transition to kindergarten. The AEPS-3 curriculum also reflects major changes. The curriculum is now designed around three levels: Beginning, Growing, and Ready. Its structure and content are grounded in 18 routines and activities in which young children engage throughout their day. All activities at all levels provide ideas for differentiating instruction across three tiers of teaching strategies: universal, targeted, and specialized.

AEPS has been one of the most widely researched CBAs in the field of early childhood intervention (Bagnato et al., 2010). The psychometric properties of AEPS have been examined for more than 30 years. Research on earlier editions of AEPS has focused on the psychometric properties, utility, and degree with which the test can be used for other purposes beyond program planning and progress monitoring. Studies of the test’s validity, reliability, and utility have been conducted, reported, and used to improve AEPS.

Validity

Concurrent validity studies between AEPS, which is a criterion-referenced test, and various norm-referenced tests have consistently found strong correlations. Bailey & Bricker (1986) examined congruence between an early version of AEPS and the Gesell. Bricker et al. (1990) found significant correlations between the AEPS, Gesell, and Bailey Scales of Infant Development. Gao & Grisham-Brown (2011) compared subscale scores of the Battelle and AEPS and found statistically significant correlations in the areas of social-communication and cognition. Finally, Macy et al. (2005) examined the congruence of AEPS with the Gesell and Battelle Developmental Inventory (BDI). The sensitivity of AEPS also has been examined. Hsia (1993) found that AEPS was sensitive to performance differences between children with and without disabilities. Bricker et al. (2003) found sensitivity levels between 77% and 100% for the Birth to 3 Years level of AEPS. Bricker et al. (2008) found extremely high sensitivity levels (i.e., above 90%) for the Birth to 3 Years level and sensitivity ranging between 76% and 100% for the 3 to 6 Years level of AEPS.

Reliability

Two studies of test–retest reliability have been conducted. Bailey & Bricker (1986) found stability over time on scores for the total sample of their study. Bricker et al. (1990) found similar results. Interrater reliability also has been examined on AEPS. Bailey and Bricker found consistent scoring between provider and parent ratings. In Bailey and Bricker, researchers found that interrater reliability scores were .966 on the entire scale and between .705 and .958 on individual domains, with Hsia (1993) finding similar results. Grisham-Brown et al. (2008) examined interrater agreement of Head Start teacher/providers and assistants and found 76% to 93% agreement across domains. Macy et al. (2005) found interrater reliability scores between .79 and .93. Internal consistency was evaluated in Bailey and Bricker with statistically significant correlations between strands and domains and total test and individual domains.

Utility

Six studies have examined the utility or treatment validity of AEPS. Most of these studies have examined the degree with which using AEPS influences early childhood intervention practices. Bailey and Bricker (1986) reported that providers found the test useful for developing educational plans. A later study conducted by Pretti-Frontczak & Bricker (2000) and Bricker & Pretti-Frontczak (1997), showed teachers/providers who learned to use AEPS wrote high-quality goals for the children with whom they worked. Hamilton (1995) compared the quality of goals written by teacher/providers generated from AEPS to those written by teacher/providers generated from the Oregon Project, and the researcher found that goals written with AEPS were of higher quality than those written from the Oregon Project. In another comparative study, Notari and Drinkwater (1991) established that long-term goals and short-term objectives generated using AEPS were more functional and easier to integrate into routines than those generated from a computerized IEP program. In another comparative study, Straka (1994), results indicated that goals and objectives generated from AEPS were of higher quality than those generated from the Communication and Symbolic Behavior Scales. Finally, Gao and Grisham-Brown (2011) found that Head Start teacher/providers preferred AEPS over norm-referenced tests for classroom planning.

Corroboration of Decisions

Four studies have investigated the degree with which AEPS resulted in similar decisions to those generated from a norm-referenced assessment. Bricker et al. (2003) found that AEPS over-identified between 5% and 25% by age interval but under-identified between 0% and 8% of children deemed eligible by age interval. Macy et al. (2005) found an average of 94% agreement in eligibility when AEPS scores were compared with norm-referenced assessments (i.e., Gesell or BDI). Bricker et al. (2008) reported results similar to Bricker et al. (2003). In a study of children ages 4 to 66 months, Bricker et al. (2008) found AEPS over-identified between 9% and 30% by age interval but under-identified between 0% and 12% of children deemed eligible by age interval. Finally, Hallam et al. (2014) compared AEPS cut-scores with BDI (Newborg, 2005) standard deviation scores to determine the congruence between the decisions that results derived from each test. Results showed agreement between the developmental status of 78% of the children (i.e., 36 of 50 children “on track,” and three of 50 children developmentally delayed).

The purpose of this study is to evaluate the psychometric properties of the third edition of AEPS. Specifically, the study examined the interobserver reliability, utility, and concurrent validity of AEPS-3.

Method

Subjects

Teacher/providers

Teacher/providers, therapists, and early interventionists in all 50 states use the second edition of AEPS. Many professionals who use AEPS work in inclusive programs that serve both children with disabilities and their typically developing peers. Therefore, the teachers/providers who participated in the present studies were drawn from AEPS users across the United States.

Interrater agreement

A total of 116 (i.e., 115 females and one male) providers from 14 sites across the United States took part in the AEPS-3 Field Study. The level of education among the providers was 1% high school diploma or General Educational Development (GED), 1% Associate’s degree, 20% Bachelor’s degree, and 78% Postgraduate/Graduate degree and above. The range of degrees earned by the participants included Interdisciplinary Early Childhood Education, Special Education, Child and Family Studies, Elementary Education, Speech Therapy, Communication Disorders and Science, Early Intervention, Occupational Therapy, Behavioral Sciences, General Education, and Psychology. The average years of experience for participants working with children from birth to 3 years old was 6.8 years (range = 0–30). For participants working with children 3 to 6 years old, it was 9.7 years (range = 0–34). Table 1 provides additional demographic information on the total sample of participants. Providers came from seven states: Kansas, Kentucky, Ohio, Oregon, Tennessee, Texas, and Virginia. Interrater reliability data were collected from the total sample population of providers (i.e., 116). Each provider received a stipend of US$25 for completing online training and completing the interrater reliability test.

Table 1.

Demographic Statistics for Participating Teacher/Providers (N = 116).

	f	%
Gender
Female	115	99
Male	1	1
Level of education
High school diploma/GED	1	1
Some college	0	0
Associate’s degree	1	1
Bachelor’s degree	23	20
Postgraduate/Graduate’s degree and above	91	78
Current job
Early childhood provider
Lead teacher	60	52
Teacher assistant	0	0
Home visitor	5	4
Early interventionist specialist
Home visitor	8	7
Speech/language pathologist	3	3
Occupational therapist	1	1
Evaluation team member	1	1
ECSE itinerant teacher/interventionist
Home visitor	2	2
Speech/language pathologist	5	4
Occupational therapist	3	3
Evaluation team member	1	1
Type of program
Home visiting	13	11
Early childhood or preschool classroom	72	62
Kindergarten	2	2
Other	1	1
Experience using AEPS
1 year or less	14	12
1–2 years	31	27
3–5 years	39	34
5 or more years	32	28
Children assessed using AEPS in past year
Less than 5	5	4
5–10	31	27
10–20	35	30
20 or more	45	39

Note. GED = General Educational Development; ECSE = Early Childhood Special Education; AEPS = Assessment, Evaluation, and Programming System.

Utility

A total of 11 providers from two sites participated in the utility study. Five participants were from Oregon, and six were from Kentucky. The site in Oregon is one of the nine regional programs contracted to provide Early Intervention/Early Childhood Special Education (EI/ECSE) services throughout the state. The site in Kentucky was a publicly funded preschool program serving preschoolers with disabilities and those who were at risk due to socioeconomic variables. Both sites required the use of the AEPS as its primary CBA. The level of education among the providers was Bachelor’s degree (N = 3) and Postgraduate/Graduate degree and above (N = 8). The range of degrees earned by the participants included Interdisciplinary Early Childhood Education, Special Education, Early Intervention, Early Childhood Special Education, and Social Work. The number of years of experience for participants working with children from birth to 6 years old was 2 to 30 years. The majority of participants had assessed over 30 children with AEPS over the past year (N = 6), while four had assessed between 10 and 20 children, and one between 5 and 10 children during the past year. Each provider received a stipend of US$100 and a set of books valued at US$100 for completing online training, assessing one or more children with the AEPS-3 test protocol, and completing the utility survey.

Concurrent validity

The concurrent validity study took place at one site in Kentucky. Eight teacher/providers participated in the study. Four teacher/providers worked in a preschool classroom, two worked in a toddler classroom, and two worked in an infant room. Three of the teacher/providers had master’s degrees, and five had bachelor’s degrees in interdisciplinary early childhood education. Years of experience ranged from less than 1 year to 9 years. Teacher/providers received US$25 for completing online training and interrater reliability testing as well as US$50 for each test protocol they completed (two for each child, BDI-2 and AEPS-3).

Children

Children birth to 6 years old with or without disabilities were recruited by their teacher/provider to participate in the field study. Each teacher/provider identified a minimum of four children in their program and sought consent from parents/guardians. Informed consent forms identified the purpose of the study, explained that teachers would observe children during activities to collect information to score AEPS-3, and explained how data would be used (e.g., concurrent validity).

Concurrent validity

Fifty children (i.e., 25 females and 25 males) took part in the concurrent validity study. The age range of children was between 12 and 65 months. The ethnicities of the children were Caucasian (68%), African American (4%), Hispanic (4%), Asian (14%), Other (2%), and Mixed (8%). Of the 50 children, 10 were receiving special education services, and 40 were not receiving services.

Procedures

Procedures for three studies are described here. The utility study was conducted first followed by the interrater reliability study and then the concurrent validity study. Those who participated in the concurrent validity study were a subset of participants who also participated in the interrater reliability study.

Utility

A utility study was completed prior to collecting any other AEPS-3 field test data. The purpose of the utility study was to gather information about AEPS-3 from a small sample of field test participants to (a) gain an understanding of the utility of the tool from providers working in the field by having them assess children using the AEPS-3 protocol, and (b) ensure there were no concerns with test content and scoring before collecting child data. For the utility study, participating providers were required to take the AEPS-3 online training (see details below) prior to collecting AEPS-3 data. Each provider was asked to complete AEPS-3 with at least one child. Together the 11 providers assessed 23 children between the ages of 4 and 83 months. Child assessment data were collected in each 12-month age interval (i.e., 12 months and younger, 25–36 months). The assessment observations took place in the child’s home or classroom setting. After completing AEPS-3, providers completed the Utility Survey, which asked questions about scoring, goals in each of the eight areas, and usefulness of AEPS-3 for its intended purposes.

Interrater agreement

Data for the interrater and concurrent validity studies entered into a secure, online web portal system. Each provider completed an online training before completing the interrater reliability test. Once providers successfully completed both the training and interrater reliability test, they were ready to begin data collection using a Child Observation Data Form (CODF) that included all test items, criteria, and examples and then accessed an online data entry form to enter AEPS-3 Test results.

AEPS-3 training

The AEPS-3 online training module included a narrated slide presentation that described features of AEPS-3 that are the same as the second edition of AEPS, highlighted new features introduced in AEPS-3, and offered an overview of scoring rules and guidelines. In addition, the training provided scoring examples using embedded video clips of children in natural home and classroom settings to give participants an opportunity to observe and practice scoring AEPS-3 items. The online training was approximately an hour in length. Providers were able to go back and forth through the slides and had unlimited access to review the training module as needed after they completed it.

Interrater reliability test

The interrater reliability test was designed for two purposes: (a) to establish interrater reliability of the AEPS-3 Test; and (b) to ensure that participants could reliably score child observations using the 3-point scoring system (2, 1, 0) prior to collecting AEPS-3 field test data. A series of videotapes were created of children engaged in typical activities in their home and classroom environments. The interrater reliability test consisted of 37 video clips that included scoring observations for 68 AEPS-3 items across all areas and developmental levels. All 68 test items were observed and scored by AEPS authors and then reviewed for final agreement to establish the gold standard for reliability. For each item, the correct score was established along with a written rationale to explain why the child received the score based on the videotaped observation.

Prior to accessing the online interrater reliability test, participants completed the online AEPS-3 training also described above. Instructions were included for review prior to taking the online test. Participants could take up to 4 hr to complete the test in one sitting. If not completed within that timeframe, it was possible to resume where they left off within 24 hr of starting the test. To establish interrater reliability, providers viewed video clips of children engaging in routine activities in home and classroom settings prior to scoring AEPS-3 items using the 3-point scoring options (2, 1, 0). Participants were encouraged to watch each clip at least twice but could watch each video clip as many times as they chose. Once all 68 items had been scored and submitted, the interrater reliability results were calculated by comparing the submitted item scores with the “correct” item scores established by AEPS-3 authors as described above. To pass the interrater reliability test and move on to data collection, participants were required to achieve an overall score of 80% or higher. After completing the interrater reliability test, providers received their test results immediately and could go back and retake the interrater reliability test if they did not achieve an overall score of 80%. To enhance understanding of AEPS-3 scoring, providers could review their item responses to the videotaped scenarios, the correct item scores, and a rationale for why the item was scored the way it was. Providers were permitted to take the test as many times as they needed to reach an overall score of 80%.

Concurrent validity

The concurrent validity of the AEPS-3 was examined by administering both the AEPS and The Battelle Developmental Inventory–Second Edition (BDI-2) (Newborg, 2005) to the same children within 2 weeks. Teachers/providers completed the AEPS-3 through observations of children in their classrooms where they received early care and education. Teacher/providers had all received training on BDI-2 prior to collecting data for concurrent validity in their undergraduate or undergraduate certification coursework. All participating providers completed the AEPS-3 online training module and interrater reliability test prior to collecting data. After passing the interrater reliability test, providers began collecting AEPS-3 data and entering it into the online system. Participating providers were required to score and enter all items on the AEPS-3 using the Child Observation Data Recording Form (CODF). BDI-2 data were collected within 2 weeks of completing AEPS-3. The Battelle Developmental Inventory–Second Edition (BDI-2) was used as the developmental criterion measure to examine the concurrent validity of the AEPS-3. The BDI-2 is a norm-referenced assessment for use with children birth to 7 years of age. The test measures five developmental areas including adaptive, personal-social, communication, motor, and cognitive. Teachers/providers administered the test through observation, interview, and primarily structured testing. The BDI-2 has been used as the criterion measure in two previous concurrent validity studies of AEPS (Gao & Grisham-Brown, 2011; Hallam et al., 2014).

Results

Utility

The utility of AEPS-3 was examined by having teacher/providers (N = 11) assess one or more children with AEPS-3 and then complete a Utility Survey. The survey was divided into three main sections: (I) Scoring, (II) Items and Criteria, and (III) Usefulness of AEPS for Intended Purposes. Questions from all sections of the survey were rated on a 4-point Likert-type scale (i.e., 4 = Strongly Agree; 3 = Agree; 2 = Disagree; and 1 = Strongly Disagree).

Scoring

Section I included a total of 13 questions about the 3-point scoring options (2, 1, 0) and the scoring notes (A = Assistance; I = Incomplete; C = Conduct; M = Modification; Q = Quality; and R = Report). Providers were asked to use the 4-point scale to rate whether the 3-point scoring options were easy to understand (M = 3.45), permitted accurate rating (M = 3.54), and if the scoring notes were easy to understand (M = 3.36) and enhanced the accuracy of rating of children’s performance on AEPS items (M = 3.18). In addition, providers were asked to rate if it was clear when to give each of the scoring options and when to add one of the notes. The range of mean scores for these nine items is M = 3.18–4.0. There were seven items for which one or two people responded Disagree and only one item marked Strongly Disagree. The Disagree and Strongly Disagree responses were all related to scoring notes, except for one item that asked about the clarity of giving a “2” score. The “Conduct” scoring note was rated the least clear with the only Strongly Disagree response. Most items in this section were rated Strongly Agree and Agree (11 items with 91% and above, and two items with 82% and above).

Items and criteria

Section II included all AEPS-3 goals (117) from the eight areas (Fine Motor, Gross Motor, Adaptive, Cognitive, Literacy, Math, Social-Communication, and Social-Emotional). Every goal was rated using the 4-point Likert-type scale on four dimensions: (1) Goal is Functional; (2) Goal is Teachable; (3) Goal is Easy to Understand; and (4) (Goal) Criterion is Easy to Understand.

Respondents rated the majority of AEPS-3: 4 (Strongly Agree) or 3 (Agree) across the four dimensions. Table 2 shows the number of items, by area, rated 4 (Strongly Agree) or 3 (Agree) by at least nine of the 11 respondents across the four dimensions. Four areas met this criteria for all items: Adaptive, Literacy, Social-Communication, and Social-Emotional. The remaining four areas (Fine Motor, Gross Motor, Cognitive, and Math) each had one item that fell below the criteria. Table 3 shows AEPS-3 goals that were rated 2 (Disagree) or 1 (Strongly Disagree) by two or more of the 11 respondents. Twelve of 117 AEPS-3 Goals were rated 2 (Disagree) or 1 (Strongly Disagree) by 2 or more respondents on any of the four dimensions. For example, the Social-Communication item, follows person’s gaze to establish joint attention (B1), was rated 2 (Disagree) by one respondent and 1 (Strongly Disagree) by one respondent on the “Goal is Teachable” dimension.

Table 2.

Number of AEPS-3 Goals Rated 3 (Agree) or 4 (Strongly Agree) by 9 or More Respondents (N = 11).

AEPS-3 area and number of goals	Goal functional	Goal teachable	Goal easy to understand	Criterion easy to understand
Fine Motor (8)	8	7	8	8
Gross Motor (15)	15	14	15	15
Adaptive (15)	15	15	15	15
Cognitive (18)	18	18	18	17
Literacy (15)^a	15	15	15	15
Math (12)^b	11	11	12	12
Social-Communication (15)	15	15	15	15
Social-Emotional (19)	19	19	19	19

Note. AEPS-3 = Assessment, Evaluation, and Programming System–Third Edition.

Literacy N = 10 for all but three items due to missing data. ^bMath N = 10 for 11 items and N = 9 due to missing data.

Table 3.

AEPS-3 Goals Rated 2 (Disagree) or 1 (Strongly Disagree) by Two or More of the 11 Respondents.

AEPS-3 area	Goal functional	Goal teachable	Goal easy to understand	Criterion easy to understand
Fine Motor	—	D1	—	—
Gross Motor	B4, B7, C3	B4, C1	—	—
Adaptive	—	A5	A5	A5
Cognitive	—	—	E1	E1, E2, E3
Literacy	—	—	—	—
Math	—	—	—	—
Social-Communication	—	B1	—	C1, C4
Social-Emotional	—	—	—	—

Note. AEPS-3 = Assessment, Evaluation, and Programming System–Third Edition.

Respondents rated two items 1 (Strongly Disagree) by two respondents, and three people rated one item 2 (Disagree). There were no items with more than three people rating any of the dimensions a 2 (Disagree) or 1 (Strongly Disagree). The two items rated 1 (Strongly Disagree) by two respondents are (a) Fine Motor item, uses finger to interact with keys on electronic keyboard (D1) and (b) Gross Motor item, Skips (B7). The one item rated 2 (Disagree) by three people was the Cognitive item, Expands simple observations and explorations into further inquiry (E1). All three items were new items on AEPS-3. One item in the Adaptive area, uses culturally appropriate social dining skills (A5), was noted as the only item rated 2 (Disagree) by two people on three of the four dimensions (goal teachable, goal easy to understand, and criterion easy to understand).

Usefulness of AEPS for intended purposes

Five questions in Section III were rated using the 4-point Likert-type scale, one question asked respondents to indicate how long it took to complete AEPS-3, and a final question asked participants for their perspective on the strengths and weaknesses of AEPS-3.

The first five items rated using the Likert-type scale relate to the purposes for which the AEPS is intended to be used included: (a) can be administered in authentic environments; (b) provides useful information for summarizing child strengths, and for writing present levels of development; (c) provides information for informing outcome data reporting to state and federal agencies; (d) can be used to monitor child progress; and (e) the use of I and/or A with a 1 score will provide useful information for progress monitoring. For all five items, nine or more of 11 respondents selected 4 (Strongly Agree) or 3 (Agree) indicating agreement that AEPS-3 is useful for its intended purposes. The range of time taken to complete AEPS-3 was between 1 and 3 hr and the duration of the observation period was 1 to 8 days. Respondents shared that the amount of time needed depended on age of the child. Strengths of AEPS-3 fell into two categories: new items/areas and scoring. For new items and areas, respondents indicated that they “liked” the additional literacy and math areas, language/literacy complexity, and inclusion of use of electronic devices (though inclusion of use of electronic devices was also identified as a weakness). Several respondents indicated they appreciated the addition of requiring the scoring notes I = Incomplete and/or A = Assistance to clarify a score of “1” for an emerging skill. One participant shared that using these notes could help with consistency in scoring between providers and be helpful to the next person scoring. One participant commented that they liked the item criteria and examples, and found the scoring notes help to clarify and inform scoring accuracy. The most consistent weakness identified was the length of the combined test for assessing children developmentally from birth to 6 years of age. Six of the 11 respondents made a note to this effect (e.g., “too many items,” “long and cumbersome,” and “I fear that using scale from 0–6 rather than 0–3 and 3–6 will get evaluators bogged down”), which provided useful feedback for developing administration guidelines for use outside the scope of a field test.

The purpose of this Utility Survey was for formative purposes to identify whether there were any substantive issues with test content and scoring before collecting child data. It is worth noting that a separate content validity study was conducted prior to the utility study (Macy et al., 2016). Once data from the Utility Survey were summarized, a group of AEPS-3 authors met on two occasions to review the information to determine any revisions needed prior to collecting field test data. Based on the survey results, no changes were made to the 3-point scoring options or the notes. The authors agreed to provide more examples of the scoring notes to improve clarity in the training materials. AEPS-3 items required few changes prior to data collection. One item was removed from the Literacy area, and eight items in total were revised across the Fine Motor, Gross Motor, Adaptive, Cognitive, Social-Communication, and Social-Emotional areas. Several examples were revised or expanded to help clarify test items further.

Interrater Reliability

Interrater reliability was examined by having teacher/providers watch video clips of children engaged in naturally occurring activities and routines and assign scores (2, 1, 0) to AEPS-3 items as observed in the clips. Participants were permitted to take the test multiple times until they reached acceptable reliability (i.e., 80%). However, for purposes of examining interrater reliability, only the first scoring attempt was used. While participants could take the test multiple times to meet the standard 80% criterion required prior to collecting AEPS-3 field test data, the majority of participants met the criterion of 80% correct on the first try (i.e., 89.79%). Teacher/provider interrater agreement with AEPS experts ranged from 66% to 100% with a grand mean of 89%. Table 4 shows the number of attempts along with the percent of participants who needed that number of attempts. These results suggest that participants with experience using AEPS can reliably score children’s performance on AEPS-3 after completing a brief online AEPS-3 training.

Table 4.

Interrater Reliability Based on Number of Attempts.

Attempt	Percent who reached criterion
1	89
2	9
4	1
6	1

Concurrent Validity

During the concurrent validity study, AEPS-3 was compared across developmental areas with BDI-2, a norm-referenced test used to evaluate early childhood developmental milestones (Newborg, 2005). The BDI-2 assesses children birth to 7 years, in the areas of personal-social, adaptive, motor, communication and cognitive.

Pearson’s correlations were calculated two ways. The first was with AEPS-3 scores for all items (i.e., scores were calculated with goals and objectives for each domain) and BDI-2 scores. The second way was with AEPS-3 goals only (i.e., scores were calculated without objectives) and BDI-2 scores.

Table 5 summarizes Pearson correlation coefficients for AEPS-3 scores with BDI-2 scores for all items. The results indicate positive correlations in the weak to moderate range (r = .31 to .65) for AEPS-3 and BDI-2 domain scores and all are statistically significant except for AEPS-3 fine motor domain and BDI-2 adaptive domain (r = .24). The larger positive correlation coefficients are found when comparing similar domains across the two tests. For example, the AEPS-3 social-communication scores and BDI-2 communication scores indicate a positive correlation (r = .63) and a less robust correlation (r = .35) when comparing AEPS-3 social-communication scores and BDI-2 motor scores. Similarly, AEPS-3 domain scores in literacy and math indicate a moderate positive correlation with the BDI-2 cognitive domain (r = .60 and .55), respectively, and a weak correlation when compared with BDI-2 motor domain (r =. 44 and .32), respectively.

Table 5.

Correlation Results for AEPS-3 Domain Scores and BDI-2 Domain Scores (N = 50).

AEPS-3 domain	BDI-2 domain					Age (in months)
AEPS-3 domain	Adaptive	Cognitive	Communication	Motor	Personal-Social	Age (in months)
Adaptive(α = .55)	.40**	.46**	.57**	.33*	.41**	.92**
Cognitive(α =.66)	.41**	.50**	.62**	.33*	.48**	.86**
Fine Motor(α = .39)	.24	.31*	.45**	.32*	.36**	.65*
Gross Motor(α =.61)	.34*	.46**	.57**	.44**	.38**	.72**
Literacy(α = .70)	.50**	.60**	.65**	.44**	.50**	.89**
Math(α = .70)	.41**	.55**	.62**	.32*	.42**	.87**
Social-Communication(α =.77)	.44**	.48**	.63**	.35*	.52**	.85**
Social-Emotional(α = .63)	.49**	.54**	.64**	.42**	.57**	.80**

Note. Reliability in () below each domain area. AEPS-3 = Assessment, Evaluation, and Programming System–Third Edition; BDI-2 = Battelle Developmental Inventory–Second Edition.

p < .05. **p < .01.

Table 6 summarizes the Pearson correlation coefficients for the AEPS-3 domain goal scores with BDI-2 domain scores. The results in Table 6 reflect lower Pearson correlation coefficients when AEPS-3 domain scores include goals only. For example, a lower positive correlation (r = .35) is shown in Table 6 for AEPS-3 social-communication scores and BDI-2 communication scores. While statistically significant the correlation (r = .35) is in the weak range as opposed to the moderate range when only using goals and objectives to establish domain scores. Correlations across all domains were higher when all test items were included. The results also indicated a stronger correlation between age and AEPS-3 domain scores with all items (r = .65 to .92) than between age and AEPS-3 domain scores with goals only (r = .34 to .84).

Table 6.

Correlation Results for AEPS-3 Domain Scores From Goals Only and BDI-2 Domain Scores (N = 50).

AEPS-3 domain	BDI-2 domain					Age (months)
AEPS-3 domain	Adaptive	Cognitive	Communication	Motor	Personal-Social	Age (months)
Adaptive(α = .55)	.16	.28	.31*	.11	.11	.80**
Cognitive(α =.66)	.31*	.37*	.50**	.21	.42**	.84**
Fine Motor(α = .39)	.13	.18	.32*	.12	.07	.59**
Gross Motor(α =.60)	.09	.24	.25	.05	−.03	.59**
Literacy(α = .70)	.23	.27	.32*	.09	.50**	.76**
Math(α =.70)	.19	.29*	.35*	.05	.09	.69**
Social-Communication(α = .77)	.31*	.32*	.35*	.10	.20	.78**
Social-Emotional(α = .63)	.13	.14	.29	.02	.14	.34*

Note. Reliability in () below each domain area. AEPS-3 = Assessment, Evaluation, and Programming System–Third Edition; BDI-2 = Battelle Developmental Inventory–Second Edition.

p < .05. **p < .01.

Discussion

Results of three studies of the psychometric properties of the AEPS-3 demonstrate the technical adequacy of the test with regard to utility, interrater reliability, and concurrent validity. Results of these studies are similar to other research on the technical adequacy of earlier versions of AEPS (e.g., Bailey & Bricker, 1986; Gao & Grisham-Brown, 2011; Grisham-Brown et al., 2008; Hamilton, 1995). Specifically, findings from the utility study showed that participants found the AEPS-3 useful from program planning and progress monitoring. Findings from studies on earlier versions of the AEPS found similar results in terms of the usefulness of the AEPS for developing functional and meaningful goals for children (e.g., Pretti-Frontczak & Bricker, 2000; Bricker & Pretti-Frontczak, 1997). As well, previous studies showed that AEPS providers could administer the assessment with acceptable levels of interrater reliability (i.e., 80% or above) (e.g., Grisham-Brown et al., 2008). Finally, some content validity results from the present study are similar to previous research on content validity. For example, in Gao & Grisham-Brown (2011), researchers found that a positive correlation existed between the social-communication domain of the AEPS and the communication domain of the BDI-2 (.60; p<.0001). In this study of the AEPS-3, the social-communication area of the AEPS was also found to be positively correlated to the communication area of the BDI-2 (.63; p < .01). Although the correlations were not strong in all domains (e.g., AEPS-3 fine motor and BDI-2 motor) or nonexistent (e.g., AEPS-3 fine motor and adaptive), we do have evidence from second edition studies that results from the AEPS leads to results that are similar to results obtained from norm referenced assessment (e.g., Bricker et al., 2003, 2008; Hallam et al., 2014). As well, the weak or nonexistent correlations between the AEPS-3 and the BDI-2 may be explained by the way that the AEPS-3 is structured. For example, in the AEPS-3, some items previously in the fine motor area are now located in the literacy area (e.g., writing).

Despite positive results from this study of the AEPS-3, there were a number of limitations to the study. The findings of the utility study should be viewed with caution due to the small number of teachers/providers who participated in the study (i.e., 11). Despite this small number other studies of treatment validity have also utilized small samples. For example, Gao & Grisham-Brown (2011) assessed teachers’ preferences of using the AEPS over a norm-referenced tool. In their study, only five teachers participated. Second, results of the interrater reliability study should be viewed with caution due to the fact that teachers scored only 20% of the items on the test and did so via videotapes. Collecting interrater reliability on all items of a test the size of the AEPS-3, and doing so in vivo, would be prohibitive given that the test is generally administered over a long period of time. Finally, results of the concurrent validity study should be viewed cautiously due to the relatively small number of children in the study, and the age range of children tested. While the sample size is smaller than targeted, results show moderate correlations that are significant across all developmental areas, suggesting that the AEPS-3 is a concurrently valid measure of children’s developmental skills and abilities. The restricted age range is an issue because results are evident for children 12 to 65 months, whereas AEPS-3 can be used for children from birth to six years of age. Data collection and analyses continue, and future research should confirm whether AEPS-3 domains have validity with BDI-2 subdomains in a larger group of children across a broader age range. Finally, the studies of the technical adequacy presented in the manuscript do not provide a full picture of all of the psychometric properties that should be examined in a newly revised test. Future research is needed on the test–retest reliability and internal consistency of the AEPS-3 to name a few. Despite these limitations, this study supports earlier reports that AEPS is among the most technically adequate early childhood CBAs (Bagnato et al., 2010).

Footnotes

Authors’ Note

Rebecca Crawford is now affiliated with Eastern Kentucky University, Richmond, Kentucky, USA. Michael Toland is now affiliated with University of Toledo, Toledo, Ohio, USA.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Jennifer Grisham

References

Bagnato

S. J.

Neisworth

J. T.

Pretti-Frontczak

(2010). LINKing authentic assessment and early childhood intervention: Best measures for best practices (2nd ed.). Brookes.

Bailey

Bricker

(1986). A psychometric study of a criterion-referenced assessment instrument designed for infants and young children. Journal of the Division of Early Childhood, 10(2), 124–134.

Bricker

(Ed.). (2002). Assessment, evaluation, and programming system for infants and young children (2nd ed.). Brookes.

Bricker

Bailey

Slentz

(1990). Reliability, validity, and utility of the Evaluation and Programming System: For Infants and Young Children (EPS-I). Journal of Early Intervention, 14(2), 147–160.

Bricker

Clifford

Yovanoff

Pretti-Frontczak

Waddell

Allen

Hoselton

(2008). Eligibility determination using a curriculum-based assessment: a further examination. Journal of Early Intervention, 31(1), 3–21.

Bricker

Dionne

Grisham

Johnson

J. J.

Macy

Slentz

K. L.

Waddell

(in press). Assessment, Evaluation, and Programming System for Infants and Young Children, Third Edition (AEPS^®-3). Baltimore, MD: Paul H. Brookes Publishing Co.

Bricker

Pretti-Frontczak

(1997). A study of psychometric properties of the assessment, evaluation, and programming test for three to six years [Unpublished report]. University of Oregon, Center on Human Development, Early Intervention Program.

Bricker

Yovanoff

Capt

Allen

(2003). Use of a curriculum-based measure to corroborate eligibility decisions. Journal of Early Intervention, 26(1), 20–30.

Division for Early Childhood. (2014). DEC recommended practices in early intervention/early childhood special education 2014. http://www.dec-sped.org/recommendedpractices

10.

Gao

Grisham-Brown

(2011). The use of authentic assessment to report accountability data on young children’s language, literacy and pre-math competency. International Educational Studies, 4(2), 41–53.

11.

Grisham-Brown

Pretti-Frontczak

(2011). Assessing young children using blended practices. Brookes.

12.

Grisham-Brown

Pretti-Frontczak

Hallam

(2008). Measuring child outcomes using authentic assessment practice. Journal of Early Intervention, 30(4), 207–281.

13.

Hallam

Lyons

Pretti-Frontczak

Grisham-Brown

(2014). Comparing apples and oranges: The mismeasurement of young children through the mismatch of assessment purpose and the interpretation of results. Topics in Early Childhood Special Education, 34(2), 106–115.

14.

Hamilton

(1995). The utility of the assessment, evaluation, and programming system in the development of quality IEP goals and objectives for young children, birth to three, with visual impairments [Unpublished doctoral dissertation]. University of Oregon.

15.

Hsia

(1993). Evaluating the psychometric properties of the assessment, evaluation, and programming system for three to six years: AEPS test [Unpublished doctoral dissertation]. University of Oregon.

16.

Macy

Bricker

Dionne

Grisham-Brown

Johnson

Slentz

Waddell

Behm

Shrestha

(2016). Content validity analyses of qualitative feedback on the revised Assessment, Evaluation, and Programming System for Infants and Children (AEPS) test. Journal of Intellectual Disability—Diagnosis and Treatment, 3(4), 177–186.

17.

Macy

Bricker

Squires

(2005). Validity and reliability of a curriculum-based assessment approach to determine eligibility for part C services. Journal of Early Intervention, 28(1), 1–16.

18.

National Association for the Education of Young Children. (2009). Developmentally appropriate practice in early childhood programs service children birth through age 8.

19.

Newborg

(2005). Battelle Developmental Inventory (2nd ed.). Riverside.

20.

Notari

Drinkwater

(1991). Best practice for writing child outcomes: An evaluation of two methods. Topics in Early Childhood Special Education, 11(3), 92–106.

21.

Pretti-Frontczak

Bricker

(2000). Enhancing the quality of Individualized Education Plan (IEP) goals and objectives. Journal of Early Intervention, 23(2), 92–105.

22.

Straka

(1994). Assessment of young children for communication delays [Unpublished doctoral dissertation]. University of Oregon.

23.

Winchell

(2011). A critical examination of the technical adequacy of a curriculum-based assessment using Rasch analyses [Doctoral dissertation] (Accession No. kent132199273). OhioLINK.

Psychometric Properties of the Assessment,Evaluation,and Programming System for Infants and Children–Third Edition (AEPS-3)

Abstract

Keywords

Validity

Reliability

Utility

Corroboration of Decisions

Method

Subjects

Teacher/providers

Interrater agreement

Utility

Concurrent validity

Children

Concurrent validity

Procedures

Utility

Interrater agreement

AEPS-3 training

Interrater reliability test

Concurrent validity

Results

Utility

Scoring

Items and criteria

Usefulness of AEPS for intended purposes

Interrater Reliability

Concurrent Validity

Discussion

Footnotes

Authors’ Note

Declaration of Conflicting Interests

Funding

ORCID iD

References