Abstract
Rasch and classification analyses on a field-test version of the third edition of the Assessment, Evaluation, and Programming System (AEPS-3), a curriculum-based assessment used to assess young children birth to age 6 years, were conducted. First, an evaluation of the psychometric properties of data from each developmental area of an AEPS-3 field-test version was conducted. Next, cutoff scores at 6-month age intervals were created and then the validity of the cutoff scores was evaluated. Results using Rasch modeling indicated acceptable model fit statistics with reasonable reliability estimates within each developmental area. Classification results showed cutoff scores accurately classified a high percentage of eligible children. Findings suggest that scores from a field-test version of the AEPS-3 are reliable within developmental areas. To the extent allowed by state criteria, early childhood interventionists could possibly use a new field-test version of the AEPS-3 to determine or corroborate eligibility for special education services.
The Individuals with Disabilities Education Improvement Act (IDEA, 2004) provides funding for early intervention/early childhood special education services to young children and their families under Part C and Part B/Section 619 based on a set of criteria for determining eligibility. These criteria rely primarily on comparing a child’s development against a group of children who share similar characteristics (e.g., age and language). In large measure, although exceptions exist for children with identified medical or genetic conditions, children who demonstrate developmental competence consistent with their chronological age are not eligible for publicly funded services, whereas those who do not demonstrate such competence are eligible to receive services (IDEA).
Historically, most states have relied on the use of conventional tests to determine young children’s eligibility for services (Danaher, 2011; IDEA, 2004; Macy et al., 2015; Shackelford, 2006). However, this practice has been criticized (de Sam Lazaro, 2017). First, administration of standardized tests must be undertaken by specialized personnel, which is costly (Bricker et al., 2003; Neisworth & Bagnato, 2004). Second, results are not useful for planning subsequent intervention or teaching content (Bagnato et al., 2011; Macy et al., 2005). Third, these tools are not likely to be sensitive to disability, cultural, or linguistic differences (de Sam Lazaro, 2017; Macy et al., 2015). Fourth, conventional test procedures do not permit modifications or adaptation for children with disabilities and thus may not accurately reflect the child’s skill repertoire (Bagnato, 2007; Macy et al., 2015). Finally, conventional tools are not typically administered in the child’s “natural” environment and often lack input from parents and primary caregivers (Bagnato, 2005; Macy et al., 2005).
Increasingly calls for more functional approaches to determining eligibility are being made (c.f., Division for Early Childhood [DEC] recommended practices, 2014). One of the more promising alternatives is the use of curriculum-based assessments (CBAs), which link test items to intervention or teaching content (Bagnato et al., 2011; Macy et al., 2005). Such measures can be reliably completed by service providers (e.g., teachers and interventionists) as children engage in play and daily activities (Bricker, 2002). In addition, many of these measures are designed to seek and include parent or caregiver input (DEC, 2014; Grisham et al., 2020; National Association for the Education of Young Children, 2009). Furthermore, these tools produce outcomes that are directly applicable to selecting Individualized Education Program (IEP) or Individualized Family Services Plan (IFSP) goals and intervention/teaching content (Johnson et al., 2015). Features of CBAs make using such tools appealing to service providers for determining eligibility for services. Yet, their use for this purpose requires psychometric evidence that they accurately and reliably discriminate between typically developing children and those whose development is significantly delayed or deviant and requires intervention.
The Assessment, Evaluation, and Programming System for Infants and Children (AEPS®; Bricker, 2002) is a CBA that meets recommended practices and federal guidelines for assessing young children for Part C and Part B/Section 619 services (DEC, 2014; IDEA, 2004). The AEPS is one of the more widely used and studied CBAs in the field of EI/ECSE (Bagnato et al., 2010). As an authentic assessment, the AEPS is used to gather information through observations of children in their natural environments by familiar adults. Items on the AEPS test include skills that are functional and essential for children to participate in daily activities and routines, thereby making them viable goals on an IFSP or IEP. AEPS Test items link directly to curricular content for teaching and the assessment allows for monitoring child progress.
The initial development of the AEPS began in the mid-1970s to address the need for an appropriate tool for assessing young children with disabilities (Bricker, 2002). Since its inception, evidence has been built around the validity, reliability, and utility of scores derived from the AEPS (Bricker, 2002; Grisham et al., 2020). The second edition of the AEPS test (Bricker, 2002) has been used to determine eligibility of young children for services (Bricker et al., 2003, 2008). To determine eligibility status, AEPS data were collected on a large sample of children with and without disabilities (Bricker et al., 2008). Using these data, a Rasch analysis was conducted to establish cutoff scores for designated age intervals. These cutoff scores were then used to determine or corroborate eligibility for IDEA services.
Several revisions have been incorporated in the third edition of the AEPS (AEPS-3) designed to improve its usefulness to EI/ECSE professionals and caregivers, and to clarify procedures for use of the system (Bricker et al., in press). The underlying philosophy and framework for the system and its basic goals remain unchanged. The changes made to the AEPS system require collection of data to re-examine the validity, reliability and utility of the various components.
In this study, we examined the scale evaluation and use of a field-test version of the AEPS-3 to determine eligibility for IDEA services. Establishing AEPS-3 cutoff scores required (a) collection of AEPS-3 performance and age data from samples of children with and without disabilities, (b) use of Rasch model to calibrate the AEPS age-specific measures with a sample of typically developing children, and (c) use of measurement calibration and conditional standard errors to specify cutoff scores. The following research questions guided our work: What are the psychometric properties of each developmental area of the AEPS-3? What is the validity of the derived cutoff scores for birth to 6 years used to categorize children according to eligibility status?
Method
Sample Description
Data were collected from 874 children, of which 47.1% (n = 412) were children without disabilities (ineligible) and 52.9% (n = 462) were children receiving IDEA services (eligible). Children ranged in chronological age from 2 months to 6 years and 11 months and consisted of 515 boys and 359 girls. The ineligible group included children who had passed a developmental screener (e.g., Ages and Stages Questionnaire, Squires & Bricker, 2009; Developmental Indicators of Assessment and Learning, Mardell-Czudnowski & Goldenberg, 2011) and were not receiving IDEA services. The sample of ineligible and eligible children was recruited from states currently using the AEPS, based on data provided by the publisher of the AEPS (Brookes Publishing). Within those states, programs using the AEPS were identified by the publisher, and invitations were sent to program directors. Program types included home visiting, parent/toddler groups, child care centers, publicly funded prekindergarten programs, and Head Start programs. The final set of programs that participated were from seven states (Kansas, Kentucky, Ohio, Oregon, Tennessee, Texas, and Virginia). When children were assessed, it was done with only one assessor, but most assessors (N = 131) did complete the AEPS-3 for multiple children (M = 6.76, Mdn = 5, Mode = 4, Min = 1, Max = 36, SD = 5.16) during their participation in the study.
Field-Test Version of the AEPS-Third Edition
The field-test version of the AEPS-3 consists of eight developmental/content areas: (a) fine motor (FM; 8 items), (b) gross motor (GM; 15 items), (c) adaptive (15 items), (d) literacy (15 items), (e) math (12 items), (f) social-communication (15 items), (g) social-emotional (19 items), and (h) cognitive (18 items). Each area contains a set of strands that consist of a series of related items, referred to as goals, with accompanying objectives, arranged hierarchically as judged by the developers of the AEPS and specific performance criteria for each item. The recommended method for collecting data to complete the AEPS is to observe a child in their usual environment. Each item is scored on a partial credit system ranging from 0 to 2. A score of 0 indicates that a child was not yet able to perform or meet the stated criteria. A score of 1 indicates that a child (a) partially met the specific criterion, (b) needed assistance, (c) inconsistently performed a skill, and/or (d) the skill is emerging. A score of 2 indicates that a child met the criterion consistently and independently met goal criteria. A brief description of the AEPS field-test version items by developmental area is provided in online Supplemental Table S1. A detailed description of the changes that occurred in the AEPS-3 is provided in the commercially published AEPS materials (Bricker et al., in press).
Procedure
Once consent was obtained from participants and parents of children selected for participation, teachers/providers were given access to an online portal that contained all field-test materials. The portal contained a training module that described changes to the third edition, an efficacy of training check (Do participating teachers/providers score video vignettes similar to expert AEPS evaluators?), and a data entry form. Participants had access to PDFs of the child observation form that they could print and use to collect data on paper before entering it online.
To increase the reliability of the data collected for the study, all participants (assessors) who collected data were required to first view a 1-hour training module. The training module included a narrated slide presentation that described (a) the differences between the second and third editions and (b) the scoring rules and guidelines. In addition, the training module included embedded video clips that provided opportunities for participants to practice scoring the field-test version of the AEPS-3 items.
Once teachers/providers completed the training module, they were given access to an interrater agreement test. A series of 37 video clips of young children between the ages of birth to 6 years were included in the interrater test. The video clips contained 68 items across the eight areas that were scored by AEPS-3 authors for purposes of establishing the gold standards for reliability for those items. Participants watched the videos and then scored items embedded into the clip using the 2, 1, 0 scoring criteria. Participants could take up to 4 hours to take the test and then resume where they left off after 24 hours if they had not completed the test. After scoring all of the videos, participants received immediate feedback on their test results. Those who received 80% or higher on the interrater agreement test were allowed to move forward with data collection. If they did not reach the 80% criterion, participants were allowed to take the test as many times as they wanted until they reached 80%. Results of the interrater agreement study showed all participants reached 80% and that the average score following the first attempt was 89.79% (range = 66%–100%; Grisham et al., 2020).
After completing the interrater agreement test, teachers/providers were asked to collect observational data on children, with whom they worked directly and on whom they had consent. Observations were to occur in the child’s natural environment (i.e., home or school). After making observations, teachers/providers compared their observations to the items on the field-test version of the AEPS-3 and assigned a score to the items they observed. This process (i.e., observation and score assignment) continued until all items were scored. Scores were entered into an online data entry form located on a secure website. Teachers/providers used unique codes assigned to each child participant to ensure the confidentiality of the child.
Data Analysis
The first step in the data analysis was to ensure the quality of items for each of the eight developmental areas. Specifically, a unidimensional Rasch rating scale model (RSM; cf. Andrich, 1978) analysis was conducted on the set of items within each area for the entire sample (i.e., both ineligible and eligible children were used for calibrating the AEPS-3 items) using Winsteps 4.5.3 (Linacre, 2020b). The rationale for using the RSM follows the recommendations of Linacre (2000). Of note, we did not take the nesting and rater variance into account and therefore considered each completed AEPS-3 as an independent observation when conducting analyses.
An underlying assumption of the RSM is unidimensionality. Accordingly, a principal component analysis of the standardized residuals (PCAR) was conducted for each developmental area to evaluate Linacre’s (2003, 2014, 2020a) three criteria for fundamental unidimensionality. First, the variance explained by the measure should be higher than 50%. Second, the eigenvalue of the first component of the standardized residuals should be less than 2.0. Third, the ratio of the variance explained by the Rasch dimension to the variance explained by the first contrast of the residuals should be high. If any concerns were identified from the previous steps regarding unidimensionality, a further inspection of the items at the top and bottom of the first contrast of the standardized residuals was conducted. Using recommendations provided by Bond and Fox (2007), if no apparent item clustering at high or low loadings can be identified, then unidimensionality can be deemed tenable. If, however, such clustering was observed, then an item content analysis was conducted to determine if there were meaningful construct relevant differences.
To determine if items were performing in a manner consistent with the RSM, Infit and Outfit item indices were assessed. Items with Infit mean-square residual values or Outfit mean-square residual values outside of 0.5 to 1.5 were flagged as poor-fitting items (Bond & Fox, 2007; Boone, 2016) and considered for removal one item at a time (iteratively), until all items had acceptable fit to RSM. A flagged item was retained if any of the following was true: (a) removing the item jeopardized the breadth of content in the already brief length of most of the developmental areas, (b) the degree of misfit for the item was considered tolerable (i.e., close to our Infit and Outfit expected range), (c) the item was in the upper or lower end of the developmental continuum because prior research has suggested that more items on the AEPS are needed to ensure decisions can be made with more confidence in the tails of the development continuum, and (d) the item was deemed important for programming purposes. Once items were calibrated, person reliability and item reliability were evaluated. Person reliability measures the degree to which a developmental area test separates a relevant child into enough levels with values above 0.8 preferred. Item reliability measures how spread out items are along a given developmental area test continuum with values above 0.9 preferred.
The next step in the analysis was to create cutoff scores for each developmental area test at 6-month age intervals. Three-month age intervals were also considered, but lower sample sizes observed for narrower age intervals would have led to less stable cutoff scores. First, the average level of performance (observed score) for an area test for the ineligible group within each age interval was found. Second, using the observed score (rounded to the nearest integer), we identified the corresponding person (Rasch) measure and associated standard error of estimate (SEE). Third, the lower limit of the 95% confidence interval (CI) computed with the SE was found (i.e., measure minus 1.96 times SEE) and the corresponding observed cutoff score was identified. This process was repeated for each age interval within each area test.
Once cutoff scores for each developmental area by age interval were created, a classification analysis was conducted to create eligibility classification criteria to identify children who were not expected to be performing as their same age typically developing peers (those ineligible). Using the entire sample, we classified children as eligible if they had at least two AEPS-3 area scores at or below the cutoff scores for their age interval, while those who did not were classified as ineligible. To examine the validity of the cutoff scores from the field-test version of the AEPS-3, sensitivity, specificity, false positive, false negative, accuracy, underidentification, and overidentification were calculated by cross-tabulation between known eligibility status, provided prior to data analysis, and identified eligibility classification based on the AEPS-3 cutoff scores. In addition, sensitivity and specificity were analyzed by means of a receiver operator characteristic (ROC) curve analysis. Area under the curve (AUC) values were tested for statistical significance and were considered excellent for AUC values between 0.9 and 1, good for AUC values between 0.8 and 0.9, fair for AUC values between 0.7 and 0.8, poor for AUC values between 0.6 and 0.7, and failed for AUC values between 0.5 and 0.6 (Metz, 1978).
Results
Sample size per age interval for the combined sample ranged from 16 (0–6 months) to 122 (43–48 and 55–60 months; see online Supplemental Table S2). A review of the combined sample shows most correlations (see online Supplemental Table S3) ranged from .77 (Social-Emotional) to .91 (Adaptive), except for Literacy (r = .66); however, all were statistically significant (p < .001; two-tailed). Although not hypothesized, these correlations are consistent with expectations that scores across all areas will improve as children grow older. Of note, correlations among developmental area scores ranged from .30 (Literacy and Social-Communication) to .91 (Cognitive with Adaptive).
Dimensionality Assessment
PCARs (see online Supplemental Table S4) conducted for each developmental area of the field-test version of the AEPS-3 showed a comparison of the raw variance explained by items with the unexplained variance in the first contrast had ratios that were not considered a concern for FM, GM, Adaptive, and Cognitive, but there was some concern for Literacy, Math, Social-Communication, and Social-Emotional. That is, the Rasch (primary) dimension within FM, GM, Adaptive, and Cognitive areas tended to dominate about 4.1 (Cognitive) to 11.6 (Gross Motor) times the secondary dimension, but the ratios for Literacy (3.5), Math (2.0), Social-Communication (3.8), and Social-Emotional (2.7) were concerning because this suggests that there may still be an unexplained dimension in each of these areas. Similarly, results show that the data for FM, GM, and Adaptive fulfill the criteria of Linacre (2020a) where variance explained by the measures should be >50% (i.e., 73.8% for GM to 74.9% for Adaptive) and the eigenvalue of the first component of the residuals should be <2.0 (i.e., 1.5 for FM and GM to 1.9 for Adaptive), but this was not upheld for Literacy, Math, Social-Communication, Social-Emotional, and Cognitive. Therefore, a further inspection of the items at the top and bottom of a standardized residual contrast 1 plot (not reported) was conducted for each area. Results show no apparent item clustering at high or low loadings for FM, GM, Adaptive, Literacy, and Social-Communication, but this is not necessarily true for Math, Social-Emotional, and Cognitive. However, for these areas, an item content review revealed no construct relevant differences between items at the top versus bottom of the contrast plot. Thus, considering all the evidence regarding dimensionality, results suggest that each developmental area of the field-test version of the AEPS-3 can be considered fundamentally unidimensional.
Model Fit Assessment
Tables 1 and 2 provide Infit and Outfit item indices across the eight developmental areas of the AEPS-3. After removing ill-fitting items using an iterative procedure, Infit indices for all items across all developmental areas fell within the acceptable range of .5 to 1.5. For example, consider the developmental area of Literacy, which originally consisted of 15 items. After the initial analysis, three items were flagged as ill-fitting (LITD1.0, LITA1.0, and LITE1.0; see online Supplemental Table S1 for a description of each item) because their Infit values fell beyond the acceptable range. However, LITD1.0 had the largest Infit value. So, it was consequently removed and then items recalibrated. In the second calibration of the Literacy items, the LITA1.0 item was flagged as the worst fitting item because its Outfit value fell beyond the acceptable range. So, it was removed, and items were recalibrated once again. This time, the LITE3.0 item fell beyond the acceptable range. Once the LITE3.0 item was removed all remaining items for the Literacy developmental area fell within the acceptable range for Infit indices. This same iterative procedure was repeated for each developmental area.
Final Item Calibration and Fit Results by Developmental Area for the Field-Test Version of the AEPS-3.
Note. Infit = infit mean-square residual index; Outfit = outfit mean-square residual index; AEPS-3 = Assessment, Evaluation, and Programming System Test–Third Edition.
Item considered for removal during the item analysis process because Outfit statistic fell outside 0.5 to 1.5 but retained because Outfit is known to be overly sensitive to unexpected responses.
Category Counts, Average Measures, Threshold Measures, and Fit Results by Developmental Area for the Field-Test Version of the AEPS-3.
Note. AEPS-3 = Assessment, Evaluation, and Programming System Test–3rd Edition.
Regarding Outfit indices, most items had values within an acceptable range, but nine items across the eight developmental areas had outfit values that fell just outside the acceptable range. An inspection of the ill-fitting items (flagged with a superscript a in Table 1) revealed that poor fit tended to occur when some children (<5%) with high abilities unexpectedly received a 0 or 1 score instead of 2 for some easy to approximately average difficulty items. However, Outfit is known to be over-sensitive to unexpected responses (i.e., a response from a person well beyond or below where an item [item] measure is located; Linacre, 2002). To evaluate the impact of mis-fit, we replaced suspect responses with a missing value and ran a sensitivity analysis to see how the results changed. Doing these sensitivity analyses showed Outfit indices fell within the acceptable range. Based on this inspection, it was deemed that these ill-fitting items could be retained in the final item set for a field-test version of the AEPS-3. The final number of items retained for a field-test version of the AEPS-3 are provided in parentheses in Table 1 along with the original number of items (in brackets) analyzed per developmental area. Online Supplemental Table S1 also displays the original items considered for analysis.
Reliability Assessment
A review of the reliability measures (see online Supplemental Table S5) shows person reliability measures ranging from .74 (Fine Motor) to .91 (Adaptive), which all fall above the preferred level of .80, except Fine Motor. Item reliability measures for all AEPS-3 developmental areas were .99, which indicates that each test has a set of items that are adequately spread out across a given developmental area continuum (above .90 is preferred).
Eligibility Classification Assessment
Table 3 provides the cutoff scores by age intervals and developmental area for the field-test version of the AEPS-3. Based on the cutoff scores and the classification procedures outlined in the Data Analysis section, Table 4 summarizes the eligibility classification rates for sensitivity, specificity, false positive, false negative, accuracy, underidentification, and overidentification. Sensitivity ranged from a low of 57% (61–66 months age interval) to a high of 100% (0–6 months and 7–12 months age intervals). Specificity ranged from a low of 0% (0–6 months and 7–12 months age intervals) to a high of 81% (67–72-months age interval). False-positive rates (ranging from 19%–87%) were generally higher than false-negative rates (ranging from 6%–43%). Accuracy ranged from 31%–84%.
Cutoff Scores by Age Intervals and Developmental Area for the Field-Test Version of the AEPS-3.
Note. Age intervals in months were created as follows: 0–6 = 0 thru 6.49; 7–12 = 6.5–12.49; . . . 67–72 = 66.5–72.49. AEPS-3 = Assessment, Evaluation, and Programming System Test–Third Edition.
Eligibility Classification Accuracy by Age Intervals for the Field-Test Version of the AEPS-3.
Note. Sensitivity = (a/[a + c]); specificity = (d/[b + d]); false-positive rate = (b/[a + b]); false-negative rate = (c/[c + d]); accuracy = ([a + d)/[total N]); a = true positive; b = false negative; c = false positive; d = true negative; underidentification = [c/(total N)]; overidentification = [b/(total N)]; a = true positive; b = false negative; c = false positive; d = true negative. All percentages are rounded to the nearest integer. AEPS-3 = Assessment, Evaluation, and Programming System Test—Third Edition.
Finally, the area under the ROC curve were statistically significant (p < .05) and had values that could be considered good to fair for most age intervals (13–18 age interval AUC = .77; 19–24 age interval AUC = .82; 25–30 age interval AUC = .73; 31–36 age interval AUC = .87; 37–42 age interval AUC = .87; 43–48 age interval AUC = .80; 49–54 age interval AUC = .78; 55–60 age interval AUC = .69; 61–66 age interval AUC = .71; 67–72 age interval AUC = .78). The AUC results indicate some promising applicability of the field-test version of the AEPS-3 as a screening test for IDEA services. However, the bottom two age intervals (0–6 and 7–1 age interval AUC = .68, p = .06; 7–12 age interval AUC = .68, p = .06) were not significant and the AUC values could be described as poor. Depending on the number of false positives that are acceptable, the optimal (maximizing sensitivity) total number of developmental areas that should be flagged for services tends to vary between 1 and 2. Interestingly, the sensitivity rate for the four oldest age intervals can be increased to at least .80 by changing the number of developmental areas flagged from 2 to 1, but at the expense of doubling the false-positive rate.
Discussion
In previous editions of the AEPS, psychometric evidence has been provided to support the reliability, internal structure, and eligibility classification accuracy of AEPS test scores (Bricker et al., 2003, 2008). Our analyses provided evidence for the underlying unidimensional structure of the final set of items within each AEPS developmental area test. Almost all 115 items in the field-test version of AEPS-3 had acceptable fit to the Rasch model within each developmental area. The final number of items retained after Rasch analyses was 93 (see Table 4 and online Supplemental Table S1). Furthermore, support for the reliability of a field-test version of the AEPS-3 developmental area test items was demonstrated as all person reliabilities were above .74. The moderate to high person reliabilities for each developmental area suggest that the area tests could, if needed, each be used in isolation for other educational and clinical use. Scores within each AEPS developmental area could be used for individual child evaluation. We also observed a moderate to strong relation between AEPS scores and chronological age (in months) with each developmental area. This finding is consistent with previous editions of the AEPS (Bricker et al., 2003, 2008)
In this study, we created cutoff scores at 6-month age intervals for each of the eight developmental areas of a field-test version of the AEPS-3. Then, the accuracy (validity) of the cutoff scores in identifying eligibility among children with and without disabilities was examined. Findings show the eligibility classification accuracies were consistent with those reported by Bricker et al. (2008), but caution should be used when making direct comparisons because of the changes across editions. As shown in Table 4, the cutoff scores for a field-test version of the AEPS-3 identified a high percentage of children that were eligible for services as eligible (sensitivity ranged from 57% to 100%) and a moderate to higher percentage of children that were not eligible for services as not eligible (specificity ranged from 0% to 81%). Sensitivity was high (>80%) for most of the age intervals up through 48 months but tended to decrease for older age intervals (49–72 months). However, classification rates should be considered with caution for the 0- to 6-month and 7- to 12-month age intervals, given the small sample size within intervals and the poor AUC values. As a result, a field-test version of the AEPS-3 tends to over identify children as eligible for services at the younger age intervals. Per recommendation from an anonymous reviewer, we conducted a more nuanced examination of the over-identification classification accuracy by developmental area to provide insights for intended score use (i.e., eligibility determination). In doing so, we did not find any particular developmental area that consistently resulted in over-identification. However, we did confirm the overidentification we found at the lower age intervals in the overall classification results found in Table 4, which are most likely due to the smaller sample sizes at these age intervals. Therefore, in most cases, the cutoff scores generated herein will infrequently misidentify a child who is eligible for services, but it will identify more children needing services. It is important to keep in mind that maximizing sensitivity (i.e., the probability that the cutoff scores approach will identify a child as eligible when they are known to be eligible) is desirable at the expense of increasing false positives (i.e., identifying children as eligible for services when they may not need them). That is, a teacher who wishes to avoid missing children needing services and who is not concerned about false positives might prefer to use higher cutoff scores at the specific developmental area or use a lower total number of developmental areas to be flagged for services. These findings are consistent with other studies that have examined whether the AEPS cutoff scores result in similar decisions about eligibility to those generated from other assessment approaches (Hallam et al., 2014).
Limitations and Future Research
This study has several limitations worth mentioning. First, the study used a convenient sample of children and did not use a random sample of eligible and ineligible children which would help to generalize the findings. Future research should attempt to conduct a larger, stratified random sample of eligible and ineligible children. Second, the sample sizes per age interval varied widely and limited our ability to make more robust decisions about the classification accuracy analyses. Future research should attempt to increase the sample size, especially at the lower age intervals, to increase our ability to assess the accuracy of the cutoff scores with the lowest age intervals. Third, although the Rasch analyses retained most items within each developmental area AEPS test, future research should be conducted to ascertain whether additional items could be added to the AEPS to improve the sensitivity of the test at differentiating children across the different developmental areas. A fourth issue is that several of the developmental area cutoff scores are zero. This suggests that there is some restriction in precision if performance “below 1” cannot be discerned within a developmental area. As expected, most of the cutoff scores of zero tended to belong to the lower two age intervals (0–6 and 7–12), except cutoff scores of zero were also observed for Math at age intervals 13 to 18, 19 to 24, and 25 to 30 months. These cutoff scores of zero for Math are a result of the fact that the items written in the Math area were unintended to be used for assessing very young children. As well, foundational math skills that young children might demonstrate are found in the cognitive area of the AEPS-3 (e.g., discrimination, classification, and comparisons). Moreover, overidentification and accuracy tended to be inflated and deflated, respectively, for the bottom three age intervals. These results further stress the restriction of the cutoff scores with younger age intervals.
Another limitation of this study is the use of 6-month cutoff scores. Although 3-month cutoff scores were examined, they resulted in questionable accuracy due to the sparseness of children in particular age bins, the classification accuracy estimates worsened. Therefore, an area of future exploration would involve testing the sensitivity of cutoff scores at varying age bins but with larger sample sizes to make more nuanced decisions. Relatedly, it is not entirely certain if the 6-month age span for creating cutoff scores is reasonable or jeopardizes any decisions made about identifying a child as eligible or not, particularly for very young children. For the AEPS (second edition), Level 1 (birth to 3 years) cutoff scores were established at 3-month intervals (Bricker et al., 2008). Because the AEPS-3 is not divided into levels, 3 and 6-month cutoff scores could not clearly be delineated based on test level. Due to the rapidly changing development of very young children, evaluators should cautiously interpret results for that age group. As with all eligibility decisions, multiple sources of data should be collected and considered (Grisham-Brown & Pretti-Frontczak, 2011).
We also found the correlation between Cognitive and Adaptive developmental area scores was quite high (r = .91). We acknowledge that being low on one basically guarantees being low on the other. It is worth posing the question of whether those two developmental areas should be considered a single entity. Finally, a different approach for setting cutoff scores than the one used herein should be considered. This could be done by using all developmental area tests in a single ROC analysis and then finding the combination of developmental area scores that could optimize sensitivity while allowing for some degree of false positives that is tolerable. In general, future research should be conducted to ensure the credibility of these findings.
Implications for Researchers and Practitioners
This study supports previous research on early versions of the AEPS, thereby providing early childhood professionals with confidence that a field-test version of the AEPS-3 provides useful information about young children’s development, and is useful in making a variety of instructional decisions. Practitioners also may use the AEPS as a tool for eligibility determination. The AEPS-3 might be useful for determining young children’s need for more intensive interventions as part of a multi-tiered system of support. Given that the results show that each AEPS-3 developmental area test could be used in isolation, programs might use cutoff scores to make instructional decisions about which children need additional support, in a particular area of development. For example, if a teacher finds that a small group of children is below the cutoff score in the math area for their age group, more intensive targeted instruction could be provided.
Supplemental Material
sj-pdf-1-tec-10.1177_0271121420981712 – Supplemental material for Scale Evaluation and Eligibility Determination of a Field-Test Version of the Assessment, Evaluation, and Programming System Third Edition
Supplemental material, sj-pdf-1-tec-10.1177_0271121420981712 for Scale Evaluation and Eligibility Determination of a Field-Test Version of the Assessment, Evaluation, and Programming System Third Edition by Michael D. Toland, Jennifer Grisham, Misti Waddell, Rebecca Crawford and David M. Dueber in Topics in Early Childhood Special Education
Footnotes
Acknowledgements
Special thanks to the anonymous reviewers and editor for their helpful suggestions.
Authors’ Note
The views expressed in this paper are those of the authors and do not necessarily reflect the views or policies of the Early Intervention Management and Research Group (EMRG).
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
Research funding was provided by the Early Intervention Management and Research Group (EMRG), a nonprofit corporation dedicated to the improvement of the AEPS.
Supplemental Material
Supplementary material for this article is available on the Topics in Early Childhood Special Education website along with the online version of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
