Abstract
When conducting item reviews, analysts evaluate an array of statistical and graphical information to assess the fit of a field test (FT) item to an item response theory model. The process can be tedious, particularly when the number of human reviews (HR) to be completed is large. Furthermore, such a process leads to decisions that are susceptible to human errors. A key finding from behavioral decision-making research has shown that a parametric model of human decision making often outperforms the decision maker himself. We exploit this finding by seeking a model to mimic how analysts integrate FT item level statistics and graphical performance plots to predict the analyst’s assignment of the item’s status. The procedure suggests a set of rules that achieves a desired level of classification accuracy, separating situations in which the evidence supports firm decisions from those situations that would likely benefit from HRs. Implementation of the decision rules accounts for an estimated 65% reduction in calibrations requiring HRs.
Good item parameter estimates are critical to the validity of a test based on item response theory (IRT). Approaches to assessing the quality of estimates obtained during an item calibration process have been a subject of numerous studies for well over three decades (e.g., Drasgow, Levine, Tsien, Williams, & Mead, 1995; Glas & Suarez-Falcon, 2003; Hambleton & Han, 2005; Orlando & Thissen, 2000, 2003; Smith, 2000; Stone & Zhang, 2003). These investigations have provided a range of tools for ensuring defensible psychometric functioning of items and tests. Among these tools are several test-item descriptive statistics, the results of several tests of item fit in common use, as well as graphical plots of an item’s observed responses or residuals from the theoretical model used to calibrate the item (e.g., Hambleton & Han, 2005; Hambleton, Swaminathan, & Rogers, 1991). Various descriptive statistics (e.g., percent correct, point-measure correlation) are employed as part of evaluating an item’s calibration because each provides a slightly different aspect of the item’s behavior within the sample of students that responded to it. Furthermore, item fit statistics (e.g., infit, outfit, standardized
While the forms of information collected about an item’s calibration status are empirical, objective, and repeatable, they do not by themselves result in a decision. For example, it is not always clear when item fit is adequate in any one case. What numerical value constitutes poor fit? When some indicators suggest good quality but others suggest caution, what should drive the decision? Clearly, HRs are subjective; even though the information being considered is from objective sources, their confluence in the form of a summary of field test (FT) performance can include ambiguous signals that lack defined rules upon which to act. Recognizing this, the integration of the available item calibration information involves the analyst’s judgment and will, in varying degrees, reflect the analyst’s knowledge and experience. How stable should we expect an analyst’s judgments to be? Would the decision made by another analyst be the same?
Best practice would seem to suggest that all FT items passing though the calibration process should undergo review by at least two experienced analysts. Judgments from several analysts would guard against an overreliance on the judgments of any one analyst. However, HRs can be expensive and inefficient, especially when large numbers of items are under consideration. They consume staff time and divert staff attentions away from other priorities. To the extent that HRs can be reduced without substantial increase in the mischaracterization of item status, the more efficient the calibration process will be.
Consider how issues may arise in the case of item calibration for a computerized adaptive test (CAT). Assume that the field-testing procedure includes embedding two to three FT items in operational tests and presenting these items to examinees as part of each normal test administration. For each FT item, the “calibration engine” monitors the number of responses and computes a calibration after a threshold number of responses has been reached. Once a calibration has been computed, a report is generated that provides fit and descriptive statistics and a plot of item responses cast against the item’s theoretical response curve. These reports are presented to analysts to examine the empirical item response curves and accompanying psychometric indicators. Based on this information, analysts make determinations about how the item should be considered going forward. As indicated above, this process can be both time-consuming and subjective. Decisions on any given item may vary among analysts. Moreover, an individual analyst may render different decisions on a given item if it is presented on different occasions.
While response plots and the accompanying psychometric indicators are intended to serve as the primary focus, little is known about how those indicators are used in shaping the analyst’s decision. The absence of studies examining the accuracy and stability of human judgments about FT item performance or of the role such judgments play in the evaluation of items suggests several plausible explanations. One is that human review is considered as simply a perfunctory exercise tacked onto an item calibration and evaluation process; that is, it holds little sway in the decision to make an item operational. Another possibility is that decisions about items are based purely on statistical analyses of item performance data. Or it may be that the process of human review of FT performance data is felt to be so canonically understood that little would be gained by studying it further, or at all. Yet another possibility is that the role of inherently subjective human judgment inside an otherwise objective process, if not antithetical to the goals of the process, is at least difficult to reconcile. We suspect that there is some truth to each of these explanations and probably others.
The indifference to studying human judgments made in the context of professional practice is not unique to this area of psychometrics. Indeed, a similar aversion has been present in clinical psychology for roughly eight decades (Meehl, 1954). Since Meehl’s seminal book, the value and in fact, the superiority of using actuarial or statistical models versus clinical methods to make judgments or predictions has been demonstrated convincingly (e.g., Dawes, 1979; Dawes, Faust, & Meehl, 1989; Einhorn, 1974; Goldberg, 1968). However, the recognition, acknowledgment, and acceptance of this evidence has been mixed (Dawes et al., 1989; Swets, Dawes, & Monahan, 2000). Other fields in which “clinical” judgment based on multiple sources of objective data and information play a major role, such as personnel selection, criminology (e.g., predicting parole violations, recidivism), medicine (e.g., diagnosis, radiology, oncology), and education (e.g., predicting college success), have all shown similar findings (Dawes, 1993; Dawes et al., 1989; Swets et al., 2000). The overwhelming finding that actuarial models of human decision making consistently outperform clinical decision makers holds important implications for the practice of making judgments about FT item performance. There is little reason to believe that psychometricians or other trained analysts reviewing item calibration reports to make judgments about items’ operational status going forward would somehow be immune to the problems of judgment consistency and accuracy that are so pervasive among professionals in other fields over the past 50 years (Dawes et al., 1989). Dawes (1979) has contended that people, especially experts, are much better at selecting and coding information than they are at integrating it, as in judgment tasks.
We argue here that the processes of making what are essentially “clinical” judgments about FT item performance will be well served by studying, if not their “weakest” aspect, then the aspect that has been virtually unaddressed.
More specifically, we wanted to know the extent to which analysts’ judgments of FT items’ status can be predicted from the statistical and graphical information used to make those judgments. From this motivation, four more narrowly focused questions emerge. They are:
To what extent does a model of the analyst replicate the analyst’s actual decisions?
To what extent does prediction accuracy vary between item subject areas?
What criteria would be required to minimize serious classification errors that would offset the benefits of using the prediction model in practice?
What changes in efficiency would be expected by embedding a model of analysts’ judgments into an item review process in which all items considered to be successfully calibrated undergo HR?
The importance of the current study lies in its initial focus on this point, namely, that a model of how analysts use data and graphical displays of item performance in judging items’ subsequent status may be used to improve the quality and efficiency of the calibration process.
Our approach to understanding how analysts employ item information in deciding an item’s subsequent status can be traced to well-established findings that a prediction model, based on an expert’s behaviors, tend to outperform the expert himself (viz., Armstrong, 2001; Dawes, 1979; Dawes et al., 1989; Einhorn, 1974; Goldberg, 1968; Libby, 1976; Weiss, Shanteau, & Harries, 2006). In this approach predictive models are developed that effectively serve as substitutes for the decision maker’s actual judgments. Following this line of inquiry, a model of the expert is built in the first phase of the study from the data used by analysts to form evaluative decisions about items.
In the second phase of the study, decision meta-rules based on the predictive model are applied in place of the expert. These rules were “tuned” for the model’s operating characteristics. The expectation is that these rules, when integrated into the broader flow of the item calibration process, should improve the accuracy and consistency of judgments of FT item calibration status.
It should be understood from the outset that the focus of the study was on the judgments of the specific analysts who reviewed the item calibration reports. This, of course, limits generalizability of the results to the future judgments of these analysts when they judge similar (dichotomously scored, multiple choice) items. We would argue, however, that the steps undertaken in the methodology would be expected to generalize to other similar contexts and situations, that is, large-scale item calibration efforts that include human review as a key component in determining the fate of calibrated test items. Item types may differ, and the particular statistical and graphical information provided to analysts may vary, but the analytic procedures can be readily adopted from the methods below.
Method
Calibration Records
An extant set of item calibration records that had been reviewed by two experienced analysts over a recent 14-month period was used as the primary dataset. All records were generated from responses to field test items embedded in operational computerized adaptive tests (CATs) developed to assess K–10 student achievement in mathematics, reading, language usage, and general science. Each calibration record contained a set of statistics related to the item’s characteristics along with a plot of test takers’ responses to the item. The plot included the theoretical ogive corresponding to the dichotomous Rasch model plus the response plot for each answer option. Analyst used the combination of the graphical plot and accompanying statistics to make a trichotomous judgment about the item: Accept (make operational), reject (remove from further consideration or make revisions before additional field-testing), or return the item to the FT status. How the analysts treated the available information was not strictly prescribed. They were asked to simply make a decision using both the visual and item fit information, guided by commonly employed thresholds for the fit statistics. Past records showed that roughly 45% of all items were returned to FT status at least one time, resulting in multiple calibration reports for a substantial proportion of items. For example, 10% of reading items were reviewed four to five times and 10% of the mathematics items were reviewed seven to eight times.
The need to validate the prediction model with a dataset composed of completely different items necessitated splitting the full dataset. Distinct items (N = 8,017) were randomly assigned to one of two groups: (a) a “training” group (n = 5,292; 66%) and (b) an evaluation group (n = 2,725; 34%). For the training group, only the item’s last cycle through the calibration/HR process was used. These records were considered to represent the most complete information available for making a decision about the status of a FT item. For the evaluation group, a single instantiation of the item’s presence in the calibration/HR process was randomly selected. These records were considered to represent those situations a reviewer is most likely to encounter. Table 1 provides a summary of how the reviews for the items broke out in each group in terms of the scale (subject) and HR decision.
Item Reviews in Training and Evaluation Sets by Subject and Human Review Decision.
Each calibration record included a number of variables related to the item and its performance. The presence of these variables in the calibration record made them accessible to analysts to use in their evaluations of items. The variables, identified as “base” in Table 2, were used either as potential predictors or outcomes. If necessary, base variables were transformed to make them more suitable for the analyses. The first six entries in Table 2 are fit and fit-related statistics that were attached to each calibration record. Variable numbers 7 through 12 were simple descriptive variables the analysts reported as being useful. The final three predictor variables were constructed from some of the preceding variables. Dichotomous outcome variables were constructed from the status assigned by the human review process. These are included as the last two variables in Table 2.
Variables Used in Logistic Regression.
Modeling the Analyst’s Decision Behavior
For the training set data, models were built for each subject area (mathematics, reading, language usage, and science) by regressing each of the dichotomously coded outcome variables (HR_Accept and HR_Reject) onto the set of predictors using a separate logistic regression procedure with stepwise selection for each outcome. The significance criterion for a variable to enter the model was set at .50, while the significance criterion for remaining in the model was set at .10. Individual probabilities for each outcome for each item were cross-validated using the leave-one-out procedure (Shtatland, Kleinman, & Cain, 2004) as implemented in SAS. Receiver operating characteristic (ROC) curve analyses were performed to judge the quality of prediction for each predictor set for each outcome variable. The focus for these analyses was on identifying the probability associated with decisions of high specificity (and thus, low false positive values) to help inform the rules used to govern the classification of items using predicted probabilities. To “score” the corresponding evaluation dataset, the estimated regression coefficients
Note that the set of predictors was allowed to vary by subject and HR status.
Classification Rules and Impact
Central to the analysis of the scored datasets was the treatment of the sets of probabilities coming from the two logistic regressions for each item. It was necessary to establish a set of rules to isolate probabilities from the “Accept” events (HR_Accept/HR_NOT Accept) and from the “Reject” events (HR_Reject/HR_NOT Reject) that would be used to define the predicted status for the item. The first method was based on a very simple rule for the probabilities; specifically, identifying the cell that contained the only probability of accepting (vs. not accepting), or rejecting (vs. not rejecting) the item that exceeded .5. This rule, referred to as the base decision method, is laid out in section A of Table 3.
Assigning Predicted Status Using the Rules Applied for Each Focus on Probability.
A second method is being referred to as the optimal probability method. In this method, probabilities were isolated by using the information from the ROC curve-based classification tables (presented in the Results section, below). Specifically, Youden’s Index (as cited in Schisterman, Perkins, Liu, & Bondell, 2005) was adopted for this method. The sensitivity and specificity values from the analysis of a single event type (Accept or Reject) were added to determine the probability that represented optimal performance in terms of “true positives”—correct predictions of the observed event (Accept or Reject) in combination with “true negatives”—correct predictions of the nonevent (not-Accept or not-Reject). The probabilities isolated from the two event types define the cells in the mathematics and reading prediction grids shown in section B of Table 3.
Finally, the third method for isolating probabilities is being referred to as the maximum specificity method. This method was guided by a judgment that predictions resulting in possible serious misclassification (i.e., predicting “accept” for an item that human reviewers would “reject”) are errors that outweigh the importance of “correct” classifications according to the model (Swets et al., 2000). Under this method, probabilities were also identified using the ROC curve-based classification tables (presented in Results section, below). For each event type, a primary probability level was isolated by using a high percentage of specificity (i.e., ratio of true negative to the sum of false positives and true negatives). Since high specificity is negatively related to the rate of false positives, misclassification errors would be less likely. However, as specificity percentages increase, the percentages of correct classifications decrease, potentially to the point of negating any advantages of using the procedure. To offset the negative effects of low percentage of correct classification, an arbitrary minimum of the percentage of correct classifications was adopted. Specifically, the primary isolated probability was the one associated with maximum specificity and a percentage of correct classifications that was within 5 percentage points of the maximum of percentage of correct classifications.
At the same time, the rules were developed to acknowledge the possibilities for misclassification by building in two “hedge” status categories. These categories were established between the “reject” status and the “return to FT” status as well as between the “accept” status and the “return to FT” status by arbitrarily subtracting 15 percentage points from the primary isolated probability. The primary probabilities isolated from the two event types as well as the probabilities adopted to establish the “hedge” categories define the cells in the mathematics and reading prediction grids shown in section C of Table 3.
Results
Beginning in this section, attention is predominantly limited to mathematics. Where it is deemed informative, results from reading items are also presented. It is worth noting, however, that corresponding results for reading, language usage, and science items follow very similar patterns to those observed for mathematics items. For example, descriptive statistics revealed that the training sets (one for Accept and one for Reject for each subject area) were quite comparable in values across most predictors. However, some differences were evident for predictors based on calibration sample size, namely T_sqrt, CntCat, and zChiSqr. For example, the mean number of items used to calibrate items in reading was about 850 larger than that used for calibration in mathematics. Distributions and inter-correlations of these predictors for all FT items are available from the authors.
Logistic Regression Analyses
Logistic regressions were carried out separately for each subject area. The first portion of this section presents summaries of the stepwise analyses in the form of maximum likelihood (ML) estimates and their standard errors (SE) for the predictors included in the mathematics analysis. The ROC curve analyses based on the final step of the regression analyses are also provided. The second portion of this section presents tables of hit rates and confusion matrices from both the model building sample, as well as from the evaluation sample.
Table 4 provides the ML estimates and their SEs for predictors included in the two training models for mathematics. Across both models, eight predictors were judged as marginal contributors under conventional significance levels (i.e., p < .05) and were dropped from the final models. These included the correlation of the observed response to the expected response (roe), the percent of correct responses in the first quartile (Q1pct), linear trend (Trend), mean square error (MSE), proportion of correct responses (Pcorr_Ctr50), category of total item responses (CntCat), square root of total item responses (T_sqrt), and proportion of responses in the quartile bounding the calibration (CQpos) for the Reject model only. The lower portion of Table 4 provides summaries of the ROC curve analyses for the mathematics training set. This table reveals that both models were quite accurate (minimum AUC > .94) at predicting a target event (e.g., the item was Accept vs. not-Accept, or the item was Reject vs. not-Reject). The reader is reminded that the area under the ROC curve (AUC) is a reflection of prediction accuracy. Typical AUC values range from .5 (flip of a coin accuracy) to 1.0 (perfect accuracy). Convention holds that AUC > .90 indicates high accuracy.
Maximum Likelihood Estimates and Standard Errors for the MATHEMATICS Training Models.
Estimates of each item’s probabilities of Accept vs. not-Accept or of Reject vs. not-Reject were cross-validated using a leave-one-out procedure (Shatland et al., 2004) as part of the prediction model building. This procedure is tantamount to leaving out the current observation and reestimating the parameters. Thus, for each item for each model, two probabilities estimates (one from the model and one from the model estimated without the current observation) were summarized for each outcome (the event and the nonevent). The mean differences between the probabilities computed with and without the leave-one-out procedure differed only at the fifth decimal place for all outcomes except for the Accept outcome for reading items where differences appeared at the fourth decimal place.
The classification tables resulting from the ROC curve analyses for the two mathematics training models are presented together in Table 5. Classifications are based on the probabilities yielded by the cross-validation procedures. The bolded cell entries (maximum values) under the “Youden Index” column indicate the values used to isolate the two single probabilities (underlined cell under the “probability level” column) that were used to define the “predicted status” according to the “Optimal Probability” rule assigning method, the method specified in section B of Table 3. Similarly, the cells with bolded and underlined entries under the percentage “correct” and percentage “specificity” columns indicate the values used to isolate the two single probabilities (for an Accept event and for a Reject event) that were used to define the “predicted status” according the “Maximum Specificity” rule assigning method, described above and specified in section C of Table 3.
Classification Tables for the MATHEMATICS Training Models.
Note. The bolded cell entries (maximum values) under the “Youden Index” column indicate the values used to isolate the two single probabilities (underlined cell under the “probability level” column) that were used to define the “predicted status” according to the “Optimal Probability” rule assigning method, the method specified in section B of Table 3. Similarly, the cells with bolded and underlined entries under the percentage “correct” and percentage “specificity” columns indicate the values used to isolate the two single probabilities (for an Accept event and for a Reject event) that were used to define the “predicted status” according the “Maximum Specificity” rule assigning method, described above and specified in section C of Table 3.
Figure 1 provides a graphic display of how the isolated probabilities from each of the three methods fell on their respective ROC curves for mathematics decisions (top two panels). For comparison, the corresponding results from the reading analyses are included in Figure 1 (bottom two panels). The horizontal axes (1 – specificity) for these curves, represents false positives. True positives (sensitivity) are represented on the vertical axes. All plots show that the base method and the “optimal” method result in higher levels of true positives than the “maximum specificity” method. However, the plots also show that this result comes at the price of higher levels of false positives.

ROC curve plots for Mathematics and Reading item decisions.
Classification Accuracy and Classification Errors
Classification “accuracy” here refers to agreement between the observed status (Accept, Reject, return to FT) assigned by the analysts and the status “predicted” as a result of applying the specific method. For the training and evaluation datasets of mathematics and reading items across the first two forms of decision rules, accuracy was consistently found in the 83% to 87% range. The largest differences in accuracy between the training and evaluation datasets ranged from 0.7% (Mathematics-Base Probabilities Method) to 1.7% (Mathematics-Optimal Probabilities Method). A parallel comparison of accuracy between the training set and the evaluation set could not be completed for the third method of selecting a prediction status since that method introduced two additional predicted status categories—“Reject with review” and “Accept with review”—that were not in the options available to human reviewers.
Classification “errors,” that is, disagreements between the observed status assigned by analysts and predicted status assigned by models using the “Base Probabilities” method and the “Optimal Probabilities” method, are shown in the underlined cells of Table 6. Cells with bolded and underlined entries contain severe disagreements (e.g., accept vs. reject). The most severe of these, referred to here as Type I (model Accepts and HR Rejects) are in the upper right cells of each section of the table. Accepting these “errors” without further review would run the risk of adding poor items to a pool. The corresponding severe errors, referred to here as Type II, are those in which the model Rejects and HR Accepts (lower left cell of each section of Table 6). Accepting these “errors” without further review would run the risk of eliminating acceptable items from a pool—a waste of resources. Comparing the errors of the two types of rules reveals that the “Optimal Probabilities” rules resulted in lower levels of Type I errors but higher levels of Type II errors than the “Base Probabilities” rules. However, while these error rates would appear to be acceptably low for many prediction applications, they were deemed to be unacceptably high for the present application.
Confusion Matrices for the Mathematics Training Set Items and Evaluation Set Items When Predictions Are Made Using the Base Probabilities Decision Rules and the “Optimal” Probabilities Decision Rules.
Note. Classification “errors,” that is, disagreements between the observed status assigned by analysts and predicted status assigned by models using the “Base Probabilities” method and the “Optimal Probabilities” method, are shown in the underlined cells of Table 6. Cells with bolded and underlined entries contain severe disagreements (e.g., accept vs. reject).
Application of the “Maximum Specificity” rules with hedge categories had the effect of reducing the levels of severe (Type I and Type II) errors. These differences can be seen by comparing the “Accept” cells and the “Reject” cells in Table 7 with the corresponding cells in Table 6. Obviously, these reductions were realized by invoking the rules created to form the “with Review” hedge categories. The model “Accept” rates and “Reject” rates in Table 7 are slightly lower in most cases, even when they are combined with their adjacent hedge category.
Confusion Matrices for the Training Set Items and Evaluation Set Items When Predictions Are Made Using the Maximum Specificity Probabilities Decision Rules With Hedge Categories Included.
Note. Application of the “Maximum Specificity” rules with hedge categories had the effect of reducing the levels of severe (Type I and Type II) errors. These differences can be seen by comparing the “Accept” cells and the “Reject” cells in Table 7 with the corresponding cells in Table 6.
Conclusions and Recommendations
The core idea used in this study is not new. It is well established in the literature focusing on modeling decision-making behavior (e.g., Swets et al., 2000). This study is the first we are aware of to apply these methods to understanding how FT item performance information is considered by human analysts. This is rather uncharted, or at least undocumented, territory. At the very least, application of these methods could provide test developers with a tool to monitor HR item judgments for consistency both within and across reviewers. In the best case, these methods provide a kernel from which a virtual HR process could be developed to substantially reduce the need for actual human reviews. Such a system would be very consistent by virtue of the underlying model and would be tractable when anomalies arise.
The principal question under consideration in this study was the extent to which analysts’ judgments of FT items’ operational worthiness can be predicted from the statistical and graphical information they use to make those judgments. The most direct answer to this question is that those judgments are strongly predictable and are quite accurate across content areas, though the models required to come to those judgments vary slightly by content area, judgment type, and the combinations of available variables used. Misclassification levels were reduced by up to about 50% from the low levels obtained when using what might be considered more conventional decision-making rules. This reduction was obtained by applying decision-making rules that incorporated explicit criteria to detect predictions based on what might be termed “less indicative” sets of probabilities. A full-scale adoption of these procedures would be expected to yield substantial improvements in efficiency in the evaluation portion of what might be termed a conventional calibration review process.
Data for the study came from calibration review records. Each record was a report on the calibration results for an item and was composed of statistical data commonly associated with IRT item calibration and additional descriptive statistical information provided to analysts. An analyst’s judgment about the item (i.e., to accept it, to reject it, or to return it to FT status) was the outcome of each encounter with an item, and was the dependent variable in the effort to build models of the analysts’ judgments. For each subject area, a two-part training model was constructed, each part using stepwise logistic regression to estimate the probability that the item was (a) placed into the “Accept” status or (b) placed into the “Reject” status. Results from these training models showed the constellation of item level information provided to analysts was highly predictive of their judgments of item status. The logistic regression results also revealed that the makeup of predictors selected for each model varied slightly across subject areas and across decision types (i.e., Accept, Reject).
To formulate the four probabilities (Accept, not-Accept, Reject, not-Reject) from the two models into a single prediction for each item, three sets of decision meta-rules were examined. The sets were developed from the estimated probabilities to form decision tables. The first set was based simply on using the probability of .5 as the cut point individually for both Accept and Reject. The second set used the optimal probabilities for each outcome as identified from the ROC stage of the analyses. The third set used the ROC analyses to find an effective balance between specificity (true negatives) and percent of correct classifications that would also minimize classification differences from the observed items status as assigned by analysts. Of these three methods, the third was considered to have the most practical advantage in that it lead to substantially lower rates of misclassifications (putative Type 1 errors).
From the results, a logical follow-up question might be, what changes to the item calibration process could be expected, if the decision meta-rules used for the third method were implemented? The answer to this question can take a number of paths, depending on the level of confidence that one is willing to invest in the outcomes shown under the evaluation set in Table 8. A very high level of confidence, for example, may suggest that the “Accept with review” and “Reject with review” categories simply be ignored and treated as “Accept” and “Reject,” respectively. That is, accept any item falling in the “Accept” or “Reject” categories as if it had been viewed by a live analyst, making further human review unnecessary. This would leave about 18% of items that would be returned to the FT cue. These items could either be reviewed by an analyst to see if an adjustment is needed for the precalibration difficulty estimate or such adjustments could be made through a set of heuristics based on the observed data. Alternatively, the data in Table 8 could be treated with more caution. For example, items placed into the hedge categories (“Accept with review” and “Reject with review”) would always be sent to be reviewed by an analyst, while all “Accept” and all “Reject” items would be assigned these statuses without further review. Using this approach would result in roughly an additional 13% of items being reviewed. In the immediate term, the most prudent approach to the items categorized as “Return to FT” would be to continue to have them reviewed by analysts.
Estimated Impact of Adopting the Recommended Approach.
Note. Underlined cells represent the differences in the numbers of items that would be reviewed under the proposed process compared to those that would be reviewed under the standard - all items reviewed process. E.g., for mathematics (Rej w/Rev + Return to FT + Acc w/Rev) = 462 items; 32.9% of all 1405 items and 67.1% fewer items that would have to be reviewed.
The latter option would appear to strike an appealing balance between accepting the models’“judgments” and maintaining a level of human review when model probabilities indicate less certainty. In addition, this option would allow for the results of HRs to inform adjustments to the width of the hedge categories, if needed. The estimated impact of implementing this option is illustrated in shaded area of Table 8.
The estimated ns for the “standard” process in each content area in Table 8 were computed from the percentages observed in each decision category from the complete set of 11,811 mathematics item reviews and 4,977 reading item reviews available for the study. This provides a rough estimate of how items are likely to be assigned by analysts using the standard review process. Corresponding values for the “proposed” process were taken directly from the “total” rows from the Evaluation set in Table 7. Examination of Table 8 reveals that the straight-out Reject or Accept decisions would increase from 32% to 65% in mathematics and from 45% to 71% in reading. These increases would, of course, be accompanied by equivalent decreases in the number of items that would trigger an HR, even assuming that items judged as “returned to FT” would require a human review. If effective heuristics could be developed to reset precalibration item difficulty estimates to better target FT items to examinees, the percentage of items requiring HR would be estimated to fall into the 12% to 15% range.
Future investigation in this area would benefit from the pursuit of several other intriguing questions that would extend or validate the performance of the predictive models. For example, would different analysts consider the same item report information in the same manner? Could the procedures of the pretest item calibration process that are embedded in an operational CAT be adjusted to provide less noisy FT data in the first place? A preliminary examination by He, McCall, Thum, and Hauser (2011) suggests that this is plausible. What different forms of information could be added to an item FT report that would improve model performance? What methods can be developed to extract statistical information from graphical plots that quantifies visual information in the plots themselves that is distinct from the statistical information used to create the plots? How can such information be of service to the item evaluation process? How small can training datasets be and still yield dependable results? Is there a “drift” over time in how the information is being employed? Presumably changes in the decision-making behavior of the analyst can be expected with increased experience. This, and other changes in the analyst behaviors, will need to be investigated in future research and reviews in order to mitigate any negative impact on the accuracy and validity of the processes used in this study.
Footnotes
Acknowledgements
The authors wish to thank NWEA for their support of this research.
Authors’ Note
The views expressed are solely those of the authors and do not necessarily reflect the positions of NWEA.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
