Abstract
There is a continuing tension in testing programs to equate forms and maintain score scales and at the same time allow for changing conditions in the educational system, such as curriculum shifts or practical limits on testing time. When such changes occur, psychometric staff members are challenged to develop linking methods that allow for comparable reporting but meet requirements for psychometric rigor. This article describes a method addressing such shifts in testing programs. The application of the method is demonstrated on a large-scale educational testing program that had changes in test length, content distribution, and decision-making process. The method used to accomplish the linkage was to develop a pseudo test from the items included in the longer test before the change that was designed to mimic the test after the change. The linking of the tests using the pseudo test process resulted in a percentage of successful students that was similar to the percentages obtained prior to the changes. The linked scores were treated as comparable rather than equated scores.
Introduction
The prominence and importance of linking scores obtained from different tests today in education are clear. The increased emphasis is consonant with the increased attention to accountability. Simply stated, schools are effective if growth from one year to another is evident. This requires multiple test forms that are administered to populations of students in the same grade but in different years or to populations of students in different grades in the same and different years. To be effective, the scores derived from these tests must be such that progress from one year to the next can be validly interpreted at the state/provincial, school board, and school levels and, if reporting is at the student level, in terms of what students know and can do.
The processes used to link the scores obtained from different test forms depend on the similarity of the forms and the psychometric characteristics of the forms. As similarity declines, different “conceptual” methods are used, moving from equating to scaling for comparability or scale concordance to projection or prediction (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999; Feuer, Holland, Green, Bertenthal, & Hemphill, 1999; Holland & Dorans, 2006; Kolen, 2004; Linn, 1993). Feuer et al. (1999) summarized the methods as follows:
same domain and same table of specifications for each alternative form (equating);
same domain but different tables of specifications for each form (scaling for comparability or scale concordance); and
different domains and different tables of specifications for each test (projection or prediction).
Equating is the most demanding and strongest form of linking (Linn, 1993; Mislevy, 1992); “a direct link is made between a score on one test and a score on another test” (Holland & Dorans, 2006). The strongest interpretations can be made with equating when the test items for each test form are referenced to the same table of specifications and the tests have the same psychometric properties. The resulting scores are considered interchangeable in the sense that a student’s score on one form would be within equating error on the equated form.
Scaling for comparability or establishing concordance is a more relaxed form of linking in which scores from different tests that measure similar but not necessarily equivalent achievement domains can be interpreted in similar ways (Holland & Dorans, 2006; Kolen, 2006; Linn, 1993; Mislevy, 1992). However, concordant scores are not interchangeable. Projection involves different domains and different test specifications. A relationship is determined between the scores to be predicted from one test (representative of the dependent variable) and the scores obtained on the second test (representative of the independent variable; Mislevy, 1992). The interpretation of the predicted scores is the weakest when compared with equated and concordant scores.
The first two conceptual models—equating and establishing concordance—are closer to each other than to projection. For example, two test forms constructed using the same table of specifications representative of the domain of interest would allow equated scores. Two test forms constructed using the same test specifications but different in length, administered in different modes (e.g., moving from paper-and-pencil to computer-based testing), or, perhaps, using different tables of specifications but referenced to the same domain would likely allow concordant scores, but the likelihood depends on how different the two forms are.
Liu and Walker (2007) examined the issue of equated versus concordant scores when changes were made to the College Board’s SAT Verbal Reading section. The College Board wanted the scores from the original and revised sections to be considered equated scores. The major changes involved replacement of analogy items by additional short reading passages and reduction in the total number of items from 78 to 67 and administration time from 75 minutes to 70 minutes. “The new section represents heavier reliance on a reading construct, with approximately 72% reading comprehension items, as compared with 51% in the OV (old verbal) section” (Liu & Walker, 2007). The same proportions of items were maintained at each difficulty level. Last, the name of the section was changed from Verbal Reading to Critical Reading. Based on the linking analyses they performed using field-trial data obtained from an equivalent groups design and from a counterbalanced single-group design, the means and standard deviations of the item difficulties on the new and old forms were “very close,” the observed score correlation was .91 between the new and old forms, and the reliabilities of the new and old forms were both .93. Consequently, the reduction in uncertainty (Dorans, 2004a) was 59% and the true score correlation was close to unity. The conditional standard errors of measurement (CSEMs) for the new and old forms were “very similar” between scale scores 350 and 800. Score equity results (see Liu, Cahan, & Dorans, 2006) revealed that although there was some lack of invariance across gender groups, the differences between scores using total sample linking and gender group linking were less than 5 score points or the difference that matters (Dorans, 2004b). Based on the results from the linking analyses of the field trial data, Liu and Walker (2007) concluded that “it appears that linking the CR and OV sections might meet the equating requirements” and indicated that when the new test is operational, the equity analyses can be expanded to include other subgroups, such as ethnic, language, and region groups (p. 133).
Pommerich, Hansen, Harris, and Sconing (2004) illustrated a four-stage process to link ACT Composite and SAT I V + M (Verbal + Math) scores using unsmoothed and smoothed equipercentile methods and linear prediction. The first stage involves deciding if the linkage is one of equating, concordance, or prediction using the degree of content similarity across the tests to be linked, the strength of the relationship between the scores on the tests, and the similarity of performance across demographic groups for each test. The second stage consists of linking the scores using the selected linking process and computing summary measures. The third stage involves evaluating the quality of the linkage (e.g., size of CSEMs, degree of population invariance, decision consistency rates) and determining what to report (e.g., at individual level or group level). The fourth stage consists of making recommendations about how to interpret and use and not use the results (e.g., whereas equated scores are interchangeable, concordant and predicted scores are not).
In a similar vein, Luecht and Camara (2011) identified issues and linking designs for the Partnership for Assessment of Readiness for College and Careers. The intent of this consortium of 24 states is to design assessments that yield reliable scores that can be validly interpreted in the areas of English Language Arts and Mathematics across grades, years, and states. Within-state comparisons are being met by existing state assessments. However, “there has been no success in establishing comparability across states” (Luecht & Camara, 2011, p. 1). Luecht and Camara provided three different linking designs (Form A and Form B with a set of common items, calibrated items bank leading to several different test forms and sets of linking items between the bank and forms, and calibrated task models with linking templates to the forms used). They pointed to the need for clear content requirements and statistical targets for each form to be linked regardless of design. They did not recommend against equating but pointed out that an alternative plan needs to be in place for comparing scores if equating is untenable. They then set forth criteria similar to those discussed by Liu and Walker (2007).
The purpose of the research reported in the present article was to assess a new linking procedure for the case in which the following changes were made in the absence of any field trials: (a) replacement of a conjunctive decision model in which students had to pass each of two tests, one in reading and the other in writing, with a compensatory decision model in which the students had to pass a test containing both reading and writing components; (b) reduction in testing time from 2 half days to 1 half day; (c) change in the proportions of the types of items used to assess the domain; and (d) change in the time of administration of the test from October to late March/early April of the school year. However, there was no change in the domain to be assessed, the table of specifications, and the inferences to be made. There was still the need to continue to identify the proportion of successful students and to monitor the change in the proportion of successful students at the provincial, school board, and school levels. Given the lack of time between the two administrations for field testing, a second new form of the Ontario Secondary School Literacy Test (OSSLT), called a pseudo test, was constructed from the operational test items included in the initial test that best “mimicked” the new revised test. Thus, the following research question was addressed in the present study: Would the use of a pseudo test as the link between the two forms of the test result in equated or concordant scores between the 2 years for both an English-language population and a French-language population?
Context of the Study
The English-language and French-language students attending public and private schools in the Province of Ontario are required to demonstrate their command of literacy based on instruction through the end of the ninth grade. To do so, they must pass the OSSLT, which is developed separately for both language groups and administered to Grade 10 students. Students who do not pass the OSSLT on their first attempt can retake the OSSLT in a following year or enroll in and pass a course designed to develop the required reading and writing skills.
Prior to 2006, 2 half days were required to complete the English-language and French-language OSSLT. Both reading and writing were assessed on both days and the test was administered in Fall. Separate scores for reading and for writing were computed, and students were required to pass both reading and writing in order to fulfill the literacy requirement (i.e., a conjunctive decision model).
In response to recommendations from a comprehensive external review of all the Education Quality and Accountability Office (EQAO) assessment programs (Wolfe, Childs, & Elgie, 2004), the time of administration of the OSSLT was moved from Fall to early Spring, the OSSLT was shortened so that it could be administered in 1 half day, and the separate reading and writing scores were replaced with one literacy score beginning with the next assessment in the 2005-2006 school year. Both reading and writing were assessed, with each component accounting for approximately half of the total score points so as to achieve the equal weighting reflected in the need to pass both reading and writing in the previous years. Students were required to achieve a passing score on the total test in order to fulfill the literacy requirement. This is a compensatory decision model because high performance on either reading or writing can compensate for low performance in the other area.
However, both the original test and the new test were referenced to the same content domain and, other than changing the proportions of multiple-choice and open-response items, to the same table of specifications. Furthermore, the same inferences were to be drawn, namely, performance in literacy. Therefore, given the policy decision to continue reporting the percentages of successful students and to make comparisons with the previous year’s scores, the purpose of the present study was to develop and evaluate a linking procedure that would produce a cut-score after the changes that would be comparable to cut-score before the changes so that trends in the percentage of successful students across time could be validly interpreted for the English-language and the French-language population.
Definition of the Domain
The definition of literacy adopted for the OSSLT is as follows: For the purpose of the OSSLT, literacy comprises the reading and writing skills required to understand reading selections and communicate through a variety of written forms as expected in The Ontario Curriculum across all subjects up to the end of Grade 9. (EQAO, 2007, p. 10)
For reading, students must use strategies to interact with a variety of narrative, informational, and graphic text selections to construct and gain an understanding of the meaning of texts of different forms, demonstrate their understanding of explicit and implicit meaning, and connect their understanding of what they have read to their own personal experience and knowledge. For writing, students are prompted to write two short responses, an extended response expressing and supporting an opinion they formulate from a prompt, and an extended news report they develop from a prompt. Through their responses, students demonstrate their ability to communicate ideas and information clearly and coherently (EQAO, 2007, p. 10).
The definition and its explication for both languages have not changed since the OSSLT was first administered in 2002.
Items for each OSSLT (the reading selections and accompanying multiple-choice and open-response items and the writing prompts for the English- and French-language forms) are written to reflect the
levels of cognitive processing and employment of reading (e.g., finding explicit information, finding implicit information, and making connections) and writing strategies (formulating ideas given information and working with their own experiences),
appropriate levels of reading difficulty of the reading passages, and
range of reading and writing item difficulty.
Both English- and French-language versions of the OSSLT are independently constructed, but the number of items and points are the same. The items are written by trained teacher items writers following accepted item writing procedures and then vetted by EQAO staff. The items are then carefully reviewed by an Assessment Committee and by a separate Sensitivity Committee, with both committees comprising teachers, principals, and curriculum specialists representative of the school boards within Ontario. Revisions are made as needed. Care is taken to ensure that the reading passages and their accompanying items are comparable between successive years and call for the same cognitive processing.
The items are then field tested. A matrix sampling plan is used to distribute the field test items as embedded items in the previous year’s OSSLT so that the field test items take up no more than 20% of the testing time. Consequently, there is more than one form in each year. Each form contains the same operational items and different field test items placed in the same position within each operational form. Field-tested items with acceptable psychometric properties and that as a set represent the literacy domain comprise the operational form next year and serve as the common items to link the current year and previous year operational forms using the fixed common item parameter procedure. In this linking, the common items are treated as an external anchor in the previous year and as an internal anchor in the current year.
Comparability of 2004 and 2006 Test Forms
To meet the reduction in administration time, the number of items referenced to each element in the table of specifications changed between 2004 and 2006 and multiple-choice and short-responses items for writing were introduced in 2006. The number and percentage of items and points for the three reading skills and the three writing tasks for the 2004 and 2006 OSSLT are provided in Table 1.
Distribution of Items and Points Across Reading and Writing by Year
Note. MC = multiple-choice items; OR = open-response reading items; SW = short-writing items; LW = long-writing items; PT = pseudo test. R1 = understanding explicitly stated ideas and information; R2 = understanding implicitly stated ideas and information; R3 = making connections between ideas and information in a reading selection and personal knowledge and experience. k and n are the number of items and points, respectively; % is corresponding percentage. The difference between the number of items and the number of points for R1, R2, and R3 is because of the use of multiple-choice items, which were dichotomously scored (0, 1), and open-response items, some of which were dichotomously scored and others that were polytomously scored (0, 1, 2).
Although the reading selections and extended-response writing prompts used in 2004 and 2006 measured the same skills, the tests differed somewhat in their distributions of multiple-choice reading items, short-answer reading items, and, as mentioned above, multiple-choice and short-answer writing items. For example, the number of short-answer reading items administered in 2004 for R1 (explicitly stated ideas and strategies in the reading passage), was 16; no short-answer items for R1 were administered in 2006. Similarly, the numbers of short-answer reading items for the remaining two reading skills were reduced (24 to 3 for R2 [understanding implicitly stated ideas and information] and 20 to 3 for R3 [making connections between ideas and information in a reading selection and personal knowledge and experience]). The short-answer items for Reading were dichotomously scored and assessed the same material that the multiple-choice items assessed. Consequently, the percentages of multiple-choice items increased for the three reading skills. However, the multiple-choice and short-response reading items in 2004 and 2006 tests were construct equivalent.
In the case of writing, the four extended response items in the 2004 form were replaced with eight multiple-choice items that measured writing conventions, two short-answer items, and two extended-response items. The two short-answer writing items used in 2006 that required students to summarize short reading passages took the place of a longer writing item in 2004 that required students to summarize by writing a long reading passage. Two of the remaining extended response items were the same in both years, and the third item was dropped in favor of the inclusion of the eight multiple-choice convention items.
Turning to the scoring of the open-response items, whereas the short-response reading items in the 2004 OSSLT were scored either right/wrong or 0/1/2, all of the short-answer reading items in the 2006 OSSLT were scored for the use of ideas and information from the reading prompt using a 3-point scoring rubric. For writing, 4-point rubrics were used to score the student responses for topic development, taking into account conventions, for the four extended-response items in the 2004 OSSLT. The new short-answer writing items in the 2006 OSSLT were scored using a 3-point rubric for topic development and a 2-point rubric for conventions, and the two extended-response items were scored using a 6-point rubric for topic development and a 4-point rubric for conventions. Taken together, the eight multiple-choice items that measured conventions in the 2006 OSSLT and the 2-point and 4-point rubrics for conventions used in 2006 replaced the “taking into account conventions” when scoring the four short extended response items in the 2004 form.
The 2006 operational form was assumed to reflect accurately the weighting of the reading and writing components for the EQAO literacy construct. The 2004 form had two separately scored parts that were treated equally (students had to pass both parts). To achieve nearly equal weighting for Reading and Writing in 2006, the items and score points were distributed across content areas as shown in the lower half of Table 1. As shown, the points for the two extended-response items were doubled to achieve the equal weighting.
Thus, although the content and cognitive processing measured did not change, given changes in the decision model, the distribution of items, introduction of multiple-choice and short response items for writing, revision of the scoring rubrics, and the change in the time the test is administered, how should the linking be accomplished to yield comparable cut-scores that separate successful from unsuccessful students? If both versions of the test had been administered to the same group of students or randomly equivalent groups of students, classical linking methods such as equipercentile linking could be used. But that was not the case. The design corresponded to the common-item nonequivalent groups design given that students in 2 different years are tested. The linkage problem was solved with the pseudo test approach described in the next section.
Linking Test Form: Construction of the Pseudo Test
If a specific form of a test can be assumed to be well designed and well constructed, then the reported score on the test is expected to reflect the desired weighting of components and to be highly related to the construct. That means that the ordering of students using the reported score should be strongly related to the ordering that would be observed if the students’ locations on the construct itself were available. The implication of this observation is that the goal of linking in the present case is to estimate the location of the pass/fail point on the test form reporting scale in 2006 that corresponds to the location on the construct implied by previous decisions in 2004.
The 2004 OSSLT could not be used given the reading and writing components were considered separately with two separate cut-scores. Thus, to achieve linking, a pseudo test form, which involved creating a test form that matched as close as possible the new 2006 OSSLT using the operational items included in the 2004 OSSLT, was used. The 2004 operational items that best “mimicked” the 2006 operational test in terms of content, cognitive processing, and difficulty were selected. As shown in the last row of each panel of Table 1, the type and number of items selected matched as closely as possible the type and number of items and points for each of the three reading skills and for writing in the 2006 OSSLT. In addition, the items selected needed to match as closely as possible the psychometric specifications (e.g., distribution of difficulty, information function) used to construct the 2006 form. The three long-writing items included in the pseudo test were each initially scored with a 4-point rubric in 2004, whereas the two long items included in the 2006 operational form were scored with a 6-point rubric for topic development and a 4-point rubric for conventions, which were then doubled to achieve equal weighting between reading and writing. Consequently, to obtain a score weight close to the score weight for the 2006 writing items, the scores of the three long-writing items in the pseudo test were multiplied by 4.
Comparing the values in the two 2004 pseudo test rows in Table 1 with the values in the corresponding rows for the 2004 OSSLT and the 2006 OSSLT reveals that the pseudo test provided a closer fit to the 2006 OSSLT than the 2004 OSSLT and, as a consequence, better reflected the weighting of Reading and Writing in the 2006 operational form. Therefore, construction of the pseudo test from the operational items contained in the 2004 OSSLT was deemed successful. Once this task had been completed, the field test items embedded in the 2004 operational forms and that together formed the operational items for the 2006 OSSLT were added to the pseudo test in the same way as in 2004, thereby providing the number of pseudo test forms equal in number of OSSLT forms administered in 2004 pseudo test.
Calibration and Scaling Sample
A set of exclusion rules was implemented for the selection of the calibration and linking samples for the English-language and French-language student populations. First, previously eligible students were removed. Then the following categories of first-time eligible students were excluded from the 2004 pseudo test and 2006 OSSLT linking samples:
Students with no work or incomplete work in a major section of the test
Students receiving accommodations with one exception, receiving extra time
Students who were exempted, deferred, or taking the OSSLC
Students who were homeschooled
After the exclusion criteria were applied, the numbers of first-time eligible students in the scaling samples were 137,946 and 4,645 for the 2004 English-language and French-language pseudo tests, respectively, and 146,280 and 5,009 for the 2006 OSSLT, respectively. The fact that there were both English-language and French-language versions of the OSSLT that were separately constructed gave the opportunity to replicate the procedures with different tests and different sample sizes.
Calibration
A modified one-parameter item response theory model was used to calibrate the multiple-choice items. The modified partial credit model (PCM; Masters, 1982) was used to calibrate the open-response items. These models were selected for the OSSLT because of the small number of French-language students. Because of the number of versions of the test with different field test items embedded in the set of operational items, the number of French-language students who responded to each field test item varied from about 400 to 600. In contrast, the number of English-language students who responded to each field test item was approximately 7,000, with around 1,500 student responses to open-response items. However, EQAO
The calibrations were completed using PARSCALE 4.1 (Muraki & Bock, 2003). The a-parameter for the multiple-choice and open-response items was set to 0.588 (which brings
where
The PCM was used to estimate the item and ability parameters for the open-response items. The PCM is given by the following equation:
where
The first analysis performed was to check the psychometric quality of the linking items. As mentioned earlier, in this linking, the common items were treated as an external anchor in the pseudo test and as an internal anchor in the 2006 OSSLT. This check was completed in two steps. First, the operational items and field test items selected for the pseudo test and the 2006 operational items were separately calibrated. The item parameters of items common to the pseudo test and the 2006 OSSLT (i.e., the field test items selected for the 2006 OSSLT) were then compared by plotting the pairs of item parameters. The graph should be a straight diagonal line. Items corresponding to points outside the 95% confidence band around this line were considered to be outliers and were dropped from scaling. Based on this analysis and taking into account the need to ensure representation of the literacy domain assessed, four English-language and five French-language multiple-choice reading items were excluded from linking because of changes to these items following field testing. 2
Linking Procedure
The forward-fixed parameter common-item nonequivalent group procedure was used to link the 2004 pseudo test and 2006 OSSLT scores. In this linking, the common items were treated as an external anchor in the pseudo test and as an internal anchor in the 2006 OSSLT. The process used to place the 2004 pseudo test and 2006 OSSLT on a common scale was as follows. First, the 2006 items were calibrated and the proficiency estimates for the 2006 population were computed. Second, the items in the pseudo test were recalibrated by fixing the item parameters for the field test items brought forward to form the 2006 OSSLT. Third, one-parameter proficiency estimates for the pseudo test form for the 2004 linking sample were calculated. Fourth, the cut-score in the distribution of θ values for the pseudo test was identified by finding the θ value in the pseudo test proficiency distribution at and above which the proportion of students was equal to the percentage of students who passed both the reading test and the writing test in 2004. Given the cut-score for the pseudo test from Step 4 is on the same item response theory proficiency scale as the 2006 assessment, this cut-score was used as the cut-score in the 2006 proficiency distribution. Last, the percentage of successful students at or above this cut-score was determined for 2006.
Scale scores were then formed. First, the cut-score was set to 300 and the theta score of −4.0 was set to 200. These values were used to calculate the slope and intercept for the linear transformation used to convert the theta values to scale scores. The scale scores were then rounded to the nearest 5 (e.g., 293 rounded to 295). These scores together with reasons for unsuccessful performance were provided to the unsuccessful students; no scores were provided to the successful students.
Results
The descriptive statistics for the pseudo test and the 2006 literacy tests for the English-language students are provided in the top panel of Table 2 and for the French-language students in the bottom panel in terms of observed score percentages. Whereas the standard deviations of the pseudo test and the 2006 OSSLT were comparable, the mean for the English-language 2006 OSSLT was 5.8% higher than the mean for the English-language pseudo test and the mean for the French-language 2006 OSSLT was 2.1% higher than the mean for the French-language pseudo test. Furthermore, as expected, although the distributions for the pseudo test and for the 2006 OSSLT for both language groups were negatively skewed and leptokurtic given the OSSLT is a minimum competency test, the distributions for the 2006 OSSLT were more so. The values of Cronbach’s alpha for the pseudo test (.82 and .81 for the English-language and French-language pseudo tests, respectively) were a little lower than the values for the 2006 OSSLT (.87) for both language groups.
Descriptive Statistics for the Psuedo Test and the 2006 OSSLT: Observed Score Metric
Note. OSSLT = Ontario Secondary School Literacy Test.
The percentages of successful English-language and French-language students for the 2004, 2006, 2007, and 2008 equating samples are summarized in Table 3 (English-language students appear in the left panel and French-language students in the right panel). As shown, the percentages of successful students decreased by 2.3% between the 2004 OSSLT and 2006 OSSLT equating samples, English-language, and by 3.3% between the 2004 OSSLT and 2006 OSSLT equating samples, French-language. The changes in these two percentages between years beginning with the 2006 OSSLT through 2008 were much smaller, varying by 0.4% for the English-language equating samples and 2.3% for the French-language equating samples. Similar patterns of changes were observed for male students and female students across the 4 years but with the change in the percentage of successful males greater than the change between successful females between 2004 and 2006. However, given the relative stability across the years for the total, male, and female samples beginning with 2006, it appears that the use of the pseudo test constructed from the 2004 operational items and designed to mimic the 2006 OSSLT led to an appropriate estimate of the percentage passing the new OSSLT for both the English-language and French-language linking samples.
Percentage of Students Successful by Language, Gender, and Year
Plots of the values of the CSEMs for the scale scores are shown in Figure 1 for the 2004 OSSLT and the 2006 OSSLT equating samples for both languages. The gaps in the plots correspond to scale scores for which there were no students in one or both tests. To determine whether or not there were differences between the values of the CSEM for 2004 and 2006, the difference that matters (Dorans, 2004b, p. 57) was used. The scale score unit is 5 scale points; hence, the difference that matters is 2.5 points. The values of the CSEM for the 2004 and 2006 OSSLT are very similar within language. The differences between the two curves are less than the difference that matters within the scale score range 260 to 385 for the English-language OSSLT and within the scale score range 255 to 270 for the French-language OSSLT. Some precision was lost outside these ranges, with the largest difference being 4 scale points.

Comparison of CSEM between 2004 and 2006 OSSLT within language: Scale scoresNote. CSEM = conditional standard error of measurement; OSSLT = Ontario Secondary School Literacy Test; ENG = English; FRE = French.
Discussion
As indicated by Liu and Walker (2007), when changes are made to a test that is administered on an annual basis and is to be used to measure progress, it is necessary to determine how to link the scores on the revised test to the scores on the original test. In the present case, the conjunctive decision model was replaced with a compensatory decision model, the time of year at which the OSSLT was administered was changed from Fall to early Spring, and the testing time was reduced by half. Consequently, although the inferences to be made and the domain to be assessed did not change, the table of specifications did change in terms of the distributions of the different item types used, and the scoring rubrics were revised.
Given there was no transition time from the old to the new OSSLT to field test the new shortened version, the procedure used to link the 2004 OSSLT and the 2006 OSSLT was to develop a pseudo test from the operational items of the 2004 OSSLT that closely mirrored the 2006 OSSLT. The items field tested in 2004 as external anchor items and that were used to construct 2006 OSSLT were added to the pseudo test to allow linking of the pseudo test and the 2006 OSSLT using the same FCIP linking procedure used by the EQAO to link its annual assessments.
The empirical results revealed that the percentages of successful students for the total sample and males and females samples were relatively stable across the years beginning with 2006 though 2008 for both the English-language linking sample and the French-language linking sample. The distributions of CSEMs for 2004 OSSLT linking sample and the 2006 OSSLT linking sample were similar, and with the exception of a few scores in the tails of the proficiency distributions, the differences were within the distance that matters for both languages.
There was, though, the drop in the percentage of successful students between 2004 and 2006. The reason appears to be an interaction between the change in the decision model, gender, and literacy area. Research that has shown that male/female differences on open-response questions often do not parallel the male/female differences on multiple-choice questions in the same subject (Breland, Danos, Kahn, Kubota, & Bonner, 1994; Livingston, 2009; Livingston & Rupp, 2004; Mazzeo, Schmitt, & Bleistein, 1993; Persky, 2003; Salahu-Din, 2008; Willingham & Cole, 1997). In the present case, whereas the percentages of successful English-language and French-language females decreased, respectively, by 1.3% and 2.4% between 2004 and 2006, the percentages of successful English-language and French-language males decreased, respectively, by 3.7% and 4.2% between 2004 and 2006. Furthermore, the differences between the female and male students in 2004, 3.3% English-language and 8.5% French-language, were smaller than the corresponding differences in 2006, 6.1% and 10.3%. The English-language female and male students performed essentially equally well on the set of multiple-choice items included in the 2006 OSSLT (
It was initially thought that the percentage of students might possibly increase given the change in the time at which the OSSLT was administered between 2004 and 2006 and the adoption of the compensatory decision model. It was felt that the students’ ability would be developed more in the Spring than in the Fall of the 10th grade. However, the percentage of successful students decreased rather than increased for both of the English-language and French-language linking samples. The replacement of the conjunctive decision model with the compensatory decision model would normally lead to an increase in the number of successful students. Again, this did not happen in the present case, likely because the OSSLT is a minimum competency test that led to negatively skewed distributions with the cut-score placed well into the stretched tail of the distributions where there were fewer students.
Given the seeming successful construction of a pseudo test constructed from the 2004 operational items and that mirrored the 2006 OSSLT, the similar CSEMs for the 2004 and 2006 OSSLTs, and the stability of the percentage of successful students beginning in 2006, but the drop in the percentage of successful students between 2004 and 2006 because of the interaction between gender and item type, the question became this: Can the 2004 and 2006 scores be treated as equated or concordant? Kolen and Brennan (2004) suggested four features to examine the degree of similarity between the two tests (pp. 433-436). In the context of this study, the inferences to be drawn did not change, the construct to be assessed was unchanged, and the populations were similar. The change in the time of the year at which the test is administered was not a factor. Likewise, the change from a conjunctive to compensatory decision model was not a factor. The reliabilities were adequate, and the values of the CSEMs were similar. Whereas the dimensions of the table of specifications did not change, changes were made in the proportion of different item types, particularly for writing, which resulted in a decrease in the percentage of successful students between the two years, expectedly, more so for male than for female students. Taken together, in the present context and using Kolen and Brennan’s degree of similarity, the results of the present study support the argument that the scores obtained in 2004 and scores obtained in 2006 using the pseudo test could not be considered as equated scores but rather as concordant scores.
The pseudo test approach outlined in this article provides a viable method for establishing concordant scores when there is no time to field test a revised test, yet the scores from the revised test and the initial test are to be linked. However, care must be taken to ensure that there are a sufficient number of operational items in the previous test to construct a pseudo test that adequately represents the domain to be assessed in a reliable way, so that valid inferences from the linking can be made.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
