Abstract
American Sign Language (ASL) is one of the most commonly taught languages in North America. Yet, few assessment instruments for ASL proficiency have been developed, none of which have adequately demonstrated validity. We propose that the American Sign Language Discrimination Test (ASL-DT), a recently developed measure of learners’ ability to discriminate phonological and morphophonological contrasts in ASL, provides an objective overall measure of ASL proficiency. In this study, the ASL-DT was administered to 194 participants at beginning, intermediate, and high levels of ASL proficiency, a subset of which (N = 57) also was administered the Sign Language Proficiency Interview (SLPI), a widely used subjective proficiency measure. Using Rasch analysis to model ASL-DT item difficulty and person ability, we tested the ability of the ASL-DT Rasch measure to detect participant proficiency group mean differences and compared its discriminant performance to the SLPI ratings for classifying individuals into their pre-assigned proficiency groups using resource operating characteristic statistics. The ASL-DT Rasch measure outperformed the SLPI ratings, indicating that the ASL-DT may provide a valid objective measure of overall ASL proficiency. As such, the ASL-DT Rasch measure may provide a useful complement to measures such as the SLPI in comprehensive sign language assessment programs.
Keywords
Introduction
American Sign Language
American Sign Language (ASL) is one of the most commonly taught second languages in American colleges and universities (Goldberg, Looney, & Lusin, 2015). It is widely used by both hearing and Deaf people and taught in numerous primary, secondary and postsecondary academic programs. It also is a native language for many individuals, and is considered the core language of the Deaf community in the United States (Padden & Humphries, 1988; following Padden and Humphries, uppercase Deaf denotes a community of people who share a common sign language and culture, and lowercase deaf denotes people with hearing loss).
Within the context of second language (L2) acquisition, the construct of language proficiency generally refers to the degree to which a learner has achieved native-like fluency overall and with particular reference to the domains of language reception and expression (i.e., the sub-skills of speaking, listening, reading and writing). With regard to ASL, the construct of language proficiency refers to overall ability in terms of language reception and expression. The assessment of language proficiency is necessary for appropriate course placement, measurement of educational attainments over time, research, and other academic purposes. Despite its popularity as a frequently-taught L2, few measures of adults’ ASL proficiency have been described in the literature. One such measure, the Sign Language Proficiency Interview (SLPI), was developed at the National Technical Institute for the Deaf at Rochester Institute of Technology in the 1980s, and remains the most widely used measure of ASL proficiency today (Caccamise & Samar, 2009). The SLPI requires trained raters to use a scale composed of 11 levels of functional communication skills to evaluate a respondent’s ASL proficiency as displayed in a video recording of a conversation between an interviewer and the respondent. The SLPI is based on a well-known measure of L2 spoken language proficiency (Newell, Caccamise, Boardman, & Holcomb; 1983), the American Council on the Teaching of Foreign Languages’ Oral Proficiency Interview (OPI).
Although the SLPI is designed to allow the evaluator to observe a respondent’s proficiency across a broad range of linguistic domains including phonology, syntax, semantics, and pragmatics, a significant validity problem with the SLPI is its fundamentally subjective nature and susceptibility to situational and rater bias. This subjectivity and the lack of independent objective criterion measures of ASL proficiency have made it difficult to assess the reliability and validity of the SLPI within academic and professional settings. Only one psychometric study has been published, which indicates good interrater reliability and limited evidence of construct validity (Caccamise & Samar, 2009). Nevertheless, the greatest challenge to the validity of the SLPI remains that it relies on an interviewer and a highly subjective qualitative scoring rubric. Variability in the background and experience of raters and interviewers can substantially influence test results. For this reason, the validity of the SLPI ratings and similar measures such as the OPI are questionable in the absence of empirical evidence. A history and critique of the OPI has been presented by Chalhoub-Deville and Fulcher (2003), and many of their criticisms apply equally to the SLPI, especially validity questions regarding the use and interpretation of test scores.
It is clear that objective measures of language proficiency with demonstrated validity need to be developed for the assessment of proficiency in ASL and other signed languages. In the absence of objective, valid and reliable assessments, it is not possible to validate properly subjective proficiency measures such as the SLPI, or to evaluate confidently learners’ ASL proficiency or the effectiveness of sign language instructional programs, resulting in a serious problem for students, educators, and administrators. More objective measures of proficiency may provide relatively fast and unbiased sign language screening and assessment data for programmatic purposes, and may act as objective criterion measures against which existing and future subjective and naturalistic proficiency rating measures can be psychometrically evaluated.
In the present paper, we argue on the basis of known associations between phonological measures and overall language proficiency, and on the basis of validity data newly presented here, that the ASL Discrimination Test (ASL-DT), a direct measure of ASL phonological discrimination, provides a valid and reliable generalized measure of overall ASL proficiency. We use Rasch analysis to optimize the item content of the ASL-DT and demonstrate that ASL-DT Rasch person-ability scores outperform the single preeminent ASL proficiency measure in current use, the SLPI ratings, in discriminating various proficiency groups composed of individuals assigned to those groups based on their sociolinguistic, academic, and employment background. We also show that some categories of ASL linguistic contrasts presented on the ASL-DT are more difficult to discern than others in a manner that respects the known phonological difficulty hierarchy for ASL from previous research (Fischer, Delhorne, & Reed, 1999; Fischer & Tartter, 1985; Tartter & Fischer, 1982), confirming the ASL-DT’s sensitivity to the primary linguistic dimensions it was designed to directly measure. Finally, we report resource operator characteristic (ROC) results that indicate moderate to excellent proficiency group classification accuracy by the ASL-DT Rasch measure for individuals, and we provide ROC evidence that the group classification accuracy of the ASL-DT Rasch measure outperforms the group classification accuracy of the SLPI rating. Our results suggest that the ASL-DT provides an objective and valid ASL proficiency assessment tool for screening and evaluating individuals for language-related pedagogic, research, and programmatic purposes.
L2 receptive language proficiency
Most tests of L2 receptive language proficiency measure the broad construct of listening comprehension, such as the listening sections of the Test of English as a Foreign Language (TOEFL) and the International English Language Testing System (IELTS). Similar to listening skills in a spoken language (Song, 2008; Vandergrift, 1999), ASL sign reception is an integrative skill involving signal recognition (phonological decoding) and higher-order components of comprehension. Comprehension is necessary for the development and use of all languages regardless of the communication channel, and depends closely on a learners’ acquired ability to discern similarity and dissimilarity at various levels of language structure, especially at the level of phonology (Bohn, 2002; Chen & Fon, 2007; Flege, 2002). The recognition of linguistic contrasts central to a language’s phonology (minimal pairs) is, by definition, directly associated with changes in lexical meaning. As such, learners’ ability to recognize fundamental linguistic similarity and dissimilarity (i.e., phonetic and morphophonological contrasts) directly influences their development of lexical and sentence recognition skills and therefore discourse comprehension and, by extension, their broader overall language proficiency. Consequently, although it is not the sole determinant of language proficiency, the ability to discriminate and recognize phonological and morphophonological contrasts is expected to correlate closely with individual differences in overall language proficiency.
Empirical evidence suggests that sensitivity to phonological contrasts is a fundamental component of language competence reflecting learners’ comprehension ability, as well as their overall level of language proficiency. Hagiwara and Kuzumaki (1982) and Okabayashi (1991) showed that the ability to discriminate speech sounds in the target language is related to listening comprehension in L2 acquisition, as well as L2 proficiency (Hagiwara & Kuzumaki, 1982). Cutler (2012) showed that phonemic misperception is associated with L2 word recognition difficulties and that phonological processing in an L2 tends to be fragile and inflexible in comparison to that in a first language (L1). Finally, Escudero (2009) has shown that difficulties in L2 sound perception lead to word learning and word recognition difficulties in the L2.
These associations between phonological processing and comprehension suggest that a valid and reliable test of phonological discrimination might provide a good proxy for overall language proficiency. Spoken languages having a written form are composed of four correlated components of language proficiency (speaking, listening, reading and writing), as has been demonstrated in studies involving the paper-based TOEFL (Educational Testing Service, 1992). In a recent study of the factor structure of the internet-based TOEFL, a single higher-order general factor (L2 ability) and four first-order factors corresponding to the four components of language proficiency were identified. The loadings for the four first-order factors on the higher-order general factor were as follows: Listening = 0.97; Reading = 0.91; Writing = 0.91; and Speaking = 0.78 (Sawaki, Stricker, & Oranje, 2009). These results confirm that the receptive and expressive dimensions of language processing are closely correlated components of overall language proficiency. Therefore, a single phonological receptive measure might be expected to provide a proxy for individual differences in overall language proficiency by virtue of these correlations across language dimensions.
In languages which do not have a written form, such as ASL and other natural sign languages, language proficiency is composed only of a visual receptive and a manual expressive component. Assuming that the associations between phonological discrimination and language comprehension and between the receptive and expressive dimensions of language found in previous studies of other languages extend to ASL, the ability to discern phonological and morphophonological contrasts in ASL should provide a proxy for overall ASL proficiency. For example, in a study of an older version of the TOEFL, Pike (1979) found that subjective evaluations of expressive language (i.e., oral interviews and writing samples) correlated closely with objective scores on test sections evaluating receptive dimensions of language use and language knowledge (i.e., listening comprehension, reading comprehension, vocabulary, and grammar). Pike’s findings are indicative of the relationships among the sub-skills comprising language proficiency and lend support to the statement that “it is commonly recognized that these skills are interrelated; persons who are highly proficient in one area tend to be proficient in the other areas as well” (Educational Testing Service, 1992, p. 32).
ASL and other natural sign languages are distinguished from spoken languages primarily by the modality or channel of communication which is used to transmit linguistic information. Accordingly, features appropriate to the visual modality, such as spatial location, handshape, and movement, are used to instantiate ASL phonology, just as features appropriate to the auditory modality, such as oral place and manner of articulation and voicing, are used to instantiate spoken language phonology. The ASL-DT is designed around phonological and morphophonological contrasts in ASL which result in minimal pairs. These contrasts might be particularly challenging for hearing L2 learners of ASL whose L1 is a spoken language because they need to learn a completely different phonological system transmitted over a completely different communication channel. That is, the difference between the auditory and visual channels poses a special challenge for hearing L2 learners of ASL (Bochner et al., 2011). This challenge stems from the fact that positive transfer cannot occur between the native and target phonological systems when L1 is a spoken language and L2 is a sign language. Unlike L2 acquisition when both the native and target languages are spoken, there is no overlap between the two phonological systems. On the contrary, the challenge posed for learners is similar to a situation in which negative transfer (or a unique form of L1 interference) is present.
The perception of movement, handshape, orientation, and location have been well-studied in ASL, and it has been shown repeatedly that for native signers viewing visually degraded presentations of ASL utterances, contrasts in movement and handshape are significantly more difficult to discern than contrasts in orientation and location (Fischer, Delhorne, & Reed, 1999; Fischer & Tartter, 1985; Tartter & Fischer, 1982). Moreover, the acquisition of ASL phonology and the ability to differentiate contrastive from noncontrastive differences in signed utterances represent a significant challenge for L2 learners (Bochner et al., 2011). Consistent with the results of studies conducted on native signers viewing degraded ASL utterances, Bochner and colleagues (2011) found that some categories of linguistic contrasts in ASL are more difficult than others for L2 learners. For example, L2 learners found contrasts in the phonetic categories of movement and handshape more difficult to discern than contrasts in the categories of orientation and location. A direct test of ASL phonology should therefore produce item scores that respect the known item difficulty hierarchy of ASL phonological contrasts.
This paper uses a validation sample of 194 individuals to investigate the validity and reliability of ASL-DT Rasch scores as a measure of ASL proficiency and examines the comparative validity of the ASL-DT Rasch measure and the SLPI rating using a subset of the validation sample for which data on both measures were available. The validation sample was divided into three groups of sign language users that differed in their average level of exposure to ASL, as indicated by their sociolinguistic background, sign language course participation history, degree program, or job function. These groups acted as a prima facie criterion measure to test the ability of the ASL-DT to detect group differences in average ASL proficiency and to statistically classify individuals into their pre-assigned proficiency groups. Successful discrimination among these three proficiency groups and good classification of individuals to pre-assigned proficiency groups would justify further studies to establish the construct and predictive validity of the ASL-DT Rasch measure with greater precision. In addition to the ASL-DT Rasch measure, we collected SLPI ratings and demographic data on the 57 new participants. This comparative validity subsample allowed us to compare the success of the ASL-DT Rasch measure versus the SLPI rating, the current standard ASL proficiency measure, for discriminating ASL proficiency group differences and for classifying individuals into their pre-assigned proficiency groups.
Method
Participants
A validation sample of 194 adults participated in this study. The sample comprised 137 participants from the Bochner et al. (2011) study who took the ASL-DT, combined with newly collected data from 57 participants. In addition to the ASL-DT Rasch measure, data on participants’ gender, age, ethnicity, age of acquisition of ASL, and SLPI ratings were collected from the subset of 57 individuals, termed the comparative validation subsample.
No objective, independent ASL proficiency test is available to act as a criterion measure to test the validity of the ASL-DT Rasch measure. The SLPI rating is fallible as a criterion measure because it is sensitive to poorly understood subjective factors whose effect on ratings cannot be reliably identified, measured, or controlled. Poor criterion performance against the SLPI might simply imply that these confounding subjective factors have strong, construct-irrelevant influence on SLPI ratings, not that the ASL-DT Rasch measure lacks validity as an objective measure. Therefore, we did not attempt to obtain a direct criterion measure of ASL skill at the level of individual participants in the validation sample. Instead, we relied on sociolinguistic, academic, and job function criteria associated with different typical levels of lifetime ASL exposure to operationally define three different ASL proficiency groups. The high proficiency group (n = 26) was composed of 16 hearing adults having at least one deaf parent (children of deaf adults or CODAs) and 10 Deaf native signers. Members of the intermediate and low proficiency groups were drawn from employees and students in three different ASL instructional programs at the Rochester Institute of Technology. The intermediate proficiency group (n = 30) was composed of students who were enrolled in intensive credit-bearing courses within a degree program designed to train professional ASL-English interpreters. The low proficiency group (n = 138) was composed of faculty and staff largely at early levels of ASL exposure who were enrolled in non-credit courses designed to build their basic functional skills in ASL and students enrolled in credit-bearing courses designed for learners classified as beginners.
The sociolinguistic, academic, and job criteria used to form the proficiency groups in this study have clear face validity as correlates of the typical amount of lifetime ASL exposure individuals in these groups had experienced. The Deaf native signers and hearing individuals born into a family with at least one Deaf parent in the high proficiency group experienced direct sign language exposure from birth. The intermediate proficiency group was composed of interpreting students enrolled in courses appropriate for learners at the intermediate level of ASL proficiency and were a select group of language-talented individuals motivated to intensively study ASL and pursue a professional degree in ASL-English interpretation, but were not themselves native signers. Based on course enrollment, the vast majority of participants in the low proficiency group had beginner-level proficiency. Some hearing faculty and staff participants in the low proficiency group were not enrolled in ASL courses at the time of data collection but were included because they had little prior exposure to ASL and their use of ASL was generally restricted to no more than a few hours per week.
Although the average level of ASL proficiency differed markedly among all three groups based on general selection criteria, individual participants’ proficiency levels within a given group likely varied greatly, and subjective observation indicated that the groups occasionally included individuals with ASL skills within the range of proficiency of the adjacent group. For example, one member of the low proficiency group had a Deaf spouse, and their ASL proficiency was likely more advanced than their group placement indicated. Similarly, one member of the high proficiency group had one Deaf parent, and their ASL proficiency was not native like. While the presence of group overlap in participant proficiency is a source of uncontrolled error variance, this error would only make it more difficult to demonstrate the ASL-DT’s ability to discriminate among the ASL proficiency groups.
Demographic data on the 137 participants from the Bochner et al. (2011) study were not available. However, Table 1 displays the demographic summary statistics for the 57 participants in the comparative validity sample. The groups did not differ significantly in gender, and women formed the majority of participants in each group. The groups did differ significantly in age. The majority of participants in each group were white, non-Hispanic. Gender, age, and ethnicity were included as covariates in subsequent analyses to control variance related to these demographic characteristics. Years signing differed among the three groups. Student t-tests revealed that the high proficiency group reported more years signing than the intermediate and low proficiency groups, which did not differ significantly from each other. In principal, years signing is a correlate of proficiency. However, this should only be true if signing effort is sustained over those years. In the current validity sample, there is a confound between self-reported years signing and age. The high proficiency group was the youngest group and used sign in their daily lives from birth. The intermediate group was a few years older, acquired sign on average as adults, and also had professional and educational goals that involved sustained use and practice of sign as adults. However, the low proficiency group began to acquire sign much later on average, typically when they were hired at RIT, were much older than the other two groups, and varied widely in how sustained their sign efforts were depending on their professional responsibilities and personal relationships. Therefore, number of years reported signing per se is confounded with age and degree of sustained effort, and as such is a poor proxy for sign language proficiency. To verify this confound, we adjusted number of years signing for age. Table 1 and follow-up student t comparisons (all pairwise pairs, p < .05) show that the least square mean years signing decreased systematically as a function of proficiency group, as would be expected if signing effort were sustained equivalently during the number of years individuals reported signing. We use the age-adjusted years signing measure in analyses below as a convergent proxy for proficiency.
Summary of demographic characteristics for the ASL proficiency groups in the comparative validity sample (N = 57).
It is important to note that reported age of acquisition of ASL was markedly different among the three groups. The high experience group reported acquiring sign at birth, the intermediate experience group as adolescents and young adults, and the low experience group as older adults. The confidence intervals among the three groups did not overlap. Age of acquisition is a fundamental life-course factor known to strongly determine language proficiency, with later age of acquisition associated with lesser adult language proficiency due to the roles of critical periods and cognitive decline factors (DeKeyser, 2012; Krashen, Scarcella, & Long, 1982; Newport, 2005; Uylings, 2006). Therefore, these results confirm that the groups in our comparative validity subsample (and likely in our validation sample as well given the similar group selection criteria) were strongly segregated from early to late acquisition ages and, by extension, were on average well stratified by three group proficiency levels from high to low proficiency.
Procedure
ASL Discrimination Test (ASL-DT: Bochner et al., 2011): The ASL Discrimination Test was administered to all participants. The test consists of 48 items, each of which contains two pairs of ASL sentences. The sentences range in length from three to nine signs. Each pair of sentences includes a standard sentence followed by a comparison sentence. The sentences in each item are identical except for one contrasting element. The contrasting element in each item represents a minimal pair. The respondent must decide if the sentences in each pair are the same or different from each other by circling “S” or “D” on an answer sheet. In effect, each sentence pair represents one trial and, for each item, the test-taker must respond to two trials. Any combination of “same” and “different” trials is possible, allowing for four potential response outcomes: S-D, D-S, D-D, and S-S. Although two responses are required for each item, an item is scored correct if and only if responses to both trials are correct. No partial credit is awarded. The reason each item consists of two trials with all combinations of response outcomes being possible is that two trials increase the difficulty of the task and reduce chance-level performance.
The assessment procedure is illustrated below in the representation of one item (two trials). In this example, an English gloss of the ASL sentence is presented with the contrasting element (minimal pair) underlined. The same standard sentence is used in each pair.
Trial 1 YOUR YOUR Trial 2 YOUR YOUR
The stimulus materials are divided into six categories, each of which contained eight items. Five categories represented contrasts in linguistic properties of ASL (i.e., the phonetic properties of movement, handshape, location and orientation, and a morphophonological category known as complex morphology; Bochner et al., 2011). The final category included items with no contrasts (i.e., the standard and comparison stimuli were the same). This category is referred to as SAME because both trials are literally the same as the standard sentence. Additional examples of ASL linguistic contrasts (minimal pairs) appearing in the stimulus materials include: BET-AGREE (handshape), MOTHER-FATHER (location), and BALANCE-MAYBE (orientation).
The stimulus materials/sentences were digital video recordings of ASL utterances produced by three different native signers, two males and one female. One of the male signers produced the standard sentences; the female produced the first comparison sentence, and the other male produced the second comparison sentence. Noncontrastive (dialectal, stylistic, and idiosyncratic) variation in sign production is evident among the signers. This variation is intended to make the discrimination task more difficult. The reader is referred to Bochner et al. (2011) for further details pertaining to the assessment procedure and stimulus materials.
Sign language proficiency interview (SLPI: Newell, Caccamise, Boardman, & Holcomb, 1983): The SLPI was administered to the 57 participants in the comparative validity subsample. The test protocol involves a trained interviewer engaging in a recorded conversation with the test taker. The conversation lasts approximately 20 minutes and covers three general topic areas: work/school, personal background, and hobbies. The content of the conversation varies depending on the test taker’s responses.
The recorded conversation is evaluated by a team of three independent raters. Using a metric consisting of an 11-point scale of “ASL communicative functioning,” raters evaluate the use of grammatical features (syntax and morphology), sign vocabulary, fluency and accuracy of sign production, and finger spelling. If the team’s ratings are within one point of each other, the raters proceed to discuss their evaluation and assign a score to the test taker. If the initial ratings are not within one point of each other, the team re-evaluates the video recording and, if the new ratings are within one point of each other, they proceed to discuss their evaluation and assign a score to the test taker. If the three ratings are not within one point of each other, the team determines whether the recording is ratable or not. If the recording is ratable, the team proceeds to discuss their evaluation and assign a score to the test taker. If the recording is not ratable, the test taker is provided with an opportunity to engage in another recorded interview.
Results
Rasch analysis
Rasch scaling analysis was performed on the validation sample’s (N = 194) responses to the 48 items contained in the ASL Discrimination Test. In Rasch measurement, fit statistics expose discrepancies between observed data and data expected by the measurement model. Fit statistics are calculated by comparing the observed and expected trace lines obtained for items (i.e., item characteristic curves or ICC’s) after the difficulty parameters have been estimated. When the observed ICC departs from the expected ICC beyond a statistical reference value, there is indication that high proficiency respondents “fail” on an easy item or low proficiency respondents “succeed” on a difficult item. In general, items that produce unexpected response patterns of this nature distort or degrade the measurement system that is being created and, hence, are removed. Two of the 48 items had mean square fit statistics that exceeded a criterion value = 2.0 logits and were excluded from further analysis.
Person ability and item difficulty were computed for the remaining set of 194 persons and 46 items. In Rasch analysis, person ability and item difficulty are both measured in logarithmic units called logits that range from negative to positive values and indicate the probability of successful responses. Persons with negative logit values have relatively low ability, and persons with positive logit values have relatively high ability. Similarly, items with negative logit values are relatively easy while items with positive logit values are relatively hard.
The mean number of items correct was 30.7 (66.7%) with SD = 6.9 (14.4%) and a range of 12 to 44. The corresponding mean person ability measure was 0.95 logit (SD = 0.92) with a range of −1.28 to 3.57. Item difficulty values ranged from −1.92 (very easy). to 2.42 (very hard).
The person separation reliability (PSR) was 0.83, and the item separation reliability (ISR) was 0.97. These values are interpreted in the same way as standard reliability coefficients, and therefore they represent very good to excellent reliabilities. The PSR statistic indicates how well the test items are able to separate the ability levels of the persons tested. The ISR statistic indicates how well the test items are ordered in difficulty. The person and item separation statistics in Rasch measurement are useful tools during test development which allowed us to evaluate and refine the item content of the ASL-DT in order to maximize its power to discriminate individuals at different levels of ability. The separation statistics obtained in this study confirm that the sample contained participants with relatively good sign skills on average, but with a suitably wide range of ability. They also indicate that the ASL-DT includes items with a wide range of difficulty appropriate for testing individuals across a wide range of abilities.
Criterion validity
Table 2 shows mean performance displayed in logits on the ASL-DT for each proficiency level. A one-way analysis of variance with participant groups as the factor was significant, F(2,191) = 75.0, p < .0001. Post hoc comparisons indicated significant differences between all combinations of participant groups.
ASL-DT means, SDs, ns, and post-hoc pairwise comparisons for ASL proficiency groups.
Figure 1 shows mean item difficulty values (di) displayed in logits for each category of items. A one way analysis of variance with item category as the factor was significant, F(5,40) = 3.1, p < .05. The results of post hoc tests indicated that the mean di for Location items differed significantly from the mean di for Movement items. Other contrasts were not significant. Classifying the item categories into two groups (i.e., the three theoretically easiest categories, Location, Orientation, and Handshape, vs. the three theoretically most difficult categories, Movement, Complex Morphology, and SAME: Bochner et al., 2011; Fischer, Delhorne, & Reed, 1999; Fischer & Tartter, 1985; Tartter & Fischer, 1982) resulted in a significant difference, F(1, 44) = 15.39, p < .01).

Mean difficulty values (di) and standard errors for each category of items.
Figure 2 displays the ROC curves for group classification by the ASL-DT of the 194 individuals in the validation sample. Table 3 displays the percentage of accuracy of classification, as indicated by the area under the ROC curves, sensitivity (proportion of true positives, and specificity (proportion of true negatives) for the high, intermediate, and low groups. The Pre-assigned by Predicted classification matrix on the right displays the numbers of correct assignments and misassignments by the discriminant function. The majority of high proficiency participants was classified as high proficiency (15 of 26), 10 were misclassified as intermediate proficiency, and only one was misclassified as low proficiency. The majority of low proficiency participants was classified as low proficiency (105 of 138), 31 were misclassified as intermediate proficiency, and only two were misclassified as high proficiency. The majority of intermediate participants (17 of 30) were classified as intermediate proficiency, five were misclassified as low proficiency, and eight as high proficiency. These results indicate that the ASL-DT placed a majority of participants at each proficiency level within their pre-assigned proficiency group and placed the large majority of all participants within +/−1 proficiency group step from their pre-assigned group.

ROC curves for the ASL-DT Rasch measure (N = 194).
ROC statistics for the ASL proficiency groups. AUC = area under the curve.
Comparative validity of the ASL-DT Rasch measure and SLPI ratings
We compared the ability of the ASL-DT Rasch measure and the SLPI ratings to discriminate ASL proficiency group differences in a pair of ANCOVAs on the comparative validity subsample of 57 participants, using proficiency group as a between subjects factor and gender, age, and ethnicity group as covariates to control for possible group differences in performance due to these demographic variables. Ethnicity was dichotomized as White non-Hispanic versus Other. One ANCOVA used the ASL-DT Rasch measure as the dependent measure and the other used the SLPI ratings as the dependent measure.
To ensure that these ANCOVAs had equal power at the outset to detect significant group differences, exactly the same set of subjects in the comparative validity sample were used in both ANCOVAs. SLPI ratings were not available for 11 participants. Eliminating these 11 participants from the analyses would have substantially reduced the statistical power to detect significant group differences. We therefore imputed SLPI ratings for these 11 participants. One common method of imputing missing values is to substitute the mean of the remaining scores within each group for the missing values. However, doing so would have placed the SLPI ratings at an artificial disadvantage compared with the ASL-DT Rasch measure because the SLPI unquestionably varies with independent individual differences in ASL proficiency within each broad proficiency group to some extent. Therefore this procedure would tend to restrict selectively the true degrees of freedom of the SLPI ratings and the imputed group mean ratings would not improve the estimation of population group means and variances. Instead, we took advantage of the fact that the SLPI ratings and the ASL-DT Rasch measure share variance related to ASL proficiency and may also share variance related to other personal traits that could affect both sets of scores. We regressed the SLPI ratings onto the ASL-DT Rasch measure from the remaining 46 participants. Then we used the additional 11 independent ASL-DT values as proxy measures of the 11 missing SLPI scores as determined by the regression equation. Because the 11 independent ASL-DT Rasch measures were free to vary during participant sampling, the 11 estimated missing SLPI ratings were similarly free to contribute to the estimation of the SLPI rating group means and variances.
This approach is statistically conservative in that the 11 imputed SLPI ratings preserve the statistical relationship between the SLPI ratings and the ASL-DT Rasch measure that exists in the comparative subsample and increase the power of the ANCOVAs to detect proficiency group differences on the SLPI through improved estimation. This method of imputing missing values based on regression parameters provides more powerful and accurate tests of the comparative validity of the SLPI ratings and the ASL-DT Rasch measure than does either eliminating the subjects with missing values all together or substituting group mean values.
The results of these two ANCOVAs are shown in Figure 3. Both the ASL-DT Rasch measure, F(2,51) = 8.0, p = .0010, and the SLPI ratings, F(2,51) = 5.1, p = .0094, showed a significant main effect of proficiency group. However, post-hoc student t pairwise comparisons revealed that the ASL-DT Rasch measure significantly discriminated every group from every other group (all pairwise ps < .05), whereas the SLPI significantly discriminated the low proficiency group from the high and intermediate proficiency groups (both pairwise ps < .05) but failed to significantly discriminate the high and intermediate proficiency groups from each other.

Results of ANCOVAs conducted on the ASL-DT Rasch measure and SLPI ratings to detect differences among ASL proficiency groups.
As convergent evidence that the significant group effect in each of these ANCOVAs was related to ASL proficiency, we added age-adjusted years signing, a prima facie correlate of ASL proficiency, as a covariate in the ANCOVA models. After age-adjusted years signing was included, the proficiency group main effect became small and non-significant in both models (both ps > .1). Therefore, the performance differences displayed by the three ASL proficiency groups on the ASL-DT Rasch measure and the SLPI ratings appear to be related directly to underlying group differences in ASL proficiency per se and not to other unknown cognitive or perceptual factors that might potentially distinguish these groups.
Finally, we examined the ROC curves for the comparative validity subsample of 57 participants. Since this subsample is less than a third the size of the validation sample, the accuracy, sensitivity, and specificity values obtained from these analyses are less reliable population estimates than those presented above for the entire validation sample. Nevertheless they permit us to compare the relative overall performance of the ASL-DT Rasch measure and SLPI ratings within the limits of reliability for this reduced sample size. Figure 4 displays the ROC curves for group classification by the ASL-DT Rasch measure and the SLPI ratings for this subsample. Table 3, presented earlier, displays the percentage of accuracy of classification, sensitivity, and specificity for each proficiency group for the discriminant function based on the ASL-DT Rasch measure and for the discriminant function based on the SLPI separately. The Pre-assigned by Predicted classification matrix on the right of each analysis displays the numbers of correct assignments and misassignments by each discriminant function.

ROC curves for the ASL-DT Rasch measure and SLPI ratings for the comparative validity sample (N = 57).
For the ASL-DT Rasch measure, the majority of high proficiency participants (nine of 16) were classified into their pre-assigned proficiency group, six were misclassified as intermediate proficiency, and only one was misclassified as low proficiency. The majority of low proficiency participants (17 of 27) were classified into their pre-assigned proficiency group, eight were misclassified as intermediate proficiency, and only two were misclassified as high proficiency. Six of 14 intermediate participants were classified into their pre-assigned proficiency group, three were misclassified as low proficiency, and five as high proficiency. These results indicate that the ASL-DT placed a majority of the high and low proficiency participants within their pre-assigned proficiency group, respectively, and placed the vast majority of all participants within +/−1 proficiency group step from their pre-assigned group. This subsample pattern is similar to the pattern shown by the larger validation sample.
For the SLPI rating, seven of 16 high proficiency participants were classified into their pre-assigned proficiency group, three were misclassified as intermediate proficiency, and six were misclassified as low proficiency. The majority of low proficiency participants (19 of 27) was classified into their pre-assigned proficiency group, none were misclassified as intermediate proficiency, and eight were misclassified as high proficiency. Only one of 14 intermediate participants was classified into their pre-assigned proficiency group, eight were misclassified as low proficiency, and five as high proficiency. These results indicate that the SLPI placed only a minority of the high and intermediate proficiency participants into their pre-assigned proficiency group, and misclassified high proficiency and low proficiency participants two groups away from their pre-assigned proficiency group substantially more often than one group away.
In order to verify the superiority of the ASL-DT Rasch measure to the SLPI ratings for the prediction of proficiency groups, we conducted an ordinal logistic regression, using age, gender, and race/ethnicity as covariates and including both the ASL-DT Rasch measure and the SLPI ratings as independent predictors. The three-level ordinal proficiency group variable was used as the criterion (dependent) measure. Only the ASL-DT Rasch measure emerged as a significant predictor of the likelihood of proficiency group membership, chi2(1) = 8.87, p = .0027; The SLPI did not contribute significant additional variance to the prediction, chi2(1) = 1.94, p = .1637. To demonstrate that the ASL-DT Rasch measure was predicting likelihood of proficiency group membership based specifically on sign language proficiency and not some unknown moderator variable, we repeated the analysis using a stepwise ordinal regression by entering age, gender, race/ethnicity, and age-adjusted years signing first, followed by the ASL-DT Rasch measure and the SLPI in both orders. Once age-adjusted years signing was entered, neither the remaining variance in the ASL-DT Rasch measure nor in the SLPI ratings contributed significantly to the prediction. These results confirm that the shared variance between two very different measures of sign proficiency, namely age-adjusted years signing and the ASL-DT Rasch measure, was responsible for the ability of the ASL-DT Rasch measure to predict proficiency group membership.
Finally, a total of 36 individuals were misclassified either by the ASL-DT Rasch measure alone, the SLPI alone, or by both measures. The ASL-DT Rasch measure and the SLPI ratings agreed on the misclassification group for nine of those 36 cases. On the remaining 27 cases, the ASL-DT Rasch measure and the SLPI ratings disagreed on the predicted group. For 11 of those 27 cases, the SLPI ratings misclassified the cases, whereas the ASL-DT Rasch measure correctly classified the cases into their pre-assigned proficiency groups. For another seven of those 27 cases, the ASL-DT Rasch measure misclassified those cases and the SLPI ratings correctly classified them into their pre-assigned proficiency groups. Therefore, the ASL-DT Rasch measure more often correctly classified cases that the SLPI misclassified than the reverse.
In general, the agreement between the two tests in simultaneously predicting group membership is not very good, especially in the middle range of proficiency. For the low proficiency group, 56.0% of participants are correctly classified by both tests. For the intermediate proficiency group, 7.1% of participants are correctly classified by both tests. In the high proficiency group, 31.3% of participants are correctly classified by both tests.
Discussion
This study investigated the validity of the ASL-DT as a proxy measure of overall ASL proficiency. The results show that the data fit the Rasch model well. The PSR and ISR statistics showed good to excellent reliability, indicating that ASL-DT items span a wide range of difficulty and are appropriate for testing individuals across a wide range of abilities. Consistent with data on native signers’ phonological difficulty hierarchies for ASL utterances under degraded viewing conditions, the results also show that contrasts within some phonetic categories are more difficult to discern than contrasts within others, thereby providing additional evidence of validity. These results demonstrate that the original findings of Bochner et al. (2011), who used raw ASL-DT scores on a subset of the present data, remain reliable for a larger and more heterogeneous participant sample and a more psychometrically valid ASL-DT measure.
The ASL-DT Rasch measure based on the validation sample of 194 individuals showed moderate to strong accuracy for classifying individuals into their pre-assigned proficiency groups based on independent course placement and sociodemographic histories. For the high and intermediate proficiency groups, sensitivity (the ability to correctly classify individuals as members of these groups) was weak to moderate, whereas specificity (the ability to correctly classify individuals as not belonging to these groups) was excellent. By contrast, for the low proficiency group, sensitivity was moderate to good but specificity was only weak to moderate.
The ASL-DT Rasch measure generally outperformed the SLPI. Whereas the ASL-DT Rasch measure produced more evenly spaced proficiency group mean differences and reliably detected them statistically, the SLPI ratings failed to significantly differ between the intermediate and high proficiency groups. Furthermore, the ROC statistics on the comparative validity subsample suggest that the ASL-DT Rasch measure has substantially higher classification accuracy for individual cases than the SLPI ratings for the low and high proficiency groups, and closely comparable accuracy for the intermediate group. Five of the six sensitivity and specificity statistics across the three proficiency groups were higher for the ASL-DT than the SLPI. The SLPI misclassified 13 of the 14 (92.9%) intermediate proficiency participants into one of the other two proficiency groups compared with the ASL-DT Rasch measure which misclassified eight of the 14 (57.1%) intermediate proficiency participants into other groups. Importantly, the SLPI ratings misclassified a total of 30 of 57 (52.6%) of participants, and of the 17 misclassified participants in the low and high proficiency groups, 14 (82.4%) were misclassified two proficiency steps away from their pre-assigned group. By contrast, the ASL-DT Rasch measure misclassified only 25 of 57 (43.9%) participants, and of the 17 misclassified participants in the low and high proficiency groups only three (17.6%) were misclassified two proficiency steps away from their pre-assigned group. When the ASL-DT Rasch measure and the SLPI ratings were entered into an ordinal logistic regression, with age, gender, and race/ethnicity controlled, to predict the likelihood of proficiency group membership, the ASL-DT Rasch measure remained as a significant predictor while the SLPI ratings contributed no additional significant prediction to the model. Once age-adjusted years signing was added to the model, neither the ASL-DT Rasch measure nor the SLPI ratings contributed further to the prediction of the likelihood of proficiency group membership. These logistic regression results indicate that the ASL-DT Rasch measure alone was sufficient to account for the likelihood of membership in a proficiency group and that the variance in the ASL-DT Rasch measure that was responsible for the prediction was directly related to age-adjusted years signing, a prima facie proxy for sign proficiency.
Although these analyses appear to support the hypothesis that the ASL-DT Rasch measure outperforms the SLPI ratings for predicting overall sign proficiency, it is important to consider an alternative interpretation. One might argue that the SLPI is a superior measure of sign proficiency but that the sampling method used to pre-assign individuals to proficiency groups misclassified some individuals from the start. If this were the case, then the discriminant analysis of the SLPI ratings would be expected to fail to classify those individuals into their pre-assigned groups since their pre-assignment was based on circumstantial factors that were unprincipled with respect to ASL proficiency in the first place. In this case, it might appear plausible that the SLPI discriminant analysis would re-classify those individuals into correct alternative groups that would artifactually appear in the ROC analysis to be misclassifications. This explanation fails, however, because if the original pre-assigned proficiency groups had poor validity owing to any sizable influence of unprincipled circumstantial factors, and if the ASL-DT Rasch measure were, in fact, an inferior predictor of correct proficiency group membership vis-à-vis the SLPI ratings, then the ASL-DT should perform worse than the SLPI ratings in the discriminant analysis, not better. This is because there is no principled reason for the ASL-DT Rasch measure to be correlated with the circumstantial factors responsible for the true misclassifications in the first place. Hence, the ASL-DT Rasch measure would tend to more broadly distribute the truly misclassified participants over the three proficiency groups rather than concentrate them close to their nominal pre-assigned groups. In fact, it is the SLPI ratings that most broadly distributed the nominally misclassified participants, compared with the ASL-DT Rasch measure, often two groups away from their pre-assigned value. Furthermore, it is clear from the ordinal regression that the predictive power of the ASL-DT is directly related to sign language proficiency as assessed by age-adjusted years signing. Finally, the pattern of misclassifications was more extreme for the SLPI ratings than for the ASL-DT Rasch measure, involving particularly poor classification of intermediate participants to their pre-assigned group and many misclassifications of high and low proficiency participants two proficiency steps away from their preassigned group.
Collectively these considerations tend to rule out the alternative explanation that the SLPI ratings only appear to misclassify cases owing to invalid pre-assignment of some cases to their nominal proficiency group. We therefore interpret the collective results of this study to support the proposition that the ASL-DT Rasch measure provides an objective proxy measure of overall ASL proficiency with good to excellent reliability and relatively good validity compared with the SLPI ratings.
Although invalid preassigned group classifications cannot account for the inferior performance of the SLPI ratings compared with the ASL-DT Rasch measure, such participant selection misassignments could certainly have contributed to weakening the ROC parameter estimates in each of those analyses. We are aware that course placement is not always well matched to the true sign skills of individuals for various circumstantial, personal, and professional reasons, and it was our informal impression that some participants, especially in the low and intermediate groups, overlapped in their apparent functional sign skills. In addition, being a child of one or two deaf parents does not guarantee that a person’s exposure to ASL was native-like since parents may occasionally be native users of sign languages other than ASL (with different phonological boundaries) or have less than native proficiency in ASL themselves. Unfortunately, we were not able to collect more detailed histories of ASL exposure and proficiency in the current study. Therefore, we suspect that a few of the misclassifications by the ASL-DT Rasch measure and the SLPI ratings in the current study were in fact correct classifications, and that the true accuracy, sensitivity, and specificity parameters for these measures may be somewhat higher than the current discriminant analyses indicate. Estimating the true accuracy, sensitivity, and specificity parameters of the ASL-DT Rasch measure is an issue for future research that employs more stringent language background criteria for pre-assigned group classification and finer group proficiency steps.
It is important to point out that our reliability and validity conclusions do not impugn the value of the SLPI as one measure of ASL proficiency or as an assessment tool for analyzing the nature of sign competencies and errors produced by individuals within a comprehensive communication development program. The SLPI has been developed carefully (Caccamise & Newell, 1995; Newell et al., 1983) to provide a rich description of sign language behaviors directly correlated with overall language proficiency and is therefore an important heuristic tool for individuals to develop personalized communication development plans and to monitor and demonstrate progress toward their learning objectives/goals. The ASL-DT Rasch measure does not provide this sort of feedback directly. However, the ASL-DT Rasch measure provides a convergent objective assessment tool that can help to overcome some of the subjective factors that can compromise the validity of SLPI ratings for some individuals under some testing conditions. Finally, expressive tasks, such as the SLPI, provide opportunities for the test takers to control their responses and avoid problematic or difficult constructions, at least to some extent. This feature of expressive tasks introduces another potential source of measurement error in test scores. In contrast, receptive tasks, such as the ASL-DT, force test takers to respond to items without permitting them to avoid challenging constructions.
Although the ASL-DT is a measure of learners’ ability to discriminate phonological and morphophonological contrasts, evidence has been presented that the test can serve as a proxy for measuring ASL proficiency. The nature of the discrimination tasks and how they relate to sign recognition (lexical identification) and comprehension, combined with data on intercorrelations among subskills of language proficiency (i.e., speaking, listening, reading, and writing) and the factor structure of a prominent language proficiency test (i.e., the internet-based TOEFL), support the contention that the ASL-DT can serve as a proxy for ASL proficiency. The results also demonstrate that the ASL-DT may be better able to discriminate among groups of participants than the SLPI. Finally, the results of Rasch analysis show that the ASL-DT has good internal consistency reliability for the assessment of individuals as indicated by a PSR statistic of 0.83. The PSR obtained in this study was good even though the test contained only 46 items. The PSR is likely to increase as additional items are added to the ASL-DT.
The results of this study, in particular the fit of test data to the Rasch model, support the development of a computer-based adaptive version of the ASL-DT. Such a test could play an important role in improving the quality of assessment in ASL pedagogy. Plans currently are being made to develop an online adaptive version of the ASL-DT. In particular, our test development plans include substantial enlargement of the current ASL-DT item pool and revision of a software application we have developed previously for delivery of the NTID Speech Recognition Test (https://apps.ntid.rit.edu/NSRT/) to accommodate the ASL-DT. The development of such a test represents a significant advance in the assessment of natural sign languages such as ASL.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
