Abstract
Sign languages present particular challenges to language assessors in relation to variation in signs, weakly defined citation forms, and a general lack of standard-setting work even in long-established measures of productive sign proficiency. The present article addresses and explores these issues via a mixed-methods study of a human-rated form-recall sign vocabulary test of 98 signs for beginning adult learners of Swiss German Sign Language (DSGS), using post-test qualitative rater interviews to inform interpretation of the results of quantitative analysis of the test ratings using many-facets Rasch measurement. Significant differences between two expert raters were observed on three signs. The follow-up interview revealed disagreement on the criterion of correctness, despite the raters’ involvement in the development of the base lexicon of signs. The findings highlight the challenges of using human ratings to assess the production not only of sign language vocabulary, but of minority languages generally, and underscore the need for greater effort expended on the standardization of sign language assessment.
Keywords
With progress in the implementation of the Common European Framework of Reference (CEFR; Council of Europe, 2020), the importance of assessing adult learners of Swiss German Sign Language (Deutschschweizerische Gebärdensprache, DSGS) has increased in German Switzerland over the past ten years. However, at the international level, very few publications address language tests that can be applied to adult learners of sign languages of any kind. Some exceptions are the Sign Language Proficiency Interview (SLPI) for American Sign Language (ASL) (Newell et al., 1983), the Sentence Reproduction Test for ASL (Hauser et al., 2008), or the ASL Discrimination Test (Bochner et al., 2016). However, as yet, no operational test for adult learners of DSGS has been available, which will be addressed in the present study. The present study is a part of the larger project “Scalable Multimodal Sign Language Technology for Sign Language Learning and Assessment (SMILE)” funded by the Swiss National Science Foundation (SNSF). The SMILE project aimed to apply automatic sign language recognition technology to a sign language vocabulary test for DSGS. The present study tackles the issues inherent to human-rated assessments of performance, the challenges associated with sign language assessment, and issues peculiar to minority languages for which small pools of qualified raters are available. Indeed, reliable human measurements of people’s abilities to sign minority languages are greatly needed.
The issues surrounding human ratings of language performance have been and continue to be thoroughly researched. Rater behavior (and misbehavior) has been described from a multitude of angles, ranging from experience level (e.g., Şahan & Razı, 2020), first language (L1) or second language (L2) status of raters (e.g., Gui, 2012), levels of fatigue (e.g., Ling et al., 2014), and through linguistic or ethnic bias (e.g., Yan, 2014). Although interrater reliability and rater disagreement resolution in sign language assessment have received some research attention (e.g., Caccamise & Samar, 2009), the sources of disagreement have gone unaddressed, and the overwhelming majority of the work on rater behavior has been in the context of spoken or written language. This has left a large gap in the research literature with regard to sign languages, and especially minority sign languages.
With the present research we seek to address this gap by employing a mixed-methods approach to examine the results of a human-rated L1–L2 DSGS vocabulary translation test (Haug & Ebling, 2019; Haug et al., 2019a) in order to investigate areas and sources of disagreement between deaf expert raters in judgments of correctness of the signs produced by 20 adult learners of DSGS. Many-facets Rasch measurement was used to identify signs with significant disagreement between two deaf expert raters, and this analysis served as the basis for a follow-up interview of the raters to discuss the sources of disagreement in order to better understand rater variation in judging sign languages and to identify points for improvement in rater training to attempt to address those issues. To the best of our knowledge, this is the first study investigating disagreement among raters of signed production, and reveals some challenges with regard to rater variability that arise in assessments of sign languages generally, but especially of minority sign languages which lack the research and human resources of some of the larger sign languages (e.g., ASL).
Literature review
Basic structure of sign languages
Manual and non-manual components of sign languages
An important feature in any sign language is the distinction between manual and non-manual components (e.g., Baker et al., 2016). Manual components are produced with the hands and arms; non-manual components are produced with the face (e.g., with mouth, cheeks, eyes, eyebrows, etc.), the head, and the upper torso (e.g., Sutton-Spence & Woll, 1999). For example, raised eyebrows can be applied to turn a declarative into an interrogative sentence, or eye gaze can be used to refer back to a previously established reference in signing space (Pfau & Quer, 2010).
For the work presented here, which is concerned with vocabulary only, the manual components of signing are the focus of evaluation. The manual components are typically divided into the subcomponents of handshape (the form of the hands, e.g., fist, flat hand, etc.), hand position (the orientation of the hand), location (where the manual activity is performed), and movement (an optional motion of the sign). These four subcomponents are comparable to phonemes in spoken languages in that they are capable of producing distinctions in meaning (Sandler, 2012).
Structure of the sign language lexicon
Johnston and Schembri (2007) suggested a model for the structure of the mental lexicon in sign languages, based on their research on Australian Sign Language (Auslan), which is still useful today. They split the mental lexicon into a native and a non-native sign language lexicon. The native or L1 lexicon consists of two subcategories, the conventional and the productive lexicon. The conventional (or, established) lexicon is made up of signs that show a stable form–meaning relationship; for example, the German Sign Language (Deutsche Gebärdensprache, DGS) sign for AUTO (“car”) can be used in different contexts without any change in meaning (König et al., 2012).
The productive lexicon is very different, and it is difficult to determine the exact number of signs in it. Sign forms that can be labeled as productive are produced and understood in a specific context to convey a specific meaning. The signs themselves are not conventionalized, although their sublexical units are. The sublexical units of productive signs are combined in a context-specific way to convey, for example, the meaning of “a person is approaching me.” To represent the concept of “person,” the signer needs to select a specific handshape (often a single upright index finger) and the location, movement, and orientation of the hand, then transmit the meaning of how and from where the person is approaching and with what kind of path (straight, wavy, etc.). Accordingly, when the sign is produced in a different location with a different direction and manner, the meaning can change from “a person came straight at me” to “meandered slowly away.” Because of the many possibilities for changing the parameters of the form, no citation or entry in the mental lexicon is possible, that is, no base form exists for productive signs. This is why productive forms, while used extensively in actual signing, often do not appear in sign language lexicons.
The non-native or L2 lexicon describes the parts of a sign language where, for example, loan signs from other sign languages are conceptualized, which (through the process of lexicalization) may eventually become part of the native/L1 conventional lexicon.
Only signs of the native/L1, conventional lexicon were included in the vocabulary test for the purpose of the present study. In theory, it would be possible to include signs that include morphological changes to the lexical base form (Johnston & Schembri, 2007) to arrive at concepts comparable to word families. However, since such groups of signs are less clearly defined for sign languages than for spoken languages, it would be difficult to arrive at a definition of what a correctly-produced sign in a DSGS vocabulary test would be. Although sign types are known to have a stable form–meaning relationship (König et al., 2012), the situation is further complicated by the fact that there exists little research on acceptable phonetic variations of signs.
Variants in sign languages
Research on British Sign Language (BSL; Fenlon et al., 2013) and ASL (Bayley et al., 2002) demonstrates that variants can be influenced by various factors, such as the lexical frequency, the phonological surrounding of a sign or social factors (e.g., gender, age, ethnicity, regional variations). However, a previous study on acceptable variants in DSGS (Ebling et al., 2018) was constrained in that it did not include factors such as in the cited studies on BSL and ASL, as the focus of the DSGS study was on gathering initial evidence in the development of research-based rating criteria for the DSGS test.
The question of what constitutes an acceptable variant (or variants) of a DSGS sign is an important one for the continuing development of the scoring criteria in the present DSGS vocabulary form-recall test, and some work has already been completed. Ebling and colleagues (2018) researched acceptable variants of lexicon-like forms/signs in different users of DSGS (L1 and L2). The term “lexicon-like” refers to an unmodified form of a DSGS signs as it is used in the lexicon for DSGS (Boyes Braem, 2001). In the acceptability study, 11 deaf L1 and 19 hearing L2 adult signers were prompted with a DSGS gloss and an associated German example sentence and asked to produce the appropriate sign. The data were later analyzed by two deaf and one hearing sign language researcher, placing the productions into six categories of acceptability ranging from “not an acceptable variant of the target in that it not only has an entirely different form but also a different meaning” through “identical to target sign.” See Table 1 for details.
Link between category assignments and test decisions (from Ebling et al., 2018).
These six categories ultimately served as the basis of the scoring instrument of the form-recall vocabulary test in the present study (cf. section Scoring instrument and rater training). Four of the categories (categories 1, 2, 4, and 5) were used as criteria to judge a produced sign as “correct,” while the remaining two (categories 3 and 6) were used to judge it as “incorrect.”
Vocabulary assessment
Because of the fact that research on sign language vocabulary assessment is scant, sign language researchers must often refer to the field of written L2 vocabulary assessment in its stead. Fortunately, it offers a vast pool of analogous knowledge from which to draw.
Tests of vocabulary knowledge
Vocabulary assessment can be broadly separated into tests of vocabulary recognition and vocabulary recall (McLean et al., 2020). In recognition tests, test-takers are presented with a target word and are asked to demonstrate understanding of its meaning. In recall tests, however, the test-takers must produce the word from memory based on some stimulus. Depending on whether receptive or productive vocabulary knowledge is being tested, different forms of instruments are typically used. Frequently used forms for vocabulary assessments are, for example, checklists (often referred to as “Yes/No” or YN tests); matching tests, for example, in which a test-taker should fit a target word with other related words or short definitions; and translation tests from the L1 into the L2 or vice versa (Kremmel & Schmitt, 2016). A simple form of a translation test is the form-recall format, wherein an L1 word is given to the test-taker and he or she is asked to produce the L2 translation (McLean et al., 2020). One logistical challenge of this format, however, is the need for at least some human rating, which can result in an inconsistency between raters (Stewart, 2012).
Sign language vocabulary assessment
Most of the vocabulary test formats described above pose special difficulties for the measurement of sign language vocabulary. One issue is that there are no widely accepted, corpus-based frequency lists from which to draw a sample of signs of varying expected difficulty, especially in the case of under-resourced sign languages such as DSGS. Even relatively simple formats such as the YN test can be technologically challenging, requiring the use of a short video for each item, thereby eliminating one of the key benefits of that format: the ability to administer a very large number of items in a short period of time (McLean et al., 2020). Another challenge is due to the fact that the absolute number of users of any sign language is relatively small, even for major sign languages such as ASL or BSL, but this becomes an even more serious problem with sign languages in smaller countries (Kotowicz et al., 2021; Mann et al., 2016). Furthermore, sign language vocabulary can only be tested via test-taker performances that must be assessed visually, and must, therefore, take into account the various challenges and issues that arise from human-rated assessments of language.
Issues in developing sign language assessments
One aspect that makes the development of and research on sign language tests difficult for many sign languages is that most sign languages are under-researched and under-resourced, and, therefore, have no corpora or reference grammars available. This also holds for DSGS: No balanced and representative DSGS corpus exists, and no reference grammar has been thus far developed. Even specific linguistic studies that address issues for the development of vocabulary tests for DSGS are rare. Examples of sign languages for which corpora of different sizes have been compiled include BSL (Fenlon et al., 2014), Auslan (Johnston & Schembri, 2006), and DGS (Hanke, 2016). New Zealand Sign Language (NZSL) is an example of a well-researched sign language that has a reference grammar upon which test developers can draw on (McKee, 2015) when developing tests. For all other sign languages, however, no such reference grammar is available (Palfreyman et al., 2015). An example of a sign language that has been widely researched overall is ASL, for which there are studies available addressing aspects such as basic word order (e.g., Valli & Lucas, 1995). Unfortunately, such resources are simply unavailable for DSGS.
Rater behavior
In addition to challenges associated with the peculiarities of sign languages generally and of minority sign languages such as DSGS in particular, the present study must also take into account issues surrounding the use of human raters. Human ratings of language can be thought of as an interaction of raters, rating criteria, and examinee performances (McNamara et al., 2019). Each of these contributes its own variation to the measurement, but as raters are the ones to ultimately assign scores, much research has been devoted to them. Of particular importance here, the processes by which raters assign scores has also received some attention in the literature.
Rater cognition
The cognitive process of rating can be conceptualized as a procession from interpretation of examinee performances, to evaluation of them via the scoring criteria, and finally to assigning a score (Eckes, 2015). Understanding how raters carry out this complex task is critically important to the validation of human rated assessments (Bejar, 2012; Myford, 2012).
Suto (2012), viewing raters as information processors, observed that rating requires the processing multiple pieces of information utilizing the rater’s perception, memory, and categorization ability. Different raters may develop different rating strategies, and will often differ in what strategies they use even when giving the same score as another rater. Viewing raters as decision makers, Baker (2012) observed that the decision-making styles most frequently employed by raters were rational and intuitive styles. Rational styles were typified by comparison to the rating rubric, whereas intuitive styles were based on an overall feeling of the examinee’s level. Finally, the model of rater cognition that sees raters as social perceivers (Govaerts et al., 2013) conceptualizes the process of rating as one that is intimately tied to social factors. According to this model, rating is based upon constructs internal to the rater, built up through his/her experience. It is influenced by “understanding of (in)effective performance, personal goals, interactions with the ratee, as well as other factors in the social context of the assessment process” (p. 377).
What seems to unite all three of these models, however, is a recognition of the subjective nature of ratings. This element of subjectivity can be thought of as beneficial expertise, but can also be the cause of unwanted rater variability.
Rater variability
Individual rater variability represents the largest threat to the reliability, validity, and fairness of rater-mediated assessments (Wind & Peterson, 2018), as it can result in examinees of equal ability being given different scores by different raters. Raters may differ in overall leniency/severity, wherein an examinee rated by one rater may be given a very different score than if she had been rated by another. Raters may be inconsistent, especially over time (Ling et al., 2014; Wolfe et al., 2001), meaning that examinees of the same ability may be given different scores even by the same rater. Finally, different raters may have different interpretations of the rating scale, resulting in different distances between scores by raters (McNamara et al., 2019). Many-facets Rasch measurement (Eckes, 2015; McNamara et al., 2019; Wind & Peterson, 2018), which has become the predominant method of quantitatively investigating rater behavior (Wind & Peterson, 2018), can at least detect, if not fully mitigate, many rater variability issues statistically. Qualitative methods, however, are usually necessary to know how and why raters assign the scores they do (Eckes, 2015).
Raters in sign language assessment
Within the context of sign language assessment, little work has been published with regard to rater variability. Wang et al. (2015) carried out a mixed-methods study of ratings of an English–Auslan interpretation corpus. Interrater reliability and severity among the three expert raters were varied, with an overall interrater reliability (Pearson’s r) of only .66. More recently, Han and Xiao (2021) employed comparative judgment in an attempt to mitigate difficulties in rating Mandarin–Chinese Sign Language interpreting, demonstrating a Pearson’s r of .793 between the comparative judgment logit values and those from a many-facet Rasch measurement model of rubric-based scores for the same performances.
Disagreement between raters on sign language performance has been researched to some extent. Both the SLPI and the American Sign Language Proficiency Interview (ASLPI) for ASL have rating programs that employ rater negotiation when ratings diverge by more than one level, and may include ratings by as many as six raters to reach consensus (Caccamise & Samar, 2009; Gallaudet University, 2021). Both are based on the oral proficiency interview (OPI; Liskin-Gasparro, 1982). The SLPI has been adapted to several other sign languages, including DSGS (Haug et al., 2019b). However, no studies of ratings of individual signs by human raters generally, nor points of or reasons for rater disagreement specifically, exist in the published sign language literature.
Research questions
As issues of rater variability and severity have implications for test reliability, and in order to address the gap in the published sign language assessment research regarding the use of human raters generally, and rating of individual signs specifically, the following research questions are posed in the context of an L1–L2 form-recall test of DSGS vocabulary:
RQ1: Where do human raters of L1–L2 single-sign translations disagree with regard to acceptable translations from German to DSGS?
RQ2: What are the reasons for any rater disagreements observed?
Method
Ethics approval was granted by the Centre for Research and Development of the second author’s home institution. This included the approval of the consent form, an information sheet about the goals of the project, including information how the data will be collected, stored, and analyzed, and how long the data will be saved on the university’s systems.
For the present study, we employed an explanatory sequential mixed-method design (Creswell & Plano Clark, 2011). Quantitative methods were used to investigate the psychometric properties of the L1–L2 DSGS Translation Test, while qualitative methods were used to aid in interpretation of the results of the quantitative results.
Instrument design
Sampling of items
It was not possible to create a corpus-based frequency list of DSGS signs like those that exist for English to be used as the basis of a vocabulary test since there does not exist a large corpus for DSGS. As a result, we made use of an initial list of 110 DSGS vocabulary items developed as part of the SMILE project funded by the Swiss National Science Foundation (Ebling et al., 2018). The items used in the test were selected from existing DSGS teaching materials (e.g., Boyes Braem, 2004) known to correspond to the CEFR level A1. The DSGS teaching materials are used in DSGS courses offered by the Swiss Federation of the Deaf. The lexicon of sign types available in the DSGS teaching materials numbers approximately 3800 (Boyes Braem, 2001). In order to decrease this number to 110, various linguistic criteria were applied; for example, removal of signs for persons, organization and places (Ebling et al., 2018).
In this manner, the 3800 lexical sign types from the DSGS teaching materials were decreased to a set of 110 items. As it is yet undecided within sign linguistics whether the concept of parts of speech can be applied to sign languages (see Erlenkamp, 2001), the items were not balanced with respect to parts of speech, unlike as is often done when sampling words for a vocabulary test of a spoken language.
The initial 110 items were evaluated in a pilot study, wherein recordings of 30 L1 and L2 users of DSGS performing the 110 items were collected. Twelve items were then removed because their glosses proved to be ambiguous; this was due to test-takers producing too many different sign variants for them. This resulted in a final list of 98 signs.
The DSGS Form-Recall Test
For the test, the 98 signs were presented in random order. The test was embedded in a Microsoft PowerPoint presentation on a computer. The instructions in written German were on the first slide. On each subsequent slide were a German word and a German sentence designed to disambiguate the meaning of the word (see Figure 1 for an example). The test-taker sat at a table facing a video camera. A laptop was placed to their side. The test administrator went through the slides; this way the test-taker could look directly at the video camera while signing.

Example of the translation test (Haug et al., 2019a).
Scoring instrument and rater training
The accuracy of the translation of the German word into DSGS was defined as the criterion of correctness. The criterion of correctness was informed by the work of Ebling et al. (2018) on acceptable variants of DSGS signs. A DSGS sign that was produced correctly was awarded a score of 1; an incorrect form (or no sign produced), a score of 0.
The video-recorded data were scored by two raters independently after they had received training on the use of the scoring instrument. Since both raters were involved in the previous study of DSGS variants (Ebling et al., 2018), they already had knowledge of the test and the scoring criteria.
Participants
Test-takers
A total of 20 test-takers participated in the study; 5 were male and 15 were female. They were between the ages of 24 and 55 (M = 39.3) at the time of testing. Nineteen of the test-takers were hearing; one had a cochlear implant and had acquired German as their first language, but learned DSGS as an adult. The L1 of the test-takers was in most cases a spoken language (e.g., a Swiss German dialect or Standard German; n = 18). Two test-takers had grown up with two spoken languages. All test-takers had learned DSGS as adults (range: 18–53 years old, M = 35.4).
Raters
Two deaf expert raters were employed in the present study. Although the number is small, it is important to remember the minority status of DSGS. The number of deaf users of the language is roughly 5500, and there are about 40 DSGS instructors in German Switzerland. As such, the present study employs approximately 5% of the total teaching population as raters.
Rater 1 (R1) was 54 years old and Rater 2 (R2) 45 at the time of the study; both were female. Both had access to DSGS from birth through deaf family members and both use DSGS daily in their private and work contexts. Both are certified sign language teachers and have been teaching different groups of learners, from beginners through to those studying sign language interpreting, for at least two decades. Both raters were also involved in the linguistic research (cf. Variants in sign languages) that informed the development of the rating criteria of the present test. As such, intensive rater training to familiarize them with the instrument was unnecessary as both were intimately involved with the development of the criteria and were deeply familiar with the test in addition to the sign language in question.
Data analysis
Quantitative data analysis
To address RQ1, data were analyzed quantitatively via many-facets Rasch measurement (MFRM) (Linacre, 1994) with the software package Facets (Linacre, 2020). A three-facet model was constructed of test-takers (N = 20), raters (N = 2), and signs (N = 98). Rather than relying on raw disagreement counts, Facets bias/interaction analyses were employed to identify signs upon which the raters’ disagreement resulted in significantly different logit values of difficulty, even when taking the raters’ overall severities into account.
Qualitative data analysis
RQ2 was addressed qualitatively via a semi-structured interview of the two raters. The goal of the interview was to investigate the raters’ decision processes that led to significantly different ratings of three signs discovered via the MFRM analysis. In preparation for the interview, a PowerPoint presentation was prepared which was divided into two sections. First, a video of a deaf L1 DSGS model producing the sign was displayed as a criterion. This was followed by video clips of each test-taker producing the sign in question. On each slide showing a test-taker’s video, a table was displayed with information on how R1 and R2 had rated that particular sign production.
The two raters watched the production first, then a discussion was initiated by the researcher regarding the different ratings of the test-taker’s production. The raters were asked the following questions:
Why did you rate that way?
Was it difficult to reach a rating decision?
Was the line between “correct” and “incorrect” clear?
Why do you think this sign in particular showed such different results in your ratings?
Can you recall any other signs that were similarly difficult?
The interview was conducted by the second author in DSGS and was video-recorded for later analysis. The interview lasted 30 minutes. For the analysis, the second author translated the interview from DSGS into English. The English translation served as the basis for the coding of the responses to the questions of the interview.
Results
Quantitative results (RQ1)
MFRM measures and model fit
Prior to addressing RQ1, reliability of the MFRM measures was investigated via Rasch statistics. Summary statistics for the MFRM model can be seen in Table 2. Average fit statistics were very near the expected value of 1, with relatively small standard deviations, indicating consistent fit statistics, with the exception of the Sign facet, which had two items with fit statistics above the recommended upper Infit cutoff of 1.5, and of the two, one exceeded the Outfit cutoff of 2.0 (Wright & Linacre, 1994). However, as removing these items had no effect on the test-taker or sign summary statistics, and slightly reduced the reliability of the Rater separation, they were retained in the analysis. The two signs, FREUND/KOLLEGE (“colleague”) and EI (“egg”), were both quite easy, with difficulty measures of −2.23 and −.28 logits, respectively. Interrater reliability as measured via a Rasch κ coefficient was quite high at .55, although the two raters could be separated into three levels of severity. Table 3 presents the severity and fit statistics for the raters.
MFRM summary statistics.
Note: MFRM = many-facets Rasch measurement.
Rater facet Rasch statistics.
Although both raters demonstrated good overall fit to the Rasch model, Rater 2 had some outlying ratings, as demonstrated by the larger OutfitMS value. Rater 2 was also significantly harsher than Rater 1 [t(3918) = −4.48; p < .000], although the effect size of the difference was extremely small (d = .14).
Rater versus Sign bias analysis
RQ1 was addressed by examining the pairwise biases between the Rater and Sign facets. Table 4 presents the results of this analysis and the raw disagreement counts. Significant differences in ratings of three signs were observed. The signs were TELEFONIEREN (“to make a telephone call”), KOPIEREN (“to copy someone”), and UNSICHER (“unsure”). In all cases, Rater 2 was harsher, with contrast sizes ranging from 1.60 to almost four logits. According to the SLA-specific Plonsky and Oswald (2014) thresholds for effect sizes, the Cohen’s d statistics indicate a large effect for TELEFONIEREN, a small effect for KOPIEREN, and a medium effect for UNSICHER. The reasons for these differences are examined qualitatively in the following section.
Pairwise bias report for Rater and Word.
Qualitative results (RQ2)
The goal of the semi-structured interview was to elucidate the causes of disagreement between the two raters regarding the three signs TELEFONIEREN, KOPIEREN, and UNSICHER. Across all three signs, the cause of disagreement between R1 and R2 was found to be that R2 had evaluated the signed productions more strictly than R1 and that she (R2) had accepted less deviation from the lexicon-like form of the sign.
Qualitative results for the sign TELEFONIEREN
The two raters disagreed in 9 out of 20 ratings of productions of the sign TELEFONIEREN (see Figure 2, top). After viewing and discussing the nine ratings, neither of the raters changed their rating decisions, except in one instance where R1 revised her rating from “correct” to “incorrect” because of an incorrect hand orientation in the sign (see Figure 2, lower left).

Lexicon-like and learner forms of the sign TELEFONIEREN.
While watching the cases of disagreement, R2 provided the following reasoning for her rating decision (Quotes were translated by the second author from DSGS into English.):
R2: Did I rate it wrong? No, I think the sign TELEFONIEREN needs to have contact on the cheek. Yes, the contact on the cheek is required. In a more casual context you can produce the sign without contact on the cheek. [. . .]. I used it as the baseline for myself how the sign is produced in a lexicon-like form [R1 referring to an online lexicon for DSGS, www.signsuisse.sgb-fss.ch]. That’s why I think it is incorrect. If I would judge the produced sign in a more casual context, it might be considered as OK, but not when I follow the model of the lexicon. Then it is not correct.
R1 responding to R2’s explanation:
R1: OK, I see, you compared the production of the learner to the model of the online lexicon. I did not consider the model as the template for myself. That’s why I rated this production as correct.
R1 interpreted the criterion of correctness more leniently than R2, accepting signs that were produced without contact on the cheek as correct. In all cases, the reasons for the differing ratings of TELEFONIEREN were due to this difference in acceptance of more informal productions, or insisting upon lexicon-like productions, in order to be rated as correct (Figure 2, lower right).
After watching another production that showed a similar form of TELEFONIEREN as the first, R2 asked R1:
R2: So what is in general the basis for our decision? The model of the lexicon or the use in a casual context? When you take the model as the basis for your decision, then the production is wrong, when you accept the use in a more casual context, I would judge it as Rater 1 (i.e., as correct).
R1 picked up on the same idea and asked R2:
R1: Imagine we would ask another deaf person, who is not involved in the current research project, if he or she thinks that the signs we are just discussing are right or wrong?
R2: The deaf person would consider the sign as correct.
This uncertainty over the correct form and correct rating decision was also expressed by R2 in the following rating rationale:
R2: I was just trying to stick as close as possible to the model from the lexicon.
Overall, the discussion of the disagreed ratings of TELEFONIEREN revealed possible reasons why R2 interpreted the criterion of correctness more strictly than R1. R2 based her rating decisions on the lexicon-like form of the sign, while R1 did not, accepting more informal or casual productions as correct.
Qualitative results for the sign KOPIEREN
The two raters disagreed in 7 out of 20 ratings for the sign KOPIEREN (see Figure 3, top); in all cases, R2 had judged the sign as incorrect and R1, as correct. The meaning of the sign in the context of the vocabulary test does not refer to “to copy something,” but “to copy someone,” as in the sentence, “My students should copy/repeat the signs I produce.” The difference in meaning is differentiated by a change of the hand orientation: in “to copy a paper” the palm is oriented downwards (Figure 3, lower left), while in the context of “someone copies the sign I produce” the palm is oriented away from the signer (Figure 3, top). In five of the seven occurrences, R1 revised her ratings to “incorrect” after the interview and discussion with R2. However, no reason was given as to why she had initially rated the productions as correct. In all five cases of R1 changing her ratings, the learners had produced the sign KOPIEREN with the meaning “to copy something” as opposed to “someone.” For the first two occurrences, the discussion about the causes of the disagreement were the same as for TELEFONIEREN, that is, R2 interpreted the criterion of correctness more strictly than R1 based on a highly prescriptive interpretation of the criterion of correctness.

Lexicon-like and learner forms of the sign KOPIEREN.
R1 raised a possible reason why some learners signed “to copy something” rather than “someone”:
R1: The problem might be also that learners did not read the German sentence contextualizing the sign KOPIEREN properly. That might be an additional reason why so many signed “to copy something” instead of “to copy someone.” But I don’t know, just a possible explanation.
R2 raised an issue similar to one that was addressed during the discussion of TELEFONIEREN:
R2: The problem is also that the sign can be modified to express person agreement, realized by a change of the palm orientation. That raises the question which form is accepted for the context of this test.
Another example of this issue was found in one production of a learner who produced the sign KOPIEREN by holding the palm not away from the body (as in the target version) but to the side, thereby producing “Person A is copying Person B” (Figure 3, lower right), leading to similar rater disagreement.
Overall, the bulk of the discussion on the disagreed ratings of KOPIEREN revealed similar possible explanations as for the sign TELEFONIEREN.
Qualitative results for the sign UNSICHER
The two raters disagreed in 8 out of 20 ratings for the sign UNISCHER; however, neither rater chose to revise her initial rating decision after discussion. Six out of the eight learners who produced the signs upon which the two raters disagreed actually produced not the target sign, but one of two semantically related signs with more or less the same meaning as the target sign, but with a different form. R1 argued that she accepted these two different signs as correct because of the learners’ course level, fearing that the learners might not yet have learned the target sign:
R1: Oh, I see why learners produced different signs which have a similar meaning. They weren’t sure yet which one to use.
After watching and discussing all disagreed ratings, R1 provided the following rationale for her ratings:
R1: They are all signs that are not totally wrong; they have a different phonological form, but bear the same meaning.
Overall, the reasons for the different ratings of the sign UNSICHER were different from those of the first two signs. Learners used semantically related signs, with a different phonological form that R1 accepted as correct, but which R2 did not.
Discussion
The present study found that the test in question exhibited high reliability (.98) in separating test-takers into seven ability levels. Although human-rated form-recall tests can present challenges in the form of rater inconsistency (Stewart, 2012), the raters in the present study exhibited excellent internal consistency, as evidenced by the fit statistics very near the expected value of 1. There was also a very high level of rater agreement as evidenced by the high interrater reliability statistic (Rasch κ = .55). As such, the test demonstrates excellent psychometric properties, with highly consistent ratings between the two expert raters.
RQ1
RQ1 asked, “Where do human raters of L1–L2 single-sign translations disagree with regard to acceptable translations from German to DSGS?” Although rater disagreement was rare and seemingly insufficient to negatively impact the psychometric properties of the test, three signs, TELEFONIEREN, KOPIEREN, and UNSICHER exhibited statistically-significant disagreement between the two raters. Rater disagreement is hardly surprising and is common enough in other sign language tests such as the SLPI (Caccamise & Samar, 2009) and ASLPI (Gallaudet University, 2021) as to warrant the development of methods to moderate it. In the case of the present research, signs with significant disagreement comprised roughly 3% of the total, which indicates far more agreement than observed by Caccamise and Samar (2009) among first ratings, where 13.4% of ratings differed by more than one level prior to negotiation. However, given the fact that the two raters were both experts in DSGS education with similar backgrounds and nearly identical contexts, that they were directly involved in the early stages of development of the instrument, and that the rating task was the relatively simple judgment of whether a particular sign production was correct or incorrect, these disagreements warranted closer examination to determine the cause(s).
RQ2
RQ2 asked, “What are the reasons for any rater disagreements observed?” A semi-structured post-hoc interview of the raters revealed differing internal criteria of correctness for these three signs. Rater 1 (R1) was more willing to accept productions that would likely appear in colloquial use, but which deviated from the lexicon-like criteria for the signs in question, whereas Rater 2 (R2) was much stricter in this regard. In some cases, R1 was also willing to accept signs which were technically incorrect, but which she believed were close enough given the test-takers’ levels.
The findings regarding the disagreements between the two raters may serve as examples of differing rater cognition styles (Baker, 2012). R1 seems to have employed a more intuitive style, referring to an overall feeling of the examinee’s level, and making “correct/incorrect” decisions based on that. R2, in contrast, seems to have approached the rating task more rationally, referring to the lexicon-like criteria and using them to make sharp delineations between correct and incorrect productions of the signs. In her ratings of productions of UNSICHER, R1 also seemed to be operating as a social perceiver (Govaerts et al., 2013), drawing from her experience as a sign language teacher and changing her rating severity based on her internal understanding of what was an expected level of performance for the signer in question.
In most of the cases of disagreement, however, it is important to note that even the performances that R2 judged as incorrect would probably be accepted in normal conversational use or in various other contexts. In these cases it was abundantly clear what signs were intended, but the question was whether they should be considered “correct” or not. The disagreement in these cases was akin to raters of spoken language who focus on comprehensibility even when accuracy is low, and those who are more prescriptive. It also has analogs to other form-recall vocabulary tests wherein test developers or researchers must make a decision as to whether misspelled words are to be accepted, and if so, how close an incorrect spelling must be to the standard written form to be acceptable. The practicalities of sign languages, however, add yet another layer of complexity, in that the language is produced temporally in three-dimensional space using both the manual and non-manual channels (e.g., Baker et al., 2016). In this regard, productive tests of sign language vocabulary could be conceptualized as analogous to spoken vocabulary tests, which are extremely uncommon. Just as the production of phonemes in spoken languages is almost infinitely variable, signs can take on different meanings with relatively small changes in the production of phonemes, that is, handshape, position, orientation, location, and facial expression during production (Pfau & Quer, 2010)—all of which add to the complexity of determining what should be accepted as a correct sign and what should not. These are perennial dilemmas for test developers to which there are no easy solutions aside from referring to the test use case and making judgments based on the assessment goals over whether accuracy or comprehensibility is more important.
This problem is exacerbated when the language in question is a minority language that nonetheless requires standardized tests of various forms of ability. In such cases, the underlying theoretical work may not yet exist. What makes the rating of a DSGS vocabulary form-recall test difficult is the limitation of what linguists know (or rather don’t know) about the language and consequently what resources are available to inform test development. For example, there are no corpus-based frequency lists available to create a DSGS vocabulary test or, generally speaking, research on DSGS that could be used to inform the criterion of correctness. The investigation of what constitutes an acceptable variant of lexical signs in DSGS to inform the criterion of correctness (Ebling et al., 2018) was needed to develop a basic understanding on variants in DSGS—and we so far have only scratched surface of this field of inquiry. Research on other sign languages, for example BSL (Fenlon et al., 2013), shows that variants depend on various factors, not only the factors that were investigated in the Ebling et al. (2018) study. Regardless of the paucity of information, however, DSGS is the L1 language of a minority population, and sign language teachers, raters, and researchers require standardized training, assessment, and/or certification to ensure that the field of sign language assessment moves forward both in terms of research but also application.
The disagreements over the signs KOPIEREN and UNSICHER, however, bring to light another important consideration when using human rating: that of rater training. In the case of KOPIEREN, R1 accepted a different, but related sign, that of “to copy something,” as opposed to the target sign meaning “to copy someone.” The post-hoc interview in this case served as a kind of rater training session, allowing R1 to notice her mistake and revise her ratings, and was perhaps akin to the rating negotiation methods used in the SLPI and ASLPI as discussed previously (Caccamise & Samar, 2009; Gallaudet University, 2021). The disagreements surrounding UNSICHER were similar, in that R1 accepted signs other than the target due to her background as a teacher. Although her fit statistics do not indicate a problem with her ratings overall, further training and standardization of the criterion of correctness may have been necessary to clarify her task.
Finally, the findings serve to illustrate the difficulties associated with assessing performances of lesser-taught languages. First of all, investigating questions around rater (dis)agreement is new in the field of inquiry of sign language education. Without a larger and more experienced community of practice and a body of research to which they may refer, individual teachers, assessors, and researchers may develop their own internal criteria of correctness or linguistic development. However, although languages such as DSGS are minority languages, they nevertheless must be subjected to the same CEFR-based standard-setting processes as other European languages—a formidable task even for major languages, but one which is rendered all the more difficult with small populations of language users, teachers, and researchers.
Conclusion
The present research carries some implications for future development and use of the test described in the present study, and also for human-rated tests of productive signed vocabulary in general. The first is the demonstrated importance of careful rater training to ameliorate the problem of rater variability. Although it may seem straightforward to judge whether a produced sign is correct or not, the present research demonstrates that this may not be the case. It is important to ensure that all raters have a clear and shared understanding of what criteria to which to refer when making judgments of acceptability.
The present research also serves to demonstrate the importance of using statistical methods such as MFRM to control for differing rater severity. R1 and R2 were significantly different in the severity of their judgments, but because MFRM can take rater severity into account, this difference did not negatively impact the psychometric properties of the test. Although the use of MFRM for human-rated language assessments is hardly new, there still remain a great many human-rated tests which do not use multiple ratings and/or methods such as MFRM to ensure that test-takers are given fair and accurate scores. Human raters are inherently subjective, even when making seemingly simple “correct/incorrect” judgments, and test developers are well-advised to take this into account.
Difficulty in selecting signs of appropriate levels for the target examinee population illustrates the need for the development of naturalistic corpus-based frequency lists for all sign languages, as they exist for spoken languages. The process of developing such lists for sign languages, however, is daunting, as they would necessarily have to be created from carefully annotated video corpora. However, as AI-driven sign language recognition improves, this may soon become as easy as it has been for spoken languages.
Finally, the present research highlights the need for more research on and resources devoted to the teaching and assessment of minority languages, and especially sign languages. DSGS may be a minor language in the world at large, but it is the L1 language of approximately 5,500 deaf sign language users in German Switzerland, and deserves greater attention.
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partially funded by the Swiss National Science Foundation Sinergia project SMILE (Scalable Multimodal Sign Language Technology for Sign Language Learning and Assessment), Grant No. 160811.
