Abstract
Introduction:
The Gap–Kalamazoo Communication Skills Assessment Form (GKCSAF) is widely used in medical education, yet its reliability in real occupational therapy clinical settings remains unexplored. This study aimed to assess the intra-rater and inter-rater reliability, as well as random measurement error, of the GKCSAF in occupational therapy.
Method:
Five independent raters evaluated audio-recordings and transcripts of conversations involving 30 patients treated by 22 assessors (7 therapists and 15 students). Both direct and coded ratings were used.
Results:
For direct ratings, intra-rater reliability was moderate for total score (intraclass correlation coefficient (ICC) = 0.76), but poor for inter-rater (ICC = 0.31). minimal detectable change (MDC%) was acceptable for the same rater (17.8%) but not for different raters (38.3%). Weighted kappa values indicated poor to fair reliability (−0.01 to 0.34) for each domain score. Coded ratings showed moderate intra-rater reliability (ICC = 0.69) and poor inter-rater reliability (ICC = 0.22). MDC% was acceptable for the same rater (24.8%) but not for different raters (65.5%). Weighted kappa values indicated poor to fair reliability (−0.02 to 0.33) for each domain score.
Conclusion:
GKCSAF displays acceptable intra-rater but poor inter-rater reliability in occupational therapy clinical scenarios. Multiple raters are advised for enhanced reliability, while coding might not significantly enhance it. It is advisable to use the GKCSAF cautiously in occupational therapy education, ensuring adequate training, and possibly incorporating multiple raters for assessment consistency.
Introduction
Effective communication skills play a critical role in the practice of therapists and students, as they contribute to establishing a positive therapist–patient relationship and enhancing patient adherence, satisfaction, and comfort (Rezaie and Kendi, 2020; Sarsak, 2022). Communication skills are particularly important in occupational therapy because of its impact on professional readiness and patient care effectiveness (Brown et al., 2020; Yu et al., 2018, 2019). Evaluating the communication skills of therapists and students is essential for educators to assess their performance and identify areas for improvement.
Several measures have been developed to assess communication skills in clinician–patient conversations. For instance, the Calgary-Cambridge Observation Guide (CCOG) is a comprehensive tool that outlines 71 communication skills (Kurtz et al., 2017). However, the CCOG, even in its simplified 28-item version, requires a significant amount of time to complete, taking at least 25 minutes (Wijayanti et al., 2021).
Another widely used tool, the SEGUE Framework, assesses 25 behavior-based skills with a binary “yes/no” response scale to indicate the completion of a communication task (Makoul, 2001b). Although originally designed for use in routine patient encounters, the SEGUE Framework has received criticism for its limited applicability to complex patient cases, particularly those involving multiple chronic conditions or comorbidities (Street et al., 2005). The Four Habits Coding Scheme (4HCS) assesses four communication habits but requires rigorous rater training up to 18 hours to achieve an inter-rater reliability of 0.72 (Krupat et al., 2006). The extensive training required limits the clinical applicability of the 4HCS.
In contrast, the Gap-–Kalamazoo Communication Skills Assessment Form (GKCSAF) encompasses nine domains and 26 subskills related to communication, each rated on a 5-point Likert-type scale (Makoul, 2001a; Peterson et al., 2014) has emerged as a widely utilized tool in medical education, with several advantages. Researchers established the GKCSAF’s face validity through the Delphi method, and its construct validity shows strong correlations with interpersonal and conversational competence, highlighting its relevance in evaluating communication skills (Brown et al., 2017; Schick et al., 2019). Healthcare professionals in medicine, nursing, and allied fields widely use the GKCSAF due to its advantages (Brown et al., 2016; Gupta et al., 2020; Lippe et al., 2020). First, it encompasses comprehensive domains of communication skills, consisting of nine skill domains, including “Build a relationship,” “Open the discussion,” “Gather information,” “Understand the patient’s and family’s perspective,” “Share information,” “Reach agreement,” “Provide closure,” “Demonstrate empathy,” and “Communicate accurate information.” Second, the GKCSAF can be completed within a practical timeframe of 7 minutes (Calhoun et al., 2009), making it suitable for fast-paced medical practice. Therefore, the GKCSAF holds promise as a potential measure to evaluate the communication skills of occupational therapists.
Reliability is a crucial psychometric property of any measurement tool, indicating the extent to which its results can be replicated. Reliability includes factors such as intra-rater reliability, inter-rater reliability, and random measurement error (Puckett et al., 2015). However, no studies to date have examined the intra-rater reliability and random measurement error of the GKCSAF in real clinical settings. Furthermore, although the inter-rater reliability of the GKCSAF has been reported in a “simulated environment” (Peterson et al., 2014), no studies have investigated its inter-rater reliability in the context of clinical occupational therapy practice. Clinical practice settings significantly differ from simulated environments, characterized by notable psychological stress and time pressure on clinical staff. Furthermore, the joint partnership and shared collaboration between occupational therapists and patients set them apart from other medical professions. Therefore, it is imperative to assess the inter-rater reliability of the GKCSAF, specifically in actual clinical occupational therapy practice. Thus, the primary objective of this study was to examine the intra-rater reliability, random measurement error, and inter-rater reliability of the GKCSAF in real occupational therapy practice.
Coding is a commonly employed research method in qualitative studies (Bylund and Makoul, 2005). It aids raters in identifying essential content, reduces scoring time, prevents the omission of crucial information, simplifies content complexity, and improves scoring accuracy (Linneberg and Korsgaard, 2019). However, limited research has focused on examining the reliability of coded ratings. Hence, the secondary objective of this study was to compare the reliability of direct with coded ratings. We hypothesized that the reliability indices of coded ratings would be higher than those of direct ratings. In addition, we examined intra-coder agreement and inter-coder agreement.
Methods
Source and content of conversations
To ensure a statistical power greater than 0.80 at a 0.05 probability level (p-value), a sample size of 30 conversations was determined through priori statistical power analysis (Arifin, 2017). However, for our study, we used a total of 67 conversations, including both audio-recordings and transcripts. These conversations were collected as secondary data from two research projects conducted in the occupational therapy room at a medical center. The conversations encompassed various aspects, such as patient or family complaints, objective assessments by therapists, discussions of intervention goals/plans, and patient education. The conversations were audio-recorded and subsequently transcribed into written form by a professional transcription company. For our analysis, we focused on the main participants’ dialog, specifically involving occupational therapists or students and patients or primary caregivers. Speech from other participants, such as fieldwork supervisors, other medical personnel, the recorders, or friends of the patients, was not included in our analysis. The conversations were collected during the period from March to December 2021.
Participants
Both research projects maintained consistent inclusion criteria for patients and evaluatees. To be eligible, patients had to be over 20 years old, without severe hearing loss, and capable of clear communication. In cases where the patient had aphasia, the caregiver who could communicate effectively participated on their behalf. Students were over 20 years of age and enrolled in “Occupational Therapy Fieldwork Level II.” Therapists were registered with the national regulatory board with at least 3 years of professional experience, and over 20 years of age. Ethical approval was obtained in 2021 from the National Taiwan University Hospital Ethics Committees. Trial registration number: 202012266RINA, and all participants provided signed written informed consent.
Screening of conversations
To select conversations for analysis, 67 conversations (transcripts) were evaluated based on the following criteria. First, the conversations had to be from the first two sessions of new patients, as these initial sessions typically involved a substantial exchange of information between therapists and patients. Each session lasted approximately 30 minutes. Additionally, to ensure an adequate amount of communication, the total number of sentences exchanged between the patient and occupational therapists or students needed to exceed 58. The screening process encompassed both audio-recordings and transcripts.
Measure
The GKCSAF comprises 9 domains of communication skills, encompassing a total of 26 specific sub-skills, with each domain containing two to five sub-skills for evaluation. For example, “Build a relationship” domain contains three sub-skills: “Greets and shows interest in patient as a person,” Use words that show care and concern throughout the interview,” “Uses tone, pace, eye contact, and posture that show care and concern.” As described in the original article (Calhoun et al., 2009), the domain score was determined by considering the overall performance of all sub-skills within the domain. Individual scores were not assigned to the sub-skills, as they were used solely as descriptive guidelines. Each domain was assessed using a 5-point Likert scale, where a score of 1 indicated poor performance, and a score of 5 indicated excellent performance. Scores of 3 or higher were considered sufficient, while scores below 3 indicated a need for improvement. To provide clearer scoring criteria, an additional rating criterion was introduced: a score of 3 or higher was given if any sub-skill was demonstrated. The discrimination for scores of 4 or 5 depended on the frequency and extent to which the sub-skills were demonstrated. The total scores across all nine domains ranged from 9 to 45 points.
Types of assessment
Two types of assessment methods were employed: direct ratings and coded ratings. In the direct ratings approach, raters scored the conversations directly after reading the original transcripts while simultaneously listening to the corresponding audio-recordings.
Coded ratings, on the other hand, involved the use of ATLAS.ti 9 software (Thomas Muhr at Technical University, Berlin). The transcripts were coded based on the domains of the GKCSAF using the software and subsequently scored according to the coded contents. During the coding process, raters also listened to the audio-recordings for reference. To provide an example of the coding process, let us consider a dialog where the therapist asked, “Is it fine for you that we will focus on balance training from now on?” and the patient responded, “Yes, I agree to focus on balance training first.” This dialog would be coded under the “Reach agreement” domain. Dialogs coded under the same domain were grouped together for the raters to assign scores.
Types of ratings
Five independent raters assessed the conversations using direct or coded ratings. Three raters (A, B, and C) used direct ratings, whereas two used coded ratings (D and E). The raters were randomly assigned using a computer-generated spreadsheet. All raters were research assistants with clinical experience in occupational therapy. The assessments were conducted individually.
Rater and coder training
Prior to the study, raters A, B, and C from the direct rating group, were trained on using the GKCSAF. The training consisted of debriefing sessions and scoring practice. In the first round of practice, the raters rated three conversations and then held a consensus meeting. If the scores for a domain differed by more than one point, they discussed the reasons and adjusted their rating principles until reaching a consensus. This process was repeated for three rounds, with three conversations rated in each round, making a total of nine conversations assessed.
The training of coders comprised three phases. First, the coders (D & E) received a debriefing on the GKCSAF and the associated scoring criteria. Next, they practiced using the ATLAS.ti 9 coding software and familiarized themselves with the preset coding labels. Finally, they engaged in coding practice, where the coding labels corresponded to the sub-skills of all domains in the GKCSAF. For example, if the therapist’s dialog was, “We’ll take a moment to get to know you now. Then, we can plan your future treatment together,” it would be coded as “Third sub-skill of domain B: Explains and/or negotiates an agenda for the visit.” To support their learning, a codebook with examples was provided to all coders.
In the first round of coding practice, coders D and E independently coded three conversations and cross-checked their coding line by line. In case of any discrepancies in the coding for the same sentence, they refined the codebook until reaching a consensus. Following a process similar to the direct rating group, they completed three rounds of coding practice and consensus meetings. Nine conversations were used for practice, and 30 conversations were employed for formal assessments.
Formal assessments
In the direct ratings approach, all three raters (A, B, and C) independently read the transcripts and listened to the corresponding audio-recordings to provide their assessments. The order of the transcript ratings was randomized to minimize potential bias, such as rating earlier transcripts lower than later ones. Only rater A conducted a second assessment, which occurred after an 8-week interval. Regarding the coded ratings, raters D and E independently read the coded transcripts and provided their initial assessments. Rater D completed a second assessment, also after an 8-week interval. The study was double-blind with both raters and participants to mitigate potential bias.
Statistical analysis
The inter-coder agreement was assessed using the Holsti index in ATLAS.ti 9 (González-Prieto et al., 2023). The Holsti index quantifies the level of agreement among coders by calculating the proportion of words or segments that were consistently coded by all coders. The formula for calculating the Holsti index is as follows: Hs = (total number of coders × total number of words consistently coded by all coders/total number of words coded by all coders) × 100%. A higher Holsti index value indicates a higher level of inter-coder agreement. A minimum level of 75% agreement among the coders was considered necessary (Nandy and Sarvela, 1997).
The inter-rater and intra-rater reliability of total scores were assessed using the intraclass correlation coefficient (ICC) with a two-way random-effects model and absolute agreement. ICC values between 0.90 and 0.99 were considered excellent, values between 0.80 and 0.89 were high, values between 0.60 and 0.79 were moderate, and values below 0.59 indicated poor reliability. The standard error of measurement (SEM) and minimal detectable change (MDC) were calculated to estimate the amount of random measurement error, with the MDC calculated based on a 95% confidence level. Additionally, the MDC% (MDC divided by the mean of all test scores of the sample multiplied by 100%) was calculated. An MDC% smaller than 30% was considered an acceptable level of random measurement error (Smidt et al., 2002).
For each domain score, weighted kappa value and percentage of agreement were calculated. Weighted kappa results were interpreted as follows: values ⩽0 indicated no agreement, 0.01–0.20 indicated none to slight, 0.21–0.40 indicated fair, 0.41–0.60 indicated moderate, 0.61–0.80 indicated substantial, and 0.81–1.00 indicated almost perfect agreement (Landis and Koch, 1977).
Results
Number of conversations after screening
After screening, 28 out of 67 conversations were excluded. Among the excluded conversations, two were not the initial two sessions of treatment, and 26 conversations did not meet the criteria of having more than 58 sentences spoken by either the patient or therapist (or student). Consequently, 39 conversations were deemed eligible for analysis.
Demographic information of the participants
The included conversations involved dialog between 22 therapists/students and 30 patients (Table 1). The mean age of the therapists was 33.8 (±7.9); the mean age of the students was 22.5 (±0.5) years; the mean age of the patients was about 60 years (Table 1). On average, the raters had 3.3 (±5.4) years of clinical practice.
Characteristics of the patients, therapists, and interns.
IQR: interquartile range (first quartile to third quartile); SD: standard deviation.
Each therapist interacted with 1–3 patients. A total of 13 patients were interacted.
Each student interacted with 1–2 patients. A total of 17 patients were interacted.
Intra-rater reliability
Direct ratings
For the total score, the mean of the first and second assessments is reported in Table 2. The first and second assessments showed a slight score difference of 0.8 (±2.9). The ICC for the intra-rater reliability of the total score in direct ratings was moderate (0.76, 95% CI = 0.56–0.88). The SEM was 2.1. The MDC% was 17.8%.
ICC, SEM, and MDC of intra- and inter-rater reliability for the total score.
ICC: intra-class correlation coefficients; SEM: standard errors of measurement; MDC: minimal detectable change.
For each domain score, the median score and interquartile range are reported in Table 3. The weighted kappa values for seven domains were low, with only two domains demonstrating fair agreement. The percentage of agreement for all domains were moderate to substantial (31% to 70%). The domain of “Share information” had the highest percentage of agreement (72.2%), whereas the domains of “Understand the patient’s and family’s perspective” and “Reach agreement” had the lowest percentages of agreement (31.1% and 44.4%, respectively).
Weighted kappa and percentage of agreement of intra-rater reliability for each domain.
Coded ratings
The overall intra-coder agreement was moderate (61.1%). Intra-coder agreements for all domains varied (18.3% ~ 80.2%, Table 4), with seven domains showing agreements lower than 75%.
Weighted kappa and percentage of agreement of inter-rater reliability for each domain.
The values represented the results for the pairs among raters A, B, and C.
The value represented the result between the raters D and E.
For the total score, the mean and the standard deviation of the first and second assessments are reported in Table 2. The ICC for the intra-rater reliability of the total score in coded ratings was moderate (0.69, 95% CI = 0.44–0.84). The SEM was 2.8, and the MDC% was acceptable at 24.8%.
For each domain score, the median and the interquartile range are reported in Table 3. The weighted kappa values for all domains ranged from low to fair (−0.02 to 0.28). The percentage of agreements for all domains also ranged from low to moderate (3.3%–43.3%). The domain “Build a relationship” had the highest percentage of agreement at 43.3%, whereas the domain “Understand the patient’s and family’s perspective” exhibited the lowest agreement among the nine domains, at 3.3%.
Overall, the results indicated moderate intra-rater reliability in both rating methods and an acceptable MDC% for the total score. However, the reliability and agreement for individual domains were mostly poor, with the domain of “Understanding the patient’s and family’s perspective” showing the lowest agreement (3.3%).
Inter-rater reliability
Direct ratings
The rating scores and reliability indices of the three raters are reported in Table 2. The total score displayed poor inter-rater reliability, as indicated by the low ICC (0.31, 95% CI = 0.09–0.54). The SEM was 4.3, and the MDC% was high at 38.3%.
The domain scores and reliability indices are reported in Table 4. Among the nine domains, eight had poor weighted kappa values (−0.09 to 0.39) for paired raters.
The domain scores and reliability indices are reported in Table 4. Among the nine domains, eight had poor weighted kappa (−0.09 to 0.39) for paired raters. Seven domains had poor to fair percentages of agreement (16.7%–60.0%). The domain “Understanding the patient’s and family’s perspective” had the lowest level of agreement (16.7%), whereas the domains “Share information” and “Gather information” had the highest levels of agreement, both showing moderate levels at 76.7%.
Coded ratings
The overall inter-coder agreement was low (24.8%). The inter-coder agreements for all domains were low (13.1%–43.7%, Table 4).
The rating scores and reliability indices of the three raters are reported in Table 2. The ICC for the inter-rater reliability of the total score in coded ratings was poor (0.22, 95% CI = −0.10 to 0.54). The SEM was 6.4, and the MDC% was large at 65.5%.
The domain scores and reliability indices are reported in Table 4. The weighted kappa for each domain, rated by each pair of raters, was poor (−0.02 to 0.33). The percentage of agreement for each domain was also poor (3.3%–43.3%). The domains “Build a relationship” and “Communicate accurate” had the highest percentages of agreement (43.3% and 40.0%, respectively), whereas the domain “Understand the patient’s and family’s perspective” had the lowest percentage of agreement (3.3%).
Discussion
To date, no studies have examined the intra-rater reliability of the GKCSAF in real clinical settings. This study aimed to fill this knowledge gap. Our findings revealed moderate intra-rater reliability for the total score, both in direct and coded ratings. This suggests that the GKCSAF is a reasonably reliable tool for assessing the overall communication skills of occupational therapists, but only when the same rater uses it.
However, for each domain, the intra-rater reliability was mostly poor to fair. This indicates that raters had difficulty consistently rating communication skills within each domain over time. One potential reason for this inconsistency could be attributed to the absence of comprehensive scoring guidelines within each domain of the GKCSAF. Although we did include a scoring criterion (“3 points” or higher was awarded if any sub-skill was demonstrated”), it may not have been sufficiently clear. This lack of clarity might have led to raters struggling to differentiate between “4 points” and “5 points,” resulting in inconsistent scores. For instance, let’s take Domain F: “Reach agreement,” which comprises four sub-skills. It remains unclear how many sub-skills the evaluatees need to perform to earn “5 points” within one domain and whether certain sub-skills carry more weight than others. To address this, future research should consider adopting Brown et al.’s (2016) recommendation to assign individual scores for each sub-skill. Additionally, the individual sub-skill score could be rated based on the frequency of sub-skill behavior occurrence. For example, if an evaluatee demonstrates the sub-skill once, it would receive 3 points; twice, 4 points; and three times or more, 5 points. However, further research is essential to determine the appropriate weightings for each sub-skill.
Our study revealed poor inter-rater reliability concerning the total score obtained from direct ratings. These results align with Brown et al.’s (2016) findings (ICC = 0.30) for radiologists. Conversely, Peterson et al. (2014) reported high inter-rater reliability in a simulated environment for pediatric residents and nurses (ICC = 0.84 (faculty raters), 0.88 (peer observer raters of participating residents/nurses)). It is worth noting that both Brown’s study and our own were conducted in authentic clinical environments, where patient–therapist conversations can be frequently interrupted and include unrelated dialogs, resulting in fragmented conversations (Fiscella et al., 2007). This fragmentation can make it challenging for different raters to maintain consistent impressions when quickly skimming through these dialogs, potentially leading to rating discrepancies. On the other hand, simulated environments, are designed to be more straightforward and controllable, enabling raters to reach a higher level of agreement (Hodges et al., 1996). Thus, the discrepancy between our results and Peterson’s results may be attributed to the contracting nature of clinical environments compared to simulated settings.
Poor inter-rater reliability can undermine the credibility of the evaluations, as variations in ratings given by different raters may lead to an inaccurate representation of an individual’s true skills. This inconsistency in scores poses a major challenge in accurately assessing an individual’s abilities and evaluating the effectiveness of communication skills training programs, especially when multiple raters are involved. Additionally, poor inter-rater reliability can increase the time and costs associated with rating communication skills, as additional training and supervision may be needed to improve reliability.
Regarding inter-rater agreement for each domain in the direct ratings, the domain “Understanding the patient’s and family’s perspective” had the lowest percentage of agreement. This domain depends on subjective interpretations of the patient’s desires, beliefs, and values, which can be abstract and susceptible to influence by factors such as gender, culture, and age. Consequently, reaching a high level of agreement among raters on this domain proved to be challenging. Furthermore, the phrasing of the first sub-skill in this domain, “Asks about life events, circumstances, . . . . . .,” is too broad, potentially leading to overlapping with the first sub-skill of domain C, “Using open-ended questions.” The wording could be improved by specifying “life events and circumstances” to “recent changes in health status, social support, financial stressors” as the content of the sub-skill.
In our initial hypothesis, we anticipated that coded ratings would demonstrate higher reliability compared to direct ratings, based on the assumption that a structured coding process would enhance consistency. However, despite employing rigorous methods to train raters and establish coding consensus, including conducting multiple rater-debrief sessions, using codebooks, and practicing with nine conversations, the intra-rater and inter-rater reliability did not show improvement after coding. This unexpected outcome led us to conclude that our initial hypothesis was not supported. The low agreement among coders can be attributed to three factors. First, the volume of content after coding remained quite high, with approximately 10 pages for a single conversation, potentially leading to rater fatigue or exhaustion. To address this issue, we suggest standardizing the duration of each coding session. To mitigate coder fatigue, we propose standardizing the duration of each coding session to, for instance, 30 minutes. Additionally, we suggest providing coders with the flexibility to complete coding an entire paragraph or sentence, even if it extends slightly beyond the set session time. This approach balances structured time management with the practical needs of comprehensive coding (O’Connor and Joffe, 2020). Second, the inconsistency in coding criteria for lengthy conversations posed a significant challenge. Some coders assigned a single code label to an entire long conversation, whereas others segmented it into multiple small units. This discrepancy in coding approaches might have led to the variation in the coded data. To improve inter-coder agreement, we propose developing specific guidelines for segmenting long conversations into smaller units. Clear criteria should be established to determine when to start and end a segment, such as based on topic changes, speaker turns, or time intervals. Additionally, when evaluating empathy in brief responses, responses consisting of only one or two words should not be coded, as such short responses often lack sufficient context to understand their meaning. Implementing these two new coding rules ensures a more standardized approach across all coders, potentially leading to improved inter-coder reliability.
Our study found that the random measurement error, as indicated by the SEM and MDC%, was within acceptable limits for intra-rater reliability. However, it was unacceptably large for inter-rater reliability. This substantial random measurement error in inter-rater reliability indicates that caution should be exercised when using the GKCSAF rating system, particularly in high-stake examinations. To enhance reliability, implementing multiple-rater evaluations could be beneficial, as suggested by Koo and Li (2016), as long as financial constraints do not hinder its implementation.
The study has at least three limitations. First, the raters evaluated the GKCSAF based on transcripts and audio recordings of conversations, rather than on video recordings. This method inherently omitted contextual elements such as environmental cues, non-verbal language expressions, and body language that are integral to communication. Consequently, the assessment might lack a certain degree of authenticity and comprehensiveness that video recordings would provide. Therefore, incorporating video recording data in future research could enhance the comprehensiveness of the evaluation. Second, our study had a small sample size, possibly leading to limited statistical power. Third, the lack of specific guidelines for coders on segmenting long conversations, potentially impacted the reliability of the coded ratings.
Future research recommendations
We propose the following directions for future research. First, to enhance the comprehensiveness of evaluations, future studies should incorporate video recording data. This would ensure that non-verbal aspects of conversations are also captured. Second, increasing the sample size to at least 50 participants would enhance statistical power. Third, there is a need to develop and implement clear guidelines for segmenting long conversations. Fourth, studies should include both novice and experienced therapists in the sample to establish the discriminant validity of the GKCSAF. Lastly, future studies should examine other important psychometric properties of the GKCSAF that fall outside the scope of this study, such as internal consistency and responsiveness.
Conclusion
Our study revealed that the GKCSAF demonstrated satisfactory intra-rater reliability but exhibited inadequate inter-rater reliability when applied in real-world occupational therapy scenarios. These outcomes imply that the GKCSAF might have limited practicality for occupational therapists due to its subpar inter-rater reliability. To enhance the accuracy of assessments, it is advisable to involve multiple raters. Additionally, we also discovered that both intra-rater and inter-rater reliability of coded ratings were unsatisfactory, suggesting that coding may not improve the reliability of the assessments.
Key findings
GKCSAF assesses communication skills and highlights the need for consistent raters to ensure reliability.
Coded ratings’ reliability, whether intra or inter, failed to enhance assessment reliability.
What our study has added
Recognizing inter-rater variability is crucial, and minimizing it may involve improved scoring guidelines or using mean scores from multiple raters.
Footnotes
Research ethics
Ethical approval was obtained in 2021 from the National Taiwan University Hospital Ethics Committees. Trial Registration number: 202012266RINA.
This research complied with principles of the World Medical Association Declaration of Helsinki Ethical Principles for Medical Research involving Human Subjects as amended October 2013 and this research was conducted with institutional or equivalent approvals consistent with the World Health Organization “Standards and operational guidance for ethics review of health-related research with human participants” (2011).
Consent
Ethical approval was botained from the National Taiwan University Hospital Ethics Committees. Trial Registration number: 202012266RINA. All participants signed an informed consent form.
Patient and public involvement data
During the development, progress, and reporting of the submitted research, patient and public involvement in the research was not included at any stage of the research.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Science and Technology Council, Taiwan, Grant number: MOST110-2511-H002-016-MY2, principal investigator: Ching-Lin Hsieh and the National Science and Technology Council, Taiwan, Grant number: NSTC 112-2410-H-002-178, principal investigator: Sheau-Ling Huang.
Contributorship
CLH conceived the study. SCF and STT conducted the literature research. YCW and STT were responsible for data collection. Data analysis was carried out by YCW and STT. The original draft was written by SCF and SLH, with editing by MLL. Supervision of the analysis was provided by SCF, SLH, and CLH. Funding acquisition was managed by CLH and SLH. All authors have read and agreed to the published version of the manuscript.
