Abstract
BACKGROUND:
A framework to establish the biopsychosocial patient profile for persons with low back pain has been recently proposed and validated: The Pain and Disability Drivers Management model (PDDM). In order to facilitate its clinical integration, we developed the PDDM rating scale.
OBJECTIVES:
To determine the inter-rater agreement of the PDDM rating scale. A second objective was to determine if this inter-rater agreement varies according to the complexity of patients’ clinical presentation.
METHODS:
We recruited physiotherapists during one-day workshops on the PDDM. We asked each participant to assess two clinical vignettes using the rating scale. One vignette presented a typical clinical presentation (moderate level of difficulty) and one presented an atypical presentation (complex level of difficulty). We determined inter-rater agreement with the proportion of participants who gave the same answer for each PDDM domain.
RESULTS:
For the typical vignette, the inter-rater agreement per domain was moderate to good (between 0.54 and 0.97). For the complex vignette, the inter-rater agreement per domain was poor to good (between 0.49 and 0.81). The comparison between the two vignettes showed a significant difference (
CONCLUSION:
Overall performance indicates that the rating scale present adequate agreement for clinical use, but specific domains require further development.
Introduction
Low back pain (LBP) is characterized by a very heterogenous clinical presentation [1]. However, the establishment of the patient’s profile, guided by a biopsychosocial framework, can help report this variability, target specific sub-groups and improve the clinical issues [2, 3]. Yet, integrating a true biopsychosocial approach in daily clinical practice is challenging for clinicians [4, 5, 6]. A diagnostic framework, such as the Pain and Disability Drivers Management model (PDDM) [7, 8] can serve to refine diagnosis [9] and to personalize rehabilitation approaches for patients with LBP [10].
In order to facilitate the integration of this model into clinical practice, we developed the PDDM rating scale [11] – such tool has the potential to positively impact the clinical integration of the model [10]. Clinicians participating to a one-day workshop on the PDDM have previously validated the content of the rating scale content in a previous study [11]. This rating scale allows clinicians to rapidly determine the severity of the five biopsychosocial domains of the PDDM [8]. At the end of the assessment process, clinicians can establish the patient’s biopsychosocial profile in order to recognize the main drivers of pain and disability for patients who suffer from LBP.
To ensure that the tool is readily applicable for clinical use, there is a need to investigate its inter-rater agreement. Inter-rater agreement reflects the observer variation when several raters assess the same patient [12]. In clinical practice, clinicians have to make a decision for individual patients that need absolute measure of agreement [12]. This “absolute” measure allows knowing if for the same patient, the profile established by one clinician is in agreement with the patient’s profile established by another clinician [12]. The probability of agreement allows clinicians to weigh the uncertainty that accompanies the patient’s biopsychosocial profile.
Thus, this study aimed to determine the inter-rater agreement of the newly developed PDDM rating scale through two clinical vignettes with different clinical presentation. A second objective was to compare the inter-rater agreement of the two vignettes in order to determine whether the complexity of the LBP presentation alters the inter-rater agreement of the scale.
Methods
Design and setting
A descriptive correlational design was used to determine the inter-rater agreement of each item of the rating scale [13]. We followed the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [14]. The study received approval from the Ethics Review Board of the Research Center at the Centre Hospitalier Universitaire de Sherbrooke (project #2021-3440).
Participants and sample size
We recruited physiotherapy professionals that participated to a one-day workshop on the integration of the PDDM into clinical practice in 2020. The inclusion criteria were: 1) to be a licensed physiotherapist, 2) to have participated in the workshop in 2020, and 3) to consent to the use of the data collected during the workshop. We had to recruit at least 50 participants to determine the inter-rater agreement of our 5-item rating scale, as a 10 participants to 1 item ratio (10:1) is recommended [15, 16].
Workshop
The workshop consisted in a one-day presentation of the PDDM. The pedagogic support contained a theoretical component and interactive practical component. The theoretical component focused on 3 objectives: (a) define the 5 domains and identify their elements, (b) how to collect data for each domain, and (c) establish the patient’s profile and care plan. The training occurred in small groups and the “practical” part consisted in analyzing two clinical vignettes with the newly developed PDDM rating scale. Due to the COVID-19 pandemic, we had to modify the workshop format from face-to-face to online between the first and second workshops.
Clinical vignette development
Based on the framework of Skilling and Sylianides [17], we developed two vignettes that included three sources of information: 1 – information from history taking, 2 – information from the objective assessment, and 3 – information from patient-reported outcome measures. This resulted in two clinical vignettes with different levels of complexity. The first one, with a “moderate” level, presented a typical clinical picture of a patient with LBP (i.e., mechanical pattern, radicular pain, no comorbidity, maladaptive cognitions and emotions, and no contextual driver). The second one, with a more complex level, presented an atypical clinical picture with the need to put different sources of information into perspective to determine the severity of the five domains. The two clinical vignettes were developed by the authors and are available in Appendix 1.
The PDDM rating scale
The development of the PDDM rating scale is described in detail elsewhere [11]. The scale aims to rapidly establish the patient’s profile based on the five domains of the PDDM. Each domain is divided into two categories, reflecting the severity/complexity of problematic aspects. Category A relates to relatively common and modifiable drivers of pain and disability, whereas Category B relates to more complex and/or less modifiable elements.
Coherent with the model, the PDDM rating scale is a 5-item rating scale (1 per domain), where each domain has four scoring options:
(A) Presence of at least 1 element from Category A, (B) Presence of at least 1 element from Category B, (A
At the end of the screening process, clinicians obtain their patient’s biopsychosocial profile leading to the identification of the patient’s own drivers of pain and disability, and thus refining the initial diagnosis. The rating scale is presented in Appendix 2.
Data collection
Participants’ characteristics
Participants answered a sociodemographic questionnaire which included age, gender, number of years of experience as physiotherapy professional, frequency of use of standardized questionnaires, and previous training on LBP classification systems.
Inter-rater agreement
At the end of the theoretical course and training, the two clinical vignettes were successively distributed to the participants. The participants used rating scale to establish the patient’s profile. The sequence of clinical vignette distribution was randomized among the different workshops, leading to the splitting of the sample. Half of the sample began with the complex vignette and the other half began with the moderate one. Because of the COVID-19 pandemic, we collected data online after the first workshop. The online format allowed us to assess the time required to complete the scale for participants. A time-consuming scale could be a limitation to its integration in a clinical setting.
Statistical analysis
Descriptive analyses (i.e., mean and standard deviation) were used to describe the characteristics of the participants and the time taken to complete the rating scale.
Guidelines for Reporting Reliability and Agreement Studies guidelines recommend reporting estimates of reliability and agreement [14]. As there is no reference score for each vignette, and the fact that we had many observers (over 70 raters) and only two [2] “patients” (vignettes), we choose to report a specific measure of observed agreement [12, 20]. To determine the inter-rater agreement of the PDDM rating scale, we used the proportion of participants who gave the same answer for each domain [10], where a 95% confidence interval (95% CI) for each proportion was estimated according to Wilson’s method [21]. This was done for both clinical vignettes independently. In healthcare research, there is no accepted threshold for the interpretation of inter-rater absolute agreement [22]. Nevertheless, we interpreted the inter-rater agreement as follows: value
For our second objective, aiming to compare if the inter-rater agreement differed according to vignette complexity, we used a Pearson’s Chi-square test [12]. A 95%CI was constructed around the estimated difference and a p-value was proposed.
All analyses were performed using R 3.6.1, and OpenEPi for the 95%CI by the Wilson’s method.
Results
Characteristics of the participants
We conducted four workshops (one in person and three online due to the COVID-19 pandemic), where we recruited 72 participants. Four participants did not respond to the sociodemographic questionnaire. The sample included 55 women (80.9%) and 13 men (19.1%) and the mean age was 40.2 (SD
Characteristics of the participants
Characteristics of the participants
MDT
Inter-rater agreement and confidence interval for each vignette
The highest inter-rater agreement (i.e. proportion of participants who gave the same answer) is highlighted in light grey.
Clinical vignette with a moderate difficulty level (typical clinical picture)
For the “typical” vignette, it is the nociceptive pain drivers domain who obtained the highest agreement (0.97 [95%CI: 0.90–0.99]), which suggests good inter-rater agreement. Again, the lowest agreement was observed for the nervous system dysfunction drivers (0.54 [95%CI: 0.43–0.65]) suggesting moderate inter-rater agreement. The three other domains presented moderate agreement, where the agreement ranged between 0.65 and 0.78. The detailed results are presented in Table 2.
Clinical vignette with a complex difficulty level (atypical clinical picture)
For this “complex” vignette, we observed good inter-rater agreement for comorbidity factors domain, which presented the highest agreement (0.81 [95%CI: 0.70–0.88]). The lowest agreement was observed for the nervous system dysfunction drivers (0.49 [95%CI: 0.38–0.61]) meaning poor inter-rater agreement. The three other domains presented a moderate agreement, which ranged between 0.51 and 0.58. The detailed results are presented in Table 2.
Comparison between the two clinical vignettes
Figure 1 presents the comparison of the proportions of the most frequently chosen answer per domain between the two clinical vignettes. We found a statistically significant difference for nociceptive pain drivers domain (
Comparison between the two clinical vignettes of the proportions of the most frequently chosen answer for each domain (point estimate and 95% confidence interval). 
The mean time to complete the second clinical vignette for the participants of online workshops (
Discussion
The objective of this study was to determine the inter-rater agreement of the PDDM rating scale and inform future users about the variation of the answers when they established the biopsychosocial profile using this newly developed tool. Overall, the inter-rater agreement of the scale can be interpreted as moderate, with the nervous system dysfunction drivers domain having the lowest agreement.
The complexity of the vignettes mainly influenced the results of the nociceptive pain drivers and cognitive-emotional drivers domains. The lower agreement for the complex vignette could be explained by a lack of knowledge or familiarity with classification systems for LBP, namely, the Treatment-Based Classification system. Many participants classified the patient in a TBC subgroup when they analyzed the complex vignette, while the characteristics were not coherent with the TBC’s classifications. This hypothesis is supported by the low rate of participants who reported previous training on classification systems (23.2%) and more precisely on the TBC (2.9%). The same reasons may explain the statistically significant difference for cognitive-emotional drivers domain. Qualitative studies on physiotherapists’ assessment of psychosocial variables suggest that it is mainly based on “gut feeling” [4, 24]. Physiotherapists frequently fail to detect psychosocial drivers [25, 26]. These elements could explain the large dispersion of the answers of cognitive-emotional drivers domain.
The difficulty to screen nervous system dysfunction drivers domain may also be explained by the concept of nervous system sensitivity which is still evolving, which can lead to confusion on its clinical integration [27]. It may also be explained misinterpretation related to significant sleep disturbances (an element pertaining to nervous system dysfunction). The complex vignette contained “pain sometimes wakes him up at night”, which may have led the clinicians to recognize the presence of sleep disturbances. Although frequency of sleep disturbances is important to consider [28] in this case, it was not deemed “significant”. These observations suggest that clinicians who want to use the PDDM rating scale would benefit from a good base knowledge on the concepts integrated in the PDDM model.
Despite a moderate to good inter-rater agreement, the results of comorbidity factors domain raised an important point for clarification: the comorbidity factors domain requires the patient’s report of a co-occurring “medical” diagnosis, whether it is a mental health comorbidity (depression, anxiety disorders, post-traumatic stress), a sleep disorder (insomnia) or a painful physical comorbidity (fibromyalgia, spondylarthritis, irritable bowel syndrome). Thus, the specificity of each domain and their elements of the PDDM model need to be better defined to improve the agreement.
Another important factor that could explain some low agreement findings is related to the fact that approximately 70% of our sample reported to rarely or never used questionnaires in their practice. As our participants were unfamiliar with the different outcome measures presented in the vignettes may also impact the agreement for nervous system dysfunction drivers domain and cognitive-emotional drivers domain. For both vignettes, the objective information was presented via physical examination results and questionnaires. Thus, we may have to integrate more content on questionnaires (outcome measures) and their interpretation during the workshop.
Our results seem generalizable and give useful information on the applicability of the PDDM rating in clinical routine. The characteristics of the participants show a heterogeneity of experience, training and clinical routine. This heterogeneity will facilitate integration of the PDDM rating scale in a variety of clinical settings. The workshop could impact the ecological validity of our results. However, workshop does not allow for an immediate change in participants’ knowledge and skills. The time to complete an instrument is a recurrent barrier when integrated into new assessment procedure [29]. We provide information on the time to complete in a sample of participants without previous exposure. Thus, clinicians can organize their timeframe to integrate the PDDM rating scale, knowing that acquaintance with the tool should optimize this time.
Strengths and limitations
The sample size of this project is larger than the recommendations of a ratio of 10 participants per item [15, 16]. Randomization during the distribution of the vignettes prevented an information bias – this strategy prevented the results of the comparison between the two vignettes from being explained by a learning process [30].
The first concern is about the familiarization of new concepts and with the rating scale. We assessed absolute agreement during the training (workshop) and not after a learning period, which may be required to reach higher inter-rater agreement. The second limitation is the fatigue induced by the workshop/data collection. We opted to collect data during the workshop to reduce recall biases, but cognitive fatigue from a 7-hour workshop could also have impacted quality of responses and inter rater agreement [31] and therefore increases the risk of errors [32]. The third limitation concerns the shift from face-to-face to online workshops. This change could have contributed to enhance cognitive fatigue (online environment) and reduce participants engagement, as instructor-participant interactions were more difficult. The fourth limitation concerns the number of clinical vignettes. Even if the participants had two training vignettes during the workshop, the low number of clinical cases does not allow for stabilization in the learning process [33]. The choice of this low number of cases was mainly guided by the length of the workshop. However, with the pool of clinicians who attended our workshops, it will be easier for us to increase this number in a future study. While these limitations may lead to underestimate of the inter-rater agreement, these also represent opportunities to improve the workshop. The last limitation concerns the statistical method. For an ordinal scale, such as the one we studied, these guidelines suggest the use of kappa statistics for reliability. However, the kappa statistics are criticized for its interpretation (not intuitive) and it is a relative (versus absolute) measure of agreement [18]. Moreover, the Kappa statistics have often rendered paradoxical results, where a low kappa is observed despite a high raw agreement [19]. Thus, the proportion of participants who gave the same answers for each domain was estimated. Although this method is superior compared to the use of the kappa statistics, it does not consider the chance agreement [23].
Conclusion
The inter-rater agreement of the PDDM rating scale was found to be moderate for a typical clinical presentation of patient with low back pain in a sample of clinicians without prior exposure to the PDDM model. However, the inter-rater agreement may be significantly lowered for nociceptive pain drivers and cognitive-emotional drivers domains when a “complex” vignette was presented to our participants. The PDDM rating scale can be helpful in order to integrate a biopsychosocial perspective into clinical practice. Clinicians must be aware that the results of certain domains may not be consistent with other therapists. Our observations represent opportunities to improve workshop content to increase the inter-rater agreement.
Footnotes
Acknowledgments
Florian Naye received a scholarship from the Université de Sherbrooke.
Conflict of interest
None to report.
Supplementary materials
The Appendices are available from https://dx-doi-org-s.web.bisu.edu.cn/ 10.3233/BMR-210125.
