Abstract
BACKGROUND:
A functional capacity evaluation (FCE) can provide a comprehensive, objective measure of a worker’s ability to meet work demands to support return to work decision making. Research evidence of a FCE’s reliability and validity, involving more than one study, and covering all test components with a diverse range of populations, is essential to ensure confidence in any FCE system.
OBJECTIVE:
This study aimed to establish the inter-rater reliability of the Valpar Joule FCE functional capacity evaluation (FCE) for which there is currently limited published literature regarding its reliability.
METHODS:
Twelve healthy subjects were digitally recorded completing the initial protocol of the Valpar Joule. Assessments were rated separately by 3 raters and the results then compared.
RESULTS:
Using Intraclass Correlation Coefficients (ICC), with percentages of agreement and t-tests to determine bias, inter-rater reliability was high for determining last safe weight lifted for forceful tasks with ICC>0.90. Agreement ranged from 97.2% –100% for determining reasons for terminating tests; 97.2% –98.6% for identifying maximum safe capacity, but was only between 8.3% –50% for full agreement for identification of last weight safely lifted in forceful tasks. Differences were identified between raters with different training and experience for identifying poor body mechanics in lifting.
CONCLUSION:
Results demonstrated high inter-rater reliability for the Valpar Joule functional capacity evaluation in healthy adults. Further development of criteria identifying poor body mechanics and training in its use is recommended to increase evaluator objectivity.
Introduction
Sickness absence from work is prevalent in the UK and costs around £100 billion annually in “sickness benefit” payments [1, 2]. Current Government policy has a strong focus on reducing sickness absence and improving services to promote work health and wellbeing [3, 4]. While employers may wish absent employees to return to work (RTW) quickly, they may require detailed healthcare guidance on an employee’s abilities, performance limitations or necessary work adjustments in order to support safe RTW [1, 5].
Allied Health Professional’s (AHPs) and General Practitioners (GPs) can help to provide employers with this information (Allied Health Professions Federation [2, 5–7]. However, objective information is required to support any recommendations of when and how absent workers can return safely [8]. A common component of work rehabilitation programmes which strives to provide objective data is the Functional CapacityEvaluation (FCE) [9]. FCEs are test batteries which provide a systematic, comprehensive and performance based measurement to determine an individual’s physical abilities to carry out work related tasks [10].
While it is not suggested that an FCE be used in isolation, studies have identified they can make a significant contribution to a comprehensive RTW process, by comparing the injured worker’s physical capacity with their work demands to determine ability to RTW safely [11–13]. However, evidence of FCEs reliability is vital to demonstrate that changes over time in a worker’s performance are due to variation in their abilities and not as a result of FCE measurement error and ensure there is confidence in the assessment results [14,15, 14,15]. Methodological limitations in studies supporting FCE reliability have previously been highlighted [14, 16]. Therefore, while it is acknowledged that there are a number of other FCE’s available, this study aimed to further evaluate the inter-rater reliability of one FCE, the Valpar Joule Functional Capacity Evaluation [17, 18] and to identify any factors potentially effecting inter-rater reliability.
Method
Research design
A cross-sectional study was conducted to investigate the inter-rater reliability of three raters conducting the Valpar Joule Functional Capacity Evaluation.
Participants
Gouttebarge et al. [16] rate studies involving more than 2 raters with 10 or more subjects highest in their systematic review of rater reliability of FCEs. Therefore a convenience sample of 12 healthy male and females of working age (18–65 years) was recruited from University staff and associates.
Subjects were screened using PARQ [19] and were excluded if they had current or chronic injury or illness; elevated resting Blood Pressure >180/90; or elevated Heart Rate > (220-clients age)x 0.85 according to Valpar Joule procedural guidance [17].
Raters
Three raters, 1 primary and 2 silent, carried out the FCE assessments. The primary rater (rater 1) and a silent rater (rater 3) were both experienced occupational therapists who had completed an approved Valpar Joule training programme and had similar FCE experience (5 years). Due to the lack of another local Valpar Joule FCE rater, the other silent rater (rater 2) was an experienced physiotherapist with expertise in biomechanics and musculoskeletal rehabilitation but with no FCE rating experience.
Procedure
Ethical approval for this study was obtained from the University School Research Review Group.
Subjects were recruited via email that included an information pack outlining the study and provided written informed consent prior to their participation. Subjects attended a one-off session and had initial demographic details recorded (age, sex, height, weight, and work status). Blood pressure (BP) and heart rate (HR) were recorded prior to and after assessment, with HR monitored during assessment, to ensure these did not exceed safe limits [17].
Each subject completed the V.J. FCE initial protocol [17], directed and rated by the primary rater, requiring the subject to complete a series of: 8 forceful tasks (waist to waist lift; waist to floor lift; waist to above shoulder lift; bilateral carry; unilateral dominant hand carry; unilateral non-dominant carry; push; and pull tasks). Due to push and pull tasks being tested using a force gauge which produces an absolute number and therefore 100% agreement these results were not discussed in this study and subsequently only 6 forceful tasks will be reported on. 6 positional tasks (sitting; standing; kneeling; crouch; sustained mid-level reach; sustained elevated reach). 8 repetitive tasks (walking; crawling; stair climb; ladder climb; balance; repetitive foot (right and left); fine motor co-ordination).
The primary rater observed each task completion, determining the last safe weight the subject lifted by identifying unsafe body mechanics or physiological signs; the reason each task was terminated; and determining the subject’s maximum safe capacity. Following studies by Legge and Burgess-Limerick [20] and Reneman et al. [21], the assessment was digitally recorded using video cameras from more than one angle with sound recording to maximise the “silent” raters view of a subjects performance and awareness of reported relevant factors such as participant’s cardiopulmonary function recordings, perceived rate of exertion [22] or any subjective comments. Recordings were transferred to external hard drives and the two “silent” raters independently viewed and rated each subject’s assessment blinded to each other’s and the primary rater’s ratings reducing the likelihood of rater bias [23].
All test data was coded, stored and analysed in accordance with Robert Gordon University policy and the Data Protection Act [24] to ensure confidentiality and protect the anonymity of volunteers.
Data analysis
For each subject, the FCE results for each rater were statistically analysed to investigate inter-rater reliability and issues impacting on reliability. Comparison between all raters and pairs of raters was made to identify consistency and agreement or differences between raters. Analysis was performed using statistical analysis software Statistical Packages for Social Sciences (SPSS) for Windows [25].
The level of inter-rater reliability for determining last safe lift in the forceful tasks of the protocol producing ordinal data was calculated using a two way mixed model intraclass correlation coefficient (ICC 2,1) to determine absolute agreement [26, 27]. To allow for comparison with other FCE studies, the levels of reliability scale provided by Gouttebarge et al. [16] in their review of FCE’s, was adopted and levels of reliability were determined as: High for ICC >0.90; Moderate 0.75 < ICC >0.90; Low <0.75.
It has been acknowledged that no single test provides a complete measure of reliability [28, 29]. Therefore, in addition to the ICC (2,1A), a 95% confidence interval (CI) was calculated for each ICC mean. Percentages of agreement were also used to determine agreement between all raters and pairs of raters [14]. Ranking was used to identify the number of ties or positive or negative ratings between pairs of raters [27] and t-tests completed to determine rater bias [26].
Percentages of agreement were also used for comparing inter-rater reliability in raters scoring for tasks which produced nominal data for: Reasons for stopping all tasks in the Valpar Joule protocol (rated 1–3 as: 1. Task was fully completed; 2. Subject determined to be using unsafe body mechanics; or 3. Subject exceeded maximum safe exertion levels according to VAI guidelines [17] and as determined by physiological measures. Identifying the maximum safe capacity of subjects in each of the forceful, positional and repetitive tasks in the Valpar Joule protocol (17) (rated 1–3 as: 1. No identified limitations; 2. Occasional; or 3. Rare).
Results
Subject demographics
A convenience sample of 12 healthy subjects(8 women, 4 men) was recruited. All subjects were employed either in teaching (n = 10) or manual work (n = 2). Subject ages ranged from 18–59 years with a mean of 40.58 years and standard deviation (SD) 12.5 years. Height ranged from 150 cm–183 cm with mean of 168.33 cm and SD of 10.34 cm. Weight ranged from 56 kilograms (kg) to 98 kg with a mean of 76.25 kg and SD of 13.90 kg. Each subject completed all aspects of the protocol in the determined order, apart from one subject (subject 1) who was unable to complete the push task due to equipment failure.
Determination of last safe weight in forceful tasks
Intraclass correlation coefficients – all raters
The level of inter-rater agreement for all forceful tasks was high with all ICC >0.9 and narrow CIs with the largest interval ranging from 0.738–0.987 for unilateral non-dominant carry and the narrowest interval of 0.939–0.997 for waist to floor carry (Table 1).
Percentages of agreement – all raters
While high ICC >0.9 are reported, for the total 72 possible scored lifts (6 lifts per 12 subjects), it is interesting when looking at the actual % of agreement to note that all 3 raters only fully agreed on a subject’s last safe weight lifted in 15/72 lifts (20.83%). Highest agreement was achieved between all 3 raters in bilateral carry for 6/12 subjects (50%). Full agreement was lowest on waist to floor, waist to above shoulder lifts and unilateral dominant hand carry when achieved for 1/12 subjects (8.3%). In contrast, highest full disagreement was recorded for 4/12 subjects (33.3%) in waist to floor lifts (Table 1).
Due to the difference in each incremental weight being lifted, differences in the ratings of last safe weight lifted recorded by the raters were identified being up to 4 incremental weights in some lifts which equated to between 3 pounds of force for waist to floor lift and 16 pounds of force for bilateral, unilateral dominant and non-dominant hand carry.
Intraclass correlation coefficients – paired raters
In determining last safe lift, inter-rater reliability was determined as high for each pair of raters with all ICC >0.90 and the narrowest confidence interval ranging from –0.081 –0.414 for waist to waist score for pair 2 (raters 1 and 3) (Table 2).
Percentages of agreement – paired raters
Table 2 reports that the highest percentage of agreement for determining last safe weight lifted was identified between raters 1 and 3 who agreed on 45/72 (62.5%) lifts. Agreement was identified as lower between other pairings, with paired raters 1 and 2 agreeing in 24/72 (33.3%) last safe weight scores and paired raters 2 and 3 agreeing on 22/72 (30.5%) scores. Paired raters 1 and 3 achieved the highest agreement for determining last safe weight lifted for any task in waist to waist lift when they agreed in 10/12 (83%) of subjects’ ratings.
Paired differences
Table 2 also identifies that there was a significant difference identified for rater 2 scores compared to raters 1 and 3 in determining last safe weight lifted in 4/6 (66.7%) of the forceful tasks – in waist to waist, waist to floor, waist to above shoulder lifts and unilateral dominant hand carry. No significant difference in scores was noted between raters 1 and 3.
A difference of up to 4 incremental last safe weights lifted by a subject was recorded between raters, equating to a difference of up to 16lbs of force being reported as the safe lifting ability for a subject in some lifts.
Reasons for terminating tasks and maximum safe capacity in forceful, positional and repetitive tasks
A high percentage rate of agreement was identified between all raters for both determining reasons for terminating tests (Table 3) and for identifying maximum safe capacity (Table 4) for each of the 12 subjects completing all 20 tasks in the protocol (240 tasks in total as push and pull not included): 100% (72/72 tasks) agreement for forceful tasks; 97.2% (70/72 tasks) agreement for positional tasks; and 98.6% (95/96 tasks) for repetitive tasks. The maximum safe capacity for forceful tasks is not presented in the results as this was calculated from the score each rater gave for last safe weight lifted and results would be the same as those presented for last safe weight.
It was noted that where there was difference in ratings for terminating tests, this was where rater 3 recorded 3 subjects as having completed a task while the other raters accurately reported that 2 subjects had their test stopped due to unsafe body mechanics and 1 subject had asked for the test to be stopped (Table 3). This subsequently had an impact on the raters’ determination of these subjects ‘maximum safe capacity as rater 3 rated each of those subjects as having no limitations, although tests were stopped due to the reasons reported (Table 4). This might therefore be considered as recording errors and should be taken into consideration when viewing these inter-rater reliability results.
Discussion
Inter-rater reliability
The aim of this study was to evaluate the inter-rater reliability of the Valpar Joule (V.J.) FCE and to consider any factors which effect reliability or its use as a clinically effective FCE tool. To allow comparison with other FCE inter-rater reliability studies, methodology as determined by Gouttebarge et al. [16] for determining inter-rater reliability was adopted. The findings of this study identified that inter-rater reliability for determining the last safe weight lifted for each forceful task subtest of this FCE protocol [17] was high as evaluated by ICC >0.90 and with narrow confidence intervals, ranging from 0.738–0.987 for unilateral non-dominant carry to 0.939–0.997 for waist to floor carry (Table 2). Reasons for terminating tests and identifying maximum safe capacity were also identified as having high inter-rater reliability, as determined by percentages (%) of agreement, ranging from to 97.2% –100% for agreement for reasons for terminating tests and from 97.2% –98.6% for identifying maximum safe capacity.
However, when the actual raters’ scores for determining the last safe weight lifted by each subject were analysed using percentages of agreement, it was apparent that there was some significant difference in agreement between raters. It was identified that all raters only fully agreed with each other’s ratings on 20.83% of occasions (15/72) and also fully disagreed on a similar amount of occasions, 15.3% (11/72). Raters achieved full agreement most frequently for bilateral lift ratings (50% or 6/12) but lowest for waist to floor, waist to above shoulder and unilateral dominant hand carry when full agreement was only achieved in each task on one occasion (8.3% or 1/12).
Some significance difference was noted between pairs of raters. It was identified that rater 2 only agreed with the raters 1 and 3 in 30.5% –33.3% respectively of total scores. Rater 2 also recorded significantly higher weights which subjects could safely lift in most lifts and a significant difference was identified in their scores compared to raters 1 and 3 in determining last safe weight lifted in 4/6 (66.7%) of the forceful tasks – in waist to waist, waist to floor, waist to above shoulder lifts and unilateral dominant hand carry.
This was in contrast with paired raters 1 and 3 who were in full agreement in over 62.5% of their total scores but with a range between 33.3% and 83.3% . This is almost twice the level of agreement than was identified between their pairings with Rater 2 and there was no significant difference identified between their ratings of last safe weight lifted in any lifting task.
Evaluator training
The impact of evaluator discipline or training on FCE ratings has been recognised as being relatively unknown [30]. Therefore while this study’s primary aim was not to investigate the impact of training or discipline on reliability of the VJ FCE, it is of interest to note that raters 1 and 3, were both occupational therapists with similar background; FCE training; experience in safe return to work decision making and that they showed no significant difference in their ratings when compared with rater 2, an experienced physiotherapist with extensive musculoskeletal and biomechanics experience but no V.J. FCE evaluator training. Of interest, raters 1 and 3 also significantly scored subjects as having lower lifting capacity than rater 2. Subsequently, while the importance of an evaluator’s skill level is recognised, these findings would appear to support the view that an evaluator discipline, knowledge, training and experience can impact on an evaluator’s judgements, confidence, objectivity of test scoring and subsequently on inter-rater reliability of a FCE [30–33]. Psychosocial, environmental and cultural factors may also influence evaluator’s judgements [34, 35]. Therefore, it is suggested that these factors and the evaluator’s clinical reasoning when conducting the FCE, would all benefit from further investigation to increase the accuracy of assessment recommendations and support sustained return to work in order to increase the inter-rater reliability of the Valpar Joule.
Criteria for determining safe body mechanics
While low percentages of full agreement between raters are reported in this study, it should be noted that when the raters were not in absolute agreement, they only differed by a small number of incremental weights when determining the last safe lift completed. Most frequently raters only differed by 1 incremental weight but on occasion there were differences of between 2–4 incremental weights. Unfortunately, due to the incremental weight differences in lifting and carry tasks, depending on which work level [17] a subject was lifting at, in some cases the difference in the rater’s decision regarding a subject’s last safe weight lifted was 16 pounds of force. It should then be considered that it may be likely that this difference in weight could have significant impact on whether an individual was determined to meet the demands of their job and able to return to work safely. Therefore it is essential for both employees and employers that improvements are made in evaluator’s observations and criteria for determining last safe weight lifted in this FCE in order to most accurately reflect a worker’s lifting abilities, minimise any difference in ratings and to facilitate safe work return.
It could be suggested that this result reflects the views of Tuckwell, Straker and Barrett [36], who highlighted that lifting is the most subjective part of the FCE and requires determination of quality of posture and movement. It is acknowledged that one of the raters in this study, while an experienced physiotherapist, was not trained in the use of this FCE and therefore could have been expected to score significantly differently from the two trained raters. However, differences were also reported in percentages of agreement in all lifts between the two trained raters and their subsequent determinations of a workers safe lifting ability. While evaluator’s clinical reasoning in FCE decision making has been identified as requiring further investigation [31], it has been identified that defining safe maximal lift [37] and developing sound FCE rating criteria could reduce subjectivity in testing and that training on how to interpret criteria, and consistent application of a rating scale can enhance objectivity [38]. King, Tuckwell and Barrett [33] concur, noting that objectivity can be promoted when procedures, variables for observation and scoring are all operationally defined. When reviewing which lifts achieved the most agreement or disagreement between raters, no particular pattern was established. Rater differences were not particular to any subject being tested and no specific explanation for why there was greater agreement in some tasks was identified. Therefore, it is suggested that the rating criteria for determining poor body mechanics for the VJ FCE should be developed to enhance inter-rater agreement on last safe weight lifted in forceful tasks to minimise any discrepancies in the FCE results and subsequent return to work recommendations. This will help ensure the confidence of raters in their assessment results and subsequent recommendations [13].
Limitations
This small study used a convenience sample of healthy young adults and results cannot be generalised to the wider population or to individuals with specific health conditions [39]. Additionally the effects of subject gender, weight or height were not taken into consideration in this study. Further research is now necessary to establish inter-rater reliability of the V.J. FCE with injured workers or individuals whose conditions can change [31, 41]. Future research is also required to establish other forms of reliability and validity of the V.J. FCE to ensure there is confidence in the assessment results [14].
It is acknowledged that due to lack of availability, inter-rater agreement for the V.J. FCE could not be determined between 3 V.J. trained evaluators. However, in using an experienced physiotherapist alongside two V.J. trained evaluators, the high inter-rater reliability for aspects of the FCE was still determined and the value of the training for evaluators apparent. It would have also been of interest to involve a third Occupational Therapist untrained in V.J. evaluation to provide a comparison of reliability, based on the V.J. training.
Conclusion
This study investigated the inter-rater reliability of the Valpar Joule FCE and identified high inter-rater reliability for lifting and carrying tasks determined with intraclass correlation coefficients of >0.90 and a high percentage of agreement between all raters of >90% for reasons for terminating tests and identification of asubject’s maximum safe working capacity. However, the findings also revealed a significant difference in scoring between pairs of raters for identifying last safe weight lifted in forceful tasks. The study highlighted apparent differences in rater’s views on criteria for determining poor body mechanics.
It appears that different training and experience may impact on objectivity of test scoring and subsequently on inter-rater reliability [30–33]. While it is acknowledged that the Valpar Joule provides training and suggested criteria, given the findings of this study and consequences of a FCE results, it is concluded that the objectivity of observations for lifting and carry tasks and the inter-rater reliability of the V.J. FCE could be further enhanced. Further consideration of factors which may improve objectivity is required and the development of more specific, clearly defined criteria for determining the presence of physical signs of poor body mechanics and additional rater training to assist in their detection is also recommended to improve rater skills, objectivity, minimise discrepancies in ratings and increase confidence in results for the V.J. FCE.
It was also recognised that the evaluators’ clinical reasoning when conducting FCEs and the effect of an evaluator’s experience, training, and incorporation of information other than just biomechanical factors would benefit from further investigation [31–33, 38]. How this subsequently impacts on assessment recommendations and successful, sustained return to work is also suggested as an area for future research to increase the accuracy of assessment recommendations.
Conflict of interest
There was no conflict of interest for all authors in this study.
Footnotes
Acknowledgments
Thank you to Elaine Stewart for her assistance in rating the FCE results, to all volunteers who participated in this study and to Alex Wilson statistician for assistance in data analysis calculations.
