Inter-rater and test-retest reliability of computerized clinical vestibular tools

Abstract

BACKGROUND:

Clinical vestibular technology is rapidly evolving to improve objective assessments of vestibular function. Understanding the reliability and expected score ranges of emerging clinical vestibular tools is important to gauge how these tools should be used as clinical endpoints.

OBJECTIVE:

The objective of this study was to evaluate inter-rater and test-retest reliability intraclass correlation coefficients (ICCs) of four vestibular tools and to determine expected ranges of scores through smallest real difference (SRD) measures.

METHODS:

Sixty healthy graduate students completed two 1-hour sessions, at most a week apart, consisting of two video head-impulse tests (vHIT), computerized dynamic visual acuity (cDVA) tests, and a smartphone-assisted bucket test (SA-SVV). Thirty students were tested by different testers at each session (inter-rater) and 30 by the same tester (test-retest). ICCs and SRDs were calculated for both conditions.

RESULTS:

Most measures fell within the moderate ICC range (0.50–0.75). ICCs were higher for cDVA in the inter-rater subgroup and higher for vHITs in the test-retest subgroup.

CONCLUSIONS:

Measures from the four tools evaluated were moderately reliable. There may be a tester effect on reliabilities, specifically vHITs. Further research should repeat these analyses in a patient population and explore methodological differences between vHIT systems.

Keywords

Clinical tools reliability vHIT cDVA subjective visual vertical

1 Introduction

Advances in technology have significantly imp-roved the ability of clinicians to objectively diagnose vestibular dysfunction and document the effectiveness of interventions. For example, the video head impulse test (vHIT) allows for evaluation beyond identifying a corrective saccade with the traditional head impulse test by measuring the ratio of eye to head movement over the entire head impulse. Simi-larly, head-mounted accelerometers during a Computerized Dynamic Visual Acuity task (cDVA) allow for more objective measurement and consistency of head movements during the task when compared to a simple metronome-assisted evaluation. Finally, new assessments of subjective visual vertical (SVV) have theoretically improved on the traditional SVV bucket test by replacing the angle finder / goniometer with computerized versions [22]. However, despite allowing for more objective testing, these technologies can impose their own biases and limitations. It is important that clinicians interested in using these tools as markers of intervention success and to identify disease are mindful of potential error. By understanding the innate range of scores on the tools, clinicians and researchers can more confidently conclude changes in their patients larger than these ranges are indicative of actual change in performance.

The vHIT enables testing of the vestibulo-ocular reflex (VOR) generated from all six semicircular canals in response to high velocity head movements. The vHIT is not only used for diagnosis of vestibular hypofunction but may also be valuable to discern mechanisms of improvement following vestibular rehabilitation [8 , 20]. Several vHIT systems, such as the EyeSeeCam® (Interacoustics, Middelfart, Denmark), typically require tightly fitting goggles. A system with an external camera only and no goggles was recently developed by Synapsys (Marseilles, France). As these tools utilize different methods of motion recording (the head-mounted goggles using an attached accelerometer and the external camera having access to only the movement of the eye in the camera), their respective gain and velocity calculations are different. To our knowledge, the reliability of these two systems have not been explored.

The cDVA allows for a functional measure of the VOR by measuring how well a patient uses the VOR to identify optotypes that appear during fast head movement and is commonly used to determine response to gaze stabilization exercise interventions [14]. Available cDVA systems conduct both the Dyn-amic Visual Acuity Test (DVAT) and the Gaze Stabilization Test (GST). The DVAT assesses the smallest optotype that can be seen with the head moving at a set velocity, while the GST assesses the maximum head velocity where the participant can still accurately identify a fixed sized optotype. Although the two tests are moderately correlated, they are thought to measure different constructs [21].

Reliability studies of the DVAT report variable reliability coefficients that also differ between younger adults and older adults [19, 21]. For instance, the NeuroCom InVision System (NeuroCom International, Inc., Clackamas, OR, USA) has demonstrated a wide range in inter-rater and test-retest reliability as measured by the intra-class correlation (ICC) and larger ICCs in older adults when compared to younger adults [18, 21]. These conflicting ICCs between groups have been partially attributed to the lack of inter-subject variability, known to contribute to lower ICCs, in young adults [13, 19]. Poor reliability has been attributed to participant fatigue, as the person being tested may struggle to achieve the exact head velocity required, thus causing the test to take more time [12]. To our knowledge, there has not been an evaluation of reliability in the Bertec® Visual Advantage cDVA system (Bertec, Columbus, OH).

Finally, the traditional SVV bucket test is an inexpensive method to determine how well an individual perceives when a straight line, tipped off-vertical, is set to true vertical in the absence of other visual or haptic cues, and shows a high test-retest Pearson correlation (r = 0.90) [5–7 , 15]. It is important to note that Pearson correlations are a measure of association and not of agreement between two outcomes, unlike the ICC, and therefore proper reliability studies on any test should calculate ICCs instead [2]. While the angle finder and goniometer bucket tests are simple and easy to setup, computerized versions of this tool have been developed that utilize smartphone acc-elerometer apps and/or computer programs. Use of accelerometers instead of angle finders may provide higher accuracy and precision to true angles. To our knowledge, the reliability of a smartphone system that is used like the classic angle finder/goniometer bucket test has not been evaluated. In addition, the reliability of the variance of angle response on any SVV bucket test has not been evaluated, and recent studies using a rotary chair system to measure SVV have demonstrated that concussion may affect the va-riance of trials and not the average angle response [4].

The purposes of this study were to calculate inter-rater reliability, test-retest reliability, and the smallest real difference of four clinical tools measuring vestibular-related function: 1) Bertec Visual Advantage DVAT & GST; 2) EyeSeeCam® vHIT; 3) Synapsys vHIT; 4) A bucket test of SVV assisted by a smartphone accelerometer app.

2 Methods

2.1 Ethical considerations

The study was approved by the University of Al-abama at Birmingham’s Institutional Review Board.

2.2 Participants and testers

A sample of convenience (n = 60) from the 1st and 2nd year University of Alabama at Birmingham (UAB) Doctor of Physical Therapy (DPT) students participated in this study. Participants committed to two 1-hour long testing sessions no more than one week apart. To be included, participants had to be 18–35 years of age, report normal vision with or without the use of corrective lenses, have no current or previous neurological diagnosis and no self-report of hearing loss, headache, migraine, neck pain, or dizziness. Participants were asked to refrain from taking medications known to affect the vestibular system (e.g. meclizine) as well to refrain from drinking alcohol 24 hours prior to testing. If a participant had a headache or dizziness at the time of either session, they were not allowed to test that day. If during testing, the participant developed a headache or dizziness rated 3/10 or greater, testing stopped immediately. If the headache and/or dizziness was rated 2/10 or lower the participant was instructed to rest to allow symptoms to return to 0/10 before proceeding.

All testers were DPT second year students who trained using methodology recommended by the manufacturer for data collection until they could confidently perform all tests. Three testers collected data for the full study sample. To assess intertester reliability, the first 30 participants were tested by a different examiner on each tool at Session 1 and Session 2 (i.e., Examiner 1 was the tester for Tool A at Session 1 for Participant 1, but Examiner 2 was the tester for Tool A at Session 2). Each examiner was the tester for each tool a total of 20 times, and examiners were randomized for each participant/session (i.e., Examiner 2 was not always the tester for Tool A at Session 2 as in the example above). The second 30 participants were tested by the same tester on each tool (i.e. Examiner 1 was the tester for Tool A at Session 1 and Session 2 for Participant 31) to assess each tools’ test-retest reliability. Again, each examiner was the tester for each tool a total of 20 times.

Participants were tested in a quiet environment. To control for the effect of test order, the four tests were organized into eight predetermined orders. On the first test day, participants drew a number (1–8) to determine order of tests. Once a number had been drawn eight times, that order was removed. This ran-domization ensured that, if the battery was fatig-uing, participants were not always fatigued on the same tools. The four clinical tools included: 1) Ber-tec Visual Advantage, 2) Synapsys vHIT, 3) EyeSeeCam® vHIT, and 4) Smartphone Accelerometer-Assisted SVV Bucket Test.

2.3 Computerized dynamic visual acuity

The Bertec Visual Advantage computerized dy-namic visual acuity system (BVA-cDVA) was used to assess static visual acuity (SVA), visual processing speed, and either the DVAT or GST in yaw. Participants were not asked to complete both tasks to reduce fatigue effects that may influence performance on the DVAT/GST [13]. At their first session, participants randomly picked a 1 or 2 to determine whether they would complete the DVAT or GST. Once the DVAT or GST reached 15 participants for each set of 30 participants, the remaining participants up to n = 30 completed the other task so that there was n = 15 participants for both subtests for inter-rater and test-retest reliability calculations. Tests for SVA and visual processing speed were determined before the DVAT or GST. Both SVA and DVAT are presented as LogMAR units (the log base-10 of the smallest minute of visual angle the participant can correctly identify) over Snellen scores due to their continuous nature.

The BVA-cDVA subtests require participants to identify the direction of a tumbling E optotype. For visual acuity tests, the size of the optotype changes based on participant response. For visual processing speed and the GST, the optotype remains the same size for each trial but is presented at different speeds (visual processing speed) or requires a certain yaw head velocity to be presented (GST). The BVA-cDVA utilizes an adaptive, threshold-search algorithm based on the Hughson-Westlake method to determine which optotype size, presentation time, or head velocity to use for subsequent trials [9]. After each trial, the software determines if the number of completed trials is enough to reliably determine the threshold with approximately equal numbers of correct and incorrect responses around that value. If not, trials continue. The BVA-cDVA has the following ceiling values: SVA and DVAT: –0.3 LogMAR, GST head velocity: 150 °/s, and visual processing speed: 30 ms.

The DVAT was conducted by having the tester move the participant’s head in the yaw plane sinusoidally at a constant velocity between 90–120°/s as measured by a head mounted accelerometer. When the participant’s head movement was within this ra-nge, the optotype presented during a leftward or right-ward head movement, depending on the current test direction, to test the participant’s ability to accurately identify the optotype during movement towards the testing side. The variables of interest were the dyn-amic acuity (LogMAR units) and the number of lines lost (equivalent to 0.1 LogMAR each) between SVA and DVAT conditions for both rightward and leftward rotations. The leftward DVAT was completed first for the first 30 participants and the rightward DVAT was completed first for the second 30 participants.

The GST was conducted by having the tester move the participant’s head at faster or slower speeds depending on their performance on their previous trial, with a ceiling velocity of 150 °/sec, to determine the upper limit of the participant’s VOR’s ability to stabilize their vision. The variable of interest was maximum velocity of head movement (°/sec) rightward and leftward with accurate identification of the presented optotype. The left side was tested first for the first 30 participants and the right side was tested first for the second 30 participants.

2.4 Video head impulse testing

EyeSeeCam® (E-vHIT) and Synapsys (S-vHIT) were used to assess the gain and asymmetry of the VOR in all semicircular canals using video head impulse test (10 trials per canal) using the instructions from each manufacturer. The E-vHIT uses a tight-fitting goggle with a camera placed over the left or right eye. For this study, the camera was kept over the left eye. Calibration was completed for both the eye-tracker and accelerometer. The E-vHIT calculates 6 total canal gains: the leftward and rightward lateral canal gains at 60 ms and 100 ms after head impulse and the left/right posterior/anterior canal gains at 100 ms after head impulse. The gain of the VOR in each canal was the variable of interest.

The S-vHIT uses a single remote camera connected to a base placed 1 meter in front of the participant and focused on the participant’s eyes. No calibration is necessary. The S-vHIT calculates the VOR gain for each canal and asymmetry.

For both tools, the following thresholds were used to determine whether a trial was valid or not: for lateral canals, acceleration = 1,500–4,000 °/s², velocity = 120–310 °/s; for posterior and anterior canals, acceleration = 1,000–4,000 °/s², velocity = 76–219 °/s. The peak velocity velocity needed to occur within 80 ms of impulse onset. While covert saccades were recorded by the tools (> = 5,000 °/s eye velocity after the peak velocity), these were not analyzed for this study due to a lack of pathology in our sample leading to very low numbers of corrective saccades. It is important to note that the E-vHIT validates trials through both the head-mounted accelerometer to calculate impulse acceleration and the eye-tracking camera to calculate eye velocity whereas the S-vHIT relies solely on its eye-tracking camera to make all calculations.

2.5 Smartphone bucket test

The smartphone-assisted bucket test (SA-SVV) used a commercially available smartphone application developed by one author (Graham Cochrane, who was involved in the training of the tool but not the data collection). This test involved placing a bucket over the participant’s face, who was sitting with their head straight to remove potential body-tilt bias, with a computer-generated printed line on the inner base of the bucket [10]. A smartphone was mounted onto the outer base that records the angle of the phone in the phone’s yaw plane. The creator of the bucket attempted to align the smartphone to be parallel with the inner line; a perfect alignment may not have been made based on the average deviations presented. The participant was instructed to close their eyes and the examiner holding the bucket turned the line to a preset angle identified randomly from a list of 12, 15, 20, –12, –15, and –20 degrees, to ensure variable, random, preset angles [17]. The participant then opened their eyes and the bucket was slowly turned by the examiner to true vertical at a speed < 8°/s (measured by the phone which prompted the tester to slow down if exceeded). When the participant believed the line was perfectly vertical, they pressed a Bluetooth button that recorded the angle of the phone at that moment. This was repeated for 12 trials. The variable of interest for this test was average response (degrees from 0) and variance in response.

2.6 Examiner training

Before official testing began, the examiners were trained by video chat and in person sessions by tool manufacturers, then given as much time as needed to become confident in the use of each tool. “Confidently” was defined through two methods: 1) Time to where the examiner could set up / break down the tool, set up a patient and correct parameters, and record results without needing input from other examiners; 2) Ability to complete a testing session on the tool while making 0–2 errors, defined as improper impulses per canal on E-vHIT and S-vHIT and or too fast or too slow trials on the DVAT and the GST. The SA-SVV does not have easily identifiable errors on the examiner’s part besides moving the bucket too fast which is monitored and can be fixed in the moment.

2.7 Statistics

All statistics were computed in SPSS v26. De-scriptive statistics consist of means and standard deviations for normal distributions, and medians and interquartile ranges for skewed data (determined by visual inspection). Two-way mixed intra-class correlations (ICCs) were computed for both sets of 30 participants representing inter-rater absolute agreement and test-retest absolute agreement [11].

ICCs were interpreted as follows: 0.00–0.49 = Poor, 0.50–0.75 = Moderate, 0.76–0.90 = Good, 0.91–1.00 = Excellent [19].

To improve potential clinical utility, ICCs were converted to smallest real differences (SRDs), also known as minimal detectable change scores. These SRD numbers help better represent the expected test-retest range of a clinical testing instrument [1]. Determining this range allows clinicians to better identify whether a change in performance on a measure is indicative of true change in the patient or a result of the error of a tool/examiner. The equation used to calculate the SRDs for this study was:[3]

SRD = (Test 1& 2 Combined Standard Deviation) ^* sqrt(1-ICC) ^* 1.96 ^* sqrt(2)

As the SRD is determined by the standard deviation of the tool, it is only accurate for measures that are normally distributed. As such, SRDs were only calculated for non-skewed data.

A simple Student’s t-test was run to determine if there was a systematic difference between the inter-rater and test-retest groups in terms of their change in performance (session 1 score –session 2 score) from session 1 to session 2 for all normally distributed tool measures.

3 Results

3.1 Training

The testing team (2nd year DPT students) reported the following hours of training (per tester) needed to confidently perform each test before data collection began: BVA-cDVA: 3 hours, E-vHIT: 3 hours, S-vHIT:>20 hours, SA-SVV: 1 hour. On the S-vHIT system, it was often difficult to obtain 10 accepted impulses per canal without making > 2 errors even after this training period.

3.2 Participant demographics

Participants were on average 24.1±1.9 years of age; n = 42 (70%) were female and n = 18 (30%) were male. n = 48 (80%) identified as white, 8 (13.3%) as black, 2 (3.3%) as Asian, 1 (1.7%) as Native American / Pacific Islander, and 1 (1.7%) as more than one race.

3.3 Testing

All 60 participants finished the first day protocol without an increase in headache or dizziness during the E-vHIT task and the bucket task. One individual reported an increase in dizziness (from 0–2) during the S-vHIT task on both Day 1 and Day 2. Four individuals reported an increase in dizziness during the BVA-cDVA tasks (0-1 or 0-2); two of these four reported increased dizziness on both days while the other two only reported dizziness on one of the days. All participants completed their second testing session within 1–7 days from the first session. One individual did not finish the second day protocol, reporting dizziness at 3/10 during the BVA-cDVA, their first tool of the day, which was unexpected based on the tolerance of other participants. Therefore, the test-retest reliability analyses were completed with 29 individuals instead of 30. Intertester reliability statistics are reported in Table 1 and test-retest reliability statistics are reported in Table 2 for all variables. There were no significant differences on change in any tool measure between the two groups (p-value = 0.13–0.94).

Table 1
Inter-rater reliability results

BVA -cDVA Test 1 Results Test 2 Results ICC SRD

DVAT Static Visual Acuity –0.13±0.15 –0.12±0.1 0.86 0.13

Visual Processing Speed 30 (30–60) 30 (30–67.5) 0.54 -

Dynamic Acuity L 0.01±0.14 0.00±0.12 0.86 0.14

Dynamic Acuity R 0.01±0.14 0.01±0.11 0.91 0.10

Lines Lost L 1.10±0.75 1.20±0.75 0.44 1.56

Lines Lost R 1.10±0.94 1.10±0.60 0.45 1.62

GST GST Speed L (^∘/s) 126±26 112.3±29.3 0.36 61.5

GST Speed R (^∘/s) 127±25.6 121.3±31.9 0.75 40.2

S-vHIT Lateral R Gain 1.00±0.05 1.00±0.04 0.43 0.09

Lateral L Gain 1.01±0.03 1.01±0.04 0.68 0.06

Lateral Asymmetry 1.00 (0.00–8.00) 1.00 (0.00–3.00) 0.00 -

Anterior R Gain 0.96±0.09 0.94±0.10 0.65 0.16

Anterior L Gain 0.99±0.07 1.01±0.08 0.58 0.14

Anterior Asymmetry 2.00 (0.00–10.00) 3.00 (0.00–11.00) 0.09 -

Posterior R Gain 0.93±0.09 0.94±0.08 0.60 0.15

Posterior L Gain 0.9±0.10 0.88±0.17 0.67 0.22

Posterior Asymmetry 3.00 (0.00–11.00) 3.00 (0.00–13.00) 0.26 -

E-vHIT Lateral R Gain (60ms) 1.08±0.12 1.04±0.11 0.00 0.32

Lateral L Gain (60ms) 1.10±0.10 1.06±0.13 0.49 0.23

Lateral R Gain (100ms) 1.12±0.08 1.08±0.09 0.20 0.21

Lateral L Gain (100ms) 1.10±0.08 1.09±0.09 0.27 0.20

Anterior R Gain 1.06±0.16 1.04±0.14 0.56 0.28

Anterior L Gain 1.31±0.21 1.31±0.22 0.44 0.45

Posterior R Gain 1.36±0.22 1.37±0.24 0.54 0.43

Posterior L Gain 1.03±0.19 1.04±0.16 0.66 0.28

SA-SVV Avg Degree Deviation –0.43±1.1 –0.28±1.1 0.69 1.70

Degree Deviation Variance 1.6 (0.3–9.8) 2.1±(0.3–8.9) 0.72 -

BVA -cDVA		Test 1 Results	Test 2 Results	ICC	SRD
DVAT	Static Visual Acuity	–0.13±0.15	–0.12±0.1	0.86	0.13
	Visual Processing Speed	30 (30–60)	30 (30–67.5)	0.54	-
	Dynamic Acuity L	0.01±0.14	0.00±0.12	0.86	0.14
	Dynamic Acuity R	0.01±0.14	0.01±0.11	0.91	0.10
	Lines Lost L	1.10±0.75	1.20±0.75	0.44	1.56
	Lines Lost R	1.10±0.94	1.10±0.60	0.45	1.62
GST	GST Speed L (^∘/s)	126±26	112.3±29.3	0.36	61.5
	GST Speed R (^∘/s)	127±25.6	121.3±31.9	0.75	40.2
S-vHIT	Lateral R Gain	1.00±0.05	1.00±0.04	0.43	0.09
	Lateral L Gain	1.01±0.03	1.01±0.04	0.68	0.06
	Lateral Asymmetry	1.00 (0.00–8.00)	1.00 (0.00–3.00)	0.00	-
	Anterior R Gain	0.96±0.09	0.94±0.10	0.65	0.16
	Anterior L Gain	0.99±0.07	1.01±0.08	0.58	0.14
	Anterior Asymmetry	2.00 (0.00–10.00)	3.00 (0.00–11.00)	0.09	-
	Posterior R Gain	0.93±0.09	0.94±0.08	0.60	0.15
	Posterior L Gain	0.9±0.10	0.88±0.17	0.67	0.22
	Posterior Asymmetry	3.00 (0.00–11.00)	3.00 (0.00–13.00)	0.26	-
E-vHIT	Lateral R Gain (60ms)	1.08±0.12	1.04±0.11	0.00	0.32
	Lateral L Gain (60ms)	1.10±0.10	1.06±0.13	0.49	0.23
	Lateral R Gain (100ms)	1.12±0.08	1.08±0.09	0.20	0.21
	Lateral L Gain (100ms)	1.10±0.08	1.09±0.09	0.27	0.20
	Anterior R Gain	1.06±0.16	1.04±0.14	0.56	0.28
	Anterior L Gain	1.31±0.21	1.31±0.22	0.44	0.45
	Posterior R Gain	1.36±0.22	1.37±0.24	0.54	0.43
	Posterior L Gain	1.03±0.19	1.04±0.16	0.66	0.28
SA-SVV	Avg Degree Deviation	–0.43±1.1	–0.28±1.1	0.69	1.70
	Degree Deviation Variance	1.6 (0.3–9.8)	2.1±(0.3–8.9)	0.72	-

Inter-rater reliability results (n = 30); Test 1 and Test 2 were completed by different testers. Data presented as Mean±Standard Deviation for normally distributed data and Median (Range) for skewed data. BVA-cDVA = Bertec Visual Advantage Computerized Dynamic Visual Acuity, DVAT = Dynamic Visual Acuity Test, GST = Gaze Stabilization Test S-vHIT = Synapsis vHit, E-vHIT = EyeSeeCam vHIT, SA-SVV = Smartphone Accelerometer Assisted Subjective Visual Vertical, ICC = Intra-Class Correlation Coefficient, SRD = Smallest Real Difference, GST = Gaze Stabilization Test.

Table 2

Test-Retest Reliability Results

BVA -cDVA		Test 1 Results	Test 2 Results	ICC	SRD
DVAT	Static Visual Acuity	–0.13±0.09	–0.15±0.07	0.77	0.11
	Visual Processing Speed	30±(30–42.5)	30±(30–60.0)	0.00	-
	Dynamic Acuity L	0.04±0.11	–0.01±0.1	0.63	0.18
	Dynamic Acuity R	0.01±0.08	0.01±0.11	0.58	0.17
	Lines Lost L	1.7±0.73	1.3±0.68	0.00	1.96
	Lines Lost R	1.4±0.95	1.5±1.05	0.31	2.31
GST	GST Speed L (^∘/s)	133.8±21.2	139.7±16.6	0.00	52.77
	GST Speed R (^∘/s)	128.1±25.8	132±23.7	0.43	51.84
S-vHIT	Lateral R Gain	0.99±0.06	1.00±0.04	0.91	0.04
	Lateral L Gain	1.00±0.04	1.01±0.04	0.83	0.05
	Lateral Asymmetry	1.00 (0.00–9.00)	1.00 (0.00–3.00)	0.60	-
	Anterior R Gain	1.01±0.08	0.94±0.10	0.73	0.13
	Anterior L Gain	1.00±0.07	1.01±0.08	0.58	0.14
	Anterior Asymmetry	1 (0.00–6.00)	2.00 (0.00–7.00)	0.00	-
	Posterior R Gain	0.96±0.08	0.94±0.08	0.68	0.13
	Posterior L Gain	0.93±0.06	0.88±0.17	0.64	0.21
	Posterior Asymmetry	1.50 (0.00–12.00)	3.00 (0.00–10.00)	0.00	-
E-vHIT	Lateral R Gain (60ms)	1.08±0.15	1.07±0.17	0.78	0.21
	Lateral L Gain (60 ms)	1.07±0.08	1.09±0.11	0.75	0.13
	Lateral R Gain (100 ms)	1.07±0.12	1.09±0.15	0.88	0.13
	Lateral L Gain (100 ms)	1.10±0.09	1.08±0.11	0.78	0.13
	Anterior R Gain	1.08±0.16	1.08±0.19	0.79	0.22
	Anterior L Gain	1.40±0.25	1.40±0.24	0.78	0.32
	Posterior R Gain	1.40±0.20	1.35±0.20	0.59	0.35
	Posterior L Gain	1.02±0.16	1.01±0.20	0.72	0.27
SA-SVV	Avg Degree Deviation	–0.34±1	–0.07±1.15	0.64	1.83
	Degree Deviation Variance	2.65 (0.23–9.05)	4.18 (0.29–16.23)	0.67	-

Test-retest reliability results (n = 29); Test 1 and Test 2 were completed by the same tester. Data presented as Mean±Standard Deviation for normally distributed data and Median (Range) for skewed data. BVA-cDVA=Bertec Visual Advantage Computerized Dynamic Visual Acuity, DVAT = Dynamic Visual Acuity Test, GST = Gaze Stabilization Test S-vHIT = Synapsis vHit, E-vHIT = EyeSeeCam vHIT, SA-SVV = Smartphone Accelerometer Assisted Subjective Visual Vertical, ICC = Intra-Class Correlation Coefficient, SRD = Smallest Real Difference, GST = Gaze Stabilization Test.

3.4 Computerized dynamic visual acuity

The SVA and DVAT LogMAR data showed good to excellent inter-rater reliability and moderate to good test-retest reliability. Lines Lost for the DVAT had poor inter-rater reliability and test-retest reliability. The GST had poor to moderate inter-rater reliability and poor test-retest reliability.

3.5 Video head impulse testing

The E-vHIT VOR gains of the lateral semicircular canals at 60 ms and 100 ms had poor inter-rater reliability and moderate to good test-retest reliability. Posterior and anterior canal VOR gains had poor to moderate inter-rater reliability and moderate to good test-retest reliability. Left anterior and right posterior (LARP) gains were 30% larger on average than right anterior and left posterior (RALP).

The S-vHIT VOR gains of the lateral semicircular canals had poor to moderate inter-rater reliability and good to excellent test-retest reliability. Posterior and anterior canals had moderate inter-rater reliability and moderate test-retest reliability. Asymmetries calculated by S-vHIT had poor inter-rater reliability and poor to moderate test-retest reliability.

3.6 Smartphone bucket test

Average degrees from vertical and variance had moderate inter-rater and test-retest reliability. The ICCs for variance were also moderate but SRDs were not calculated due to the large right-skew of the data.

4 Discussion

4.1 Computerized dynamic visual acuity

Overall, the measures from the BVA-cDVA with the strongest reliability were the SVA and DVAT LogMAR data. While clinicians typically utilize the resulting comparison of these two measures, Lines Lost, this measure did not show convincing ICCs. Despite low ICCs, the SRD was still 1.5–2.0 lines which does not appear unreasonable as goal endpoint for a treatment or therapy. It should be noted that the participants in this study did not have vestibular deficits and the mean Lines Lost was only one line. Therefore, the low inter-subject variability likely affected the ICC values [13]. Similarly, although the GST demonstrated low ICCs, the SRDs may guide clinicians who are using the GST to determine response to gaze stabilization training.

Our good to excellent reliability results differ from other studies who report poor reliability for the computerized DVAT instead [19]. The difference in our findings may be attributed to the fact that we required the examiner to assist the participant with turning the head, rather than having the participant turn their head independently. During the DVAT or GST, the optotype will not appear until the head velocity is within a certain range. Participants, especially patients, often have difficulty obtaining the correct velocity which makes the test last longer potentially resulting in fatigue. Comparing inter-rater to test-retest reliability suggests that inter-rater testing resulted in slightly higher ICCs but nearly identical SRDs.

4.2 Video head impulse testing

The E-vHIT appears to be the device most affected by change in examiner from one administration of the test to the next, with significantly different reliability scores when comparing inter-rater reliability and test-retest reliability. This indicates that either the E-vHIT is sensitive to the exact user or that it is the tool that requires the most training to produce similar results. Since the test-retest analysis was done on the second group of 30 participants, the testers were more skilled on the tools by that point. This does not coincide with the testers’ reported time to perceived confidence vs the S-vHIT, however. On the E-vHIT system, VOR gains were close to 1.00 as expected, except for the gains of the LARP canals, which were 30% larger than the other canals. Through discussion with the E-vHIT vendor experts, it is likely these higher gains were due to the accelerometer system and camera being fastened to the left side of the goggles. This pattern was not fully appreciated until multiple participants had been tested and the research team decided to stay consistent with our original testing paradigm. Clinicians should be mindful of this and collect their own normative data using standardized methodology. It is possible there may be an exact correction factor to account for this issue. Although the VOR gains were higher in LARP compared to RALP, reliability coefficients were similar. However, this was not explored further.

The ICCs for the S-vHIT system were similar for inter-rater and test-retest across the six canals. While the asymmetry ICCs were very low, it is important to note that the sample tested were all healthy individuals with no history of vestibular disease. Therefore, it is not surprising that few individuals had any pronounced asymmetry. As with the BVA-cDVA tasks, ICCs can be heavily influenced by low inter-subject variability; the low ICCs here should be attributed to a statistical weakness of the correlation and not an indication that asymmetries are unreliable [13]. Studies investigating asymmetries in a population with more variability (e.g. those with differing levels of unilateral vestibular hypofunction) will be better equipped to determine the reliability of asymmetry measures. It is important to emphasize that the S-vHIT was the most difficult for the testers to perform confidently and that, even after a 20 + hour training period and 120 examinations still felt as though they were learning nuances to the system.

4.3 Smartphone accelerometer assisted SA-SVV

Mean degree angle and variance in angle response had similar inter-rater and test-retest reliability ICCs suggesting that changing examiners between administrations may not affect scores. The SA-SVV helps this by providing the preset angle for each trial, monitoring the speed of the bucket rotation, and making sure the bucket and phone are parallel to the floor. Average deviations from the mean were tilted slightly from vertical (–0.36 and –0.21), likely indicating that our phone was not perfectly aligned with the visual line. While this bias did not affect the reliability calculations here since the same biased tool was used for both administrations, this is important to consider when using this system clinically. Assuming a population average of 0.0 degrees from vertical would allow a clinician to determine their bucket’s bias if there is any. This potential for error is not unique to the SA-SVV and needs to be considered in studies of and use of the traditional bucket test. The results suggest the ICC of the smartphone app is near, but less than, the Pearson correlation coefficients recorded for the traditional bucket test by Zwergal et al., greater than that of the bucket test measured by Michelson, et al., and comparable to the virtual SVV system [12, 15]. As mentioned previously, ICC is a more robust measure of agreement than Pearson r, so a lower ICC here does not necessarily indicate the SA-SVV was less reliable than the traditional method [2]. As this particular app is used identically to the traditional angle finder / goniometer bucket test and requires only a smartphone instead of a computer and software we believe it is more clinically feasible, cheaper, and would require less training to shift to from traditional angle finder / goniometer bucket test when compared to other available virtual SVV tools.

4.4 Feasibility

The testing battery of BVA-cDVA, E-vHIT and S-vHIT, and SA-SVV was well tolerated by participants and reasonable to use for the vestibular evaluation in a single clinic visit from a fatigue point of view. However, it is possible that length of the visit or symptoms exacerbated by patients with actual vestibular impairments may make using all tools unfeasible. In addition, the cost of all tools may be prohibitive. Only one individual was unable to complete both sessions due to their reported dizziness severity. The testers required 1 hour of training to confidently use the SA-SVV app, 3 hours to confidently use the BVA-cDVA and E-vHIT, and > 20 hours to confidently use the S-vHIT. It is possible that clinicians well trained in administration of vestibular testing may not require the same level of training that these students required.

5 Conclusions

The ICCs for the BVA-cDVA, E-vHIT and S-vHIT, and SA-SVV ranged from poor to excellent for both inter-rater and test-retest calculations in healthy young adults. The ICCs for vHIT tools demonstrated better test-retest reliability than inter-rater reliability, suggesting that change in examiner may negatively affect the reliability of these vHIT tools. Some measures, such as Lines Lost for DVAT and gain asymmetry for vHITs demonstrated very poor ICCs; these poor ICCs may have been driven by low inter-subject variability on these measures in our sample with no vestibular deficits. Future research should identify the reliability and SRD of these tools in children and older adults and include populations with more variable performance, such as patients with vestibular deficits, to address measures with low inter-subject variability here.

References

Beckerman

, Roebroeck

M.E.

, Lankhorst

G.J.

, Becher

J.G.

, Bezemer

P.D.

, Verbeek

A.L.

, Smallest real difference, a link between reproducibility and responsiveness, Qual Life Res 10(7) (2001), 571–578. doi: 10.1023/a:1013138911638

Berchtold

, Test–retest: Agreement or reliability? Methodological Innovations 106(9), 1–7 doi: 10.1177/2059799

Chen

H.M.

, Chen

C.C.

, Hsueh

I.P.

, Huang

S.L.

, Hsieh

C.L.

, Test-retest reproducibility and smallest real difference of 5 hand function tests in patients with stroke, Neurorehabil Neural Repair 23(5) (2009), 435–440. doi: 10.1177/1545968308331146

Christy

J.B.

, Cochrane

G.D.

, Almutairi

, Busettini

, Swanson

M.W.

, Weise

K.K.

, Peripheral Vestibular and Balance Function in Athletes With and Without Concussion, J Neurol Phys Ther 43(3) (2019), 153–159. doi: 10.1097/NPT.0000000000000280

Christy

J.B.

, Payne

, Azuero

, Formby

, Reliability and diagnostic accuracy of clinical tests of vestibular function for children, Pediatr Phys Ther 26(2) (2014), 180–189 doi: 10.1097/PEP.0000000000000039

Cohen

H.S.

, Sangi-Haghpeykar

, Subjective visual vertical in vestibular disorders measured with the bucket test, Acta Otolaryngol 132(8) (2012), 850–854. doi: 10.3109/00016489.2012.668710

Dai

, Kurien

, Lin

V.Y.

, Mobile phone app Vs bucket test as a subjective visual vertical test: a validation study, J Otolaryngol Head Neck Surg 49(1) (2020), 6. doi: 10.1186/s40463-020-0402-3

Halmagyi

G.M.

, Chen

, MacDougall

H.G.

, Weber

K.P.

, McGarvie

L.A.

, Curthoys

I.S.

, The Video Head Impulse Test, Front Neurol 8 (2017), 258. doi: 10.3389/fneur.2017.00258

Hughson

W.W.H.

, Manual for program outline for rehabilitation of aural casualties both military and civilian, Trans Am Acad Ophthalmol Otolaryngol 48(Supp) (1944), 1–15.

10.

Kheradmand

, Winnick

, Perception of Upright: Multisensory Convergence and the Role of Temporo-Parietal Cortex, Front Neurol 8 (2017), 552. doi: 10.3389/fneur.2017.00552

11.

Koo

T.K.

, Li

M.Y.

, A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research, J Chiropr Med 15(2) (2016), 155–163. doi: 10.1016/j.jcm.2016.02.012

12.

Manago

M.M.

, Schenkman

, Berliner

, Hebert

J.R.

, Gaze stabilization and dynamic visual acuity in people with multiple sclerosis, J Vestib Res 26(5-6) (2016), 469–477. doi: 10.3233/VES-160593

13.

Mehta

, Bastero-Caballero

R.F.

, Sun

, et al., Performance of intraclass correlation coefficient (ICC) as a reliability index under various distributions in scale reliability studies, Stat Med 37(18) (2018), 2734–2752. doi: 10.1002/sim.7679

14.

Michel

, Laurent

, Alain

, Rehabilitation of dynamic visual acuity in patients with unilateral vestibular hypofunction: earlier is better, Eur Arch Otorhinolaryngol 277(1) (2020), 103–113. doi: 10.1007/s00405-019-05690-4

15.

Michelson

P.L.

, McCaslin

D.L.

, Jacobson

G.P.

, Petrak

, English

, Hatton

, Assessment of Subjective Visual Vertical (SVV) Using the “Bucket Test” and the Virtual SVV System, Am J Audiol 27(3) (2018), 249–259. doi: 10.1044/2018_AJA-17-0019

16.

Navari

, Cerchiai

, Casani

A.P.

, Assessment of Vestibulo-ocular Reflex Gain and Catch-up Saccades During Vestibular Rehabilitation, Otol Neurotol 39(10) (2018), e1111–e17. doi: 10.1097/MAO.0000000000002032

17.

Pagarkar

, Barniou

, Ridout

, Luxon

, Subjective Visual Vertical and Horizontal Effect of the Preset Angle, J AMA Otolaryngol, Head Neck Surg 134(4) (2008), 394–401. doi: 10.1001/archotol.134.4.394

18.

Portney

L.G.

, Watkins

M.P.

, Prentice Hall; New Jersey: 2000. Foundations of clinical research: applications to practice.

19.

Riska

K.M.

, Hall

C.D.

, Reliability and Normative Data for the Dynamic Visual Acuity Test for Vestibular Screening, Otol Neurotol 37(5) (2016), 545–552. doi: 10.1097/MAO.0000000000001014

20.

Sjogren

, Fransson

P.A.

, Karlberg

, Magnusson

, Tjernstrom

, Functional Head Impulse Testing Might Be Useful for Assessing Vestibular Compensation After Unilateral Vestibular Loss, Front Neurol 9 (2018), 979. doi: 10.3389/fneur.2018.00979

21.

Ward

B.K.

, Mohammad

M.T.

, Whitney

S.L.

, Marchetti

G.F.

, Furman

J.M.

, The reliability, stability, and concurrent validity of a test of gaze stabilization, J Vestib Res 20(5) (2010), 363–372 doi: 10.3233/VES-2010-0371

22.

Zwergal

, Rettinger

, Frenzel

, Dieterich

, Brandt

, Strupp

, A bucket of static vestibular function, Neurology 72(19) (2009), 1689–1692. doi: 10.1212/WNL.0b013e3181a55ecf