Abstract
BACKGROUND:
Wrist-Worn Activity Monitors (WWAMs) are low-cost, user-friendly devices which have become popular for monitoring physical activity. Their reliability and validity need investigation for accurate physical activity monitoring. We examined between-sessions and inter-device reliability of the WWAMs. In addition, we examined the criteria-related validity of the WWAMs against two gold standards, an Ankle-Worn Activity monitor (AWAM) and video.
METHODS:
Twenty volunteers participated in two sessions, one week apart. In each session, participants walked on a treadmill for five minutes at each of the three speeds: 0.89 m/s (slow),1.12 m/s (moderate) and 1.33 m/s (fast). Total step counts at each speed were obtained using one AWAM (stepWatch), three-WWAMs (Fitbit Flex) and video. The Intraclass Correlation Coefficient (ICC) was calculated to determine the reliability and validity of the WWAMs.
RESULTS:
The WWAMs exhibited moderate to excellent between-sessions reliability (ICC = 0.69–0.90). The WWAMs demonstrated excellent inter-device reliability at each speed across both sessions (ICC = 0.91–0.98). The criteria-related validity of WWAMs compared to the AWAM, and video recording showed moderate to excellent agreement (ICC = 0.67–0.85) at each speed.
CONCLUSIONS:
WWAMs recorded steps consistently between-sessions and between-devices for treadmill walking among healthy adults at each speed but exhibited limited agreement for recording steps at each speed compared to AWAM and video.
Introduction
Physical inactivity is a pandemic challenge of the 21st century, costing health-care system $53.8 billion a year worldwide [1, 2]. The United Nations General Assembly has recognized physical inactivity as one of the leading risk factors for the development of non-communicable diseases which kill 38 million people each year [3, 4]. This has prompted health agencies like the World Health Organization (WHO) to develop a global strategy on physical activity, encouraging many to take up physical activity awareness initiatives [5]. Life expectancy increases proportionally to the physical activity level; making a strong case for motivational efforts to adopt an active lifestyle [6]. The U.S. Department of Health and Human Services recommends at least 150 minutes a week of moderate-intensity aerobic physical activity for substantial health benefits; this equates to walking 3000 steps in 30 minutes on five days each week [7, 8]. Research has shown that individuals either underestimate or overestimate their physical activity level and factors like social desirability, and social approval can influence physical activity estimation [9, 10]. This underscores the importance of physical activity monitoring devices which are valid, accurate and reliable, for evaluating current and changing physical activity levels. Such physical activity monitors can play a crucial role in mediating the relationship between physical activity level and health outcomes.
Various physical activity monitors have been used in recent studies to measure physical activity. One of the most commonly used devices is the step activity monitor (SAM) which quantifies step-count and step rate, especially in a research environment. It is a type of Ankle-Worn Activity Monitor (AWAM), which is small (75×50×20 mm), lightweight (38 g) accelerometer-based step counter, that is worn around the ankle and costs approximately $2000. High reliability and validity of SAM for counting steps have been demonstrated among various age groups and across many disease conditions [11–13]. It exhibited an accuracy of about 99.87% for counting steps among children and adults [11]. Studies showed SAM to have high accuracy or criteria-related validity for counting steps among individuals with diabetes (98.8%) [12], older adults with different walking abilities (ICC >0.87) [13], individuals with an amputation [12], individuals with a stroke (r = 0.96) [14] or traumatic brain injury (ICC = 0.97) [15]. SAM also exhibited excellent test-retest reliability with an error of 0.54%–0.65% [12]. Studies have found SAM to be accurate in both indoor or laboratory (96.06%) and outdoor (99.58%) environment [16]. Given its high reliability and validity, SAM is widely used for research purposes [17–20]. Even though SAM and other research-grade activity monitors are proven to be reliable and valid for tracking physical activity, they are expensive and require more time to set up and retrieve data. The higher cost of these research-grade physical activity monitors makes their use challenging in studies with large and heterogeneous samples. The data retrieval process, which requires specialized software and hardware to get the information, makes it less user-friendly. Also, SAM, a pager-sized instrument, needs to be worn at the ankle, which may not be aesthetically appealing to some wearers.
More recently, wearable consumer-grade physical activity monitors have emerged as an alternative to track physical activity. The popularity of these physical activity monitors is surging due to their low-cost, aesthetic appeal, and user-friendly features, with the wrist-worn activity monitors (WWAMs) topping the list. Many WWAMs like Jawbone UPTM (JAWBONE, San Francisco, CA, USA), Garmin Vivofit (Garmin, Olathe, KS, USA), Fitbit Flex (Fitbit Inc, CA) are available in the market for consumers. The WWAMs are used by people to track their daily physical activity, monitor their physical activity progress over time and for comparing their activity level using social interactivity features. It is crucial that the WWAMs are reliable and valid for a meaningful assessment of physical activity. However, there is limited objective evidence available about the reliability and validity of WWAMs [13, 21]. If proven to be reliable and valid, the WWAMs can be used in research and clinical settings for monitoring physical activity. Use of WWAMs in these settings will make physical activity monitoring more economical, will save time due to easy data retrieval process and will enable measuring physical activity of a large sample. In this study, we examined the inter-device reliability and between-session reliability of WWAMs in a laboratory setting. In addition, we tested the criteria-related validity of the WWAMs against two gold standards, an Ankle-Worn Activity Monitor (AWAM) and video. Previous studies have demonstrated excellent reliability and validity of AWAM, supporting its consideration as a gold standard [17–20]. In our knowledge, this will be the first study to investigate the inter-device reliability of WWAMs.
Methods
All study protocols were approved by the Institutional Review Board of New York University.
Study design
This study used a repeated measures design to examine inter-device reliability, the between-sessions reliability of WWAMs and criteria-related validity of the WWAMs with reference to two gold standards, namely AWAM and video recording.
Participants
Twenty healthy adults (age 18–65 years) volunteered to participate in the study. Individuals were recruited within the university and adjacent community via posted statements, e-mails, and word of mouth. Individuals reporting any current pain or history of injury, pain in the lower extremity within last six months were excluded from the study.
Instrumentation
Three consumer- grade Wrist-Worn Activity Monitors (Fitbit Flex, Fitbit Inc, CA) were worn on the dominant wrist in a random order. An Ankle- Worn Activity Monitor (stepWatch, modus health LLC, Washington, D.C) was strapped around the dominant ankle. Video recording of gait was done using a laptop (VLC media player, Version 2.2.6, MacBook Air) as participants walked on the treadmill (TR 4000i; Lifespan, Utah, USA).
Procedures
The study protocol was explained to the participants. All participants provided informed consent. Age, height, weight, and gender information were obtained from all participants. Each participant walked on the treadmill at three speeds; 0.89 m/s(slow),1.12 m/s(moderate) and 1.33 m/s (fast). The speeds correspond to 2 miles per hour (mph), 2.5 mph and 3 mph respectively. These speeds were chosen because they are representative of walking speeds noted in clinical populations such as community-dwelling older adults [22, 23], individuals with diabetes [24, 25], individuals with musculoskeletal pain [26, 27], and individuals with neurologic impairments [28, 29]. The participants walked for five minutes at each speed wearing three WWAMs and an AWAM. The treadmill speed was increased from slow to fast with five to ten minutes rest period between speed change. The Participants’ gait was recorded using VLC media player as they walked on the treadmill. Data collection setup is displayed in Fig. 1.

Experimental setup showing (A) WWAM (Fitbit Flex); (B) AWAM (stepWatch); (C) Treadmill.
Total step count (i.e., the number of steps walked during five minutes at each speed) was the outcome variable. A single researcher did all the data collection procedure. The data from respective consumer- grade WWAMs as well as the AWAM were retrieved using the appropriate software as per the manufacturer’s guidelines. A single researcher (EA) obtained step count data from video recording through direct observation of the video.
Statistical analysis
We calculated intraclass correlation coefficients (ICC) to evaluate inter-device reliability (ICC (2, k)); and between-sessions reliability (ICC (2, 1)) of three WWAMs. The criteria-related validity of WWAMs was evaluated using ICC (model (2, k), with reference to two gold standards, the AWAM and video recording respectively. Bland Altman plots were constructed to assess limits of agreement between WWAMs at each speed. ICC values >0.75 were interpreted as good, 0.40–0.75 were moderate, and <0.40 were poor reliability [30].
Results
All 20 participants (17 females, age: 26±3 years; BMI: 24.0±4 kg/m2) completed the study protocol. Mean step counts from three WWAMs, an AWAM, and video recording are displayed in Fig. 2. It shows that the mean step counts from three WWAMs are lower compared to the mean step counts from the AWAM and video recording consistently across three speeds.
Inter-device reliability of the three WWAMs is summarized in Table 1 using ICCs with 95% confidence interval. The WWAMs demonstrated good inter-device reliability at each speed across both sessions (ICC = 0.91–0.98). Between-sessions, reliability is displayed in Table 2 using ICCs with 95% confidence interval at each speed for three WWAMs. Mean step counts from session 1 and session 2, for one WWAM are displayed in Fig. 3. The WWAMs exhibited moderate to good between-sessions reliability (ICC = 0.69–0.90).

Mean step count from session 1 and session 2. Columns are grouped by speed and depict three WWAMs, one AWAM, and video. Error bars indicate standard deviation.
Inter-device reliability (within-session) at each speed, for the three-wrist worn physical activity monitors summarized using Intraclass Correlation Coefficient (2, k)
ICC = Intraclass Correlation Coefficient; 95% CI = 95% Confidence Interval.
Between-sessions reliability at each speed for the three-wrist worn physical activity monitors summarized using Intraclass Correlation Coefficient (2, 1)
ICC = Intraclass Correlation Coefficient; 95% CI = 95% Confidence Interval.

Mean representative step count from one WWAM, supporting moderate to good between-session reliability of WWAMs.
Criteria-related validity is displayed in Table 3 using ICCs with 95% confidence interval at each speed for the three WWAMs with each gold standard, the AWAM, and video analysis respectively. The criteria-related validity of WWAMs compared to the AWAM, and video recording showed moderate to good agreement (ICC = 0.67–0.85) at each speed. Representative Bland Altman plots assessing the limits of agreement between the WWAM 1 and 2, WWAM 1 and 3 at fast speed are displayed in Fig. 4. The Bland Altman plots indicated the absence of systematic bias between WWAMs.

Representative Bland Altman plot between two WWAMs supporting good inter-device reliability and absence of systematic bias.
Criteria-related validity of the WWAMs at each speed compared to AWAM and video, summarized using Intraclass Correlation Coefficient (ICC (2, k)) (session 2)
ICC = Intraclass Correlation Coefficient; CI = Confidence Interval.
We examined the inter-device reliability and between-session reliability of wrist-worn activity monitors for treadmill walking across three different speeds. This study also tested wrist-worn activity monitors for validity against two criteria measures which included a pre-validated ankle worn physical activity monitor and visually-identified step counts from the video recording. This is the first study which to our knowledge has investigated the inter-device reliability of WWAMs especially across three speeds of treadmill walking.
Wrist-worn activity monitors have become popular due to their low -cost and, user-friendly features. Individuals use WWAMs to monitor physical activity and track improvement in physical activity level over time. The WWAMs need to be reliable and valid for such monitoring and comparison to be meaningful.
The findings of this study indicate that WWAMs displayed good inter-device reliability within a session (ICC (2, k)>0.75) at each speed, with excellent agreement among three wrist-worn devices for measuring the total number of steps. Our findings are in agreement with Diaz et al. (2015) who reported similar results for the inter-device reliability of WWAMs worn on the right and left wrist with a correlation coefficient of 0.90 for treadmill walking among healthy adults [21]. Another study by de Man et al. (2016) found similar results with good inter-device reliability (ICC = 0.81) for a different version of WWAM (Fitbit Charge HR) for self- paced walking [31]. These findings may indicate that WWAMs with the same proprietary algorithm and placed at the same body part location show high inter-device reliability for counting steps under similar testing conditions. Our Bland-Altman plots support these findings and also show that WWAMs demonstrate a high level of agreement and absence of systematic bias for measuring the total number of steps [30]. These data may suggest that one WWAM can be substituted with another WWAM if required without affecting the outcome.
Our study also found that the three WWAMs displayed moderate to good between- sessions reliability with ICC (2,1) ranging from 0.69 to 0.90 at each speed. Similar results were found by Kooiman et al. (2015) with the good test-retest reliability of various WWAMs (Fitbit Flex, Jawbone UP) for treadmill walking at 4.8mph (1.33 m/s) reflected in ICC between 0.81 to 0.83 [32]. Our results are in agreement with recent literature and support the between-sessions reliability of WWAMs.
Regarding criteria-related validity, our findings showed that the WWAMs consistently underestimated the total number of steps compared to criteria measure of AWAM and counting steps from the video recording. WWAMs underestimated the total number of steps approximately by 7%, 9% and 12% at slow, moderate and fast speed respectively compared to criteria measures. Our findings are consistent with the literature which compared steps measured by WWAMs with various criteria measures. Diaz et al. (2015) found WWAMs to underestimate steps by 11–16% during treadmill walking at speeds of 0.85 m/s and 1.34 m/s [21]. Floegel et al. (2016) found higher percentage error with WWAMs. Additionally, WWAMs underestimated the step count by 27% among unimpaired older adults walking 100 m over- ground at a self-selected speed [13]. The higher percent of error found by Floegel et al. (2016) may be due to the reduced gait speed, and reduced arm swing noted in older adults, which in turn may affect the total number of steps measured by WWAMs [33].
Our study showed WWAMs to have moderate to good criteria-related validity at each speed compared to AWAM and video respectively with ICC (2, k) ranging from 0.69 to 0.85. Floegel et al. (2016) found WWAMs to have poor criteria-related validity with an ICC of 0.15. The low ICC may be due to their study sample, which included older adult population who may have had lower gait velocity and arm swing [13, 33]. Kooiman et al. (2015) found WWAM to have poor criteria-related validity with ICC of 0.22 for treadmill walking at the speed of 1.33 m/s among healthy adults [32]. The low ICCs for criteria-related validity reported by Floegel et al. (2016) and Kooiman et al. (2015) may be explained by the model used in these studies to calculate ICC. Model (2, k) used in this study allows us to generalize our findings to multiple devices. In contrast, model (2, 1) limits the generalizability of study findings to the devices tested.
As the WWAMs are used to monitor physical activity level over time, reliability has important implications for interpreting the magnitude of change in physical activity. The MDC 90 (Minimal Detectable Change, 90% confidence interval) can be used to infer the minimal change in step count which would indicate a change in physical activity with 90% confidence. MDC 90 was calculated for a period of 5-minute at each speed. Knowledge of MDC 90 for such a short-duration is of significance as in clinical settings therapists have a limited amount of time to assess change in the functional ability of an individual. Studies have estimated Minimal Detectable Change (MDC) for short-duration functional assessments tools like 6-minute walk test, 2-minute walk test, 4-meter walk test for ease of evaluation in the clinical settings [23, 35]. Also, studies have shown that during daily activities, people on an average walk for multiple shorter bouts rather than long bouts of walking [36]. We calculated MDC 90 using the equation (MDC 90 = 1.65×(SD from the 1st test×(√1 – ICC)) × √2) and information from Fig. 1 and Table 2 [30, 37]. Our finding suggests MDC 90 of 41–56 steps at slow speed, 47–56 steps at moderate speed, 56 to 93 steps at fast speed compared to the preceding record indicate a noticeable change in a physical activity of an individual.
In summary, various consumer-grade activity monitors are available for the users in the market. Multiple factors like cost of the device, its user-friendly features, aesthetic appeal and purpose of usage influence the user’s choice of a device. The purpose of the usage of the activity monitoring device decides the stringency of acceptable reliability and validity. Our study makes an important contribution through examination of reliability and validity of the WWAMs which will aid the prospective user to make a choice.
Limitations
The study investigated reliability and validity of a specific type of WWAM at three walking speeds in a laboratory setting over a short duration with small sample size. However, the study allowed comparison of WWAMs with two gold standards, an AWAM, and video recording, and provided information which is clinically relevant.
Conclusion
While WWAMs demonstrate good inter-device and between-sessions reliability across speeds, they should be used with caution in clinical and research settings due to concerns about their validity. As WWAMs are low cost, user-friendly devices available to buyers at large; the results of this study can help the healthy adults make informed use of these activity monitors to monitor, compare and improve physical activity levels.
Conflict of interest
None to report.
