Abstract
This study provides a new test of time-use diary methodology, comparing diaries with a pair of objective criterion measures: wearable cameras and accelerometers. A volunteer sample of respondents (n = 148) completed conventional self-report paper time-use diaries using the standard UK Harmonised European Time Use Study (HETUS) instrument. On the diary day, respondents wore a camera that continuously recorded images of their activities during waking hours (approximately 1,500–2,000 images/day) and also an accelerometer that tracked their physical activity continuously throughout the 24-hour period covered by the diary. Of the initial 148 participants recruited, 131 returned usable diary and camera records, of whom 124 also provided a usable whole-day accelerometer record. The comparison of the diary data with the camera and accelerometer records strongly supports the use of diary methodology at both the aggregate (sample) and individual levels. It provides evidence that time-use data could be used to complement physical activity questionnaires for providing population-level estimates of physical activity. It also implies new opportunities for investigating techniques for calibrating metabolic equivalent of task (MET) attributions to daily activities using large-scale, population-representative time-use diary studies.
1. Introduction
1.1. Background
Time-use diary methods are used for a range of research purposes in the social sciences. Economists use diary data to estimate extended national product measures, including the value of unpaid work (Goldschmidt-Clermont and Pagnossin-Aligisakis 1999). Sociologists use them to investigate parenting practices (Craig and Mullan 2011), sociability (Voorpostel, van der Lippe, and Gershuny 2009), and the division of domestic labor (Sullivan 2000). Some public and population health researchers use diaries as a data collection method (e.g., Brunner, Juneja, and Marmot 2001; Millward and Spinney 2011; Spinney, Millward, and Scott 2011; van der Ploeg et al. 2010), but they are not routinely used to estimate the extent and distribution of time devoted to physical activity (PA) across large populations. Rather, the convention has been to use various forms of physical activity questionnaires (PAQ) that include a battery of items asking respondents to recall the number of times they participated in specific activities over a specified period (the past week/month). One of the most routinely used PAQs is the International Physical Activity Questionnaire (IPAQ), or its Short Form (IPAQ-SF).
1.2. Objectives
This article reports results of the CAPTURE-24 1 project, which tests the quality of self-report time-use diary data. In this study, we deploy whole-day worn-camera and accelerometer evidence as criterion measures. This allows us to restrict questions of reliability and validity largely to issues relating to the coding of objectively timed events, avoiding the concerns related to respondents’ behaviors outlined in Fowler (2011) and Schaeffer and Dykemar (2011). We use this evidence as a straightforward means of checking the duration of the activities recorded by respondents in their self-report time-use diaries.
Why not use the criterion measures themselves instead of the diary measures? For some purposes (e.g., dietary analysis), wearable cameras are appropriate (O’Loughlin et al. 2013), while for other topics (e.g., sleep), accelerometers are more suitable (Van Hees et al. 2015). However, the camera records involve substantial extra costs (a similarly funded diary study alone might have achieved 10 or more times the sample size discussed in this article). Some activity categories (e.g., sleep, PA; Pedišić, Dumuid, and Olds 2017; Willetts et al. 2018) can be inferred from accelerometer data, but we are at present unable to identify other activities of daily living from accelerometer evidence alone.
The underlying question is whether time-use diaries are an appropriate means of collecting data on durations of various types of activities (Bolger, Davis, and Rafaeli 2003). We start by deploying a large-scale survey (the 2014–15 UK National Time Use Study; UK TUS) to compare estimates of participation rates in purposive physical activity (PA) from time-use diaries with estimates derived from retrospective exercise participation questions from the same survey. Responses to retrospective participation questions are known to be biased due to respondents’ perceptions of social desirability (Bauman et al. 2009; Bernstein, Chadha, and Montjoy 2001; Shephard 2003; Troiano et al. 2012) and their attempts to enact normatively sanctioned identities (Brenner and DeLamater 2014). Lee and colleagues (2011) carried out a systematic review of the validity of a widely used standard battery of such questions (IPAQ-SF) and found that it strongly overestimated PA as measured by an objective criterion. Similarly, the 2014–15 UK TUS estimates show the retrospective questions produce participation rates approximately double those recorded in the time-use diaries. Do time-use diaries produce accurate estimates of activities of daily living? Does this vary between different types of activities?
The current project emerged in response to a study by Kelly and colleagues (2014) that compared travel behavior recorded by participants (n = 69) wearing an automated SenseCam wearable camera with their registrations in a variation of the UK National Travel Survey trip log for the same day. The CAPTURE-24 study expands this focus on travel behavior to include all activities of daily living—the entire range of paid and unpaid work, leisure time, sleep, and personal care activities. It is the first full-scale attempt to test the accuracy of continuous and comprehensive time-use diary records against objectively registered measures of daily activities recorded in real time.
2. Literature Review
There is a long history of methodological research into time-use diary validity (i.e., the degree of agreement among different measures; see Alwin 2011; Campbell and Fiske 1959). Most of this work examines the convergence of diaries with questionnaire-type time-use estimation methods, and some compare diaries with objective criterion variables.
The seminal work in the former category is led by the Michigan Institute for Social Research, associated with the 1975 U.S. National Time Use Study (Juster and Stafford 1985). Robinson and Godbey (1997), after reviewing examples of this type of methodological research (e.g., Hill 1985; Juster 1985; Robinson 1985), concluded that additional controlled studies needed to be undertaken to extend and refine the estimates. Subsequent, methodologically sophisticated approaches to non–criterion based tests (e.g., Kan and Pudney 2008) reiterated the view that diary approaches can be regarded as a gold standard, and the current high-level teaching text in this area (Belli, Stafford, and Alwin 2009) concurs. In their review, Brenner and DeLamater (2016) reported no definitive progress in establishing validity or reliability on other than a priori grounds. Without an adequate criterion variable, deductive arguments are mere speculation.
The CAPTURE-24 study follows the criterion variable route. The earliest direct test using a real-time activity record as an objective criterion deployed a video camera on top of a television set in 20 U.S. households (Bechtel, Achepohl, and Akers 1972). The video record provided evidence of participants’ time in front of the television when it was switched on for comparison with the associated diary record of television viewing. Anderson and colleagues (1985) compared parents’ reports of their children’s television viewing with a time-lapse camera record of the children’s behavior in front of the set. A second method of criterion comparison research used motion sensors, worn continuously, to compare the number of episodes of PA recorded in time-use diaries. Hofferth and colleagues (2008) used this method to validate time-use diary records of children’s PA, as did van der Ploeg and colleagues (2010) with a general population-representative sample.
The present study uses both wearable cameras and accelerometers. It provides a substantial advance on the existing literature, yielding comprehensive comparisons of diary data with the criterion measures, covering all the activities of daily living (rather than just television viewing or PA, as in previous criterion-based studies). The diary/camera pairings directly compare frequencies and durations in each daily activity, coded separately in the two records. The accelerometer data provide slightly less direct but still comprehensive comparisons of total daily PA estimated from the continuous accelerometer record, with estimates of the total daily PA in the diary and camera records gathered by attaching appropriate metabolic equivalent of task scores (METs) to each diary/camera event (Deyaert et al. 2017; Tudor-Locke et al. 2009). Our main focus here is examining durations of activities through the day.
2.1. Estimating PA from Time-Use Data
Figure 1 (an updated version of Gershuny 2012:258), which follows discussion in Juster and Stafford (1985), uses the 2014–15 UK TUS (Gershuny and Sullivan 2017) to illustrate the relationship between reported rates of purposive physical activity/exercise participation in the questionnaire and participation rates recorded in respondents’ randomly selected diary days (weighted to give an equal representation of days of the week).

Actual versus predicted daily participation (UK 2015 data, authors’ calculations).
Assuming that past participation rates indicate future participation probabilities, we suggest that any respondent who reported 14 or more instances of participation in the past month (i.e., more than three per week) would have a greater than .5 probability of participation on a randomly chosen day (reweighted, as in the previous paragraph). This type of reasoning gives us the “predicted participation” line. Diary evidence on participation in walking, cycling, running, and swimming shows participation rates between .13 and .22 for this group.
About 5 percent of respondents who report no walking and 2 percent of those who report no purposive PA in the previous month show some participation on the randomly chosen diary day. With these two exceptions, all diary participation rates are substantially below what would be expected based on the questionnaire responses. The average slope of the swimming, gym workouts, cycling, sport, walking, and running lines is about halfway between the x-axis and the prediction line, which corresponds well with Brenner and DeLamater’s (2014)“double the actual” estimation and supports findings from Lee and colleagues (2011).
Another shortcoming is the restricted coverage of most PAQs. All daily activities involve some level of PA, but many PAQs only cover a limited subset of prespecified activities. Some respondents’ main source of PA may be outside the range covered by the PAQ. For example, incidental daily moderate-to-vigorous activities (e.g., caring for babies and toddlers, home renovation, gardening) are not captured adequately by most PAQ items. Respondents’ detailed “own words” diary descriptions provide continuous coverage across all daily activities, resulting in a better-balanced estimation of the extent of different types of PA, although not their intensity.
These two issues with the PAQ approach together with the centrality of PA measurement to understanding obesity, diabetes, cardiovascular disease, and cancer (see Lee et al. 2012) provide—in addition to the many social science applications mentioned earlier—a strong public health–based motivation for the evaluation of time-use diary materials enabled by the CAPTURE-24 project.
3. Study Design and Methods
3.1. Ethical Considerations
The investigators developed a comprehensive ethical framework for conducting research using wearable cameras based on Kelly and colleagues (2013); the framework was approved by the appropriate University of Oxford ethics committee (IDREC). 2 Participants signed an informed consent form after a member of the research team had fully explained the study requirements. Investigators recommended participants check in advance that friends, family, and co-workers understood the nature of the study and were happy for them to take part. Participants were also advised of places where wearing the camera may not be appropriate (e.g., changing rooms, banks, schools). All cameras were encrypted and did not record sound or conversations. Participants were not permitted to keep any copies of their images.
3.2. Sample and Setting
The volunteer sample was drawn from the UK county of Oxfordshire. The research team invited participants via professional networks, free online advertisements, posters, social and sport clubs, word of mouth from other participants, and emails to an authorized list of willing research volunteers provided by a market research agency. Every effort was made to recruit a sample varying broadly across sex, age (18 years and older), and educational level (see Table 1). The original sample of 148 participants returned 124 complete diary, camera, and accelerometer records and 131 diary/camera pairs.
Age, Sex, and Educational Composition of Diary-Camera Sample
3.3. Design
The study design and associated protocols were refined based on pilot study findings (n = 14; Kelly et al. 2015). Participants met with a member of the research team before and after the data collection day. In the initial meeting, researchers explained the project purpose, gained written informed consent, had participants complete a short demographic questionnaire (including self-reported height and weight to calculate body mass index [BMI]), and provided the three instruments (diary, camera, and accelerometer) and instructions on how to use them.
On the data collection day, participants completed a self-report time-use diary and wore the two passive data collection devices (camera and accelerometer). Shortly after the data collection day, participants met with a researcher for a post–data collection “reconstruction interview” and to report their experience of wearing the devices and completing the time-use diary. Participants received a £20 High Street shopping voucher after completing the interview.
3.4. Instruments, Devices, and Interview
3.4.1. Time-use diary
The study used the diary designed for the 2014–15 UK TUS, the UK version of the European Harmonised European Time Use Study (HETUS; Eurostat 2009). The diary starts at 4:00 a.m. and covers 24 hours, in 10-minute intervals (timeslots), with 3 hours on each page (see Figure 2). Participants completed the diary in their own words across six fields or “domains”: primary activity, secondary activity, co-presence, location or travel mode, technology use, and enjoyment. Respondents were encouraged to record throughout the diary day, although, as in the majority of time-use surveys, most recording was done at the end of the day or early the following day. A one-day diary typically takes about 20 minutes to complete.

Example page of the UK Version of the European Harmonised European Time Use Study self-report time-use diary.
3.4.2. Autographer wearable camera
The Autographer wearable camera was developed by the Oxford Metrics Group and has been evaluated in several papers (e.g., Doherty et al. 2013). Participants wore the Autographer (on a lanyard or clipped to their clothing) for as long as possible during their waking hours—generally after showering in the morning until preparing for bed in the evening. The camera captures images automatically at 20- to 30-second intervals (medium capture rate) from the wearer’s point of view, but no sound is recorded. A privacy lens allows participants to halt image recording temporarily.
On a typical day, the camera captures 1,500 to 2,000 images. The average 16-hour battery life is sufficient to cover waking hours for most participants. The Autographer is not waterproof, so participants were asked not to wear the camera if they were engaged in contact or water-based sports. The camera functions best in good lighting conditions (i.e., daytime and indoors with sufficient illumination). Occasionally, participants’ clothing or hair can obscure the lens, or data may be lost when they turn the camera off (e.g., for privacy or unintentionally).
3.4.3. Axivity AX3 band accelerometer
The AX3, first released in 2012, is a continuous logging accelerometer designed for a range of applications, including PA monitoring and classification, motion analysis, and medical research (Doherty et al. 2017). The AX3 is compliant with the OpenMovement data format, has sufficient memory for 14 days of continuous logging at 100 Hz (512 MB), and is waterproof to 1.5 meters. It has an in-built, accurate clock and calendar that time-stamps the recorded acceleration data. The AX3 has configurable sample rates, adjustable sensitivity, and a low power mode.
Participants wore the accelerometer for at least 24 hours on their dominant hand (wrist), although many wore it for a day before and after the diary day, which provided an additional two days of sleep data. Because the AX3 has a long battery life and is robust and waterproof, participants were able to wear it while working, traveling, taking a bath or shower, sleeping, and playing most sports.
3.4.4. Reconstruction interview
Shortly after the data collection period (maximum four days), participants viewed the camera images in a face-to-face “reconstruction” interview, which took about 60 minutes. The main purpose of the interview was to clarify any unclear images and verify when one activity finished and the next began, which assists with the coding process. This method is similar to a recall diary, but it achieves higher validity due to the image prompts (see Cowburn et al. 2016). Before the interview, the investigator downloaded the images into a bespoke browser (Doherty, Moulin, and Smeaton 2011) and invited the participant to view and delete (in private) any unwanted images. Using the images as prompts, participants described their day while the interviewer kept detailed notes.
4. Data Coding
For this study, the reliability test focus made it essential to code the diary and image data independently. Limited resources allowed only a single researcher to code the activities, so to avoid contamination, the diary and image coding exercises were carried out separately, approximately four months apart (first diaries, then images). The large number of respondents combined with the anonymity of the records meant the coder had no means of connecting particular diaries with the corresponding image files.
4.1. Time-Use Diary Coding
The HETUS diary instrument uses 10-minute intervals (timeslots). A time-use episode is a sequence of intervals during which there is no change in any of the six substantive domains. 3 The 10-minute intervals make it difficult for diarists to record brief (e.g., visiting the bathroom, checking text messages) or momentary (e.g., taking medication, using an ATM) activities occupying less than five minutes. Episodes shorter than this sometimes fail to appear, although they sometimes appear in the secondary activity field. For each study participant, the final coded diary data file comprises a sequence of episodes of varying lengths, starting at 4 a.m., with a total duration of 1,440 minutes (Eurostat 2009).
The HETUS activity coding system is hierarchical to a three-digit level. 4 Primary and secondary activities are coded using the UK version of the standard HETUS activity classification (just under 300 different activities). Coders categorize the main and secondary activities as well as the location/mode of transport and other domains, and they determine the start and end times of these episodes.
4.2. Camera Image Coding
We applied the diaries coding procedures to the raw camera images, with two exceptions. First, we used one-minute recording intervals (timeslots), giving the image data a finer granularity than the diary. Second, we did not use the enjoyment domain. For the comparisons discussed in the following sections, the one-minute intervals in the image files were concatenated to 10-minute diary intervals.
Interview notes were essential to the coding process. Most participants had a few black or unclear images from using the privacy lens cover, inadvertently covering it with clothing, or being in low-light conditions, so the interviewer needed to identify what the respondent was doing when this occurred. The main reasons for covering the lens or turning the camera off were using the bathroom, reading confidential documents on the computer, attending medical appointments, and collecting children from school. The interview notes also allowed the coder to include additional domain information such as secondary activities, location, and the presence of others.
We developed a standard operating procedure for the image coding to aid replicability (see Figure 3). We identified activities as episodes and assigned a HETUS code if they continued for three or more images with no breaks (interruptions) of more than two images. Activities that lasted fewer than three images were grouped with the activity immediately preceding them. For example, 10 images of watching television → 2 frames of food preparation → 25 frames of watching television would be coded as a single activity watching television. If the food preparation lasted three or more images, it would be coded as preparing food with watching television on either side. One limitation of the protocol is that it cannot assign preparing food or watching television as primary or secondary activities unless they were recorded as such in the interview notes.

The standard operating procedure for image coding.
4.3. Accelerometer Data Extraction
For the accelerometer data processing, we followed procedures used by the UK Biobank accelerometer data processing expert group, including device calibration to local gravity and resampling to 100 Hz (http://www.ukbiobank.ac.uk; Doherty et al. 2017). We calculated the sample-level Euclidean norm of the acceleration in x/y/z axes, and removed machine noise using a fourth-order Butterworth low pass filter with a cutoff frequency of 20 Hz. To extract the activity-related component of the acceleration signal, we removed one gravitational unit from the vector magnitude, with remaining negative values truncated to zero. Consecutive stationary episodes lasting for at least 60 minutes were automatically identified as device nonwear time.
Accelerometer measures that represent total activity volume, such as average vector magnitude (i.e., movement per time interval relative to the center of the earth), are recommended as appropriate measures of PA. To describe PA intensity, we aggregated the sample-level data into 10-minute epochs for summary data analysis, maintaining the average vector magnitude value over the period (in milli-gravity units).
5. Data Analysis and Results
We first compare estimates of sample means from the diary and camera records and then make individual-level comparisons of estimated durations of activities. We next consider the overall degree of similarity of the days depicted through these two measures (self-similarity) to answer the question, “Could we identify the diarist from the camera record?” We then turn to estimates of the relationship of the camera and the diary to the accelerometer record.
5.1. Aggregate Comparison of Diary and Camera Records
The 33 activities listed in Table 2 comprise activities coded to the two-digit level of the UK HETUS activity lexicon together with some amalgamation of activities associated with very small time expenditures. The aggregate mean times in coded activities from the camera data and the self-report time-use diaries are, in general, rather similar. Table 2 shows substantial and significant differences in only three activity categories: eating, reading, and watching television.
Mean Daily Time in 33 Activities
p < .05. **p < .005. ***p < .0005.
Note:N = 131 cases for camera and diary.
Figure 4 plots the 31 activity categories with durations less than 100 minutes. We excluded sleep and paid work, both with long durations, as they would distort the view and give a correlation coefficient indistinguishable from unity. We find a very strong association between the two measures as estimators of time-use at the aggregate level. If we take just the 31 two-digit activities as cases, we find a correlation coefficient between the diary and camera estimates of .975, which is a remarkably high level of association between a self-report estimate and a criterion measure. Compare, for example, the nearly 45° plot in Figure 4 with that shown in Figure 1.

Thirty-one activities less than 100 minutes.
5.2. Individual-Level Comparisons of Diary and Camera Reports
The similarity between the aggregate means of this quite detailed activity list is not entirely surprising. For example, perfect recall of the sequence of yesterday’s activities combined with mutually cancelling random errors in recall of the exact start or finish time of each element in the activity sequence might produce the generally nonsystematic unbiased mean estimates seen in Figure 4.
We thus turn from the comparison of aggregate mean time in activities across the sample to consider patterns of difference between the diary and camera estimates of total time in an activity at the individual level (i.e., moving from between- to within-individual comparisons). The main issue in assessing the quality of the diary record is whether we can find statistically significant differences between diary-based estimates of an individual’s total time in various activity categories and estimates derived from the (criterion) camera record. The t-tests in Table 2 show significant differences only in the case of time devoted to eating, other personal care, food management, reading, and school travel.
Table 2 also provides measures of the covariance (correlation coefficients) of the two measures. The correlation coefficients can provide an estimate of the extent of noise associated with recall errors in the start/finish times of diary activities, although it is not clear what should be considered a good correlation in this context. Some short duration categories, such as other paid work–related (mean 15 minutes in the camera record), resting and time out (mean 8 minutes), and listening to radio and recordings (2 minutes), have correlations less than .5. However, the major time-use categories (more than 60 minutes per day in the diary record) sleep, paid work, social activity, and watching television all have correlations greater than .65. Of the 33 activity categories, 9 have r≥ .9, 7 have r≥ .8, and a further 5 have r≥ .7.
5.3. Simultaneous Activities and the Construction of Daily Narratives
It is not coincidental that the major activity categories of eating, watching television, and reading show the most substantial differences at both aggregate (sample) and individual (case) levels, as these activities are the most likely to occur simultaneously with other activities.
Most participants are accustomed to being asked, “What did you do today?” Answering this question trains individuals to construct narratives such as “arrived home from work, put the kettle on and made tea, then watched television.” These accounts are, in effect, “streams of behavior” in different environments, that is, sequences of activities that can be nested hierarchically (Barker 1963, 1968, 1978; Harms 2004). From the diarist’s perspective, simultaneous activities (e.g., drinking tea, glancing at the newspaper) may occur within and are evidently secondary to the main activity of watching television.
All simultaneous activities reported in the diaries and interviews were coded. However, if the respondent did not nominate the primary activity in the reconstruction interview, it was not always evident which activities were primary or secondary/simultaneous. In these cases, we made judgments to reconstruct the respondent’s behavior stream in a logical sequence. However, our judgments may have differed from the diarist’s subjective understanding of the particular activity. Interpreting images from the wearer’s perspective (i.e., facing outward) leads to other problems. A respondent eating a meal may turn to talk to a companion, causing the camera to face away from the plate for a few frames. The analyst, for lack of other evidence, may classify this as conversing even though the respondent would classify the primary activity as eating with talking as a secondary activity.
We illustrate these problems by considering the full accounts of three activities in the camera record (see Table 3). Eating as a primary activity occupies 55 minutes in the camera record, compared with 74 minutes in the diary. If all events in which eating is recorded as a secondary activity were reversed to place eating as the primary activity, then eating durations would double. Similarly, watching television, which occupies 75 minutes as a primary activity in the diary but only 64 minutes in the camera, increases by 50 percent if television viewing events counted as secondary by the camera analyst are recoded as primary. Reading, by contrast, is frequently ancillary to other activities. For example, during a meal, a respondent may read the newspaper rather than converse. The newsprint may feature frequently in images alongside the plate of food, but from the diarist’s perspective, eating the meal is the main activity.
Time-Reporting Hierarchy as Seen in the Camera Record (Minutes/Day)
5.4. Are There Reporting Differences by Educational Levels?
The issue here is not whether there are variations in the amounts of activity reported by respondents with different levels of educational attainment; we expect such differences. Rather, the question is whether there are substantial differences in the differences between the camera and diary. Put more directly, we need to establish whether highly educated respondents are more likely to under- or overreport particular sorts of activities in their diaries compared with the camera evidence. Table 4 compares ratios of camera minus diary differences as a percentage of the diary mean estimates of time in the activities. In this analysis, we emphasize activities that occupy a relatively large proportion of the average day. Activities occupying 30 or fewer minutes per day have a relatively large number of zero-scores, meaning either the diary or the camera evidence is missing.
Is There a Reporting Bias from Educational Level?
Note: Activities with total camera durations of ≥30 minutes per day are in bold.
Most of the larger activities in Table 4 show reasonable correspondence between recording patterns of the higher educated and less educated participants; these differences mainly reflect the expected education-related variation. Among these activities, sleep, eating, paid work, cooking, reading, and watching television show similar patterns of difference between the two records. Household upkeep, gardening, and pet-care show larger differences, although with the same sign on the errors. Participatory activities (religious practice, volunteering) show a very large proportional difference in underreporting but from a small number of minutes. Only shopping, social entertainment, and leisure travel show large discrepancies in different directions. Among the shorter-mean duration activities, other paid work–related, helping other households, and playing games show substantially lower estimates in the diary records relative to the camera estimates among less educated respondents. Radio listening, resting, exercise, and exercise-related travel show higher levels of difference among less educated respondents.
5.5. Self-Similarity Analysis of Diary and Camera Records
We now consider similarities in overall patterns of time-use produced by the camera and diary pairs in a more holistic way. Our focus here is on the evaluation of aggregate durations in activities, and with the exception of a brief discussion in the following section, we reserve analysis of the similarity of timing of daily activities for discussion elsewhere. Instead, we now consider the overall daily totals of time in activities as a more general test of how well the diaries represent individual respondents’ days. We use the compositional distance measure proposed by Robinson and Converse (1972), 5 calculating generalized Euclidean distances (GEDs) between pairs of records. By considering each of the 33 activity categories as an independent dimension, we can define a 32-dimensional hypotenuse-equivalent as the square root of the sum of the squared differences between the paired camera and diary estimates of total time in each activity. The resulting “self-similarity” measure is the GED between the two time-use measures for a single respondent.
We can also calculate a similar GED between each of the 131 diary records and the camera records of each of the other 130 respondents, producing “general similarity” measures. The self- and general similarity measures together provide a 131 × 131 matrix of GEDs, each row corresponding to a diary record and each column to a camera record, with the major diagonal elements containing the self-similarity measures.
The ratio of the mean of the general similarity measures along a given row of the matrix to the self-similarity measure (the major diagonal cell) provides a goodness-of-fit indicator. We expect, given the extent of interpersonal variation in patterns of daily time-use, that the GED between any diary activity pattern and that of the corresponding camera should be smaller than GEDs between a diary and any other camera record; the major diagonal cell should, in general, show the minimum GED on any given row.
Figure 5 reorders the rows and columns of the matrix in ascending order of the 131 self-similarity scores and for each case plots the mean of the general similarity indicator, the self-similarity indicator, and the minimum GED for the appropriate row of the matrix. The GED scores for each subject, roughly speaking, represent the sum of the deviations between the 33 time-use totals from camera/diary pairs; a GED of 100 units represents an average three-minute deviation for the 33 pairs, 200 represents a six-minute average deviation, and so on. With the exception of the single worst case, the self-similarity distance is smaller than the mean of the general similarity scores. Likewise, the self-similarity distance for most of the first 100 or so reordered cases is also the minimum GED. Beyond this point, we find an increasing number of cases where the overall time-use pattern in the diary record is more similar to someone else’s camera record than to the diarist’s own record.

Comparison of similarity of diary/camera pairs and distance of diaries to means of all other camera records.
As already noted, there are two likely explanations for the differences between the camera and diary pairs. The first is simply poor diary-keeping, which emphasizes the importance of checking diaries for missing data upon collection. The second is the difference between respondents’ own recorded sequences of primary activities and the more complex multiple-simultaneous-activity reality of the camera record and the coder’s decisions. Although beyond the scope of this article, this can be tested by observing the effects of reordering the multiple simultaneous activities recorded by coders in the camera records (e.g., in Table 4).
There are several documented indicators for diary quality (e.g., Fisher et al. 2015; Glorieux and Minnen 2009). These include (1) range of coverage in the daily record (i.e., its inclusion of necessary daily activities, such as eating and sleeping), (2) frequency of mentions of secondary or higher-order simultaneous activities, (3) amounts of missing time during the day, and (4) number of separate activities recorded in the diary. In this analysis, we use the latter two indicators. Removing “lower-quality” diaries (those with more than 60 minutes missing/unallocated time during the diary day and with fewer than 25 diary episodes) leaves 100 “higher-quality” diary records of the 131 total. Of these, 90 have self-similarity scores of no more than 15 units (i.e., average deviations of less than 30 seconds above the minimum for their case).
5.6. Aggregate Comparisons of Diary, Camera, and Accelerometer Measures
Table 5 groups the 33 two-digit activities into seven broad categories and compares the PA levels (accelerometer records in mg/minute) associated with each. The upper two panels of the table refer to the camera records. On the right are the means and standard deviations for all participants who completed diaries, and on the left are the higher quality diaries. Only a subset (n = 124) of the camera and diary sample returned usable accelerometer data. To maintain adequate numbers, we used a slightly less stringent criterion for diary quality, categorizing those with fewer than 70 minutes missing as “better” diaries. The lower two panels provide equivalent measures comparing the diary with the accelerometer records.
Comparison of Accelerometer Means for Summary Activities, by Camera and Diary
Two findings emerge with some clarity from Table 5. The first is that both the camera and diary records show the expected differences in PA between broad types of activity. For example, in all four quadrants of Table 5, we find a roughly eightfold difference in PA between the sleeping and exercise categories. In particular, the same differentials emerge from the camera and diary records.
The second finding, with a single exception, is that there are insubstantial differences between the whole sample and the higher quality diary columns. The exception is purposive exercise (e.g., sports, walking), where diaries from the whole sample report higher levels of PA than do the high-quality diaries: 174 mg/minute versus 158 mg/minute for the camera records and 173 mg/minute versus 162 mg/minute for the self-report diary. The standard deviations of these means are large, which indicates these differences are not statistically significant. Although the precise mechanism is not clear, in both cases, the less densely recorded diary and camera sequences reveal somewhat more purposive exercise.
Table 6 compares the mean accelerometer scores, broken down by the camera and diary classification of each activity, for the more detailed two-digit activity classification. The rows of the table are placed in ascending order of the diary-based accelerometer scores. The ordering would differ only slightly—activities moving up or down by no more than a single rank—if it were reordered according to the equivalent camera coding. Scores derived from the camera- and diary-based coding have a correlation of .98. (We excluded exercise scores from our calculation of this correlation because as distinct outliers, they would push the estimate upward. 6 )
Accelerometer Means by Two-Digit Activity Categories, Ordered by Camera Scores
Note: N = 124 cases for camera and diary.
5.7. Individual-Level Comparisons of Diary, Camera, and Accelerometer Measures
Just as we did for the two-way diary and camera analysis, we now turn from sample means to an individual-level analysis. For each timeslot (10-minute intervals through the day), we regress the mean accelerometer score on the camera- and diary-based activity coding. The timeslot is the case, yielding a potential data set of 17,856 (i.e., 124 × 144) cases for both the diary and camera records. Missing data reduced this total to 16,846 cases for the records that have valid camera, diary, and accelerometer measures for the same timeslot.
The simple ordinary least squares approach to this is a dummy variable regression, classifying each timeslot case by a vector of 32 indicator (0/1) variables representing the activity categories, 31 of which are always set to zero. The 33rd “default” activity category is represented by the case where none of the indicator variables are set to 1. The camera-based regression analysis of the whole data set produces a multiple correlation (R) coefficient of .493, the diary only slightly less at .473. 7 Considering that much of the variation in purposive exercise/PA relates to physiological, demographic, and socioeconomic variables (e.g., BMI, level of fitness, age, sex, employment status, social class) that can vary almost independently of the type of activity, these are reassuringly acceptable levels of association from the perspective of the reliability of the two alternative indicators (i.e., camera and diary) of the type of activity in the timeslot.
However, assessing the reliability of the diary using the camera record as a criterion indicator requires a slightly different approach. It is important to know whether the diary measure is explaining the same part of the variation of the accelerometer record as the camera measure. We modeled this by allocating METs 8 —using the Ainsworth Compendium (Ainsworth et al. 2011) as a reference—to the three-digit HETUS activity classification (Eurostat 2009). Our process broadly duplicates the work carried out by Tudor-Locke and colleagues (2009), who applied this to the American Time Use Study (ATUS) activity lexicon. The raw correlations between the camera- and diary-derived METs scores and the accelerometer measure are .518 and .500, respectively.
Table 7 provides multiple correlation scores for Model 3, which uses both camera and diary estimated METs to predict accelerometer scores. The relatively small increment of prediction gained by adding the camera METs above the diary METs suggests the camera and diary are explaining the same components of the variation in the accelerometer record. Adding descriptors of the respondents (e.g., age, sex, educational attainment) improves the model performance, but we reserve further modeling of METs for another article.
Diary- and Camera-Based Metabolic Equivalent of Task Scores (METs) as Predictors of Accelerometer Scores
p < .0005.
The main objective of this study was to validate diary estimates of activity durations, but this last result combined with the similarity of accelerometer scores between camera and diary records seen in Tables 5 and 6 provides a direct chain of inference to establish the general accuracy of time-of-day of activities in the diary. The camera timings of activities are objectively recorded, so the “same components” finding suggests the diary is identifying close to the same times for the activities as the camera record.
6. Discussion
The overall purpose of the CAPTURE-24 project was to test the self-report diary method of capturing time-use information in a comprehensive way against records of activity that are sufficiently objective to be considered criterion tests. This is the first time in the social scientific or public health literature that such a test, covering all activities of daily life, has been carried out.
We demonstrate that self-report time-use diaries provide a good basis for accurate estimation of time-use patterns, without evidence of bias by educational level. By direct inference, we can therefore conclude that when collected from representative samples of respondents, time-use diaries can validly and reasonably reliably represent the time-use of large populations. This is an important advance on previous time-diary evaluation literature insofar as it relies not on a priori reasoning but on comparisons with unimpeachable criterion data.
Our results amplify, on a much broader basis, Kelly and colleagues’ (2014) conclusions comparing self-report trip logs to camera records of travel: Self-reports provide generally accurate and unbiased aggregate estimates of means of time in different activities, with random error at the level of individual observations, presumably related to recall error. The CAPTURE-24 study is the first to provide a clear test of the performance of conventional self-report time-use diaries against reasonably objective criterion measures covering the full range of daily activities.
The sample studied here is in no sense representative of any specific population. Despite our efforts to recruit a broad base of participants, the possibility of hidden bias toward unusually accurate diarists remains. However, our investigation of the relationship of educational levels to reporting provides no evidence of systematic bias from this source. There are issues with the type of time-use diary used in this study. Participant burden is higher with time-use diaries than with passive data collection devices such as cameras and accelerometers. Furthermore, the 10-minute intervals used by the HETUS standard time-use diary are too coarse to capture some activities, leading to ambiguity (e.g., multiple short activities versus simultaneously occurring activities within the same interval). We acknowledge that a single 24-hour period cannot represent “usual” behavior at an individual level.
Analyses show clear biases resulting from use of PAQ methods, which are sufficient to disqualify many studies relying solely on that method. However, PAQ approaches could be used alongside diaries to adjust diary estimates for longer-term participation frequencies and calibrate PAQ results to compensate for their biases (Gershuny 2012). The message of this study is that time-use diaries produce reliable results and should be used either alongside or instead of PAQ methods.
Footnotes
Acknowledgements
First, we wish to thank our participants who gave their time to contribute to this project. We also acknowledge Sven Hollowell, who helped prepare the accelerometer data.
Authors’ Note
Jonathan Gershuny and Teresa Harms are now with the ESRC Centre for Time Use Research (CTUR), Department of Social Science, at University College London, UK.
Funding
The authors acknowledge the support of the UK Economic and Social Research Council and European Research Council for funding Teresa Harms and Jonathan Gershuny (ESRC Centre for Time Use Research, Department of Social Science, University College London). The British Heart Foundation Centre of Research Excellence at Oxford (http://www.cardioscience.ox.ac.uk/bhf-centre-of-research-excellence) and the Li Ka Shing Foundation (http://www.lksf.org/) supported the work undertaken by Aiden Doherty, Emma Thomas, Karen Milton, and Charlie Foster, all at the British Heart Foundation Centre on Population Approaches for Non-communicable Disease Prevention, Nuffield Department of Population Health, University of Oxford. We also acknowledge the support provided by the University of Oxford Advanced Research Computing facility in carrying out this work (
).
