Abstract
Background:
The impact of thyroid disease on quality of life is an important disease aspect that is best investigated by patient-reported outcomes. Recent patient-reported outcomes research has raised concern about the validity of traditional retrospective questionnaires. Therefore, ecological momentary assessments of patients' subjective well-being have been introduced to avoid recall bias and improve contextual validity. Despite theoretical advantages, the measurement properties remain unsubstantiated. This study examines the relationship between the retrospective thyroid-related quality of life patient-reported outcome measure (ThyPRO) and a momentary (here-and-now) version of ThyPRO.
Methods:
Eighty-three newly diagnosed hyperthyroid patients expected to undergo treatment completed questions on their thyroid-related quality of life. Head-to-head comparison was performed between 12 momentary items from four multi-item ThyPRO scales administered three times daily via a smartphone application during 28 days and the original retrospective ThyPRO on day 28. The measurement difference between recalled and momentary ratings was quantified for all four scales. Furthermore, correlations between the measures were investigated, and their agreement was explored using Bland–Altman plots. Finally, the study examined whether retrospective ratings were influenced by two forms of recall bias (the peak effect and the end effect).
Results:
Retrospective and mean momentary ThyPRO ratings were highly correlated (Pearson's correlations: 0.74–0.88). However, retrospective ratings provided significantly higher scores (i.e., worse quality of life) on all scales. Bland–Altman plots showed a skewed distribution, indicating low levels of agreement. Results supported a peak effect for retrospective ratings on tiredness but not for the remaining scales. Further, results supported end effects for retrospective ratings of emotional susceptibility and anxiety.
Conclusions:
Retrospective and mean momentary ThyPRO ratings correlated strongly, but retrospective ratings were higher, indicating more disease impact. The differences were of magnitudes normally deemed clinically relevant. Limited evidence supported peak and end effect bias for retrospective assessments. The two measurement modalities did not appear congruent and thus cannot be used interchangeably. When designing clinical studies, whether to use a momentary or retrospective measurement method may depend on the aim of measurement. Further prospective analyses are needed to compare any beneficial effects, for example in terms of higher precision or sensitivity to clinical change, of momentary assessments.
Introduction
Patient-reported outcomes are increasingly employed in a medical world where assessment of the quality of clinical care has been put into focus. To capture the impact of disease and treatment from the patient's perspective, a large number of patient-reported outcome measures (PROMs) have been developed. In thyroidology, the thyroid-related quality of life patient-reported outcome measure (ThyPRO) has been developed and extensively evaluated, including content validity, factorial validity, known groups validity, cross-cultural validity, internal consistency, test–retest reliability, and responsiveness (1 –6). A systematic review by Wong et al. found the ThyPRO to have strong measurement properties and recommended ThyPRO as the preferred PROM for patients with benign thyroid disease (7). The only limitation of ThyPRO mentioned was the length of the questionnaire, calling for an abbreviated version, which has now been developed (8). Two additional PROMs were found to have adequate measurement properties: Graves' ophthalmopathy-specific Quality of Life (GO-QOL) (9 –11) and Thyroid Treatment Satisfaction Questionnaire (ThyTSQ) (12,13). However, these PROMs target patients with Graves' ophthalmopathy and hypothyroidism, respectively.
Ecological momentary assessment is a recently introduced new measurement method with the ability to expand further on the quality of patient-reported outcomes. Ecological momentary assessments investigate experiences in real time as they occur in the daily lives of patients by repeatedly asking them how they currently feel (14). The method was introduced in response to raised concerns about the validity of traditional retrospective PROMs. The argument was that memory is flawed and that people are unable to retrieve past experience properly. Instead, mental shortcuts are used to reconstruct past experiences, which may introduce systematic bias and inaccuracy (15).
The method of ecological momentary assessments has mainly been used to study pain and fatigue, as well as being used in behavioral research (15). Compared to traditional retrospective assessments, the mean of momentary ratings usually produced lower ratings of symptom severity for both pain and fatigue (16 –18). Studies using momentary assessments to evaluate pain have shown that respondents may pay excessive attention to the most intense as well as to the most recent pain when retrospectively rating an experience, while duration of the pain was given less attention (19 –21). These types of recall bias are known as the peak effect and end effect, respectively.
In addition to avoiding recall bias, momentary assessments have other potential advantages. First, they provide ecological validity, that is, people are answering questions while living their everyday lives as opposed to answering questionnaires at a time dictated by convenience or even in the doctor's office (15,16). Second, repeated sampling enables investigation of symptoms over time, including daily fluctuations (14,15). Thus, ecological momentary assessments have the potential to provide more valid data on quality of life, and therefore a momentary version of the original ThyPRO was developed (22).
The objective of the present study was to examine the relationship between the original retrospective ThyPRO and momentary ThyPRO ratings and to evaluate the presence of recall bias in a standard retrospective PROM across four different scales.
Methods
Participants
Patients newly diagnosed with hyperthyroidism, including Graves' disease, toxic nodular goiter, and drug-induced thyrotoxicosis, were included. Patients were recruited from September 2014 to July 2017 from endocrine outpatient clinics at four Copenhagen University Hospitals: Rigshospitalet, Gentofte, Herlev, and Bispebjerg. Patients were eligible if they were ≥18 years of age, understood Danish, had serum thyrotropin (TSH) levels <0.1 mIU/L within the last month, and were scheduled to undergo treatment for hyperthyroidism (antithyroid drugs, radioiodine, or surgery). If treatment had already been initiated, patients had to have elevated free thyroxine (fT4; reference range 10–22 pmol/L) in addition to TSH <0.1 mIU/L as a marker of current disease activity. Exclusion criteria were pregnancy and major comorbidities suspected to impact quality of life substantially (e.g., cancer or congestive heart failure). Eligible patients were initially contacted by phone. Patients who were interested in participating then received written information via e-mail.
Collection of momentary and retrospective ratings
The ThyPRO assesses quality of life in patients with benign thyroid disease. It consists of 85 items summarized in 13 multi-item scales and one single-item scale. Each scale ranges from 0 to 100, with higher scores indicating worse quality of life. The original ThyPRO uses a four-week reference period, that is, it asks patients to summarize their experiences over the last four weeks. A momentary version of selected sections of the ThyPRO was developed. Items from the five scales previously found to be most responsive were evaluated by conducting cognitive interviews with patients from the target group. It was evaluated if items functioned in a momentary setting and if there were any problems with the new item versions (22). One scale was found to be incompatible with a momentary setting. To minimize response burden while allowing representation of sub-domains within each scale, three items from each of the remaining four multi-item scales were selected for this study. Items were chosen based on how well they functioned in a momentary setting, that is, if items were actually answered with a momentary reference period and the amount of problems detected during the cognitive interviews. The items and scales were: (i) Hyperthyroid Symptoms scale—“At this moment, do you have trembling hands?” “…are you experiencing palpitations (rapid heartbeats)?” and “…are you experiencing shortness of breath?”; (ii) Tiredness scale—“At this moment, are you tired?” “…are you exhausted?” and “…do you feel energetic?”; (iii) Anxiety scale—“At this moment, do you feel nervous?” “…do you feel tense?” and “…do you feel restless?”; and (iv) Emotional Susceptibility scale—“At this moment, do you have difficulty coping?” “…do you feel irritable?” and “…do you feel in balance?” (22).
The 12 momentary items were administered three times a day for a period of four weeks. On the last day, patients received an electronic ThyPRO survey via e-mail containing the same 12 items but with the original retrospective four-week reference period to cover the same four-week period by both measures.
Momentary assessments were collected using an Android smartphone application (app) specifically designed for the capture of momentary questionnaire data. If patients owned an Android smartphone, the app was installed on the patients' own devices; if not, they borrowed one. The app administered momentary questions three times a day at semi-randomized time points. A participant's waking hours were divided into three equal periods, and within each period, a prompt was issued at a random time. It featured auditory prompts and presented the items via the touch screen. If a notification was prompted at an inconvenient time, the assessment could be postponed by up to one hour. The system recorded the time and date of each data entry. Patients were able to set and adjust their diurnal rhythm in the app from day to day to ensure that every waking hour was represented. The app was integrated with a trial management system, PROgmatic (23), which enabled automatic distribution of retrospective questionnaires and daily monitoring of response rates.
Data analysis plan
Both retrospective and momentary ThyPRO ratings were aggregated and presented on a scale level. Differences between retrospective and mean momentary ratings (i.e., the mean of all momentary assessments collected during the study period) were analyzed with a paired t-test. Patients with very low response rates (<35%) were excluded from analysis. For patients with less missing data, the mean was taken over all available assessment (equivalent to mean score substitution for missing data). The magnitude of the difference in measurement method was analyzed by Cohen's d (mean difference/standard deviation). Investigations of previous studies using ThyPRO revealed an average scale standard deviation of 20 (24 –28), which is why this standard deviation was used for this study as well. Effect sizes were defined as small (0.2–0.5), moderate (0.5–0.8), and large (>0.8) (29). The level of agreement between the two measures was investigated using Bland–Altman plots, which plot the difference between retrospective and mean momentary score against their average for each participant (30). Preliminary analysis of the Bland–Altman plots showed that differences increased with higher mean score, resulting in a trumpet-shaped plot. To adjust for this skewness, a logarithmic data transformation was subsequently performed. Correlations between the two measures were calculated using Pearson's correlation coefficients.
To evaluate the presence of a peak effect, scale scores were exponentiated by ex to give added weight to higher scores, consistent with the hypothesis that when judging a past experience, people pay strong attention to the most intense experiences (peaks) rather than simply averaging every moment of the experience.
To test for the end effect, each participant's momentary scores were aggregated weekly (weeks 1, 2, 3, and 4), and correlations between the retrospective score and mean momentary scores of each week were compared. It was hypothesized that correlations would be higher for later weeks due to the end effect. All analyses were performed using SAS Enterprise Guide v7.1.
Ethical considerations
The study was performed in accordance with the Declarations of Helsinki. The study protocol was reviewed by the local Ethical Committee (reg. no. H-A-2009-FSP23). According to Danish law, questionnaire studies do not require and thus cannot obtain formal approval by ethical committees. Informed consent was obtained from all individual participants. The study was approved by the Danish Data Protection Agency (local identifier at Rigshospitalet: 13-30-1092).
Results
Participants
A total of 273 eligible hyperthyroid patients with phone numbers available from the patient record were contacted, and 94 agreed to participate. The main reasons for non-participation were unwillingness to carry an extra phone among those without an Android phone, or being too busy. Others did not answer their phone or return calls after receiving the written information. Five participants withdrew from the study after initial consent by not responding to any of the momentary assessments. When contacted, these patients indicated that they had changed their mind. Three participants were excluded due to very low response rates (<35%), and three others did not answer the retrospective ThyPRO questionnaire. Analyses were performed on data from the remaining 83 participants. In accordance with thyroid epidemiology, the majority of participants were women (87%), with a median age of 49 years (Table 1). Twenty-five participants owned an Android smartphone, and the app was installed on their own device. The remaining 58 participants borrowed an Android smartphone with the app. Each participant received 87 notifications to answer momentary assessments during the study period. This added up to a total of 7221 notifications, of which 5908 were answered within the one-hour entry period, yielding a response rate of 82% in total. Participants using their own smartphone had a response rate of 86% (1876/2175), while participants borrowing a smartphone had a response rate of 80% (4032/5046). For baseline characteristics of nonparticipants and dropouts, see Supplementary Table S1 (Supplementary Data are available online at
Clinical Characteristics
Information on fT4 was missing for five patients.
Weight-reducing product containing iodine.
TSH, thyrotropin (mIU/L); fT4, free thyroxine (pmol/L); ATD, antithyroid drug.
Descriptive analysis
Daily mean momentary scores from day 0 to 28 are shown in Figure 1 for all four scales. The Tiredness scale received the highest score, followed by the Emotional Susceptibility scale, while the scale with lowest score was the Anxiety scale. Figure 1 illustrates the change in quality of life over time: decreasing scores indicate improving quality of life during the four weeks. Figure 2 shows mean momentary scale scores over 28 days from six patients. The particular graphs were chosen as examples to illustrate some different trends observed in the study and to show how the measurements can be displayed to clinicians and patients in a clinical setting.

Daily mean momentary scores with confidence intervals over a four-week period for 83 patients with hyperthyroidism. Scale scores range from 0 to 100.

(
Means and standard deviations of momentary and retrospective ratings are shown in Table 2, as well as median, scoring range, correlations, and mean differences between the two measures. Table 2 shows that large parts of the scoring spectrum of both measures were used at some point during the study. Furthermore, Table 2 shows that the momentary ratings were to some extent skewed toward lower ratings. A log transformation did not alter this. A high proportion of the participants had scores on the lower end of the Anxiety scale and Hyperthyroid Symptoms scale (i.e., floor effects).
Mean Scale Score Over 28 Days, Median, Correlation, and Mean Difference Between Retrospective and Momentary Ratings for 83 Patients with Hyperthyroidism
Small effect size (0.2–0.5).
Moderate effect size (0.5–0.8).
Correlation and mean difference significantly (p < 0.001) different from 0.
SD, standard deviation.
The retrospective ratings were strongly correlated with mean momentary ratings on all scales, ranging from 0.74 for the Hyperthyroid Symptoms scale to 0.88 for the Emotional Susceptibility scale (Table 2). However, momentary and retrospective ratings did not provide identical scores. Retrospective ratings were significantly higher (more symptoms) than mean momentary ratings on all scales. The smallest difference was found on the Tiredness scale, where retrospective ratings measured four points higher than momentary ratings [confidence interval (CI) 1–6], equivalent to a small difference (effect size 0.2–0.5), and the largest difference was found on the Hyperthyroid Symptoms scale, which measured 11 points higher [CI 8–13], equivalent to a moderate difference (effect size 0.5–0.8). On the Anxiety scale and the Emotional Susceptibility scale, retrospective ratings were seven points higher than momentary ratings [CI 4–9], equivalent to small differences.
In Bland–Altman plots (Fig. 3), all scales had a tendency to have higher differences with higher mean score (more symptoms), even after logarithmic transformation. This tendency was especially apparent in the Hyperthyroid Symptoms scale and the Anxiety scale. Graphically, the Tiredness scale showed the highest level of agreement, with relatively evenly distributed differences. However, inspection of the limits of agreement showed that the agreement was far from satisfactory. The limits of agreement were −0.11 and 0.18. Antilog of these limits yield 0.78 and 1.5, meaning that for 95% of cases, retrospective ratings will be between 0.78 and 1.5 times the mean momentary rating, that is, retrospective ratings will differ by 22% below and 50% above mean momentary ratings. Upper limits for the remaining scales are even higher, with retrospective ratings differing by 100% above mean momentary ratings on the Hyperthyroid Symptoms scale.

Bland–Altman plots of the four ThyPRO scales. The horizontal axis is the mean score of retrospective and mean momentary ratings. The vertical axis is the difference in scores (retrospective–momentary). The solid horizontal line is the mean difference, whereas the punctuated lines represent the limits of agreement (2 SD diff). Scale scores were transformed logarithmically.
Peak effect and end effect
The transformation of scale scores to ex had no effect on three scales; retrospective ratings were still significantly higher than mean momentary ratings (Hyperthyroid symptoms: 31 vs. 19, p < 0.01; Emotional Susceptibility: 41 vs. 34, p < 0.01; Anxiety: 25 vs. 12, p = 0.02). However, after transforming the Tiredness scale, a nonsignificant mean difference was found between the two measures (51 vs. 50, p = 0.54).
Figure 4 shows Pearson's correlations between retrospective ratings and weekly momentary ratings for each of the four scales. Correlations for the Emotional Susceptibility scale and the Anxiety scale were both increasing, indicating a higher correlation for later assessments than for earlier assessments, which is consistent with the end effect hypothesis. The two remaining scales did not show a similar tendency.

Correlations with confidence intervals between retrospective ratings and weekly momentary ratings for 83 patients with hyperthyroidism.
Discussion
This is the first study to investigate ecological momentary assessments in the field of thyroidology. The methodology was designed to assess patient-reported outcomes more accurately than retrospective ratings, based on the hypothesis that momentary assessments are free from recall bias. However, knowledge about the relationship between classical retrospective measures and momentary measures is still limited. In this study, retrospective items and momentary items from the widely used ThyPRO survey were tested head-to-head in 83 hyperthyroid patients expected to undergo treatment. Retrospective and momentary measures showed strong correlations, suggesting that patients with high retrospective ThyPRO scores also had high momentary scores. However, the two measures provided significantly different results and low level of agreement across all four scales.
The study confirms and extends the findings from studies of pain and fatigue, that is, retrospective ThyPRO ratings were higher than mean momentary ratings (16 –18). The magnitude of this difference varied depending on the scale. Cohen's d was used to examine the magnitudes, since the minimal important difference has not been determined for ThyPRO. Small effect sizes were found on three scales, and a moderate effect size was found on one scale. Thus, the observed differences were considered clinically relevant (29). A possible explanation for the higher retrospective ratings is that the momentary assessment sampling density of three per day was too low to capture relevant symptom peaks. However, other studies have shown the same discrepancy, despite using much higher sampling densities (16 –18,20). Perhaps participants are more inclined to ignore or decline an assessment when symptoms are at their worst, which would result in missing peaks. With a response burden perceived as too high, some participants might rush through the questions, selecting the option indicating highest quality of life. Finally, in a previous cognitive interview study converting retrospective items to momentary versions, it was found that changing the reference period could change the way participants perceived the meaning of the item, for example when changing a retrospective item assessing increased appetite, the momentary item was understood as being hungry (22). Items behaving this way were not chosen for the current study. However, some items may act differently in a real-world setting than in a cognitive interview situation, so a change in meaning may have been overlooked.
Bland–Altman plots showed low levels of agreement, perhaps indicating that retrospective and momentary ratings are measuring different concepts. Recent studies suggest that we are psychologically comprised of more than one self, for example the experiencing self and the remembering self (31). Different ways of measuring a concept may evoke responses from different selves. Experience measured momentarily evokes the experiencing self and is more strongly correlated with physiological processes (e.g., stress response) than when measured retrospectively (32). On the other hand, experience measured retrospectively evokes the remembering self and was found more to be important to health decision making (20). The choice of measurement method should therefore be defined by the study aims and based upon which aspects of symptoms are most relevant. For example, if we are studying compliance with treatment, retrospective measures should most likely be the method of choice.
By transforming data to give added weight to higher scores, the study showed that retrospective ratings on the Tiredness scale may have been influenced by the peak effect, while ratings on the remaining scales were not. A possible explanation is that the Tiredness scale has the highest variability (largest standard deviation), why the higher scores (peaks) are relatively larger and thus more memorable compared to the other scales. Another explanation is that tiredness is a more consciously perceived outcome compared to anxiety and emotional susceptibility, which are more subconsciously perceived, perhaps making them less prone to peak effects. However, items from the Hyperthyroid Symptoms scale are likewise consciously perceived and should also be influenced by a peak effect if this was the sole explanation.
It was found that retrospective ratings on two of four scales may have been influenced by the end effect. The presence of such an effect entails that when a physician or researcher asks about symptoms over the previous month, the patient's response is actually mainly based on the most recent days. Momentary assessments were aggregated for each week to base the correlations on a sufficiently high number of assessments.
A strength of this study is that the method of ecological momentary assessments was tested using a validated disease-specific PROM with items from multiple scales, thereby increasing the generalizability of the results. In addition, it is important to note that data were collected while patients were undergoing treatment and changes in symptoms were expected, making the results transferable to a clinical trial setting. Finally, momentary assessments were collected using an innovative measurement technology allowing collection on patients' own mobile devices.
However, the study also has some limitations. It was not feasible to conduct the study with all ThyPRO items, which is why only a subset of items was investigated. However, the selection process was conducted thoroughly to select the most suitable items (22). The smartphone app only functioned on Android smartphones, making it necessary for more than two thirds of participants to borrow a project smartphone and others to decline the invitation. Although response rates were lower for the group borrowing a smartphone, response rates were still satisfactory. A few declined to participate in the study because it involved a smartphone in general. It is possible that this group would have responded differently and that the method may be difficult to implement in this subgroup. This study used a sampling density of three per day, which was considered sufficient to capture daily symptom variation and was found to interfere very little with daily activities in a previous study (33). Increasing the sampling density would make data more representative. However, this would simultaneously increase participant burden. It is possible that answering the same questions during the preceding 28 days could influence the response to the retrospective survey simply because the participant notices which response option is most frequently used. Perhaps this effect has made retrospective ratings more similar to momentary ratings and the measurement differences less pronounced. Figure 1 shows a trend toward improved quality of life during the 28 days. This was expected, since treatment had been initiated before or during the 28 days in the majority of patients. Relative to a situation with stable thyroid function and thus presumably stable quality of life, perhaps this change over time puts additional strain on memory, thereby making retrospective ratings less reliable. However, this is a real-life challenge that a measurement method should be able to withstand. The tested momentary items mainly involve specific disease manifestations, as well as general symptoms assessable at all times. In contrast, rare symptoms and context specific items (e.g., work function) were not included. For these aspects to be properly represented, the sampling density needs to be very high, illustrating that not all aspects may be appropriate for momentary assessment.
Future research on retrospective and momentary ThyPRO ratings should determine if the measurement differences are dependent on the length of the recall period. Broderick et al. showed that as the reporting period increased, the difference in retrospective and momentary pain ratings increased (18). However, investigations of sleep reports and inference with activities due to pain and fatigue did not show the same tendency (34,35). For surveillance of treatment effect, PROMs should be able to detect and respond to clinically relevant changes. Thus, the responsiveness of both measures should be investigated to find the most suitable measure for this task.
In conclusion, retrospective and mean momentary ThyPRO ratings had high correlations but provided significantly different results. Retrospective ratings were higher on the four tested scales compared to mean momentary ratings. The presence of the peak effect was supported when measuring tiredness but not in the remaining three scales. There was partial support for the presence of an end effect, which was found in two of four scales. The results do not put the validity of retrospective measurements in question. However, the observed differences in scores were of clinically relevant magnitudes. So, the two measures should not be used interchangeably. Research elaborating which method is most responsive to clinical change may determine which method is most suitable for comparative effectiveness and efficiency studies.
Footnotes
Acknowledgments
This study was funded by The Danish Agency for Science, Technology, and Innovation (grant 271-09-0143). The publication was funded by the Agnes and Knut Mørk Foundation. U.F.R.'s research salary is sponsored by a grant from Novo Nordisk Foundation.
Author Disclosure Statement
No competing financial interests exist.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
