Abstract
Background:
The thyroid-related patient-reported outcome measure ThyPRO has become the gold standard for measuring thyroid-related quality of life and uses a 4-week recall period. The impact of the length of recall is unresolved. To minimize recall bias, the US Food and Drug Administration has argued in favor of short recall periods or measures describing current states. We investigated whether a 1-week recall version of ThyPRO was less prone to recall bias than the original ThyPRO, using averaged momentary ThyPRO measurements as the hypothesized true mean of patients' symptoms.
Methods:
Patients newly diagnosed with thyrotoxicosis were included (N = 122). During a 28-day study period, participants answered momentary questions three times daily via a smartphone, weekly retrospective surveys with a 1-week recall period, and the original survey with a 4-week recall period on day 28. Twelve ThyPRO items from four multi-item scales were used. Mean momentary ratings for each scale were compared with recall ratings of 1- and 4-week periods, respectively.
Results:
The mean momentary ratings were highly correlated with retrospective ratings and remained rather constant when altering the reporting period from four weeks to one week. We found consistently lower scores (i.e., better thyroid-related quality of life) on momentary ratings compared with retrospective ratings. The mean differences between momentary ratings and retrospective ratings were similar for both recall periods. The original 4-week ThyPRO accurately summarized the mean of all 1-week ThyPROs.
Conclusions:
Shortening the recall period of ThyPRO from four weeks to one week was not associated with less recall bias within this subset of items. Nor did 1-week recall seem to compromise the accuracy of ThyPRO. Thus, either version of ThyPRO can be used in future studies.
Introduction
Increased focus on patient-centered care has driven the demand for accurately quantifying the impact of disease and treatment on patients' lives, known as health-related quality of life (HRQoL). To assess HRQoL, we rely on patients' self-reports, also known as patient-reported outcomes (PROs). PROs have become important endpoints in clinical studies and clinical practice, as well as in regulatory contexts (1,2). From a health care policy perspective, PROs are also used to study cost-effectiveness evaluations of different interventions (3). This increasingly pertinent role of PROs necessitates PRO instruments of the highest quality. Several studies have suggested that retrospective PRO instruments may be inherently biased due to respondents being unable to accurately remember past experiences (4 –8). For example, people are predominantly influenced by the most recent experiences and the most prominent experiences (9 –11). In response, the US Food and Drug Administration (FDA) has issued guidelines on PRO usage to support labeling claims (1) recommending items with short recall periods or items describing current states, based on the logic: the shorter the recall, the less bias, and thereby higher quality. It is clear that by removing recall from the equation, recall bias can be avoided. However, assessing current states is a demanding task for both trial personnel and participants since repeated measurements are needed to validly represent a longer time period, especially if symptoms are rare or fluctuating (4,12).
Presence of recall bias has been shown already one hour after an experience (10), and Broderick et al. have shown that correlations between momentary ratings and retrospective ratings of pain and fatigue gradually decline as recall increases from one day to three and seven days, indicating that recall bias is less prominent for shorter recall periods. However, this was not without ambiguity since correlations were higher for 28-day recall compared with 7-day recall (13).
The thyroid-related patient-reported outcome measure ThyPRO (14) has become the gold standard for measuring thyroid-related quality of life (15) and uses a 4-week reference period. We have previously shown that retrospective ratings and momentary ratings provide different results and should not be used interchangeably (16): for the four tested scales, retrospective ratings were higher on average (i.e., indicating more symptoms) than mean momentary ratings. This finding is in line with other studies on pain and fatigue (9,13,17,18). It remains to be elucidated whether the retrospective recall period can validly be shortened for ThyPRO, and if so, if a shorter recall period reduces recall bias.
This study aims to investigate whether a 1-week recall period is superior to a 4-week recall period in terms of recall bias. For this purpose, we hypothesized that averaged momentary ThyPRO ratings were the true mean of patients' symptoms and compared mean differences and correlations between corresponding ratings with a 1-week version of the ThyPRO (ThyPRO-1W) and the original 4-week ThyPRO (ThyPRO-4W).
Methods
Participants
The present study is a continuation of a previous study of patients newly diagnosed with thyrotoxicosis, including Graves' disease, toxic nodular goiter, subacute thyroiditis, and drug-induced thyrotoxicosis (16). Patients were included between September 2014 and September 2018 from endocrine outpatient clinics at four Copenhagen University Hospitals: Rigshospitalet, Gentofte, Herlev, and Bispebjerg. Patients were eligible if they were at least 18 years of age, understood Danish, had serum thyrotropin (TSH) concentrations <0.1 mIU/L within the last month, and underwent or were scheduled to undergo treatment for thyrotoxicosis (antithyroid drugs, radioiodine, or surgery). If treatment had already been initiated, patients had to have elevated free thyroxine levels (fT4; reference range 10.0–22.0 pmol/L) in addition to a TSH <0.1 mIU/L, as a marker of current disease activity. Exclusion criteria were pregnancy and major comorbidities suspected to impact quality of life substantially (e.g., cancer or congestive heart failure). Eligible patients were initially contacted by phone, and patients interested in participation received written information via e-mail.
Assessment of PROs
The ThyPRO survey assesses HRQoL in patients with benign thyroid disease. It consists of 13 multi-item scales and 1 single-item scale. Each scale ranges from 0 to 100, with higher scores indicating more symptoms or more impact of thyroid disease. The original ThyPRO uses a 4-week recall period (hence, it is denoted as ThyPRO-4W in the present study). Due to the wording and retrospective nature of ThyPRO items, the 4-week recall period could be substituted with a 1-week recall period without any difficulty to construct a ThyPRO with a 1-week recall period (denoted as ThyPRO-1W). A momentary version of ThyPRO was previously developed, consisting of 12 items from four different scales (Hyperthyroid Symptoms, Tiredness, Anxiety, and Emotional Susceptibility) (16,19).
Momentary ThyPRO (ThyPRO-EMA [ecological momentary assessments]) ratings were collected for 28 days via a smartphone app (“EMA-P1”). Participants who owned an Android smartphone installed the app on their device, while patients without an Android smartphone borrowed one. The app featured repeated auditory prompts three times daily at semi-randomized time points notifying participants to complete a survey within the app. All 12 items were asked each time. If a notification occurred at an inconvenient time, the survey could be postponed, which would silence additional auditory prompts. From the time of notification, the survey remained open for completion for one hour only, until removed. The app enabled adjustment of the diurnal rhythm from day to day, to ensure that every waking hour was represented. The app was integrated with a trial management system, PROgmatic (20), which enabled automatic distribution of retrospective questionnaires via e-mail and daily monitoring of response rates. The system recorded time and date of each data entry in the app.
On day 7, 14, 21, and 28, participants received an electronic ThyPRO survey via e-mail containing the same 12 items, but with a 1-week recall period (ThyPRO-1W). On day 28, an additional electronic ThyPRO survey with a 4-week recall period (ThyPRO-4W) was administered. Thus, the 28-day study period was covered by three different measures illustrated in Figure 1.

Overview of the four-week study period covered by three different measurement modalities: Momentary ratings (ThyPRO-EMA), retrospective ThyPRO-1Ws, and a retrospective ThyPRO-4W. The figure is constructed with hypothetical data for illustrative purposes. ThyPRO-1W, 1-week ThyPRO; ThyPRO-4W, 4-week ThyPRO; EMA, ecological momentary assessments.
Data analysis
Response rates of ThyPRO-EMA were calculated (completed notifications/amount of notifications in total). Participants with response rates below 35% were excluded from further analysis. Only retrospective ratings completed within the same day of administration or the following day were included in the analyses.
Both retrospective ratings and momentary ratings were calculated on scale level, that is, one score for each of the four ThyPRO three-item scales applied in the present study. The mean ThyPRO-EMA scores corresponding to each (retrospective) ThyPRO-1W period were calculated. Likewise, the mean ThyPRO-EMA score was calculated for the entire study period corresponding to (the retrospective) ThyPRO-4W. The means of ThyPRO-EMA scores were calculated using all available assessments for the relevant time period. For the purpose of the present study, momentary ratings were hypothesized as being the true measure of patients' symptoms. Thus, each retrospective ThyPRO measure was compared with the mean of all momentary ratings for the same period; the ThyPRO-4W was compared with the mean of all ThyPRO-EMA collected over the four-week study period and each ThyPRO-1W rating was compared with the corresponding mean ThyPRO-EMA for that week. Additionally, the original (ThyPRO-4W) was compared with ratings collected by ThyPRO-1Ws, both each weekly ThyPRO-1W separately and the mean of all four. The mean measurement differences were compared using the paired t-test. Comparisons were made by calculating the Pearson correlation coefficients between retrospective and mean momentary measurements. The mean correlation for all ThyPRO-1Ws was found using a minimum-variance unbiased estimator described by Olkin and Pratt (21).
where r = correlation, k is the number of individual sample correlations, and ni is the n for the ith sample.
Ethical considerations
The study was performed in accordance with the Declaration of Helsinki. The study protocol was reviewed by the local ethics committee (reg. no. H-A-2009-FSP23). According to Danish law, questionnaire studies do not require and thus cannot obtain formal approval by ethical committees. Informed consent was obtained from all individual participants. The study was approved by the Danish Data Protection Agency (local identifier at Rigshospitalet: 13-30-1092).
Results
Participants
A total of 379 eligible patients with available contact information in their patient records were invited to participate in the study, 133 of whom accepted the invitation. Unwillingness was mainly due to three reasons. First, patients were too busy. Second, patients with iPhones were unwilling to carry a second phone. Third, some did not experience any subjective symptoms of thyrotoxicosis and thus felt that the study was of no relevance to them. Some patients did not respond after receiving the written information. Seven participants changed their mind shortly after their initial consent and withdrew from the study. Four participants were excluded due to momentary response rates below 35%. Data from the remaining 122 participants were analyzed. The majority were women (86%), with a median age of 49 years (range: 20–79 years) (Table 1). Response rates for the retrospective questionnaires were 98% and 95% for the ThyPRO-4W and ThyPRO-1W, respectively. Each participant received 87 momentary measurement notifications during the study period. On average, 79% of these were answered.
Clinical Characteristics
Information on fT4 was missing for four patients.
Weight-reducing product containing iodine.
ATD, antithyroid drug; fT4, free thyroxine (pmol/L), reference range: 12.0–22.0 pmol/L; TSH, serum thyrotropin (mIU/L), reference range: 0.40–4.80 mIU/L.
Retrospective ratings compared with mean momentary ratings
Correlations between retrospective ratings and momentary ratings for the corresponding period are presented in Figure 2. The Hyperthyroid Symptoms scale had the lowest correlation coefficients: 0.64–0.75 (mean 0.67; confidence interval [CI 0.62–0.72]) for ThyPRO-1Ws and 0.66 [CI 0.54–0.75] for ThyPRO-4W. The Emotional Susceptibility scale had the highest correlation coefficients with a mean of 0.88 [CI 0.86–0.90] for ThyPRO-1W and 0.88 [CI 0.83–0.91] for ThyPRO-4W.

Correlations between mean momentary ratings and retrospective ratings. For each scale, the first four data points show the correlations between weekly mean momentary ThyPRO and the corresponding 1-week ThyPROs. The fifth data point, “mean,” shows the correlation between the mean momentary ThyPRO for the entire 4-week period and the mean of the four 1-week ThyPROs. Finally, the sixth data point shows the correlation between the mean momentary ThyPRO for the entire 4-week period and the 4-week ThyPRO. Please note that the Y-axis starts at 0.5.
Table 2 shows the mean rating of each ThyPRO-1W compared with the corresponding mean of ThyPRO-EMA. ThyPRO-1Ws scored between 4 and 13 points higher than the corresponding ThyPRO-EMA, that is, measuring more impact of disease. Momentary ratings were fairly stable over the period with weekly ratings decreasing ≤2 points for all scales from week 1 to week 4. In the same period, ThyPRO-1Ws decreased between 4 and 7 points—indicating an improvement in HRQoL. Thus, there was a small improvement in the ThyPRO-1W compared with no change or an even smaller improvement in the ThyPRO-EMA. Table 3 shows the mean score of all ThyPRO-1Ws, as well as ThyPRO-4W and mean ThyPRO-EMA covering the entire study period. The differences between the ThyPRO-EMA and mean ThyPRO-1W were between 6 and 11 points, with higher ratings on ThyPRO-1Ws. Similarly, the differences between the ThyPRO-EMA and ThyPRO-4W were between 4 and 10 points.
Scores in 1-Week ThyPROs and Mean Momentary ThyPRO Ratings (Ecological Momentary Assessments) for the Corresponding Week, and Differences Between 1-Week ThyPRO and Mean Momentary ThyPRO Ratings for the Corresponding Week
Mean difference significantly different from 0 (p < 0.005).
1W, 1-week ThyPRO; EMA, ecological momentary assessments; SD, standard deviation.
Mean Score of All 1-Week ThyPROs, 4-Week ThyPRO, and Mean Momentary Ratings Covering the Entire Study Period, As Well As Differences Between the Measurement Methods
Mean difference significantly different from 0 (p < 0.005).
ThyPRO-1W, 1-week ThyPRO; ThyPRO-4W, 4-week ThyPRO; EMA, ecological momentary assessments.
1-week ThyPRO versus 4-week ThyPRO
Table 3 further shows the differences in scale scores measured by the ThyPRO-4W and mean ThyPRO-1W. Only ratings on the Tiredness scale differed significantly, with the mean ThyPRO-1W score being 3 points higher (effect size = 0.14). Figure 3 shows the difference between weekly ThyPRO-1Ws and the ThyPRO-4W. It shows that ThyPRO-1W ratings decreased over the study period. For weeks 1 and 2, ThyPRO-1W ratings were higher on three and two scales, respectively, compared with ThyPRO-4W. In contrast, ThyPRO-1W ratings were lower on two scales for week 4.

Differences between each of the 1-week ThyPROs and the 4-week ThyPRO. A positive difference in the four scale scores means more impact of disease when evaluated with the 1-week ThyPRO compared with the 4-week ThyPRO survey. *Significant difference between 1-week ThyPRO and 4-week ThyPRO surveys (p < 0.05).
Discussion
This is the first study to investigate the impact of length of retrospective reporting periods on thyroid-related quality of life, as measured by the ThyPRO questionnaire. ThyPRO was developed with a 4-week reporting period. However, in some instances, clinicians or researchers might wish to alter the reporting period, to fit a certain protocol. The aim of the present study was to investigate whether a shorter reporting period could be applied without compromising the accuracy of ThyPRO, and further if it could reduce recall bias, as suggested by the FDA. Retrospective ratings with two different reporting periods were compared with momentary assessments, to explore if a shorter reporting period would provide a better representation of momentary assessments, which in our study was conceived as the true measure of patients' symptoms.
The study shows that mean momentary assessments were highly correlated with retrospective ratings and, more importantly, that correlations remained rather constant when decreasing the reporting period from four weeks to one week. It has previously been shown that retrospective ratings indicated worse HRQoL (i.e., higher ratings) than momentary ThyPRO assessment, with differences of small-to-moderate effect sizes (16). Thus, the higher ratings for ThyPRO-4W compared with ThyPRO-EMA were already established. Interestingly, we found that the mean of ThyPRO-1Ws was equally different from the momentary mean, as was the ThyPRO-4W. According to the reported correlations and the relationships with momentary measurements, the ThyPRO-1W and ThyPRO-4W are equally good representations of aggregated momentary measurements.
ThyPRO-1W was also compared directly with ThyPRO-4W, which enabled us to investigate agreement between symptom levels assessed by ThyPRO-4W and ThyPRO-1Ws. Even though our results showed that ThyPRO-1W ratings decreased over the study period, indicating clinical improvement, the average symptom level was accurately summarized in the ThyPRO-4W: patients provided roughly the same ThyPRO-4W ratings as the mean of all four ThyPRO-1Ws (Table 3 and Fig. 3). Only the Tiredness scale showed a statistically significant but very small difference of three points. Therefore, our data do not support a notion that shortening the reporting period of the ThyPRO from four weeks to one week will reduce recall bias, within this subset of items, nor does a 1-week reporting period seem to compromise the accuracy of ThyPRO. Thus, either version of ThyPRO can be implemented in future studies. However, great caution should be taken if comparing the two versions, especially if symptoms are not in a steady state, for example, if surveys are administered during the beginning of the disease or during treatment. A comparison between a 4-week rating and a 1-week rating provided in the first week versus a 1-week rating provided in the last week would be very different because of the change in symptoms. It is worth noting that longer reporting periods allow for symptoms with rare occurrence to be incorporated in the rating, why the ThyPRO-4W might be preferable in situations where rare symptoms are investigated.
The high agreement found between ThyPRO-1W and ThyPRO-4W might, to some extent, reflect that the participants were taking part in a research project and were therefore more attentive to their responses. Participants may have tracked their changes more thoroughly than they would have otherwise, and they were thus capable of providing an accurate estimate of their average symptoms. However, if we follow this idea, it is surprising that momentary ratings were not closer to retrospective ratings since we would assume participants were attentive to these responses as well.
The majority of participants (57%) had already initiated treatment with antithyroid drugs at the time of inclusion. The remaining participants either initiated antithyroid drug treatment during the study period or were scheduled for later treatment (radioiodine or surgery). Consequently, some participants were expected to experience improvement in HRQoL during the study period. Others had perhaps already experienced improvement before our measurements began, while others had yet to experience improvement. This made it difficult to predict how much of an improvement we would find during the study period. Surprisingly, we found that momentary measurements indicated no particular change in HRQoL during the study period when aggregating scores weekly, while ThyPRO-1W results indicated an improvement in HRQoL. It is difficult to ascertain if this difference was due to higher responsiveness of the ThyPRO-1W or to retrospective ratings being more prone to expectations of a treatment effect, that is, participants remember their responses from last week and reason that they ought to feel a bit better, or whether there might be other explanations. In the present study, repeated momentary measurements were chosen as a measure free from recall bias; however, the method may not be free from other measurement errors. Repeated momentary measurements have been applied as a comparator in several previous studies, primarily based on theoretical considerations, but have not been validated as such in this specific study (9,13,17,18,22).
As described elsewhere (16), it is possible that when participants' symptoms are worst the, task of responding to the smartphone app becomes too inconvenient, and some might ignore notifications or run through the questions without consideration. This would lead to lower mean momentary ratings if participants are more inclined to answer when feeling good. This was one of the reasons why the relatively short response window of one hour was created, so that it would be difficult for participants to wait until feeling better to response.
When using any PRO measure, reactivity is a risk, that is, the act of monitoring symptoms may affect symptom levels in either direction. When dealing with repeated momentary measures, participants are repeatedly confronted with possible symptoms, which could be considered a form of intervention and influence participants to perceive higher symptom levels and respond more negatively. Such an effect could potentially mask an otherwise actual improvement in HRQoL. However, in a previous study on momentary pain, reactivity was minimal (23).
Our results suggest that for investigating change in symptoms, the ThyPRO-1W is preferable to momentary ratings. To verify this, a future study should examine the responsiveness of the different types of measures to determine if one measure is superior in this regard. All three types of measurements should be administered before and after a demonstrable change. Based on accumulated knowledge from such studies evaluating a range of relevant changes, the most responsive measure should be used to investigate change in the future. It should also be explored if any differences between measurement methods exist depending on treatment phase, for example, if patients are in a hyperthyroid state, undergoing improvement, or in a stable phase. Defining these groups could be guided by either clinicians' assessments or hormone concentrations. Furthermore, it should be investigated whether the findings from our subset of items and scales can be extended to apply to the full ThyPRO survey as well as to other condition-specific PRO measures. Finally, it would be interesting to conduct a study with patients randomized to two groups, where one group responds to ThyPRO-1W and the other to ThyPRO-4W to avoid potential distortion of the ThyPRO-4W rating by the memory of previous ratings. However, such a study would require strict inclusion criteria and randomization procedures to make sure participants in each group are in the same phase of their thyroid illness since participants no longer act as their own control.
In conclusion, we found no change in the accuracy of the ThyPRO measure when shortening the reporting period from four weeks to one week, compared with momentary ratings. In addition, ThyPRO with a 4-week reference period provided scores that were similar to the mean of four corresponding ThyPROs with 1-week reference periods. Thus, our results indicate that both retrospective versions can be applied, and the choice may depend on the specific scientific needs.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
This study was funded by The Danish Agency for Science, Technology and Innovation (grant 271-09-0143). Research salary of U.F.-R. is sponsored by a grant from Novo Nordisk Foundation.
