Abstract
Background:
Thyroid diseases affect quality of life (QoL). The Thyroid-Related Patient-Reported Outcome (ThyPRO) is an international comprehensive well-validated patient-reported outcome, measuring thyroid-related QoL. The current version is rather long—85 items. The purpose of the present study was to develop an abbreviated version of the ThyPRO, with conserved good measurement properties.
Methods:
A cross-sectional (N = 907) and a longitudinal sample (N = 435) of thyroid patients were analyzed. A graded item response theory (IRT) model was fitted to the cross-sectional data. Short-form scales with three items were aimed for, by selecting items with best fit according to the IRT model, avoiding cross-culturally noninvariant items. Seven scales measuring mental and social well-being and function as well as one overall QoL impact item were analyzed in a bifactor model, to develop a supplementary composite score. Short-form scales were linked to original scales with IRT-based summed-score-linking. Agreement between the short and long form was estimated by agreement plots, intraclass correlations, and mean score levels. Responsiveness was compared by relative validity indices, clinical validity by ability to detect clinically relevant differences, and test–retest reliability by intra–class correlation.
Results:
One four-item scale was not abbreviated and one two-item scale was omitted from the short-form. For the 11 scales undergoing abbreviation, 10 with three and one with four items were developed. A bifactor model with good overall fit was fitted to the composite score, including the single QoL item. Responsiveness and clinical validity of the short-form scales were preserved, as were test–retest reliability (0.75–0.89). Short- versus long-form intraclass correlations were high (0.89–0.98), and the mean scale levels were similar.
Conclusions:
A 39-item version of the ThyPRO, with good measurement properties, was developed and is recommended for clinical use.
Introduction
D
The ThyQoL project was launched to address this shortage (14). The Thyroid-Related Patient-Reported Outcome (ThyPRO) instrument was developed as a comprehensive thyroid-related standalone PRO for patients with any benign thyroid disease (15
–17). It was crucial to the developers that it covered all benign thyroid diseases in order for it to maintain content validity when patients “converted” from one diagnosis to the other because of treatment (e.g., patients with nontoxic goiter becoming hypothyroid and requiring thyroid hormonal substitution after thyroidectomy). The ThyPRO is now in use in many studies worldwide (e.g., Mishra et al. (7), Watt et al. (18), Watt et al. (19), Graf et al. (20), Fast et al. (21), Bukvic et al. (22), Bukvic et al. (23), and Winther et al. (24)). However, the current version is rather long (85 items) and is reported in numerous (i.e., 13) scales (Supplementary Appendix S1; Supplementary Data are available online at
The purpose of the present study was to develop an abbreviated version of the ThyPRO with good cross-cultural validity and with maximum preservation of favorable measurement properties in terms of construct and clinical validity, test–retest reliability, and responsiveness to relevant clinical treatments.
Methods
Study population
Data from two previously described patient populations were used (16,17,26). The cross-sectional sample comprised thyroid patients followed at or referred to two university hospital outpatient clinics (Copenhagen University Hospital Rigshospitalet and Odense University Hospital), in 2007–2008 (16) (Table 1). Thus, this sample comprised patients with newly diagnosed thyroid disease, as well as patients controlled for ongoing treatment. A subset of these, the retest subsample in Table 1, was evaluated twice, at two-week intervals (17). The longitudinal sample comprised patients undergoing treatment for thyroid diseases at the abovementioned centers during 2008–2012, evaluated before and six months after treatment (26).
Data are in number (percent) or median (interquartile range [Q1–Q3]).
Only in longitudinal sample.
Patient-reported outcome measure
The ThyPRO measures a range of aspects of QoL relevant to patients with benign thyroid diseases, as identified during patient and expert interviews (14). It thus covers both physical symptoms specifically relevant to thyroid diseases, for example, symptoms of hyperthyroidism and goiter, and nonspecific aspects of high importance to patients with thyroid diseases, for example fatigue. The full-length ThyPRO consists of 85 items summarized in 13 scales, as well as a single item measuring overall impact of thyroid disease on QoL (Supplementary Appendix S1). Each item is rated on a 0–4 Likert scale from 0 = “no symptoms/problems” to 4 = “severe symptoms/problems.” The average score of items in a scale is divided by four and multiplied by 100 to yield thirteen 0–100 scales, with higher scores indicating worse health status.
Abbreviation strategy
The analyses were conducted in three separate steps: (a) selection of items for the short form, including selection of scales for a composite score; (b) scoring of short scales; and (c) validation of the short form.
Item selection
The Hypothyroid scale already consists of only four items and has relatively low reliability, and it was therefore decided to retain it in full length in an abbreviated instrument. The Impaired Sex Life scale had higher occurrence of missing responses in previous studies, as an indication of lower acceptability than the remaining scales. It was therefore decided to exclude it in the abbreviated version. For each of the remaining 11 scales, items previously shown to fit a unidimensional factor model, (27) were analyzed using Samejima's graded item response theory (IRT) model (28,29). In case of significant item misfit at the p < 0.01 level, according to Orlando and Thissen's S-X2 item fit index (30 –33), a reduced model without the least fitting item was respecified, until a model without misfit was identified. Based on knowledge about the ThyPRO instrument, and on the IRT-derived item information functions (29), the best items were selected. Knowledge about ThyPRO stemmed from the initial qualitative content validation studies (14,15) and from other validation studies evaluating cross-cultural invalidity (34), and differential item functioning (DIF) according to diagnosis or sociodemographic characteristics (35). Since the latter DIFs were found to be minor, items with DIF were not excluded if content or measurement considerations advocated their preservation. The aim was to reduce each scale to three items, which was considered the optimal minimum number enabling meaningful evaluation of scale properties (e.g., dimensionality and DIF). In case of similar item characteristics and information curves, items were selected to cover core content of the construct, as well as important subaspects of the construct measured (e.g., including both positive and negative aspects, such as positive energy vs. fatigue).
Scoring composite and short-form scales
For simplicity in future reporting, it was decided to develop a supplemental composite summary score based on factor analysis. Scales measuring mental and social well-being and function have previously been shown to be highly correlated (16,27) (i.e., the scales concerning Tiredness, Cognitive Complaints, Anxiety, Depressivity, Emotional Susceptibility, Impaired Social Life, Impaired Daily Life, as well as the single overall QoL-impact item) and were modeled as one general factor (see Fig. 1). The individual scales were modeled as subfactors, that is, a bifactor model was fitted: Each item was regressed on both the general factor and the subfactor (representing the individual, abbreviated scales). The subfactors were specified as uncorrelated with each other and with the general factor (36 –40).

Bifactor modeling evaluating the composite scale, summarizing the seven short-form mental and social well-being and function scales (left-hand side). Factor loadings are presented at the relevant arrows. All items except one had higher loading on the general factor representing the composite score.
The individual short-form scales were scored, using the Orlando and Thissen IRT-based summed-score linking (41), where the scales are linked based on derived summed-score-to-IRT-score translation tables, to make scale levels on the abbreviated scales comparable to the original scales. For the purpose of this linking, all items in each of the original 11 scales were calibrated using the graded IRT model. Agreement between the short- and long-form scales was estimated by agreement plots, mean score levels, and intraclass correlation with empirical bootstrap confidence intervals (42,43).
Scale validation
Responsiveness of abbreviated scales was compared to the long-form scales with calculation of effect sizes (44,45) and relative validity indices (46) in patients undergoing relevant clinical treatment, as evaluated in a previous study (26). Responsiveness in specific diagnostic groups undergoing treatment (hyper- and hypothyroidism, respectively, treated to euthyroidism, and volume reduction of goiter) was also compared to the responsiveness previously reported for the long form (26). Clinical validity was tested by evaluating whether the short-form scales were able to differentiate among clinically relevant patient groups (sensitivity), similar to what was found previously for the long-form scales (17). DIF according to age was evaluated for the short form and compared to findings from the long form (35). Finally, test–retest reliability was evaluated among stable patients responding twice to the ThyPRO (17), using intraclass correlation (42,47 –49). Confidence intervals (CI) were estimated by empirical bootstrap (43,50).
Statistical analysis
Descriptive analyses, summed-score linking, sensitivity tests, responsiveness comparisons, DIF, and test–retest intraclass correlations were performed with SAS v9.3 (51). Bifactor models were estimated with Mplus v7.1 (52). IRT modeling was performed with IRTPRO (53,54). DIF was evaluated using the ordinal logistic regression approach (55).
Study approval and conduct
According to Danish law, PRO research does not require and thus cannot obtain approval by ethical committees, and a completed questionnaire is regarded as consent. The study was approved by the Danish Data Protection Agency (#2007-58-0015) and conducted in accordance with the Declarations of Helsinki.
Results
Item parameters from the IRT modeling are shown in Table 2. The Impaired Social Life scale only included three unidimensional items, and was thus not abbreviated further. In the original long-form scales—Goiter Symptoms, Hyperthyroid Symptoms, Cognitive Complaints, Anxiety, Social Life, and Appearance—all items had good fit to the IRT model. For the Eye Symptoms, Tiredness, Depressivity, Emotional Susceptibility, and the Daily Life scales, item-level misfit was eliminated through reduction of the scales (Table 2, middle column). Abbreviated scales with three items each were obtained for 10/11 scales, whereas the Hyperthyroid Symptoms scale had four items. The short form is presented in Supplementary Appendix S2.
Items flagged for nonunidimensionality in previous confirmatory factor analyses were omitted (+).
Items with item level S-X2 misfit p-value < 0.01.
IRT, item response theory.
With specification of a method factor comprising the positively worded items, a bifactor model with acceptable overall fit (comparative fit index [CFI] = 0.97, root mean square error of approximation [RMSEA] = 0.085) was fitted to the composite score (Fig. 1). One item concerning memory problems had higher loading on its subfactor; all other items had higher loading on the general factor.
The scale transformations are presented for each scale in Table 3. Thus, a raw sum score of 0 on, for example, the Goiter Symptoms scale (answer “not at all” to all three items) should be rescaled to a value of 2, and a raw sum score of 12 (answer “very much” to all three items) should be rescaled to 84, which is the maximum score on the Goiter short-form scale.
The raw score (left column) is derived by simple summation of item values (0–4) separately for each scale. The corresponding final, IRT-linked rescaled short-form score is tabulated for each scale. Thus, when scoring short-form scales, items are summed (0–12 for 3-item scales and 0–16 for 4-item scale), and the rescaled score is derived from the table.
Agreement plots of the new, rescaled short-form scales versus the original long-form scales are presented in Figure 2. Good and uniform agreement was shown across the entire range of scores. In each plot, the mean [CI] long-form score and the mean short-form score for the cross-sectional sample are provided too. Only the Tiredness short-form mean score was outside the CI for the long-form scale. However, the difference between the two mean levels was only three points on the 0–100 scale.

Agreement plots of short-form (horizontal axis) versus original long-form (vertical axis) scale levels with regression lines. For each scale, the mean score (95% CI) for the long form is presented in the upper-left corner, and the mean score for the short form in the lower right. Short-form mean outside long form is marked by an asterisk.
Effect sizes and responsiveness in groups of patients undergoing treatment were preserved in the short-form scales, as shown in Table 4. Only the short-form Anxiety scale had smaller effect size and slightly lower responsiveness than the original long-form. Conversely, the short-form Appearance scale had slightly higher effect size and responsiveness compared to the long form.
For each scale, the effect size for the new short form [CI] and the original long form is provided. To compare responsiveness directly, the F-value of the change for each version was calculated and the relative validity index calculated as F/F max, i.e., the version with the best responsiveness has a relative validity index of 1. In addition, the test–retest reliability for the short form [CI] and the long form, as well as the intraclass correlation among the two versions, are presented to the right.
Test–retest reliability was similarly preserved in the short-form scales, with only the Appearance scale having significantly but marginally lower reliability (Table 4).
Very high short- versus long-form intraclass correlations were found (0.89–0.98; Table 4). Further, the previously shown discriminant validity was reproduced. Thus, for all short-form scales, the group expected to score higher, as specified in Watt et al. (17), did have significantly higher mean scores than the group expected to score low (data not shown). Responsiveness in the three diagnostic groups evaluated was identical to that demonstrated for the long form (26). DIF according to age in the short form were also identical to the small effects previously identified in the long form (35). For a given level of Tiredness and Depressivity, respectively, younger patients had a tendency to endorse positively worded items (“felt energetic” and “self-confident,” respectively). For the Impaired Social Life, younger patients had more conflicts with other people, compared with older patients, but again, the effect was small.
Discussion
The purpose of the present study was to develop an abbreviated version of the ThyPRO. This goal was achieved successfully. Based on previous validation studies and IRT modeling, an abbreviated version of the ThyPRO was developed containing (a) four physical symptom scales, two with three items (Goiter and Eye Symptoms) and two with four items (Hypo- and Hyperthyroid physical symptoms); (b) seven three-item scales about physical, mental, and social well-being and function; (c) one three-item scale concerning appearance; and (d) one single item about impact on overall QoL. Thus, the abbreviated version consists of 39 items, if all physical symptom scales are administered. Each of the 12 short-form scales and the single QoL item can be reported separately, but the seven well-being and function scales can also be summarized in one single composite score.
The validation analyses showed that the abbreviated scales had very high agreement with the original long-form ones, including roughly similar mean levels, and comparable measurement quality. Thus, good test–retest reliability, responsiveness to clinical change, and sensitivity to relevant clinical differences were demonstrated. This preservation of good measurement properties in scales with much fewer items is interpreted as being a result of selection of items with best measurement properties, under consideration of the conceptual model and content validity, thereby reducing random and systematic measurement error.
The primary strength of this study is the integration of several studies and methodologies in the item reduction process. Thus, several modalities within modern psychometrics (DIF, structural equation modeling for ordinal data, item response modeling) were applied within a firm clinical framework among patients from several clinical studies, including cross-cultural samples. Further, analyses were conducted within the original ThyQoL conceptual model (56) with focus on content validity. However, the final short-form has not been tested as a stand-alone form in an independent, novel clinical sample. This should be considered in future studies. Further, although the aim was to develop and test the instrument in a broad, heterogeneous sample, as specified in the introduction, and although the cross-sectional sample size was fairly large, it was not large enough to permit multigroup analyses (57) according to diagnosis. However, previous studies using an ordinal regression approach (55), less dependent on sample size, have shown only minimal DIF of the ThyPRO scales, according to diagnosis (35). Application of a short form may lead to loss of content validity. The extent to which this has occurred can only be evaluated in qualitative studies (58). However, since previous validation studies have confirmed that the individual scales are unidimensional, the potential loss should theoretically be minimal. As evident from the agreement plots, the short-form scales have fewer measurement points along the entire spectrum, and application of the short forms may also lead to poorer discrimination at the extremes. Another potential weakness was the fact that five of the scales were slightly modified (some items were omitted) to avoid item-level misfit in the IRT model. Since the rescaling was based on these IRT analyses, this may lead to weaker linking between the two versions. On the other hand, as mentioned, the correlation, agreement and mean levels between the two versions of each scale all supported the appropriateness of the present linking.
The applied approach is in line with recent recommendations for item reduction (25). When reviewing the available item reduction literature, the authors found that 55% of the studies had preserved scale structure and the median proportion of reduction was 57% (range 21–88%). The present study is close to this median reduction (from 85 to 39 items, i.e., 54% reduction). In 62% of the studies, only the long form was administered. Use of IRT methods was recommended as advantageous in the suggested guidelines, but was only applied in 11% of the studies.
The two-level scale-scoring approach, where both a composite score and the underlying more detailed subscales can be scored for the well-being and function scales, has also been adopted in previous studies. The most prominent is the most widely used short-form measure, the SF-36 Health Survey (59). Based on SF-36, eight domain scores as well as two Component Summaries can be derived (60), depending on the level of detail required in reporting. The scoring of the SF-36 summaries is based on results from principal component analyses, in contrast to the present study, where a simple summation approach was adopted for ease of scoring and reporting.
The short-form offers an advantageous measure, when reduction of respondent burden, potentially increasing response rates, is considered to outweigh the theoretical reduction of content validity and measurement detail. In an ongoing clinical trial among patients with Graves' disease (19,61), time to completion of 39 ThyPRO items was short (median 4 minutes; interquartile range 3–5 minutes), according to time-stamped electronic responses. This may be particularly relevant in longitudinal studies with multiple measurement points and when studying larger samples. Reporting the well-being and function scales as the composite score is recommended, when simplicity of reporting, combined with small measurement intervals and high precision, is the primary goal. When a detailed evaluation of physical, emotional, and social well-being and function is warranted, reporting the individual scales is recommended. In general, administering ThyPRO-39 as a whole is recommended to enhance content validity and comparability. However, it may be relevant to omit entire scales (not individual items), if considered irrelevant for a specific future study. For example, the Eye Symptoms scale may not be administered in a trial among patients with nontoxic goiter. Similarly, scales from the full-length ThyPRO may be selectively added to the ThyPRO-39, if considered of particular importance, for example the Sex Life scale, which is not included in ThyPRO-39.
ThyPRO-39 can be implemented in daily clinical practice (62). Patients may respond to the instrument, for example, prior to their appointment, either from home via an e-mail sent in advance, or in the waiting room. Scale scores might then be transferred to the electronic medical record and evaluated by the clinician, similarly to evaluation of, for example, thyroid function tests. These data can then be used for monitoring and communication purposes. Relevant problems (or lack thereof) may be rapidly identified and addressed, ideally with established thresholds and recommended actions and interventions, including referral to psychosocial intervention. For an ongoing clinical trial, real-time automatic monitoring of responses to ThyPRO has been implemented (61). In response to scores above (i.e., worse than) preset thresholds, e-mail alerts are generated and sent to clinical staff. A similar system could monitor responses in clinical practice. However, there is still a requirement for further research on this, for example evaluating how to communicate these results meaningfully to patients; establishment of alert thresholds; identification and evaluation of effectiveness of relevant interventions, among others.
In conclusion, this study has developed an abbreviated 39-item version of the thyroid-related QoL measurement instrument ThyPRO, with good measurement properties. It has high agreement with the long-form original version, and score levels on one form are comparable to score levels on the other. Function and well-being may be reported as a composite score or as individual scale scores. This abbreviated version, named the ThyPRO-39, is recommended for use in clinical studies, as a possible alternative to the original version. Scoring programs for use on various platforms are available from the first author for both versions.
Footnotes
Acknowledgments
This study has been supported by grants from the Danish Agency for Science, Technology, and Innovation: Council for Strategic Research and Council for Independent Research, Genzyme Corporation, Novo Nordisk Foundation by unrestricted research grants and Agnes and Knut Mørk's Foundation. Warm thanks to Sofie Larsen Rasmussen and Lene Frydenberg for help with identification and inclusion of patients; to Sofie Larsen Rasmussen and Kira Bang Bové for help with follow-up; to Sofie Larsen Rasmussen, Kim Æbelø, Emilie Birch, Thea Christophersen, and Sara Kehlet Watt for help with generation of clinical data; and to Selma Watt and Laura Siim Magnussen for data entry. Researchers interested in using the ThyPRO or the ThyPRO-39 may contact the corresponding author.
Author Disclosure Statement
None of the authors has any financial conflicts of interest to declare.
