Development of a Short Version of the Thyroid-Related Patient-Reported Outcome ThyPRO

Abstract

Background:

Thyroid diseases affect quality of life (QoL). The Thyroid-Related Patient-Reported Outcome (ThyPRO) is an international comprehensive well-validated patient-reported outcome, measuring thyroid-related QoL. The current version is rather long—85 items. The purpose of the present study was to develop an abbreviated version of the ThyPRO, with conserved good measurement properties.

Methods:

A cross-sectional (N = 907) and a longitudinal sample (N = 435) of thyroid patients were analyzed. A graded item response theory (IRT) model was fitted to the cross-sectional data. Short-form scales with three items were aimed for, by selecting items with best fit according to the IRT model, avoiding cross-culturally noninvariant items. Seven scales measuring mental and social well-being and function as well as one overall QoL impact item were analyzed in a bifactor model, to develop a supplementary composite score. Short-form scales were linked to original scales with IRT-based summed-score-linking. Agreement between the short and long form was estimated by agreement plots, intraclass correlations, and mean score levels. Responsiveness was compared by relative validity indices, clinical validity by ability to detect clinically relevant differences, and test–retest reliability by intra–class correlation.

Results:

One four-item scale was not abbreviated and one two-item scale was omitted from the short-form. For the 11 scales undergoing abbreviation, 10 with three and one with four items were developed. A bifactor model with good overall fit was fitted to the composite score, including the single QoL item. Responsiveness and clinical validity of the short-form scales were preserved, as were test–retest reliability (0.75–0.89). Short- versus long-form intraclass correlations were high (0.89–0.98), and the mean scale levels were similar.

Conclusions:

A 39-item version of the ThyPRO, with good measurement properties, was developed and is recommended for clinical use.

Introduction

Diseases related to the thyroid gland are common, affecting around 10–15% of the adult population in most countries (1 –3). Irrefutably, thyroid diseases affect quality of life (QoL) (4 –9), work role function (10,11), as well as morbidity and mortality (12,13). Nonetheless, research focusing on QoL among these patients has been scarce, and until recently, no validated thyroid-specific patient-reported outcome (PRO) measuring thyroid-related QoL across different diseases has been available.

The ThyQoL project was launched to address this shortage (14). The Thyroid-Related Patient-Reported Outcome (ThyPRO) instrument was developed as a comprehensive thyroid-related standalone PRO for patients with any benign thyroid disease (15 –17). It was crucial to the developers that it covered all benign thyroid diseases in order for it to maintain content validity when patients “converted” from one diagnosis to the other because of treatment (e.g., patients with nontoxic goiter becoming hypothyroid and requiring thyroid hormonal substitution after thyroidectomy). The ThyPRO is now in use in many studies worldwide (e.g., Mishra et al. (7), Watt et al. (18), Watt et al. (19), Graf et al. (20), Fast et al. (21), Bukvic et al. (22), Bukvic et al. (23), and Winther et al. (24)). However, the current version is rather long (85 items) and is reported in numerous (i.e., 13) scales (Supplementary Appendix S1; Supplementary Data are available online at www.liebertpub.com/thy). A shorter version for use as, for example, secondary outcome in clinical trials and in daily clinical practice would further advance its applicability (25).

The purpose of the present study was to develop an abbreviated version of the ThyPRO with good cross-cultural validity and with maximum preservation of favorable measurement properties in terms of construct and clinical validity, test–retest reliability, and responsiveness to relevant clinical treatments.

Methods

Study population

Data from two previously described patient populations were used (16,17,26). The cross-sectional sample comprised thyroid patients followed at or referred to two university hospital outpatient clinics (Copenhagen University Hospital Rigshospitalet and Odense University Hospital), in 2007–2008 (16) (Table 1). Thus, this sample comprised patients with newly diagnosed thyroid disease, as well as patients controlled for ongoing treatment. A subset of these, the retest subsample in Table 1, was evaluated twice, at two-week intervals (17). The longitudinal sample comprised patients undergoing treatment for thyroid diseases at the abovementioned centers during 2008–2012, evaluated before and six months after treatment (26).

Table 1.

Clinical and Basic Characteristics of the Samples

	Cross-sectional sample (N = 907)	Test–retest subsample (N = 87)	Longitudinal sample (N = 435)
Sex, n (%):
Women	787 (87)	81 (93)	361 (83)
Men	120 (13)	6 (7)	74 (17)
Age in years, median (Q1–Q3)	51 (39–61)	47 (37–59)	54 (42–63)
Months from treatment initiation to completion of follow-up survey, median (Q1–Q3)			6.5 (6.5–6.8)
Diagnosis:
Nontoxic goiter	259 (29)		135 (31)
Toxic nodular goiter	145 (16)		98 (23)
Graves' hyperthyroidism	168 (19)		73 (17)
Graves' orbitopathy	94 (10)	9 (10)	25 (6)
Autoimmune hypothyroidism	199 (22)	24 (28)	86 (20)
Other thyroid diagnoses	42 (5)		18 (4)
Disease duration in months, median (Q1–Q3)	27 (4–79)	47 (21–76)	0.3 (0–4)
Treatment, n (%)
L-thyroxine	292 (32)	43 (50)	111 (26)
Anti-thyroid medication	162 (18)	20 (23)	86 (20)
Aspiration of thyroid cyst	0	0	4 (1)
Glucocorticoid pulse therapy	4 (0.4)	0	2 (0)
Other immunosuppressive treatment	0	0	4 (1)
Hemithyroidectomy	132 (14)	8 (10)	64 (15)
Total thyroidectomy			37 (9)
Radioactive iodine	114 (13)	14 (16)	127 (29)
No treatment (ever)	283 (31)	10 (11)	0
Thyroid function at baseline, median (Q1–Q3)
TSH mIU/L (undetectably low = 0)	1.1 (0.4–2.7)	1.1 (0.5–2.4)	0.42 (0.01–2.64)
Thyroxine (i.e., total T4) nmol/L	112 (94–132)	112 (95–128)	99 (80–133)
Thyroid function at follow-up^*:
TSH mIU/L			1.37 (0.46–2.87)
Thyroxine (i.e., total T4) nmol/L			93 (77–112)

Data are in number (percent) or median (interquartile range [Q1–Q3]).

Only in longitudinal sample.

Patient-reported outcome measure

The ThyPRO measures a range of aspects of QoL relevant to patients with benign thyroid diseases, as identified during patient and expert interviews (14). It thus covers both physical symptoms specifically relevant to thyroid diseases, for example, symptoms of hyperthyroidism and goiter, and nonspecific aspects of high importance to patients with thyroid diseases, for example fatigue. The full-length ThyPRO consists of 85 items summarized in 13 scales, as well as a single item measuring overall impact of thyroid disease on QoL (Supplementary Appendix S1). Each item is rated on a 0–4 Likert scale from 0 = “no symptoms/problems” to 4 = “severe symptoms/problems.” The average score of items in a scale is divided by four and multiplied by 100 to yield thirteen 0–100 scales, with higher scores indicating worse health status.

Abbreviation strategy

The analyses were conducted in three separate steps: (a) selection of items for the short form, including selection of scales for a composite score; (b) scoring of short scales; and (c) validation of the short form.

Item selection

The Hypothyroid scale already consists of only four items and has relatively low reliability, and it was therefore decided to retain it in full length in an abbreviated instrument. The Impaired Sex Life scale had higher occurrence of missing responses in previous studies, as an indication of lower acceptability than the remaining scales. It was therefore decided to exclude it in the abbreviated version. For each of the remaining 11 scales, items previously shown to fit a unidimensional factor model, (27) were analyzed using Samejima's graded item response theory (IRT) model (28,29). In case of significant item misfit at the p < 0.01 level, according to Orlando and Thissen's S-X² item fit index (30 –33), a reduced model without the least fitting item was respecified, until a model without misfit was identified. Based on knowledge about the ThyPRO instrument, and on the IRT-derived item information functions (29), the best items were selected. Knowledge about ThyPRO stemmed from the initial qualitative content validation studies (14,15) and from other validation studies evaluating cross-cultural invalidity (34), and differential item functioning (DIF) according to diagnosis or sociodemographic characteristics (35). Since the latter DIFs were found to be minor, items with DIF were not excluded if content or measurement considerations advocated their preservation. The aim was to reduce each scale to three items, which was considered the optimal minimum number enabling meaningful evaluation of scale properties (e.g., dimensionality and DIF). In case of similar item characteristics and information curves, items were selected to cover core content of the construct, as well as important subaspects of the construct measured (e.g., including both positive and negative aspects, such as positive energy vs. fatigue).

Scoring composite and short-form scales

For simplicity in future reporting, it was decided to develop a supplemental composite summary score based on factor analysis. Scales measuring mental and social well-being and function have previously been shown to be highly correlated (16,27) (i.e., the scales concerning Tiredness, Cognitive Complaints, Anxiety, Depressivity, Emotional Susceptibility, Impaired Social Life, Impaired Daily Life, as well as the single overall QoL-impact item) and were modeled as one general factor (see Fig. 1). The individual scales were modeled as subfactors, that is, a bifactor model was fitted: Each item was regressed on both the general factor and the subfactor (representing the individual, abbreviated scales). The subfactors were specified as uncorrelated with each other and with the general factor (36 –40).

FIG. 1.

Bifactor modeling evaluating the composite scale, summarizing the seven short-form mental and social well-being and function scales (left-hand side). Factor loadings are presented at the relevant arrows. All items except one had higher loading on the general factor representing the composite score.

The individual short-form scales were scored, using the Orlando and Thissen IRT-based summed-score linking (41), where the scales are linked based on derived summed-score-to-IRT-score translation tables, to make scale levels on the abbreviated scales comparable to the original scales. For the purpose of this linking, all items in each of the original 11 scales were calibrated using the graded IRT model. Agreement between the short- and long-form scales was estimated by agreement plots, mean score levels, and intraclass correlation with empirical bootstrap confidence intervals (42,43).

Scale validation

Responsiveness of abbreviated scales was compared to the long-form scales with calculation of effect sizes (44,45) and relative validity indices (46) in patients undergoing relevant clinical treatment, as evaluated in a previous study (26). Responsiveness in specific diagnostic groups undergoing treatment (hyper- and hypothyroidism, respectively, treated to euthyroidism, and volume reduction of goiter) was also compared to the responsiveness previously reported for the long form (26). Clinical validity was tested by evaluating whether the short-form scales were able to differentiate among clinically relevant patient groups (sensitivity), similar to what was found previously for the long-form scales (17). DIF according to age was evaluated for the short form and compared to findings from the long form (35). Finally, test–retest reliability was evaluated among stable patients responding twice to the ThyPRO (17), using intraclass correlation (42,47 –49). Confidence intervals (CI) were estimated by empirical bootstrap (43,50).

Statistical analysis

Descriptive analyses, summed-score linking, sensitivity tests, responsiveness comparisons, DIF, and test–retest intraclass correlations were performed with SAS v9.3 (51). Bifactor models were estimated with Mplus v7.1 (52). IRT modeling was performed with IRTPRO (53,54). DIF was evaluated using the ordinal logistic regression approach (55).

Study approval and conduct

According to Danish law, PRO research does not require and thus cannot obtain approval by ethical committees, and a completed questionnaire is regarded as consent. The study was approved by the Danish Data Protection Agency (#2007-58-0015) and conducted in accordance with the Declarations of Helsinki.

Results

Item parameters from the IRT modeling are shown in Table 2. The Impaired Social Life scale only included three unidimensional items, and was thus not abbreviated further. In the original long-form scales—Goiter Symptoms, Hyperthyroid Symptoms, Cognitive Complaints, Anxiety, Social Life, and Appearance—all items had good fit to the IRT model. For the Eye Symptoms, Tiredness, Depressivity, Emotional Susceptibility, and the Daily Life scales, item-level misfit was eliminated through reduction of the scales (Table 2, middle column). Abbreviated scales with three items each were obtained for 10/11 scales, whereas the Hyperthyroid Symptoms scale had four items. The short form is presented in Supplementary Appendix S2.

Table 2.

Results of IRT Analyses (Slopes a and Item Thresholds b _1–4) of the Original Long-Form Scales, the Scales That Were Modified Based on IRT Analyses of Long-Form Scales and the Short-Form Scales

	Long form
	Initial model					Modified model					Short form
		Thresholds					Thresholds					Thresholds
Scale name Abbreviated item wording	Slope	1	2	3	4	Slope	1	2	3	4	Slope	1	2	3	4
Goiter symptoms
Sense of fullness in neck	2.9	0.1	0.8	1.4	2.3						3.3	0.1	0.8	1.3	2.3
Visible swelling in front of neck+
Pressure in throat	3.7	0.1	0.8	1.3	2.2						5.7	0.1	0.7	1.2	2.2
Pain in front of neck	2.0	1.1	1.7	2.4	3.8
Throat pain felt in ears	1.3	1.2	2.2	3.1	4.4
Lump in throat	3.3	0.0	0.7	1.2	2.0
Clear throat often	1.7	−0.2	0.7	1.4	2.5
Discomfort swallowing	3.5	0.4	1.0	1.5	2.3						2.1	0.5	1.2	1.8	2.7
Difficulty swallowing	2.9	0.7	1.4	1.8	2.5
Sense of suffocating	2.3	1.2	1.8	2.2	2.9
Hoarseness+
Hyperthyroid symptoms
Trembling hands	1.4	0.6	1.7	2.6	3.8						1.8	0.5	1.4	2.2	3.2
Increased sweating	1.8	−0.4	0.5	1.1	2.1						1.5	−0.4	0.5	1.1	2.2
Palpitations	1.7	−0.2	0.8	1.6	2.7						2.1	−0.2	0.7	1.5	2.5
Shortness of breath	1.5	−0.2	0.9	1.7	2.7
Sensitive to heat	1.7	0.1	0.9	1.7	2.6
Increased appetite	1.2	0.5	1.4	2.2	3.7
Loose stools+
Upset stomach	1.4	0.0	1.0	1.9	3.0						1.0	0.0	1.3	2.5	3.9
Eye symptoms
Watery eyes	1.2	0.2	1.3	2.1	3.1
Bags under the eyes	1.9	−0.2	0.7	1.4	2.2	1.2	0.2	1.3	2.1	3.1
Grittiness in eyes	1.9	0.5	1.3	2.1	3.0	1.9	−0.2	0.7	1.4	2.2	2.0	−0.2	0.7	1.4	2.2
Reduced sight	2.0	1.2	1.9	2.3	2.9	1.9	0.5	1.3	2.1	3.0	1.7	0.5	1.4	2.2	3.2
Pressure in eyes	^*2.2	0.9	1.6	2.2	2.7
Double vision	1.8	0.3	1.2	1.7	2.4	2.0	1.2	1.9	2.3	2.9
Pain in eyes	^*1.2	0.2	1.3	2.1	3.1	2.2	0.9	1.6	2.2	2.7
Sensitive to light	1.9	−0.2	0.7	1.4	2.2	1.8	0.3	1.2	1.7	2.4	1.7	0.3	1.2	1.8	2.5
Tiredness
Been tired	^*3.3	−1.2	−0.2	0.4	1.2	3.1	−1.2	−0.2	0.4	1.3	3.1	−1.2	−0.2	0.4	1.3
Been exhausted	4.1	−0.5	0.1	0.6	1.4
Difficulty getting motivated	^*3.9	−0.5	0.3	0.7	1.5	4.7	−0.5	0.3	0.7	1.5	4.7	−0.5	0.3	0.7	1.5
Felt worn out	^*4.5	−0.5	0.2	0.6	1.3
Full of life	^*2.6	−2.3	−0.9	0.1	1.0
Energetic	^*2.7	−2.4	−0.9	0.1	1.1	2.0	−2.6	−1.0	0.1	1.3	2.0	−2.6	−1.0	0.1	1.3
Able to cope with life	^*2.7	−2.4	−1.1	−0.1	0.9
Cognitive complaints
Problems remembering	3.2	−0.4	0.6	1.3	2.2						3.4	−0.4	0.6	1.3	2.2
Slow or unclear thinking	5.0	0.0	0.8	1.4	2.3						6.6	0.0	0.8	1.4	2.2
Difficulty finding words	2.8	−0.1	0.8	1.5	2.5
Been confused	2.7	0.2	1.1	1.8	2.8
Difficulty learning	3.7	0.2	0.9	1.6	2.4
Difficulty concentrating	3.8	−0.3	0.7	1.3	2.0						3.2	−0.3	0.7	1.3	2.0
Anxiety
Nervous	3.6	0.0	0.9	1.4	2.1
Afraid or anxious	3.4	0.3	1.1	1.6	2.3						^*3.2	0.3	1.1	1.6	2.3
Felt tension	3.5	−0.3	0.7	1.3	2.1						4.2	−0.2	0.7	1.3	2.0
Concerned being seriously ill+
Uneasy	4.3	−0.2	0.7	1.3	1.9						3.5	−0.2	0.7	1.3	2.0
Restless	2.3	−0.1	0.8	1.5	2.3
Depressivity
Sad	5.8	−0.3	0.7	1.2	1.9	7.5	−0.3	0.6	1.2	1.9	6.3	−0.3	0.7	1.3	1.9
Depressed	4.6	0.2	0.9	1.4	2.0
Discouraged	5.2	0.0	0.8	1.4	2.1	4.7	0.0	0.8	1.4	2.2
Crying easily	2.1	0.0	0.8	1.5	2.2
Unhappy	4.0	−0.2	0.7	1.3	1.9	3.8	−0.3	0.7	1.3	1.9	4.2	−0.2	0.7	1.3	1.9
Happy	^*1.8	−2.1	−0.2	0.9	2.4
Self-confident	^*1.6	−2.1	−0.4	0.8	2.0	1.4	−2.3	−0.4	0.8	2.1	1.3	−2.3	−0.4	0.8	2.2
Emotional susceptibility
Difficulty coping	2.3	−0.5	0.5	1.3	2.5	3.3	−0.5	0.5	1.2	2.2
Not like yourself	2.5	−0.1	0.7	1.3	2.2	3.2	−0.1	0.7	1.2	2.0
Easily stressed	2.6	−0.6	0.4	1.0	1.8	3.1	−0.6	0.3	0.9	1.7	2.7	−0.6	0.4	1.0	1.8
Mood swings	^*3.7	−0.5	0.4	1.0	1.7	2.7	−0.6	0.4	1.1	1.9	3.6	−0.5	0.4	1.0	1.8
Irritable	^*3.6	−0.6	0.4	1.1	1.9
Frustrated	4.3	−0.2	0.6	1.2	1.9
Angry	2.4	−0.1	0.9	1.5	2.3
Felt in control	^*1.7	−2.0	−0.3	0.8	2.0	1.6	−2.1	−0.3	0.8	2.1	1.5	−2.1	−0.3	0.9	2.1
Felt in balance	^*2.1	−1.9	−0.6	0.6	1.6
Impaired social life
Difficult being with other people	^*3.7	0.6	1.2	1.8	2.6						^*3.7	0.6	1.2	1.8	2.6
A burden to other people	4.1	0.7	1.3	1.9	2.7						4.1	0.7	1.3	1.9	2.7
Conflicts with other people	^*2.3	0.9	1.8	2.6	3.0						^*2.3	0.9	1.8	2.6	3.0
People lack understanding+
Impaired daily life
Difficulty managing daily life	4.9	0.3	1.0	1.5	2.2	4.6	0.3	1.0	1.5	2.3	4.9	0.3	1.0	1.5	2.2
Limit leisure activities	^*5.8	0.3	0.9	1.2	1.7
Difficulty participating in life	6.5	0.5	1.0	1.4	2.0	5.2	0.5	1.0	1.4	2.1	5.3	0.5	1.0	1.4	2.1
Difficulty getting around	2.7	0.7	1.3	1.7	2.3	2.8	0.7	1.3	1.7	2.3
Everything takes longer	^*2.8	−0.1	0.8	1.2	1.9	3.1	−0.1	0.7	1.2	1.9	2.8	−0.1	0.7	1.2	1.9
Difficulty managing job	3.3	0.5	1.1	1.4	1.9
Appearance
Disease affect appearance	2.9	−0.3	0.5	1.1	1.9						2.4	−0.3	0.5	1.2	2.0
Unsatisfied with appearance	7.1	0.2	0.7	1.1	1.7
Mask visible signs	2.2	1.2	1.6	2.0	2.7
Bothered by people looking	2.6	1.3	1.8	2.2	2.5						2.6	1.3	1.8	2.2	2.5
Influence on clothes worn	2.3	1.0	1.4	1.8	2.6						2.5	0.9	1.4	1.8	2.5
Felt too fat+

Items flagged for nonunidimensionality in previous confirmatory factor analyses were omitted (+).

Items with item level S-X² misfit p-value < 0.01.

IRT, item response theory.

With specification of a method factor comprising the positively worded items, a bifactor model with acceptable overall fit (comparative fit index [CFI] = 0.97, root mean square error of approximation [RMSEA] = 0.085) was fitted to the composite score (Fig. 1). One item concerning memory problems had higher loading on its subfactor; all other items had higher loading on the general factor.

The scale transformations are presented for each scale in Table 3. Thus, a raw sum score of 0 on, for example, the Goiter Symptoms scale (answer “not at all” to all three items) should be rescaled to a value of 2, and a raw sum score of 12 (answer “very much” to all three items) should be rescaled to 84, which is the maximum score on the Goiter short-form scale.

Table 3.

Final Scores on the Short-Form Scales, After Linking via IRT Model to Original Long-Form Scales (Orlando and Thissen Summed-Score-to-IRT-Score Approach)

Final rescaled short-form score
Raw sum score	Goiter	Hyper	Eye	Tired	Cognition	Anxiety	Depressivity	Susceptibility	Social Life	Daily Life	Appearance
0	2	2	1	0	1	1	0	1	0	0	1
1	10	8	8	8	7	10	7	7	8	7	12
2	15	13	14	17	14	18	14	13	17	15	21
3	20	18	20	25	21	26	22	21	25	22	28
4	26	23	25	33	29	34	29	28	33	30	36
5	31	28	32	42	37	41	37	36	42	38	43
6	37	33	38	50	44	49	45	44	50	46	51
7	43	38	45	58	52	56	54	52	58	54	59
8	49	44	52	67	60	63	63	60	67	62	66
9	57	49	60	75	68	71	71	68	75	71	73
10	64	55	68	83	76	79	80	77	83	80	80
11	73	60	78	92	85	87	89	86	92	89	87
12	84	66	89	100	95	96	97	95	100	98	96
13		71
14		77
15		84
16		90

The raw score (left column) is derived by simple summation of item values (0–4) separately for each scale. The corresponding final, IRT-linked rescaled short-form score is tabulated for each scale. Thus, when scoring short-form scales, items are summed (0–12 for 3-item scales and 0–16 for 4-item scale), and the rescaled score is derived from the table.

Agreement plots of the new, rescaled short-form scales versus the original long-form scales are presented in Figure 2. Good and uniform agreement was shown across the entire range of scores. In each plot, the mean [CI] long-form score and the mean short-form score for the cross-sectional sample are provided too. Only the Tiredness short-form mean score was outside the CI for the long-form scale. However, the difference between the two mean levels was only three points on the 0–100 scale.

FIG. 2.

Agreement plots of short-form (horizontal axis) versus original long-form (vertical axis) scale levels with regression lines. For each scale, the mean score (95% CI) for the long form is presented in the upper-left corner, and the mean score for the short form in the lower right. Short-form mean outside long form is marked by an asterisk.

Effect sizes and responsiveness in groups of patients undergoing treatment were preserved in the short-form scales, as shown in Table 4. Only the short-form Anxiety scale had smaller effect size and slightly lower responsiveness than the original long-form. Conversely, the short-form Appearance scale had slightly higher effect size and responsiveness compared to the long form.

Table 4.

Responsiveness and Test–Retest Reliability of the Short-Form Versus the Long-Form Scales

Scale	Effect size short scale	Effect size long scale	F short scale	F long scale	Relative validity index short scale	Relative validity index long scale	Test–retest reliability short scale	Test–retest reliability long scale	Intraclass correlation short vs. long scale
Goiter Symptoms	0.54 [0.47–0.60]	0.52	153	144	1	0.94 [0.84–1.06]	0.83 [0.74–0.90]	0.87	0.89 [0.88–0.90]
Hyperthyroid Symptoms	0.49 [0.42–0.55]	0.46	132	119	1	0.90 [0.79–1.02]	0.89 [0.82–0.93]	0.89	0.93 [0.92–0.94]
Eye Symptoms	0.25 [0.18–0.31]	0.21	41	31	1	0.76 [0.53–1.04]	0.78 [0.63–0.89]	0.86	0.89 [0.86–0.90]
Tiredness	0.52 [0.44–0.60]	0.52	133	131	1	0.98 [0.91–1.06]	0.84 [0.76–0.90]	0.85	0.98 [0.97–0.98]
Cognitive Complaints	0.23 [0.17–0.29]	0.21	35	33	1	0.94 [0.76–1.15]	0.84 [0.74–0.90]	0.88	0.96 [0.96–0.97]
Anxiety	0.46 [0.39–0.52]	0.53	111	147	0.76 [0.66–0.85]	1	0.75 [0.59–0.86]	0.77	0.96 [0.96–0.97]
Depressivity	0.25 [0.18–0.32]	0.25	31	33	0.94 [0.78–1.02]	1	0.86 [0.79–0.91]	0.88	0.96 [0.96–0.97]
Emotional Susceptibility	0.32 [0.25–0.39]	0.37	56	84	0.67 [0.51–0.83]	1	0.82 [0.74–0.88]	0.87	0.95 [0.94–0.96]
Impaired Social Life	0.22 [0.17–0.28]	0.20	31	26	1	0.84 [0.66–1.04]	0.80 [0.65–0.90]	0.84	0.95 [0.94–0.96]
Impaired Daily Life	0.37 [0.31–0.44]	0.38	85	88	0.97 [0.87–1.07]	1	0.82 [0.71–0.91]	0.83	0.98 [0.97–0.98]
Appearance	0.19 [0.12–0.27]	0.11	17	6	1	0.35 [0.10–0.59]	0.75 [0.61–0.85]	0.86	0.94 [0.93–0.95]
Composite scale	0.46		144				0.90 [0.84–0.94]

For each scale, the effect size for the new short form [CI] and the original long form is provided. To compare responsiveness directly, the F-value of the change for each version was calculated and the relative validity index calculated as F/F _max, i.e., the version with the best responsiveness has a relative validity index of 1. In addition, the test–retest reliability for the short form [CI] and the long form, as well as the intraclass correlation among the two versions, are presented to the right.

Test–retest reliability was similarly preserved in the short-form scales, with only the Appearance scale having significantly but marginally lower reliability (Table 4).

Very high short- versus long-form intraclass correlations were found (0.89–0.98; Table 4). Further, the previously shown discriminant validity was reproduced. Thus, for all short-form scales, the group expected to score higher, as specified in Watt et al. (17), did have significantly higher mean scores than the group expected to score low (data not shown). Responsiveness in the three diagnostic groups evaluated was identical to that demonstrated for the long form (26). DIF according to age in the short form were also identical to the small effects previously identified in the long form (35). For a given level of Tiredness and Depressivity, respectively, younger patients had a tendency to endorse positively worded items (“felt energetic” and “self-confident,” respectively). For the Impaired Social Life, younger patients had more conflicts with other people, compared with older patients, but again, the effect was small.

Discussion

The purpose of the present study was to develop an abbreviated version of the ThyPRO. This goal was achieved successfully. Based on previous validation studies and IRT modeling, an abbreviated version of the ThyPRO was developed containing (a) four physical symptom scales, two with three items (Goiter and Eye Symptoms) and two with four items (Hypo- and Hyperthyroid physical symptoms); (b) seven three-item scales about physical, mental, and social well-being and function; (c) one three-item scale concerning appearance; and (d) one single item about impact on overall QoL. Thus, the abbreviated version consists of 39 items, if all physical symptom scales are administered. Each of the 12 short-form scales and the single QoL item can be reported separately, but the seven well-being and function scales can also be summarized in one single composite score.

The validation analyses showed that the abbreviated scales had very high agreement with the original long-form ones, including roughly similar mean levels, and comparable measurement quality. Thus, good test–retest reliability, responsiveness to clinical change, and sensitivity to relevant clinical differences were demonstrated. This preservation of good measurement properties in scales with much fewer items is interpreted as being a result of selection of items with best measurement properties, under consideration of the conceptual model and content validity, thereby reducing random and systematic measurement error.

The primary strength of this study is the integration of several studies and methodologies in the item reduction process. Thus, several modalities within modern psychometrics (DIF, structural equation modeling for ordinal data, item response modeling) were applied within a firm clinical framework among patients from several clinical studies, including cross-cultural samples. Further, analyses were conducted within the original ThyQoL conceptual model (56) with focus on content validity. However, the final short-form has not been tested as a stand-alone form in an independent, novel clinical sample. This should be considered in future studies. Further, although the aim was to develop and test the instrument in a broad, heterogeneous sample, as specified in the introduction, and although the cross-sectional sample size was fairly large, it was not large enough to permit multigroup analyses (57) according to diagnosis. However, previous studies using an ordinal regression approach (55), less dependent on sample size, have shown only minimal DIF of the ThyPRO scales, according to diagnosis (35). Application of a short form may lead to loss of content validity. The extent to which this has occurred can only be evaluated in qualitative studies (58). However, since previous validation studies have confirmed that the individual scales are unidimensional, the potential loss should theoretically be minimal. As evident from the agreement plots, the short-form scales have fewer measurement points along the entire spectrum, and application of the short forms may also lead to poorer discrimination at the extremes. Another potential weakness was the fact that five of the scales were slightly modified (some items were omitted) to avoid item-level misfit in the IRT model. Since the rescaling was based on these IRT analyses, this may lead to weaker linking between the two versions. On the other hand, as mentioned, the correlation, agreement and mean levels between the two versions of each scale all supported the appropriateness of the present linking.

The applied approach is in line with recent recommendations for item reduction (25). When reviewing the available item reduction literature, the authors found that 55% of the studies had preserved scale structure and the median proportion of reduction was 57% (range 21–88%). The present study is close to this median reduction (from 85 to 39 items, i.e., 54% reduction). In 62% of the studies, only the long form was administered. Use of IRT methods was recommended as advantageous in the suggested guidelines, but was only applied in 11% of the studies.

The two-level scale-scoring approach, where both a composite score and the underlying more detailed subscales can be scored for the well-being and function scales, has also been adopted in previous studies. The most prominent is the most widely used short-form measure, the SF-36 Health Survey (59). Based on SF-36, eight domain scores as well as two Component Summaries can be derived (60), depending on the level of detail required in reporting. The scoring of the SF-36 summaries is based on results from principal component analyses, in contrast to the present study, where a simple summation approach was adopted for ease of scoring and reporting.

The short-form offers an advantageous measure, when reduction of respondent burden, potentially increasing response rates, is considered to outweigh the theoretical reduction of content validity and measurement detail. In an ongoing clinical trial among patients with Graves' disease (19,61), time to completion of 39 ThyPRO items was short (median 4 minutes; interquartile range 3–5 minutes), according to time-stamped electronic responses. This may be particularly relevant in longitudinal studies with multiple measurement points and when studying larger samples. Reporting the well-being and function scales as the composite score is recommended, when simplicity of reporting, combined with small measurement intervals and high precision, is the primary goal. When a detailed evaluation of physical, emotional, and social well-being and function is warranted, reporting the individual scales is recommended. In general, administering ThyPRO-39 as a whole is recommended to enhance content validity and comparability. However, it may be relevant to omit entire scales (not individual items), if considered irrelevant for a specific future study. For example, the Eye Symptoms scale may not be administered in a trial among patients with nontoxic goiter. Similarly, scales from the full-length ThyPRO may be selectively added to the ThyPRO-39, if considered of particular importance, for example the Sex Life scale, which is not included in ThyPRO-39.

ThyPRO-39 can be implemented in daily clinical practice (62). Patients may respond to the instrument, for example, prior to their appointment, either from home via an e-mail sent in advance, or in the waiting room. Scale scores might then be transferred to the electronic medical record and evaluated by the clinician, similarly to evaluation of, for example, thyroid function tests. These data can then be used for monitoring and communication purposes. Relevant problems (or lack thereof) may be rapidly identified and addressed, ideally with established thresholds and recommended actions and interventions, including referral to psychosocial intervention. For an ongoing clinical trial, real-time automatic monitoring of responses to ThyPRO has been implemented (61). In response to scores above (i.e., worse than) preset thresholds, e-mail alerts are generated and sent to clinical staff. A similar system could monitor responses in clinical practice. However, there is still a requirement for further research on this, for example evaluating how to communicate these results meaningfully to patients; establishment of alert thresholds; identification and evaluation of effectiveness of relevant interventions, among others.

In conclusion, this study has developed an abbreviated 39-item version of the thyroid-related QoL measurement instrument ThyPRO, with good measurement properties. It has high agreement with the long-form original version, and score levels on one form are comparable to score levels on the other. Function and well-being may be reported as a composite score or as individual scale scores. This abbreviated version, named the ThyPRO-39, is recommended for use in clinical studies, as a possible alternative to the original version. Scoring programs for use on various platforms are available from the first author for both versions.

Footnotes

Acknowledgments

This study has been supported by grants from the Danish Agency for Science, Technology, and Innovation: Council for Strategic Research and Council for Independent Research, Genzyme Corporation, Novo Nordisk Foundation by unrestricted research grants and Agnes and Knut Mørk's Foundation. Warm thanks to Sofie Larsen Rasmussen and Lene Frydenberg for help with identification and inclusion of patients; to Sofie Larsen Rasmussen and Kira Bang Bové for help with follow-up; to Sofie Larsen Rasmussen, Kim Æbelø, Emilie Birch, Thea Christophersen, and Sara Kehlet Watt for help with generation of clinical data; and to Selma Watt and Laura Siim Magnussen for data entry. Researchers interested in using the ThyPRO or the ThyPRO-39 may contact the corresponding author.

Author Disclosure Statement

None of the authors has any financial conflicts of interest to declare.

References

Vanderpump

. 2011. The epidemiology of thyroid disease. Br Med Bull, 99:39–51.

Carle

, Laurberg

, Pedersen

, Knudsen

, Perrild

, Ovesen

, Rasmussen

, Jorgensen

. 2006. Epidemiology of subtypes of hypothyroidism in Denmark. Eur J Endocrinol, 154:21–28.

Carle

, Pedersen

, Knudsen

, Perrild

, Ovesen

, Rasmussen

, Laurberg

. 2011. Epidemiology of subtypes of hyperthyroidism in Denmark: a population-based study. Eur J Endocrinol, 164:801–809.

Bianchi

, Zaccheroni

, Solaroli

, Vescini

, Cerutti

, Zoli

, Marchesini

. 2004. Health-related quality of life in patients with thyroid disorders. Qual Life Res, 13:45–54.

Watt

, Groenvold

, Rasmussen

, Bonnema

, Hegedüs

, Bjorner

, Feldt-Rasmussen

. 2006. Quality of life in patients with benign thyroid disorders. A review. Eur J Endocrinol, 154:501–510.

Elberling

, Rasmussen

, Feldt-Rasmussen

, Hording

, Perrild

, Waldemar

. 2004. Impaired health-related quality of life in Graves' disease. A prospective study. Eur J Endocrinol, 151:549–555.

Mishra

, Sabaretnam

, Chand

, Agarwal

, Verma

, Mishra

. 2013. Quality of life (QoL) in patients with benign thyroid goiters (pre- and post-thyroidectomy): a prospective study. World J Surg, 37:2322–2329.

Cramon

, Bonnema

, Bjorner

, Ekholm

, Feldt-Rasmussen

, Frendl

, Groenvold

, Hegedus

, Rasmussen

, Watt

. 2015. Quality of life in patients with benign nontoxic goiter: impact of disease and treatment response, and comparison with the general population. Thyroid, 25:284–291.

Bove

, Watt

, Vogel

, Hegedüs

, Bjoerner

, Groenvold

, Bonnema

, Rasmussen

, Feldt-Rasmussen

. 2014. Anxiety and depression are more prevalent in patients with graves' disease than in patients with nodular goitre. Eur Thyroid J, 3:173–178.

10.

Nexo

, Watt

, Pedersen

, Bonnema

, Hegedus

, Rasmussen

, Feldt-Rasmussen

, Bjorner

. 2014. Increased risk of long-term sickness absence, lower rate of return to work, and higher risk of unemployment and disability pensioning for thyroid patients: a Danish register-based cohort study. J Clin Endocrinol Metab, 99:3184–3192.

11.

Nexø

, Watt

, Cleal

, Hegedüs

, Bonnema

, Rasmussen ÅK, Feldt-Rasmussen

, Bjorner

. 2015. Exploring the experiences of people with hypo- and hyperthyroidism. Qual Health Res, 25:945–953.

12.

Brandt

, Almind

, Christensen

, Green

, Brix

, Hegedüs

. 2012. Excess mortality in hyperthyroidism: the influence of preexisting comorbidity and genetic confounding: a Danish nationwide register-based cohort study of twins and singletons. J Clin Endocrinol Metab, 97:4123–4129.

13.

Thvilum

, Brandt

, Almind

, Christensen

, Hegedüs

, Brix

. 2013. Excess mortality in patients diagnosed with hypothyroidism: a nationwide cohort study of singletons and twins. J Clin Endocrinol Metab, 98:1069–1075.

14.

Watt

, Hegedüs

, Rasmussen

, Groenvold

, Bonnema

, Bjorner

, Feldt-Rasmussen

. 2007. Which domains of thyroid-related quality of life are most relevant? Patients and clinicians provide complementary perspectives. Thyroid, 17:647–654.

15.

Watt

, Rasmussen

, Groenvold

, Bjorner

, Watt

, Bonnema

, Hegedüs

, Feldt-Rasmussen

. 2008. Improving a newly developed patient-reported outcome for thyroid patients, using cognitive interviewing. Qual Life Res, 17:1009–1017.

16.

Watt

, Bjorner

, Groenvold

, Rasmussen

, Bonnema

, Hegedüs

, Feldt-Rasmussen

. 2009. Establishing construct validity for the thyroid-specific patient reported outcome measure (ThyPRO): an initial examination. Qual Life Res, 18:483–496.

17.

Watt

, Hegedüs

, Groenvold

, Bjorner

, Rasmussen

, Bonnema

, Feldt-Rasmussen

. 2010. Validity and reliability of the novel thyroid-specific quality of life questionnaire, ThyPRO. Eur J Endocrinol, 162:161–167.

18.

Watt

, Hegedüs

, Bjorner

, Groenvold

, Bonnema

, Rasmussen

, Feldt-Rasmussen

. 2012. Is thyroid autoimmunity per se a determinant of quality of life in patients with autoimmune hypothyroidism?. Eur Thyroid J, 1:186–192.

19.

Watt

, Cramon

, Bjorner

, Bonnema

, Feldt-Rasmussen

, Gluud

, Gram

, Hansen

, Hegedüs

, Knudsen

, Bach-Mortensen

, Nolsoe

, Nygaard

, Pociot

, Skoog

, Winkel

, Rasmussen

. 2013. Selenium supplementation for patients with Graves' hyperthyroidism (the GRASS trial): study protocol for a randomized controlled trial. Trials, 14:119.

20.

Graf

, Fast

, Pacini

, Pinchera

, Leung

, Vaisman

, Reiners

, Wemeau

, Huysmans

, Harper

, Driedger

, de Souza

, Castagna

, Antonangeli

, Braverman

, Corbo

, Duren

, Proust-Lemoine

, Edelbroek

, Marriott

, Rachinsky

, Grupe

, Watt

, Magner

, Hegedüs

. 2011. Modified-release recombinant human TSH (MRrhTSH) augments the effect of ¹³¹I therapy in benign multinodular goiter: results from a multicenter international, randomized, placebo-controlled study. J Clin Endocrinol Metab, 96:1368–1376.

21.

Fast

, Hegedus

, Pacini

, Pinchera

, Leung

, Vaisman

, Reiners

, Wemeau

, Huysmans

, Harper

, Rachinsky

, de Souza

, Castagna

, Antonangeli

, Braverman

, Corbo

, Duren

, Proust-Lemoine

, Marriott

, Driedger

, Grupe

, Watt

, Magner

, Purvis

, Graf

. 2014. Long-term efficacy of modified-release recombinant human thyrotropin augmented radioiodine therapy for benign multinodular goiter: results from a multicenter, international, randomized, placebo-controlled, dose-selection study. Thyroid, 24:727–735.

22.

Bukvic

, Zivaljevic

, Sipetic

, Diklic

, Tausanovic

, Paunovic

. 2014. Improvement of quality of life in patients with benign goiter after surgical treatment. Langenbecks Arch Surg, 399:755–764.

23.

Bukvic

, Zivaljevic

, Sipetic

, Diklic

, Tausanovic

, Stojanovic

, Stevanovic

, Paunovic

. 2015. Improved quality of life in hyperthyroidism patients after surgery. J Surg Res, 193:724–730.

24.

Winther

, Watt

, Bjorner

, Cramon

, Feldt-Rasmussen

, Gluud

, Gram

, Groenvold

, Hegedus

, Knudsen

, Rasmussen

, Bonnema

. 2014. The chronic autoimmune thyroiditis quality of life selenium trial (CATALYST): study protocol for a randomized controlled trial. Trials, 15:115.

25.

Goetz

, Coste

, Lemetayer

, Rat

, Montel

, Recchia

, Debouverie

, Pouchot

, Spitz

, Guillemin

. 2013. Item reduction based on rigorous methodological guidelines is necessary to maintain validity when shortening composite measurement scales. J Clin Epidemiol, 66:710–718.

26.

Watt

, Cramon

, Hegedus

, Bue

, Joop

, Krogh

, Feldt-Rasmussen

, Groenvold

. 2014. The thyroid-related quality of life measure ThyPRO has good responsiveness and ability to detect relevant treatment effects. J Clin Endocrinol Metab, 99:3708–3717.

27.

Watt

, Groenvold

, Deng

, Gandek

, Feldt-Rasmussen

, Rasmussen

, Hegedüs

, Bonnema

, Bjorner

. 2014. Confirmatory factor analysis of the thyroid-related quality of life questionnaire ThyPRO. Health Quality Life Outcomes, 12:126.

28.

Samejima

. 1969. Estimation of Latent Ability Using a Response Pattern of Graded Scores (Psychometric Monographs No. 17). Richmond, VA: Psychometric Society.

29.

Samejima

. 1997. Graded Response Model. In: van der Linden

, Hambleton

(eds) Handbook of Modern Item Response Theory. Springer, New York, NY, pp 85–100.

30.

Orlando

, Thissen

. 2000. Likelihood-based item-fit indices for dichotomous item response theory models. Appl Psychol Meas, 24:50–64.

31.

Orlando

, Thissen

. 2003. Further investigation of the performance of S-X²: an item fit index for use with dichotomous item response theory models. Appl Psychol Meas, 27:289–298.

32.

Kang

, Chen

. 2008. Performance of the generalized S-X² item fit index for polytomous IRT models. J Educ Meas, 45:391–406.

33.

Kang

, Chen

. 2011. Performance of the generalized S-X² item fit index for the graded response model. Asia Pacific Educ Rev, 12:89–96.

34.

Watt

, Barbesino

, Bjorner

, Bonnema

, Bukvic

, Drummond

, Groenvold

, Hegedus

, Kantzer

, Lasch

, Marcocci

, Mishra

, Netea-Maier

, Ekker

, Paunovic

, Quinn

, Rasmussen

, Russell

, Sabaretnam

, Smit

, Torring

, Zivaljevic

, Feldt-Rasmussen

. 2015. Cross-cultural validity of the thyroid-specific quality-of-life patient-reported outcome measure, ThyPRO. Qual Life Res, 24:769–780.

35.

Watt

, Groenvold

, Hegedüs

, Bonnema

, Rasmussen

, Feldt-Rasmussen

, Bjorner

. 2014. Few items in the thyroid-related quality of life instrument ThyPRO exhibited differential item functioning. Qual Life Res, 23:327–338.

36.

Reeve

, Hays

, Bjorner

, Cook

, Crane

, Teresi

, Thissen

, Revicki

, Weiss

, Hambleton

, Liu

, Gershon

, Reise

, Lai

, Cella

. 2007. Psychometric evaluation and calibration of health-related quality of life item banks: plans for the Patient-Reported Outcomes Measurement Information System (PROMIS). Med Care, 45:S22–S31.

37.

Anatchkova

, Ware

Jr. , Bjorner

. 2011. Assessing the factor structure of a role functioning item bank. Qual Life Res, 20:745–758.

38.

McDonald

. 1999. Test Theory. A Unified Treatment. Lawrence Erlbaum, Mahwah, NJ.

39.

Gibbons

, Hedeker

. 1992. Full-information item bi-factor analysis. Psychometrika, 57:423–436.

40.

Reise

, Morizot

, Hays

. 2007. The role of the bifactor model in resolving dimensionality issues in health outcomes measures. Qual Life Res, 16:19–31.

41.

Orlando

, Sherbourne

, Thissen

. 2000. Summed-score linking using item response theory: application to depression measurement. Psychol Assess, 12:354–359.

42.

Shrout

, Fleiss

. 1979. Intraclass correlations: uses in assessing rater reliability. Psychol Bull, 86:420–428.

43.

Henderson

. 2005. The bootstrap: a technique for data-driven statistics. Using computer-intensive analyses to explore experimental data. Clin Chim Acta, 359:1–26.

44.

Cohen

. 1988. Statistical Power Analysis for the Behavioral Sciences. First edition. Lawrence Erlbaum, Hillsdale, NJ.

45.

Kazis

, Anderson

, Meenan

. 1989. Effect sizes for interpreting changes in health status. Med Care, 27:S178–S189.

46.

Deng

, Allison

, Fang

, Ash

, Ware

Jr . 2013. Using the bootstrap to establish statistical significance for relative validity comparisons among patient-reported outcome measures. Health and Quality of Life Outcomes, 11:1–12.

47.

Tammemagi

, Frank

, Leblanc

, Artsob

, Streiner

. 1995. Methodological issues in assessing reproducibility—a comparative study of various indices of reproducibility applied to repeat ELISA serologic tests for Lyme disease. J Clin Epidemiol, 48:1123–1132.

48.

Deyo

, Diehr

, Patrick

. 1991. Reproducibility and responsiveness of health status measures. Statistics and strategies for evaluation. Control Clin Trials, 12:142S–158S.

49.

Fayers

, Machin

. 2000. Quality of Life: Assessment, Analysis and Interpretation. John Wiley, Chichester, United Kingdom.

50.

Thorsen

, Bjorner

. 2010. Reliability of the Copenhagen Psychosocial Questionnaire. Scand J Public Health, 38:25–32.

51.

SAS Institute, Inc.. 2011. SAS/STAT 9.3 User's Guide. SAS Institue, Inc, Cary, NC.

52.

Muthen

, Muthen

. 2010. Mplus User Guide. Sixth edition. Muthén & Muthén, Los Angeles, CA.

53.

Paek

, Han

. 2013. IRTPRO 2.1 for Windows (Item Response Theory for Patient-Reported Outcomes). Appl Psychol Meas, 37:242–252.

54.

2011 IRTPRO User Guide. Scientific Software International, Lincolnwood, IL.

55.

Zumbo

. 1999. A Handbook on the Theory and Methods of Differential Item Functioning (DIF). Logistic Regression Modeling as a Unitary Framework for Binary and Likert-Type (Ordinal) Item Scores. First edition. Directorate of Human Resources Research and Evaluation, Department of National Defense, Ottawa, ON.

56.

Watt

. 2014. Thyroid-Specific Patient-Reported Outcome Measure (ThyPRO). In: Michalos

(ed) Encyclopedia of Quality of Life and Well-Being Research. Springer, Dordrecht, pp 6637–6645.

57.

Muthen

. 1984. A general structural equation model with dichotomous, ordered categorical, and continuous latent variable indicators. Psychometrika, 49:115–132.

58.

Terwee

, Bot

, de Boer

, van der Windt

, Knol

, Dekker

, Bouter

, de Vet

. 2007. Quality criteria were proposed for measurement properties of health status questionnaires. J Clin Epidemiol, 60:34–42.

59.

Ware

Jr , Sherbourne

. 1992. The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection. Med Care, 30:473–483.

60.

Ware

Jr , Kosinski

, Bayliss

, McHorney

, Rogers

, Raczek

. 1995. Comparison of methods for the scoring and statistical analysis of SF-36 health profile and summary measures: summary of results from the Medical Outcomes Study. Med Care, 33:AS264–AS279.

61.

Cramon

, Rasmussen

, Bonnema

, Bjorner

, Feldt-Rasmussen

, Groenvold

, Hegedus

, Watt

. 2014. Development and implementation of PROgmatic: a clinical trial management system for pragmatic multi-centre trials, optimised for electronic data capture and patient-reported outcomes. Clin Trials, 11:344–354.

62.

Snyder

, Aaronson

, Choucair

, Elliott

, Greenhalgh

, Halyard

, Hess

, Miller

, Reeve

, Santana

. 2012. Implementing patient-reported outcomes assessment in clinical practice: a review of the options and considerations. Qual Life Res, 21:1305–1314.