Abstract
Purpose:
This project examined test-retest reliability and survey mode administration across single-item and multi-item measures among adolescent and young adult (AYA) cancer survivors.
Methods:
Forty-six AYAs randomly assigned to survey mode (phone, online, and paper) completed the survey and were invited to complete the survey again 1 week later.
Results:
Mode effects were found on 6% of single-items and 25% of multi-item scores. Reliability was low for 52% of single-items and 8% of multi-item scores.
Conclusion:
Multi-item measures should generally be used over single-item measures due to better reliability, but single-item measures may be preferable when mode effects are large.
Introduction
One of the great debates in psychometrics and survey development is the use of single- versus multi-item measures. Currently, no consensus exists on whether single-item or multi-item measures are preferred. Single-item measures have a lower burden for both participants and scientists. 1 However, studies have shown that multi-item measures are more reliable and have better content, construct, and criterion validity. 2 While most patient-reported outcomes are multi-item, several single-item measures have been shown to be valid and reliable (e.g., self-rated health, visual analog scale,1,3,4) though this can differ by domain. 5 Previous research on single-item measures has focused on estimating validity and internal consistency for reliability.6,7 However, prior studies have shown that single-item measures are not always superior to multi-item measures.8–10
Test-retest reliability and mode effects tend to be less well studied when comparing single- and multi-item measures. Test-retest reliability reflects the reproducibility of the scale scores and how consistent the scores are over a specified period of time. 11 Mode refers to how the survey is completed, such as with a research assistant, on paper, or online, and mode effects are when the mode is associated with statistically significant differences in participants answers. 12 Identifying whether single- or multi-item measures maximize test-retest reliability and minimize mode effects could help reduce participant burden while maintaining study integrity.
The issue of whether to use single- or multi-item measures is particularly salient for adolescents and young adults (AYA) with a history of cancer. AYAs are in the age group most likely to own smartphones, 13 and completing multi-item measures on smartphones is challenging. 14 AYAs may experience several time-intensive developmental milestones (e.g., completing formal education, starting a career, forming relationships, and family planning) during and after cancer treatment. 15 Consequently, long, multi-item survey measures may be challenging for AYAs to complete. Single-item measures are also more feasible for screening and treatment monitoring in clinical care. 16 Given the potential benefits of single-item measures for AYAs, additional evaluation in this population is warranted.
The aim of the present analysis was to evaluate the psychometric properties of single- and multi-item measures among AYA cancer survivors. We aimed to evaluate two psychometric properties of single- and multi-item measures: effects of survey mode (by telephone with an interviewer, online, and mailed paper survey) and test-retest reliability. Most prior studies have used an observational design comparing answers of participants no matter the mode of survey used to complete, potentially impacting internal validity and the ability to infer direct effects of survey mode on responses.17,18 In this study, we used a randomized, between-subjects design with participants randomly assigned to one of three survey modes. We also included repeated measures to assess test-retest reliability rather than attempting to estimate internal consistency reliability for single-item measures. The study was situated within an NIH-funded program project grant focused on clinical care gaps and unmet needs in AYA cancer survivors; 19 thus, the survey covered multiple domains of the cancer survivorship experience, enabling the investigation of survey mode effects and test-retest reliability for single-item and multi-item measures across these domains.
Methods
Study design
The Valuing Opinions and Insights from Cancer Experiences (VOICE) Program is focused on health care needs and utilization in cancer survivors diagnosed at ages 15–39 years. The activities discussed in this article were a part of survey development for the larger VOICE Program. This pilot study recruited a sample of AYA cancer survivors 2–10 years after diagnosis. A randomized, between-subjects experiment was used to examine differences by mode of completing the survey. Participants were assigned to groups using simple randomization. A repeated measures design was used to examine the test-retest reliability of each single-item and multi-item measure. Study activities were approved and overseen by the Kaiser Permanente Northern California Institutional Review Board.
Participants
Participants were recruited from Kaiser Permanente Northern California (KPNC), a large integrated health care system providing health care to over 4.6 million members in Northern California. Eligible individuals were 15–39 years of age at cancer diagnosis and 2–10 years post-cancer diagnosis when recruited into the study. To prevent overlap with the main VOICE Program survey, eligible individuals had cancer but were not diagnosed with one of the cancer types examined in the main study, which are the top 10 most common cancers in AYAs (leukemia, non-Hodgkin lymphoma, Hodgkin lymphoma, melanoma, sarcoma, colorectal, cervical, thyroid, testicular, and breast cancers). Eligible individuals were continuously enrolled at KPNC since their cancer diagnosis and at least 18 years old at recruitment. Participants did not have to be in remission and could have experienced a second primary cancer or recurrence.
Procedures
Potentially eligible participants (n = 302) were identified from electronic health records at KPNC. Potential participants received a series of emails, phone calls, and mailed paper letter invitations to participate in the study, consistent with the Dillman method. 20 Invitees were further screened to confirm eligibility, including willingness to be randomized to survey completion mode. Consented participants (n = 84) were then randomized to survey mode (mailed paper, online web-based, or interviewer-delivered by phone). One week after completing the first survey, participants completed the same survey by the same mode as the first survey. The 1-week interval was chosen to minimize recall bias and to reduce the likelihood of actual change in symptoms and quality of life that complicates reliability testing. Participants who did not complete the surveys after the initial invitation received up to five reminder attempts. Participants received $15 for completing the first survey and $25 for completing the second survey.
Measures
Survey items and measures were drawn from previously evaluated measures 21 and large national surveys. Most items were drawn from the Medical Expenditure Panel Survey 22 the Patient-Reported Outcomes Measurement Information System (PROMIS, 23 ) the Functional Assessment of Cancer Treatment measures, 24 and the Childhood Cancer Survivor Study (CCSS) insurance survey. 25 Survey items were grouped into seven domains (Fig. 1): cancer care; work, education, and finances; social support and well-being; health history; fertility and sexual health; demographics; and health behaviors. Except for demographics, which had only single-item measures, each domain had both single- and multi-item measures. A total of 148 items from the survey were evaluated in this study, with 68 of the 148 forming the 12 multi-item measures.

Survey domains, number of single-items examined, and scores examined. The number of single-items examined was 148, the total across all seven domains. FACT, Functional Assessment of Cancer Therapy; FSDS, Female Sexual Dysfunction Scale; MEPS, Medical Expenditure Panel Survey; PROMIS, Patient-Reported Outcomes Measurement Information System; WAHS, Worry about Affording Health care Scale.
Statistical analyses
To examine the representativeness of the survey sample, we first compared completers, non-completers, refusers, and non-responders on characteristics available from the electronic health record for all four groups. Partial completers and completers were defined as those who were randomized and completed one and two surveys, respectively. Respondents were defined as completers and partial completers combined. Non-completers were defined as those who contacted the study team, completed the screening, and were randomized but did not complete the first survey. Refusers were people who contacted the study team and actively declined to participate. Non-responders were those who passively declined (did not contact the study team and did not participate). Ineligible persons were those who did not meet the eligibility criteria outlined above.
Analyses to identify mode effects and test-retest reliability were performed for each individual item and for each multi-item score. Mode effects analyses used the first survey and compared the three modes using ANOVAs for continuous items and measures and chi-squared tests for categorical or ordinal items. Mode effects were defined as statistically significant (p < 0.05) differences between modes. Test-retest reliability compared responses and scores on the first and second survey using intraclass correlation coefficients (ICCs) for scores and continuous items 26 and kappa statistics for categorical or ordinal items. 27 ICCs over 0.70 and kappa statistics over 0.60 were defined as showing good reliability, consistent with interpretation guidelines.26,28
Results
From July to October 2022, 302 AYA cancer survivors from KPNC were approached to participate. Seven of these people were found to be ineligible because they could not complete the survey in all three modes or indicated that they did not have cancer. Among those remaining and presumed eligible, 158 did not respond and 53 refused. Of the 84 survivors who enrolled in the study and were randomized, 38 did not complete the survey (non-completers). A total of 46 people completed the survey, with 39 completing both surveys and 7 completing only the first survey. As shown in Supplementary Table S1(supplementary materials) and Table 1, completers, non-completers, refusers, and non-responders did not differ on age at diagnosis or cancer type compared to completers. A greater proportion of the non-completers, refusers, and non-responders were Hispanic or Asian compared to completers. More than half of the 46 respondents (63%) had completed college and identified as female (56%) (not shown). Respondents were a mean of 6.3 years post-diagnosis, and most (61% of the entire sample) had undergone surgery for their cancer (not shown).
Characteristics of AYA Cancer Survivor Survey Participants by Survey Mode Randomization Group and Level of Participationa
Characteristics are from the medical record.
This pilot survey did not include people diagnosed with the 10 most common AYA cancers [listed in methods] to prevent overlap with the main survey.
Mode effects
A small percentage of items showed effects by survey mode (Fig. 2). By domain, the proportion of single-items with mode effects ranged from 0% (fertility and sexual health; demographics; health behaviors) to 17% (cancer care). The overall proportion of items with mode effects was 6%. Mode effects were seen on items about receiving help or care discussions, mental health, and having heart conditions. In general, items with significant mode differences showed that participants completing surveys online or by paper were more likely to report worse health or less help than those completing the survey by phone. Of the 12 multi-item scores, 3 (25%; anxiety, sexual satisfaction, and quality of life) showed significant mode effects, with participants who completed the surveys by phone reporting better health than those completing the surveys online or by paper.

Proportion of scores or items with mode effects and low test-retest reliability by domain.
Test-retest reliability
Many individual items showed low test-retest reliability (Fig. 2). Except for demographics, between 35% (health history) and 78% (social support and well-being) of items within a survey content domain had low reliability. Individual items with low reliability included the financial worry items from the Work, Education, and Finances domain and whether a care provider discussed certain topics at the time of diagnosis from the Cancer Care domain. Conversely, only one multi-item score (benefit finding in the Cancer Care domain; 8%) had low reliability. Cronbach’s alpha for the multi-item measures are in the supplementary tables (Supplementary Table S2).
Discussion
In this study of survey response characteristics in AYA cancer survivors, we found that mode effects were more common in multi-item measures, but multi-item measures had better test-retest reliability compared to single-item measures. Except for demographics, the results tended to be consistent across the other six content domains of the survey. Our findings support the use of multi-item measures over single-item measures with AYAs due to better reliability. However, single-item measures may be preferable when mode effects are large and when controlling for mode or limiting modes is not possible. For example, study results showed that phone surveys possibly had a social desirability effect compared to paper and electronic surveys. When phone surveys have to be used, single-item measures may be preferable to multi-item measures. Regardless, most studies should consider multi-item measures due to better test-retest reliability.
Consistent with prior studies,17,18 mode effects showed that participants reported worse health when completing the survey online or on paper compared to by phone. One potential reason for the mode effects was social desirability for the phone survey. Social desirability is more likely to happen with phone or in-person surveys;29–31 our finding of significant mode effects suggests that self-administered surveys on paper or through electronic platforms may be preferable. While not observed in this study, prior research has shown that there may be non-equivalence between electronic, web-based surveys, and paper surveys and these differences could result from electronic surveys more often being completed in the participants natural setting with less control over the format of the survey and potential influences from the environment.32–34 This study rigorously developed the survey tools to ensure equivalence across modes, and web and paper surveys were both completed by the participants in their natural environment. Both of these factors could explain the equivalence of paper and electronic surveys observed here.
The better reliability of multi-item scores was also consistent with previous studies, although some items would be expected to have lower estimated reliability due to the timeframe of the question changing over time (e.g., feelings right now, current financial worry, 1 ) The better test-retest reliability of multi-item scores is not surprising given that longer measures tend to have better internal consistency, another form of reliability.35,36 The choice of multi-item versus single-item measures should consider the specific characteristics of the study, including the number of constructs measured and previously documented mode effects.
Future studies are needed to examine how many items are necessary to gain the improved reliability of multi-item measures with AYAs. Many of the multi-item measures in this study had two or three items, suggesting that multi-item measures could consist of only a few items to balance the benefits of single- and multi-item measures. The two-item global health measures from the PROMIS have been shown to be reliable and valid, further supporting the potential utility of brief measures. 37 This could be particularly useful for studies of AYAs such as in this study of cancer survivors, as many AYAs access surveys via smartphones 13 and completing complex multi-item measures on smartphones is challenging. 14 Remote patient monitoring using PROs through various platforms (text, web-based, apps, and patient portals) is also becoming common within oncology, and brief PROs could be helpful for reducing patient burden while supporting these efforts. 38 AYAs also undergo developmental milestones that might require unique survey measures. Single-item measures may seem to be an appropriate way to assess these important milestones and associated health outcomes. However, using individual items from multi-item measures to assess these experiences could lead to more error in statistical analyses, requiring larger sample sizes. 39 When using multi-item measures in surveys of AYAs, the multi-item score should be prioritized in analyses instead of each individual item to improve reliability.
Research studies, particularly those focused on health disparities, are moving toward using ecological momentary assessment (EMA), in which participants complete two or more brief surveys per day. 40 EMA studies may be particularly important for assessing the quality of life of AYAs as they experience the aforementioned developmental milestones that might complicate single surveys with long recall periods (i.e., 6–12 months). Most multi-item measures would be too long for EMA studies even if the timeframe was adjusted to reflect the past day or few hours. While multi-item measures should be used as developed unless modern psychometrics such as item response theory were used, additional studies are needed to adapt such measures to ultra-brief (2–3 items) versions that may be more appropriate for use with AYAs and especially for EMA studies. These measures could be useful both for EMA and for quickly assessing outcomes associated with AYAs’ developmental milestones.
The limitations of this study provide important context. The small sample size and number of multi-item measures limited the ability to compare across content domains. The sample size from this study may not have been large enough to reliably estimate effects for items with more than two response categories given the low frequency of some responses. Because this study was part of a larger project, the items and measures were limited to those needed for the parent study. Participants were also restricted to one health care system in one state, which might not reflect responses from AYA cancer survivors more generally. The need to be willing to complete the survey in all three modes could have led to the low response rate and limited generalizability. There were also some differences in completion by randomized mode, and this could have affected responses. Validity was also not assessed beyond examining mode effects. These limitations are balanced out by the major strengths of the study: randomization to mode, repeated measures with a short interval for test-retest reliability, and a longer survey with measures across multiple domains of the cancer survivorship experience.
Our findings suggest that multi-item measures should generally be used over single-item measures with AYAs due to better reliability, but single-item measures may be preferable when mode effects are large. Additional research is warranted to develop very brief multi-item measures and to examine how best to balance reliability with participant burden and mode effects. These very brief multi-item measures could then help promote research to improve quality of life and care experiences of AYAs.
The authors thank the patients of Kaiser Permanente Northern California for helping to improve care through the use of information collected through our electronic health record systems and the KPWHRI Survey Research Program, who were instrumental in conducting this research.
Authors’ Contributions
S.M.W.J.: Conceptualization, methodology, formal analysis, writing—original draft, writing—review and editing. T.H.M.K.: Conceptualization, supervision, funding acquisition, writing—review and editing. E.E.H.: Conceptualization, supervision, funding acquisition, writing—review and editing. H.B.N.: Conceptualization, supervision, funding acquisition, writing—review and editing. E.S.O’M.: Conceptualization, data curation, writing—review and editing. M.F.G.: Conceptualization, methodology, writing—review and editing. K.J.W.: Conceptualization, supervision, writing—review and editing. C.A.M.S.: Conceptualization, writing—review and editing. A.C.K.: Conceptualization, writing—review and editing. C.A.L.: Conceptualization, data curation, writing—review and editing. L.H.K.: Conceptualization, supervision, funding acquisition, writing—review and editing. J.C.: Conceptualization, methodology, supervision, funding acquisition, data curation, writing—review and editing.
Footnotes
Author Disclosure Statement
No competing financial interests exist.
Funding Information
Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number P01CA233432. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
