A comparison of methods for designing hybrid type 2 cluster-randomized trials with continuous effectiveness and implementation endpoints

Abstract

Hybrid type 2 studies are gaining popularity for their ability to assess both implementation and health outcomes as co-primary endpoints. Often conducted as cluster-randomized trials (CRTs), five design methods can validly power these studies: p-value adjustment methods, combined outcomes approach, single weighted 1-DF test, disjunctive 2-DF test, and conjunctive test. We compared these methods theoretically and numerically. Theoretical comparisons of power equations allowed us to identify when one method had more or less power than another globally. We showed that p-value adjustment methods are always less powerful than both the combined outcomes approach and the single 1-DF test, and identified conditions where the disjunctive 2-DF test is less powerful than the single 1-DF test. To further investigate when power advantages shift, we conducted a large-scale numerical study using our novel crt2power R package, which calculates power or sample size for CRTs with two continuous co-primary endpoints using these methods. Across 45,000 input scenarios, we found specific patterns: when treatment effects are unequal, the disjunctive 2-DF test tends to be most powerful; when treatment effects are equal, the single 1-DF test tends to dominate. Together, these comparisons offer practical guidance for powering hybrid type 2 studies.

Keywords

Hybrid type 2 studies crt2power co-primary outcomes CRTs implementation-effectiveness studies cluster-randomized trials multiple outcomes

1. Introduction

Cluster-randomized trials (CRTs) are studies in which the unit of randomization is a cluster rather than an individual. Clusters can be villages, towns, hospitals, wards of a hospital, and so on. CRTs can offer logistical and administrative convenience, reduce treatment group contamination, and are advantageous when the intervention in question is best administered at a cluster level.¹ In implementation science, our fundamental research goal is to understand how best to deliver an intervention effectively; in these endeavors, CRTs are often utilized. This is because many implementation outcomes are often measured and assessed at the cluster level. Effectiveness-implementation hybrid designs offer simultaneous assessment of a health (or effectiveness) outcome and an implementation outcome. On one end of the spectrum, hybrid type 1 studies consider the health outcome as the primary outcome, while the implementation outcome is the secondary outcome. Hybrid type 3 studies consider the implementation outcome as the primary outcome, and the health outcome as the secondary outcome. The focus of this article is the hybrid type 2 study, which considers both outcomes as co-primary outcomes.²

Hybrid type 2 studies are very advantageous because they allow for simultaneous analysis of both an effectiveness outcome and an implementation outcome in one study. With this added efficiency are various statistical complications. The first is the issue of multiple testing, where one must control the overall type I error rate. The management of this rate will have important implications on the overall study design parameters that result from the power calculations. There is also the complication of clustering—this introduces various correlations that come from the hierarchical structure of the data. In a recent study by Owen et al., five study design methods (specifically for sample size and power calculation) that can be applied to hybrid type 2 studies were identified through a literature review.³ These methods were used on data motivated by a real-world hybrid study in order to showcase how the calculations for these methods could be conducted and how the considerations differ. However, to date, no formal analytic or simulation-based comparisons were made across the five methods in the context of hybrid type 2 designs. To fill in this gap, we aim to thoroughly examine the performance of these methods via theoretical comparisons and a numerical evaluation, where different scenarios have different assumptions and input parameters.

In order to conduct the numerical evaluation, and to make these methods more widely available for practice with hybrid type 2 designs, we also created an R package called crt2power.⁴ This package allows users to calculate the statistical power or sample size requirements for CRTs with two continuous co-primary endpoints given a set of input parameters. This package includes code for each of the five methods identified in Owen et al., namely the p-value adjustment methods, the combined outcomes approach, the single weighted 1-degree of freedom (DF) combined test, the disjunctive 2-DF test, and the conjunctive intersection–union test.³ Each method has three functions: (a) a function to calculate the statistical power (the probability that a given test correctly rejects the null hypothesis) given the number of clusters, cluster size (number of individuals per cluster), and other input parameters, (b) a function to calculate the required number of clusters given the statistical power and cluster size, and (c) a function to calculate the cluster size given the statistical power and number of clusters. This R package and the usage information is publicly available at https://github.com/melodyaowen/crt2power. It is also available on The Comprehensive R Archive Network (CRAN) at https://cran.r-project.org/web/packages/crt2power/index.html.⁴ Accompanying this R package is a ShinyApp, crt2powerApplication, which provides a user-friendly interface where input values can be entered, and the resulting power or sample size will be calculated and displayed for all five methods. This ShinyApp is available at https://mowen17.shinyapps.io/crt2powerApplication/.

In this article, we begin by introducing notation for CRTs and hybrid type 2 studies, as well as describing the study design methods that are examined in the numerical evaluation and introduced in Owen et al.³ in Section 2. Then, a thorough description of the software package and ShinyApp are provided along with usage examples in Section 3. Next, a theoretical comparison of the power equations is provided in Section 4. Lastly, a numerical evaluation comparing the performance of these methods is conducted in Section 5.

2. Methods

2.1 Notation

We begin by introducing notation for CRTs in a hybrid type 2 setting. In this setting, we have two continuous primary outcomes, denoted as $Q = 2$ with outcome index $q = 1, 2$ . We consider two treatment groups; in this manuscript, we assume equal treatment allocation, but note that our software accommodates unequal treatment allocation. Then, we denote $K_{1}$ as the number of clusters in the treatment group, and $K_{2}$ as the number of clusters in the control group (under equal allocation, $K_{1} = K_{2}$ ). We can also simply refer to it as K under equal treatment allocations, and the clusters are indexed as $k = 1, \dots, 2 K$ . The total number of clusters in the study is $K_{1} + K_{2}$ , or $2 K$ , and the number of individuals in each cluster is m, indexed as $j = 1, \dots, m$ . The two continuous outcome vectors are $Y_{1, k j}$ and $Y_{2, k j}$ , which we will generically refer to as $Y_{1}$ and $Y_{2}$ , whenever there is no ambiguity by omitting the cluster and individual subscripts.

When using the p-value adjustment methods, which we will discuss in the next section, the power, cluster size, and number of clusters are calculated separately for each outcome. Thus, we have $K^{(q)}$ as the number of clusters in each treatment group based on the qth outcome, and $m^{(q)}$ as the number of individuals per cluster based on the qth outcome. We assume the continuous outcomes follow a bivariate linear mixed model, given by:

(\begin{matrix} Y_{1, k j} \\ Y_{2, k j} \end{matrix}) = (\begin{matrix} γ_{1} \\ γ_{2} \end{matrix}) + (\begin{matrix} β_{1} * \\ β_{2} * \end{matrix}) X_{k} + (\begin{matrix} b_{1, k} \\ b_{2, k} \end{matrix}) + (\begin{matrix} e_{1, k j} \\ e_{2, k j} \end{matrix}),

where

X_{k}

denotes the treatment group for cluster k (

X_{k} = 1

for treatment,

X_{k} = 0

for control),

β_{1} *

and

β_{2} *

are the treatment effects on the first and second outcome, respectively, and

γ_{1}

and

γ_{2}

are the intercepts. Then,

b_{k} = (b_{1, k}, b_{2, k})^{T}

is the vector of random intercepts for cluster k for each outcome q, and is assumed to follow

N (0_{2 \times 1}, Σ_{b})

, and

e_{k j} = (e_{1, k j}, e_{2, k j})^{T}

is the random error for each subject j for each outcome q, and follows

N (0_{2 \times 1}, Σ_{e})

. Note that

Σ_{b}

and

Σ_{e}

must be positive definite. Then, the diagonal element of

Σ_{b}

and

Σ_{e}

are

σ_{q, B}^{2}

and

σ_{q, W}^{2}

respectively, and the off-diagonal elements are

σ_{12, B}

and

σ_{12, W}

. The total variances of

Y_{1}

and

Y_{2}

are thus

Var (Y_{1}) = σ_{1}^{2} = σ_{1, B}^{2} + σ_{1, W}^{2}

and

Var (Y_{2}) = σ_{2}^{2} = σ_{2, B}^{2} + σ_{2, W}^{2}

Throughout the manuscript, we consider four key correlation coefficients. These are the endpoint specific intraclass correlation coefficients (ICCs) for $Y_{1}$ and $Y_{2}$ , the inter-subject between-endpoint ICC, and the intra-subject between-endpoint ICC, defined as $ρ_{0}^{(1)} = ICC (Y_{1}) = Corr (Y_{1, k j}, Y_{1, k j^{'}}) = σ_{1, B}^{2} / (σ_{1, B}^{2} + σ_{1, W}^{2})$ (i.e. the correlation for $Y_{1}$ for two different individuals in the same cluster); $ρ_{0}^{(2)} = ICC (Y_{2}) = Corr (Y_{2, k j}, Y_{2, k j^{'}}) = σ_{2, B}^{2} / (σ_{2, B}^{2} + σ_{2, W}^{2})$ (i.e. the correlation for $Y_{2}$ for two different individuals in the same cluster); $ρ_{1}^{(1, 2)} = Corr (Y_{1, k j}, Y_{2, k j^{'}}) = σ_{12, B} / ((σ_{1 B}^{2} + σ_{1 W}^{2})^{1 / 2} (σ_{2 B}^{2} + σ_{2 W}^{2})^{1 / 2})$ (i.e. the correlation between $Y_{1}$ and $Y_{2}$ for different individuals in the same cluster); $ρ_{2}^{(1, 2)} = Corr (Y_{1, k j}, Y_{2, k j}) = (σ_{12, B} + σ_{12, W}) / ((σ_{1 B}^{2} + σ_{1 W}^{2})^{1 / 2} (σ_{2 B}^{2} + σ_{2 W}^{2})^{1 / 2})$ (i.e. the correlation between $Y_{1}$ and $Y_{2}$ for the same individual), respectively.⁵ These are used to calculate the variance inflation factors (VIFs) used in many of the equations for power that quantify how clustering inflates the variance of each outcome and their correlations. These are defined as ${VIF}_{1} = 1 + (m - 1) ρ_{0}^{(1)}$ , ${VIF}_{2} = 1 + (m - 1) ρ_{0}^{(2)}$ , and ${VIF}_{12} = ρ_{2}^{(1, 2)} + (m - 1) ρ_{1}^{(1, 2)}$ . Information regarding each parameter used in these equations and the software package are summarized in Table 1.

Table 1.
Description of required input parameters.

Parameter Statistical notation Variable name in R package Description

Statistical Power $π$ power Probability of detecting a true effect under the alternative hypothesis

Number of clusters $K$ K Number of clinics in each treatment arm

Cluster Size $m$ m Number of patients in each clinic

Overall (family-wise) False Positive Rate $α$ alpha Probability of one or more Type I error(s)

Effect for $Y_{1}$ $β_{1} $ beta1 Estimated intervention effect on $Y_{1}$ , in percentage point increase

Effect for $Y_{2}$ $β_{2} $ beta2 Estimated Intervention effect on $Y_{2}$ , in percentage point increase

Total Variance of $Y_{1}$ $Var (Y_{1}) = σ_{1}^{2}$ varY1 Total variance of the first outcome

Total Variance of $Y_{2}$ $Var (Y_{2}) = σ_{2}^{2}$ varY2 Total variance of the second outcome

Endpoint-specific ICC for $Y_{1}$ $ICC (Y_{1}) = ρ_{0}^{(1)}$ rho01 Correlation for $Y_{1}$ for two different individuals in the same cluster

Endpoint-specific ICC for $Y_{2}$ $ICC (Y_{2}) = ρ_{0}^{(2)}$ rho02 Correlation for $Y_{2}$ for two different individuals in the same cluster

Inter-subject between-endpoint ICC $Corr (Y_{1, k j}, Y_{2, k j^{'}}) = ρ_{1}^{(1, 2)}$ rho1 Correlation between $Y_{1}$ and $Y_{2}$ for two different individuals in the same cluster

Intra-subject between-endpoint ICC $Corr (Y_{1, k j}, Y_{2, k j}) = ρ_{2}^{(1, 2)}$ rho2 Correlation between $Y_{1}$ and $Y_{2}$ for the same individual

Treatment allocation ratio $r$ r Treatment allocation ratio: $K_{2} = r K_{1}$ where $K_{1}$ is the number of clusters in the experimental group

Statistical distribution – dist Specification of which distribution to base calculation on, either the $χ^{2}$ -distribution or F-distribution^a

Parameter	Statistical notation	Variable name in R package	Description
Statistical Power	$π$	power	Probability of detecting a true effect under the alternative hypothesis
Number of clusters	$K$	K	Number of clinics in each treatment arm
Cluster Size	$m$	m	Number of patients in each clinic
Overall (family-wise) False Positive Rate	$α$	alpha	Probability of one or more Type I error(s)
Effect for $Y_{1}$	$β_{1} *$	beta1	Estimated intervention effect on $Y_{1}$ , in percentage point increase
Effect for $Y_{2}$	$β_{2} *$	beta2	Estimated Intervention effect on $Y_{2}$ , in percentage point increase
Total Variance of $Y_{1}$	$Var (Y_{1}) = σ_{1}^{2}$	varY1	Total variance of the first outcome
Total Variance of $Y_{2}$	$Var (Y_{2}) = σ_{2}^{2}$	varY2	Total variance of the second outcome
Endpoint-specific ICC for $Y_{1}$	$ICC (Y_{1}) = ρ_{0}^{(1)}$	rho01	Correlation for $Y_{1}$ for two different individuals in the same cluster
Endpoint-specific ICC for $Y_{2}$	$ICC (Y_{2}) = ρ_{0}^{(2)}$	rho02	Correlation for $Y_{2}$ for two different individuals in the same cluster
Inter-subject between-endpoint ICC	$Corr (Y_{1, k j}, Y_{2, k j^{'}}) = ρ_{1}^{(1, 2)}$	rho1	Correlation between $Y_{1}$ and $Y_{2}$ for two different individuals in the same cluster
Intra-subject between-endpoint ICC	$Corr (Y_{1, k j}, Y_{2, k j}) = ρ_{2}^{(1, 2)}$	rho2	Correlation between $Y_{1}$ and $Y_{2}$ for the same individual
Treatment allocation ratio	$r$	r	Treatment allocation ratio: $K_{2} = r K_{1}$ where $K_{1}$ is the number of clusters in the experimental group
Statistical distribution	–	dist	Specification of which distribution to base calculation on, either the $χ^{2}$ -distribution or F-distribution^a

When selecting the χ²-distribution, all methods will use this distribution with the exception of the conjunctive IU test, which will use the multivariate normal (MVN) distribution; when selecting the F-distribution, all methods will use this distribution with the exception of the conjunctive IU test, which will use the t-distribution.

The non-centrality parameter used for power calculations is $λ$ , and the statistical power is denoted as $π$ . Then, under this model, the standard design formulas for a CRT for the qth outcome and equal treatment allocation are

\begin{aligned} m^{(q)} & = \frac{2 {(Z_{1 - α / 2} + Z_{β})}^{2} σ_{q}^{2} (1 - ρ_{0}^{(q)})}{{(β_{q} *)}^{2} K^{(q)} - 2 {(Z_{1 - α / 2} + Z_{β})}^{2} σ_{q}^{2} ρ_{0}^{(q)}}, K^{(q)} = \frac{2 {(Z_{1 - α / 2} + Z_{β})}^{2} σ_{q}^{2} [1 + (m^{(q)} - 1) ρ_{0}^{(q)}]}{m^{(q)} {(β_{q} *)}^{2}}, \end{aligned}

\begin{aligned} λ^{(q)} & = \frac{{(β_{q} *)}^{2}}{2 \frac{σ_{q}^{2}}{K^{(q)} m^{(q)}} [1 + (m^{(q)} - 1) ρ_{0}^{(q)}]}, \end{aligned}

as shown in Donner and Klar, where

λ

is the non-centrality parameter used for power calculations, and

λ^{(q)}

is the non-centrality parameter based on the

q^{th}

outcome. They also extended these to accommodate unequal treatment allocation.⁶ Note here that for

(Z_{1 - α / 2} + Z_{β})^{2}

Z_{1 - α / 2}

is the

(1 - (α / 2)) \times 100 th

lower percentile of the standard normal distribution with error rate

α

, and

Z_{β}

is the critical value corresponding to

β

. So, ignoring multiple testing, for example, at 80% power and a significance level of 5%, we have

(Z_{1 - α / 2} + Z_{β})^{2} = (1.96 + 0.84)^{2} = 7.84

. Next, we describe each of the study design methods that are available in the crt2power R package⁴ and ShinyApp, all of which are included in the numerical evaluation in Section 5.

2.2 Study design methods

Here, we briefly describe the study design methods that are available in the crt2power package⁴ and accompanying ShinyApp. Table 2 summarizes each of the methods, including the hypothesis test, non-centrality parameter, equation for statistical power, and test statistic in which the power calculations are based. We denote statistical power as $π$ , and the statistical power based on the qth outcome as $π^{(q)}$ . For more details, we refer the reader to Owen et al.³

Table 2.
Summary of study design formulas for hybrid type 2 CRTs.

Method Hypothesis setup Non-centrality parameter and power Number of clusters in the treatment group (K)

p-Value Adjustments $H_{0} : β_{1} * = 0$ and $β_{2} * = 0$ $H_{A} : β_{1} * \neq 0$ or $β_{2} * \neq 0$ $λ^{(q)} = \frac{{(β_{q} )}^{2}}{2 \frac{σ_{q}^{2}}{K m} [1 + (m - 1) ρ_{0}^{(q)}]}$ $π^{(q)} = 1 - χ^{2} [λ^{(q)}, 1]$ ; $π = min (π^{(1)}, π^{(2)})$ $K^{(q)} = \frac{2 {(Z_{1 - α / 2} + Z_{β})}^{2} σ_{q}^{2} [1 + (m^{(q)} - 1) ρ_{0}^{(q)}]}{m^{(q)} {(β_{q} )}^{2}}$

Combined Outcomes $H_{0} : β_{c} * = 0$ $H_{A} : β_{c} * \neq 0$ $\begin{array}{ll} λ & = \frac{{(β_{c} )}^{2}}{2 \frac{σ_{c}^{2}}{K m} [1 + (m - 1) ρ_{0}^{(c)}]} \\ π & = 1 - χ^{2} [λ, 1] \end{array}$ $K = \frac{2 {(Z_{1 - α / 2} + Z_{β})}^{2} σ_{c}^{2} [1 + (m - 1) ρ_{0}^{(c)}]}{m {(β_{c} )}^{2}}$ $σ_{c}^{2} = σ_{1}^{2} + σ_{2}^{2} + 2 ρ_{2}^{(1, 2)} σ_{1} σ_{2}; ρ_{0}^{(c)} = \frac{ρ_{0}^{(1)} σ_{1}^{2} + ρ_{0}^{(2)} σ_{2}^{2} + 2 ρ_{1}^{(1, 2)} σ_{1} σ_{2}}{σ_{1}^{2} + σ_{2}^{2} + 2 ρ_{2}^{(1, 2)} σ_{1} σ_{2}}$

Single Weighted 1-DF Combined Test $H_{0} : β_{1} * + β_{2} * = 0$ $H_{A} : β_{1} * + β_{2} * \neq 0$ $λ = {[\frac{\sqrt{\frac{{(β_{1} )}^{2}}{\frac{2 σ_{1}^{2}}{K m} [1 + (m - 1) ρ_{0}^{(1)}]}} + \sqrt{\frac{{(β_{2} )}^{2}}{\frac{2 σ_{2}^{2}}{K m} [1 + (m - 1) ρ_{0}^{(2)}]}}}{\sqrt{2 (1 + \frac{(ρ_{2}^{(1, 2)} + (m - 1) ρ_{1}^{(1, 2)})}{\sqrt{(1 + (m - 1) ρ_{0}^{(1)}) (1 + (m - 1) ρ_{0}^{(2)})}})}}]}^{2}$ $π = 1 - χ^{2} [λ, 1]$ $K = \frac{2 {(Z_{1 - α / 2} + Z_{β})}^{2} (1 + \frac{(ρ_{2}^{(1, 2)} + (m - 1) ρ_{1}^{(1, 2)})}{\sqrt{(1 + (m - 1) ρ_{0}^{(1)}) (1 + (m - 1) ρ_{0}^{(2)})}})}{{[\sqrt{\frac{{(β_{1} )}^{2}}{\frac{2 σ_{1}^{2}}{m} [1 + (m - 1) ρ_{0}^{(1)}]}} + \sqrt{\frac{{(β_{2} )}^{2}}{\frac{2 σ_{2}^{2}}{m} [1 + (m - 1) ρ_{0}^{(2)}]}}]}^{2}}$

Disjunctive 2-DF Test $H_{0} : L β * = 0$ $H_{A} : L β * \neq 0$ $λ = [\frac{K m [{(β_{1} )}^{2} σ_{2}^{2} V I F_{2} - 2 β_{1} β_{2} * σ_{1} σ_{2} V I F_{12} + {(β_{2} )}^{2} σ_{1}^{2} V I F_{1}]}{2 σ_{1}^{2} σ_{2}^{2} [V I F_{1} V I F_{2} - V I F_{12}^{2}]}]$ F-distribution: $π = \int_{F_{1 - α} (S, 2 K - S - Q)}^{\infty} f (x; λ, S, 2 K - S - Q) d x$ $χ^{2}$ distribution: $π = 1 - \int_{0}^{5.99} χ^{2} (x; 2, λ) d x$ $K = [\frac{2 {(Z_{1 - \frac{α}{2}} + Z_{β})}^{2} σ_{1}^{2} σ_{2}^{2} [V I F_{1} V I F_{2} - V I F_{12}^{2}]}{m [{(β_{1} )}^{2} σ_{2}^{2} V I F_{2} - 2 β_{1} * β_{2} * σ_{1} σ_{2} V I F_{12} + {(β_{2} )}^{2} σ_{1}^{2} V I F_{1}]}]$ (for $χ^{2}$ -distribution only)

Conjunctive IU Test $H_{0} : β_{1} = 0 or$ $β_{2} * = 0$ $H_{A} : β_{1} * \neq 0 and$ $β_{2} * \neq 0$ $[ζ_{1}, ζ_{2}]^{T} = {[\begin{array}{cc} \frac{β_{1} * \sqrt{2 K}}{\sqrt{\frac{4 σ_{1}^{2} V I F_{1}}{m}}}, & \frac{β_{2} * \sqrt{2 K}}{\sqrt{\frac{4 σ_{2}^{2} V I F_{2}}{m}}} \end{array}]}^{T}$ $π = \int_{c_{1}}^{\infty} \int_{c_{2}}^{\infty} f_{W} (w_{1}, w_{2}) d w_{1} d w_{2}$ –

Method	Hypothesis setup	Non-centrality parameter and power	Number of clusters in the treatment group (K)
p-Value Adjustments	$H_{0} : β_{1} * = 0$ and $β_{2} * = 0$ $H_{A} : β_{1} * \neq 0$ or $β_{2} * \neq 0$	$λ^{(q)} = \frac{{(β_{q} *)}^{2}}{2 \frac{σ_{q}^{2}}{K m} [1 + (m - 1) ρ_{0}^{(q)}]}$ $π^{(q)} = 1 - χ^{2} [λ^{(q)}, 1]$ ; $π = min (π^{(1)}, π^{(2)})$	$K^{(q)} = \frac{2 {(Z_{1 - α / 2} + Z_{β})}^{2} σ_{q}^{2} [1 + (m^{(q)} - 1) ρ_{0}^{(q)}]}{m^{(q)} {(β_{q} *)}^{2}}$
Combined Outcomes	$H_{0} : β_{c} * = 0$ $H_{A} : β_{c} * \neq 0$	$\begin{array}{ll} λ & = \frac{{(β_{c} *)}^{2}}{2 \frac{σ_{c}^{2}}{K m} [1 + (m - 1) ρ_{0}^{(c)}]} \\ π & = 1 - χ^{2} [λ, 1] \end{array}$	$K = \frac{2 {(Z_{1 - α / 2} + Z_{β})}^{2} σ_{c}^{2} [1 + (m - 1) ρ_{0}^{(c)}]}{m {(β_{c} *)}^{2}}$ $σ_{c}^{2} = σ_{1}^{2} + σ_{2}^{2} + 2 ρ_{2}^{(1, 2)} σ_{1} σ_{2}; ρ_{0}^{(c)} = \frac{ρ_{0}^{(1)} σ_{1}^{2} + ρ_{0}^{(2)} σ_{2}^{2} + 2 ρ_{1}^{(1, 2)} σ_{1} σ_{2}}{σ_{1}^{2} + σ_{2}^{2} + 2 ρ_{2}^{(1, 2)} σ_{1} σ_{2}}$
Single Weighted 1-DF Combined Test	$H_{0} : β_{1} * + β_{2} * = 0$ $H_{A} : β_{1} * + β_{2} * \neq 0$	$λ = {[\frac{\sqrt{\frac{{(β_{1} )}^{2}}{\frac{2 σ_{1}^{2}}{K m} [1 + (m - 1) ρ_{0}^{(1)}]}} + \sqrt{\frac{{(β_{2} )}^{2}}{\frac{2 σ_{2}^{2}}{K m} [1 + (m - 1) ρ_{0}^{(2)}]}}}{\sqrt{2 (1 + \frac{(ρ_{2}^{(1, 2)} + (m - 1) ρ_{1}^{(1, 2)})}{\sqrt{(1 + (m - 1) ρ_{0}^{(1)}) (1 + (m - 1) ρ_{0}^{(2)})}})}}]}^{2}$ $π = 1 - χ^{2} [λ, 1]$	$K = \frac{2 {(Z_{1 - α / 2} + Z_{β})}^{2} (1 + \frac{(ρ_{2}^{(1, 2)} + (m - 1) ρ_{1}^{(1, 2)})}{\sqrt{(1 + (m - 1) ρ_{0}^{(1)}) (1 + (m - 1) ρ_{0}^{(2)})}})}{{[\sqrt{\frac{{(β_{1} )}^{2}}{\frac{2 σ_{1}^{2}}{m} [1 + (m - 1) ρ_{0}^{(1)}]}} + \sqrt{\frac{{(β_{2} )}^{2}}{\frac{2 σ_{2}^{2}}{m} [1 + (m - 1) ρ_{0}^{(2)}]}}]}^{2}}$
Disjunctive 2-DF Test	$H_{0} : L β * = 0$ $H_{A} : L β * \neq 0$	$λ = [\frac{K m [{(β_{1} )}^{2} σ_{2}^{2} V I F_{2} - 2 β_{1} β_{2} * σ_{1} σ_{2} V I F_{12} + {(β_{2} *)}^{2} σ_{1}^{2} V I F_{1}]}{2 σ_{1}^{2} σ_{2}^{2} [V I F_{1} V I F_{2} - V I F_{12}^{2}]}]$ F-distribution: $π = \int_{F_{1 - α} (S, 2 K - S - Q)}^{\infty} f (x; λ, S, 2 K - S - Q) d x$ $χ^{2}$ distribution: $π = 1 - \int_{0}^{5.99} χ^{2} (x; 2, λ) d x$	$K = [\frac{2 {(Z_{1 - \frac{α}{2}} + Z_{β})}^{2} σ_{1}^{2} σ_{2}^{2} [V I F_{1} V I F_{2} - V I F_{12}^{2}]}{m [{(β_{1} )}^{2} σ_{2}^{2} V I F_{2} - 2 β_{1} β_{2} * σ_{1} σ_{2} V I F_{12} + {(β_{2} *)}^{2} σ_{1}^{2} V I F_{1}]}]$ (for $χ^{2}$ -distribution only)
Conjunctive IU Test	$H_{0} : β_{1} * = 0 or$ $β_{2} * = 0$ $H_{A} : β_{1} * \neq 0 and$ $β_{2} * \neq 0$	$[ζ_{1}, ζ_{2}]^{T} = {[\begin{array}{cc} \frac{β_{1} * \sqrt{2 K}}{\sqrt{\frac{4 σ_{1}^{2} V I F_{1}}{m}}}, & \frac{β_{2} * \sqrt{2 K}}{\sqrt{\frac{4 σ_{2}^{2} V I F_{2}}{m}}} \end{array}]}^{T}$ $π = \int_{c_{1}}^{\infty} \int_{c_{2}}^{\infty} f_{W} (w_{1}, w_{2}) d w_{1} d w_{2}$	–

Throughout the following sections, we denote which tests utilize the disjunctive hypothesis, and which utilize a conjunctive hypothesis. For a disjunctive test, the null hypothesis is $H_{0} : β_{1} * = 0$ and $β_{2} * = 0$ and the alternative hypothesis is $H_{A} : β_{1} * \neq 0$ or $β_{2} * \neq 0$ . To reject the null hypothesis, the intervention must have an effect on at least one outcome but not necessarily both. For a conjunctive test, the null hypothesis is $H_{0} : β_{1} * = 0$ or $β_{2} * = 0$ , and the alternative hypothesis is $H_{A} : β_{1} * \neq 0$ and $β_{2} * \neq 0$ . That is, to reject the null hypothesis, the intervention must have an effect on both outcomes. This distinction is important when understanding the study design methods that follow, and especially when deciding which method best suits one's study goals.

2.2.1 p-Value adjustment methods

The most popular method for addressing multiple testing is adjusting the p-value. There are three key ways to adjust for the p-value, namely the Bonferroni correction,⁷ the Sidak method,⁸ and the D/AP approach.⁹ These p-value adjustment methods are used for a disjunctive hypothesis framework, where we have $H_{0} : β_{1} * = 0 and β_{2} * = 0 versus H_{A} : β_{1} * \neq 0 or β_{2} * \neq 0$ .¹⁰ Although the design calculations are carried out for each outcome separately, using adjusted significance levels ensures that the overall decision rule—rejecting the null if at least one outcome is significant—validly corresponds to a disjunctive hypothesis while maintaining control of the family-wise error rate. The study design calculations are marginal, where either K, m, or power for both of the $Q = 2$ outcomes are calculated independent of the other, and the final result is $K = max (K^{(1)}, K^{(2)})$ , $m = max (m^{(1)}, m^{(2)})$ , and $π = min (π^{(1)}, π^{(2)})$ . The resulting design parameters ( $K$ , m, $π$ ) reflect the conservative marginal power framework typically used in multi-endpoint designs, ensuring that each outcome achieves the desired power under the adjusted significance level.¹⁰ This marginal approach provides a guaranteed lower bound that aligns with standard practice in multiple testing scenarios.

2.2.2 Combined outcomes

The combined, or composite, outcomes approach combines the two outcome vectors into a single outcome.¹¹ A popular way of combining the outcomes is to sum them to produce $β_{c} *$ . To eliminate the possibility that $β_{1} * = - β_{2} *$ , it is assumed that both elements of $β *$ are in the same direction, and data can be transformed to ensure that this is the case. In this approach, the hypothesis setup is $H_{0} : β_{c} * = 0 versus H_{A} : β_{c} * \neq 0$ . Note that rejecting the null hypothesis in this case is concluding that the treatment is efficacious on the combined outcome, which means that the treatment could be effective on just one or both of the treatments. Thus, this test considers a disjunctive hypothesis. Power calculations for this approach require specification of the combined treatment effect, $β_{c} *$ , the endpoint specific ICC for the combined outcome, $ρ_{0}^{(c)}$ , and the total variance of the combined outcome, $σ_{c}^{2}$ . Expressions for these quantities as a function of other, more intuitive or readily available ones, are derived in Owen et al.,³ and are

ρ_{0}^{(c)} = \frac{ρ_{0}^{(1)} σ_{1}^{2} + ρ_{0}^{(2)} σ_{2}^{2} + 2 ρ_{1}^{(1, 2)} σ_{1} σ_{2}}{σ_{1}^{2} + σ_{2}^{2} + 2 ρ_{2}^{(1, 2)} σ_{1} σ_{2}}; σ_{c}^{2} = σ_{1}^{2} + σ_{2}^{2} + 2 ρ_{2}^{(1, 2)} σ_{1} σ_{2},

where

ρ_{0}^{(1)}

and

ρ_{0}^{(2)}

are the endpoint specific ICCs for

Y_{1}

and

Y_{2}

respectively,

σ_{1}^{2}

and

σ_{2}^{2}

are the total variances of

Y_{1}

and

Y_{2}

respectively,

ρ_{1}^{(1, 2)}

is the inter-subject between-endpoint ICC, and

ρ_{2}^{(1, 2)}

is the intra-subject between-endpoint ICC.

2.2.3 Single weighted 1-DF combined test

In this approach, two separate test statistics are weighted to create a single test statistic. Originally proposed by Pocock et al. (1987) and O’Brien et al. (1984) for the individual randomized controlled trial,^12,13 this method was extended in Owen et al. to accommodate clustering.³ Here, we are testing $H_{0} : β_{1} * + β_{2} * = 0 vs . H_{A} : β_{1} * + β_{2} * \neq 0$ , similar to the combined outcomes approach. Thus, this is also a disjunctive test. For conciseness, this test is also referred to as the “single weighted 1-DF test” or “single 1-DF test.”

2.2.4 Disjunctive 2-DF test

In this 2-DF test, we simultaneously test both outcomes for any departure from the null hypothesis. This test utilizes a linear hypothesis, and is written as $H_{0} : L β * = 0 versus H_{A} : L β * \neq 0$ . For the hybrid type 2 scenario, $L$ is a $2 \times 2$ contrast matrix whose rows represent linearly independent hypotheses concerning the treatment effect parameter, $β *$ .⁵ When $L = [\begin{array}{cc} 1 & 0 \\ 0 & 1 \end{array}]$ , as would usually be the case, $H_{0} : L β * = 0 \Rightarrow H_{0} : [\begin{matrix} β_{1} * = 0 \\ β_{2} * = 0 \end{matrix}]$ . This test is also disjunctive; to reject the null hypothesis, the treatment needs to be effective on at least one outcome. The F-distribution was proposed for the distribution of the test statistic, but one may also utilize the χ²-distribution in larger sample size settings.

2.2.5 Conjunctive intersection–union test

The conjunctive test, or intersection–union (IU) test, requires that the treatment be effective on both outcomes in order to reject the null hypothesis. Thus, the hypothesis setup is written as $H_{0} : β_{1} * = 0 or β_{2} * = 0 versus H_{A} : β_{1} * \neq 0 and β_{2} * \neq 0$ . It was proposed that the t-distribution be used as the distribution of the test statistic, but one may also utilize the multivariate normal (MVN) distribution in larger sample size settings.⁵

When referring to the various power, number of clusters, and cluster size parameters for each method, we use acronyms in the superscripts of these variables. Method 1: p-value adjustment methods uses the acronym “PADJ,” Method 2: combined outcomes approach uses the acronym “COMB,” Method 3: single weighted 1-DF test uses the acronym “W1DF,” Method 4: disjunctive 2-DF test uses the acronym “DIS2DF,” and Method 5: conjunctive IU test uses the acronym “CONJ.” So, for example, $π^{DIS 2 DF}$ refers to the statistical power of a study using the disjunctive 2-DF test.

2.2.6 Motivation and interpretation of methods

Different study goals and questions of interest can naturally lead to different hypothesis structures, and the choice of study design method depends on how evidence across outcomes will be interpreted in decision-making. When improvement in either outcome would support moving an intervention forward, a disjunctive formulation is appropriate. p-Value adjustment methods are a popular approach in practice and are often straightforward to conduct and communicate; the combined outcomes approach and single 1-DF approach summarize both outcomes into one signal, which can be useful when outcomes are measured on similar scales or when simplicity (i.e. a single test) is needed for planning. In contrast, when success with both outcomes is necessary to consider an intervention successful, a conjunctive hypothesis is appropriate, as targeted by the intersection–union test. No single method is globally preferable across different contexts. Instead, the appropriate approach depends on the scientific aims, resource constraints, measurement considerations, and the role each outcome plays in subsequent decisions. The results that follow therefore focus on illustrating how different design parameters affect the achievable power for each approach, clarifying the relative performance of different tests, and enabling researchers to select the method that best aligns with their goals in light of the power properties. These design methods, to our knowledge, are the methods one can possibly consider for this setting, and were identified through an extensive literature review in Owen et al.³

3. Software description

3.1 Description of the R package and ShinyApp

The crt2power package is an R package that allows users to calculate the statistical power or sample size requirements for a cluster-randomized trial with two continuous co-primary outcomes given a set of input parameters.⁴ More precisely, there are three classes of functions: (a) functions that calculate the statistical power given cluster size, number of clusters, and other necessary input parameters; (b) functions that calculate the required number of clusters given the desired statistical power, cluster size, and other necessary input parameters; and (c) functions that calculate the required cluster size given the desired statistical power, number of clusters, and other necessary input parameters. For each of these three classes of functions, there are five functions, which correspond to the five study design methods that are currently available for a clustered hybrid type 2 study design. In each function name, it is specified which parameter the function calculates followed by which study design method is being used. In addition, there is also a function that allows the user to calculate the specified design parameter (either “power,” “K,” or “m”) for all of the study design methods at once, outputting a table. This software is intended to support researchers during the design phase that meets study goals, which include, but are not limited to, statistical power. As always, the choice of method should be finalized prior to data collection, rather than selected post-hoc based on observed power or results. Table 3 summarizes the list of functions available in the crt2power package.

Table 3.
List of crt2power functions⁴.

Design Method List of Functions

1. p-Value Adjustment Methods a. calc\_pwr\_pval\_adj(…)

b. calc\_K\_pval\_adj(…)

c. calc\_m\_pval\_adj(…)

2. Combined Outcomes Approach a. calc\_pwr\_comb\_outcome(…)

b. calc\_K\_comb\_outcome(…)

c. calc\_m\_comb\_outcome(…)

3. Single Weighted 1-DF Combined Test a. calc\_pwr\_single\_1dftest(…)

b. calc\_K\_single\_1dftest(…)

c. calc\_m\_single\_1dftest(…)

4. Disjunctive 2-DF Test a. calc\_pwr\_disj\_2dftest(…)

b. calc\_K\_disj\_2dftest(…)

c. calc\_m\_disj\_2dftest(…)

5. Conjunctive Intersection–Union Test a. calc\_pwr\_conj\_test(…)

b. calc\_K\_conj\_test(…)

c. calc\_m\_conj\_test(…)

All five methods a. run\_crt2\_design(output = “power”, …)

b. run\_crt2\_design(output = “K”, …)

c. run\_crt2\_design(output = “m”, …)

Design Method	List of Functions
1. p-Value Adjustment Methods	a. calc\_pwr\_pval\_adj(…)
b. calc\_K\_pval\_adj(…)
c. calc\_m\_pval\_adj(…)
2. Combined Outcomes Approach	a. calc\_pwr\_comb\_outcome(…)
b. calc\_K\_comb\_outcome(…)
c. calc\_m\_comb\_outcome(…)
3. Single Weighted 1-DF Combined Test	a. calc\_pwr\_single\_1dftest(…)
b. calc\_K\_single\_1dftest(…)
c. calc\_m\_single\_1dftest(…)
4. Disjunctive 2-DF Test	a. calc\_pwr\_disj\_2dftest(…)
b. calc\_K\_disj\_2dftest(…)
c. calc\_m\_disj\_2dftest(…)
5. Conjunctive Intersection–Union Test	a. calc\_pwr\_conj\_test(…)
b. calc\_K\_conj\_test(…)
c. calc\_m\_conj\_test(…)
All five methods	a. run\_crt2\_design(output = “power”, …)
b. run\_crt2\_design(output = “K”, …)
c. run\_crt2\_design(output = “m”, …)

3.2 Specification of design parameters

In order to calculate either K, the number of clusters in the experimental group, m, the number of individuals in each cluster, or power ( $π$ ), the probability of detecting a true effect under the alternative hypothesis, the user must specify various study design parameters so that their desired study is adequately powered. Table 1 describes each of the input parameters that are required for these calculations, along with each parameter's statistical notation and variable name in the package. Depending on the study design method that is used, not all of these input parameters are utilized. For example, the p-value adjustment methods do not require the intra- and inter-subject between-endpoint ICCs, with the exception of the D/AP p-value adjustment method utilizing the intra-subject between-endpoint ICC.

3.3 ShinyApp and usage examples

Figure 1 shows R code using the crt2power for four different function calls one might conduct using this package, and Figure 2 shows the homepage and a usage example for the crt2powerApplication ShinyApp. The ShinyApp has four main tabs—an overview tab, a tab for calculating statistical power, a tab for calculating the number of clusters ( $K$ ), and a tab for calculating the number of individuals per cluster ( $m$ ). For each of the calculation tabs, the user may choose to display the results in either a plot or a table. The ShinyApp can be accessed via https://mowen17.shinyapps.io/crt2powerApplication/. Note that validation checks are conducted in both the ShinyApp and R package, and include but are not limited to ensuring all inputs are numeric, ensuring the resulting covariance matrix is positive definite, ensuring variance and sample size values are non-negative, as well as many others. The authors plan to monitor and address any future bugs promptly to ensure that the tools remain stable and informative for the user.

Figure 1.

Usage examples for the crt2power R package.

Figure 2.

Overview page of the crt2powerApplication ShinyApp and usage examples.

4. Theoretical relationships between the methods

It is of interest to determine under which conditions some of these methods might be identical to others, if at all. This would simplify the choice of which method to use in the design phase of a hybrid type 2 CRT. In addition, it is of interest to determine if any method is globally more powerful than any other method, or if not, under which circumstances this would be the case. Thus, we conducted theoretical comparisons of the statistical power whenever possible, and when the mathematics was not tractable, we conducted a numerical analysis. The theoretical analyses, which we conducted first, helped guide the design of the numerical analysis, investigating the behavior of the design characteristics as input parameters.

In order to examine power in the theoretical analyses, we begin by comparing the equations for the non-centrality parameter for the design methods—namely, $λ$ , and then assessing the underlying distributions to discern the exact relationship between the methods. We examine the non-centrality parameter because the statistical power is a function of $λ$ ; understanding the relationship between the non-centrality parameters of two methods is necessary for understanding the relationship between their statistical power.

4.1 Relationship between Methods 2, 3, and 4

Among the five study design methods, the methods that are the most similar in form are the ones that utilize a single test statistic that combines the two outcomes in some way, namely the combined outcomes approach, single weighted 1-DF test, and disjunctive 2-DF test. The p-value adjustment methods and the conjunctive test consider two test statistics—one for each outcome, and are thus different in form. For this reason, we began by first comparing the aforementioned methods to understand if any among them are equivalent under certain scenarios, or if one is globally more powerful. To compare the methods, we use the χ²-distribution.

4.1.1 Method 2: combined outcomes vs. Method 3: single weighted 1-DF combined test and Method 4: disjunctive 2-DF test

In Owen et al., it was found that if $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ and $σ_{1}^{2} = σ_{2}^{2}$ , then $λ^{COMB} = λ^{W 1 DF}$ , resulting in $π^{COMB} = π^{W 1 DF}$ .³ We further examined the relationship between $λ^{COMB}$ and $λ^{W 1 DF}$ , and could not identify any other theoretical relationship or constraint on the input parameters that resulted in $λ^{COMB} > λ^{W 1 DF}$ , $λ^{COMB} < λ^{W 1 DF}$ , or $λ^{COMB} = λ^{W 1 DF}$ . Similarly, we also compared the combined outcomes test to the disjunctive 2-DF test. Collapsing the contrast matrix into a single row (e.g. $L = [\begin{array}{cc} 1 & 1 \end{array}]$ ) effectively tests whether a weighted combination of the two treatment effects differs from zero, yielding a 1-DF test that is algebraically equivalent to the combined outcomes approach ( $λ^{COMB} = λ^{DIS 2 DF, D F = 1}$ ). Because the motivation of the 2-DF test is to jointly assess departures across both outcomes, this 1-DF reduction is primarily of theoretical interest and detailed further in Appendix A.5. When the disjunctive 2-DF test is not changed into a 1-DF test, no theoretical relationship or constraint on the input parameters were found that resulted in $λ^{COMB} > λ^{DIS 2 DF}$ , $λ^{COMB} < λ^{DIS 2 DF}$ , or $λ^{COMB} = λ^{DIS 2 DF}$ . These findings motivated the need to examine these methods in a numerical evaluation in order to further investigate whether there are study design scenarios that result in one of the methods being more powerful than the other methods.

4.1.2 Method 3: single weighted 1-DF combined test vs. Method 4: disjunctive 2-DF test

For the single weighted 1-DF combined test and the disjunctive 2-DF test, we compared the equations for the non-centrality parameter, $λ^{W 1 DF}$ and $λ^{DIS 2 DF}$ , respectively. We set $λ^{W 1 DF} = λ^{DIS 2 DF}$ to identify a necessary condition for this equation, and in Appendix A.1 we show that $λ^{W 1 DF} = λ^{DIS 2 DF}$ when $σ_{2} β_{1} * \sqrt{{VIF}_{2}} = σ_{1} β_{2} * \sqrt{{VIF}_{1}}$ , where ${VIF}_{1} = 1 + (m - 1) ρ_{0}^{(1)}$ and ${VIF}_{2} = 1 + (m - 1) ρ_{0}^{(2)}$ . We rule out the cases when $σ_{1}$ , $σ_{2}$ , $β_{1} *$ , or $β_{2} *$ equal 0 as they would not be plausible assumptions.

Even when $λ^{W 1 DF} = λ^{DIS 2 DF}$ , it follows that $π^{W 1 DF} \neq π^{DIS 2 DF}$ under the χ²-distribution. This is because the single weighted 1-DF test uses 1-DF and disjunctive 2-DF test uses 2-DF, which determines the bounds, that is the critical values ( $c^{W 1 DF}$ and $c^{DIS 2 DF}$ ) of the integrals used for calculating power. For example, for an overall false-positive rate of $α = 0.05$ , the critical value for the single weighted 1-DF test is calculated using the central χ²-distribution with 1-DF: $χ_{1 - α}^{2} (1) = c^{W 1 DF} = 3.84$ . For the disjunctive 2-DF test, the critical value uses the central χ²-distribution with 2-DF: $χ_{1 - α}^{2} (2) = c^{DIS 2 DF} = 5.99$ . In fact, for all $α \in (0, 1)$ , $χ_{1 - α}^{2} (1) < χ_{1 - α}^{2} (2)$ , and so $c^{W 1 DF} < c^{DIS 2 DF}$ . Knowing this relationship between the critical values between the single weighted 1-DF test and the disjunctive 2-DF test will help us further understand the relationship of statistical power when $λ^{W 1 DF} = λ^{DIS 2 DF}$ .

Note that the equations for power can also be written in terms of their cumulative distribution functions (CDF) under the alternative. The CDF of the χ²-distribution makes use of the “Marcum Q-function,” denoted $Q_{d / 2} (\sqrt{λ}, \sqrt{x})$ ; for random variables $X \sim χ^{2} (k, λ)$ (i.e. non-central χ²-distribution with d degrees-of-freedom and non-centrality parameter $λ$ ), the CDF is $F (x; d, λ) = 1 - Q_{d / 2} (\sqrt{λ}, \sqrt{x})$ by definition. Using this fact, the statistical power for the single weighted 1-DF test and the disjunctive 2-DF test can be written as

\begin{aligned} π^{W 1 DF} & = \int_{c^{W 1 DF}}^{\infty} χ^{2} (x; 1, λ^{W 1 DF}) d x = Q_{1 / 2} (\sqrt{λ^{W 1 DF}}, \sqrt{c^{W 1 DF}}) \end{aligned}

\begin{aligned} π^{DIS 2 DF} & = \int_{c^{DIS 2 DF}}^{\infty} χ^{2} (x; 2, λ^{DIS 2 DF}) d x = Q_{1} (\sqrt{λ^{DIS 2 DF}}, \sqrt{c^{DIS 2 DF}}) . \end{aligned}

The Marcum Q-Function, $Q_{d / 2} (\sqrt{λ}, \sqrt{c})$ , is strictly increasing in $d / 2$ and $\sqrt{λ}$ for all $\sqrt{λ} \geq 0$ and $\sqrt{c}$ , $d / 2 > 0$ . It is strictly decreasing in $\sqrt{c}$ for all $\sqrt{λ}$ , $\sqrt{c} \geq 0$ and $d / 2 > 0$ . In other words, $Q_{d / 2} (\sqrt{λ}, \sqrt{c})$ increases as $d / 2$ increases, and $Q_{d / 2} (\sqrt{λ}, \sqrt{c})$ decreases as $\sqrt{c}$ increases. Due to these competing effects, it is not clear how $Q_{1 / 2} (\sqrt{λ^{W 1 DF}}, \sqrt{c^{W 1 DF}})$ compares to $Q_{1} (\sqrt{λ^{DIS 2 DF}}, \sqrt{c^{DIS 2 DF}})$ when the non-centrality parameters are the same. Furthermore, although the function $Q_{1 / 2}$ can be reduced nicely using the complimentary error functions (see Appendix A.1), the integrals in in $Q_{1}$ cannot be reduced in the same way. This is because $Q_{1}$ includes $I_{0}$ , the modified Bessel function of the first kind, which complicates the integral and prevents it from being written as the complementary error function, and so we cannot directly compare the equations.

To better understand the case when $λ^{W 1 DF} = λ^{DIS 2 DF}$ , R was used to visualize the difference between $π^{W 1 DF} = Q_{1 / 2} (\sqrt{λ^{W 1 DF}}, \sqrt{c^{W 1 DF}})$ and $π^{DIS 2 DF} = Q_{1} (\sqrt{λ^{DIS 2 DF}}, \sqrt{c^{DIS 2 DF}})$ . It was found that when the non-centrality parameters are the same, the single weighted 1-DF test approach always yields more power than the disjunctive 2-DF test approach, regardless of the choice of $α$ . This is shown in Appendix A.1. Then, taking a closer look at this scenario theoretically when $λ^{W 1 DF} = λ^{DIS 2 DF}$ , that is, when $σ_{2} β_{1} * \sqrt{{VIF}_{2}} = σ_{1} β_{2} * \sqrt{{VIF}_{1}}$ , we can rewrite the expression as follows:

σ_{2} β_{1} * \sqrt{{VIF}_{2}} = σ_{1} β_{2} * \sqrt{{VIF}_{1}} \Rightarrow \frac{β_{1} *}{σ_{1} \sqrt{{VIF}_{1}}} = \frac{β_{2} *}{σ_{2} \sqrt{{VIF}_{2}}} .

In other words, when the cluster-corrected standardized effect sizes, $β_{1} * / σ_{1} \sqrt{{VIF}_{1}}$ and $β_{2} * / σ_{2} \sqrt{{VIF}_{2}}$ , are the same, then the non-centrality parameters between these methods will be the same. So, we can conclude that when the standardized effect sizes that also account for clustering through the variance inflation factor are the same, then the single weighted 1-DF test will be more powerful than the disjunctive 2-DF test. In a real study, especially in a hybrid 2 study where the outcomes are different and on different scales, the occurrence of this equality is unlikely to happen.

We have established that when $σ_{2} β_{1} * \sqrt{{VIF}_{2}} = σ_{1} β_{2} * \sqrt{{VIF}_{1}}$ , and thus $λ^{W 1 DF} = λ^{DIS 2 DF}$ , it follows that $π^{W 1 DF} > π^{DIS 2 DF}$ across all $α \in (0.01, 0.025, 0.05, 0.1)$ and $λ^{W 1 DF} = λ^{DIS 2 DF} \in [0, 30]$ . However, when $σ_{2} β_{1} * \sqrt{{VIF}_{2}} \neq σ_{1} β_{2} * \sqrt{{VIF}_{1}}$ , then ${(σ_{2} β_{1} * \sqrt{{VIF}_{2}} - σ_{1} β_{2} * \sqrt{{VIF}_{1}})}^{2}$ will always be greater than 0. In this case, $0 < {(σ_{2} β_{1} * \sqrt{{VIF}_{2}} - σ_{1} β_{2} * \sqrt{{VIF}_{1}})}^{2}$ implies $λ^{W 1 DF} < λ^{DIS 2 DF}$ . We are again faced with the issue as before where the relationship between $Q_{1 / 2} (\sqrt{λ^{W 1 DF}}, \sqrt{c^{W 1 DF}})$ and $Q_{1} (\sqrt{λ^{DIS 2 DF}}, \sqrt{c^{DIS 2 DF}})$ is unclear due to the competing effects that the inputs have on the function. The threshold of when $Q_{1 / 2} (\sqrt{λ^{W 1 DF}}, \sqrt{c^{W 1 DF}}) > Q_{1} (\sqrt{λ^{DIS 2 DF}}, \sqrt{c^{DIS 2 DF}})$ or vice versa now depends on many variables, since we are under the constraint of $λ^{W 1 DF} \neq λ^{DIS 2 DF}$ . Since the relationship of the non-centrality parameters depend on $σ_{1}$ , $σ_{2}$ , $β_{1} *$ , $β_{2} *$ , $V I F_{1}$ , and $V I F_{2}$ , we must examine this case in the numerical evaluation.

4.2 Examination of Method 1: p-value adjustments

Based on the illustrative example shown in Owen et al., it was hypothesized that the p-value adjustment methods were less powerful than the combined outcomes test, the single weighted 1-DF test, and the disjunctive 2-DF test.³ We compared their equations for the non-centrality parameter and statistical power to formally prove if this is the case.

For all of the p-value adjustment methods, the statistical power is $π^{PADJ} = min (π^{(1)}, π^{(2)})$ . Since a smaller value of $λ$ corresponds to a smaller statistical power, this is the same as writing $π^{PADJ} = 1 - χ^{2} [λ^{PADJ} = min (λ^{(1)}, λ^{(2)}), 1]$ . Recall that here, $π^{(q)}$ and $λ^{(q)}$ refer to the statistical power and non-centrality parameter calculated based on the $q^{t h}$ outcome for the marginal tests. Thus, when comparing the p-value adjustment methods to Methods 2, 3, and 4, we have two cases: (1) $λ^{(1)} > λ^{(2)}$ ; and (2) $λ^{(1)} < λ^{(2)}$ . Under the first case, $λ^{PADJ} = λ^{(2)}$ , and under the second case, $λ^{PADJ} = λ^{(1)}$ . If they are equal, then power can be calculated using parameters from either outcome. The proofs for each case are logically equivalent, so we show the proof for one case. We compare the methods by comparing their equations for the non-centrality parameters. We know that $ρ_{2}^{(1, 2)} < 1$ , and assume that $ρ_{1}^{(1, 2)} < ρ_{0}^{(1)}$ , $ρ_{1}^{(1, 2)} < ρ_{0}^{(2)}$ , implying that ${VIF}_{12} < {VIF}_{1}$ and ${VIF}_{12} < {VIF}_{2}$ . Lastly, we also assume that the treatment effects are non-negative. If this is not the case, they can be transformed in order to meet this assumption.

4.2.1 Method 1: p-value adjustments vs. Method 2: combined outcomes

We hypothesize that the non-centrality parameter for the p-value adjustment is smaller than that of the combined outcomes approach. That is,

min (\frac{{(β_{1} *)}^{2}}{\frac{2}{K m} σ_{1}^{2} {VIF}_{1}}, \frac{{(β_{2} *)}^{2}}{\frac{2}{K m} σ_{2}^{2} {VIF}_{2}}) < \frac{{(β_{1} * + β_{2} *)}^{2}}{\frac{2}{K m} [σ_{1}^{2} {VIF}_{1} + σ_{2}^{2} {VIF}_{2} + 2 σ_{1} σ_{2} {VIF}_{12}]} .

Under the first case, we suppose that $(β_{1} *)^{2} / ((2 / K m) σ_{1}^{2} {VIF}_{1}) < (β_{2} *)^{2} / ((2 / K m) σ_{2}^{2} {VIF}_{2})$ , which means $λ^{PADJ} = (β_{1} *)^{2} / ((2 / K m) σ_{1}^{2} {VIF}_{1})$ . Then, reducing the inequality, we are left with the expression $β_{1} * σ_{2} {VIF}_{2} < β_{2} * σ_{1} {VIF}_{1}$ , and from this and the fact that ${VIF}_{12} < {VIF}_{2}$ , it follows that $β_{1} * σ_{2} {VIF}_{12} < β_{2} * σ_{1} {VIF}_{1}$ . This inequality and our first supposition both result in the inequality $λ^{PADJ} < λ^{COMB}$ being true, which is shown in Appendix A.2. Also note that for all p-value adjustment methods, $α^{PADJ} < α^{COMB}$ , and recall that both of these methods use the χ²-distribution with 1-DF. A smaller family-wise false-positive rate, $α$ , corresponds to smaller statistical power, as does a smaller $λ$ value. Thus, since $λ^{PADJ} < λ^{COMB}$ , and $α^{PADJ} < α^{COMB}$ it follows that $π^{PADJ} < π^{COMB}$ , meaning that the p-value adjustment methods will always be less powerful than the combined outcomes approach. For details of the full proof, see Appendix A.2.

4.2.2 Method 1: p-value adjustments vs. Method 3: single weighted 1-DF test

We examine the equations of the non-centrality parameters for the p-value adjustment method and single weighted 1-DF test. We aim to show:

min (\frac{{(β_{1} *)}^{2}}{\frac{2}{K m} σ_{1}^{2} V I F_{1}}, \frac{{(β_{2} *)}^{2}}{\frac{2}{K m} σ_{2}^{2} V I F_{2}}) < {[\frac{\sqrt{\frac{{(β_{1} *)}^{2}}{\frac{2 σ_{1}^{2}}{K m} V I F_{1}}} + \sqrt{\frac{{(β_{2} *)}^{2}}{\frac{2 σ_{2}^{2}}{K m} V I F_{2}}}}{\sqrt{2 (1 + \frac{V I F_{12}}{\sqrt{V I F_{1} V I F_{2}}})}}]}^{2} .

Under the first case, we suppose that $\frac{{(β_{1} *)}^{2}}{\frac{2}{K m} σ_{1}^{2} V I F_{1}} < \frac{{(β_{2} *)}^{2}}{\frac{2}{K m} σ_{2}^{2} V I F_{2}}$ , which means $λ^{PADJ} = \frac{{(β_{1} *)}^{2}}{\frac{2}{K m} σ_{1}^{2} V I F_{1}}$ . Then, reducing the inequality, we are left with the expression $\sqrt{2} \sqrt{1 + \frac{V I F_{12}}{\sqrt{V I F_{1} V I F_{2}}}} - 1 < \frac{β_{2} * σ_{1} \sqrt{V I F_{1}}}{β_{1} * σ_{2} \sqrt{V I F_{2}}}$ . From our first supposition, we know that $β_{1} * σ_{2} \sqrt{V I F_{2}} < β_{2} * σ_{1} \sqrt{V I F_{1}}$ , and so it follows that $1 < \frac{β_{2} * σ_{1} \sqrt{V I F_{1}}}{β_{1} * σ_{2} \sqrt{V I F_{2}}}$ . So, to show that the inequality holds, we need to show that $\sqrt{2} \sqrt{1 + \frac{V I F_{12}}{\sqrt{V I F_{1} V I F_{2}}}} - 1 < 1$ , which reduces to $V I F_{12} V I F_{12} < V I F_{1} V I F_{2}$ . Since $V I F_{12} < V I F_{1}$ and $V I F_{12} < V I F_{2}$ , it must be the case that $V I F_{12} V I F_{12} < V I F_{1} V I F_{2}$ . Thus, we’ve shown that $λ^{PADJ} < λ^{W 1 DF}$ . Note that for any p-value adjustment method, $α^{PADJ} < α^{W 1 DF}$ . Both the p-value adjustment method and the single weighted 1-DF test use the χ²-distribution with 1-DF. A smaller family-wise false-positive rate, $α$ , corresponds to smaller statistical power, as does a smaller value of $λ$ . Thus, since $λ^{PADJ} < λ^{W 1 DF}$ , and $α^{PADJ} < α^{W 1 DF}$ it follows that $π^{PADJ} < π^{W 1 DF}$ , meaning that the p-value adjustment methods will always be less powerful than the single weighted 1-DF test. For details of the full proof, see Appendix A.3.

4.2.3 Method 1: p-value adjustments vs. Method 4: Disjunctive 2-DF Test

Examining the equations of the non-centrality parameters for the p-value adjustment method and disjunctive 2-DF test, we aim to show:

min (\frac{{(β_{1} *)}^{2}}{\frac{2}{K m} σ_{1}^{2} V I F_{1}}, \frac{{(β_{2} *)}^{2}}{\frac{2}{K m} σ_{2}^{2} V I F_{2}}) < \frac{K m [{(β_{1} *)}^{2} σ_{2}^{2} V I F_{2} - 2 β_{1} * β_{2} * σ_{1} σ_{2} V I F_{12} + {(β_{2} *)}^{2} σ_{1}^{2} V I F_{1}]}{2 σ_{1}^{2} σ_{2}^{2} [V I F_{1} V I F_{2} - V I F_{12}^{2}]} .

We compare both tests using the χ²-distribution, and note that the disjunctive test uses 2-DF instead of 1-DF. Due to the differing degrees-of-freedom, $λ^{PADJ} < λ^{DIS 2 DF}$ and $α^{PADJ} < α^{DIS 2 DF}$ do not necessarily imply that $π^{PADJ} < π^{DIS 2 DF}$ . Similar to our assessment of the single weighted 1-DF test and disjunctive 2-DF test, we can write the power integrals as Marcum Q-Functions, which gives

\begin{aligned} π^{PADJ} & = \int_{c^{PADJ}}^{\infty} χ^{2} (x; 1, λ^{PADJ}) d x = Q_{1 / 2} (\sqrt{λ^{PADJ}}, \sqrt{c^{PADJ}}) \end{aligned}

\begin{aligned} π^{DIS 2 DF} & = \int_{c^{DIS 2 DF}}^{\infty} χ^{2} (x; 2, λ^{DIS 2 DF}) d x = Q_{1} (\sqrt{λ^{DIS 2 DF}}, \sqrt{c^{DIS 2 DF}}) . \end{aligned}

As previously noted, the Marcum Q-Function, $Q_{d / 2} (\sqrt{λ}, \sqrt{c})$ , is strictly increasing in $d / 2$ and $\sqrt{λ}$ for all $\sqrt{λ} \geq 0$ and $\sqrt{c}$ , $d / 2 > 0$ . It is strictly decreasing in $\sqrt{c}$ for all $\sqrt{λ}$ , $\sqrt{c} \geq 0$ and $d / 2 > 0$ . In other words, $Q_{d / 2} (\sqrt{λ}, \sqrt{c})$ increases as $d / 2$ increases, and $Q_{d / 2} (\sqrt{λ}, \sqrt{c})$ decreases as $\sqrt{c}$ increases. Due to these competing effects, it is not clear how $Q_{1 / 2} (\sqrt{λ^{PADJ}}, \sqrt{c^{PADJ}})$ compares to $Q_{1} (\sqrt{λ^{DIS 2 DF}}, \sqrt{c^{DIS 2 DF}})$ . However, for a specific overall false-positive rate, we can plot the function values (i.e. statistical power) over many possible values for the non-centrality parameters. It was found that for almost every case, $π^{DIS 2 DF} > π^{PADJ}$ , but there are a small number of cases where $π^{DIS 2 DF} < π^{PADJ}$ , namely when the difference between $λ^{PADJ}$ and $λ^{DIS 2 DF}$ is very small. Though it is possible for $λ^{PADJ} < λ^{DIS 2 DF}$ due to the differing degrees-of-freedom, it is not globally true that $π^{DIS 2 DF} > π^{PADJ}$ . However, this is not likely to be observed in practice, and across all p-values adjustment methods, $π^{DIS 2 DF} > π^{PADJ}$ generally. Appendix A.4 explores this in greater detail.

4.3 Summary of theoretical comparisons

The reason for why the p-value adjustment methods are globally less powerful than the combined outcomes approach and the single weighted 1-DF test can be intuitively understood through the property of the arithmetic mean. It states that for any two real numbers, a and b, it is always the case that $\min (a, b) \leq (a + b) / 2 \leq \max (a, b)$ . In a way, $λ^{COMB}$ and $λ^{W 1 DF}$ are “averages” of the non-centrality parameters for the first and second outcomes individually, namely $λ^{(1)}$ and $λ^{(2)}$ . The p-value adjustment methods take the “worst” case scenario, resulting in the smallest power possible among the first and second outcome. Thus, it makes sense why they are less powerful than the combined outcomes approach and the single weighted 1-DF test. It's also important to note that the p-value adjustments don’t take into account the correlation parameters $ρ_{1}^{(1, 2)}$ or $ρ_{2}^{(1, 2)}$ , with the exception of the D/AP method, which uses $ρ_{2}^{(1, 2)}$ to adjust the α-level. So, this approach is disregarding important information about the data, possibly contributing to why the statistical power is more conservative. The results in the following numerical evaluation help to confirm these findings; there are no scenarios in which the p-value adjustment methods are more powerful than the combined outcomes approach and the single weighted 1-DF test. The same intuition can apply to why nearly always, the p-value adjustment methods are less powerful than the disjunctive 2-DF test, as this test is also “averaging” statistics between the first and second outcomes. The differing degrees-of-freedom could explain why in theory, the p-value adjustment methods could be more powerful than the disjunctive 2-DF test, though it's not likely to be observed in practice.

To summarize, we’ve shown that the p-value adjustment methods are globally less powerful than the combined outcomes approach and the single weighted 1-DF test, and that the p-value adjustment methods have a globally smaller non-centrality parameter than the disjunctive 2-DF test, but that it is still theoretically possible for the disjunctive 2-DF test to have a smaller power than the p-value adjustment methods. We’ve also shown that the combined outcomes approach is equivalent to the single weighted 1-DF test when the outcome specific ICCs and variances between the two outcomes are the same. Lastly, we’ve shown that the single weighted 1-DF test has the same non-centrality parameter as the disjunctive test when the cluster-corrected standardized effect sizes of the outcomes are equal. In this case, the single weighted 1-DF test will yield more power than the disjunctive 2-DF test. Table 4 shows a summary of these theoretical comparison findings along with the theoretical notation. To more thoroughly examine the performance of the study design methods, we continue on to the numerical evaluation.

Table 4.
Summary of theoretical results.

Description Theoretical Notation

p-Value Adjustment Methods are always less powerful than the Combined Outcomes Approach $π^{P A D J} < π^{C O M B}$

p-Value Adjustment Methods are always less powerful than the Single Weighted 1-DF Test. $π^{P A D J} < π^{W 1 D F}$

p-Value Adjustment Methods always have a smaller non-centrality parameter than the Disjunctive 2-DF Test. However, due to the differing degrees of freedom, there are cases where the p-Value Adjustment Methods can result in higher power than the Disjunctive 2-DF Test (though this is not typically observed in practice). $λ^{P A D J} < λ^{D I S 2 D F}$

Combined Outcomes Approach is theoretically equivalent to the Single Weighted 1-DF Test when the outcome specific ICCs and variances between the two outcomes are the same, resulting in the same statistical power. If $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ and $σ_{1}^{2} = σ_{2}^{2}$ , then $λ^{C O M B} = λ^{W 1 D F}$ and $π^{C O M B} = π^{W 1 D F}$

Single Weighted 1-DF Test has the same non-centrality parameter as the Disjunctive 2-DF Test when the cluster-corrected standardized effect sizes of the first and second outcomes are equal. If $\frac{β_{1} }{σ_{1} \sqrt{V I F_{1}}} = \frac{β_{2} }{σ_{2} \sqrt{V I F_{2}}}$ , then $λ^{W 1 D F} = λ^{D I S 2 D F}$ and $π^{W 1 D F} > π^{D I S 2 D F}$ for all $α \in (0.01, 0.025, 0.05, 0.1)$ and $λ^{W 1 D F}, λ^{D I S 2 D F} \in [0, 30]$

Description	Theoretical Notation
p-Value Adjustment Methods are always less powerful than the Combined Outcomes Approach	$π^{P A D J} < π^{C O M B}$
p-Value Adjustment Methods are always less powerful than the Single Weighted 1-DF Test.	$π^{P A D J} < π^{W 1 D F}$
p-Value Adjustment Methods always have a smaller non-centrality parameter than the Disjunctive 2-DF Test. However, due to the differing degrees of freedom, there are cases where the p-Value Adjustment Methods can result in higher power than the Disjunctive 2-DF Test (though this is not typically observed in practice).	$λ^{P A D J} < λ^{D I S 2 D F}$
Combined Outcomes Approach is theoretically equivalent to the Single Weighted 1-DF Test when the outcome specific ICCs and variances between the two outcomes are the same, resulting in the same statistical power.	If $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ and $σ_{1}^{2} = σ_{2}^{2}$ , then $λ^{C O M B} = λ^{W 1 D F}$ and $π^{C O M B} = π^{W 1 D F}$
Single Weighted 1-DF Test has the same non-centrality parameter as the Disjunctive 2-DF Test when the cluster-corrected standardized effect sizes of the first and second outcomes are equal.	If $\frac{β_{1} }{σ_{1} \sqrt{V I F_{1}}} = \frac{β_{2} }{σ_{2} \sqrt{V I F_{2}}}$ , then $λ^{W 1 D F} = λ^{D I S 2 D F}$ and $π^{W 1 D F} > π^{D I S 2 D F}$ for all $α \in (0.01, 0.025, 0.05, 0.1)$ and $λ^{W 1 D F}, λ^{D I S 2 D F} \in [0, 30]$

5. Numerical evaluation

5.1 Overview and estimands

In the numerical evaluation, we explore the comparison of different testing procedures in terms of power under (a) varying cluster sizes and number of clusters; (b) varying values of the four correlations, namely the endpoint-specific ICC's, inter-subject between-endpoint ICC, and the intra-subject between-endpoint ICC; (c) varying values of the treatment effects on each outcome; and (d) varying values for the total variance of the two outcomes on study design. The goal of the evaluation was to identify which method of the five currently available for hybrid type 2 studies, if any, is globally the most powerful? If none are globally most powerful (which we hypothesize to be the case), under what design assumptions are each of the methods are most powerful? The numerical evaluation was needed because it was not possible to derive fully comprehensive theoretical answers to these questions, particularly for the conjunctive IU test and how it compares to the other methods. The code used to run the numerical evaluation is available on GitHub at https://github.com/melodyaowen/hybrid2numerical. Note that because exact expressions for all quantities of interest are available, there is no need for a simulation study. We discuss this point in greater detail later on.

5.2 Methods

The parameters that are varied in the numerical evaluation, along with their values, are displayed in Table 5. Values of each parameter were chosen based on values commonly observed in real CRT data, and to ensure that a wide range of values were considered for each input parameter. We explored treatment effect scenarios where either $β_{1} * < β_{2} *$ or $β_{1} * = β_{2} *$ . Because $β_{1} * < β_{2} *$ is symmetric to $β_{1} * > β_{2} *$ , meaning that the results for $β_{1} * > β_{2} *$ would mirror those for $β_{1} * < β_{2} *$ if the roles of the two treatment effects were reversed, we did not explore that scenario. We only considered a family-wise false-positive rate of 0.05 and equal treatment allocation.

Table 5.
Numerical evaluation parameters (results in 45,000 unique design scenarios).

Parameter Statistical notation Description Considered values

Number of clusters K Number of clusters in treatment group 4

6

8

10

20

30

Cluster size m Number of individuals in each cluster 50

70

100

Effects for $Y_{1}$ and $Y_{2}$ $β * = (β_{1}^{}, β_{2}^{})$ Estimated intervention effect vector for the two outcomes ( $Y_{1}$ and $Y_{2}$ ) (0.1, 0.4)

(0.2, 0.4)

(0.3, 0.4)

(0.4, 0.4)

Outcome variances $σ^{2} = (σ_{1}^{2}, σ_{2}^{2})$ Total variance, $V a r (Y_{1})$ and $V a r (Y_{2})$ (0.5, 1.5)

(0.5, 1)

(1, 1)

(1, 0.5)

(1.5, 0.5)

Endpoint-specific ICC for $Y_{1}$ and $Y_{2}$ $ρ_{0} = (ρ_{0}^{(1)}, ρ_{0}^{(2)})$ Correlation for $Y_{1}$ for two different individuals in the same cluster, correlation for $Y_{2}$ for two different individuals in the same cluster (0.05, 0.1)

(0.07, 0.1)

(0.1, 0.1)

(0.1, 0.07

(0.1, 0.05)

Inter-subject between-endpoint ICC $ρ_{1}^{(1, 2)}$ Correlation between $Y_{1}$ and $Y_{2}$ for two different individuals in same cluster 0.005

0.01

0.02

0.05

0.07

Intra-subject between-endpoint ICC $ρ_{2}^{(1, 2)}$ Correlation between $Y_{1}$ and $Y_{2}$ for the same individual 0.1

0.3

0.5

0.7

0.9

Overall (family-wise) False Positive Rate $α$ Probability of one or more Type I error(s) 0.05

Parameter	Statistical notation	Description	Considered values
Number of clusters	K	Number of clusters in treatment group	4
6
8
10
20
30
Cluster size	m	Number of individuals in each cluster	50
70
100
Effects for $Y_{1}$ and $Y_{2}$	$β * = (β_{1}^{}, β_{2}^{})$	Estimated intervention effect vector for the two outcomes ( $Y_{1}$ and $Y_{2}$ )	(0.1, 0.4)
(0.2, 0.4)
(0.3, 0.4)
(0.4, 0.4)
Outcome variances	$σ^{2} = (σ_{1}^{2}, σ_{2}^{2})$	Total variance, $V a r (Y_{1})$ and $V a r (Y_{2})$	(0.5, 1.5)
(0.5, 1)
(1, 1)
(1, 0.5)
(1.5, 0.5)
Endpoint-specific ICC for $Y_{1}$ and $Y_{2}$	$ρ_{0} = (ρ_{0}^{(1)}, ρ_{0}^{(2)})$	Correlation for $Y_{1}$ for two different individuals in the same cluster, correlation for $Y_{2}$ for two different individuals in the same cluster	(0.05, 0.1)
(0.07, 0.1)
(0.1, 0.1)
(0.1, 0.07
(0.1, 0.05)
Inter-subject between-endpoint ICC	$ρ_{1}^{(1, 2)}$	Correlation between $Y_{1}$ and $Y_{2}$ for two different individuals in same cluster	0.005
0.01
0.02
0.05
0.07
Intra-subject between-endpoint ICC	$ρ_{2}^{(1, 2)}$	Correlation between $Y_{1}$ and $Y_{2}$ for the same individual	0.1
0.3
0.5
0.7
0.9
Overall (family-wise) False Positive Rate	$α$	Probability of one or more Type I error(s)	0.05

The number of clusters in each treatment group ( $K$ ) explored in the numerical study was carefully considered, as this is a particularly important parameter when designing CRTs. It was reported by Kahan et al. that the median total number of clusters in CRTs is 25 ( $K = 12$ to 13 in each arm for a parallel design).¹⁴ Though hybrid 2 studies are increasing in popularity, the typical number of clusters hybrid 2 studies that are also CRTs is not well known, though a well-known hybrid 2 study by Abbott et al. reported 15 total number of clusters in their study, and another study by Clemson et al. reported 28 total number of clusters in their study.^15,16 In these studies, the clusters were clinics and general healthcare practices. In another hybrid 2 study by Galaviz et al., physicians were considered as the clusters, with their patients being the individuals; they reported a total of 36 clusters (physicians) in their study.¹⁷ The goal of the numerical study is to evaluate the power that these methods yield over a wide range of scenarios that are commonly found in CRTs, particularly in those in public health and implementation science; to that end, we include a wide range of values for K in order to examine scenarios in which there are a small amount of clusters (K = 4, 6, and 8, i.e. 8, 12, and 16 total), and scenarios where we have the typical total number of clusters in CRTs and hybrid 2 studies or higher (K = 10, 20, and 30, i.e. 20, 40, and 60 total).

The numerical evaluation was conducted using R/RStudio. First, a data frame of every potential design scenario using all combinations of the inputs displayed in Table 5 was created. Here, a study design “scenario” refers to a unique set of the 10 input parameters that one could use to calculate statistical power. There were a total of 45,000 such input scenarios. For each scenario, the statistical power was calculated using each of the study design methods described in Section 2. The resulting power calculations were assessed in relation to each input parameter, and trends were summarized through figures and tables.

Four separate numerical analyses were conducted, which we refer to as Comparison I, Comparison II, Comparison III, and Comparison IV. The motivation for conducting four separate comparisons was because different probability distributions have been proposed to assess power for different tests. In particular, the disjunctive 2-DF test was derived using the F-distribution, but one can also use the χ²-distribution for this method. The p-value adjustment methods, combined outcomes approach, and single weighted 1-DF test can also either utilize the F-distribution or χ²-distribution, and all of these methods use a 2-sided test. The conjunctive IU test differs from the other tests because it is multivariate in nature, where a vector of test statistics is used (one for each outcome). It was originally derived using a multivariate t-distribution, but one can also use a MVN distribution. It is not feasible for this method to utilize a χ²-distribution or F-distribution because it is multivariate in nature, and there is no multivariate F or χ²-distribution. It is valid to compare the conjunctive IU test under a two-tailed MVN-distribution to the remaining methods under the χ²-distribution, and to compare the conjunctive IU test under the t-distribution to the remaining methods under the F-distribution. Furthermore, the conjunctive IU test as proposed conducts two 1-sided tests (one for each outcome), but to make this method comparable to the other methods, which are all 2-sided, we consider the conjunctive IU test with two 2-sided tests. On the other hand, recognizing that users may want to understand how the conjunctive IU test compares to the other methods when used as originally proposed (i.e. using two 1-sided tests), we also looked at this case. Thus, this results in four main comparisons—Comparison I: 2-sided method comparison using the F-distribution for Methods 1–4 and the two-sided t-distribution for Method 5; Comparison II: 2-sided method comparison using the χ²-distribution for Methods 1–4 and the two-sided MVN-distribution for Method 5; Comparison III: “as is” method comparison using the F-distribution for Methods 1–4 and the one-sided t-distribution for Method 5; and Comparison IV: “as is” method comparison using the χ²-distribution for Methods 1–4 and the one-sided MVN-distribution for Method 5. Table 6 summarizes these comparisons.

Table 6.

Summary of numerical analysis comparisons.

	Numerical analysis types
	Comparison I	Comparison II	Comparison III	Comparison IV
Design Method	“2-sided” method comparison using the F-distribution and t-distribution	“2-sided” method comparison using the $χ^{2}$ -distribution and MVN-distribution	“As is” method comparison using the F-distribution and t-distribution	“As is” method comparison using the $χ^{2}$ -distribution and MVN-distribution
1. p-Value Adjustment Methods for Multiple Testing	$F$ -distribution One 2-sided test	$χ^{2}$ -distribution One 2-sided test	$F$ -distribution One 2-sided test	$χ^{2}$ -distribution One 2-sided test
2. Combined Outcomes Approach	$F$ -distribution One 2-sided test	$χ^{2}$ -distribution One 2-sided test	$F$ -distribution One 2-sided test	$χ^{2}$ -distribution One 2-sided test
3. Single Weighted 1-DF Combined Test	$F$ -distribution One 2-sided test	$χ^{2}$ -distribution One 2-sided test	$F$ -distribution One 2-sided test	$χ^{2}$ -distribution One 2-sided test
4. Disjunctive 2-DF Test	$F$ -distribution One 2-sided test	$χ^{2}$ -distribution One 2-sided test	$F$ -distribution One 2-sided test	$χ^{2}$ -distribution One 2-sided test
5. Conjunctive Intersection–Union Test	t-distribution Two 2-sided tests	MVN-distribution Two 2-sided tests	t-distribution Two 1-sided tests	MVN-distribution Two 1-sided tests

*MVN stands for “multivariate normal distribution.”

Next, we discuss the results of the numerical evaluation under Comparison I in depth. Though we do not display the results for Comparisons II–IV, we outline any key findings or differences in results that were found. We provide the corresponding figures and tables for Comparisons II–IV in the Supplementary Material.

5.3 Results

5.3.1 Distribution of statistical power and overall method rankings

To gain an overall understanding of the statistical power yielded by each method across the 45,000 input scenarios, the distribution of power was examined, averaged over the 10 input parameters varied. Figure 3 displays histograms of statistical power for each method, along with the mean, minimum, and maximum power for each method. The three p-value adjustment methods tended to have lower power compared to the other methods, and the conjunctive test similarly had lower power generally. The combined outcomes approach, single 1-DF test, and disjunctive 2-DF test tended to have higher power among the methods. The ranking of the methods in terms of power was examined in order to better understand overall how the methods compared to one another. For each of the 45,000 input scenarios, each method was ranked 1 through 7; a method with a ranking of 1 means the method had the highest power, while a method with a ranking of 7 means the method had the lowest power. Figure 4 shows a heatmap of these rankings across scenarios—a darker color blue corresponds to a lower ranking (lower power), and a lighter color blue corresponds to a higher ranking (higher power). Included in this figure is a summary table with each of the method's mean ranking. The single weighted 1-DF test had a mean ranking of 1.61, while the disjunctive 2-DF test had a mean ranking of 2.06. The combined outcomes approach had a mean ranking of 2.13, while the conjunctive test had a mean ranking of 4.36. The p-value adjustment methods had the lowest mean rankings, with the Bonferroni method having the lowest mean ranking (6.93). These results give an overview of how the methods measure up against each other in terms of statistical power, but they also demonstrate that no method was globally better than all other methods. Next, we more closely examine trends for how each of the input design parameters affect the statistical power.

Figure 3.

Distribution of statistical power for each design method among all 45,000 input scenarios for comparison I (F and t distributions with 2-sided conjunctive test).

Figure 4.

Ranks of all study design methods for each of the 45,000 input scenarios for comparison I (F and t distributions with 2-sided conjunctive test).

5.3.2 Power in relation to each input parameter

Before evaluating the numerical study results further, we conducted an additional evaluation that examined how each input parameter individually impacts power when allowed to vary over a larger set of values than what was feasible in the numerical evaluation, holding all other input parameters constant. This step was helpful in guiding which aspects of the numerical evaluation to report on. Values for the parameters that did not vary were chosen from the numerical evaluation parameters, and fixed at $K = 8$ , $m = 50$ , $(β_{1} *, β_{2} *) = (0.4, 0.4)$ , $(σ_{1}^{2}, σ_{2}^{2}) = (1, 1)$ , $(ρ_{0}^{(1)}, ρ_{0}^{(2)}) = (0.1, 0.1)$ , $ρ_{1}^{(1, 2)} = 0.01$ , and $ρ_{2}^{(1, 2)} = 0.1$ . These values remained the same across the evaluations, with the exception of the input parameter that was allowed to vary. We conducted this additional analysis for each of the four comparisons, but discuss the results for Comparison I only. Results for Comparison II-IV are available in the Supplementary Material. We summarize key findings here, with detailed results and figures given in Appendix B.

As expected, it was found that as the number of clusters increases ( $K$ ), so too will the statistical power. Although a wide range of values for K was examined, increasing the number of clusters did not impact the ranking of the tests in terms of power. However, the power difference between the methods is greater when K is smaller, and it was observed through this additional evaluation, as well as over the wide range of scenarios considered in the numerical study, that when K generally reaches approximately 20, the differences between the methods become small, at which point only subject matter considerations, rather than power, should dictate the choice of the primary test statistic. The choice of K is important in the design of CRTs, and prior work has addressed the type I error challenges that can arise in CRTs with a small number of clusters. The threshold at which type I error stabilizes depends on the data generating process and the choice of test. For example, Kahan et al. showed that in some scenarios, even single-outcome CRTs may exhibit inflated type I error for as many as 70 total clusters, underscoring that no universal cut-off exists.¹⁴ Our study, however, examines the relative power yielded by each design method. Importantly, the ranking of the tests by power was stable across all values of K we considered; because we evaluated a broad range, we captured both the region where the methods differ most and the region where they converge, ensuring that our conclusions about comparative performance are robust to the number of clusters.

For m, the ranking of which method is the most powerful, second most powerful, etc. does not differ as m increases. As the ratio of the treatment effects ( $β_{2} * / β_{1} *$ ) increases, so does statistical power, and the method that was the most powerful changed as the ratio increased. As the ratio of the variances ( $σ_{2}^{2} / σ_{1}^{2}$ ) increases, the statistical power decreased for all of the methods, and the method that was the most powerful also changed as the ratio increased. This observation motivated the need to look into the treatment effects and variances impacts on power further, which we detail in the next section. As the inter-subject between-endpoint ICC increases ( $ρ_{1}^{(1, 2)}$ ), the power decreases for the single weighted 1-DF test and the disjunctive 2-DF test, and increases for the conjunctive IU test. As the intra-subject between endpoint ICC ( $ρ_{2}^{(1, 2)}$ ) increases, the power very slightly decreased for the single weighted 1-DF test, disjunctive 2-DF test, remained constant for the Bonferroni and Sidak method, and increased slightly for the D/AP method, though these effects are minimal. The outcome specific ICCs, $ρ_{0}^{(1)}$ , and $ρ_{0}^{(2)}$ , had a much more substantial impact on statistical power, changing the ranking of which method is the most powerful as $ρ_{0}^{(1)} / ρ_{0}^{(2)}$ increases. So, we further investigated the effects of $ρ_{0}^{(1)}$ and $ρ_{0}^{(2)}$ on power.

5.3.3 When is each method most powerful?

When examining the statistical power through the numerical analysis of the five study design methods among the 45,000 input scenarios, we narrowed our scope to look the design methods that had at least one scenario where it was the most powerful among all the methods. Among those methods, we examined the conditions of $β_{1} *$ , $β_{2} *$ , $σ_{1}^{2}$ , $σ_{2}^{2}$ , $ρ_{0}^{(1)}$ , and $ρ_{0}^{(2)}$ that resulted in a method being the most powerful. Examining the results in this way will allow us to give specific recommendations on which methods yield the highest statistical power based on the design parameters that have the biggest impact on power.

Under Comparison I, the combined outcomes approach, single weighted 1-DF test, and disjunctive 2-DF test all had scenarios in which they yielded the highest power, whereas the p-value adjustment methods and conjunctive IU test were never observed to have the highest power. To discern under which conditions these three methods were the most powerful relative to one another, we calculated the frequency each method had the highest power under different scenarios in terms of the standardized effect sizes ( $β_{1} * / σ_{1}$ and $β_{2} * / σ_{2}$ ) and ICCs. For example, there are a total of 2700 of the 45,000 scenarios for which $β_{2} * / σ_{2} - β_{1} * / σ_{1} \in [0.40, 0.49]$ and $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ , and in this case, the disjunctive 2-DF test was found to yield the highest statistical power for 92% of these cases. These results are displayed in Table 7; note that because $β_{1} * < β_{2} *$ by design of the numerical analysis, there were very few cases where $β_{2} * / σ_{2} - β_{1} * / σ_{1} < 0$ , and so these cases were grouped together.

Table 7.
Numerical evaluation results – proportion of scenarios where study design methods are most powerful based on standardized effect sizes, summarized for comparison I (F and t distributions with 2-sided conjunctive test).

Proportion of scenarios where the method was most powerful

$\frac{β_{2}^{}}{σ_{2}} - \frac{β_{1}^{}}{σ_{1}}$ $(ρ_{0}^{(1)}, ρ_{0}^{(2)})$ Combined outcomes Combined outcomes = Single 1-DF Single 1-DF Disjunctive 2-DF # Total Scenarios

$< 0$ $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ 0% 10% 78% 39% 3600

$ρ_{0}^{(1)} = ρ_{0}^{(2)}$ 0% 8% 94% 12% 1800

$ρ_{0}^{(1)} > ρ_{0}^{(2)}$ 21% 21% 67% 16% 3600

$0$ $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ 0% 14% 89% 15% 900

$ρ_{0}^{(1)} = ρ_{0}^{(2)}$ 0% 100% 0% 0% 450

$ρ_{0}^{(1)} > ρ_{0}^{(2)}$ 0% 14% 89% 15% 900

$[0.05, 0.19]$ $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ 38% 10% 54% 8% 4500

$ρ_{0}^{(1)} = ρ_{0}^{(2)}$ 45% 23% 31% 7% 2250

$ρ_{0}^{(1)} > ρ_{0}^{(2)}$ 43% 6% 31% 33% 4500

$[0.20, 0.29]$ $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ 38% 8% 41% 26% 3600

$ρ_{0}^{(1)} = ρ_{0}^{(2)}$ 16% 23% 40% 30% 1800

$ρ_{0}^{(1)} > ρ_{0}^{(2)}$ 6% 5% 40% 65% 3600

$[0.30, 0.39]$ $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ 14% 3% 38% 54% 2700

$ρ_{0}^{(1)} = ρ_{0}^{(2)}$ 0% 5% 29% 69% 1350

$ρ_{0}^{(1)} > ρ_{0}^{(2)}$ 0% 0% 22% 87% 2700

$[0.40, 0.49]$ $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ 0% 0% 13% 87% 2700

$ρ_{0}^{(1)} = ρ_{0}^{(2)}$ 0% 0% 8% 92% 1350

$ρ_{0}^{(1)} > ρ_{0}^{(2)}$ 0% 0% 8% 97% 2700

		Proportion of scenarios where the method was most powerful
$< 0$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	0%	10%	78%	39%	3600
$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	0%	8%	94%	12%	1800
$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	21%	21%	67%	16%	3600
$0$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	0%	14%	89%	15%	900
$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	0%	100%	0%	0%	450
$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	0%	14%	89%	15%	900
$[0.05, 0.19]$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	38%	10%	54%	8%	4500
$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	45%	23%	31%	7%	2250
$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	43%	6%	31%	33%	4500
$[0.20, 0.29]$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	38%	8%	41%	26%	3600
$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	16%	23%	40%	30%	1800
$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	6%	5%	40%	65%	3600
$[0.30, 0.39]$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	14%	3%	38%	54%	2700
$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	0%	5%	29%	69%	1350
$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	0%	0%	22%	87%	2700
$[0.40, 0.49]$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	0%	0%	13%	87%	2700
$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	0%	0%	8%	92%	1350
$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	0%	0%	8%	97%	2700

* Percentages are rounded to the nearest integer.

Based on these results, we see that there are many cases for which the single weighted 1-DF test had the most power out of all the methods more than 80% of the time. In particular, when $β_{2} * / σ_{2} - β_{1} * / σ_{1} = 0$ , this test had higher power than all other methods over 80% of the time. When $β_{2} * / σ_{2} - β_{1} * / σ_{1} < 0$ , the single weighted 1-DF test does well over 55% of the time, and when $β_{2} * / σ_{2} - β_{1} * / σ_{1} < 0$ and $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ , it is most powerful in over 86% of the time. From this, we conclude that the single weighted 1-DF test tends to have higher power when $β_{2} * / σ_{2} - β_{1} * / σ_{1} \leq 0$ , especially when the outcome specific ICCs are the same. In contrast, the disjunctive 2-DF test tended to have the highest power most frequently when the difference between the standardized treatment effects was greater. For example, when $β_{2} * / σ_{2} - β_{1} * / σ_{1} \in [0.30, 0.39]$ and $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ , this method had the highest power over 80% of the time. When $β_{2} * / σ_{2} - β_{1} * / σ_{1} \in [0.40, 0.49]$ , an even greater difference between the treatment effects, this method had the highest power up to 92%. The results based on the unstandardized effect sizes were also examined, and this table is available in the Supplementary Material (S.5). These findings were similar to the unstandardized effect sizes for the single weighted 1-DF test; this test tends to have the highest power when the effect sizes are the same. We also see that the disjunctive 2-DF test does well when the effect sizes are different, in particular when $β_{1} * < β_{2} *$ and $σ_{1}^{2} > σ_{2}^{2}$ . Table 8 summarizes the results for both the standardized and unstandardized effect sizes, showing when these methods are most powerful 50–80% of the time, and >80% of the time.

Table 8.

Summary of trends for highest statistical power from numerical evaluation based on standardized and unstandardized effect sizes for Comparison I (F and t distributions with 2-sided conjunctive test).

	Results based on standardized effect sizes		Results based on unstandardized effect sizes
Design method	Most powerful 50–80% of the time	Most powerful >80% of the time	Most powerful 50–80% of the time	Most powerful >80% of the time
Combined Outcomes Approach	–	$\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} = 0$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$	$β_{1} < β_{2}$ , $σ_{1}^{2} < σ_{2}^{2}$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $β_{1} < β_{2}$ , $σ_{1}^{2} < σ_{2}^{2}$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ $β_{1} < β_{2}$ , $σ_{1}^{2} = σ_{2}^{2}$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $β_{1} < β_{2}$ , $σ_{1}^{2} = σ_{2}^{2}$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$	$β_{1} = β_{2}$ , $σ_{1}^{2} = σ_{2}^{2}$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$
Single Weighted 1-DF Test	$\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} < 0$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} < 0$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} \in [0.05, 0.19]$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$	$\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} < 0$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} = 0$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} = 0$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} = 0$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$	$β_{1} < β_{2}$ , $σ_{1}^{2} < σ_{2}^{2}$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $β_{1} < β_{2}$ , $σ_{1}^{2} = σ_{2}^{2}$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} < σ_{2}^{2}$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} > σ_{2}^{2}$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$	$β_{1} = β_{2}$ , $σ_{1}^{2} < σ_{2}^{2}$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} < σ_{2}^{2}$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} = σ_{2}^{2}$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} = σ_{2}^{2}$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} = σ_{2}^{2}$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} > σ_{2}^{2}$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} > σ_{2}^{2}$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$
Disjunctive 2-DF Test	$\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} \in [0.20, 0.29]$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} \in [0.30, 0.39]$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} \in [0.30, 0.39]$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$	$\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} \in [0.30, 0.39]$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} \in [0.40, 0.49]$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} \in [0.40, 0.49]$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $\frac{β_{2}}{σ_{2}} - \frac{β_{1}}{σ_{1}} \in [0.40, 0.49]$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$	$β_{1} < β_{2}$ , $σ_{1}^{2} = σ_{2}^{2}$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$ $β_{1} < β_{2}$ , $σ_{1}^{2} > σ_{2}^{2}$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $β_{1} < β_{2}$ , $σ_{1}^{2} > σ_{2}^{2}$ , $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} < σ_{2}^{2}$ , $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ $β_{1} = β_{2}$ , $σ_{1}^{2} > σ_{2}^{2}$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$	$β_{1} < β_{2}$ , $σ_{1}^{2} > σ_{2}^{2}$ , $ρ_{0}^{(1)} > ρ_{0}^{(2)}$

*To save on space, the “*”'s are dropped from the $β$ terms in this table.

5.3.4 Conjunctive test vs. Bonferroni p-value adjustment

Though the conjunctive test was never the most powerful among all the methods considered, it was important to better understand the situations in which this test is more powerful than arguably the most popular method, the Bonferroni p-value adjustment method. This is because this test is the only test that explores a conjunctive hypothesis, which many researchers may be interested in utilizing in the context of a hybrid 2 study. To this end, we narrowed our scope and looked solely at the numerical evaluation results for the conjunctive test and the Bonferroni adjustment method alone. Table 9 shows summary statistics for the power difference between the two methods, namely $π^{CONJ} - π_{Bonferroni}^{PADJ}$ , across the levels of the 10 input parameters. Generally, as K, m, $ρ_{1}^{(1, 2)}$ and $ρ_{2}^{(1, 2)}$ increase in value, the conjunctive test becomes increasingly more powerful than the Bonferroni. When $ρ_{0}^{(1)} \neq ρ_{0}^{(2)}$ , the conjunctive test is more powerful than the Bonferroni to a higher degree than when $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ , and across the other design parameters, over all values of $β_{2} * / σ_{2} - β_{1} * / σ_{1}$ , the conjunctive test performs better than the Bonferroni. In general, $π^{CONJ} > π_{Bonferroni}^{PADJ}$ in 70% to 100% of the scenarios considered, except when $β_{2} * / σ_{2} = β_{1} * / σ_{1}$ and $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ , where $π^{CONJ} < π_{Bonferroni}^{PADJ}$ (see online Supplementary Material, S.1).

Table 9.
Summary statistics for $π^{CONJ} - π_{Bonferroni}^{PADJ}$ stratified by input parameter values for Comparison I (F and t distributions with 2-sided conjunctive test).

Input parameter Value Mean difference Minimum Maximum # total scenarios

$K$ $4$ 3.3 −3.2 13.6 7500

$6$ 4.7 −7.0 14.2 7500

$8$ 5.9 −8.9 13.4 7500

$10$ 6.7 −9.5 12.8 7500

$20$ 6.7 −4.3 11.9 7500

$30$ 5.4 −0.6 11.6 7500

$m$ $50$ 5.6 −8.6 14.2 15{,}000

$70$ 5.4 −9.5 13.6 15{,}000

$100$ 5.5 −9.0 13.9 15{,}000

$\frac{β_{2} }{σ_{2}} - \frac{β_{1} }{σ_{1}}$ $< 0$ 5.3 −6.4 14.2 9000

$0$ 3.2 −7.8 13.4 2250

$[0.05, 0.19]$ 4.2 −9.5 13.6 11,250

$[0.20, 0.29]$ 6.9 −0.6 14.2 9000

$[0.30, 0.39]$ 6.9 0.4 12.8 6750

$[0.40, 0.49]$ 5.3 1.6 11.9 6750

$ρ_{0}^{(1)}$ , $ρ_{0}^{(2)}$ $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ 5.4 −9.5 14.2 18{,}000

$ρ_{0}^{(1)} = ρ_{0}^{(2)}$ 4.7 −7.8 12.6 9000

$ρ_{0}^{(1)} > ρ_{0}^{(2)}$ 5.9 −6.4 14.2 18{,}000

$ρ_{1}^{(1, 2)}$ $0.005$ 4.4 −9.5 13.1 9000

$0.01$ 4.6 −8.8 13.1 9000

$0.02$ 5.0 −7.4 13.2 9000

$0.05$ 6.3 −2.9 13.4 9000

$0.07$ 7.1 0 14.2 9000

$ρ_{2}^{(1, 2)}$ $0.1$ 5.2 −9.5 13.4 9000

$0.3$ 5.3 −8.9 13.7 9000

$0.5$ 5.5 −8.4 13.9 9000

$0.7$ 5.6 −7.8 14.1 9000

$0.9$ 5.7 −7.3 14.2 9000

Input parameter	Value	Mean difference	Minimum	Maximum	# total scenarios
$K$	$4$	3.3	−3.2	13.6	7500
$6$	4.7	−7.0	14.2	7500
$8$	5.9	−8.9	13.4	7500
$10$	6.7	−9.5	12.8	7500
$20$	6.7	−4.3	11.9	7500
$30$	5.4	−0.6	11.6	7500
$m$	$50$	5.6	−8.6	14.2	15{,}000
$70$	5.4	−9.5	13.6	15{,}000
$100$	5.5	−9.0	13.9	15{,}000
$\frac{β_{2} }{σ_{2}} - \frac{β_{1} }{σ_{1}}$	$< 0$	5.3	−6.4	14.2	9000
$0$	3.2	−7.8	13.4	2250
$[0.05, 0.19]$	4.2	−9.5	13.6	11,250
$[0.20, 0.29]$	6.9	−0.6	14.2	9000
$[0.30, 0.39]$	6.9	0.4	12.8	6750
$[0.40, 0.49]$	5.3	1.6	11.9	6750
$ρ_{0}^{(1)}$ , $ρ_{0}^{(2)}$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	5.4	−9.5	14.2	18{,}000
$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	4.7	−7.8	12.6	9000
$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	5.9	−6.4	14.2	18{,}000
$ρ_{1}^{(1, 2)}$	$0.005$	4.4	−9.5	13.1	9000
$0.01$	4.6	−8.8	13.1	9000
$0.02$	5.0	−7.4	13.2	9000
$0.05$	6.3	−2.9	13.4	9000
$0.07$	7.1	0	14.2	9000
$ρ_{2}^{(1, 2)}$	$0.1$	5.2	−9.5	13.4	9000
$0.3$	5.3	−8.9	13.7	9000
$0.5$	5.5	−8.4	13.9	9000
$0.7$	5.6	−7.8	14.1	9000
$0.9$	5.7	−7.3	14.2	9000

5.3.5 Conjunctive test vs. single weighted 1-DF test

Similarly, we explored the degree to which the single weighted 1-DF test was more powerful than the conjunctive test. Although the single weighted 1-DF test had the highest average ranking of power among all the methods, it may be worth the loss of power when pursuing the conjunctive test due to its strength in hypothesis setup. Table 10 displays the percentage of scenarios where $π^{W 1 DF} - π^{CONJ}$ was between 0–5%, 5–10%, 10–20%, or >20, stratified by all of the different values of the 10 input parameters. Though for the majority of all scenarios, $π^{W 1 DF} - π^{CONJ} > 20 %$ , there were some noteworthy findings. For a very small number of clusters in each treatment group ( $K = 4$ ), $π^{W 1 DF} - π^{CONJ} < 20 %$ about 42% of the time. Similarly, when $ρ_{1}^{(1, 2)}$ was at its highest value of 0.07, $π^{W 1 DF} - π^{CONJ} < 20 %$ about 51% of the time. Lastly, when $β_{2} * / σ_{2} = β_{1} * / σ_{1}$ , $π^{W 1 DF} - π^{CONJ} < 20 %$ about 51% of the time. This suggests that under these scenarios, the conjunctive test could be a viable option as opposed to the most powerful method, the single weighted 1-DF test. When the conjunctive test uses a 1-sided tail as in Comparisons III and IV, the power difference between these two methods is even less—these results are shown in the online Supplementary Material (S.3 and S.4).

Table 10.
Numerical evaluation results—proportion of scenarios for ranges of the power difference between the single weighted 1-DF test and conjunctive IU test, summarized for Comparison I (F and t distributions with 2-sided conjunctive test).

Parameter Value $π^{W 1 D F} - π^{C O N J}$ 0% to 5% $π^{W 1 D F} - π^{C O N J}$ 5% to 10% $π^{W 1 D F} - π^{C O N J}$ 10% to 20% $π^{W 1 D F} - π^{C O N J}$ > 20% # Total Scenarios

$K$ $4$ 1% 7% 34% 58% 7500

$6$ 0% 1% 12% 87% 7500

$8$ 0% 1% 7% 92% 7500

$10$ 0% 1% 9% 90% 7500

20 19% 11% 16% 55% 7500

30 42% 10% 8% 40% 7500

$m$ $50$ 9% 6% 15% 70% 15000

$70$ 11% 4% 14% 71% 15000

$100$ 12% 5% 14% 70% 15000

$\frac{β_{2} }{σ_{2}} - \frac{β_{1} }{σ_{1}}$ $< 0$ 25% 5% 16% 54% 9000

$0$ 26% 9% 16% 49% 2250

$[0.05, 0.19]$ 10% 10% 20% 60% 11250

$[0.20, 0.29]$ 7% 5% 14% 74% 9000

$[0.30, 0.39]$ 1% 2% 12% 86% 6750

$[0.40, 0.49]$ 0% 0% 4% 96% 6750

$ρ_{0}^{(1)}$ , $ρ_{0}^{(2)}$ $ρ_{0}^{(1)} < ρ_{0}^{(2)}$ 13% 5% 17% 65% 18000

$ρ_{0}^{(1)} = ρ_{0}^{(2)}$ 7% 5% 15% 73% 9000

$ρ_{0}^{(1)} > ρ_{0}^{(2)}$ 10% 5% 11% 74% 18000

$ρ_{1}^{(1, 2)}$ $0.005$ 10% 4% 7% 80% 9000

$0.01$ 10% 4% 7% 80% 9000

$0.02$ 10% 4% 9% 78% 9000

$0.05$ 10% 5% 21% 64% 9000

$0.07$ 12% 10% 28% 49% 9000

$ρ_{2}^{(1, 2)}$ $0.1$ 10% 5% 13% 72% 9000

$0.3$ 10% 5% 14% 71% 9000

$0.5$ 10% 5% 14% 70% 9000

$0.7$ 11% 5% 15% 69% 9000

$0.9$ 11% 6% 16% 68% 9000

Parameter	Value	$π^{W 1 D F} - π^{C O N J}$ 0% to 5%	$π^{W 1 D F} - π^{C O N J}$ 5% to 10%	$π^{W 1 D F} - π^{C O N J}$ 10% to 20%	$π^{W 1 D F} - π^{C O N J}$ > 20%	# Total Scenarios
$K$	$4$	1%	7%	34%	58%	7500
$6$	0%	1%	12%	87%	7500
$8$	0%	1%	7%	92%	7500
$10$	0%	1%	9%	90%	7500
20	19%	11%	16%	55%	7500
30	42%	10%	8%	40%	7500
$m$	$50$	9%	6%	15%	70%	15000
$70$	11%	4%	14%	71%	15000
$100$	12%	5%	14%	70%	15000
$\frac{β_{2} }{σ_{2}} - \frac{β_{1} }{σ_{1}}$	$< 0$	25%	5%	16%	54%	9000
$0$	26%	9%	16%	49%	2250
$[0.05, 0.19]$	10%	10%	20%	60%	11250
$[0.20, 0.29]$	7%	5%	14%	74%	9000
$[0.30, 0.39]$	1%	2%	12%	86%	6750
$[0.40, 0.49]$	0%	0%	4%	96%	6750
$ρ_{0}^{(1)}$ , $ρ_{0}^{(2)}$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	13%	5%	17%	65%	18000
$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	7%	5%	15%	73%	9000
$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	10%	5%	11%	74%	18000
$ρ_{1}^{(1, 2)}$	$0.005$	10%	4%	7%	80%	9000
$0.01$	10%	4%	7%	80%	9000
$0.02$	10%	4%	9%	78%	9000
$0.05$	10%	5%	21%	64%	9000
$0.07$	12%	10%	28%	49%	9000
$ρ_{2}^{(1, 2)}$	$0.1$	10%	5%	13%	72%	9000
$0.3$	10%	5%	14%	71%	9000
$0.5$	10%	5%	14%	70%	9000
$0.7$	11%	5%	15%	69%	9000
$0.9$	11%	6%	16%	68%	9000

5.3.6 Comparisons II, III, and IV

Overall, the results did not change much between the different comparisons. The cases for when the combined outcomes approach, the single weighted 1-DF test, and the disjunctive 2-DF test are best remained largely the same, and are exactly equivalent the majority of the time. This is consistent for both the unstandardized and standardized results. We conclude that the choice of distribution (χ²-distribution vs. F-distribution, and MVN-distribution vs. t-distribution), does not largely impact which methods are most powerful. Similarly, whether or not the conjunctive IU test is 2-sided or 1-sided does not change the fact that it is more powerful than the Bonferroni p-value adjustment method. These results are available in the Supplementary Material (S.2-S.4).

5.3.7 Summary of results

Overall, the combined outcomes approach, single weighted 1-DF test, and disjunctive 2-DF test had the highest statistical power across the 45,000 design input scenarios. No method is globally most powerful, but we showed that the p-value adjustment methods are always less powerful than the combined outcomes approach and single weighted 1-DF test. When the outcome specific ICCs and variances between the two outcomes are the same, the combined outcomes approach is equivalent to the single weighted 1-DF test. Based on the average rankings of power for Comparison I, the most powerful methods in order are as follows: the single weighted 1-DF test, disjunctive 2-DF test, combined outcomes approach, conjunctive IU test, D/AP p-value adjustment, Sidak p-value adjustment, and Bonferroni p-value adjustment.

Treatment effects, outcome variances, and outcome-specific ICCs tend to have the largest impact on statistical power. The choice of reference distribution for hypothesis tests did not largely change which design methods were more or less powerful, though we note that the choice of reference distribution could potentially have an impact on the type I error rate, particularly in scenarios with a small number of clusters; future work could examine this. Table 8 summarizes the numerical study findings from Comparison I based on both the standardized effect sizes and unstandardized effect sizes, specifying under which cases a method has the highest power 50–80% of the time, and which cases a method has the highest power > 80% of the time. In general, we concluded that when the treatment effect sizes are the same, or when the standardized treatment effects are the same or close, the single weighted 1-DF test tends to have the highest power. When the treatment effects are different, the disjunctive 2-DF test tends to perform better, and as the difference between the standardized treatment effects increases, the disjunctive 2-DF test will yield higher power than the remaining methods. We revealed that the conjunctive test is more powerful than the Bonferroni method in the vast majority of scenarios. We also examined the degree to which the conjunctive test is less powerful than the most powerful method (single weighted 1-DF test). Table 11 summarizes both the results of the mathematical comparisons and numerical study by showing the relationship between each pairing of study design methods.

Table 11.
Matrix comparison of all statistical design methods.

PADJ Sidak PADJ D/AP COMB W1DF DIS2DF CONJ

PADJ Bonferroni $π_{Bonf}^{PADJ} < π_{Sidak}^{PADJ}$ globally $π_{Bonf}^{PADJ} < π_{D / AP}^{PADJ}$ globally $π_{Bonf}^{PADJ} < π^{COMB}$ globally $π_{Bonf}^{PADJ} < π^{W 1 DF}$ globally $π_{Bonf}^{PADJ} < π^{DIS 2 DF}$ generally $π_{Bonf}^{PADJ} < π^{CONJ}$ generally

PADJ Sidak $π_{Sidak}^{PADJ} < π_{D / AP}^{PADJ}$ globally $π_{Sidak}^{PADJ} < π^{COMB}$ globally $π_{Sidak}^{PADJ} < π^{W 1 DF}$ globally $π_{Sidak}^{PADJ} < π^{DIS 2 DF}$ generally Context dependent

PADJ D/AP $π_{D / AP}^{PADJ} < π^{COMB}$ globally $π_{D / AP}^{PADJ} < π^{W 1 DF}$ globally $π_{D / AP}^{PADJ} < π^{DIS 2 DF}$ generally Context dependent

COMB If $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ and $σ_{1}^{2} = σ_{2}^{2}$ , then $π^{COMB} = π^{W 1 DF}$ globally Context dependent $π^{CONJ} < π^{COMB}$ generally

W1DF If $\frac{β_{1} }{σ_{1} \sqrt{V I F_{1}}} = \frac{β_{2} }{σ_{2} \sqrt{V I F_{2}}}$ , then $π^{DIS 2 DF} < π^{W 1 DF}$ globally $π^{CONJ} < π^{W 1 DF}$ generally

DIS2DF $π^{CONJ} < π^{DIS 2 DF}$ generally

	PADJ Sidak	PADJ D/AP	COMB	W1DF	DIS2DF	CONJ
PADJ Bonferroni	$π_{Bonf}^{PADJ} < π_{Sidak}^{PADJ}$ globally	$π_{Bonf}^{PADJ} < π_{D / AP}^{PADJ}$ globally	$π_{Bonf}^{PADJ} < π^{COMB}$ globally	$π_{Bonf}^{PADJ} < π^{W 1 DF}$ globally	$π_{Bonf}^{PADJ} < π^{DIS 2 DF}$ generally	$π_{Bonf}^{PADJ} < π^{CONJ}$ generally
PADJ Sidak		$π_{Sidak}^{PADJ} < π_{D / AP}^{PADJ}$ globally	$π_{Sidak}^{PADJ} < π^{COMB}$ globally	$π_{Sidak}^{PADJ} < π^{W 1 DF}$ globally	$π_{Sidak}^{PADJ} < π^{DIS 2 DF}$ generally	Context dependent
PADJ D/AP			$π_{D / AP}^{PADJ} < π^{COMB}$ globally	$π_{D / AP}^{PADJ} < π^{W 1 DF}$ globally	$π_{D / AP}^{PADJ} < π^{DIS 2 DF}$ generally	Context dependent
COMB				If $ρ_{0}^{(1)} = ρ_{0}^{(2)}$ and $σ_{1}^{2} = σ_{2}^{2}$ , then $π^{COMB} = π^{W 1 DF}$ globally	Context dependent	$π^{CONJ} < π^{COMB}$ generally
W1DF					If $\frac{β_{1} }{σ_{1} \sqrt{V I F_{1}}} = \frac{β_{2} }{σ_{2} \sqrt{V I F_{2}}}$ , then $π^{DIS 2 DF} < π^{W 1 DF}$ globally	$π^{CONJ} < π^{W 1 DF}$ generally
DIS2DF						$π^{CONJ} < π^{DIS 2 DF}$ generally

* “Globally”—proven theoretically; holds true for all parameter values. “Generally”—dominant in most numerical scenarios; exceptions exist. “Context-dependent”—dominance switches across the parameter space; no single method is uniformly superior.

6. Discussion

In this article, we examined the performance of a number of valid design methods for hybrid type 2 CRTs with continuous co-primary outcomes. These included the p-value adjustment methods, combined outcomes approach, single weighted 1-DF test, disjunctive 2-DF test, and the conjunctive test. A theoretical comparison of the power equations was conducted. It was proven that the p-value adjustment methods are globally less powerful than the combined outcomes approach and the single weighted 1-DF test. It was also shown that the non-centrality parameter for the p-value adjustment methods is always smaller than the non-centrality parameter for the disjunctive 2-DF test, but due to the differing degrees of freedom, there are theoretical cases where the p-value adjustment methods can result in higher power than the disjunctive 2-DF test. We also showed that when the cluster-corrected standardized effect sizes of the first and second outcomes are equal, then the single weighted 1-DF test will be more powerful than the disjunctive test. Lastly, when the outcome specific ICCs and variances between the two outcomes are the same, then combined outcomes approach is theoretically equivalent to the single weighted 1-DF test.³

In the numerical evaluation, we explored comparisons that could not be obtained theoretically. We conducted four comparisons that differed in the choice of distribution and whether or not the conjunctive test utilized a 1-tail or 2-tail hypothesis. Results for all comparisons are given in the Supplementary Material, and we discussed the results from Comparison I in depth, which looked at a 2-tailed conjunctive IU test with the t-distribution, and all other methods using an F-distribution. It was found that the treatment effects, outcome variances, and outcome specific ICCs had the largest effect on power and which methods were more powerful than other methods. The combined outcomes approach, single weighted 1-DF test, and disjunctive 2-DF test had the most power, while the p-value adjustment methods and conjunctive IU test never had situations where they yielded the highest power. No method was found to be globally more powerful than another method. In general, the disjunctive 2-DF test did well when the treatment effects were not equal, while the single weighted 1-DF test did well when the treatment effects were equal. We also quantified the extent the extent to which the conjunctive test was more powerful than the popular Bonferroni p-value adjustment method, and also examined it in comparison to the most powerful method—the single weighted 1-DF test.

The choice of conducting a numerical evaluation was made because exact expressions are available for the quantities under consideration. These results can be interpretable as those when asymptotic inference is valid. It is beyond the scope of this paper to investigate relative finite sample performance of these methods, although such a study could be of further interest. Though our numerical evaluation covered 45,000 scenarios that included a wide range of input values for each parameter, and the additional analysis further allowed each individual parameter to vary over an even wider range of values, there are important limitations. We did not examine different values of the overall family-wise false positive rates, and limited our scope to $α = 0.05$ . We also did not consider differing treatment allocation ratios, and looked only at equal allocation (though our R package does accommodate unequal treatment allocation ratios). Future work could expand on these results, and look at other input scenarios. Furthermore, we note that these results are restricted to CRTs with two continuous co-primary outcomes. Though these methods have not formally been explored for the case of two binary co-primary outcomes, it is possible to use these design methods to approximate study design specifications for binary outcomes. In fact, in Owen et al., a common approximation is used to obtain the variance for the difference between two binomial proportions from independent data. This is then used instead of the variance for the difference between two means and accounts for clustering.¹⁸ Further work is needed to formally extend and examine these methods for binary data. It is also needed to generalize these methods and their comparisons to stepped wedge hybrid type 2 designs.¹⁹ Despite these limitations, our theoretical comparisons and numerical studies shed light on important relationships between the study design methods for hybrid type 2 studies, allowing strong conclusions to be made.

Table 8 shows in which scenarios a certain method has the highest power, to serve as a guide for researchers to better understand which methods to look into when designing their own hybrid 2 CRT. Though certain methods yielded higher power, we do not dismiss any of the methods as options altogether, for there may be situations where any of the methods could be most fitting depending on one's study and research goals. So, we also encourage the reader to make use of the crt2power R package that we introduced,⁴ or the user-friendly crt2powerApplication ShinyApp. This new R package is currently available on CRAN, and allows the user to enact any of the methods to calculate K, m, or power, for their own studies; likewise, the ShinyApp that uses this package is also currently available using the link provided earlier. This software package and accompanying application allow the user to make an informed decision about which method fits their research goals and needs. The theoretical comparisons, numerical evaluation, and software, have made it possible to better understand the performance of these methods, contributing to the knowledge base of hybrid type 2 studies, and CRTs with continuous co-primary endpoints more generally.

Supplemental Material

sj-docx-1-smm-10.1177_09622802261457277 - Supplemental material for A comparison of methods for designing hybrid type 2 cluster-randomized trials with continuous effectiveness and implementation endpoints

Supplemental material, sj-docx-1-smm-10.1177_09622802261457277 for A comparison of methods for designing hybrid type 2 cluster-randomized trials with continuous effectiveness and implementation endpoints by Melody A Owen, Fan Li, Ruyi Liu and Donna Spiegelman in Statistical Methods in Medical Research

Footnotes

Acknowledgments

This publication was made possible by CTSA Grant Number TL1 TR001864 from the National Center for Advancing Translational Science (NCATS), a component of the National Institutes of Health (NIH). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of NIH.

ORCID iDs

Melody A Owen

Fan Li

Ruyi Liu

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Training in Implementation Science Research Methods, National Center for Advancing Translational Sciences (grant numbers T32HL155000, TL1 TR001864).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Turner

Gallis

, et al. Review of recent methodological developments in group-randomized trials: part 1-design. Am J Public Health 2017; 107: 907–915.

Curran

Bauer

Mittman

, et al. Effectiveness-implementation hybrid designs: combining elements of clinical effectiveness and implementation research to enhance public health impact. Med Care 2012; 50: 217–226.

Owen

Curran

Smith

, et al. Power and sample size calculations for cluster randomized hybrid type 2 effectiveness-implementation studies. Stat Med 2025; 44: e70015.

Owen

. crt2power: Designing Cluster-Randomized Trials with Two Co-Primary Outcomes, https://cran.r-project.org/web/packages/crt2power/ (2024).

Yang

Moerbeek

Taljaard

, et al. Power analysis for cluster randomized trials with continuous coprimary endpoints. Biometrics 2022; 79: 1293–1305.

Donner

Klar

. Design and analysis of cluster randomization trials in health research. London, Ontario, Canada: Wiley, 2000192.

Rupert

. Simultaneous statistical inference. Springer Science, 2012.

Sidak

. Rectangular confidence regions for the means of multivariate normal distributions. J Am Stat Assoc 1967; 62: 626–633.

Sankoh

Hueque

Dubey

. Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Stat Med 1997; 16: 2529–2542.

10.

Vickerstaff

Omar

Ambler

. Methods to adjust for multiple comparisons in the analysis and sample size calculation of randomised controlled trials with multiple primary outcomes. BMC Med Res Methodol 2019; 19: 129.

11.

Dmitrienko

Bretz

Westfall

, et al. Multiple testing methodology. Taylor and Francis Group, LLC, 2010, pp.35–97.

12.

Pocock

Geller

Tsiatis

. The analysis of multiple endpoints in clinical trials. Biometrics 1987; 43: 487–498.

13.

O’Brien

. Procedures for comparing samples with multiple endpoints. Biometrics 1984; 40: 1079–1087.

14.

Kahan

Forbes

Ali

, et al. Increased risk of type I errors in cluster randomised trials with small or medium numbers of clusters: a review, reanalysis, and simulation study. Trials 2016; 17: 38.

15.

Abbott

Schroder

Enthoven

, et al. Effectiveness of implementing a best practice primary healthcare model for low back pain (BetterBack) compared with current routine care in the Swedish context: an internal pilot study informed protocol for an effectiveness-implementation hybrid type 2 trial. BMJ Open 2018; 8: e019906.

16.

Clemson

Mackenzie

Roberts

, et al. Integrated solutions for sustainable fall prevention in primary care, the iSOLVE project: a type 2 hybrid effectiveness-implementation design. Implement Sci 2017; 12: 12.

17.

Galaviz

Estabrooks

Ulloa

, et al. Evaluating the effectiveness of physician counseling to promote physical activity in Mexico: an effectiveness-implementation hybrid study. Transl Behav Med 2017; 7: 731–740.

18.

Hussey

Hughes

. Design and analysis of stepped wedge cluster randomized trials. Contemp Clin Trials 2007; 28: 182–191.

19.

Davis-Plourde

Taljaard

. Power analyses for stepped wedge designs with multivariate continuous outcomes. Stat Med 2023; 42: 559–578.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

2.26 MB

		Proportion of scenarios where the method was most powerful
$\frac{β_{2}^{}}{σ_{2}} - \frac{β_{1}^{}}{σ_{1}}$	$(ρ_{0}^{(1)}, ρ_{0}^{(2)})$	Combined outcomes	Combined outcomes = Single 1-DF	Single 1-DF	Disjunctive 2-DF	# Total Scenarios
$< 0$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	0%	10%	78%	39%	3600
	$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	0%	8%	94%	12%	1800
	$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	21%	21%	67%	16%	3600
$0$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	0%	14%	89%	15%	900
	$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	0%	100%	0%	0%	450
	$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	0%	14%	89%	15%	900
$[0.05, 0.19]$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	38%	10%	54%	8%	4500
	$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	45%	23%	31%	7%	2250
	$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	43%	6%	31%	33%	4500
$[0.20, 0.29]$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	38%	8%	41%	26%	3600
	$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	16%	23%	40%	30%	1800
	$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	6%	5%	40%	65%	3600
$[0.30, 0.39]$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	14%	3%	38%	54%	2700
	$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	0%	5%	29%	69%	1350
	$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	0%	0%	22%	87%	2700
$[0.40, 0.49]$	$ρ_{0}^{(1)} < ρ_{0}^{(2)}$	0%	0%	13%	87%	2700
	$ρ_{0}^{(1)} = ρ_{0}^{(2)}$	0%	0%	8%	92%	1350
	$ρ_{0}^{(1)} > ρ_{0}^{(2)}$	0%	0%	8%	97%	2700