A Deep Learning Framework for Causal Inference in Clinical Trial Design: The Clinical Trials Uncovering Real Efficacy Artificial Intelligence Large Clinicogenomic Foundation Model

Abstract

Background:

Clinical research is limited by the capability to define the most important combinations of clinical features and biomarkers that predict therapeutic benefit. Here, we introduce Clinical trials Uncovering Real Efficacy Artificial Intelligence (CURE AI), a novel deep learning framework designed to predict individual patient benefit from a new therapeutic intervention compared to a standard of care.

Methods:

CURE AI utilizes a large clinicogenomic foundation model to understand the complex relationships between the vast clinical and multiomic features in clinical trial data. To build CURE AI, we trained a proprietary foundation model based on a deep learning architecture and training schema using a large collection of clinical and multiomics datasets. Using CURE AI, we seek to understand the complex interplay between clinicogenomic information from clinical trial arms to predict the magnitude of therapeutic benefit on the individual patient level.

Results:

In this article, we fine-tuned the CURE AI foundation model on lung cancer data from the OAK non-small cell lung cancer clinical trial. We observed that the trial could have been significant for progression-free survival (PFS), with fewer than half of the patients enrolled using CURE AI to guide trial enrollment (p = 0.60 to p < 0.05). The fine-tuned CURE AI (termed CURE Lung Cancer) demonstrated direct generalizability on a held-out independent clinical trial dataset, the POPLAR trial, by converting an insignificant PFS endpoint to significance while also including the majority of patients (88%; p = 0.21 to p < 0.05).

Conclusion:

In summary, we developed a causally aware clinicogenomic deep learning platform that can learn to predict individualized patient benefit of investigational therapy compared to an existing standard of care. Because we use a foundation model trained on readily measurable patient characteristics, CURE AI can be applied to a variety of scientific and clinical uses, including adaptive clinical trials, toxicity prediction, treatment response prediction, and understanding of drug resistance and response mechanisms.

Introduction

Clinical research is hampered by limitations in identifying the most important combinations of clinical and biological features that can predict therapeutic response. New biomarker discovery is crucial to help define patient populations eligible for therapies. When large-scale molecular profiling of patients (e.g., transcriptomic or proteomic) is carried out, as in OAK, POPLAR, and other recent oncology clinical trials, retrospective data analysis following completion of clinical trials can result in the development of gene signatures that can be predictive of patient response.^1–5 However, traditional methods to identify actionable biomarkers are time-consuming, expensive, mostly prognostic, and often fail to provide actionable insights.

To identify patterns within high-dimensional multiomics data, artificial intelligence algorithms have been developed to assist in understanding basic patterns within the data.^6–10 At the more advanced level, complex neural networks could, in theory, identify nonlinear complex relationships of clinical and biological information that can predict treatment benefit, which cannot be achieved by classical methods but could be made feasible by the assistance of advanced AI models.¹¹ For this reason, we sought to develop a large clinicogenomic foundation model (LCGM) that can leverage the complexity of high-dimensional data (clinical and multiomic) to predict treatment benefit. If the model can learn complex biological relationships in the foundation model, then the LCGM should be able to identify patterns that predict treatment benefit when applied to new datasets.

Our technology utilizes a deep learning architecture and training schema to predict the benefit of a clinical trial intervention, which we define as the benefit of a therapeutic intervention (experimental arm) compared to the standard of care (control arm) for each patient. The name of our LCGM is Clinical trials Uncovering Real Efficacy Artificial Intelligence, or CURE AI^TM. CURE AI predicts which patients benefit more from an investigational therapy than standard of care at the individual patient level based on their baseline clinicogenomic profile, which can be used to inform clinical trial eligibility. Identifying patient populations more likely to benefit from treatment, a process known as predictive enrichment, can contribute to the development of robust and reliable patient enrichment strategies.^6,12 CURE AI enables predictive enrichment through enhanced classification of true responders and true nonresponders to therapy. Accurately defining eligibility criteria is a key part of ensuring that patients are offered therapy that they will benefit from, which is an important consideration for new drug approval.^13–15

CURE AI tackles two major bottlenecks in clinical drug development. The first problem is data scarcity (limited patient numbers and data missingness), combined with the high dimensionality of clinical data. Clinical trial data often present challenges due to limited patient numbers that cannot be easily pooled into larger databases, as well as a vast amount of data points per patient (>1,000,000 measurements in phase 3 trials with omics data).^16,17 The CURE AI foundation model utilizes a self-supervised learning schema to not only identify the most important clinicogenomic variables but also to learn the connections and patterns between these variables on a much larger scale using our LCGM trained on a large corpus of patient data. The second challenge is sorting out the differences in prognostic and predictive markers. Current methods in clinical trials often find correlations between variables but are unable to assess all available clinical and multiomic data to pinpoint patterns within that data that explain why drugs benefit certain patient groups but not others.¹⁸ Inability to accurately identify predictive factors can result in misinterpretation of trial outcomes, with immortal time bias being a well-known example.¹⁹ For instance, a patient’s functional status (prognostic factor) might be mistakenly interpreted to drive their health outcome to a new drug, leaving the true predictive factors undiscovered. The difficulty lies in predicting the benefit that an individual would receive from both the investigational therapy and the standard of care, which allows for a prediction of the magnitude of excess benefit derived from the investigational therapy. We need to be able to predict responses to both trial therapies to design and interpret the most personalized and effective clinical trials.

To achieve this, a perfect clinical trial dataset would contain every patient’s response to both trial therapies (randomizing each patient to both trial arms), which would allow measurement of the real benefit of the investigational therapy over the existing standard of care at the individual patient level. Since it is not possible to administer different therapies to a patient in parallel, we developed CURE AI to accurately predict this difference for each patient based on accurate comparisons of their clinical and genomic characteristics to similar patients in what can be thought of as an advanced implementation of causal inference.^20,21 In this seminal article, we validate the premise of CURE AI to guide patient therapy selection in two completed advanced nonsmall cell lung cancer randomized clinical trials, using one as a validation set and the other as a held-out test set. We show how CURE AI can guide clinical trial interpretation and design by AI-guided prediction of therapeutic benefit.

Methods

CURE AI development

CURE AI is a deep learning technology developed by Numenos which consists of two main parts: (1) a base neural architecture foundation model that was trained on a large cohort of clinicogenomic datasets including multiomic and clinical data (e.g., RNA sequencing, age, gender, laboratory values, clinical diagnoses) filtered through novel attention-based layers followed by model training with a proprietary self-supervised learning schema with multiple loss function²² and (2) a proprietary training process to fine-tune CURE AI on indication-specific clinicogenomic data for the specific downstream task of predicting treatment benefit using k-fold cross-validation (CV)²³ and out-of-fold (OOF) predictions. The CURE AI foundation model is trained on a very large cohort of patients with diverse pathologies, including all major cancer types.

The CURE AI LCGM was fine-tuned on progression-free survival (PFS) outcome stratified random selection of four-fifths of the patients on the OAK trial for whom RNA sequencing was available. The model was then used to predict benefit on the held-out one-fifth of the validation patients. This process was iterated five total times (to define the features that predict treatment benefit on the entire included OAK trial dataset) to create the fine-tuned CURE Lung Cancer^TM model. A schematic of the CURE AI training model is shown in Figure 1.

Fig. 1.

Outline of CURE AI development, fine-tuning, and validation testing. The CURE AI foundation model was trained on vast clinical and multiomic datasets to learn and understand the interplay between clinical and multiomic features. CURE AI was applied to the OAK clinical trial to fine-tune the model for training (fine-tuning CURE AI into CURE Lung Cancer using conditional average treatment effect modeling) and cross-validation (testing how well CURE AI predicts outcomes on OAK). CURE Lung Cancer was then validation tested on the completely held-out POPLAR clinical trial dataset. The CURE Index is created as an output from the models, from which we define CURE AI-informed simulated trial enrollment. CURE AI, Clinical trials Uncovering Real Efficacy Artificial Intelligence.

CURE Score^TM and CURE Index^TM evaluation of individual patient benefit

CURE AI calculates a counterfactual outcome for each patient to predict how each patient would have responded to the trial therapy that they did not receive.²⁴ This individualized patient benefit prediction is called the CURE Score. The estimate of the benefit of investigational therapy over standard of care in a specific trial is based on fine-tuning CURE AI with conditional average treatment effect (CATE) modeling.²⁵ The CURE Score is calculated by assessing the difference between the predicted treatment outcomes. A positive score favors investigational therapy, and a negative score favors the standard of care.

The ordered ranking of each patient within a clinical trial by their CURE Score is called the CURE Index. The CURE Index is used to define subgroups of patients. We create a Cox proportional hazard model, with the treatment arm being the only covariate for each subgroup. Because of randomization, benefit prediction is equally distributed between the trial arms, allowing a balanced dataset to train the Cox proportional hazard model and measure treatment effect using the hazard ratio for the treatment covariate for that subgroup. The log-rank test is used to generate the p-value for therapeutic benefit. Thus, the treatment effect is based on the hazard ratio, and significance is based on the p-value.

The CURE Curve

We developed the CURE Curve as the aggregated p-value plot of all patients included in the order of the CURE Index. The CURE Curve presents CURE AI results and allows for straightforward identification of the patients who benefit more from either the investigational therapy or the standard of care. We enroll each patient based on their order in the CURE Index, starting from the highest CURE Score. For each additional patient enrolled (who received either atezolizumab or docetaxel), a p-value is calculated to assess whether we can reject the null hypothesis that there are no differences in outcome between the treatment groups. If we observe significance in CURE Index-defined subsets (p < 0.05), then CURE AI successfully identified patient characteristics that predict better response to atezolizumab.

In order to compare the CURE Curve to a random baseline, we simulated trial enrollment of patients based on random enrollment (rather than CURE Index-defined enrollment, we used random ranking). A p-value is calculated for the random enrollment, which is performed 1000 times. The p-value line for random patient enrollment is presented with its mean value as well as 1 standard deviation on each side of the mean. At every point where the CURE Curve p-value is above the upper confidence interval of the random selection line (i.e., the p-value has a lower absolute value for CURE Indexed patients), CURE AI is significantly outperforming random trial enrollment.

Because the CURE Score is based on baseline information, CURE Scores are directly comparable for patients on similar clinical trials. We can therefore compare groups of patients across different trials (comparing OAK patients to POPLAR patients), based on each patient’s CURE Score.

The CURE Curve generates “natural subgroups” of patients with similar predicted benefit values. We selected three representative points along the OAK and POPLAR trial CURE Curves that represent two possible inclusion thresholds generated by CURE AI (2 thresholds plus the entire dataset for each trial). For the OAK trial, we selected the top 35% and top 52% of CURE Score-defined patients. The 35% threshold on the OAK trial was selected because it is the first point at which the CURE Curve demonstrated significance with p < 0.05, corresponding to the top 44% of patients on the POPLAR trial. The 52% threshold on the OAK trial was chosen because it corresponds to the most inclusive significant threshold on the POPLAR trial (including 88% of patients on the POPLAR trial).

CURE Permutation Test^TM

We developed a permutation test as an additional assessment of confidence in CURE AI. The permutation test involves randomizing the outcome of each patient so that the treatment outcome is no longer correctly associated with the same patient. This is done by randomizing PFS independently within each trial arm so that the distribution of outcomes per trial arm remains the same. Permutated data from OAK and POPLAR were independently analyzed by CURE AI as described above. If the permutated data leads to loss of significance (loss of relationship between clinical variables and outcome), then we are assured of more confidence in our model.

Kaplan–Meier estimation of PFS

We plot the Kaplan–Meier estimates for the probability of PFS for three representative points along the CURE Curve for both the OAK and POPLAR trials. We utilized PFS as the outcome because the OAK and POPLAR trials both failed to meet this endpoint. Additionally, PFS is most directly related to cancer treatment effect and therefore less prone to bias, whereas overall survival can be affected by factors unrelated to cancer treatment or progression. Kaplan–Meier plots were generated using Python. The number of patients at risk (alive and evaluable; noncensored) at each time point is shown for the nonpermutated plots.

Clinical trial datasets

The phase 3 OAK and phase 2 POPLAR advanced nonsmall cell lung cancer randomized prospective clinical trials both tested the investigational therapy atezolizumab compared to docetaxel in previously treated advanced nonsmall cell lung cancer.^3,4 We restricted our analysis to patients who had available RNA sequencing data, which led to the inclusion of 699 patients on the OAK trial and 193 patients on the POPLAR trial. Tabular clinical trial data were integrated into CURE AI as training and testing variables.

Preprocessing of bulk RNA sequencing data

Raw read counts and transcripts per million were utilized for sample normalization. These data integrate into CURE AI as training and testing variables.

Results

CURE AI predicts a benefit score for each patient, which allows enrolled patients to be ranked from high to low predicted benefit to investigational therapy compared to standard of care (Fig. 1). This order of patients is called the CURE Index. We then enroll each patient by the CURE Index order and calculate a p-value for each additional enrolled patient, which generates the CURE Curve. We developed the CURE Curve as a visual representation of the performance of CURE AI in predicting clinical trial treatment benefit.

The CURE Curves and associated Kaplan–Meier estimates for the OAK trial³ are demonstrated in Figure 2. The OAK trial represents a validation of the performance of the CURE AI foundation model because the benefit prediction was fine-tuned on OAK using OOF benefit prediction derived only from baseline OAK patient profiles (see Fig. 1). Practically, this means that CURE AI is a high-performing foundation model as patients who benefit from the anti-programmed death ligand 1 (PD-L1) immunotherapy atezolizumab more than the microtubule stabilizer and mitotic inhibitor docetaxel could be identified through analysis of a single clinical trial without the need to aggregate multiple datasets for fine-tuning.

Fig. 2.

CURE Curves and Kaplan–Meier Estimates for the OAK trial. In the OAK trial, the CURE Curve shows that patients enrolled up to around the 65th percentile (where the p-value line begins to descend) are more likely to benefit from atezolizumab than docetaxel. When the CURE Curve begins to descend, patients who have more predicted benefit to docetaxel are included, which results in loss of trial significance in the total trial population as the standard of care arm outperforms atezolizumab in the >65th percentile subgroup. Top panel: The top 35% of CURE Index-defined patients were selected. At this threshold, the CURE Curve reaches significance (p < 0.05), with the corresponding Kaplan–Meier estimation shown on the right panel. This CURE Score threshold is equivalent to the top 44% of patients on the POPLAR trial. Middle panel: The top 52% of CURE Index-defined patients were selected. At this threshold, the CURE Curve remains above significance (p < 0.05), with the corresponding Kaplan–Meier estimation shown on the right panel. This CURE Score threshold is identical to the top 88% of patients on the POPLAR trial. Bottom panel: The entire included trial population is shown with the corresponding Kaplan–Meier estimation shown on the right panel (p = 0.595). For each Kaplan–Meier estimate, the number of patients at risk (alive and evaluable) at each time point is shown.

Two OAK trial CURE Curve thresholds were selected for Kaplan–Meier estimation. These two thresholds represent the first point that the CURE Curve reaches significance (top 35% of OAK patients, corresponding to the top 44% of POPLAR patients) and the point on the POPLAR trial that is the most inclusive for patients who experience more benefit from atezolizumab relative to docetaxel (the top 88% of patients on POPLAR, which corresponds to the top 52% of the OAK CURE Index). The corresponding OAK and POPLAR thresholds represent the same CURE Score, meaning that they have the same predicted benefit to atezolizumab relative to docetaxel for a patient at that threshold.

The point where the CURE Curve p-value first reaches significance is upon enrollment of the top 35% of the CURE Index patients (Fig. 2, top panel). The Kaplan–Meier estimates for PFS calculation for the 35% threshold are shown in Figure 2, top right panel (p = 0.045, indicating significant benefit to atezolizumab in this group). Figure 2 (middle panels) demonstrates the CURE Curve and associated Kaplan–Meier estimates for the CURE Curve threshold of 52% (p = 0.047), which was selected as it corresponds to the POPLAR threshold that is the most inclusive for patient benefit to atezolizumab relative to docetaxel (includes 88% of POPLAR trial patients). Both of the CURE Curve thresholds significantly outperform the entire training set (which is shown in Fig. 2, bottom panels), which has a p-value of 0.595. At every point along the CURE Curve up to when the p-value line begins to descend, that patient cohort (all patients enrolled from the highest CURE Index and to that point) has more benefit to atezolizumab than docetaxel. The point where the CURE Index p-value begins to consistently decrease (around 65% on the OAK CURE Curve) represents the threshold where patients benefit more from docetaxel than from atezolizumab, based on this specific model.

To address concerns of overfitting, we compared CURE AI performance to a random recruitment of patients. The random selection line on the CURE Curve includes an average of 1000 iterations of simulated randomized trial enrollment of the same number of included patients to that point along the CURE Curve. A p-value is calculated for this patient group (the cohort receiving atezolizumab compared to the cohort receiving docetaxel) for each enrollment iteration, and the p-value is averaged. The standard deviation for random patient enrollment p-value is shown on the CURE Curve. At every point where the CURE Curve p-value line is above the upper confidence interval of the random selection line, CURE AI significantly outperforms random trial enrollment.

We hypothesized that if a trial is well-balanced, CURE AI should not generate an imbalance in trial arms. Indeed, we found that the CURE Curve-defined groups remain well-balanced in both patient number and clinical features. Table 1 demonstrates that some of the major variables associated with clinical response are not imbalanced due to CURE patient selection for the OAK trial, including gender, histology, TMB,²⁶ and PD-L1 status. These data demonstrate that CURE AI can effectively predict benefit to therapy on the OAK trial while keeping baseline characteristics balanced and despite significant data missingness (PD-L1 was only defined for about half of patients, for example).

Table 1.

OAK Trial Patient Demographics by CURE Curve Threshold (Top 35% of Patients, Top Table; Top 52% of Patients, Middle Table; All Included Patients, Bottom Table)

OAK 35
Missing			35% OAK	Atezolizumab	Docetaxel	p
0	Variable	Patient number	245 (100)	128 (52.2)	117 (47.8)
	HIST, n (%)	Nonsquamous	164 (66.9)	88 (68.8)	76 (65.0)	0.621
	HIST, n (%)	Squamous	81 (33.1)	40 (31.2)	41 (35.0)	0.621
91	PD-L1, n (%)	[1,50)	79 (51.3)	42 (51.9)	37 (50.7)	1.000
91	91 Unknown	[50,100]	75 (48.7)	39 (48.1)	36 (49.3)	1.000
0	Gender, n (%)	Female	57 (23.3)	34 (26.6)	23 (19.7)	0.260
0	Gender, n (%)	Male	188 (76.7)	94 (73.4)	94 (80.3)	0.260
113	TMB, n (%) 113	<16 (low)	84 (63.6)	37 (61.7)	47 (65.3)	0.804
	Unknown	>16 (high)	48 (36.4)	23 (38.3)	25 (34.7)

OAK 52
Missing			52% OAK	Atezolizumab	Docetaxel	p
0	Variable	Patient number (n)	364 (100)	189 (51.9)	175 (48.1)
	HIST, n (%)	Nonsquamous	253 (69.5)	136 (72.0)	117 (66.9)	0.346
	HIST, n (%)	Squamous	111 (30.5)	53 (28.0)	58 (33.1)	0.346
93	PD-L1, n (%) 93	[0,1)	87 (32.1)	45 (32.1)	42 (32.1)	0.973
	Unknown	[1,50)	105 (38.7)	55 (39.3)	50 (38.2)	0.973
		[50,100]	79 (29.2)	40 (28.6)	39 (29.8)
0	Gender, n (%)	Female	96 (26.4)	49 (25.9)	47 (26.9)	0.934
0	Gender, n (%)	Male	268 (73.6)	140 (74.1)	128 (73.1)	0.934
143	TMB, n (%) 143	<16 (low)	149 (67.4)	67 (62.6)	82 (71.9)	0.183
143	Unknown	>16 (high)	72 (32.6)	40 (37.4)	32 (28.1)	0.183

OAK 100
Missing			100% OAK	Atezolizumab	Docetaxel	p
0	Variable	Patient number (n)	699 (100)	344 (49.2)	355 (50.8)
	HIST, n (%)	Nonsquamous	509 (72.8)	257 (74.7)	252 (71.0)	0.307
	HIST, n (%)	Squamous	190 (27.2)	87 (25.3)	103 (29.0)	0.307
338	PD-L1, n (%) 338	[0,1)	177 (49.0)	85 (47.2)	92 (50.8)	0.769
	Unknown	[1,50)	105 (29.1)	55 (30.6)	50 (27.6)
		[50,100]	79 (21.9)	40 (22.2)	39 (21.5)
0	Gender, n (%)	Female	258 (36.9)	125 (36.3)	133 (37.5)	0.818
0	Gender, n (%)	Male	441 (63.1)	219 (63.7)	222 (62.5)	0.818
274	TMB, n (%) 274	<16 (low)	323 (76.0)	157 (75.5)	166 (76.5)	0.895
274	Unknown	>16 (high)	102 (24.0)	51 (24.5)	51 (23.5)	0.895

Number of patients with an unknown variable is listed in the left column. These data demonstrate that CURE lung cancer patient group demographics are not significantly altered compared to all included patients for the OAK trial.

CURE, Clinical trials Uncovering Real Efficacy; PD-L1, programmed death ligand 1.

We deliberately only included the OAK trial in the fine-tuning process to showcase two points: (1) how CURE AI performs on a real clinical trial when used for both training and CV and (2) to assess the performance on a smaller and completely held-out trial (POPLAR), which we describe next. CURE AI, now fine-tuned on OAK nonsmall cell lung cancer data, is termed CURE Lung Cancer.

CURE Lung Cancer demonstrated direct generalizability on a held-out independent clinical trial dataset, the POPLAR trial.⁴ The POPLAR trial has a similar patient population and treatment arms (atezolizumab vs. docetaxel) as the OAK trial. We utilized CURE Lung Cancer, optimized to understand the treatment benefit of immunotherapy (atezolizumab) relative to chemotherapy (docetaxel), to calculate the CURE Curve for the POPLAR clinical trial. The CURE Curves and Kaplan–Meier estimates for the POPLAR trial are demonstrated in Figure 3. The top panel demonstrates the top 44% of the CURE Index patients (corresponding to the top 35% of patients on the OAK trial, which represents the same CURE Score). We find that the PFS is significant with a p-value of 0.021 (compared to 0.211 on the entire patient cohort; Fig. 3, bottom panel).

Fig. 3.

CURE Lung Cancer Curves and Kaplan–Meier Estimates for the POPLAR trial. In the POPLAR trial, the CURE Curve shows that patients enrolled up to around the 88th percentile (where the p-value line begins to descend) are more likely to benefit from atezolizumab than docetaxel. When the CURE Curve begins to descend, patients who have more predicted benefit to docetaxel are included, which results in loss of trial significance in the total trial population as the standard of care arm outperforms atezolizumab in the >88th percentile subgroup. Note that the POPLAR trial was analyzed on CURE Lung Cancer (CURE AI that has been fine-tuned on the OAK trial). Top panel: The top 44% of CURE Index-defined patients were selected. At this threshold, the CURE Curve passes significance (p < 0.05), with the corresponding Kaplan–Meier estimation shown on the right panel. Middle panel: The top 88% of CURE Index-defined patients were selected. At this threshold, the CURE Curve remains above significance (p < 0.05), with the corresponding Kaplan–Meier estimation shown on the right panel. Bottom panel: The entire included trial population is shown with the corresponding Kaplan–Meier estimation shown on the right panel (p = 0.211). For each Kaplan–Meier estimate, the number of patients at risk (alive and evaluable) at each time point is shown.

Interestingly, the top 52% of patients on the OAK trial corresponded to the top 88% of patients on the POPLAR trial (chosen because it is the most inclusive significant threshold for POPLAR), based on the same CURE Score cutoffs. This means that the patient group on the POPLAR trial contained more patients who we predict would have more benefit from atezolizumab. We find that on POPLAR, if the trial included the top 88% of patients by CURE Index ranking, the trial would have had a positive PFS endpoint (p = 0.035). This means that CURE AI can be fine-tuned on a single clinical trial to predict outcomes on an independent clinical trial using predefined inclusion criteria that include the majority of patients, as we do not want to unnecessarily restrict patient eligibility.

Since the POPLAR trial was well-balanced, CURE AI should not lead to an imbalance in trial arms. Indeed, we found that the CURE Curve-defined groups remain well-balanced in both patient number and clinical features. Table 2 demonstrates that major variables associated with clinical response are not imbalanced due to CURE patient selection for the POPLAR trial. These data demonstrate that CURE Lung Cancer can effectively predict benefit to therapy for the POPLAR trial, which is a completely independent test of CURE AI training and fine-tuning process.

Table 2.

POPLAR Trial Patient Demographics by CURE Curve Threshold (Top 44% of Patients, Top Table; Top 88% of Patients, Middle Table; All Included Patients, Bottom Table).

POLAR 44
Missing			44% POPLAR	Atezolizumab	Docetaxel	p
0	Variable	Patient number (n)	85 (100)	38 (44.7)	47 (55.3)
	HIST, n (%)	Nonsquamous	48 (56.5)	21 (55.3)	27 (57.4)	1.000
	HIST, n (%)	Squamous	37 (43.5)	17 (44.7)	20 (42.6)
21	PD-L1, n (%) 21	[1,50)	44 (68.8)	17 (65.4)	27 (71.1)	0.837
21	Unknown	[50,100]	20 (31.2)	9 (34.6)	11 (28.9)	0.837
0	Gender, n (%)	Female	24 (28.2)	6 (15.8)	18 (38.3)	0.040
0	Gender, n (%)	Male	61 (71.8)	32 (84.2)	29 (61.7)	0.040
0	TMB, n (%)	<16 (low)	20 (23.5)	9 (23.7)	11 (23.4)	1.000
		>16 (high)	65 (76.5)	29 (76.3)	36 (76.6)

POLAR 88
Missing			88% POPLAR	Atezolizumab	Docetaxel	p
1	Variable	Patient number (n)	170 (100)	89 (52.4)	81 (47.6)
	HIST, n (%) 1	Nonsquamous	101 (59.8)	54 (61.4)	47 (58.0)	0.775
	Unknown	Squamous	68 (40.2)	34 (38.6)	34 (42.0)	0.775
21	PD-L1, n (%) 21	[0,1)	85 (57.0)	51 (66.2)	34 (47.2)	0.058
	Unknown	[1,50)	44 (29.5)	17 (22.1)	27 (37.5)
	Unknown	[50,100]	20 (13.4)	9 (11.7)	11 (15.3)
0	Gender, n (%)	Female	56 (32.9)	25 (28.1)	31 (38.3)	0.212
0	Gender, n (%)	Male	114 (67.1)	64 (71.9)	50 (61.7)	0.212
0	TMB, n (%)	<16 (low)	51 (30.0)	25 (28.1)	26 (32.1)	0.688
0	TMB, n (%)	>16 (high)	119 (70.0)	64 (71.9)	55 (67.9)	0.688

POLAR 100
Missing			100% POPLAR	Atezolizumab	Docetaxel	p
1	Variable	Patient number (n)	193 (100)	96 (49.7)	97 (50.3)
	HIST, n (%) 1	Nonsquamous	120 (62.5)	59 (62.1)	61 (62.9)	1.000
	Unknown	Squamous	72 (37.5)	36 (37.9)	36 (37.1)
37	PD-L1, n (%) 37	[0,1)	92 (59.0)	55 (67.9)	37 (49.3)	0.056
	Unknown	[1,50)	44 (28.2)	17 (21.0)	27 (36.0)
	Unknown	[50,100]	20 (12.8)	9 (11.1)	11 (14.7)
0	Gender, n (%)	Female	72 (37.3)	28 (29.2)	44 (45.4)	0.029
0	Gender, n (%)	Male	121 (62.7)	68 (70.8)	53 (54.6)	0.029
0	TMB, n (%)	<16 (low)	51 (26.4)	25 (26.0)	26 (26.8)	1.000
0	TMB, n (%)	>16 (high)	142 (73.6)	71 (74.0)	71 (73.2)	1.000

Number of patients with an unknown variable is listed in the left column. These data demonstrate that the CURE Lung Cancer patient group demographics are not significantly altered compared to all included patients for the POPLAR trial.

HIST, histopathology; TMB, tumor mutational burden.

Since the design of CURE AI intrinsically limits the risk of confounding by validation testing on held-out trial data, and we saw no signs of data overfitting or bias within our models, we have high confidence in the performance of CURE AI. Indeed, our fine-tuned model generalized to the POPLAR trial. However, as an additional test to affirm our confidence in the model, we developed the CURE Permutation test to assess for intrinsic bias within the model fine-tuning schema. The permutation test involves randomizing the outcome of each patient so that treatment outcomes are no longer correctly associated with the same patient’s clinicogenomic information. Within each trial arm, PFS is randomized so that the distribution of outcomes per trial arm remains the same, but the association of PFS for a particular patient is incorrect. This should result in a loss of signal for the model training, as the clinicogenomic relationships that are predictive for patient benefit are no longer accurate, so the model should not be able to learn an association since the correct association no longer exists.

Permutated patient outcome data from OAK and POPLAR were independently assessed by CURE AI, as described in the schema from Figure 1. Figure 4 demonstrates the CURE Permutation test on the OAK trial. We find that there is a loss of significance across the entire dataset, as the CURE Curve no longer outperforms the randomized patient selection. At the top 35% of the CURE Index (Fig. 4, top panel), the p-value is insignificant at p = 0.391, compared to p = 0.045 on the CURE AI model for OAK. At the top 52% of the CURE Index (Fig. 4, bottom panel), the p-value is insignificant at p = 0.703, compared to p = 0.047 on the CURE AI model for OAK. Figure 5 demonstrates the CURE Permutation test on the POPLAR trial. We find that there is a loss of significance across the entire dataset, as the CURE Curve no longer outperforms the randomized patient selection. At the top 44% of the CURE Index (Fig. 5, top panel), the p-value is insignificant at p = 0.629, compared to p = 0.021 on the CURE AI model for POPLAR. At the top 88% of the CURE Index (Fig. 5, bottom panel), the p-value is insignificant at p = 0.371, compared to p = 0.035 on the CURE AI model for POPLAR. The results demonstrate that CURE AI is not prone to generating false associations between data and outcomes.

Fig. 4.

CURE Permutation Test and Kaplan–Meier Estimates for the OAK trial. Using the CURE Permutation Test, when PFS is randomized within each arm of the OAK trial, CURE AI does not identify any significant associations between clinicogenomic variables and patient outcome. The top 35% and top 52% of CURE Index-defined patients during permutation testing are demonstrated. The CURE Curve is not significantly different than the random selection enrollment (p > 0.05 at all points), with the corresponding Kaplan–Meier estimates shown on the right panel. This loss of significance is a test of confidence in the CURE AI model. PFS, progression-free survival.

Fig. 5.

CURE Permutation Test and Kaplan–Meier Estimates for the POPLAR trial. Using the CURE Permutation test, when PFS is randomized within each arm of the POPLAR trial, CURE AI does not identify any significant associations between clinicogenomic variables and patient outcome. The top 44% and top 88% of CURE Index-defined patients during permutation testing are demonstrated. The CURE Curve is not significantly different than the random selection enrollment (p > 0.05 at all points), with the corresponding Kaplan–Meier estimates shown on the right panel. This loss of significance is a test of confidence in the CURE AI model.

Discussion

The CURE AI foundation model is an innovative example of how complex neural network architectures can be applied to clinical trials to increase the likelihood that patients will receive a drug from which they will benefit. We independently validated the performance of CURE AI on two clinical trials and showed that CURE AI can predict the benefit of atezolizumab relative to docetaxel while being blinded to patient outcome. This is the first demonstration that an LCGM can proactively characterize patients by predicted benefit to an investigational therapy.

While further validation is required, based on the presented evidence, it is attractive to speculate that LCGMs will cause a shift in the way clinical trials are approached through enhancement of enrollment criteria that enhances patient selection as an individualized treatment benefit prediction among multiple therapy options, which intrinsically includes measurement of both treatment response and risk (side effects). With improved enrollment criteria, we should observe a Will Rogers effect, where the clinical outcomes of all patients screened for a trial will improve. This is because patients who are eligible for enrollment will be enriched for the benefit of investigational therapy, and patients who are screened as ineligible will avoid receiving an investigational therapy that is predicted to be inferior to the existing therapeutic option. For example, on POPLAR, the top 88% of patients should be enrolled and randomized, while the patients who are predicted to benefit more from standard of care could instead be offered standard of care alone or an alternative clinical trial. By optimizing for the benefit of an intervention compared to the standard of care, CURE AI leads to more inclusive enrollment criteria than alternative AI approaches that focus on treatment response (rather than benefit), such as PBMF²⁷ and COMPASS,²⁸ that can result in restricted eligibility criteria that magnify response differences but exclude >80% of the initial population.

In practice, CURE AI could be used to understand which actionable clinical and genetic factors define benefit from a phase 2 trial, which would refine eligibility criteria for a phase 3 trial. Insights from phase 2 and phase 3 trials can inform new therapy combinations to pursue or not pursue, accelerating progress. For example, from analysis of OAK and POPLAR, CURE AI predicts that T cell immunoreceptor with Ig and ITIM domains is a poor target in combination with PD-1/PD-L1 axis therapies (in most but not all patients). Furthermore, CURE Scores can be applied to real-world data to predict cross-cancer therapy benefit for a treatment that has not yet been tested in that cancer type, enabling informed indication expansion decisions. Another exciting use case is to adapt ongoing clinical trials based on interim analyses, in which patient eligibility criteria could be modified during the trial to optimize for therapeutic benefit.

It is important for clinical trials to be representative of the entire population, as model training is dependent on the availability of robust clinical trial data. Complex models like CURE AI are not easily implementable into clinical practice due to the need for a companion diagnostic to inform patient eligibility. However, model distillation can simplify complex models to utilize readily available clinical information that does not require omics-based companion diagnostics, which would increase patient access and reduce cost. CURE AI is currently being implemented into immunotherapy clinical trial protocols to prospectively test CURE AI-informed patient selection.

CURE AI has the potential to improve clinical trial design, analysis, and interpretation. In the future, it is possible that LCGMs will enable “virtual control arms” for clinical trials, as standard control arms could raise concerns for not offering treatment with high predicted benefit over standard of care, based on existing data. Regulatory bodies will need to collaborate with academic, biotechnology, and patient advocacy leaders to ensure safe, transparent, and straightforward implementation of technologies like CURE AI, as the clinical translation of transformative patient treatments could be rapidly expedited with the assistance of these technologies.

In summary, we have developed a causal LCGM that can predict and accurately measure the individualized patient benefit of investigational therapy compared to an existing standard of care. The development of CURE AI enables a shift from simple outcomes (such as treatment response) to individualized continuous benefit measurement. We are optimistic that special implementations of AI, such as CURE AI for clinical trials, will demonstrate consistent successes and lead to a revolution in health care, including a major step toward more individualized medicine. CURE AI can be applied to a variety of scientific and clinical uses, including adaptive clinical trials, toxicity prediction, treatment response prediction, and understanding of drug resistance and response mechanisms.

Footnotes

Authors’ Contributions

A.D.W. and V.F. conceived and developed CURE AI. A.W., N.T.P., and V.F. designed the trial analyses in this study. A.D.W., V.F., and N.T.P. analyzed and interpreted the trial analyses in this study. D.F., Z.T., J.C., and B.J. contributed to the supervision and project strategy. All authors contributed to writing and reviewing the article. All authors have read and approved the final article.

Availability of Data and Materials

All data generated or analyzed during this study that are available are included in this published article. The code underlying all CURE AI models is a proprietary trade secret. CURE AI models are currently available to access by commercial entities and academic collaborations, or for nonprofit organizations for clinical trial design and analysis through Numenos. The models are not currently available for independent public research use.

Author Disclosure Statement

All intellectual property relevant to this work has been developed and is owned exclusively by Numenos. A.W., V.F., and N.P. have equity in Numenos. D.F., Z.T., J.C., and B.J. are employees of Boehringer Ingelheim.

Funding Information

This study was supported by Numenos and Boehringer Ingelheim Pharmaceuticals, Inc.

References

Hwang

, Kwon

, Jeong

, et al. Immune gene signatures for predicting durable clinical benefit of anti-PD-1 immunotherapy in patients with non-small cell lung cancer. Sci Rep, 2020; 10(1):643.

Motzer

, Robbins

, Powles

, et al. Avelumab plus axitinib versus sunitinib in advanced renal cell carcinoma: Biomarker analysis of the phase 3 JAVELIN Renal 101 trial. Nat Med, 2020; 26(11):1733–1741.

Rittmeyer

, Barlesi

, Waterkamp

, et al.; OAK Study Group. Atezolizumab versus docetaxel in patients with previously treated non-small-cell lung cancer (OAK): a phase 3, open-label, multicentre randomised controlled trial. Lancet, 2017; 389(10066):255–265.

Fehrenbacher

, Spira

, Ballinger

, et al.; POPLAR Study Group. Atezolizumab versus docetaxel for patients with previously treated non-small-cell lung cancer (POPLAR): a multicentre, open-label, phase 2 randomised controlled trial. Lancet, 2016; 387(10030):1837–1846.

Mazieres

, Rittmeyer

, Gadgeel

, et al. Atezolizumab versus docetaxel in pretreated patients with NSCLC: Final results from the randomized phase 2 POPLAR and Phase 3 OAK Clinical Trials. J Thorac Oncol, 2021; 16(1):140–150.

Harrer

, Shah

, Antony

, et al. Artificial intelligence for clinical trial design. Trends Pharmacol Sci, 2019; 40(8):577–591.

Gupta

, Srivastava

, Sahu

, et al. Artificial intelligence to deep learning: Machine intelligence approach for drug discovery. Mol Divers, 2021; 25(3):1315–1360.

Askin

, Burkhalter

, Calado

, et al. Artificial Intelligence Applied to clinical trials: Opportunities and challenges. Health Technol (Berl), 2023; 13(2):203–213.

Azuaje

. Artificial intelligence for precision oncology: Beyond patient stratification. NPJ Precis Oncol, 2019; 3:6; doi: 10.1038/s41698-019-0078-1

10.

Bhalla

, Laganà

. Artificial intelligence for precision oncology. In: Computational Methods for Precision Oncology, 1st ed. ( Bhalla

, Laganà

., eds.) Springer; 2022, pp. 249–268.

11.

Singh

, Kumar

, Payra

, et al. Artificial intelligence and machine learning in pharmacological research: Bridging the gap between data and drug discovery. Cureus, 2023; 15(8):e44359.

12.

Temple

. Enrichment of clinical study populations. Clin Pharmacol Ther, 2010; 88(6):774–778.

13.

U.S. Food and Drug Administration. Enrichment Strategies for Clinical Trials to Support Determination of Effectiveness of Human Drugs and Biological Products: Guidance for Industry. FDA: Silver Spring, MD; 2019. Available from: https://www.fda.gov/media/121320/download

14.

Magnuson

, Bruinooge

, Singh

, et al. Modernizing clinical trial eligibility criteria: Recommendations of the ASCO-Friends of Cancer Research Performance Status Work Group. Clin Cancer Res, 2021; 27(9):2424–2429.

15.

Beaver

, Ison

, Pazdur

. Reevaluating eligibility criteria—balancing patient protection and participation in oncology trials. N Engl J Med, 2017; 376(16):1504–1505; doi: 10.1056/NEJMp1615879

16.

Tufts CSDD. Rising protocol design complexity is driving rapid growth in clinical trial data volume. Tufts CSDD Impact Report, 2021; 23(1):1–6.

17.

Mitra

, McGough

, Chakraborti

, et al. Learning from data with structured missingness. Nat Mach Intell, 2023; 5(1):13–23.

18.

Simon

. Clinical trial designs for evaluating the medical utility of prognostic and predictive biomarkers in oncology. Per Med, 2010; 7(1):33–47; doi: 10.2217/pme.09.49

19.

Yadav

, Lewis

. Immortal time bias in observational studies. JAMA, 2021; 325(7):686–687; doi: 10.1001/jama.2020.9151

20.

Imbens

, Rubin

. Rubin Causal Model. In: Microeconometrics. Springer; 2010, pp. 229–241.

21.

Chiba

. Causal measures for prognostic and predictive biomarkers. OJS, 2018; 08(02):241–248; doi: 10.4236/ojs.2018.82014

22.

Yadav

, Shukla

. Analysis of k-Fold cross-validation over hold-out validation on colossal datasets for quality classification. In: 2016 IEEE 6th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA). IEEE; 2016, pp. 1–6.

23.

Wang

, Ma

, Zhao

, et al. A comprehensive survey of loss functions in machine learning. Ann Data Sci, 2022; 9(2):187–212.

24.

Melnychuk

, Frauen

, Feuerriegel

. Causal transformer for estimating counterfactual outcomes. In: Proceedings of the 39th International Conference on Machine Learning. PMLR 162; 2022:15293.

25.

Curth

, Svensson

, Weatherall

, et al. Really Doing Great at Estimating CATE? A Critical Look at ML Benchmarking Practices in Treatment Effect Estimation. In: Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS 2021). NeurIPS; 2021.

26.

Stenzinger

, Allen

, Maas

, et al. Tumor mutational burden standardization initiatives: Recommendations for consistent tumor mutational burden assessment in clinical samples to guide immunotherapy treatment decisions. Genes Chromosomes Cancer, 2019; 58(8):578–588; doi: 10.1002/gcc.22733

27.

Arango-Argoty

, Bikiel

, Sun

, et al. AI-driven predictive biomarker discovery with contrastive learning to improve clinical trial outcomes. Cancer Cell, 2025; 43(5):875–890.e8.

28.

Shen

, Nguyen

, Li

, et al. Generalizable AI predicts immunotherapy outcomes across cancers and treatments. medRxiv, 2025; doi: 10.1101/2025.05.01.25326820