Abstract
Background
Non-metastatic, castration-resistant prostate cancer (nmCRPC) is an advanced state of prostate cancer with variable prognosis; early identification of patient risk is crucial, so that clinicians can recommend optimal treatment.
Objective
Compare predictive models in identifying patient risk; evaluate the value of electronic healthcare record (EHR) time-series (TS) information in prediction.
Methods
We evaluated SurvTRACE, Weibull Time to Event Recurrent Neural Network (WTTE-RNN), and traditional Cox proportional hazards (CPH) models’ performance on EHR data from 12,819 nmCRPC patients in the Veterans Health Administration, using area under the receiver operating characteristic curve and Brier score.
Results
WTTE-RNN, which intrinsically uses EHR TS information, outperformed the other models without TS information. Feature-engineered TS information improved performances of CPH and especially SurvTRACE; with TS information, SurvTRACE outperformed WTTE-RNN.
Conclusion
Deep learning methods, whether intrinsically able to handle TS data or enhanced with TS information, can outperform traditional survival analysis in predicting risk.
Introduction
Prostate cancer (PC) is the second most common cancer in men, and accounts for a large proportion of all cancer-related deaths worldwide.1,2 PC patients who stop responding to androgen deprivation therapy (ADT)—referred to as castration resistance—but have no evidence of metastatic disease on radiographic imaging are classified as having non-metastatic, castration-resistant PC (nmCRPC). 3 These patients are typically of advanced age, have chronic comorbidities, and generally face high risk of developing metastasis and morbidity. 4 The prognosis for nmCRPC patients can vary, and early intervention and effective treatment are crucial, especially for individuals with a higher risk of metastatic disease and mortality. Early identification of high-risk patients could help clinicians adjust treatment plans and thus prolong patients’ progression-free survival.
The most common clinical guideline for identifying high-risk nmCRPC patients comes from tracking prostate-specific antigen doubling time (PSADT), 5 which does not account for other patient characteristics. In practice, clinicians lack a standard tool for calculating PSADT, and often must make a rough estimate, rendering PSADT a less-than-ideal criterion for identifying risk. Electronic healthcare record (EHR) data are a rich source of information on patient characteristics, and individual responses to treatment. However, EHR data are also complex, with a high degree of missingness, irregular intervals of measurement, and heterogeneity among patients in quantity and quality of information collected. While risk classification tools for PC have been developed, they have not fully explored optimal methods of summarizing and standardizing EHR time series. Potential advantages of models capable of directly intaking time-series data for accurate risk prediction for nmCRPC patients versus models reliant on summaries of EHR data remain unclear.
A comparison of PC risk classification tools 6 found that MSKCC nomogram, 7 a simple predictive model developed by the Memorial Sloan Kettering Cancer Center, had the best prognostic performance for predicting death in PC patients prior to starting any treatments. Ni, et al. 8 developed a machine learning model to stratify nmCRPC patients by risk for metastasis or death based on data from the SPARTAN 9 and ARAMIS 10 clinical trials. However, models based on clinical trial data cannot be assumed to be applicable to the general PC patient population, as healthier individuals are overrepresented in clinical trials. Work exploring the optimal processing of EHR time-series data has largely been conducted in more general healthcare contexts than in PC. In Johnson et al., 11 careful feature engineering of time-series data greatly improved the predictive performance of logistic regression and outperformed a long short-term memory neural network directly processing time-series data. This paper focused on a general prediction task of in-hospital mortality and validated results using the MIMIC-III dataset. Similarly, Wu et al. 12 fit logistic regression with summarized time-series features that outperformed a long short-term memory model in predicting vasopressor onset in ICU units also trained with MIMIC-III data. These studies applied only simulated datasets, and the models’ performance on real-world EHR data is unknown. Another study compared performance in prediction of systemic lupus erythematosus in 925 patients using an ensemble machine learning approach with feature-engineered time-series data versus a long short-term memory model with raw time-series data, concluding that each approach had strengths and limitations. 13 A study focusing on PC showed that a cross-sectional DL survival model performed better using feature-engineered time-series data to predict a composite outcome of adverse events for PC patients, while a DL survival model that automatically processed time-series data performed better at predicting PC mortality, though the differences in performance were not significant for either outcome. 14 This study focused generally on PC patients, resulting in a large cohort size of 110,000 patients, which is an easier predictive problem compared to risk prediction in nmCRPC, where fewer patients have less heterogeneity.
We sought to develop a model using data from the Veterans Health Administration (VHA) to predict patients’ prognoses automatically to identify high-risk patients and thus facilitate closer surveillance and more tailored treatment plans. We also investigated whether time-series information helps improve prediction accuracy; whether being able to process longitudinal data directly confers an advantage in predictive performance; and whether feature engineering can be used to overcome limitations in models that do not traditionally handle temporal data.
The VHA is the largest integrated healthcare system in the US and contains the biggest cohort of nmCRPC patients. However, there are still patients lost to follow-up, or censored. To address censoring, we implemented models designed for survival analysis. Cox proportional hazards (CPH) model is one of the most popular models for risk prediction and stratification in survival contexts.15,16 However, the CPH model must satisfy linearity and proportional hazards assumptions, as well as requiring domain-specific knowledge to account for interactions between variables. 17 Deep learning (DL) survival models have been developed as an alternative to traditional statistical methods like CPH, and are not subject to the same limitations. 18 These models have been successfully used in risk prediction contexts for PC patients,14,19 making them an attractive choice for our comparison.
DeepSurv, 20 DeepHit, 21 and deep proportional hazards models are all popular architectures for deep survival analysis. DeepSurv and the broader family of deep proportional hazards all make the same proportional hazards assumption as the traditional Cox proportional hazards (CPH) model, but allow for non-linear relationships between event risk and covariates. DeepHit directly learns the survival distribution by discretizing time into distinct intervals and predicting the event probability for each interval. All these models are designed to intake tabular data, but versions leveraging recurrent neural networks (RNNs) exist for each of them that can directly process longitudinal data. However, for DeepSurv and other similar models, the proportional hazards assumption is flawed and can lead to biased estimates. Multiple alternative architectures that do not rely on this assumption exist, including DeepHit, Weibull Time to Event Recurrent Neural Network 22 (WTTE-RNN), and SurvTRACE, 18 a transformer-based neural network for survival contexts. WTTE-RNN is capable of directly processing temporal data, whereas SurvTRACE requires tabular data. Previous work has shown success using WTTE-RNN to predict survival outcomes for nmCRPC patients. 23 SurvTRACE has been found to outperform DeepHit and DeepSurv 18 as well as being successfully used in risk stratification and prediction of recurrent cardiovascular events for patients with ischemic heart disease. 24 Given the success of SurvTRACE over DeepHit and DeepSurv, we included it in our comparison. We also included WTTE-RNN due to its ability to intake time-series data and proven success with treatment recommendation for nmCRPC patients 23 as well as a regularized CPH model to establish a baseline of performance.
Using SurvTRACE, WTTE-RNN, and regularized CPH, we compared predictive performance in risk stratification for nmCRPC patients resulting from different methods for handling time-series data.
Methods
This was a retrospective cohort study designed to develop and compare machine learning models for predicting time to metastasis or all-cause mortality in a nationwide cohort of U.S. Veterans with nmCRPC. Using data from the Department of Veterans Affairs (VA) health system, we trained and evaluated several survival analysis models, including a regularized CPH model and two DL models (SurvTRACE and WTTE-RNN).
Study cohort
Using data from the VA Cancer Registry System and pharmacy dispensation records from the VA Corporate Data Warehouse, we identified a nationwide cohort of 13,557 patients diagnosed with PC from January 1, 2006, through December 31, 2019, who later developed nmCRPC. Three months post-nmCRPC was selected as a landmark date, and patients who developed metastatic disease or died within the first 3 months post-nmCRPC were excluded. We also excluded 173 patients whose PC diagnosis date could not reliably be determined. Figure 1 provides the patient cohort flow diagram. The final cohort consisted of 12,819 nmCRPC patients.
Study cohort.

Data processing
Features
We characterized features as either static or time-varying. Static features included race, time from PC diagnosis to nmCRPC, Charlson comorbidity index (CCI) 6 months prior to nmCRPC date, as well as age, body mass index (BMI), and Gleason score at the time of nmCRPC diagnosis. Time-varying features included prostate-specific antigen (PSA), number of days from PC diagnosis to the record time, treatment, and an nmCRPC status indicator. We only included measurements of time-varying features recorded between PC diagnosis and initiation of a first line of treatment or the landmark date, if no treatments were initiated prior to the landmark date.
Feature engineering was used to condense time-varying features to a single row per patient. Longitudinal PSA values were summarized by the minimum, maximum, median, and slope. Treatments were summarized by treatment type and duration. Further details are provided in the appendix.
Data split
Data were split into 10 folds, with each fold generated by randomly sampling without replacement and stratified by treatments patients received over time, ensuring the treatment distribution in each fold reflected the original dataset. Models were trained and evaluated using 10-fold cross-validation based on these splits. For each iteration of the cross-validation, 8 folds were used for training, 1 for testing, and 1 for validation. Repeating for 10 iterations resulted in 10 validation sets. Performance was evaluated by averaging metrics over these validation sets with confidence intervals generated using bootstrapping. Further details around model configuration are included in the appendix.
Missing data imputation
We imputed missing data for static features using the mean values within each fold of the data split. For time-varying features, the missing values were filled in using the latest available non-missing value prior to the missing value. The missing values at the first visit were imputed with the mean first values within each fold. We adopted this imputation approach to avoid temporal information leakage and to reduce information leakage from the testing set to the training set.
Models
The outcome of interest was time to metastatic PC, or all-cause mortality. Our prediction models generated progression free survival curves for each individual. The baseline time was taken to be either the initiation of first-line treatment post-nmCRPC diagnosis, or the landmark date, if no treatment was initiated within 3 months of nmCRPC diagnosis. Metastasis was ascertained from clinical documents using natural language processing techniques as described in
25
and
26
. Models incapable of directly processing time-varying data were trained both with and without summaries of time-varying features, as shown in Figure 2.
Data pipelines for compared models. (A) Displays the pipeline for Weibull Time to Event Recurrent Neural Network (WTTE-RNN) that directly ingests time series data and static data; (B) displays the pipeline for Regularized Cox Proportional Hazards (Regularized CPH) and SurvTRACE that ingest static data; (C) displays the pipeline for time series (TS) Regularized CPH and TS SurvTRACE that ingest summaries of time series data and static data. All models output progression free survival curves.

Regularized CPH
We used an elastic net 27 which is a combination of ridge 28 and lasso 29 regularization for CPH model regularization. The model performs automatic feature selection by setting some coefficients to zero according to the elastic net. 30 Regularized CPH is incapable of handling time-varying data; thus, we refer to the regularized CPH model trained using summaries of time-series data as TS regularized CPH.
SurvTRACE
SurvTRACE is a transformer-based model that encodes each feature in a low-dimensional embedding and uses self-attention to account for full interactions between features. 18 The main architecture includes a baseline covariate embedding module, a deep-stacked attentive encoder module, and an alignment and subnetwork prediction module. Categorical variables and numerical variables are embedded and concatenated to represent features. Multi-head self-attention is used to enable sufficient interactions between covariate embeddings. For single-event survival analysis, the loss function is defined as the piecewise constant hazard loss proposed by Kvamme et al. 31 Like regularized CPH, SurvTRACE cannot process time-series data. We refer to the SurvTRACE model trained using summaries of time-series data as TS SurvTRACE.
WTTE-RNN
Weibull Time to Event Recurrent Neural Network (WTTE-RNN) is a DL prediction model that incorporates survival analysis. It discretizes time into steps, with time to the next event assumed to follow a Weibull distribution. An RNN is used to learn the scale and shape parameters of this Weibull distribution. Our implementation used a Gated Recurrent Unit (GRU) as the RNN structure. WTTE-RNN can directly process time-varying features due to its RNN architecture.
Model performance metrics
Models were evaluated based on area under the receiver operating characteristic curve (AUROC) and Brier score. AUROC is a discrimination metric that indicates how well models stratify groups based on the outcome. It ranges between 0.5 and 1, with higher values indicating better discrimination. Brier score is the mean squared difference between the predicted event rate and the observed event rate. It is commonly used for calibration and can be considered a proximity measure. 32 A perfect model gives a Brier score of 0, whereas a reference model yields a Brier score of 0.25.
Results
Data characteristics
Baseline characteristics.
1IQR: interquartile range.
2Treatment initiated within 3 months of date of non-metastatic, castration-resistant status.
3Standard deviation.
4Prostate specific antigen doubling time.
Model performance
Model performance.
1Area under the receiver operating characteristic curve.
2Confidence interval.
3Cox Proportional Hazards.
4Time Series.
5Weibull Time to Event Recurrent Neural Network.
TS regularized CPH outperformed regularized CPH, and TS SurvTRACE outperformed SurvTRACE. SurvTRACE received a greater benefit from time-series summaries compared with regularized CPH.
Discussion
We developed a risk prediction approach using structured EHR data from a nationwide cohort of Veterans with nmCRPC and explored the contribution of time-series data to the predictive performance of survival models. Our findings indicate the inclusion of time-varying features significantly enhances the predictive performance of survival models compared with using static features alone. While WTTE-RNN can intrinsically use time-series data, TS SurvTRACE achieved superior performance in Brier scores at the 1- and 2-year milestones. This suggests the capacity to process time-series data does not necessarily offer an advantage over creating summaries of time-series data. Specifically, summarized approaches like TS SurvTRACE may offer better model calibration, whereas automated models like WTTE-RNN may provide marginal gains in discrimination as shown by its higher AUROC over longer follow-up periods. Consequently, the optimal choice of architecture depends on whether a clinician prioritizes precise calibration or long-term risk discrimination.
Our work demonstrates the capacity of DL models such as WTTE-RNN or SurvTRACE to provide more individualized guides than PSADT for clinicians when recommending treatment. Thus, the practical impact of this work lies in its potential to refine treatment selection and guide therapy timing in real-world practice, and inform risk stratification in clinical trials. For instance, by integrating our models with EHR, a clinical decision support system (CDSS) could provide clinicians with real-time, personalized risk predictions. This would enable more nuanced decision-making—escalating therapy for high-risk patients while avoiding overtreatment for those at low risk. 8
Limitations and future work
This work has several limitations. One key challenge was the significant missingness in certain features, notably Gleason score, which may have impacted model performance. Although PSADT also had a high degree of missingness, this is common in EHR data and was a motivating factor for our approach. Consequently, we did not include PSADT as a predictive feature. Our primary focus was to leverage the richer information available in the PSA time series directly. PSADT is presented in the baseline characteristics table only to provide a conventional and more interpretable summary of patient PSA kinetics.
While we provide foundational results for developing a CDSS for managing nmCRPC, more work is required prior to implementation. Our models relied exclusively on structured EHR data for prediction; future work will involve incorporating unstructured clinical notes, which is expected to enhance model performance. Furthermore, for these models to be clinically useful, their predictions must be interpretable. The “black box” nature of complex models is a significant barrier to clinician trust and adoption. A current limitation is a lack of transparency in decision-making processes for models such as SurvTRACE and WTTE-RNN, making them less desirable to clinicians. Explainable machine learning is an active area of research 33 and multiple explainers exist that can offer insight to model decisions such as SHAP 34 and LIME. 35 SHAP values quantify each feature’s marginal contribution to an individual patient’s predicted risk and support personalized, case-level explanations, while permutation feature importance estimates a feature’s global relevance by measuring the degradation in predictive performance when its values are randomly shuffled. Partial dependence plots and individual conditional expectation curves can further characterize how predicted risk varies with individual features such as PSA slope, Gleason score, or treatment type. For WTTE-RNN, time-step attributions derived from integrated gradients or attention weights can highlight which points in a patient’s PSA trajectory drove a prediction; for the transformer-based SurvTRACE, self-attention weights over covariate embeddings offer a native path to explanation. Translating these technical outputs into clinician-facing explanations—for example, by surfacing the top contributors to an individual’s risk estimate within the EHR interface—will be essential to support clinical trust and appropriate use. 36 In a companion study currently under review, we benchmark interpretable or “glass-box” methods against black box models using the same cohort of prostate cancer patients.
Next, successful integration into clinical practice is a complex task. A CDSS must be seamlessly embedded within existing EHR workflows to avoid disrupting care or increasing clinician workload. This involves not only technical interoperability but also careful consideration of human factors to prevent issues like alert fatigue and over reliance on automation. 37
Finally, the lack of external validation reducing the generalizability of our findings beyond the VHA is a limitation that must be addressed. Therefore, the next steps toward real-world application include rigorous external validation on diverse, multi-institutional datasets to ensure model generalizability and fairness, followed by prospective clinical studies to determine whether the CDSS improves clinical outcomes. The VHA cohort is composed almost entirely of male U.S. Veterans and differs from community-based populations in demographics, comorbidity burden, treatment access, and ascertainment of outcomes. Consequently, both the absolute performance and the relative ordering of models may shift when they are applied to other healthcare systems, and site-specific differences in laboratory platforms, PSA measurement cadence, diagnostic coding practices, and unstructured documentation of metastasis could all induce measurable data drift. External validation on diverse, multi-institutional datasets—ideally through federated evaluations across academic medical centers, community oncology networks, and large claims-linked EHR consortia—combined with fairness audits across racial, age, and comorbidity strata, will be necessary before deployment. This translational process will require a collaborative effort between data scientists, clinicians, and informatics experts to create a tool that is accurate, interpretable, and meaningfully integrated into the fabric of patient care. 38 However, until such validation is completed, the results reported here should be interpreted as evidence of the relative value of time-series representations within a single integrated health system rather than as a deployable risk model.
Despite limitations, this work provides essential groundwork for implementing a tool to help clinicians in their treatment decisions for patients with nmCRPC, offering valuable insight into which model architectures and features are effective for identifying high-risk nmCRPC patients.
Conclusion
DL survival methods are useful in predicting individual nmCRPC patients’ risk, demonstrating superior performance compared to the traditional CPH survival analysis approach. Prediction using DL on EHR data should leverage its time-series nature, either via models that can intrinsically utilize time-series information or through careful engineering of time-series features.
Supplemental material
Supplemental material - Deep survival learning for prognosis prediction in non-metastatic castration-resistant prostate cancer
Supplemental material for Deep survival learning for prognosis prediction in non-metastatic castration-resistant prostate cancer by Chunyang Li, Julia Bohman, Vikas Patil, Richard Mcshinsky, Christina Yong, Zach Burningham and Ahmad Halwani in Health Informatics Journal.
Footnotes
Acknowledgments
We would like to thank Dr. Siamack Ayandeh for the creation of the analytics study mart environment and the Veterans Health Administration Office of Research and Development for funding cloud credits.
Ethical considerations
The University of Utah Institutional Review Board approved this study under
Consent to participate
The requirement for informed consent was waived by the University of Utah Institutional Review Board.
Author Contributions
Conceptualization, Chunyang Li, Ahmad Halwani, Zach Burningham, Julia Bohman; methodology, Chunyang Li; formal analysis, Chunyang Li; data curation, Vikas Patil, Richard McShinsky; writing—original draft preparation, Chunyang Li, Julia Bohman; writing—review and editing, Chunyang Li, Julia Bohman, Christina Yong; supervision, Ahmad Halwani, Zach Burningham.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Data include PHI and are not available. Detailed information on model tuning and implementation is available upon request.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
