Abstract
Tumor progression is driven by the accumulation of genetic alterations, including both point mutations and copy number changes. Understanding the temporal sequence of these events is crucial for comprehending the disease but is not directly discernible from cross-sectional genomic data. Cancer progression models, including Mutual Hazard Networks (MHNs), aim to reconstruct the dynamics of tumor progression by learning the causal interactions between genetic events based on their co-occurrence patterns in cross-sectional data. Here, we highlight a commonly overlooked bias in cross-sectional datasets that can distort progression modeling. Tumors become clinically detectable when they cause symptoms or are identified through imaging or tests. Detection factors, such as size, inflammation (fever, fatigue), and elevated biochemical markers, are influenced by genomic alterations. Ignoring these effects leads to “conditioning on a collider” bias, where events making the tumor more observable appear anticorrelated, creating false suppressive effects or masking promoting effects among genetic events. We enhance MHNs by incorporating the effects of genetic progression events on the inclusion of a tumor in a dataset, thus correcting for collider bias. We derive an efficient tensor formula for the likelihood function and apply it to two datasets from the MSK-IMPACT study. In colon adenocarcinoma, we observe a significantly higher rate of clinical detection for TP53-positive tumors, while in lung adenocarcinoma, the same is true for EGFR-positive tumors. Compared to classical MHNs, this approach eliminates several spurious suppressive interactions and uncovers multiple promoting effects.
INTRODUCTION
Cancer progression models (CPMs) aim to describe and reproduce the evolutionary development of a healthy tissue into a malignant tumor, driven by a series of genetic events such as mutations or copy number alterations (Beerenwinkel et al., 2014). Such events initially occur in single cells through largely random mutagenic processes, but their expansion across the cancer cell population is driven by nonrandom fitness effects they exert in the given environment and the genetic background (Mina et al., 2022). In other words, the fixation of a new event can be promoted or suppressed by previously fixated events, which constrains the probable chronological sequences of events and their co-occurrence patterns (Nowell, 1976).
For example, an initial mutation might promote growth until the tumor is starved for oxygen, whereupon subsequent mutations become beneficial and facilitate blood vessel formation. Access to blood vessels in turn sets the stage for further events culminating in metastasis. Conversely, some events can also suppress one another. This can result, for example, from synthetic lethality, where some events aid the tumor cell individually but become fatal when they occur together. Or, events may target genes in the same regulatory pathway; whichever event occurs first disrupts the whole pathway, reducing selective pressure on the other event.
At large, such interactions between events are still poorly characterized, and learning them from data is the goal of CPMs. The challenge lies in the inherent limitations of available data. Datasets with many patients typically provide only bulk genotypes and do not reliably resolve clonal structures. Most datasets are also cross-sectional. They provide a snapshot of many different tumors at a single time point each but do not track tumors over multiple stages of their evolution.
While we do not know the time at which a tumor was observed relative to the start of its progression, it is also not entirely random. During the initial stages, cancers are low in cell numbers, have accumulated few events, and are generally hard to detect. Instead, most human cancer data comes from advanced stages, after the tumor has clinically progressed and accumulated genetic events. Importantly, some of these genetic events contribute to the cancer’s phenotype: crucial driver genes accelerate proliferation and therefore promote aggressiveness and detectability. Hence, our ability to observe and study tumors depends on the events that occurred before we could detect the tumor. This dependence introduces a systematic bias to CPMs, which we resolve in this article.
So far, CPMs have been developed that can be learned from bulk genotypes (Diaz-Uriarte, 2023) under the assumption that tumors were observed at a random time independent of their progression events. The CPMs are trained on the co-occurrence patterns of events and model the probabilities of future events as functions of the events that are already present. These functions define a causal network. CPMs build on the seminal work of Fearon and Vogelstein (Fearon and Vogelstein, 1990) who manually inferred from genetic and clinical data that colorectal cancer tends to progress along a chain of mutations in the genes APC → KRAS → TP53.
Oncogenetic trees (Beerenwinkel et al., 2005; Desper et al., 1999) extend such chains and allow each event to be a necessary precursor to more than one successor event. In Conjunctive Bayesian Networks (CBNs) (Beerenwinkel et al., 2007; Gerstung et al., 2009; Montazeri et al., 2016), events may also require multiple precursors, thus extending trees to directed acyclic graphs (DAGs). CAPRESE (Loohuis et al., 2014) and CAPRI (Ramazzotti et al., 2015) are similar tree and DAG models where precursor events are not strictly necessary for successor events but raise their probabilities. Other DAG models with different functional forms are Disjunctive Bayesian Networks (Nicol et al., 2021), (Semi-)Monotone Bayesian Networks (Farahani and Lagergren, 2013), and Bayesian Mutation Landscapes (Misra et al., 2014). Pathway Linear Progression Models (Raphael and Vandin, 2015) infer groups of mutually exclusive events and arrange the groups in a chain. PathTiMEx (Cristea et al., 2017) generalizes this to CBNs of groups of mutually exclusive events. Network Aberration Models (NAMs) (Hjelm et al., 2006) are cyclic causal networks with promoting effects. HyperTraPS (Greenbury et al., 2020; Johnston and Williams, 2016) and Mutual Hazard Networks (MHNs) (Schill et al., 2019) generalize this to cyclic networks with promoting and suppressive effects. Similar approaches (Alfaro-Murillo and Townsend, 2023; Moen and Johnston, 2022) allow higher-order rather than pairwise interactions between events. TreeMHN (Luo et al., 2023) infers MHNs from intratumor phylogenetic trees derived from single-cell, multiregion, or bulk sequencing data.
Here we address a fundamental oversight in all these CPMs: They do not include the observation of the tumor itself in their causal networks. Instead, we regard observation of the tumor as an additional event indicating that the tumor was biopsied, sequenced, and eventually included in the dataset. It implies that the tumor has become conspicuous due to its size, morphology, or symptoms such as weight loss, fatigue, or pain. Since a dataset contains by definition only tumors at their moment of observation, neglecting any causal effects on this event makes CPMs prone to the notorious “conditioning on a collider” bias (Hernán MA, 2020). This bias is also known as Berkson’s paradox and refers to spurious associations between any variables that affect another conditioned variable. Joseph Berkson originally described it for a hospital in-patient population, which showed a negative association between diabetes and cholecystitis (Berkson, 1946), see Figure 1. However, since diabetes is known to increase the risk for cholecystitis (Cho et al., 2010), one would naturally expect a positive association. The explanation for the spurious negative association is that diabetes and cholecystitis are both separate causes of being in the hospital and therefore to be observed in this study. Learning of one cause explains away the need for the other cause.

Berkson’s original example of a collider bias (Berkson, 1946): The negative association between diabetes and cholecystitis observed in hospital patients could be spuriously explained (left) by suppressive effects between the diseases if the inclusion of a patient in the dataset were independent of both diseases. Alternatively, the same negative association between diabetes and cholecystitis can be correctly explained (right) by taking into account that both diseases have a promoting effect on being in the hospital and thus observed in the dataset. The spurious association masks the actual promoting effect of diabetes on cholecystitis.
Similarly, the inference of CPMs from statistical associations can be grossly distorted when the observation should be a part of the causal network but is not properly accounted for. CBNs and MHNs are models in continuous time that do have a random observation event, but its rate is fixed at 1 and cannot be affected by other events. Timed Hazard Networks (Chen, 2023) extend MHNs by hidden variables for the observation times of all tumors, but these are also not affected by other events. NAMs (Hjelm et al., 2006) have an observation event whose rate depends on the total number of events that have occurred, but not on which particular events have occurred.
In this article, we extend MHNs by causal effects between their progression events and their observation event. Each event occurs at its base rate and has multiplicative effects on the rate of every other event. These effects can be greater than 1 (promoting), less than 1 (suppressive), or equal to 1 (neutral) and define a causal network with cycles. An MHN is a generative model of cancer progression in the form of a continuous-time Markov chain. We provide an analytical formula for its probability distribution over tumor states, explicitly conditioned on their times of observation. This formula uses tensor expressions which allows us to efficiently infer base rates and multiplicative effects between events via maximum likelihood estimation.
We demonstrate our approach on two datasets of colon adenocarcinoma (COAD) and lung adenocarcinoma (LUAD) from the MSK-IMPACT study (Nguyen et al., 2022; Pugh et al., 2022). Compared to classical MHNs (cMHNs), we find results that offer drastically different interpretations. In COAD, we find that TP53 strongly promotes observation, which explains away suppressive interactions and uncovers promoting effects between APC and TP53. For LUAD, the new model identifies EGFR mutations as principal observation drivers, which explains away its suppressive interactions with most other events but retains suppressive effects with KRAS.
We first summarize the definition of cMHNs from Schill et al. (2019). Then we extend MHNs by effects on the observation and derive a formula for their likelihood function. Finally, we show that such models are not uniquely identifiable from cross-sectional data and resolve this by a regularization that favors parsimony.
Classical MHNs with unaffected observation
MHNs (Schill et al., 2019) model cancer progression as a continuous-time Markov chain that describes how a tumor accumulates n possible progression events. Over the course of its progression, a tumor can be in any of
Let

Comparative illustration for n = 3 progression events. Left: A cMHN with parameters Θ and implicit observation event whose rate is fixed at 1. Right: An oMHN with the observation as an explicit fourth event and parameters
Here,
Our aim is to learn for each event i how its rate depends on already present events in
In order to learn Θ from data via maximum likelihood estimation, we have to compute the probability distribution over all possible tumor states at the time of their observation. The observation in a cMHN occurs randomly at a time which is exponentially distributed with a fixed rate of 1. Marginalizing over the unknown observation time
Note that equation (5) is only valid for a fixed observation rate, since it relies on the fact that Q commutes with the identity matrix I. The log-likelihood of Θ for a dataset
Maximizing the log-likelihood of Θ, for example, via gradient ascent or quasi-Newton methods, requires operations that involve the huge matrix Q. To this end, we make use of the following representation of Q as a sum of tensor products:
Using efficient tensor operations (Buis and Dyksen, 1996), Θ can be learned with a time and storage complexity only exponential in the number of events that have occurred for each tumor, rather than exponential in 2n (Schill, 2022).
Here, we extend the cMHN introduced in the previous section to an observation MHN (oMHN). We include the observation event in its causal network as an explicit
Because now the observation rate depends on the state, we can no longer use equation (6) to compute the probabilities of tumor states at their time of observation. Instead, we use the following construction: We set the outgoing effects of the observation on all genetic progression events to 0. Once the observation occurs, it prevents all other events from ever occurring by multiplying their rates with 0, which freezes the data-generating process at the time of observation.1 The probability distribution at observation now equals the stationary distribution at infinity, which can be computed as follows.
Formally, we define the extended Markov chain on the state space
The transient distribution
Because all eigenvalues of T are strictly negative, we can calculate the stationary distribution by
We can now supersede the definition of the time-marginal distribution in equation (6) by the more general expression
This includes the classical MHN as a special case, where
The log-likelihood of a dataset
In order to compute and maximize it efficiently, we use the tensor expressions for Q and U, which yield
Learning the parameters Θ and Ω then has the same complexity as for a cMHN, that is, it is exponential in the number of events that have occurred for each tumor.
Note that the formula for the time-marginal distribution of an oMHN (18) is the same as for a cMHN (6) where the parameters Θ
ij
are replaced by the fractions
In order to decide on a particular causal model, we cannot rely on data alone but have to incorporate background knowledge or preferences in the form of a Bayesian prior or a penalty on the likelihood. Following the principle of parsimony (Occam’s razor), we prefer simple models that postulate the least number of causal mechanisms for explaining the data. This means that MHNs should be sparse, in the sense that many effects

Example of a cMHN (left) and an oMHN (right), which generate the same observational data but differ in their causal interpretation. They imply different experimental predictions: A drug treatment that suppresses event 2 would increase the probabilities of events 1 and 3 according to the left model, but not according to the right model. (Both networks are fully connected, but neutral effects of multiplicative strength 1 are not drawn.).
Moreover, we prefer symmetric models where many effects
To this end, we propose maximizing the log-likelihood regularized by the following penalty, which induces sparsity and soft symmetry:

Three possible regularization penalties, visualized by their unit-level surfaces. The horizontal plane corresponds to two counter-directed effects θij and θji, whereas the vertical axis corresponds to an unrelated effect θkl. The unit-level curve in the horizontal plane is a square for the L1 penalty, a circle for the group L1 penalty, and an ellipse for the symmetrizing group L1 penalty. The major axis of the ellipse is oriented through the quadrants with equal sign.
This means that, when our algorithm does infer asymmetrical effects despite this penalty, the choice is likely driven by the data. On the other hand, effects that are inferred as symmetric may either be genuinely so, or the direction may not be discerned from the data.
In this article, we choose the hyperparameter λ in 5-fold cross-validation according to the One Standard Error Rule (Hastie et al., 2009), which selects the largest value for λ such that its average log-likelihood is within one standard error of the optimum. We use this rule because the optimal λ tends to 0 for larger datasets as it becomes less necessary to prevent overfitting, but we still want to favor simple models to mitigate nonidentifiability.
We provide a new version of MHNs with a corresponding efficient learning algorithm. These models provide a novel perspective on cancer progression by inferring which (genetic) events promote the clinical detection of a tumor. The models are thereby corrected for a collider bias and offer more realistic interpretations of cancer progression, showing fewer spurious interactions and more genuine interactions that had been previously obscured.
We applied our method to cancer genomic data from clinical targeted sequencing assays originally collected by the Memorial Sloan Kettering Cancer Center (Nguyen et al., 2022) and retrieved through AACR GENIE (The AACR Project GENIE Consortium et al., 2017). Specifically, we selected cohorts of primary COAD (n = 2269) and primary LUAD (n = 3662). For each cohort, we selected the 12 most frequently mutated genes as model events while discounting putative nonpathogenic variants and genes that had not been consistently sequenced across assay versions—for more details, see Appendix A1.
Although oMHN models are more realistic than cMHNs, they are in principle equally powerful for explaining the data, so we did not necessarily expect results with higher likelihood. Nevertheless, we validated their model fit by splitting each dataset in half into a training and test set. We trained a cMHN and an oMHN on the training set and evaluated their log-likelihoods on the test set. For COAD, the cMHN achieved a log-likelihood of –5.14, while the oMHN achieved a slightly better –5.10. For reference, the independence model2 achieved –6.02, and the best possible performance was the negative entropy –4.48 of the test set. For LUAD, the cMHN achieved a log-likelihood of –3.96 and the oMHN achieved a slightly better –3.94. The independence model achieved –4.50, and the negative entropy of the test set was –3.74.
In the following, we report the models trained on the full datasets.
Colon Adenocarcinoma
Mutations in APC, KRAS, and TP53 are classically regarded as the principal genetic drivers of conventional COAD progression (Fearon and Vogelstein, 1990; Vogelstein et al., 2013). The three events are abundant in the dataset (42%−72%) but enriched in samples with few events overall, see Appendix A2. TP53 in particular is anticorrelated with most other events, that is, it is often observed with no or very few co-events.
Although cMHN and oMHN both fit the data similarly well, they offer drastically different causal interpretations (Fig. 5): cMHN suggests that APC, KRAS, and TP53 strongly suppress each other’s accumulation as well as other events. oMHN instead proposes that APC, KRAS, and especially TP53 lead to observation. This explains away many of their suppressive interactions and even uncovers a synergy between APC and TP53.

Heatmap visualization of the cMHN (left) and the oMHN (right) for the COAD dataset. In the main heatmap bodies, each cell shows the multiplicative effect Θ ij from the column event j on the row event i. Promoting effects >1 are coded in red, suppressive effects <1 in blue, and neutral effects = 1 are blank. The additional top row indicates base rates Θ ii , and the bottom row indicates effects Ω j from each column event j on the observation event. Values are rounded to the first decimal.
The different causal models also imply different chronological orders of events. We demonstrate this by considering the probabilities of every possible chronological order to reach the state in which APC, KRAS, and TP53 are mutated in Table 1. Contrary to cMHN, the oMHN suggests that APC tends to occur early in the progression and that TP53 tends to occur late, despite its prevalence. This is because TP53 triggers and therefore immediately precedes the observation. In a similar fashion, we analyzed the probabilities of chronological orders for all common genotypes in the data at hand and present in Figure 6 the most probable histories.

Most probable chronological order of events for the COAD dataset according to the cMHN (left) and oMHN (right). Each path from the root of the tree (white circle) to a leaf represents the progression of a tumor in the dataset. The symbols along the path indicate events whose most probable chronological order was computed from the trained models. To avoid clutter, the observation event at every leaf is implied without drawing a symbol and only tumors whose state is shared by at least three patients are drawn. The size of the edges and symbols along a path scale in the total number of patients with that tumor state. COAD, colon adenocarcinoma.
All Possible Chronological Orders to Reach the Genotype that Contains Exactly the Events APC, KRAS, TP53, and was Observed
The probabilities of these orders according to cMHN and oMHN are computed as in Appendix A3 and rounded to the third decimal.
Unlike for cMHN, the interpretations drawn from oMHN are in line with common conceptions about COAD genetic progression. Specifically, the model recapitulates the well-researched adenoma-to-carcinoma sequence, which posits that APC inactivation serves as an initial event generating a benign lesion. Only then can auxiliary drivers like TP53 elicit aggressive and invasive growth, which makes the cancer clinically conspicuous (Bürtin et al., 2020; Vogelstein et al., 2013; Yang et al., 2019).
While the biological causes for this interplay continue to be debated, there is ample observational support. In mouse models of colon cancer, APC loss is sufficient to generate benign adenomas, but these are typically spatially confined and noninvasive (Fischer et al., 2013; Johnson and Fleet, 2012). Similarly, in human clinical data, APC inactivation is largely the only driver mutation that is frequently present even in the most microscopic adenomas. Contrary to other driver alterations, its frequency is not substantially higher in more advanced lesions (Fearon, 2011). These insights suggest that APC inactivation is an initiating event in colon tumorigenesis, which is on its own rarely sufficient to generate a clinically conspicuous cancer.
On the other hand, it is known that “secondary” driver mutations, specifically KRAS activation or TP53 inactivation greatly increase tumor growth, aggressiveness, and invasive abilities in mouse models with an APC background (Hadac et al., 2015; Halberg et al., 2000; Sansom et al., 2006). However, experiments suggest that these secondary driver mutations are typically unable to generate viable colon cancers on their own (Harvey et al., 1993; Johnson and Fleet, 2012; Sansom et al., 2006).
The synergy between APC and TP53 is further supported by a systematic study of conditional selection effects in cancer genomes (Iranzo et al., 2022), which found that TP53 mutations are under particularly strong positive selection in APC-mutated colorectal cancers, and vice versa.
Taken together, we report that when correcting for the observation bias, oMHN recapitulates and quantifies established dynamics of colon cancer progression in which APC inactivation initiates rather inconspicuous lesions before synergistic codrivers cause clinical observation.
While introduction of the observation event produced markedly different results, some interactions remain consistent between cMHN and oMHN. Most notably, both models suggest a “double antagonism” between the KRAS-APC and BRAF-RNF43 pairs: each event in one pair suppresses both members of the other pair. In fact, the two event pairs likely produce similar consequences through alternative means: both event pairs deregulate the RAS and Wnt pathways. These are synergistic milestones in COAD progression (Jeong et al., 2018; Lee et al., 2018). Specifically, KRAS and BRAF mutations are alternative ways of RAS pathway deregulation (Cicenas et al., 2017; Oliveira et al., 2006), and APC and RNF43 are alternative ways of Wnt signaling deregulation (Giannakis et al., 2014; Grant et al., 2021). Additionally, the synergy within the pairs as well as the antagonism between them are clearly reflected in the conditional selection analysis of (Iranzo et al., 2022). Both points, functional similarity and conditional selection effects, support genuine antagonism between these pairs.
Interestingly, the BRAF-RNF43 pair is associated with a distinct mode of COAD progression, the Serrated Neoplasia Pathway. These cancers develop from serrated sessile lesions, with different histopathological and prognostic properties (Leggett and Whitehall, 2010). Unlike in conventional COADs, APC mutations are rare here while BRAF mutations are thought to be initial (Bettington et al., 2015; Bond et al., 2016). Experimental evidence suggests that specifically MLH1-deficient, microsatellite-instable serrated COADs rely on BRAF and RNF43 mutations in their progression (Bleijenberg et al., 2022; Yamamoto et al., 2022).
Taken together, these findings suggest that there are two prototypical ways of genetic progression in COAD. On the one hand, any combination of the synergistic triplet APC-KRAS-TP53 can be sufficient to elicit observation, although APC tends to be the initiating factor and TP53 the observation driver. On the other hand, crucial pathway deregulation can also be achieved by alternatives like BRAF and RNF43. In these cases, there is no main observation driver, and typically more alterations are accumulated before observation.
In the models on LUAD, Figure 7, we also observed a shift from widespread suppressive interactions in cMHN to observation rate increases in oMHN, most notably for EGFR mutations. EGFR mutations appear mutually exclusive with many other events in the input data. cMHN models this with widespread suppressive interactions, while oMHN explains these away by an observation rate increase. For EGFR and TP53, oMHN even suggests synergy instead of the antagonism proposed by cMHN.

Heatmap visualization of the cMHN (left) and the oMHN (right) for the LUAD dataset. In the main heatmap bodies, each cell shows the multiplicative effect Θ ij from the column event j on the row event i. Promoting effects >1 are coded in red, suppressive effects <1 in blue, and neutral effects = 1 are blank. The additional top row indicates base rates Θ ii , and the bottom row indicates effects Ω j from each column event j on the observation event. Values are rounded to the first decimal. LUAD, lung adenocarcinoma.
The chronological orders differ between cMHN and oMHN mainly for EGFR and KRAS. These events tend to occur later according to oMHN because they trigger observation, see Figure 8. According to oMHN, EGFR has a strong effect on observation on its own. Conversely, KRAS-positive LUADs elicit observation in a more concerted manner supported by, for example, ATM, STK11, and KEAP1. This is in line with in-depth genomic analyses that suggest that EGFR-driven lung cancers are largely self-sufficient, that is, they depend less on concurrent driver mutations, with TP53 co-mutations being a notable exception (Nahar et al., 2018). Moreover, there are also interactions that remain consistent between cMHN and oMHN, most prominently the suppressive relationship between EGFR and KRAS. In fact, experimental demonstration of synthetic lethality (Unni et al., 2015) and conditional selection analysis (Iranzo et al., 2022) both support a genuine antagonism. We also note the synergism between mutations in STK11 and KEAP1, which is consistent between both models (albeit reinforced by oMHN) and possibly reflects their dual role in ferroptosis protection (Wohlhieter et al., 2020).

Most probable chronological orders of events for the LUAD dataset according to the cMHN (left) and oMHN (right). Each path from the root of the tree (white circle) to a leaf represents the progression of a tumor in the dataset. The symbols along the path indicate events whose most probable chronological order was computed from the trained models. To avoid clutter, the observation event at every leaf is implied without drawing a symbol, and only tumors whose state is shared by at least three patients are drawn. The size of the edges and symbols along a path scale in the total number of patients with that tumor state.
We evaluated the accuracy of our learning algorithm in simulation experiments. To this end, we used the oMHN inferred from the COAD dataset in Section 3.1 and declared it as the ground truth model. From this model, we simulated 100 synthetic datasets of the same size as the original dataset, that is, 2269 tumors each. For each synthetic dataset, we trained a new oMHN by the same procedure as described before.
The resulting 100 models can be found in the github repository. Figure 9 shows the marginal distributions of each logarithmic entry θij over the 100 simulations. While the estimators show a slight tendency to underestimate the actual strength of the effect, we found that our method accurately recovered whether an effect exists and whether it is promoting or suppressive.

Evaluation of learning accuracy on 100 simulated datasets from a given ground truth model. Each histogram shows the distribution of an entry
Large cancer genomics datasets offer a valuable opportunity for modeling cancer progression, but many of them are observational, drawn from routine clinical practice rather than controlled trials (The AACR Project GENIE Consortium et al., 2017). This makes them prone to pervasive biases (van de Haar et al., 2019), such as the notorious confounder bias, which is due to unaccounted effects from latent variables, and the collider bias, which is due to unaccounted effects on a conditioned outcome. In this article, we have resolved an important instance of the collider bias by learning which genetic events cause the clinical observation of a tumor.
This is an important biological insight on its own, since the observation of a tumor is often tied to its size and aggressiveness. Here, we find that in COAD, observation is caused by mutations in TP53, consistent with its role as late-stage driver promoting the transition from adenomas to more aggressive carcinomas (Vogelstein et al., 2013; Yang et al., 2019). In contrast, for LUAD, observation is mainly promoted by different driver mutations activating the RTK-RAS pathway, specifically affecting EGFR, BRAF, and KRAS (Imperial et al., 2019). In addition, and perhaps more consequentially, learning such observation effects explains away spurious interactions between many other events and uncovers interactions that were previously hidden, for instance, the synergism between EGFR and TP53 mutations in lung cancer. Overall, the effects inferred by oMHN are more in line with the current literature than the effects inferred by MHN. Despite this, further validation studies, especially of the effects on the observation, are necessary. However, these would require access to data generated by prospective cohort studies.
While resolving the collider bias is a crucial step toward more reliable CPMs, other sources of confounding may still remain. Environmental exposures such as smoking, complex mutational processes, or even the tumor’s cell type of origin can significantly impact the evolution of tumors. Future work should therefore combine our approach with the modeling of these factors, either as latent variables if they are not easily measurable or as additional covariates if they can be measured directly. The observation mechanism introduces an additional source of nonidentifiability to the MHN framework. The exact extent of it is currently not fully resolved and warrants additional comprehensive investigation in future studies. Further nonidentifiability could be mitigated by exploiting known time intervals between consecutive observations (Rupp et al., 2021), such as biopsies of primary tumors and metastases. Surprisingly, Gotovos et al. (2021) have shown that classical MHNs become more identifiable by simply including more events. This, however, necessitates efficient learning algorithms such as (Georg, 2022; Gotovos et al., 2021; Klever et al., 2022). oMHN models the evolution of bulk tumor profiles and thus does not account for intratumor heterogeneity. Nevertheless, our approach could also be used to directly extend models of subclonal tumor progression (Luo et al., 2023).
More realistic models will ultimately allow us to not only understand and predict the course of cancer progression but also inform clinicians about potential outcomes of specific treatment choices and thus impact patient care directly.
DATA AND CODE AVAILABILITY
We provide a github repository at https://github.com/cbg-ethz/ObservationMHN to reproduce the results in this article. It also contains the input data that the models were trained on, alongside scripts detailing their preprocessing. The original raw data are part of the GENIE 13.1 data release and obtainable at The AACR Project GENIE Consortium (2023).
Footnotes
ACKNOWLEDGMENTS
The authors thank Tilo Wettig, Simon Pfahler, and Stefan Hansch for their helpful discussions.
AUTHORS’ CONTRIBUTIONS
Conceptualization: R.S., M.K., and K.R. Methodology: R.S., M.K.,Y.L.H., and S.V. Software: S.V. and Y.L.H. Validation: A.L. and Y.L.H. Investigation: A.L. and Y.L.H. Data curation: A.L. Writing—original draft: R.S., M.K., A.L., and Y.L.H. Writing—reviewing and editing: N.B., R.S., and K.R. Visualization: R.S. and Y.L.H. Supervision: N.B., R.S., and L.G.
AUTHOR DISCLOSURE STATEMENT
The authors declare they have no conflicts of interest.
FUNDING INFORMATION
This work was supported by the Swiss National Science Foundation grant
