Abstract
Modeling of cancer hazards at age t deals with a dichotomous population, a small part of which (the fraction at risk) will get cancer, while the other part will not. Therefore, we conditioned the hazard function, h(t), the probability density function (pdf), f(t), and the survival function, S(t), on frailty α in individuals. Assuming α has the Bernoulli distribution, we obtained equations relating the unconditional (population level) hazard function, hU(t), cumulative hazard function, HU(t), and overall cumulative hazard, H0, with the h(t), f(t), and S(t) for individuals from the fraction at risk. Computing procedures for estimating h(t), f(t), and S(t) were developed and used to ft the pancreatic cancer data collected by SEER9 registries from 1975 through 2004 with the Weibull pdf suggested by the Armitage-Doll model. The parameters of the obtained excellent fit suggest that age of pancreatic cancer presentation has a time shift about 17 years and five mutations are needed for pancreatic cells to become malignant.
Introduction
Mathematical modeling of cancer hazards in aging is aimed at determining the relationship between the observed cancer incidences in the population with the carcinogenic processes ongoing in individuals. At the present time, many different carcinogenetic models have been developed in which the hazard of getting cancer in aging is considered as a random process. The main difference between these models is in accounting the variability of the hazard of getting cancer within individuals. Some of the models assume that all individuals initially have equal chances of getting cancer,1,2 while other models assume that within individuals these chances are randomly distributed and introduce a nonnegative random variable (a frailty) that is multiplied with the hazard to cancer. 3
Modeling of cancer hazards in aging requires the use of a theoretical hazard function and, when frailty is assumed, a frailty distribution function. As hazard functions, linear functions, 1 exponential functions, 4 beta-functions,5,6 and some other functions have been used. 7 As frailty, the gamma-distribution, compound Poisson distribution, power-variance distribution, as well as other distributions, have been utilized.8–12 Mathematically, the problem of modeling is stated as a best fitting of the hazard rates with observations using methods of regression analysis. The obtained models are useful when the utilized model parameters have apparent biological meaning and the estimated values of these parameters are consistent with the estimates presented by biological means. Unfortunately, often these models are very complicated and do not agree with the observed data. In addition, some model parameters do not have clear biological meaning, and/or their values are inconsistent with the estimates provided by the current knowledge on the carcinogenetic processes ongoing in individuals. 3
In the present work, for a better accounting for cancer hazards on individual and population levels, we utilized the main concepts of survival analysis (the survival function, S(t), the hazard function, h(t), and the probability density function (pdf), f(t)), and conditioned these functions on frailty α used on an individual level. We assumed that the population under consideration is a dichotomous one: a small part of this population (so called fraction at risk) will eventually get the cancer, 9 while the other part (immune fraction) will not. In such a case, it is appropriate to present the frailty α by the Bernoulli distribution. Based on the main concepts of survival analysis and the mathematical statistics,13,14 we developed novel equations and a computing framework to be used in carcinogenetic modeling. We suggested that these equations and the computing framework can be applied to estimate the parameters of different carcinogenic models with a given f(t) (or h(t), or S(t)) on the individual levels. As an example, we used the Weibull pdf, suggested by the Armitage-Doll multiple mutation model of carcinogenesis, and estimated the model parameters of the pancreatic cancer occurrence in aging using the corresponding data collected by SEER9 registries from 1975 through 2004.
Basic Equations for Frailty Modeling of Cancer Hazards in Aging
To determine the relationship between the estimates of the hazard function performed for the dichotomous population (population level) with the survival, hazard, and probability density functions determined on the individual level, we utilized the concept of frailty. We presented variability to cancer susceptibility between individuals in the dichotomous population using the Bernoulli distribution, which provides a binary presentation (1/0) to the statistical distribution of individuals within the considered population, assuming that “1” means that an individual belongs to the fraction at risk and eventually will get cancer, while “0” means that this individual belongs to the immune fraction and will not get cancer.
Main Concepts for Modeling of Cancer Hazards in Aging for Individuals from the Fraction at Risk
The concepts of survival function, S(t), a hazard function, h(t) and a probability density function (pdf), f(t), developed in “classical” survival analysis can be directly used for the mathematical presentation of events occurring in the fraction at risk. In this case, the event time, t, refers to the age of an individual when a cancer is diagnosed and the survival function, S(t), can be defined as:
where F(t) is a cumulative frequency function (or a cumulative distribution function, cdf), which refers to the probability that in an individual, a considered event will occur up to the time t. This function can be presented as:
By definition, a hazard function, h(t), is determined by formula 13 :
From (1)–(3) it follows:
Note that by specifying the probability density function, f(t), or the survival function, S(t), or the hazard function, h(t), the other two functions can be ascertained. For example, if f(t) is the Weibull pdf, then the corresponding hazard function, h(t), will be an exponential function of tand ln[-ln S(t)] is a linear function with ln t.
Basic Equations for Modeling of Cancer Hazards in Aging for Individuals from the Dichotomous Population
In frailty models, the basic hazard rate, h(t), is multiplied with the frailty positive random variable α. The frailty random variable expresses the extent of frailty in each individual. A large value of αreflects that an individual is highly susceptible to cancer, whereas a low value characterizes an individual that is less susceptible to cancer. The individual hazard rate and survival function conditioned on frailty can be expressed as 13 :
and
When the frailty distribution with the pdf, g(α), is known from the conditional survival function, S(t|α), which is determined on the individual level, one can obtain unconditional survival function, SU(t), and the unconditional hazard function, hU(t), on the population level. In fact, for the population under consideration, we have: 13
Note that if αis a discreet random variable, α= αN, N= 1,2…, with probability distribution g(αN) = gN, then instead of (7) we have:
and by definition: 13
Based on the fact that only a small part of the population (fraction at risk) is exposed to the cancer, while the other part of the population (immune fraction) is not, we assumed that the frailty αcan also take the value of zero (that means that the individual is immune to the cancer) and that the frailty αhas the Bernoulli distribution with the parameter, p, which is a probability that a given individual will eventually get the cancer. The Bernoulli distribution, which in mathematical statistics is usually designated as B(1,p), has the following discreet pdf:
and
In such a case, according to (6) and (8), we have:
Taking into account (1), (2), (9), and (12) we have:
For the majority of cancer types, p < 1 (because a particular type of cancer is a rare disease) and the denominator of the right side of the (13) is very close to 1. Therefore, from formula (13) it follows that hU(t) can be approximated by pf(t):
(Below, without losing generosity, we considered formula (14) as a precise equality).
According to (14), in the age-specific population, the age-specific hazard function is proportional to pdf, f(t), of ages at which individuals from the fraction at risk will get cancer. Taking into account that
On the other hand, by definition, the overall cumulative unconditional hazard rate, H0, is determined as:
where
is the unconditional cumulative hazard rate. Thus, from (15) and (16) it follows that, on the population level, the overall cumulative unconditional hazard, H0, characterizes the fraction at risk in the dichotomous population and is equal to p.
Finally, by dividing both sides of (14) on pand substituting pon H0we can obtain that:
The equation (18) relates the hazard function, hU(t), and the overall hazard, HO, on the population level with the pdf, f(t), on the individual level. The left side of (18) can be estimated from the observed data. We consider (18) as a basic equation for estimating the unknown model parameters of f(t) based on values of hU(t) and HOthat can be estimated from the observed data. As we show below, the problem of estimating the model parameters of f(t) from the values of hU(t) and HOis reduced to the problem of solving the corresponding system of the conditional equations.
Using (1) to (4) and (18), after elementary transformations, one can obtain the following two equations:
and
These equations can also be used for forming the system of the conditional equations from which unknown parameters of model hazard and survival functions can be estimated.
Computing Procedures for Modeling of Cancer Hazards in Aging
Below we propose computing procedures for modeling cancer hazards in aging by using the observed data on the population level (presented in a discrete tabulated form) and a theoretical form of the pdf, f(t), given on the individual level. In a discrete form, to solve the problem of modeling of cancer hazards in aging, the following three procedures need to be consecutively executed.
Procedure 1: Estimation of the Age-Specific Hazard Rates by the Age-Period-Cohort(Apc)Analysis
Let us assume that the observed numbers of cancer cases and the numbers of population in the equal-sized consecutive nage intervals,
and
where tiis the midpoint of the i-th age interval (i= 1,2, …, n). (Here and below symbols “^” designate the corresponding estimates.) In (21) and (22), mi,jand Pi,jare the number of cancer cases and the size of population in the i-th age interval, observed during the j-th time-period, correspondingly.
According to the log-linear age-period-cohort (LLAPC) model, the observed age-specific incidence rates can be presented as:16–18
where vjand uiare the time-period and birth-cohort effects, correspondingly, and hU(ti) are the values of the unknown hazard function at age, ti, to be estimated from the system of the conditional equations (23). In (23), the index lis defined by a linear combination of the age and time-period indexes in the following way:
From system (23) it follows that, when the time-period and birth-cohort effects are negligible (vj ≅ 1 and ul ≅ 1), the best estimates of the hazard function values,
where weights, Wi,j, can be calculated using formula (22) as:
and
When the time-period and birth-cohort effects are significant, the estimates,
Procedure 2: Estimation of the Pdf, Cdf, Hazard, and Survival Functions on the Individual Level
To estimate f(t), we will use the equation (18) that presents a relationship between the values of the hazards of getting cancer in aging and f(t). In a discrete form, the estimates
In (28),
Standard errors of
Assuming that a correlation between errors of the
Estimates of the cumulative distribution function (cdf)
where tk is the midpoint of the interval, Δk, and
Analogously, to estimate the hazard and survival functions on the individual level from the observed data in a discrete form, the systems of conditional equations can be constructed on the basis of the equations (19) and (20). For estimates of the hazard function on the individual level we obtained:
where
It is easy to show that
and
According to (34), using the standard rules of error propagation, we have:
where estimates:
For estimates of the survival function on the individual level we have:
Note that
and
where
So, we have shown that
Procedure 3: Estimation of the Model Pdf(or Cdf)Parameters
Based on the theoretical models of carcinogenesis, one might assume that the f(t) ought to follow to some nonlinear (in a general case) function f of age, t, defined with the s parameters,
with the weights, wi, determined as:
where ti is the midpoint of the i-th age interval and
Note that we have one response variable, that is the pdf of age, f(t), and one predictor variable, the age t. The predictor is assumed to be measured with no error, while the response data can be affected by an observational error. In general, to obtain the estimates
The aforementioned computational procedure can be applied for determining parameters of different pdf (or cdf) with known mathematical forms. However, for modeling of the cancer occurrence in aging, the mathematical form of the pdf (or cdf) should be chosen by considering the appropriate biological concepts, leading to carcinogenesis. The parameters of this function should have distinct biological and/or epidemiological meaning.
As an example, let us assume that according to the multiple mutation carcinogenic model, the cancer occurrence in aging on the individual level can be represent with a two- or three-parameter Weibull pdf:4,21,22
and
where the λ is a “scale” parameter, t denotes age, r is a “shape” parameter, and A is a “shift” parameter (according to designations of (43),
For two-parameter Weibull distribution, when the estimates of the cdf,
with the weights, wi, that are presented as the inverse of the squares of the standard errors of the left side of (47):
According to the standard rules of error propagation, for the standard errors of the
The λ and r parameters (and their standard errors) of the Weibull pdf can be obtained from the system (47) to (50) by methods of linear regression. Note that the estimates of the response variable,
For three-parameter Weibull distribution, the problem is to estimate the parameters λ, r, and A from the observational rates (25) and their standard errors (27). For this purpose, instead of the system of the conditional equations (47) to (50), the following system can be considered:
with weights:
where
and
To estimate the parameters of the three-parameter Weibull distribution, one can use a special technique, analogous to one presented in MATLAB, 24 as follows. For each provisional value of the shift parameter, A, the λ and r parameters of the Weibull pdf, as well as their standard errors, are obtained by methods of linear regression. To evaluate the quality of fitting of the same dataset by different regression lines, the Akaike's information corrected criterion (AIC) can be used. 25 Assuming that the scatter of points around the regression line follows a Gaussian distribution, the AIC can be defined by the following formula:
where (SS) is the weighted sum of the squares of the residuals of the system (51) with the weights (53), l is the number of observed points, K = q + 1 (q is the number of parameters used for curve fitting). When fittings of the same dataset by different regression lines are compared, it is assumed that the curve fitting is better for the line with the smallest AIC. 25 In our case, the value of the shift parameter, A, from the set of provisional values, providing the best fitting with observations, can be considered as Â.
Note that the analogous procedures can be easily developed when instead of the parametric form of the pdf,
Application of the Computing Procedures for Modeling Pancreas Cancer Data
Data Preparation
The proposed computing procedures were utilized for modeling of pancreatic carcinogenesis using SEER9 data collected from 1975 through 2004. Table 1 presents the number of pancreatic cancer cases collected in this database. Below, we consider only data for the ages older than 30 years, assuming that for younger age the data are statistically indistinguishable from zero. The first column of Table 1 presents the middle points of five-year age intervals of the human lifespan beginning from the age of 30 (30–100 years). The following six columns present a number of cases in the five-year time periods.
Number of pancreatic cancer cases (m i'j ) observed in the i-th age intervals (i = 1, …, 14) and the j-th time periods (j = 1, …, 6).
Table 2 presents the size of the population for the corresponding age intervals and time periods.
Size of population (Pi'j) in the i-th age intervals (i = 1, …, 14) and the j-th time periods (j = 1, …, 6).
Modeling of Pancreatic Carcinogenesis in Aging
To perform modeling of pancreatic cancer hazards in aging we consecutively completed three procedures of the proposed computing framework.
Procedure 1
As described in our previous work, 19 for the observed data, presented in Tables 1 and 2, we performed APC analysis using the log-linear age-period-cohort (LLAPC) model. According to this model, an age-specific incidence rate of a cancer can be presented as a product of the time-period and birth-cohort coefficients, as well as an unknown age-specific hazard function (ie, the risk function of getting the cancer at a given age). We found that for the data presented in Tables 1 and 2 the time-period and birth-cohort effects are statistically insignificant (data are not shown).
Procedure 2
By neglecting the time-period and birth-cohort effects, we obtained
Estimates of the
Figure 1 shows the discrete distribution of

Discrete distribution of the estimates of the unconditional hazard function,
Procedure 3
The
Estimates of the
Estimates of the h(t) and S(t), as well as their standard errors in each of the i-th age intervals (i = 1, …, 14).
To perform modeling of pancreatic cancer hazards in aging, we used the multiple mutation carcinogenic model, according to which the cancer occurrence in aging on the individual level can be represented with a two- or three-parameter Weibull pdf, described by formulas (45) and (46), correspondingly. Our numerical experiments suggested that, in the case of pancreatic cancer, the best fitting of the observed data can be achieved by using the three-parameter Weibull pdf. Therefore, we show the results of this modeling below (data for two-parameter Weibull pdf are not shown).
To estimate the λ, r, and A parameters of the three-parameter Weibull pdf, we used the system of conditional equations, presented by formulas (51) to (54). By varying the values of the A from 0 to 30 years with a one-year age interval, we obtained estimates of two other parameters. For each set of parameters obtained in such a way, we evaluated the quality of fitting using the AIC as described previously. Figure 2 shows a variation of the AIC (presented by open circles) with age.

Variation of the AIC with the age (in years) for the three-parameter Weibull pdf modeled function, fitted by the estimates,
As can be seen from Figure 2, the best fitting was achieved for
Figure 3 shows the discrete distribution of

Discrete distribution of f(ti), of pancreatic cancer in the age intervals, ti, obtained from the observed pancreatic cancer incidence rates and modeled by three-parameter Weibull pdf,
Figures 4 and 5 show the estimates (presented by open circles) of the hazard and survival functions on the individual level, obtained from the formulas (19) and (20) by substitution of the hU(t), HU(t) and HO by their discrete estimates. The solid lines show the modeled hazard and survival functions with

Discrete distribution of the estimates of the conditional hazard function for pancreatic cancer (individual level),

Discrete distribution of the estimates of the conditional survival function for pancreatic cancer (individual level),
Notes
The h(ti) distribution was obtained from the observed pancreatic cancer incidence rates and modeled by three-parameter Weibull hazard function,
As can be seen from these figures, the modeled functions provide excellent fittings to the estimated values of the hazard and survival functions on the individual level.
Comparison of Figures 1 and 4 showed that the trends of the hazards in aging on the population level (see Fig. 1) and on the individual level (see Fig. 4) are dramatically different. Such phenomena can be explained by the fact that cancer is a rare disease occurring in the dichotomous population, a very small part of which will eventually get the cancer, while the biggest part of the population will not get cancer.
Conclusion
This work was inspired by the fact that the existing approaches used in carcinogenic modeling are focused on the ongoing processes on the individual level, while the observed data used in this modeling are obtained on the population level. Currently used mathematical equations relating hazard functions on the individual level with hazard functions on the population level are rather arbitrary, do not have clear biological/epidemiological meaning, and do not allow one to obtain an appropriate fitting with the observed data. Particularly, these equations do not answer why the hazard functions on the population level fall at old ages, while the hazard functions on the individual level do not fall.
In this work we analyzed the relationships between cancer hazards on the population and individual levels using mathematical concepts (the hazard function, h(t), the probability density function, f(t), the survival function, S(t), and frailty, α) in survival analysis. We showed that these concepts can be adopted for analyzing the hazards of cancer occurrence in aging, assuming that the considered population is a dichotomous one: a small part of this population (the fraction at risk) will eventually get the cancer, while the other part will not. We assumed that for individuals within the dichotomous population, αhas the Bernoulli distribution, with parameter p. We used the model, hU(t) = pf(t), which is a special boundary case (when p < 1) of the general model with the Bernoulli frailty, SU(t) = 1 – p + pS(t). It should be mentioned that an analogous model was widely used in the cure models of survival analysis, 26 in which parameter pand parameters of survival function, S(t), have to be simultaneously estimated by multivariate regression. However, we showed that in the considered boundary case (p < 1), for arbitrary (parametric or non-parametric) f(t), pis equal to the overall hazard, HO, and for estimating pit is not a necessity to perform simultaneous estimation of the parameters of the individual level survival function, S(t), by multivariate regression. This allowed us to obtain three basic equations relating the unconditional (determined on population level) hazard function, hU(t), cumulative hazard function, HU(t), and overall cumulative hazard, H0, with the corresponding conditional equations (determined on the individual level) functions, h(t), f(t), and S(t), for individuals belonging to the fraction at risk: (1) hU(t)/[H0 – HU(t)] = h(t); (2) hU(t)/H0 = f(t); and (3) [H0 – HU(t)]/H0 = S(t).
One of the main advantages of these basic equations is that they have clear epidemiological meaning. Specifically, the equation hU(t)/H0 = f(t) indicates that the values of the hazard functions on the population level, hU(t), are proportional to the probability density function, f(t), on the individual level, suggesting that the shapes of these functions should be similar. In addition, in this equation, the coefficient of proportionality is the cumulative hazard, H0, that characterizes both the fraction at risk in the dichotomous population and the probability, p, that an individual will eventually get the cancer. At the same time, the other equation, hU(t)/[H0 – HU(t)] = h(t), indicates that the relationship between cancer hazards on population and individual levels depends on the age and may have different shapes. This explains why the hazard functions on the population level fall at old ages, while the hazard functions on the individual level may not fall.
Using the derived basic equations, we developed a computing framework for estimating the carcinogenic parameters from the observed cancer incidences in aging. The framework includes three procedures. The first procedure is aimed at correcting the cancer incidence rates observed on the population level on the time-period and birth-cohort effects and estimating the corresponding cancer hazards in aging (on the population level). For assessing the cancer hazards in aging, this procedure uses the LLAPC model. In the present work, we have used the procedure developed in our previous work that reduces the problem of estimating cancer hazards in aging to the problem with removable interactions. 19 The second procedure is aimed at estimating in a discrete form the pdf, cdf, hazard function, and survival function on the individual level from the cancer hazards in aging estimated by the first procedure. Finally, the third procedure is aimed at determining the model parameters of the pdf from the discrete estimates of the pdf (or cdf, or hazard function, or survival function) on the individual level, performed by the previous procedure. We showed that, in a general case, this problem can be solved by methods of nonlinear regression analysis.
As an example, we estimated the modeled parameters of pancreatic carcinogenesis, using the corresponding data collected by SEER9 registries from 1975 through 2004. We showed that, in the case of pancreatic cancer, the time-period and birth-cohort effects can be neglected. Therefore, we could use the observed incidence rates as the cancer hazards in aging. Then, using the obtained cancer hazards in aging, we estimated the pdf of pancreatic cancer in a discrete form. To obtain values of the carcinogenic parameters, we used the three-parameter Weibull pdf, suggested by the Armitage-Doll multiple mutation model. By using a special technique, we reduced the nonlinear problem of estimating three parameters of the Weibull pdf, to the problem with removable interactions and estimated parameters of this pdf that provide an excellent fitting with the observed data. The estimated values of these parameters suggest that age of pancreatic cancer presentation has a time shift about 17 years, and that, for pancreatic cells, at least five mutations are needed to become malignant. Our finding of the number of mutations required for pancreatic cells to become malignant is consistent with what is known about the required number of mutations leading to cancer occurrence in other organ sites. 27
Overall, in this work we mathematically proved that a simple assumption of a rareness of cancer in a dichotomous population (the Bernoulli frailty effect) is enough to explain why the observed incidence rates (hazard functions) on the population level fall at old ages, when the modeled hazard functions on the individual level are not falling. We derived three basic equations that relate the observed cancer hazards in aging on the population level with the hazard function, the pdf, the cdf, and the survival function on the individual level. We used these equations to develop a novel computing framework for estimating the carcinogenic parameters from the observed cancer incidences in aging. We suggest that the basic equations and computing framework developed in this work can be applied for estimating parameters of carcinogenic models with any given hazard function (or the pdf, or the cdf, or the survival function) on the individual level.
Author Contributions
Conceived and designed the experiments: TM, SS. Analysed the data: TM, SS. Wrote the first draft of the manuscript: TM, SS. Contributed to the writing of the manuscript: TM, SS. Agree with manuscript results and conclusions: TM, SS. Jointly developed the structure and arguments for the paper: TM, SS. Made critical revisions and approved final version: TM, SS. All authors reviewed and approved of the final manuscript.
Funding
This work was supported by the 5 R01 CA140940-03 (NIH, SS the PI) grant.
Competing Interests
Author(s) disclose no potential conflicts of interest.
Disclosures and Ethics
As a requirement of publication author(s) have provided to the publisher signed confirmation of compliance with legal and ethical obligations including but not limited to the following: authorship and contributorship, conflicts of interest, privacy and confidentiality and (where applicable) protection of human and animal research subjects. The authors have read and confirmed their agreement with the ICMJE authorship and conflict of interest criteria. The authors have also confirmed that this article is unique and not under consideration or published in any other publication, and that they have permission from rights holders to reproduce any copyrighted material. Any disclosures are made in this section. The external blind peer reviewers report no conflicts of interest.
Footnotes
Acknowledgements
The authors acknowledge Mr. Michael X. Gleason for help in data preparation.
