Abstract
Representing the conditional mean in Poisson regression directly as a sum of smooth components can provide a realistic model of the data generating process. Here, we present an approach that allows such an additive decomposition of the expected values of counts. The model can be formulated as a penalized composite link model and can, therefore, be estimated by a modified iteratively weighted least-squares algorithm. Further shape constraints on the smooth additive components can be enforced by additional penalties, and the model is extended to two dimensions. We present two applications that motivate the model and demonstrate the versatility of the approach.
Introduction
The generalized additive model can fit a sum of smooth components to observed counts; however, the sum is on the scale of the linear predictor. For count data, the usual model is a Poisson distribution with a log-link, that is, the logarithm of the expected values is expressed as a sum of smooth components. As a consequence, the model operates as a product of components on the scale of the count data. Often an additive decomposition of the expected values themselves is a more realistic description of the data generating process. Here, we introduce such a model that describes the expected values as a sum of components. We stay with the logarithmic link function and still require smoothness of the components, but the sum is taken after the exponential, instead of before. Hence, we call this a Sum of Smooth Exponentials (SSE) model. To further motivate the model, we give two examples where such an additive decomposition is a natural approach.
X-ray crystallography allows the exploration of the molecular and atomic structure of crystals. A physical sample is rotated while it is illuminated by a beam of X-rays. Depending on the angle, the number of diffracted photons varies and they are observed and counted by an optical detector. Figure 1 shows such an X-ray diffraction (XRD) scan of a thin layer of indium tin oxide. It was analyzed in detail by Davies et al. (2008), and the data are available in the R package diffractometry. The overall signal, the photon counts, is formed by a baseline and a number of peaks. The latter characterize the type of material and its physical state (like stress). The objective of such an XRD analysis is to decompose the signal into its components, to remove the baseline and to isolate the peaks, which can be analyzed further for their position, height, symmetry, and so forth.
X-ray diffraction spectrum for indium oxide. The response y is the number of diffracted photons along twice the rotation angle (θ). The complete sequence contains measurements at 7 001 different angles between 15 and 85 degrees.
X-ray diffraction spectrum for indium oxide. The response y is the number of diffracted photons along twice the rotation angle (θ). The complete sequence contains measurements at 7 001 different angles between 15 and 85 degrees.
The second example deals with human mortality, and data are taken from the Human Mortality Database (2014). Death rates vary strongly across the age range, and so do causes of death. As a consequence, death at high ages is conceived as being different from the death of infants. Therefore, the decomposition of the mortality trajectory over age, which is shown in Figure 2, into several components has a long tradition. Heligman and Pollard (1980) suggested an additive three-component model, where the first component characterizes infant and child mortality, sharply decreasing after birth to very low values. The second component characterizes senescent mortality, which starts at low levels at middle-adult ages, around age 40, and increases exponentially. The central part, sometimes called middle-mortality, describes mortality after the onset of puberty and stretching into young adult ages. Although all causes of death contribute, this middle mortality is strongly related to risk-taking behaviour in young men and, at least historically, to mortality related to childbearing in women.
Age-specific death rates for Swiss males in 1980, ages 0 to 110 on log-scale (left) and ages 1 to 50 on original scale (right)
Siler (1983) modelled this central part as a constant hazard, and several modifications of this three-component additive model of human death rates were proposed to improve the fit to real data. However, the purely parametric forms of these models either do not fit well or they involve a rather high number of parameters. For instance, the model proposed by Heligman and Pollard (1980) attempts to describe mortality with eight or nine parameters.
In both examples, the overall mean trajectory, either of the photon counts or of the hazard of death, is seen as a sum of components, and the individual terms of the sum should be isolated in the analysis. In this article, we propose a novel approach to model and estimate such compound regression functions. While the individual components, which we assume to be smooth, are modelled on the log-scale, the final regression curve is an additive composition of these components on their original scale. Estimation of the SSE model can be achieved in a familiar framework as it can be formulated as a composite link model (Thompson and Baker 1981) with a penalty added to guarantee smoothness (Eilers 2007). Estimates are obtained by a modified version of the iteratively weighted least-squares (IWLS) algorithm.
The remainder of the article is organized as follows. In Section 2, we introduce the SSE model and demonstrate how it can be estimated as a penalized composite link model (PCML). We show its performance for the two examples presented in the introduction in Section 3 and extend the SSE model to two dimensions in Section 4. We conclude with a discussion.
The SSE model
We observe a series of counts
Hence, the mean (Equation 2.1) of the responses is
To estimate the parameters
Note that the penalties work on the level of the logarithm of the components,
The use of splines is not mandatory. For one or several of the
The composite link model (CLM) was proposed by (Thompson and Baker 1981) as a generalization of the generalized linear model (GLM; McCullagh and Nelder 1989). Instead of modelling the expectations of observed data directly, it is written as a sum of GLM components. Eilers (2007) extended the CLM with a penalty for the estimation of latent smooth components from grouped or overdispersed data, the penalized composite link model (PCLM). The SSE model is a special case of the PCLM, as we will now show. To keep the description simple, we consider a model with two components, for example, the leftmost peak in the XRD scan with its neighbourhood, say diffraction angles between 20 and 23 degrees (see Figure 1).
Let
Again, in the definition of the matrix
With the definitions in Equations (2.5) to (2.9), we now have a special case of the penalized CLM. Following (Thompson and Baker 1981) and (Eilers 2007), estimation of the model is achieved by repeatedly solving the system
Several practical issues need attention. The model is highly non-linear in the pa\-ra\-me\-ters, so reasonable starting values are important. A convenient way is to start with trial values for the
Additional features of the components, beyond smoothness, can be enforced by further penalties. In the mortality example, which we will discuss in detail in Section 3.2, we can require the senescent component to be strictly increasing with age, while the early adult component should be log-concave to obtain the hump shape. Such additional constraints, on monotonicity or on shape, can be implemented by a second penalty for the respective component (see Bollaerts et al. 2006). Incorporating such additional knowledge about the components via an additional penalty can considerably facilitate, or is even required for, the identification of the single components in complex models.
Several smoothing parameters
SSE model of X-ray diffraction spectrum
Figure 1 shows the XRD scan of indium oxide over 7 001 angles. To demonstrate the SSE approach, we focus on the first
Range identification of spike components: The signal is smoothed by
-spline Poisson regression, deliberately oversmoothing the peaks (top panel). From the first derivative of the fit (bottom panel), the positions (zero-crossings) and relevant neighbourhood (relative maximum to minimum around the zero-crossing) are extracted for defining the domain of the corresponding SSE component.
Range identification of spike components: The signal is smoothed by
-spline Poisson regression, deliberately oversmoothing the peaks (top panel). From the first derivative of the fit (bottom panel), the positions (zero-crossings) and relevant neighbourhood (relative maximum to minimum around the zero-crossing) are extracted for defining the domain of the corresponding SSE component.
The two further components
This procedure selects 122 data points for the surrounding of the first peak (angles 20.71 to 21.92) and 137 data points for the second peak (29.72 to 31.08). In both cases, we put a knot at every fifth data point, resulting in
All three components are assumed to be smooth, and the order of the difference penalty was chosen as
BIC contour plot over the two smoothing parameters for SSE model of the XRD spectrum example. The optimal values are at
and at
.
The grid extended over
The resulting estimates for the three components in the SSE model are shown in Figure 5. With this model, the
Estimates of the SSE model for the XRD spectrum data. Top panel: Observed counts, estimated baseline (dashed line) and sum of all three components (solid line). The vertical axis is clipped to better show the fit of the baseline signal. Central panel: Estimated components for the two spikes, along with the actual counts devoid of the estimated baseline. For each spike, the location of the mode as well as the angles where the component reaches its half-height are marked; the latter allows to characterize the (a)symmetry of the peak. The width of the photon distribution around the peak is described by the standard deviation (σ). Bottom panel: Deviance residuals for fitted SSE model.
Once the signal is decomposed, the individual components can be analyzed further. For the XRD data, the position and shape of the peaks are of main interest, and relevant characteristics can be extracted from the estimated components
As death rates vary over several orders of magnitude, if the full age range is considered, they are often plotted on log-scale, as was done in the left panel of Figure 2. This also allows to recognize two features of human mortality that are typical and are generally found. First, mortality of newborns, that is, for age zero, is sharply higher than the death rates for later infant ages, with a majority of deaths occurring shortly after birth due to malformations, pre-term births, birth-related complications and the like. Therefore, the first year of life is usually dealt with separately, as in classic life table construction (Chiang 1984), and constitutes a discontinuity in the mortality trajectory. Second, the exponential increase of death rates after about age 40 is clearly seen as a linear pattern on the log-scale. This part is commonly referred to as senescence-related mortality.
Nevertheless, the additive decomposition that we have in mind is for the death rates themselves, portrayed in the right panel of Figure 2 for ages between 1 and 50 years. The falling mortality of children is followed by what is commonly called the ‘accident hump’, before the exponentially increasing senescent component takes over. The death rate at age zero is excluded for the reasons given above.
Hence, we intend to separate three components in an SSE model. The first component
The data, taken from the Human Mortality Database, consist of the death counts
All three components are modelled non-parametrically by cubic
The first component
The second aging-related component
SSE model fitted to death rates, Swiss males in 1980. Top left: The three components and the overall fit, on log-scale for age range 1 to 110. Top right: SSE fit and the three components for ages 1 to 50. Bottom left: Deviance residuals for the SSE fit. Bottom right: Component
(‘accident hump’) on original scale for ages 1 to 50.
Finally, the third component
In summary, three smoothing parameters, one for each component, need to be chosen. We perform a grid search over a total of
The first component
The model can be extended so that a sequence of SSE fits can be considered. As an illustration, we extend the mortality example of Section 3.2, which analyzed data in a single year, to the case where data in a series of calendar years
We arrange the response data as a vector
The design matrices for the different components will depend on the assumptions that were made for the component. If a parametric specification is chosen for component
Non-parametric specifications again are modelled by
Alternatively, an additive model can be considered for (the logarithm of) a given component. In this case, the design matrix is
Again, shape constraints can be incorporated in the two-dimensional setting without changing the penalized IWLS algorithm. For example, if we assume a two-dimensional smooth surface for component
We apply the two-dimensional SSE model to Swiss male mortality between 1980 and 2011. The three components—see Section 3.2—are each modelled as smooth surfaces over age and time. The time dimension (31 years) is covered by 13 cubic
Components and overall fit resulting from a two-dimensional SSE model of Swiss male mortality 1980–2011. Results are shown for selected years and are plotted on log-scale. See also Figure 8.
Components and overall fit resulting from a two-dimensional SSE model of Swiss male mortality 1980–2011. Results are shown for selected years and are plotted on log-scale. See also Figure 8.
Component γ3 (accident hump) estimated from a two-dimensional SSE model for Swiss male mortality 1980–2011. The curve for 1980 is very close to the estimate that was obtained from the univariate model and was shown in Figure 6.
While smoothness in age-direction is determined separately for each component (has its own smoothing parameter), smoothness in the time-direction is assumed to be the same for all components so that only a single smoothing parameter is added. Hence, four smoothing parameters need to be chosen, and again we minimize the BIC over a grid of
The estimated components for the optimal
This component shows a general decline of death rates—the peak level in 2011 is about 20 per cent of the value in 1980—but we can also see a change in the shape of this component during the 1990s, implying rising death rates during the fourth age-decade. This suggests further research on the causes-of-death structure and/or comparisons with other countries.
The SSE model offers a flexible but computationally feasible approach to directly model the mean trajectory of count data as a sum of smooth components. We view the methodology as a decomposition device to isolate the component(s) of interest. The two applications provide good examples and illustrate the usefulness of the approach.
With the decomposition idea in mind, the user will commonly have some knowledge of the phenomenon under study and the purpose of the analysis (e.g., removal of the baseline signal and analysis of the peaks). The number of components may be known (as in the mortality example) as well as some additional properties (such as the increase of aging-related mortality). The possibility of combining parametric with smooth components in the sum of predictors is an additional attractive feature of the SSE model in this context. Concerns about identifiability remain, particularly when all components are specified non-parametrically. Implementing known (or desired) properties via additional shape-constraining penalties can be essential. However, the examples provide evidence that this is a viable and successful endeavour. The individual components are readily available once the SSE model is fit, and their specific features can be analyzed immediately. Surely the sum of the exponentials gives a fit to the overall mean trajectory, and this can be used to smooth a highly non-linear regression function; however, this comes as a pleasant by-product. We do not recommend using the SSE approach as an alternative to adaptive smoothing, if there is no intention to identify components that are of interest in their own.
Finding good starting values can be important. In all applications, we provided starting components that were the results of some regression performed on subsets of the data. We refer to the supplementary material where these preliminary steps are detailed for each example. Once such starting values are determined, the computational effort is reasonable, even if the choice of several smoothing parameters is performed by a full grid search.
In both applications, there were plenty of data, and in the mortality example, we could have easily estimated a one-dimensional model for each calendar year separately. If data are sparser, however, the two-dimensional SSE model can be particularly useful for studying the change in some (or all) of the components.
To use SSE for the automatic analysis of XRD scans, reliable location of peaks is essential. Davies et al. (2008) used zero-crossing of the derivative to locate peaks, after initial smoothing with the taut string method. Apparently, their smoother introduces side lobes that result in a complicated pattern in the derivative at the peak centres. Our experience suggests that Poisson smoothing with P-splines (de Rooi et al. 2014) and taking the derivative of the logarithm of the fit might work better.
de Rooi et al. (2014) consider the elimination of the
Davies et al. (2008) model some peaks as a sum of two strongly overlapping peaks, using parametric models. It will be interesting to investigate the limits of SSE by trying to fit sums of overlapping components.
Appendix: Comparison to adaptive smoothers
Our model is a structural smoother; it uses known characteristics of the data to decompose a signal as a sum of components with smooth logarithms. Our focus is on properties of these components, like locations and heights of peaks. However, the resulting fit to the data gives the impression as if a high quality adaptive smoother has been applied. This similarity has prompted the reviewers to ask for a comparison to adaptive smoothers. That is the subject of this appendix.
We have studied two packages,
Adaptive smoothing for the XRD spectrum data using mgcv . For the linear predictor, different number of knots were used as well as different adaptive knots. The computation time reported was measured in seconds.
Figure A1 shows data and fit, using cubic B-splines with two different settings. For the first estimation, we used 100 knots for the linear predictor and 50 knots for variable smoothness. Unwanted ripples are visible in the fit around the leftmost peak. Computation time was almost six minutes. The ripple gets smaller when 200 knots are used for the linear predictor. However, even with only 20 knots for adaptive smoothness, computation time is rather high (almost 11 minutes) and small ripples are still visible on the right tail of the first peak. We also increased the number of knots to 200 and 50 in linear predictor and variable smoothness, respectively. This setting leads to 20 minutes of computation time and, though main peaks are well described, additional undesirable peaks are also captured (not shown here).
We also tested adaptive smoother from
Even if adaptive smoothing were fast and precise, it would not help us much for our applications, where the separation of components is the goal. We would have to estimate and subtract the baseline to get good estimates of the individual peaks. We do not want to position our model as an adaptive smoother either. It can only deal with sums of smooth positive components.
