Penalized complexity priors for degrees of freedom in Bayesian P-splines

Abstract

Bayesian penalized splines (P-splines) assume an intrinsic Gaussian Markov random field prior on the spline coefficients, conditional on a precision hyper-parameter $τ$ . Prior elicitation of $τ$ is difficult. To overcome this issue, we aim to building priors on an interpretable property of the model, indicating the complexity of the smooth function to be estimated. Following this idea, we propose penalized complexity (PC) priors for the number of effective degrees of freedom. We present the general ideas behind the construction of these new PC priors, describe their properties and show how to implement them in P-splines for Gaussian data.

Keywords

Bayesian P-splines degrees of freedom penalized complexity priors penalized spline regression

1 Introduction

Penalized spline (P-spline) regression is a well-established and numerically stable approach for smoothing (Eilers and Marx, 1996; Ruppert et al., 2003). Typically, P-spline components are implemented in Bayesian additive regression models (Fahrmeir et al., 2013) to fit non-linear covariate effects or higher dimensional effects such as spatial and spatio-temporal smooth trends. The P-spline approach proposed by Eilers and Marx (1996) uses equally spaced B-splines and constructs a smooth function as the sum of these B-splines scaled by spline coefficients. A regularizing penalty is assumed on these coefficients to control the degree of smoothness of the fitted function. A common approach is to penalize the sum of second-order-squared differences between adjacent spline coefficients, but specific penalties can be designed to drive the fit towards desired features (Eilers and Marx, 2010). This is a very useful strategy in presence of prior information about the shape, or degree of smoothness, of the function to be estimated.

The Bayesian approach to P-spline by Lang and Brezger (2004) assumes an intrinsic Gaussian Markov random field (IGMRF) prior on the spline coefficients. An IGMRF is a multivariate normal distribution with rank deficient precision matrix $Q (τ)$ , depending on a precision hyper-parameter $τ$ . Similarly to a regularizing penalty, the IGMRF forces the spline coefficients to be shrunk towards an infinite smooth model, which we will denote as the ‘base model’. The degree of smoothness of the base model depends on the order of the IGMRF; for instance, an IGMRF of order $2$ forces shrinkage towards a linear trend, that is, a polynomial of degree one (Rue and Held, 2005).

The amount of shrinkage towards the corresponding base model depends on the IGMRF precision $τ$ . The prior $π (τ)$ can have a substantial impact on the posterior distribution of the spline coefficients and hence, to some extent, on the shape of the fit. A common strategy in Bayesian P-splines is to adopt the conjugate gamma family, that is, $gamma (a, b)$ , with shape $a$ and rate $b$ (Fahrmeir and Kneib, 2009; Lang and Brezger, 2004). Lang and Brezger (2004) suggest to choose $a = 1$ and small $b$ , for example, $b = 5 \cdot 10^{- 4}$ , leading to a diffuse prior for $τ^{- 1}$ . Jullion and Lambert (2007) note that the choice of $b$ clearly affects the smoothness of the fitted curve, when sample size or signal-to-noise ratio is small, and propose a mixture of gamma distributions with different $b$ values. Another popular choice is the $gamma (ε, ε)$ , with small $ε$ (e.g., $ε = 0.001$ , which is the default option in the software BayesX (Belitz et al., 2000)) as an attempt of vagueness on $τ^{- 1}$ . The suitability of the gamma family as a non-informative prior for the scale parameters in hierarchical models has been debated in the literature (Gelman, 2006); overfitting due to gamma priors has been demonstrated in Frühwirth-Schnatter and Wagner (2010, 2011) and Simpson et al. (2014). In particular, in Bayesian P-splines, the main difficulty with using a gamma prior on $τ$ is that $τ$ scales differently according to the amount of noise present in the data and the number (and location) of knots selected by the user.

The present work proposes a new prior for $τ$ which is informative about model complexity and implicitly accounts for different choices about number (and location) of knots. A suitable measure of complexity of the P-spline model is the number of ‘effective degrees of freedom’ denoted as d in the following text, calculated as the trace of the hat matrix (Hastie and Tibshirani, 1990)). The value $d$ relates to the degree of a polynomial equivalent to the smooth function to be estimated. An expert user who has a prior guess about the shape of this function may find easy to elicit $d$ . As an example, for a monotonic cubic trend, one may elicit $d$ in a range between $3$ and $5$ and assign very low prior probability to $d > 5$ . The key point is that, in presence of this prior information, elicitation of a range for $d$ is intuitive and immediate, whereas elicitation of a distribution for $τ$ , directly, is very difficult.

The challenge is to design a prior distribution on a model property (i.e., $d$ ) rather than on a parameter of the model (i.e., $τ$ ). To achieve this, we follow the penalized complexity (PC) prior approach proposed by Simpson et al. (2014). Within this framework, we derive the PC prior for $d$ and calibrate it by two intuitive parameters: $U$ , an upper bound for $d$ , and $α$ , the prior probability assigned to $d > U$ . In the aforementioned example, the user would only need to set $U = 5$ and $α = 0.01$ , or some other small value. As a further challenging point, note that $d$ depends on the noise variance characterizing the dataset. Thus, implementing the proposed PC prior for degrees of freedom in real datasets, where the noise variance is typically unknown, implies defining a joint prior on two quantities, the IGMRF precision and the noise precision.

The plan of the article is as follows. In Section 2.1, the Bayesian P-spline approach is revised with focus on the challenges to be addressed in defining a prior for $τ$ . The principles behind the construction of a PC prior for $τ$ are revised in Section 3. In Section 4, the PC prior for $d$ is derived and its properties are described in the case of known noise variance. In Section 5, we show how to implement the PC prior for $d$ when the noise variance is unknown, focusing on an additive P-splines model framework. Results from a simulation study assessing the impact of the proposed prior compared to standard gamma priors and other PC priors proposed in the recent literature are illustrated in Section 6. An application of these new priors in a P-spline model for nitrate concentrations observed in river Oglio, Lombardy, Italy, is illustrated in Section 7. The article closes with a discussion in Section 8.

2 Background on P-splines

Let $y = (y_{1}, . . ., y_{n})^{T}$ be observations of a response variable, $x$ be a continuous covariate, $f$ be a smooth function describing the effect of the covariate on the response and $ε$ be independent errors with zero mean and variance $τ_{ε}^{- 1}$ . The P-spline model (Eilers and Marx, 1996) is $y = f (x) + ε$ , $f (x) = B β$ , where $B$ is a $n \times K$ basis matrix containing $K$ B-spline functions built on a set of equally spaced (for simplicity) knots within the covariate domain, while $β$ is a $K \times 1$ vector of unknown spline coefficients. The method requires to select a generous number of knots to over fit the data and then add a penalty on $β$ which smoothes adjacent spline coefficients. In the frequentist approach, $β$ is estimated via penalized maximum likelihood, conditional on a tuning parameter regulating the degree of smoothness of $f$ . The optimal tuning can be found via cross-validation (Wood, 2006) or estimated via restricted maximum likelihood in a mixed model representation (Ruppert et al., 2003 chapter 4). P-splines are widely used in generalized additive models (Hastie and Tibshirani, 1990; Wood, 2006) or structured additive regression models (Fahrmeir et al., 2004; 2013). Higher dimensional smooth functions can also be represented as P-splines, using the tensor product of marginal B-spline bases (Currie et al., 2006; Eilers et al., 2006). For a systematic presentation of the different approaches to p-spline regression, see Ruppert et al. (2003); for an excellent review of spline methods and their applications in statistical modelling, see Hastie et al. (2009), Wakefield (2013) and Wood (2006).

2.1 Bayesian P-splines

The Bayesian approach to P-splines (Lang and Brezger, 2004) assumes an IGMRF prior on the spline coefficients,

π (β | τ_{β}) = (2 π)^{- rank (R) / 2} (| τ_{β} R |^{*})^{1 / 2} exp \{- \frac{τ_{β}}{2} β^{T} R β\}

(2.1)

where the precision

Q (τ_{β})

is given by

τ_{β} R

. Matrix

R

is denoted as the structure of the IGMRF, that is, a

K \times K

sparse matrix with non-zero entries indicating conditional dependencies among the spline coefficients,

τ_{β}

is a scalar precision hyper-parameter and

| τ_{β} R |^{*}

is the generalized determinant. Throughout the article, we will assume

R = D_{r}^{T} D_{r}

, where

D_{r}

is a

(K - r) \times K

matrix such that

D_{r} β = Δ^{r} β

(Eilers et al., 2006), with

Δ^{r}

the

r^{th}

-order difference operator. In this form,

R

is the structure of an

r^{th}

-order random walk on

β

(Rue and Held, 2005 Chapter 3), with

rank (R) = K - r

, where

r

indicates the order of the IGMRF (2.1).

The IGMRF (2.1) describes deviation from a base model, which is a polynomial of degree $r - 1$ . The amount of deviation depends on $τ_{β}$ . A fully Bayesian specification requires priors on $τ_{β}$ and $τ_{ε}$ . Since we usually have enough information in the data to estimate $τ_{ε}$ , the prior $π (τ_{ε})$ has less impact on the fit. The hyper-parameter $τ_{β}$ enters at a lower level in the hierarchy, the data bring little information on it and the prior $π (τ_{β})$ can have a substantial impact on the posterior distribution of $β$ and, as a consequence, on the smoothness of $f$ . Therefore, specification of $π (τ_{β})$ should be as consistent as possible with the prior information actually available about the smoothness of the function to be estimated.

The marginal variance of the IGMRF (2.1), given by the diagonal elements of the generalized inverse matrix $Σ^{*} = τ_{β}^{- 1} R^{- 1}$ , depends on $K$ . We denote this as the ‘scaling issue’ (Sørbye and Rue, 2014), meaning that the amount of deviation from the base model depends on the number of knots. This is illustrated in Figure 1, where the two panels report the marginal standard deviation of the smooth $f (x) = B β$ , for $K = {50, 100}$ . On the other hand, results (not shown here) show that the degree of the B-splines has little or no impact on the marginal variance of $B β$ , especially when $K$ is large enough, say $K > 50$ .

2.2 Degrees of freedom

The scaling issue can be avoided if we consider building priors on the number of effective degrees of freedom (Hastie and Tibshirani, 1990), $d = tr \{{(B^{T} B + \frac{τ_{β}}{τ_{ε}} R)}^{- 1} B^{T} B\}$ . If we think of the smooth $f (x) = B β$ as a polynomial, then $d$ can be thought of as the degree of this polynomial. In presence of prior information on the degree of an equivalent polynomial, it seems to be a sensible approach to design a prior for $d$ , $π (d)$ , instead of $τ_{β}$ .

A fundamental issue when building $π (d)$ is that $d$ depends on both precisions $τ_{β}$ and $τ_{ε}$ . The former regulates the number of effective degrees of freedom, conditionally on the latter. When $τ_{ε}$ is known, the construction of $π (d)$ can be based on the prior $π (τ_{β})$ (see Section 4). When $τ_{ε}$ is unknown, the prior $π (d)$ will be specified in terms of the joint $π (τ_{β} | τ_{ε}) π (τ_{ε})$ , following a fully Bayesian approach (see Section 5).

The degrees of freedom can be reduced to

d = tr \{{(I + \frac{τ_{β}}{τ_{ε}} R (B^{T} B)^{- 1})}^{- 1}\} = \sum_{k = 1}^{K} \frac{1}{1 + \frac{τ_{β}}{τ_{ε}} v_{k}},

(2.2)

where

v_{1}, . . ., v_{K}

are the eigenvalues of

R (B^{T} B)^{- 1}

, whose null space has dimension

r

(the rank deficiency of

R

). When the factor

τ_{β} / τ_{ε}

goes to infinity, we obtain the minimum number of degrees of freedom,

d = r

. When

τ_{β} / τ_{ε}

goes to zero, we obtain the maximum number of degrees of freedom,

K

, corresponding to the most flexible model under the assumed IGMRF.

The prior $π (d)$ depends on the eigenvalues of $R (B^{T} B)^{- 1}$ , hence on the choice of $B$ . Hereafter, $B$ will be referred to simply as ‘design’, because it is determined by both the assumptions made by the user (location and number of knots, order of the B-splines) and the assumptions purely made by design (location and number of observations along the covariate domain). Since the degrees of freedom depend on $B$ , it follows that $π (d)$ automatically accounts for the design. This will be discussed in detail in Section 4.2.

Figure 1:

The scaling issue. The two panels show the marginal standard deviation of $f (x) = B β$ for varying dimension of the basis $B$ , $K = {50, 100}$ , where $B$ is a matrix of cubic B-splines (coloured lines) defined over the interval $(0, 1)$ and $β$ is an IGMRF of order 2. The standard deviation (black line) is calculated as the squared root diagonal entries of $B Σ^{*} B^{T}$ , with $τ_{β} = 1$ .

3 PC priors for P-splines

The PC prior framework by Simpson et al. (2014) introduces a new concept for building priors in hierarchical additive models, where the latent structure is given by the sum of a number of model components described by a small number of flexibility parameters. Each model component is seen as a flexible extension of a base model. For instance, $τ_{β}$ is a flexibility parameter for the IGMRF component $π (β | τ_{β})$ and a natural base model corresponds to $π (β | τ_{β}^{- 1} = 0)$ . Below, the four principles underpinning the construction of a PC prior for $τ_{β}$ are reviewed, following Simpson et al. (2014).

Parsimony: A simple model should be preferred unless there is enough evidence for a more flexible one. Under this principle, the prior probability mass assigned to models of increasing complexity should decay as their distance from the base model (measured in terms of model complexity) increases. In Bayesian P-splines, the IGMRF prior operates on $β$ (the object of inference), but we can extend the notion of base and flexible model to the ‘spline-modelled’ function; we denote with $f_{0} = B β_{0}$ the base model, which is a polynomial of degree $r - 1$ , and with $f = B β$ the flexible model, which reflects any deviation from such polynomial.

Measure of complexity: The Kullback–Leibler divergence (Kullback and Leibler, 1951) is assumed to evaluate the distance, $δ$ , between the complexities of two different models. We use $KLD (f | | f_{0})$ to denote the increased complexity of the flexible model $f$ with respect to the base model $f_{0}$ . Since $B$ is fixed by design, it is enough to evaluate $KLD (β | | β_{0})$ . Let $τ_{β_{0}}$ and $τ_{β}$ be the precisions of the base and flexible models, respectively; it can be shown that $KLD (β | | β_{0})$ goes to $\frac{τ_{β_{0}} K}{2 τ_{β}}$ , for $τ_{β}$ much lower than $τ_{β_{0}}$ and $τ_{β_{0}} \to \infty$ ; see a proof in Simpson et al. (2014). Finally, for convenience we take the transformation $δ = \sqrt{2 KLD (β | | β_{0})} = \sqrt{τ_{β_{0}} K / τ_{β}}$ .

Constant rate penalization: Flexible models are penalized by a constant decay rate for increasing $δ$ . Following this principle, the PC prior is defined as an exponential distribution on the distance scale, $π_{PC} (δ) = λ exp (- λ δ)$ , with constant rate $λ$ . It follows that the mode of a PC prior is always at the base model. By a change of variable and setting the rate $λ = θ / \sqrt{K τ_{β_{0}}}$ , Simpson et al. (2014) obtain the PC prior for $τ_{β}$ as,

\begin{matrix} π_{PC} (τ_{β}) & = & λ exp (- λ \sqrt{τ_{β_{0}} K / τ_{β}}) |\frac{\partial \sqrt{τ_{β_{0}} K / τ_{β}}}{\partial τ_{β}}| \\ = & \frac{θ}{2} τ_{β}^{- 3 / 2} exp (- θ / \sqrt{τ_{β}}), \end{matrix}

(3.1)

which is a

Gumbel (1 / 2, θ)

type 2 distribution,

θ > 0

User-defined scaling: Often, the user has an idea about the size of an interpretable transformation of the original parameter $τ_{β}$ , say $h (τ_{β})$ (e.g., degrees of freedom). In this case, the user may elicit an upper bound $U$ for $h (τ_{β})$ and set a prior probability $α$ for the tail event, that is, $α = \Pr (h (τ_{β}) > U)$ . Simpson et al. (2014) suggest to bound the marginal standard deviation, $1 / \sqrt{τ_{β}}$ . To obtain $θ$ in equation (3.1), it is enough to specify $(U, α)$ and solve $\Pr (1 / \sqrt{τ_{β}} > U) = α$ for $θ$ , which yields $θ = - log (α) / U$ .

PC priors can be helpful as ‘default’ priors in complex hierarchical models where, typically, ‘it is difficult to elicit information about structural parameters that are further down the model hierarchy’(Simpson et al., 2014). In addition, the user-defined scaling approach enables to build ‘informative’ priors for the original parameter $τ_{β}$ or for a property of the associated model component, by tuning two intuitive parameters $U$ and $α$ . In the next section, we introduce a new scaling approach to derive the PC prior for the degrees of freedom of a P-spline model component $B β$ . Other approaches might be possible though. For instance, recently Klein and Kneib (2015) proposed PC priors for the scale (or range of variation) of $B β$ and showed via simulation that these outperformed the gamma family in cases where the data are weakly informative and/or the size of the effects is close to the base model.

4 PC priors for degrees of freedom

4.1 A new scaling approach

With no loss of generality, we derive the PC prior for degrees of freedom and study its properties under the assumption that the noise precision $τ_{ε}$ is known. Given $τ_{ε}$ , denote as $d (τ_{β}) = \sum_{k = 1}^{K} \frac{1}{1 + \frac{τ_{β}}{τ_{ε}} v_{k}}$ the function mapping the precision $τ_{β}$ into the number of effective degrees of freedom, following equation (2.2). (Hereafter, $d (τ_{β})$ will be referred to as the mapping). Figure 2 shows the mapping $d$ in the log precision scale for $τ_{ε} = 1$ and various designs (left panel) and for a specific design and varying $τ_{ε}$ (right panel).

We introduce a new user-defined scaling operating not directly on $τ_{β}$ but on $d (τ_{β})$ . Let $U$ be an upper bound for $d (τ_{β})$ and $α$ a (small) probability associated to the tail event,

α = \Pr (d (τ_{β}) > U) = \Pr (τ_{β} < d^{- 1} (U)) = F (d^{- 1} (U)),

where

F

is the cumulative distribution function (c.d.f.) of the

Gumbel (1 / 2, θ)

type 2 distribution. The PC prior resulting from this new scaling is a Gumbel type 2 as in equation (3.1) with

θ = - log (α) \sqrt{d^{- 1} (U)}

. In the following text,

π_{PC} (d)

will denote the ‘induced PC prior for degrees of freedom’, with

U \in (r, K)

and

α \in (0, 1)

the parameters specifying the distribution.

4.2 Invariance under design

While the PC prior for $τ_{β}$ in equation (3.1) does not take into account any information regarding the adopted design, the induced PC prior for $d$ does. Indeed, different designs return different mappings $d$ (see the left panel in Figure 2), which implies a desirable property: the PC prior $π_{PC} (d)$ is ‘invariant under design’; the term invariant here applies to the interpretation of the PC prior in terms of degrees of freedom, not to the density. Figure 3 illustrates this property. The density of $π_{PC} (d)$ , with parameters $(U = 5, α = 0.01)$ , is displayed both in the $d$ scale (left panel) and $\log (τ_{β})$ scale (right panel), for different designs. By changing the design, the range of $d$ also changes and the density $π_{PC} (d)$ adapts accordingly (Figure 3, left panel). However, even if $π_{PC} (d)$ materializes differently in different designs, the probability mass assigned to $d > U$ is always $α$ . In the $log (τ_{β})$ scale, the location of the PC prior is shifted between different designs (Figure 3, right panel). In our opinion, this shows well how difficult it is to define priors for degrees of freedom in the original scale: In this case, the user would have to figure out the correct location of $π_{PC} (log (τ_{β}))$ and shift it according to the adopted design.

Figure 2:

Mapping the degrees of freedom. The plot on the left shows the mapping $d$ in the $log (τ_{β})$ scale, conditional on $τ_{ε} = 1$ , for four designs (choices of $K$ and $n$ ). The dotted horizontal line at $d = 2$ indicates the base model (assuming an IGMRF of order 2 on the spline coefficients). The plot on the right shows $d$ as a function of both $log (τ_{ε})$ and $log (τ_{β})$ , for the specific design ${K = 20, n = 50}$ .

The PC prior $π_{PC} (d)$ plotted in the left panel of Figure 3 has been obtained numerically. Let $π_{PC} (τ_{β}^{'})$ be the PC prior equation (3.1) evaluated at some predefined $τ_{β}^{'} > 0$ (a convenient approach is to take $log (τ_{β}^{'})$ on a regular grid inside $(- 20, 20)$ ) and $d^{'} = d (τ_{β}^{'})$ be the associated degrees of freedom computed by equation (2.2). The induced PC prior evaluated at $d^{'}$ is $π_{PC} (d^{- 1} (d^{'})) |\frac{\partial d^{- 1} (d^{'})}{\partial d^{'}}|$ .

Figure 3:

Invariance under design. Both panels show the PC prior density $π_{PC} (d)$ , for $(U = 5, α = 0.01)$ and $τ_{ε} = 1$ , in two different scales: $d$ (left panel) and $log (τ_{β})$ (right panel). In the left panel, the dotted vertical line at $d = 2$ indicates the base model (assuming an IGMRF of order 2 on $β$ ), while the red tick indicates the upper bound ( $U = 5$ ) for degrees of freedom. Even if the PC prior density changes for varying $K$ and $n$ , the probability assigned to $d > 5$ is always $0.01$ . In this particular example, the density change between $n = 50$ and $n = 100$ is evident in the $log (τ_{β})$ scale (right panel), but not in the $d$ scale (left panel), where the solid and dashed lines appear superimposed (however, they are not the same because the eigenvalues of $R (B^{T} B)^{- 1}$ are different).

4.3 Behaviour near the base model

According to Simpson et al. (2014), the prior $π (δ)$ , where $δ$ is the distance from the base model, is said to overfit, or to force overfitting, if $π (0) = 0$ . Theorem 1 in Simpson et al. (2014) states that if $π (τ_{β})$ is an absolutely continuous prior for the IGMRF precision $τ_{β}$ , with $E (τ_{β}) < \infty$ , this prior overfits. The commonly used $gamma (a, b)$ , $a, b > 0$ , with $a / b < \infty$ falls in this class of overfitting priors.

PC priors avoid over fitting by construction: By applying the first three principles outlined in Section 3, the mode of a PC prior is always at the base model. The fourth principle essentially allows the user to specify the penalty for increasing distances from the base model. The following result describes the behaviour near the base model for both the PC priors for degrees of freedom and the distribution of the degrees of freedom induced by a $gamma (a, b)$ on $τ_{β}$ , which we denote as $π_{G} (d)$ . (In the result below, we consider $θ = - log (α) \sqrt{d^{- 1} (U)}$ , where $d$ is the mapping given a positive and finite $τ_{ε}$ .)

Result: Let $r$ be the dimension of the null space of $R$ in equation (2.1), then $π_{PC} (d) \to \infty$ as $d \to r$ , for $θ > 0$ , and $π_{G} (d) \to 0$ as $d \to r$ , for $a, b > 0$ .

The proof is given in Appendix 9. The density $π_{PC} (d)$ goes to infinity as approaching the base model, avoiding over fitting. Instead, the gamma-induced $π_{G} (d)$ does not prevent over fitting as it repulses the base model. In Figure 4, the gamma-induced priors with $a = 1, b = 5 \cdot 10^{- 4}$ (left panel) and $a = 10^{- 3}, b = 10^{- 3}$ (right panel) are displayed under four different designs. These two different priors have different interpretations in terms of degrees of freedom: the first favours over-smoothing, while the latter favours over fitting. For both choices of $a$ and $b$ , the base model is repulsed at a different rate according to design. Indeed, for $a$ and $b$ fixed, the density $π_{G} (d)$ clearly changes with design: In general, gamma priors are not invariant under design as a consequence of the scaling issue discussed in Section 2.1.

Figure 4:

The distribution of the number of effective degrees of freedom induced by a gamma prior with $a = 1, b = 5 \cdot 10^{- 4}$ (left panel) and $a = 10^{- 3}, b = 10^{- 3}$ (right panel), under four different designs. The base model is at $d = 2$ , assuming an IGMRF prior of order 2 on $β$ .

5 P-splines with a joint prior on

(τ_{β}, τ_{ε})

So far we have worked under the assumption of known noise precision $τ_{ε}$ . The number of effective degrees of freedom is a function of $τ_{β}$ , which scales differently according to the level of noise present in the data (see the right panel in Figure 2). Knowing $τ_{ε}$ is then crucial to scale the PC prior correctly, in order to guarantee the upper bound for degrees of freedom specified by the user.

The noise precision is typically unknown in applications. One could estimate $τ_{ε}$ from the data and then specify the PC prior for $d$ conditional on this estimate; this strategy has been proposed by Fong et al. (2010) to define gamma-induced priors for degrees of freedom. We, instead, adopt a fully Bayesian model,

y | β, τ_{ε} \sim N (B β, τ_{ε}^{- 1} I),

(5.1)

β | τ_{β} \sim N (0, τ_{β}^{- 1} R^{- 1}),

(5.2)

τ_{β} | τ_{ε} \sim Gumbel (1 / 2, θ (τ_{ε})),

(5.3)

τ_{ε} \sim π (τ_{ε}) \propto 1 / τ_{ε},

(5.4)

where equations (5.3) and (5.4) specify a joint prior

π (τ_{β}, τ_{ε}) = π_{PC} (τ_{β} | τ_{ε}) π (τ_{ε})

. The scaling parameter in equations (5.3) is given by

θ (τ_{ε}) = - log (α) \sqrt{d^{- 1} (U | τ_{ε})}

, where

d (\cdot | τ_{ε})

is the mapping conditional on the random noise precision

τ_{ε}

. We use the improper

π (τ_{ε}) \propto 1 / τ_{ε}

since the data usually contain sufficient information with respect to

τ_{ε}

. The joint prior in equations (5.3) and (5.4) corresponds to the induced PC prior for

d

conditional on a random

τ_{ε}

, with

U \in (r, K)

and

α \in (0, 1)

the parameters specifying the distribution.

We developed a Markov chain Monte Carlo (MCMC) algorithm to fit the model in equations (5.1)–(5.4); see pseudo-code reported in Appendix A.2 algorithm 1. The algorithm includes a Metropolis–Hastings step to jointly update $(τ_{ε}, τ_{β}, β)$ . In our experience, block updating ensures good mixing and fast convergence of the proposed MCMC algorithms; see Rue and Held (2005) for details on block updating in hierarchical models with GMRF components. A brief description on how algorithm 1 works follows. At iteration $j$ , both precision parameters (here denoted simply as $τ$ ) are sampled from the proposal distribution adopted in Knorr-Held and Rue (2002): $q (τ^{*} | τ^{(j - 1)}) = t τ^{(j - 1)}$ , where $τ^{*}$ and $τ^{(j - 1)}$ are, respectively, the proposed and current values at iteration $j$ , $t$ is random with density $π (t) \propto 1 + 1 / t$ , for $t \in [1 / T, T]$ and $T > 1$ is a tuning parameter; in our experience, setting $T$ approximately equal to $1.5$ works well in most applications. The proposal $q (\cdot)$ has the advantage that the ratio $q (τ^{*} | τ^{(j - 1)}) / q (τ^{(j - 1)} | τ^{*})$ equals one; for more details on this, see Rue and Held (2005), Chapter 4.2. Given $τ_{ε}^{*}$ and $τ_{β}^{*}$ , we draw $β^{*}$ from the full conditional $π (β^{} | τ_{ε}^{*}, τ_{β}^{*}, y)$ . Within this scheme, the acceptance probability for $(τ_{ε}^{*}, τ_{β}^{*}, β^{*})$ simplifies to (dropping the superscript $(j - 1)$ to ease the notation)

\begin{matrix} a & = & min (1, \frac{π (τ_{ε}^{*}, τ_{β}^{*}, β^{*} | y)}{π (τ_{ε}^{}, τ_{β}^{}, β^{} | y)} \frac{π (β^{} | τ_{ε}^{}, τ_{β}^{}, y)}{π (β^{*} | τ_{ε}^{*}, τ_{β}^{*}, y)} \frac{q (τ_{ε}^{} | τ_{ε}^{*})}{q (τ_{ε}^{*} | τ_{ε}^{})} \frac{q (τ_{β}^{} | τ_{β}^{*})}{q (τ_{β}^{*} | τ_{β}^{})}) \\ = & min (1, \frac{π (τ_{ε}^{*}, τ_{β}^{*} | y)}{π (τ_{ε}^{}, τ_{β}^{} | y)}), \end{matrix}

(5.5)

where

π (τ_{ε}^{*}, τ_{β}^{*} | y) \propto π (y | β^{*}, τ_{ε}^{*}) π (β^{*} | τ_{β}^{*}) π_{PC} (τ_{β}^{*} | τ_{ε}^{*}) π (τ_{ε}^{*}) / π (β^{*} | τ_{ε}^{*}, τ_{β}^{*}, y) .

Computing the acceptance probability in equation (5.5) only requires the marginal for the precision parameters

(τ_{ε}, τ_{β})

evaluated at the proposed and current values. Note that rescaling

π_{PC} (τ_{β} | τ_{ε})

according to the proposed

τ_{ε}^{*}

is needed before accepting/rejecting

(τ_{ε}^{*}, τ_{β}^{*}, β^{*})

; this implies re-evaluating the inverse mapping

d^{- 1} (\cdot | τ_{ε})

, given

τ_{ε}^{*}

, and recomputing

θ (τ_{ε}^{*})

at each iteration.

5.1 Additive P-splines

We now focus on an additive P-spline modelling framework, where the linear predictor is the sum of a number of smooth functions. Let $y$ be a Gaussian response and $x_{j}$ , $j = 1, . . ., J$ , be a set of $J$ continuous covariates; the model is

\begin{matrix} y & = & \sum_{j = 1}^{J} f_{j} (x_{j}) + ε; ε \sim N (0, τ_{ε}^{- 1}) \\ f_{j} (x_{j}) & = & B_{j} β_{j}, j = 1, . . ., J \\ β_{j} | τ_{β_{j}} & \sim & N (0, τ_{β_{j}}^{- 1} R^{- 1}), \end{matrix}

where

B_{j}

is the

n \times K_{j}

B-spline basis matrix and

β_{j}

the vector of spline coefficients associated to the smooth function

f_{j}

; with no loss of generality, we consider the same number of knots

\forall j

, yielding

K = K_{j}

j = 1, . . ., J

. We assume the joint prior

\prod_{j = 1}^{J} π_{PC} (τ_{β_{j}} | τ_{ε}) π (τ_{ε})

, where

π (τ_{ε}) \propto 1 / τ_{ε}

and

π_{PC} (τ_{β_{j}} | τ_{ε}) = Gumbel (1 / 2, θ_{j} (τ_{ε}))

j = 1, . . ., J

. The scaling parameter

θ_{j} (τ_{ε}) = - log (α_{j}) \sqrt{d_{j}^{- 1} (U_{j} | τ_{ε})}

, where

(U_{j}, α_{j})

are the parameters calibrating the induced PC prior for the degrees of freedom of

f_{j}

. The mapping for the degrees of freedom of

f_{j}

is given by

d_{j} (τ_{β_{j}} | τ_{ε}) = tr \{{({B_{j}}^{T} B_{j} + \frac{τ_{β_{j}}}{τ_{ε}} R)}^{- 1} {B_{j}}^{T} B_{j}\}

Identifiability constraints are important in additive P-splines. The IGMRF prior on $β_{j}$ controls deviations of the smooth term $B_{j} β_{j}$ from a polynomial base model; in the following, for simplicity, we consider an IGMRF of order 2 (which forces shrinkage towards a linear base model). All smooths $B_{j} β_{j}$ include the linear base model, thus they all compete to capture the mean of the data. To ensure identifiability, we adopt the following reparametrization,

y = μ + \sum_{j = 1}^{J} x_{j} γ_{j} + \sum_{j = 1}^{J} B_{j} β_{j}^{ULC} + ε,

(5.6)

where

μ

is the intercept,

γ_{j}

is the slope coefficient for covariate

x_{j}

and

β_{j}^{ULC}

are the spline coefficients

β_{j}

under the two following linear constraints:

[c^{T} B_{j}] β_{j} = 0

and

[l^{T} B_{j}] β_{j} = 0

, with ‘constant’ vector

c = 1_{n}

and ‘line’ vector

l = {[1, 2, . . ., n]}^{T}

. In this way, the smooth term

B_{j} β_{j}^{ULC}

captures residual variations from the linear base model

μ + γ_{j} x_{j}

. In other words, the constrained model (5.6) allows each smooth component to be identified, by separating the linear and flexible terms which coexist in each

B_{j} β_{j}

. Model (5.6) can be expressed in compact form as

y = X γ + B β^{ULC} + ε,

(5.7)

where

B = [B_{1} : . . . : B_{j}]

is the

n \times (KJ)

joint basis matrix,

β^{ULC}

is the joint vector of spline coefficients subject to the linear constraints,

X = [1_{n} : x_{1} : . . . : x_{J}]

is the

n \times (J + 1)

matrix of covariates with an additional column of ones for the intercept term and

γ

is the vector of fixed effects. We assume

γ \sim N (0, τ_{γ} I_{J + 1})

with a small precision, for example,

τ_{γ} = 10^{- 4}

, as a prior for the fixed effects. Other covariates can be added to

X

in model (5.7), if we assume them to have a simple linear effect.

We wrote a block updating MCMC algorithm implementing the joint prior in the model described above; pseudo-code is given in Appendix A.2 algorithm 2. This includes Metropolis–Hastings steps to jointly update the blocks $(τ_{ε}, γ)$ and $(τ_{β_{j}}, β_{j}^{ULC})$ , for each $j = 1, . . ., J$ ; in our experience, this scheme gives good mixing (and convergence) properties. Algorithm 2 presents two main changes with respect to algorithm 1. First, rescaling the conditional PC prior $π_{PC} (τ_{β_{j}} | τ_{ε})$ is no longer necessary at each iteration; $θ (τ_{ε}^{*})$ must be recomputed only when $(τ_{ε}^{*}, γ^{*})$ is accepted. Second, the spline coefficients are sampled under linear constraints. To do this, we use the algorithm proposed in Rue and Held (2005), Chapter 2, which samples first the unconstrained coefficients and then ‘corrects’ them for the constraints. To compute the acceptance probability for the candidate $(τ_{β_{j}}^{*}, β_{j}^{* ULC})$ in step (14) of algorithm 2, Appendix 10, the full conditional density $π (β^{*} | τ_{β}^{*}, y)$ is evaluated at the constrained $β_{j}^{* ULC}$ ; for computational details, see Rue and Held (2005), formula 2.31.

6 Simulation study

The scaling parameter $θ$ of the PC prior equation (3.1) can be tuned in several ways. Our proposal is to select $θ$ through assumptions on the number of effective degrees of freedom, $d$ , of the function $f (x) = B β$ . This approach seems intuitive in the Gaussian case where quantity $d$ relates immediately to the degree of an equivalent polynomial (which an expert user might have prior information about). Moreover, literature on smoothing often refers to degrees of freedom as a way to summarize model complexity. In a recent paper, Klein and Kneib (2015) specify $θ$ through assumptions on the ‘scale’, or range of variation, of $f (x) = B β$ and denote this PC prior as ‘scale-dependent prior’ (SD prior). Precisely, $θ$ is derived numerically by requiring that $\Pr (| f (x) | \leq c) \geq 1 - α$ for each $x$ in the covariate domain, where $α \in (0, 1)$ and $c$ indicates an upper bound for the scale of $f$ . Both scaling approaches lead to a PC prior which is invariant under design, in the sense that the computation of $θ$ accounts for the adopted B-spline design $B$ . However, the two PC priors differ regarding conditioning on the noise variance. Our PC prior, $π_{PC} (d)$ , is a joint prior $π (τ_{β} | τ_{ε}) = π_{PC} (τ_{β} | τ_{ε}) π (τ_{ε})$ , while the SD prior is defined unconditionally on $τ_{ε}$ as the scale of $f$ does not depend on $τ_{ε}$ .

In this section, we present a simulation study which investigates further the relevance of degrees of freedom for designing priors for Gaussian P-splines. The objective of our study is to assess the behaviour of our joint prior in scenarios with different noise levels, and to compare this with two alternative priors which are defined unconditionally on $τ_{ε}$ , namely the conjugate gamma prior and the SD prior. Therefore, our simulation study does not aim to generically assess the behaviour of PC priors compared to standard priors. For an extended simulation study evaluating the performance of PC priors (particularly SD priors) compared to several alternative hyper-priors for variance parameters, in both Gaussian and non-gaussian contexts, the reader is referred to Klein and Kneib (2015). Furthermore, our simulation is restricted to the Gaussian case.

We consider two different models regarding the shape of the true $f (x)$ :

$f_{1} (x) = sin (x); x \in (- 1, 1)$

$f_{2} (x) = cos (x); x \in (0, 2 π) .$

Model

f_{1} (x)

is close to the base model (almost a linear effect; this is the same model considered in Klein and Kneib, 2015), while

f_{2} (x)

is a one-cycle sinusoidal curve (highly non-linear effect). In both scenarios, data are simulated as

y \sim N (f (x), τ_{ε}^{- 1} I)

, where covariate

x = {x_{1}, . ., x_{n}}

takes values on a regular grid. We assume a standard P-spline model with one covariate,

y \sim N (B β, τ_{ε}^{- 1} I)

β \sim N (0, τ_{β}^{- 1} R^{- 1})

, where

R

is the structure of an IGMRF of order 2, and

B

contains

K

cubic B-splines evaluated at

x

Different scenarios are generated by setting: $n = {20, 50}$ (small and moderate sample sizes), $K = {20, 30}$ , $τ_{ε} = {0.25, 1, 5}$ (high, moderate and low noise). We aim to assess the model fit obtained by the following priors:

conjugate gamma prior on $τ_{β}$ , with two specifications widely used in applications: $gamma (10^{- 3}, 10^{- 3})$ and $gamma (1, 5 \cdot 10^{- 4})$ .

our joint prior $π_{PC} (τ_{β} | τ_{ε}) π (τ_{ε})$ , inducing a PC prior for degrees of freedom, $π_{PC} (d)$ , with parameters $U$ and $α$ . We set $α = 0.01$ and various upper bounds $U = {2, 3, 5, 7, 10}$ . Note that we specify $U = 2$ only to check consistency of results in the limit case where any deviation from the base model is strongly penalized (in applications, this is not a sensible choice as it forces the fit towards a linear trend; for more details, see the joint prior in action with simulated data in the supplemental material).

SD prior on $τ_{β}$ (Klein and Kneib (2015)), with $α = 0.01$ and three specifications for the scale of $f$ , $c = {1.5, 2, 3}$ . Note that since both $f_{1}$ and $f_{2}$ vary within $(- 1, 1)$ , $c = 1.5$ seems the most sensible choice as an upper bound for the scale of both functions, resulting in a sufficiently flexible prior, while $c = {2, 3}$ leads to an even more flexible prior. However, from Table 1, we see that the degrees of freedom implied by an SD prior with parameter $c$ strongly depend on the noise present in the data (and to some extent on the adopted design): for instance, $c = 1.5$ implies an upper bound for $d$ around $2.72$ in the high noise case (for the design $n = 20$ , $K = 20$ ) which results into a very restrictive prior in fact.

Table 1
Implied degrees of freedom, $d$ , for the SD prior. The entries in the table refer to the upper bound, $U$ , for $d$ , obtained by assuming an SD prior with parameters $c$ and $α = 0.01$ , in the different simulation scenarios. The computation of $U$ involves the use of the sdPrior R package (Klein, 2015); for more details see the supplemental material.

High noise ( $τ_{ε} = 0.25$ ) Moderate noise ( $τ_{ε} = 1$ ) Low noise ( $τ_{ε} = 5$ )

B-spline design $c = 1.5$ $c = 2$ $c = 3$ $c = 1.5$ $c = 2$ $c = 3$ $c = 1.5$ $c = 2$ $c = 3$

$n = 20$ ; $K = 20$ $2.72$ $2.99$ $3.41$ $3.41$ $3.78$ $4.36$ $4.54$ $5.08$ $5.89$

$n = 50$ ; $K = 20$ $3.11$ $3.44$ $3.95$ $3.95$ $4.40$ $5.10$ $5.31$ $5.95$ $6.90$

$n = 20$ ; $K = 30$ $2.70$ $2.95$ $3.44$ $3.39$ $3.75$ $4.43$ $4.54$ $5.06$ $6.04$

$n = 50$ ; $K = 30$ $3.09$ $3.40$ $4.00$ $3.93$ $4.37$ $5.20$ $5.34$ $5.98$ $7.16$

	High noise ( $τ_{ε} = 0.25$ )	Moderate noise ( $τ_{ε} = 1$ )	Low noise ( $τ_{ε} = 5$ )
$n = 20$ ; $K = 20$	$2.72$	$2.99$	$3.41$	$3.41$	$3.78$	$4.36$	$4.54$	$5.08$	$5.89$
$n = 50$ ; $K = 20$	$3.11$	$3.44$	$3.95$	$3.95$	$4.40$	$5.10$	$5.31$	$5.95$	$6.90$
$n = 20$ ; $K = 30$	$2.70$	$2.95$	$3.44$	$3.39$	$3.75$	$4.43$	$4.54$	$5.06$	$6.04$
$n = 50$ ; $K = 30$	$3.09$	$3.40$	$4.00$	$3.93$	$4.37$	$5.20$	$5.34$	$5.98$	$7.16$

6.1 Results

In each scenario, goodness of fit was assessed for each of 1 000 simulated datasets by the mean squared error (MSE) of $\hat{f} (x) = B \hat{β}$ , where $\hat{β}$ is the posterior mean, as MSE $= n^{- 1} \sum_{i = 1}^{n} (\hat{f} (x_{i}) - f (x_{i}))^{2}$ . The posterior for $β$ was computed using R-INLA (Rue et al., 2009, www.r-inla.org) for the gamma and SD priors, and using MCMC algorithm 1 for the joint prior. For the sake of comparison between the three classes of priors, we assume $π (τ_{ε}) = gamma (1, 5 \cdot 10^{- 4})$ throughout the simulation study (hence, our joint prior is $π_{PC} (τ_{β} | τ_{ε}) gamma (1, 5 \cdot 10^{- 4})$ ).

Figure 5 reports $log (MSE)$ for small sample size $n = 20$ (which is when the hyper-prior is expected to be most influential on the posterior) and $K = 20$ . We do not see much change for increasing $n$ and $K$ , thus results for other scenarios are only reported in the supplemental material. Our main findings are as follows:

The gamma priors are generally outperformed by the two PC priors (both for degrees of freedom and scale), unless the data are very informative about the true model (e.g., in model $f_{2}$ with low noise, see bottom-right panel in Figure 5). As expected, the $gamma (10^{- 3}, 10^{- 3})$ overfits when model $f_{1}$ is the true one (see left panels in Figure 5), while the $gamma (1, 5 \cdot 10^{- 4})$ performs poorly in scenario $f_{2}$ (especially with high noise, see top-right panel in Figure 5).

For sensible choices of the upper bound $U$ , the joint prior performs better than, or at least as good as, the SD prior. The main difference is noticed in scenario $f_{2}$ with high noise (top-right panel in Figure 5) and scenario $f_{1}$ with low noise (bottom-left panel in Figure 5). In the former case, setting the joint prior with $U = {5, 7, 10}$ outperforms most SD prior specifications; in particular, the SD prior with $c = {1.5, 2}$ achieves poor performance because the implied upper bound for $d$ is too small (i.e., below $3$ , see Table 1) for this case, where the true effect is highly non-linear and the data are very noisy. The SD prior with $c = 3$ gives similar performance because it implies a larger upper bound for degrees of freedom. Instead, in scenario $f_{1}$ with low noise, SD priors are outperformed by the joint prior with $U = {2, 3}$ , because any choice $c = {1.5, 2, 3}$ implies an upper bound for $d$ clearly larger than needed (i.e., above $4$ , see Table 1) in this case, where data are very informative and the true effect is close to the base model.

In cases where the choice of the upper bound $U$ is inappropriate to describe the complexity of the true function $f$ , the joint prior performs poorly. For instance, when the true curve is close to the base model (i.e., scenario $f_{1}$ ), the joint prior with $U \geq 5$ is outperformed by both the SD priors and the $gamma (1, 5 \cdot 10^{- 4})$ (see left panels in Figure 5); in other words, when data are generated under a linear model, a joint prior with large upper bound $U$ will result in a bad choice, as it allows ‘far more’ degrees of freedom than needed. For the same reasons, when the true curve is highly non-linear (i.e., scenario $f_{2}$ ), the joint prior with $U = {2, 3}$ generally achieves poor performance (see the right panels in Figure 5), because it assigns almost all weight to the base model or close to it.

In summary, the simulation study showed that both PC priors (i.e., the SD prior and our joint prior) provide potentially better performance than the standard gamma priors for sensible choices of the scaling parameter $θ$ . None of these PC priors shows uniformly better performance than the other one. However, we think that the particular scaling approach used to tune $θ$ plays an important role. Scaling the PC prior in terms of degrees of freedom automatically accounts for the level of noise present in the data. We were able to show via simulation that our joint prior has the potential to perform well in general, that is, both in high and low noise cases, provided that the elicited $U$ is an appropriate upper bound for the degrees of freedom of the function $f$ underlying the data. Therefore, we conclude that the number of effective degrees of freedom is a relevant quantity for building PC priors for smoothing Gaussian data, more suitable than other transformations of $τ_{β}$ that do not depend on $τ_{ε}$ .

Figure 5:

Simulation results: $log (MSE)$ for $f_{1}$ (left panels) and $f_{2}$ (right panels), in presence of high noise ( $τ_{ε} = 0.25$ , top panels) and low noise ( $τ_{ε} = 5$ , bottom panels), sample size $n = 20$ , $K = 20$ . In the legend on the right, label ‘G’ indicates the Gamma prior; ‘PC_d’ indicates our PC prior for degrees of freedom (joint prior), with $α = 0.01$ and $U = {2, 3, 5, 7, 10}$ ; ‘SD’ denotes scale dependent prior with $α = 0.01$ and $c = {1.5, 2, 3}$ .

7 Application

We demonstrate the use of the joint prior within an additive P-spline framework for modelling nitrate concentration in river Oglio, Lombardy, Italy. A total of $n = 576$ observations of NO $_{3}^{-}$ concentration were collected during 2010–2012 by taking one sample at each season (spring, summer, autumn and winter) in 48 gauging stations located along the river catchment. The response variable is $log ({NO}_{3}^{-}_{ij})$ , measured at station $i = 1, . . ., 48$ and season $j = 1, . . ., 4$ . Covariate stream_i is the distance from each station $i$ to the Iseo lake (i.e., the river source) measured in km along the stream network; the first station in proximity to the lake is at $stream \approx 0$ , while the last station downstream the river is at $stream \approx 150$ . The goal is to understand ‘river enrichment’ in terms of nitrates, by studying the behaviour of $log ({NO}_{3}^{-})$ as the stream distance increases. A substantial amount of information comes from previous studies (see Bartoli et al., 2012; Delconte et al., 2014 and references therein), suggesting that river enrichment in terms of nitrates may vary non-linearly as the stream distance increases. Due to different characteristics in terms of groundwater interactions and irrigation practices, the river catchment can be divided into upstream, middle and downstream reach. Different processes are expected within the three reaches and between seasons; hence, the enrichment curve may show different shapes in the three river segments and seasons.

In order to investigate possible seasonal effects on river enrichment, we adopt the model: $log ({NO}_{3}^{-}_{ij}) = μ + γ_{j} + f_{j} ({stream}_{i}) + ε_{ij}$ , $i = 1, . . ., 48$ , $j = 1, . . ., 4$ , where $μ$ is the overall intercept, $γ_{j}$ is the season-specific intercept and $f_{j} ({stream}_{i})$ is the season-specific smooth function of the stream distance, modelled with a P-spline with joint prior on $(τ_{β_{j}}, τ_{ε})$ as described in Section 5,

\begin{matrix} f_{j} (stream) & = & B_{j} β_{j} j = 1, . . ., 4 \\ β_{j} | τ_{β_{j}} & \sim & N (0, τ_{β_{j}}^{- 1} R^{- 1}) \\ τ_{β_{j}} | τ_{ε} & \sim & Gumbel (1 / 2, θ_{j} (τ_{ε})) \\ τ_{ε} & \sim & π (τ_{ε}) \propto 1 / τ_{ε} . \end{matrix}

(7.1)

We assume an IGMRF of order 2 with precision

τ_{β_{j}}

on each

β_{j}

. Based on our prior information, we specify an upper bound

U = 8

for the PC prior equation (7.1), for all

j = 1, . . ., 4

, assuming that each

f_{j}

is much more flexible than linear (i.e.,

d > 2

) and assigning two additional degrees of freedom to each of the three river segments, to capture possibly different enrichment behaviours. We believe that this prior is flexible enough to describe the possible smooth change between the upstream middle and downstream behaviours. We set

α = 0.01

, saying that it is

1 %

likely that

f_{j}

is more flexible than eight degrees of freedom,

j = 1, . . ., 4

To construct the matrices $X$ and $B$ of model (5.5), we need to create suitable dummy vectors of length $n$ : ${dummy}_{j}$ , $j = 2, . . ., 4$ , taking value 1 when the observation is from season $j$ and 0 elsewhere (these are associated to the season-specific intercepts); $stream \times {dummy}_{j}$ , $j = 1, . . ., 4$ , taking the actual stream distance when the observation is from season $j$ and 0 elsewhere (these are associated to the season-specific slopes). The $n \times 8$ fixed effect design matrix is $X = [1, D, S]$ , with $D = [{dummy}_{2}, . . ., {dummy}_{4}]$ and $S = [stream \times {dummy}_{1}, . . ., stream \times {dummy}_{4}]$ . Each basis $B_{j}$ contains $K = 30$ cubic B-splines, evaluated on the season-specific stream distances $stream \times {dummy}_{j}$ . The $n \times (4 K)$ B-spline design matrix is $B = [B_{1}, B_{2}, B_{3}, B_{4}]$ . To separate the season-specific slopes and intercepts from the season-specific smooth variation captured by the B-splines, suitable linear constraints must be applied to each $f_{j}$ , as discussed in Section 5.1. We fit this model via MCMC based on the pseudo-code reported in Appendix A.2 algorithm 2; in particular, we use a two blocks-updating scheme, which ensured good mixing: first block is $(τ_{ε}, γ)$ and the other one contains all the four sets of spline coefficients and associated precisions, $(τ_{β_{1}}, . . ., τ_{β_{4}}, β_{1}^{ULC}, . . ., β_{4}^{ULC})$ . During MCMC iterations, the PC prior for each precision $τ_{β_{j}}$ needs to be rescaled according to the current $τ_{ε}$ , but this can be done at a negligible computational cost. The R code is provided as supplemental material.

The results displayed in Figure 6 reveal different river enrichment curves ( ${\hat{f}}_{j} (stream)$ ) in different seasons: A distinctive pattern is observed in summer, where the fitted curve shows a fast increase upstream ( $stream < 50$ km) and tendency to decrease downstream the river ( $stream > 80$ km). This pattern supports the argument given in Delconte et al. (2014) about an upstream reach (from 0 to 25 km) where nitrates are stable, reflecting the chemistry of the lake; a middle reach (from 25 to 80 km) showing an increase of NO $_{3}^{-}$ concentration, probably due to groundwater inputs as replacement of river water abstracted for irrigation; a downstream reach (from 80 to 150 km) where nitrates should remain constant or even decrease, mainly due to the dilution of river water with NO $_{3}^{-}$ -deprived inputs.

Figure 6

Estimated river enrichment curves ( ${\hat{f}}_{j} (stream)$ , black line), for each season, and $95 %$ credible bands (grey). The curve for summer shows clearly a distinctive pattern with respect to the other seasons. Partial residuals (Wood, 2006) are plotted in each panel (dots), indicating larger variability, than what assumed by the model, for the log NO $_{3}^{-}$ concentrations observed in spring and autumn (with possible outliers in the upstream river segment in autumn).

8 Discussion

PC priors are defined to penalize complexity with respect to a given base model, the magnitude of the penalty being elicited by the user using an intuitive scaling approach. The scaling tool allows the user to derive the PC prior on an interpretable scale, different from the scale of the original parameter, provided that the link between the two is known. We took advantage of this nice feature and derived PC priors for the number of effective degrees of freedom $d$ of a P-spline model for Gaussian data.

For non-Gaussian responses, the idea presented in this article follows straightforwardly by assuming the definition of the degrees of freedom of a generalized P-spline model (Hastie and Tibshirani, 1990), $d = tr {(B^{T} W B + \frac{τ_{β}}{τ_{ε}} R)}^{- 1} B^{T} W B$ , where $W$ is a diagonal matrix, with entries depending on the linear predictor of the model (i.e., on $B β$ ) and the adopted link function. MCMC methods similar to those proposed in this article can then be developed being careful about implementing the mapping $d (τ_{β} | τ_{ε}, W)$ , which is conditional on $W$ in the generalized case (note that for most distributions in the exponential family, $τ_{ε}$ is known, e.g., for Poisson we have $τ_{ε} = 1$ ). For Poisson and binomial responses, a convenient approach is to use auxiliary variable methods (Frühwirth-Schnatter and Frühwirth, 2007; Frühwirth-Schnatter et al., 2008) and work with an equivalent ‘augmented’ P-spline model for Gaussian (pseudo) data. The PC prior for $d$ can then be implemented in the same way as described in this article, with algorithms 1 and 2 needing only the inclusion of a Gibbs step to update the augmented parameters.

The potential advantages of using PC priors for degrees of freedom are twofold. First, they are ‘easy-to-elicit’ by the user, who has to define two intuitive scaling parameters: $U$ , an upper bound for $d$ , and $α$ , the prior mass assigned to $d > U$ . This scaling tool can be handled flexibly. For instance, elicitation of the median $M$ for the degrees of freedom results from fixing $α = 0.5$ . In this case, the PC prior density is bimodal: one mode is set at the base model (by definition) and another mode is set around $M$ degrees of freedom. This bimodal behaviour is due to the attraction to the base model implicit in PC priors.

As a second advantage, these PC priors avoid overfitting and are invariant under design, which means that the parameters $U$ and $α$ do not need to be rescaled if the design changes. In other words, the PC prior is able to code into the model the prior knowledge on the complexity of the curve, or its degrees of freedom, in a design-adaptive way. The ability to adapt to design and avoid overfitting by construction makes $π_{PC} (d)$ an appealing default choice in additive models where the latent structure includes several smooth functions (built on a basis of B-splines, e.g., P-splines) and other types of structures, such as individual, spatial and spatio-temporal random effects.

Footnotes

Acknowledgments

Massimo Ventrucci is funded by a FIRB 2012 grant (project no. RBFR12URQJ, title: Statistical modeling of environmental phenomena: Pollution, meteorology, health and their interactions) for research projects of national interest provided by the Italian Ministry of Education, Universities and Research. The dataset used in Section 7 was kindly provided by the Consorzio dell'Oglio, from the project ‘Experimental assessment of the environmental flow in the lower Oglio river’. We thank Erica Racchetti and Alex Laini, Department of Life Science, University of Parma, for introducing us to the application in Section 7 and for fruitful discussion on the ecological interpretation of results. Finally, we would like to thank the Associate Editor and two anonymous referees for their helpful comments.

Appendices

References

Bartoli

Racchetti

Delconte

Sacchi

Soana

Laini

Longhi

Viaroli

(2012) Nitrogen balance and fate in a heavily impacted watershed (Oglio River, Northern Italy): In quest of the missing sources and sinks. Biogeosciences , 9, 361–373. doi: 10.5194/bg-9-361-2012. URL http://www.biogeosciences.net/9/361/2012/ (last accessed 28 July 2016).

Belitz

Brezger

Kneib

Lang

Umlauf

(2000) BayesX software for Bayesian inference in structured additive regression models (Technical report). URL http://www.statistik.lmu.de/~bayesx/manual/reference_manual.pdf (last accessed 29 July 2016).

Currie

Durbán

Eilers

(2006) Generalized linear array models with applications to multidimensional smoothing. Journal of Royal Statistical Society, Series B 68, 259–280.

Delconte

Sacchi

Racchetti

Bartoli

Mas-Pla

(2014) Nitrogen inputs to a river course in a heavily impacted watershed: A combined hydrochemical and isotopic evaluation (Oglio River Basin, N Italy). Science of The Total Environment , 466–467, 924–938. doi: https://dx-doi-org.web.bisu.edu.cn/10.1016/j.scitotenv.2013.07.092. URL https://www-sciencedirect-com.web.bisu.edu.cn/science/article/pii/S0048969713008747 (last accessed 28 July 2016).

Eilers

Currie

Durbán

(2006) Fast and compact smoothing on large multidimensional grids. Computational Statistics & Data Analysis , 5, 61–76.

Eilers

Marx

(1996) Flexible smoothing with B-splines and penalties. Statistical Science , 11, 89–121.

Eilers

Marx

(2010) Splines, knots, and penalties. Wiley Interdisciplinary Reviews: Computational Statistics , 2, 637–653.

Fahrmeir

Kneib

(2009) Propriety of posteriors in structured additive regression models: Theory and empirical evidence. Journal of Statistical Planning and Inference , 139, 843–859.

Fahrmeir

Kneib

Lang

(2004) Penalized structured additive regression for space-time data: A Bayesian perspective. Statistica Sinica , 14, 715–745.

10.

Fahrmeir

Kneib

Lang

Marx

(2013) Regression: models, methods and applications . Berlin: Springer-Verlag.

11.

Fong

Rue

Wakefield

(2010) Bayesian inference for generalized linear mixed models. Biostatistics , 11, 397–412. doi: https://dx-doi-org.web.bisu.edu.cn/10.1093/biostatistics/kxp053. URL http://biostatistics.oxfordjournals.org/content/11/3/397.abstract (last accessed 28 July 2016).

12.

Frühwirth-Schnatter

Frühwirth

(2007) Auxiliary mixture sampling with applications to logistic models. Computational Statistics & Data Analysis , 51, 3509–3528. doi: https://dx-doi-org.web.bisu.edu.cn/10.1016/j.csda.2006.10.006. URL https://www-sciencedirect-com.web.bisu.edu.cn/science/article/pii/S0167947306003720 (last accessed 28 July 2016).

13.

Frühwirth-Schnatter

Frühwirth

Held

Rue

(2008) Improved auxiliary mixture sampling for hierarchical models of nongaussian data. Statistics and Computing , 19, 479–492. doi: 10.1007/s11222-0089109-4. URL https://dx-doi-org.web.bisu.edu.cn/10.1007/s11222-008-9109-4 (last accessed 28 July 2016).

14.

Frühwirth-Schnatter

Wagner

(2010) Stochastic model specification search for Gaussian and partial non-Gaussian state space models. Journal of Econometrics , 154, 85–100. doi: https://dx-doi-org.web.bisu.edu.cn/10.1016/j.jeconom.2009.07.003. URL https://www-sciencedirect-com.web.bisu.edu.cn/science/article/pii/S0304407609001614 (last accessed 28 July 2016).

15.

Frühwirth-Schnatter

Wagner

(2011) Bayesian variable selection for random intercept modeling of Gaussian and nonGaussian data. In Bernardo

Bayarri

Berger

Dawid

Heckerman

Smith

AFM

West

, eds. Bayesian Statistics 9 pages 165–200, Oxford: Oxford University Press.

16.

Gelman

(2006) Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis , 1, 515–534. doi: 10.1214/06-BA117A. https://dx-doi-org.web.bisu.edu.cn/10.1214/06-BA117A

17.

Hastie

Tibshirani

(1990) Generalized Additive Models . London: Chapman and Hall.

18.

Hastie

Tibshirani

Friedman

(2009) The Elements of Statistical Learning (Springer Series in Statistics). New York: Springer-Verlag.

19.

Jullion

Lambert

(2007) Robust specification of the roughness penalty prior distribution in spatially adaptive bayesian P-splines models. Computational Statistics & Data Analysis , 51, 2542–2558. doi: https://dx-doi-org.web.bisu.edu.cn/10.1016/j.csda.2006.09.027. URL https://www-sciencedirect-com.web.bisu.edu.cn/science/article/pii/S0167947306003549 (last accessed 28 July 2016).

20.

Klein

(2015) sdPrior: Scale-Dependent Hyperpriors in Structured Additive Distributional Regression . R package version 0.3. URL http://CRAN.R-project.org/package=sdPrior (last accessed 28 July 2016).

21.

Klein

Kneib

(2015) Scale-dependent priors for variance parameters in structured additive distributional regression. Bayesian Analysis . doi: 10.1214/15-BA983. URL http://projecteuclid.org/euclid.ba/1448323525

22.

Knorr-Held

Rue

(2002) On block updating in Markov random field models for diasease mapping. Scandinavian Journal of Statistics , 29, 597–614.

23.

Kullback

Leibler

(1951) On information and sufficiency. The Annals of Mathematical Statistics , 22, 79–86.

24.

Lang

Brezger

(2004) Bayesian P-splines. Journal of Computational and Graphical Statistics , 13, 183–212.

25.

Rue

Held

(2005) Gaussian Markov Random Fields . London: Chapman and Hall/CRC.

26.

Rue

Martino

Chopin

(2009) Approximate bayesian inference for latent Gaussian models using integrated nested laplace approximations (with discussion). Journal of the Royal Statistical Society, Series B , 71, 319–392.

27.

Ruppert

Wand

Carroll

(2003) Semiparametric Regression (Cambridge Series in Statistical and Probabilistic Mathematics). Cambridge, UK: Cambridge University Press.

28.

Simpson

Rue

Martins

Riebler

Sørbye

(2014) Penalising model component complexity: A principled, practical approach to constructing priors. Statistical Science (published before print) URL http://arxiv.org/abs/1403.4630v4”arXiv:1403.4630v4 (last accessed 28 July 2016).

29.

Sørbye

Rue

(2014) Scaling intrinsic Gaussian Markov random field priors in spatial modelling. Spatial Statistics , 8, 39–51. doi: https://dx-doi-org.web.bisu.edu.cn/10.1016/j.spasta.2013.06.004. URL https://www-sciencedirect-com.web.bisu.edu.cn/science/article/pii/S2211675313000407 (last accessed 28 July 2016).

30.

Wakefield

(2013) Bayesian and Frequentist Regression Methods (Springer Series in Statistics). New York: Springer-Verlag.

31.

Wood

(2006) Generalized Additive Models: An Introduction with R . London: Chapman and Hall/CRC.

Penalized complexity priors for degrees of freedom in Bayesian P-splines

Abstract

Abstract

Keywords

1 Introduction

2 Background on P-splines

2.1 Bayesian P-splines

4.1 A new scaling approach

4.2 Invariance under design

Figure 2:

Figure 4:

The distribution of the number of effective degrees of freedom induced by a gamma prior with a = 1 , b = 5 · 10 − 4 (left panel) and a = 10 − 3 , b = 10 − 3 (right panel), under four different designs. The base model is at d = 2 , assuming an IGMRF prior of order 2 on β .

Figure 5:

Footnotes

Acknowledgments

Appendices

References

The distribution of the number of effective degrees of freedom induced by a gamma prior with $a = 1, b = 5 \cdot 10^{- 4}$ (left panel) and $a = 10^{- 3}, b = 10^{- 3}$ (right panel), under four different designs. The base model is at $d = 2$ , assuming an IGMRF prior of order 2 on $β$ .