Rejoinder

Abstract

We would like to thank all discussants for their thoughtful comments and constructive criticism and their willingness to share the scripts and code for their analyses. We have responded to some of the raised points in the text, grouping them by their general topic.

1 Assumptions and flexibility of models

Several questions have been raised about the assumed covariance structure and flexibility of the modelling approach.

Jeff Morris and Paul Eilers ask about the assumed covariance structure of the functional random effects. For independent functional random effects and a spline-based approach, our assumptions imply (dropping the index j throughout the rejoinder for simplicity)

Cov (B_{a} (t), B_{a^{'}} (t^{'})) = δ_{{aa}^{'}} Φ_{Y} (t)^{⊤} (λ_{x} I_{K_{Y}} + λ_{Y} P_{Y})^{- 1} Φ_{Y} (t^{'})

(1.1)

with the Kronecker delta

δ_{{aa}^{'}}

and

Φ_{Y} (t)

being a vector of, for example, B-spline evaluations with difference or derivative-based penalty matrix

P_{Y}

. Without the additional penalty term

λ_{x} I_{K_{Y}}

, this corresponds to common assumptions for smooth curves in additive mixed models (e.g., Ruppert et al., 2003; Wood, 2006) also used in function-on-scalar regression (e.g., Reiss et al., 2010). The curves

B_{a}

are factor-specific with a common smoothing parameter

λ_{Y}

. The term

λ_{x} I_{K_{Y}}

without the additional penalty

λ_{Y} P_{Y}

in turn would imply an independent normal distribution for each spline coefficient over the different levels a of the grouping variable. The two assumptions are additively combined on the level of the precision matrix, as is commonly done for anisotropic smoothing, to allow for different amounts of smoothness or variability in the different ‘directions’. Figure 1:

Plots of random draws with identical random number generator seed from the distribution of $B_{a} (t)$ given in equation (1.1) for various combinations of $λ_{x}, λ_{Y}$ . Note the different scales for $B_{a} (t)$ . The example uses 25 cubic B-splines with second-order difference penalty $P_{Y}$ .

Figure 1, for a fixed spline and penalty order, plots simulated functions from the above covariance for different values of

λ_{x}

and

λ_{Y}

to illustrate the different kinds of inter- and intra-functional variability that can be captured by this covariance structure. It is clearly visible that the ratio of

λ_{Y}

λ_{x}

controls the smoothness of the functions, while different absolute values influence the range of the

B_{a} (t)

. It would be interesting to compare our assumptions to those of Morris and Carroll (2006) and subsequent papers of the Morris group, where wavelet coefficients are assumed independent but are allowed to have varying variances across scales and locations. In Section 2, we also look at the ability of covariance structure (1.1) for functional random effects to approximate other kinds of covariances and alternatively at functional random effects based on functional principal components, where the covariance is a truncated and estimated version of the covariance implied by the Karhunen–Loève expansion,

Cov (B_{a} (t), B_{a^{'}} (t^{'})) = δ_{{aa}^{'}} \sum_{l = 1}^{\infty} κ_{l}^{B} ϕ_{l}^{B} (t) ϕ_{l}^{B} (t^{'})

with eigenvalues

κ_{l}^{B}

and eigenfunctions

ϕ_{l}^{B}

While the discussion paper describes model terms with a single smoothing parameter in t direction, locally varying smoothness is possible in our framework as well. Essentially, the smoothing parameter is then itself allowed to be a smooth function along t (Wood, 2011, Section 5.1). Such locally adaptive smoothing is feasible for all model terms in the general model (2.1) of the main article using mgcv’s facilities for adaptive smoothing and is directly accessible in pffr and pfr models in the current implementation for the functional intercept and linear effects of scalar or functional covariates. This can be very important in cases where smoothness of functions varies along the interval $T$ (or $S$ ) and spatially adaptive smoothing is required; see, for example, Figure 5 later in this section.

There seems to be a misunderstanding regarding the degrees of freedom for the individual model terms in the gradient boosting approach. It is correct that the degrees of freedom of all base learners in a given model are set to a common (low) number in order to achieve unbiased selection of baselearners (cf. Hofner et al., 2012). Nevertheless, the estimated functions adapt to the complexity of the underlying effects over the iterations: baselearners for model terms requiring more flexibility will be more frequently selected and updated, thereby adaptively increasing the effective degrees of freedom of the respective model term. This also proved true in our simulations when looking at anisotropic effect surfaces, which seem to concern both Paul Eilers and Jeff Morris: While the smoothing parameters for bivariate terms are chosen such that the base learner has a given number of degrees of freedom, for example, by choosing a single smoothing parameter $λ_{x} = λ_{Y}$ , over the course of the iterations the fitted effect adapts well even to highly anisotropic surfaces (e.g., constant or linear in one direction and wavy in the other). We saw no improvement when experimenting with different fixed ratios of $λ_{x}$ to $λ_{Y}$ . For the mixed model-based approach implemented in pffr, anisotropy is handled in the usual way by allowing two smoothing parameters for the two penalties over $x$ and t, respectively; see equation (3.3) in the main article and Figure 1 for an illustration for functional random effects.

Jeff Morris asks about the assumed error term structure, and he and Anna Maria Paganoni and Laura Sangalli (PS) raise the issue of dependent functional data. While we do not specify a smooth residual curve $E_{i (t)}$ in our general model (2.1), this is certainly a model term that will be included in most functional regression models and is covered as a special case in our framework. However, not including it as per default allows for situations such as those where the error structure is group-specific and one would like to include smooth residual curves with group-specific covariance structure. Alternatively, dependent random effects can be achieved by specifying a fixed correlation structure, for example, over time and/or space, using $P_{x}^{- 1}$ for the (functional) random intercepts. However, estimates of covariance parameters such as the shape or range parameters of a Matèrn correlation between different locations have to be obtained by a grid search in our approach (see Scheipl et al., 2015, Web Appendix C.3, for an application example on spatial functional data) and cannot be estimated along with the other model parameters in the current implementation. The relative scale of the assumed covariance structure will still be controlled by $λ_{x}$ though, which is estimated from the data.

As another important aspect, we would like to point out that in the proposed general model formulation, we cover many other loss functions and response distributions as well as GAMLSS in addition to Gaussian, t- or quantile regression, which we see as an advantage compared to most competing approaches. This is also illustrated in our application in the main article with a beta regression with or without modelling of the conditional variance. Additional response distributions can now be added fairly easily even for the mixed model-based approach (see Wood and Fasiolo, 2016), while the boosting-based approach for GAMLSS we build on (Hofner et al., 2014) already implements all of the response distributions described in Stasinopoulos and Rigby (2007, Section 1.3).

María Durban and Carmen Aguilera-Morillo (DM) raise some issues with regard to the estimation and choice of basis and penalty for the functional random effects. As our approach is completely indifferent towards the estimation method used for the FPC basis of the functional random effects, it would also be possible to use it with FPCs estimated using Aguilera and Aguilera-Morillo (2013). However, it is unclear to us how to extend the required presmoothing step to sparse or irregular functional data or non-i.i.d. data with more complicated grouping structure along the lines of Cederbaum et al. (2016). We would also like to point out that the comparison of the proportion of variability explained (PVE) they present (their Table 1) is somewhat misleading, as the denominator in the PVE (i.e., the total variability of the input data) is smaller for the heavily presmoothed residual curves they use than for the raw residuals we use. When we repeat the analysis for DM's Table 1 with data presmoothed with 80 instead of the 20 B-spline basis functions DM used (proportion of raw variability ‘lost to smoothing’: 0.05% and 2.2%, respectively), the differences between the two approaches mentioned by DM practically disappear (cf. Table 1 and Figure 2).

Table 1:

Percentage of variability explained by first eight FPCs estimated via smoothed covariance on raw residuals (I, top) and penalized FPCA (Aguilera and Aguilera-Morillo, 2013) on presmoothed residuals represented with 80 B-splines (II, bottom).

,Method	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	Accumulated
I	60.42	10.52	7.85	6.46	4.50	2.51	1.70	1.40	95.36
II	60.13	10.49	7.76	6.40	4.38	2.54	1.66	1.40	94.75

Figure 2:

First two eigenfunctions estimated via smoothed covariance on raw residuals (dashed red) and penalized FPCA (Aguilera and Aguilera-Morillo, 2013) on presmoothed residuals represented with 80 B-splines (solid black).

Figure 3:

Plot of raw residuals ${\hat{ε}}_{it}$ (black, solid) after estimation of mean structure and centring (model without smooth residuals/functional random effects) and estimated pffr functional random effects ${\hat{E}}_{i (t)}$ using the first 20 FPCs (solid green) or 20 cubic B-spline basis functions (dashed blue) as well as DM's method using 23 cubic B-spline basis functions for ${\hat{E}}_{i (t)}$ (dotted red).

We are grateful to DM for pointing out that we chose an insufficiently flexible basis for the functional random effects in the main article (cf. their Figure 2). However, this should not be construed as a defect of the method itself, but rather as a question of (sub-optimal) model specification. Re-estimating the model with functional random effects based on 20 instead of 8 FPCs or alternatively 20 B-spline basis functions per subject, we achieve results that are very similar to those achieved with the 23 B-spline basis functions per subject used by DM. These results are shown in Figure 3.

Regarding the alternative penalty structure that DM give in their equations (3.2) and (3.3), we would like to point out that this single smoothing parameter penalty does not regularize the components of the random effect curves that are unpenalized by their roughness penalty $S$ . In practice, for example, for a second-order difference penalty on the B-spline coefficients, this means that subject-specific intercept and linear slope estimates will be estimated as fixed effects (unregularized). The tensor product penalty we use, on the other hand, estimates one variance component/smoothing parameter that controls the roughness of the functional random effects and an additional one that controls the degree of inter-subject heterogeneity, just like the penalty proposed by DM in their equation (3.4). Our parameterization is slightly different as we impose a penalty over subjects for the entire curves rather than separately for the constant and linear parts. See also Figure 1 for examples of draws from the proposed distribution for various combinations of smoothness and inter-subject heterogeneity.

Figure 4:

Plot of negative log REML criterion as a function of smoothing parameter $λ_{x}$ for the penalized signal regression of water content of fossil fuels with different basis sizes. REML estimates of $λ_{x}$ returned by mgcv are marked with red crosses.

Figure 5:

Adaptive spline fits (cubic B-splines, second-order difference penalty) for penalized signal regression of water content of fossil fuels with different basis sizes. Top row: estimated coefficient functions. Bottom row: observed versus fitted values.

Paul Eilers points out that the default settings of refund’s penalized signal regression (i.e., scalar outcome with a functional covariate) may not always yield stable and reliable results when the basis dimension for the functional effect is changed (his Figure 5). Upon investigating this further, we found that restricted maximum likelihood (REML) optimization gets stuck in local optima when 60 and 100 basis functions are used for this dataset. Figure 4 shows the behaviour of the REML criterion as a function of the smoothing parameter $λ_{x}$ for the three models with $K_{x} = 60, 80, 100$ basis functions, where only the solution found for $K_{x} = 80$ actually represents a global REML optimum. This highlights the importance of the recommendations given in Reiss and Ogden (2009) for this very model class: Plotting criteria over a wide range of $λ$ values where possible and, rather less expensive in terms of computation, cross-checking results obtained under generalized cross-validation (GCV) and REML against each other. Although Reiss and Ogden (2009) found that running into local optima seems to be more common for GCV than for REML, for this particular model and dataset, GCV reliably found the global optimum. Our mgcv-based implementation makes it very easy to switch between the different criteria. The most suitable model for this data is most likely based on locally adaptive smoothing where the smoothing parameter $λ_{x}$ is itself a function allowed to vary over s. The adaptive smooths available in pfr and pffr via mgcv parameterize the log of the smoothing parameter as another spline function and can be estimated in the same framework, for example, using REML. Results for such an adaptive smooth seem to be stable and rather more sensible looking than the results achieved by global smoothing (compare our Figure 5 and Eilers’ Figures 4 and 5).

2 Inference

Many of the discussants are concerned about neglected sources of estimation uncertainty and methods for properly accounting for the complicated covariance structures typical of functional datasets, as well as testing, model choice and variable selection.

The contribution of Piotr Kokoszka and Mathew Reimherr (KR) raises the important issue of how to properly account for intra-functional autocorrelation and showcases some valuable tools for constructing confidence intervals that take into account the autocorrelation of residual curves. However, we strongly disagree with their broad conclusion that ‘estimates produced by pffr have a tendency to underestimate the variability of parameter estimates’ and welcome the opportunity to discuss this important issue in some more detail. Specifically, we want to stress that the undercoverage and $α$ -level violations reported by KR are primarily caused by their use of a severely misspecified model. The data that KR simulate has errors $E_{i (t)}$ that are the sum of two Gaussian processes with (squared) exponential covariance, while the model they fit assumes i.i.d. or white noise $E_{i (t)}$ . Diagnostic plots for the (autocorrelation of) estimated residuals like those shown in Figure 2 of the main article would have alerted analysts to such a severe violation of model assumptions in any real application. It is hardly surprising, then, that (a) confidence intervals (CIs) based on models estimated under an independence assumption will undercover for data displaying strong autocorrelation and that (b) the effect of this neglected autocorrelation is stronger the denser the evaluation grid for the functional responses is, as the model then assumes an ever large number of independent observations with resulting decreasing widths of CIs while the ignored correlations between adjacent function evaluations become ever stronger.

To more fairly evaluate the severity of the issue KR raise, we repeat their simulation study with an error process that includes an i.i.d. component in addition to a smooth auto correlated process of roughly the same variability while taking advantage of pffr’s capabilities to deal with such autocorrelation by including either spline- or FPC-based curve-specific smooth residuals (FPCs are estimated from a pilot estimate under a working independence assumption) or alternatively by assuming an AR(1) residual structure over the functional response's domain, which captures the correlation structure at least to some extent. We observe point-wise coverages and type I errors that achieve almost the nominal level for most settings, at least if FPC-based random effects are used (cf. Table 2). Spline-based random effects indeed do not perfectly capture the autocorrelation structure of the simulated relatively non-smooth process, especially for the squared exponential covariance, but do as well as might be expected from a model that assumes a smooth plus i.i.d. error structure in a setting with quite wiggly residual curves. Furthermore, we would like to point out that inference that directly takes into account the data's dependency structure like pffr is typically more efficient than ex post corrections such as KR use (cf. linear mixed model inference versus generalized estimating equations (GEEs) with working independence assumptions). If further correction of inference or simultaneous inference is desired, KR's method could be combined with pffr specifying smooth plus i.i.d. errors in the model to increase efficiency.

Table 2:

Type I error rates (nominal 5%) for the null model and average point-wise coverage for the model with a non-zero effect (nominal 95%) in the corrected simulation study of KR with $D_{i} \equiv D$ (KR's m) indicating the number of observations per curve. Method ‘simple’ is the misspecified i.i.d. error model KR use; ‘AR(1)’ assumes an AR(1) process for the errors along t. ‘FPC’ and ‘P-spline’ denote models including $E_{i (t)}$ using functional principal component-based and cubic B-spline random effects with first-order difference penalty, respectively. $C_{1 / 2}$ and $C_{\infty}$ denote exponential and squared exponential covariance, respectively. Results are based on 200 replicates per setting.

,		Type I error				Coverage
C	D	simple	AR(1)	FPC	P-spline	simple	AR(1)	FPC	P-spline
C _1/2	5	0.080	0.050	0.070	0.055	0.93	0.96	0.93	0.93
	20	0.235	0.055	0.040	0.055	0.88	0.96	0.96	0.96
	50	0.655	0.245	0.075	0.085	0.74	0.91	0.95	0.95
C _∞	5	0.125	0.045	0.075	0.055	0.92	0.95	0.93	0.94
	20	0.400	0.165	0.040	0.095	0.83	0.95	0.96	0.91
	50	0.815	0.465	0.075	0.285	0.65	0.86	0.94	0.80

In his excellent contribution contrasting his functional mixed model (FMM) and our approach, Morris states that any inference conditional on subject-specific random effects ‘would not account for the subject-to-subject variability [...] and the model would not effectively make inference on the population from which the subjects were drawn’ (Section 2.3). This is a rather more subtle issue. First, Morris's statement should not be interpreted to mean that conditional models ignore inter-subject variability which they model explicitly, merely that the focus of inference is different. For illustration, let us focus on a simple Gaussian linear mixed model $y = X β + Z b + ε$ with $b \sim N (0, G)$ independent of $ε \sim N (0, R)$ , implying the marginal model $y \sim N (X β, V)$ with $V = Z G Z^{⊤} + R$ . Working with the marginal model, the estimator for the fixed effects corresponds to the generalized least squares estimator $̂ β = (X^{⊤} V X)^{- 1} X^{⊤} V^{- 1} y$ with covariance $(X^{⊤} V X)^{- 1}$ . Alternatively, we can estimate $β$ and $b$ jointly based on Henderson's mixed model equations. These can be derived from maximizing the log-likelihood based on the joint density of $y$ and $b$ , also known as the penalized log-likelihood, as it corresponds to the conditional log-likelihood treating $β$ and $b$ as fixed plus a quadratic ‘penalty’ term arising from the normal prior distribution of $b$ . We then obtain $(̂ β, ̂ b)^{⊤} = (C^{⊤} R^{- 1} C + B)^{- 1} C^{⊤} R^{- 1} y$ , where $C = (X Z)$ and $B = blockdiag (0, G^{- 1})$ , with covariance $Cov ((̂ β, ̂ b - b)^{⊤}) = (C^{⊤} R^{- 1} C + B)^{- 1}$ (Ruppert et al., 2003, Section 4.5–4.7). Using the Woodbury formula, it is easy to see that the two estimates and covariances for $̂ β$ coincide, that is, the marginal and hierarchical (conditional plus prior for $b$ ) model formulations lead to the same estimates and covariances for the fixed effects. Inference is thus the same and does account for variability in $b$ using either viewpoint. Furthermore, all implementations in mgcv, whether bam, gam or gamm, coincide up to slight differences caused by different estimation algorithms when REML is used for estimation of the smoothing parameters/variance components. In this context, it is also important to emphasize that, just like Bayesian approaches, CIs in the current mgcv implementation account for smoothing parameter uncertainty as well (cf. Wood et al., 2016b Section 4; see also the discussion in Greven and Scheipl, 2016; Wood et al., 2016a). For the generalized case, the hierarchical model formulation still implies a corresponding marginal model, which likelihood inference can be based on. As the integrals over the random effects cannot be analytically derived, this is usually done using approximations to the marginal likelihood. mgcv maximizes the penalized log-likelihood, which can also be seen as maximizing a penalized quasi-likelihood (PQL) approximate likelihood marginalized over the random effects (Breslow and Clayton, 1993). Thus, in the generalized case, the variability in the random effects is again (approximately) accounted for. Gradient boosting directly works with the penalized log-likelihood, if the loss-function is taken to be the negative log-likelihood. While we are working on deriving tests and confidence bands for boosted models, currently the non-parametric bootstrap has to be employed to obtain variability quantification.

Jiawei Bai, Andrada Ivanescu and Ciprian Crainiceanu (BaIC) list some proposed tests for functional regression methods and bemoan the lack of a general and validated implementation of tests in functional regression, and KR and Morris raise the important objection that inference for functional effects often calls for simultaneous (or regional) confidence intervals and tests. In this context, it is worth noting that our mixed model-based framework—in addition to point-wise tests based on Wood et al. (2016b)—does offer simultaneous tests of effects based on Wood (2013a,b). Simultaneous confidence bands could be constructed along the lines of Ruppert et al. (2003, Section 6.5), similar to what the Morris group used for simultaneous credible bands in Meyer et al. (2015); alternatively, KR show reliable (albeit conservative) performance of their joint CIs even for their heavily misspecified model. Of course, these frequentist-penalized approaches will never be as flexible and easily customizable as Bayesian inference where the full posterior is available as for Morris et al.’s FMM.

In terms of model choice, the mixed model-based framework offers a generally applicable corrected conditional AIC that also incorporates smoothing parameter uncertainty (Wood et al., 2016b, Section 5) and can be used to select or deselect any terms in the model as well as to compare say smooth vs. linear or constant specifications. Model term selection for generalized additive mixed models based on full-rank penalties has alternatively been proposed in Marra and Wood (2011). For term selection in our boosting-based approach, we use cross-validated ‘early stopping’ of boosting iterations (Bühlmann and Hothorn, 2007) and stability selection (Meinshausen and Bühlmann, 2010; Shah and Samworth, 2013), which is theoretically well founded with strong guarantees on error rates.

3 Computational aspects

Jeff Morris, Paul Eilers, BaIC and DM raise some interesting issues with regard to the actual implementations of the model class we present. We are looking forward to the implementation of the separation of overlapping penalties (SOP) algorithm combined with generalized linear array models (GLAM) representation announced by DM, which has the potential for large reductions in computing time and memory requirements. Unfortunately, they were unable to share working code with us for the preparation of this rejoinder for a more detailed comparison. As we describe in the article, the GLAM arithmetic implemented in our boosting-based approach will only ever be feasible for functional responses sampled on a common grid (with potential missings) and for the subset of model terms in which the marginal basis functions over $x \in X$ do not depend on the value of $t \in T$ in any way. While this includes the response structure and effect types currently available in Morris et al.’s WFMM software for FMM, this rules out the GLAM approach for concurrent and historical functional effects or sparse, irregular functional data, for example. On the upside, the GLAM approach avoids the representation of the responses in a long vector, which Morris correctly identifies as computationally less than ideal.

The sweep-operator trick (Goodnight, 1978; Herrick and Morris, 2006), used in WFMM if data are without missings and if random effect and residual functions are conditionally independent, is the key to WFMM’s much better scalability for data with large n compared to our boosting and mixed model-based implementations (cf. results for large numbers of grid points, random effect levels and observations per random effect level in Scheipl et al., 2015, Web Appendix B.6, Figure 34). However, both our approaches can now utilize binning or discretization for large data (see Wood et al., 2016, for discrete option in mgcv::bam; index-option for mboost base learners) which means computation times increase much more slowly as n increases, depending on the coarseness of the binning. Using this strategy and a parallelized estimation algorithm, Wood et al. (2016) were able to fit a model on $\approx 10^{8}$ data points with $\approx 10^{4}$ coefficients in under one hour on a conventional laptop. Also, note that for functional responses on regular grids, the ‘binned’ data is an exact representation of the original data if a sufficiently fine binning is used.

Table 3:

Tabular comparison of available implementations for regression with functional data for versions publicly available December 2016. Note that the FMM framework defined by Morris and co-authors is much more versatile than the software implementation (WFMM) currently available.

Note: ⁽¹⁾ possible to specify manually (no penalization over $X$ ); ⁽²⁾ both spline- or FPC-based variants available; ⁽³⁾ experimental feature, validated for scalar y; ⁽⁴⁾ published, but not available in public software; ⁽⁵⁾ correlation structure not estimated, specified via fixed penalty; ⁽⁶⁾ splines, FPCs published, but not available in public software; ⁽⁷⁾ only for functional intercept, linear effects of scalars and functions; ⁽⁸⁾ available for exponential family response via mgcv’s jagam (Wood, 2016); ⁽⁹⁾ n: # curves; D: grid size, p: # coefficients for additive predictor;⁽¹⁰⁾ componentwise updates only, so no large linear systems have to be solved regardless of total p;⁽¹¹⁾ Gaussian scale mixtures published but not available in public software; ⁽¹²⁾ only conditional variance of Gaussian data currently implemented, others under development;⁽¹³⁾ experimental feature, not published; ⁽¹⁴⁾ R interface under development;⁽¹⁵⁾ full posterior Bayesian inference for subset of models, see (8); ⁽¹⁶⁾ user-defined parallelization described in user manual.

4 Comparison between available implementations

As suggested by BaIC, Table 3 represents our best effort to summarize the current capabilities of the four most mature implementations of regression models for functional data that we are aware of. Of course, many other software solutions offering more specialized routines for certain tasks are available (cf. the implementations mentioned in Section 1.4 of the main article). Note that the section ‘model terms’ in the table does not describe the scope of implemented interaction effects for reasons of space. We are looking forward to the planned comprehensive R implementation of Morris et al.’s FMM framework, which will fill further cells in the table and mean that the full flexibility of their developed framework will be publicly available.

5 Outlook and open questions

Regardless of the efficacy of random-effect-based approaches for dealing with intra-functional autocorrelation and achieving well-calibrated inference, Jeff Morris correctly points out the heavy computational cost of this conditional modelling strategy which does not scale well for datasets with more than a couple of hundred functional observations. At least for Gaussian functional data, an extension of our conditional modelling strategy by including an estimated marginal intra-functional covariance structure, possibly along the lines of the penalized GEE approach of Wang et al. (2013), would be an important addition to the framework presented here. An implementation of this idea is under development in the form of the pffrGLS function in refund. The discussion of computational aspects by several discussants also leads us to conclude that future implementations for this model class should strive for a combination of best-of-breed techniques, that is,

prefer a marginal model approach instead of a conditional one for the autocovariance structure of functional residuals, at least for Gaussian data,

perform (nearly) lossless compression of functional responses as an optional pre-processing step to make time and memory requirements (more) independent of grid lengths,

use GLAM arithmetic for model terms where possible,

use the sweep-operator trick for (iteratively reweighted) least squares (IW)LS-like estimation steps where possible,

discretize or bin covariates in large data sets and use the resulting compact representation of associated basis matrices,

use sparse matrix representations of basis matrices and penalties, where possible.

Note, however, that many of these algorithmic improvements will not be trivial or even feasible for sparse or irregular data.

On the subject of functional covariates, we agree with Paul Eilers that ‘Many quite different coefficient curves can give essentially identical fitted values or predictions of the dependent variable’. This is due to the high number of function grid points that are used as covariates compared to the number of observations and to their high collinearity. In the main article and Scheipl and Greven (2016), we have discussed the resulting identifiability issue and the inherently important role of assumptions like the coefficient function β being smooth or being in the span of the first few eigenfunctions of a functional covariate. However, we do not agree that only prediction performance matters. This is certainly of interest in almost all applications, but there is also often legitimate interest in β itself. Also, if we do out of sample prediction on completely new data, say on fossil fuel samples from a different country, where the functional covariate is somewhat different, predictions might differ depending on β. For example, functions that were approximately orthogonal to the original functional covariates and thus could be added with an arbitrary constant to β without changing an unpenalized fit, might be no longer orthogonal, and the arbitrary constant can now strongly change the model predictions. If the assumptions implied by the penalty are reasonable, then this may constitute an advantage of penalized spline-based approaches over FPC-based approaches, as the latter are completely determined by the given dataset.

We also welcome the discussions by PS, BaIC and others on important future directions in functional data analysis, which excellently complement our own outlook by further important areas of research that the community should tackle in the coming years. Regarding the topics of dimension reduction and spatio-temporal processes/imaging data, we have worked on multivariate functional principal component analysis (FPCA) approaches for dimension reduction that can jointly tackle imaging and functional data (Happ and Greven, 2016) and will strive to include this approach into our general modelling framework. With respect to the relative sparsity of Bayesian methods that BaIC point out, we agree that it would be a valuable addition to have a wider range of Bayesian functional regression models available. We have, in fact, started implementing a flexible class of Bayesian models for another area of interest that BaIC identify, namely joint models for longitudinal (sparse functional) and time to event data (see Köhler et al., 2016). This is based on the R package bamlss (Umlauf et al., 2016), which provides a toolbox for Bayesian additive models for location, scale and shape with interfaces also to MCMC samplers from JAGS (Plummer, 2016), STAN (Stan Development Team, 2016) and BayesX (Belitz et al., 2013). While we cannot honestly say that this speeds up computations, it does provide a very flexible framework for Bayesian inference.

In conclusion, we believe that regression models for functional data remain an exciting and active area of research, where much has been achieved and much needs to be done in order to develop suitable methods for the complex functional datasets of today.

Footnotes

Acknowledgments

Financial support was provided by the German Research Foundation (DFG) through Emmy Noether grant GR 3793/1-1.

References

Aguilera

Aguilera-Morillo

(2013) Penalized PCA approaches for B-spline expansions of smooth functional data. Applied Mathematics and Computation , 219, 7805–19.

Belitz

Brezger

Kneib

Lang

Umlauf

(2013) BayesX: Software for Bayesian Inference in Structured Additive Regression Models. Version 2.1 . Available at http://www.BayesX.org (accessed 6 January 2017).

Breslow

Clayton

(1993) Approximate inference in generalized linear mixed models. Journal of the American Statistical Association , 88, 9–25.

Bühlmann

Hothorn

(2007) Boosting algorithms: Regularization, prediction and model fitting. Statistical Science , 22, 477–505.

Cederbaum

Pouplier

Hoole

Greven

(2016) Functional linear mixed models for irregularly or sparsely sampled data. Statistical Modelling , 16, 67–88.

Goodnight

(1978) The sweep operator: Its importance in statistical computing (Technical report). SAS technical report R-106, SAS Institute Inc., Cary, NC.

Greven

Scheipl

(2016) Discussion (on smoothing parameter and model selection for general smooth models by Wood, Pya and Säfken). Journal of the American Statistical Association .

Happ

Greven

(2016) Multivariate functional principal component analysis for data observed on different (dimensional) domains. Journal of the American Statistical Association , 111, 1568–73.

Herrick

Morris

(2006) Wavelet-based functional mixed model analysis: Computation considerations. In Proceedings of the Joint Statistical Meetings, ASA Section on Statistical Computing, Seattle, Washington.

10.

Hofner

Hothorn

Kneib

Schmid

(2012) A framework for unbiased model selection based on boosting. Journal of Comp- utational and Graphical Statistics , 20, 956–71.

11.

Hofner

Mayr

Schmid

(2014) gambo- ostLSS: An R package for model building and variable selection in the GAMLSS framework. Statistical Software , 74(1).

12.

Köhler

Umlauf

Beyerlein

Winkler

Ziegler

A-G

Greven

(2016) Flexible Bayesian additive joint models with an application to type 1 diabetes research. arXiv preprint arXiv:1611.01485.

13.

Marra

Wood

(2011) Practical variable selection for generalized additive models. Computational Statistics & Data Analysis , 55, 2372–87.

14.

Meinshausen

Bühlmann

(2010) Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 72, 417–73.

15.

Meyer

Coull

Versace

Cinciripini

Morris

(2015) Bayesian function-on-function regression for multilevel functional data. Biometrics , 71, 563–74.

16.

Morris

Carroll

(2006) Wavelet-based functional mixed models. Journal of the Royal Statistical Society, Series B , 68, 179–99.

17.

Plummer

(2016) rjags: Bayesian Graphical Models using MCMC . R package version 4-6. Available at https://CRAN.R-project.org/package=rjags. (accessed 6 January 2017).

18.

Reiss

Ogden

(2009) Smoothing parameter selection for a class of semiparametric linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 71, 505–23.

19.

Reiss

Huang

Mennes

(2010) Fast function-on-scalar regression with penalized basis expansions. The International Journal of Biostatistics , 6, 1557–4679.

20.

Ruppert

Wand

Carroll

(2003) Semiparametric Regression . Cambridge: Cambridge University Press Cambridge.

21.

Scheipl

Greven

(2016) Identifiability in penalized function-on-function regression models. Electronic Journal of Statistics , 10, 495–526.

22.

Scheipl

Staicu

A-M

Greven

(2015) Functional additive mixed models. Journal of Computational and Graphical Statistics , 24, 477–501.

23.

Shah

Samworth

(2013) Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 75, 55–80.

24.

Stan Development Team (2016) The Stan C++ Library, Version 2.12.0 . URL http://mc-stan.org (last accessed 6 January 2017).

25.

Stasinopoulos

Rigby

(2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software , 23, 1–46.

26.

Umlauf

Klein

Zeileis

Koehler

(2016) bamlss: Bayesian Additive Models for Location Scale and Shape (and Beyond) R package version 0.1-1. Available at https://r-forge.r-project.org/R/?group_id=865 (accessed on 6 January 2017).

27.

Wang

HCY

Paik

Choi

(2013) A marginal approach to reduced-rank penalized spline smoothing with application to multilevel functional data. Journal of the American Statistical Association , 108, 1216–29.

28.

Wood

(2006) Generalized Additive Models: An Introduction with R Boca Raton: CRC Press.

29.

Wood

(2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B , 73, 3–36.

30.

Wood

(2013a) A simple test for random effects in regression models. Biometrika , 100, 1005–10.

31.

Wood

(2013b) On p-values for smooth components of an extended generalized additive model. Biometrika , 100, 221–28.

32.

Wood

(2016) Just another Gibbs additive modeler: Interfacing JAGS and mgcv. Journal of Statistical Software , 75, 1–15.

33.

Wood

Fasiolo

(2016) A generalized Fellner-Schall method for smoothing parameter estimation with application to Tweedie location, scale and shape models. Biometrics , to appear. Available at http://www.biometrics.tibs.org/fppaperstoappear.htm

34.

Wood

Shaddick

Augustin

(2016) Generalized additive models for gigadata: Modelling the UK black smoke network daily data. Journal of the American Statistical Association . Available at http://amstat.tandfonline.com/doi/abs/10.1080/01621459.2016.1195744

35.

Wood

Pya

Säfken

(2016a). Rejoinder (for Smoothing parameter and model selection for general smooth models). Journal of the American Statistical Association , 111, 1573–75.

36.

Wood

Pya

Säfken

(2016b) Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association . 111(516), 1548–63.

Rejoinder

Abstract

1 Assumptions and flexibility of models

Plots of random draws with identical random number generator seed from the distribution of B a ( t ) given in equation (1.1) for various combinations of λ x , λ Y . Note the different scales for B a ( t ) . The example uses 25 cubic B-splines with second-order difference penalty P Y .

Percentage of variability explained by first eight FPCs estimated via smoothed covariance on raw residuals (I, top) and penalized FPCA (Aguilera and Aguilera-Morillo, 2013) on presmoothed residuals represented with 80 B-splines (II, bottom).

First two eigenfunctions estimated via smoothed covariance on raw residuals (dashed red) and penalized FPCA (Aguilera and Aguilera-Morillo, 2013) on presmoothed residuals represented with 80 B-splines (solid black).

Plot of negative log REML criterion as a function of smoothing parameter λ x for the penalized signal regression of water content of fossil fuels with different basis sizes. REML estimates of λ x returned by mgcv are marked with red crosses.

Adaptive spline fits (cubic B-splines, second-order difference penalty) for penalized signal regression of water content of fossil fuels with different basis sizes. Top row: estimated coefficient functions. Bottom row: observed versus fitted values.

Table 2:

Table 3:

Tabular comparison of available implementations for regression with functional data for versions publicly available December 2016. Note that the FMM framework defined by Morris and co-authors is much more versatile than the software implementation (WFMM) currently available.

5 Outlook and open questions

Footnotes

Acknowledgments

References

Plot of negative log REML criterion as a function of smoothing parameter $λ_{x}$ for the penalized signal regression of water content of fossil fuels with different basis sizes. REML estimates of $λ_{x}$ returned by mgcv are marked with red crosses.