Abstract
In this study, we develop a penalized additive regression estimation method based on a neural network architecture. An additive neural network model is constructed by using a linear combination of univariate neural networks, or equivalently functional components. We use a B-spline activation function, which is useful to capture local features of data, for nodes that constitute the model. A penalty function is adopted to induce sparsity in functional components and nodes on each component simultaneously. This enables us to obtain a sparse representation, which in turn improves accountability of the model. To implement the proposed estimation method, we devise an efficient iterative algorithm based on a coordinate-wise updating process. An initialization scheme specialized for the B-spline activation function is proposed. The initialization approach enables the proposed method to achieve better performance compared with random initialization scheme. Numerical studies show that the fitted functional components of our estimator adapt to local and sparse structures based on a given dataset.
Introduction
In the multivariate regression problem, the generalized additive model is a popular tool in practice due to its flexibility and interpretability. It requires no predetermined form for the relationship between each predictor and response, relieving the restrictive assumptions of parametric relationship in the linear regression literature. Additionally, the estimate of each functional component in the additive model provides a description of how the response variable changes depending on the corresponding predictor. [5] provides a comprehensive review of the additive model.
The individual functional component in an additive model can be modeled using univariate regression function estimation techniques including smoothing spline, regression spline, local polynomial, and kernel-based methods. Thus, an essential process in estimation is to determine an univariate function estimation method. For an overview of nonparametric function estimation methods, one may refer to [15, 13].
Function estimation methods based on a neural network architecture hold a prominent place due to high estimation accuracy inherited from its flexible structure. However, the complexity of a neural network obstructs its accountability and transparency since it is difficult to understand how it works for estimation. We intermediate between accountability in the additive model and flexibility from the neural network structure by constructing an additive model with functional components based on the neural network structure.
In this study, we propose a penalized additive regression estimator based on a neural network architecture. We construct an additive neural network model where individual functional component is modeled with univariate neural network structure. To encourage sparsity in the model, a hierarchical lasso penalty, proposed by [20] in the linear regression literature, is adopted as a penalty function. Use of the penalty function simultaneously induces sparsity in functional components and in nodes by removing unnecessary ones. The resulting estimator is a locally adaptive estimator that allows the different number of nodes and controls the node placement for the functional components. A strategy to implement the proposed estimation method is developed. The numerical studies on simulated example data and a real-life example show that the proposed method performs well in terms of estimation accuracy and component selection. Moreover, it recovers inhomogeneous and sparse structures of example functional components from given data.
The main contributions in this paper are summarized as follows. First, we incorporate a penalization scheme into an additive neural network to encourage sparsity in the model. Construction of additive neural networks aims to enhance interpretability by restricting the neural network structure to an additive form. Sparsity induced by the penalization scheme enables component and node selection with a single complexity parameter, and thus maximizes benefits provided by the additive neural network model. The use of the sparsity-inducing penalty function is a distinctive characteristic of this study when compared with exiting additive neural network methods. Second, we establish a sound design for the additive neural network to guarantee good performance in practice. The performance of neural network methods largely depends on various elements including a type of activation function, algorithm, initialization scheme and so on. We use B-spline activation functions as nodes in the neural network. The chosen activation functions are locally active so that our network captures local features from data. Moreover, we develop an initialization scheme specialized for the B-spline activation function, and illustrate that it works well compared with random initialization.
The remainder of the paper is organized as follows. Section 2 discusses the previous research related to this work. Section 3 describes a B-spline activation function and define a penalized additive neural network estimator. An implementation algorithm to obtain the proposed estimator and a specialized initialization scheme are detailed in Section 4, followed by numerical studies including simulated data and real data in Section 5. The conclusion is summarized in Section 6. Proof of Lemma 1 is presented in Appendix Appendix A. Proof of Lemma 1.
Related work
Recently, there have been several attempts to improve transparency of neural networks by restricting neural network architecture. A representative approach is additive neural network method in which network is defined as the sum of multiple subnetworks. It pursue a good balance between flexibility from neural network and interpretability from additive model. [9] develops a generalized additive neural network model using the hyperbolic tangent function as its activation function. [1] introduce a deep neural network additive model based on exp-centered hidden units. To avoid overfitting, they use drop-out techniques for hidden nodes and predictors and weight decay method. More recently, [17] propose an additive neural network method with main effects and pairwise interaction effects. They use an importance criterion for component selection and in turn, retrain the network with the selected terms.
To enhance interpretability and prediction accuracy for additive neural network methods, a possible way is to impose sparsity with reduction of unnecessary model complexity. [17] obtain enhanced interpretability by removing unnecessary main or interaction terms based on importance quantities. Then, a fine-tuning procedure is implemented with the selected terms. [1] use several regularization methods to reduce model complexity. However, their main interest is to avoid overfitting, not to obtain sparse model for interpretability.
Penalization technique can be an useful tool to induce sparsity. Most penalization methods are originally introduced in the linear regression literature, and aimed to shrink the predictor effects and select important predictors. A popular method is the lasso penalty proposed by [12]. They use the
From additive regression perspective, penalization techniques have been used for the purposes of controlling complexity of each component function and removing unnecessary predictors, or equivalently functional components. [8] introduce an extension of the lasso for additive models, using smoothing spline and a penalty based on the sum of component norms. A variable selection method in an additive model is developed by [2]. Their method consists of two stages in which a penalized additive regression spline is fitted and in turn, unimportant components are removed via nonnegative garrotte method. More recently, [7] propose an additive regression spline estimator based on the total variation and nonnegative garrotte penalties. The three penalization methods control complexity for each component and sparsity at predictor level.
For additive neural network methods, penalization method can be adopted to reduce unnecessary model complexity. We propose use of the hierarchical lasso penalty for the additive neural network regression method. The proposed method simultaneously encourages sparsity at node and predictor levels with a single complexity parameter. On the other hand, the exiting additive neural network methods include a relatively complicated process to tune two of more complexity parameters to obtain sparsity at the two levels.
Model and estimator
Additive neural network model
Let
where
The additive neural network model is constructed by incorporating univariate neural network models into an additive form based on a linear combination. The additive neural network’s nodes are constructed using a symmetric B-spline activation function defined as
For
where
where
Additive neural network structure.
Examples of 
We describe a B-spline activation function with a form
A benefit of use of the B-spline activation function is the fact that it locally activates the linear function
Penalized additive neural network estimator
For notation uncluttered, we denote
We add a penalization term that separately induces sparsity on the component and node level to the loss function. We formulate an optimization problem that minimizes
subject to a constraint
In solving the above optimization problem, tuning two complexity parameters requires a high computational cost since a grid set of complexity parameters on two-dimensional surface should be searched. [20] propose a method that reduces two complexity parameters to one in the linear regression literature. Following their approach, we consider reformulating the optimization problem into one that simplifies two tuning parameters into one. Let
subject to a constraint
.
Let
The proof is summarized in Appendix Appendix A. Proof of Lemma 1. This lemma shows that the fitted functional components in the two optimization problem are equal, and thus the fitted values are also the same. Let
Iterative algorithm
We devise an iterative algorithm to obtain
Update
Update
If difference between current loss and updated loss is sufficiently small, go to step 3. Otherwise, go back to step 1. Update
subject to If difference between current loss and updated loss is sufficiently small, stop the algorithm. Otherwise, go back to step 1.
Each optimization in step 1–3 plays a different role. We refer to step 1–3 as ‘node pruning step’, ‘node placement step’, and ‘component selection step’. In the node pruning step, the appropriate number of nodes is determined by shrinking
The optimizations based on coordinate-wise updating process are described below.
Node pruning step
In the node pruning step, the problem becomes a lasso problem [12]. Here,
with respect to
where
The amount of shrinkage at node level gets larger when
Node placement step
We use an approximation method to optimize
with respect to
where
We sequentially update
Component selection step
The given problem in this step is a non-negative garrotte problem [3]. After obtaining the updated
with respect to
where
A choice of an optimal tuning parameter is important for performance of the estimators in penalization methods. This section presents an optimal tuning parameter selection strategy for the PANNE. Consider a decreasing sequence of
Our numerical studies confirm that the approximately computed
where
To reduce a computational burden in searching a set of complexity parameters, we propose a heuristic stopping rule where our algorithm stops before complete search on a sequence of
Compute If Select
Here,
The gray solid lines represent 25 initial activation functions.
In general, the performance for neural network methods largely depends on an initialization scheme. Popular initializations include Xavier initialization [4], He initialization [6]. Following them, choice of an initialization scheme is related to a type of activation functions used. Thus, we need to develop an initialization strategy that takes into account of the B-spine activation function. Initialization method to be presented is similar to one for the radial basis function mentioned in [16], however it is specialized for our network structure.
Instead of determining
Set Let Set Set
To justify the proposed initialization, we compare it with random initialization and He initialization (uniform and normal). For random initialization, the initial values are generated from the normal distribution with mean zero and finite variance. The variances are chosen by 1 and 0.1
Comparisons between the proposed initialization and random initializations. 
We apply the warm start when the algorithm proceeds from
Simulations
We investigate performance of the proposed method in terms of estimation accuracy and component selection consistency. We compare the PANNE with existing additive regression methods with component selection including the PARSE [7], COSSO [8] and the adaptive COSSO (ACOSSO; [11]). The last two methods use the smoothing spline and a penalty functional. The PARSE is an additive regression spline estimator based on total variation and nonnegative garotte penalties via the two stage procedure.
We adopt the mean squared error (MSE) and maximum deviation (MXDV) as discrepancy measures between the example function
To evaluate the component selection consistency, we compute the average number of relevant variables selected in the model (AverR) and average number of irrelevant variables selected in the model (AverI) as defined in [14].
For the comparison, we consider two simulation examples. A set of response observations is generated by
where, for
The initial number of nodes in each functional component for the PANNE are set to be
We first consider a simulated data example, which has also been considered by [2], with response observations generated from Eq. (4) where
In this example, the second, fifth and sixth functional components have comparatively weak influence on the response than the others. The example functional components are displayed in Fig. 5. We set
Table 1 summarizes the simulation results for the performance measures. The average of MSEs and MXDVs obtained through 100 repetitions are reported with its standard error in parentheses. The results show that the PANNE performs well in terms of all of performance measures. The PANNE and COSSO perfectly draw a distinction between signal and noise variables while the ACOSSO and PARSE dose not exactly distinguish weak signals from noises.
(Example 1) Average of each criterion over 100 repetitions with standard error in parentheses
(Example 1) Functional components.
(Example 1) 5th and 95th quantiles for the fitted functional components of PANNE obtained through 100 replications and their bands.
To examine the results for PANNE in detail, we compute the 5th and 95th quantile values at points in each functional component. Figure 6 presents the quantiles and bands drawn between them for each functional component. Figure 6 shows that the fitted functional components for the PANNE can recover the weak signal components as well as strong signal components.
We consider a more complicated example in which response observations are generated from Eq. (4), where
Example functional components, described in Fig. 7, have various characteristics. The first and second functional components are sine functions that possess spatially inhomogeneous smoothness. The third functional component has the sparse structure over [0, 0.5] in which the third predictor has no influence on responses. The fourth component is the same as the fourth in Example 1 except for a scale difference. We consider the two scenarios with
Tables 2 and 3 summarizes the simulation results for
(Example 2) Average of each criterion over 100 repetitions with standard error in parentheses for
(Example 2) Functional components.
(Example 2) Average of each criterion over 100 repetitions with standard error in parentheses for
Fitted functional components for PANNE (black lines) and data points (gray points).
Active nodes (multiplied by the corresponding weights) for the 
The number of nodes (left plot) and BIC values (right plot) along a 
Fitted functional components of water and slag for the PANNE (red lines) and partial residuals (gray dots).
We apply the proposed method to the concrete dataset with 103 observations (
For the PANNE, the initial number of nodes in each functional component is set to be 25 and the number of complexity parameters is determined as 100. In Fig. 10, we present the number of nodes (the left plot) and the BIC values (the right plot) along a
The PANNE selects water and slag as signal variables with high influence on the slump flow. This is consistent with the result of analysis conducted in [2]. To examine effects of water and slag on the slump flow, the corresponding fitted functional components are described in Fig. 11, in which the partial residuals are displayed by the gray points. Overall, the slump flow tends to increase as the amount of water increases. This result coincides with the findings of [2]. However, the PANNE captures the local characteristics around [190, 200] based on the data. For slag variable, the functional component increases at first and then decreases as slag becomes larger than about 70. The residuals sum of squares (RSS) is computed as in [2]. The
In this paper, we developed a penalized additive regression estimator based on a neural network structure and a hierarchical lasso penalty. A flexible architecture of neural network model and accountability of the additive model was incorporated. We used a B-spline activation function with a local support to capture local trends of data and improve stability of the optimization algorithm. An iterative algorithm based on a coordinate descent updating process was devised. A specialized initialization scheme for the B-spline activation function was described. The simulation results showed the proposed estimator performs well compared to the existing methods. Adaptation in node placement was proven to be a outstanding advantage of the proposed method. It enables the proposed estimator to capture the sparse structure as well as the inhomogeneous structure of the functional component based on data.
A possible extension of the proposed method is to incorporate pairwise interaction effects between predictors into the additive neural network model. We expect that the extension leads to improvement of interpretability as well as prediction accuracy. A task for this extension is a development of the corresponding penalization to reduce model complexity. We defer it to the future research.
Footnotes
Acknowledgments
The research of Ja-Yong Koo was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2018R1D1A1B07049972) and by a Korea University Grant (K2109351). The research of Kwan-Young Bak was supported by the Basic Science Research Program through NRF funded by the Ministry of Education, Science and Technology (RS-2022-00165581).
Appendix A. Proof of Lemma 1
Denote the criterion Eq. (2) by
Let
Choose
It follow from the above inequality and
that
Therefore,
Let
Choose
Then we have
Therefore,
