Penalized additive neural network regression

Abstract

In this study, we develop a penalized additive regression estimation method based on a neural network architecture. An additive neural network model is constructed by using a linear combination of univariate neural networks, or equivalently functional components. We use a B-spline activation function, which is useful to capture local features of data, for nodes that constitute the model. A penalty function is adopted to induce sparsity in functional components and nodes on each component simultaneously. This enables us to obtain a sparse representation, which in turn improves accountability of the model. To implement the proposed estimation method, we devise an efficient iterative algorithm based on a coordinate-wise updating process. An initialization scheme specialized for the B-spline activation function is proposed. The initialization approach enables the proposed method to achieve better performance compared with random initialization scheme. Numerical studies show that the fitted functional components of our estimator adapt to local and sparse structures based on a given dataset.

Keywords

Additive regression model neural network architecture penalization sparsity

1. Introduction

In the multivariate regression problem, the generalized additive model is a popular tool in practice due to its flexibility and interpretability. It requires no predetermined form for the relationship between each predictor and response, relieving the restrictive assumptions of parametric relationship in the linear regression literature. Additionally, the estimate of each functional component in the additive model provides a description of how the response variable changes depending on the corresponding predictor. [5] provides a comprehensive review of the additive model.

The individual functional component in an additive model can be modeled using univariate regression function estimation techniques including smoothing spline, regression spline, local polynomial, and kernel-based methods. Thus, an essential process in estimation is to determine an univariate function estimation method. For an overview of nonparametric function estimation methods, one may refer to [15, 13].

Function estimation methods based on a neural network architecture hold a prominent place due to high estimation accuracy inherited from its flexible structure. However, the complexity of a neural network obstructs its accountability and transparency since it is difficult to understand how it works for estimation. We intermediate between accountability in the additive model and flexibility from the neural network structure by constructing an additive model with functional components based on the neural network structure.

In this study, we propose a penalized additive regression estimator based on a neural network architecture. We construct an additive neural network model where individual functional component is modeled with univariate neural network structure. To encourage sparsity in the model, a hierarchical lasso penalty, proposed by [20] in the linear regression literature, is adopted as a penalty function. Use of the penalty function simultaneously induces sparsity in functional components and in nodes by removing unnecessary ones. The resulting estimator is a locally adaptive estimator that allows the different number of nodes and controls the node placement for the functional components. A strategy to implement the proposed estimation method is developed. The numerical studies on simulated example data and a real-life example show that the proposed method performs well in terms of estimation accuracy and component selection. Moreover, it recovers inhomogeneous and sparse structures of example functional components from given data.

The main contributions in this paper are summarized as follows. First, we incorporate a penalization scheme into an additive neural network to encourage sparsity in the model. Construction of additive neural networks aims to enhance interpretability by restricting the neural network structure to an additive form. Sparsity induced by the penalization scheme enables component and node selection with a single complexity parameter, and thus maximizes benefits provided by the additive neural network model. The use of the sparsity-inducing penalty function is a distinctive characteristic of this study when compared with exiting additive neural network methods. Second, we establish a sound design for the additive neural network to guarantee good performance in practice. The performance of neural network methods largely depends on various elements including a type of activation function, algorithm, initialization scheme and so on. We use B-spline activation functions as nodes in the neural network. The chosen activation functions are locally active so that our network captures local features from data. Moreover, we develop an initialization scheme specialized for the B-spline activation function, and illustrate that it works well compared with random initialization.

The remainder of the paper is organized as follows. Section 2 discusses the previous research related to this work. Section 3 describes a B-spline activation function and define a penalized additive neural network estimator. An implementation algorithm to obtain the proposed estimator and a specialized initialization scheme are detailed in Section 4, followed by numerical studies including simulated data and real data in Section 5. The conclusion is summarized in Section 6. Proof of Lemma 1 is presented in Appendix Appendix A. Proof of Lemma 1.

2. Related work

Recently, there have been several attempts to improve transparency of neural networks by restricting neural network architecture. A representative approach is additive neural network method in which network is defined as the sum of multiple subnetworks. It pursue a good balance between flexibility from neural network and interpretability from additive model. [9] develops a generalized additive neural network model using the hyperbolic tangent function as its activation function. [1] introduce a deep neural network additive model based on exp-centered hidden units. To avoid overfitting, they use drop-out techniques for hidden nodes and predictors and weight decay method. More recently, [17] propose an additive neural network method with main effects and pairwise interaction effects. They use an importance criterion for component selection and in turn, retrain the network with the selected terms.

To enhance interpretability and prediction accuracy for additive neural network methods, a possible way is to impose sparsity with reduction of unnecessary model complexity. [17] obtain enhanced interpretability by removing unnecessary main or interaction terms based on importance quantities. Then, a fine-tuning procedure is implemented with the selected terms. [1] use several regularization methods to reduce model complexity. However, their main interest is to avoid overfitting, not to obtain sparse model for interpretability.

Penalization technique can be an useful tool to induce sparsity. Most penalization methods are originally introduced in the linear regression literature, and aimed to shrink the predictor effects and select important predictors. A popular method is the lasso penalty proposed by [12]. They use the $\ell_{1}$ -norm of coefficients as a penalty function for variable selection. [19] introduce the group lasso penalty for variable selection in a grouped manner. Their method penalizes the $\ell_{2}$ -norm of coefficients associated with a group of variables. Recently, [20] propose a hierarchical lasso penalty to remove unimportant groups as well as unimportant variables within a group. The penalty method simultaneously controls sparsity at two levels with a single complexity parameter.

From additive regression perspective, penalization techniques have been used for the purposes of controlling complexity of each component function and removing unnecessary predictors, or equivalently functional components. [8] introduce an extension of the lasso for additive models, using smoothing spline and a penalty based on the sum of component norms. A variable selection method in an additive model is developed by [2]. Their method consists of two stages in which a penalized additive regression spline is fitted and in turn, unimportant components are removed via nonnegative garrotte method. More recently, [7] propose an additive regression spline estimator based on the total variation and nonnegative garrotte penalties. The three penalization methods control complexity for each component and sparsity at predictor level.

For additive neural network methods, penalization method can be adopted to reduce unnecessary model complexity. We propose use of the hierarchical lasso penalty for the additive neural network regression method. The proposed method simultaneously encourages sparsity at node and predictor levels with a single complexity parameter. On the other hand, the exiting additive neural network methods include a relatively complicated process to tune two of more complexity parameters to obtain sparsity at the two levels.

3. Model and estimator

3.1 Additive neural network model

Let ${\left\{(x^{i},y^{i})\right\}}_{i=1}^{n}$ be a set of observations from the additive regression model

$\displaystyle y^{i}=f(x^{i})+\varepsilon^{i}=\mu+f_{1}(x^{i}_{1})+\cdots+f_{p}% (x^{i}_{p})+\varepsilon^{i},i=1,\ldots,n,$ (1)

where $\mu\in{\mathbb{R}}$ is the intercept term, $y^{i}\in{\mathbb{R}}$ , $x^{i}=(x^{i}_{1},\ldots x^{i}_{p})\in{\mathbb{R}}^{p}$ and $\varepsilon^{i}$ are independent errors with ${\mathbb{E}}{\left[\varepsilon^{i}\right]}=0$ . For identification purpose, we assume that the functional components $f_{j}$ satisfies ${\mathbb{E}}[f_{j}]=0$ for $j=1,\ldots,p$ . The goal is to estimate the unknown function $f$ based on an additive neural network model and a penalty function that induces sparsity in nodes and functional components simultaneously.

The additive neural network model is constructed by incorporating univariate neural network models into an additive form based on a linear combination. The additive neural network’s nodes are constructed using a symmetric B-spline activation function defined as

$\displaystyle\sigma(z)=\left\{\begin{array}[]{ll}1-z&-1<z\leqslant 0\\ 1+z&0<z\leqslant 1\\ 0&\text{otherwise}\\ \end{array}\right..$

For $x=(x_{1},\ldots,x_{p})\in{\mathbb{R}}^{p}$ , define univariate neural network models based on the B-spline activation function by

$\displaystyle\,{\mathsf{s}}_{j}(x_{j};\beta^{j},\alpha^{j}_{0},\alpha^{j}_{1})% =\sum_{m=1}^{M_{j}}\beta^{j}_{m}\sigma(\alpha^{j}_{0m}+\alpha^{j}_{1m}x_{j}),j% =1,\ldots,p,$

where $\beta^{j}=(\beta_{m}^{j})$ , $\alpha^{j}_{0}=(\alpha^{j}_{0m})$ , $\alpha^{j}_{1}=(\alpha^{j}_{1m})$ , and $M_{j}$ denotes the number of nodes associated with $\,{\mathsf{s}}_{j}$ . The additive neural network model is defined by a linear combination of the univariate neural network functions

$\displaystyle{\mathsf{f}}(x;\gamma,\beta,\alpha_{0},\alpha_{1})=\gamma_{0}+% \sum_{j=1}^{p}\gamma_{j}\,{\mathsf{s}}_{j}(x_{j};\beta^{j},\alpha^{j}_{0},% \alpha^{j}_{1}),$

where $\gamma=(\gamma_{j})$ , $\beta=(\beta^{j})$ , $\alpha_{0}=(\alpha_{0}^{j})$ and $\alpha_{1}=(\alpha_{1}^{j})$ . We refer to $\gamma_{j}\,{\mathsf{s}}_{j}$ as the $j$ th functional component. Figure 1 describes the additive neural network structure.

Figure 1.

Additive neural network structure.

Figure 2.

Examples of $\sigma(a+bz)$ for the different values of $a$ and $b$ . The black solid lines and gray dotted lines represent their forms and centers, respectively.

3.2 B-spline activation function

We describe a B-spline activation function with a form $\sigma(a+bz)$ for $z\in{\mathbb{R}}$ , where $a$ and $b$ are arbitrary real values. Its shape is determined by the values of $a$ and $b$ . To be specific, the center and support size for $\sigma(a+bz)$ can be computed as $-a/b$ and ${|2/b|}$ , respectively. Figure 2 shows three examples depending on different values of $a$ and $b$ . They have a symmetric shape and compact support.

A benefit of use of the B-spline activation function is the fact that it locally activates the linear function $a+bz$ with the compact support unlike the other activation functions such as the sigmoid function and the rectified linear unit. The localization property encourages nodes in the network to be less entangled each other and thus any change at certain node less affects other nodes. It enables the network to identify local trends of data by effectively using local information.

3.3 Penalized additive neural network estimator

For notation uncluttered, we denote $\theta=(\alpha_{0},\alpha_{1},\beta,\gamma)\in{\mathbb{R}}^{K}$ , where $K=3\sum_{j=1}^{p}M_{j}+p+1$ is the dimension of the parameter space. We consider the squared-distance loss function

$\displaystyle\ell(\theta)=\frac{1}{2}\sum_{i=1}^{n}{\left(y^{i}-{\mathsf{f}}(x% ^{i};\theta)\right)}^{2}.$

We add a penalization term that separately induces sparsity on the component and node level to the loss function. We formulate an optimization problem that minimizes

$\displaystyle\ell(\theta)+\lambda_{1}\sum_{j=1}^{p}\gamma_{j}+\lambda_{2}\sum_% {j=1}^{p}\sum_{m=1}^{M_{j}}{|\beta^{j}_{m}|}$ (2)

subject to a constraint $\gamma_{j}\geqslant 0$ , $j=1,\ldots,p$ , where $\lambda_{1}$ and $\lambda_{2}$ are positive complexity parameters. That is, we simultaneously select the significant functional components and nodes. The complexity parameters $\lambda_{1}$ and $\lambda_{2}$ control the sparsity levels in the components and nodes, respectively.

In solving the above optimization problem, tuning two complexity parameters requires a high computational cost since a grid set of complexity parameters on two-dimensional surface should be searched. [20] propose a method that reduces two complexity parameters to one in the linear regression literature. Following their approach, we consider reformulating the optimization problem into one that simplifies two tuning parameters into one. Let $\lambda_{1}$ and $\lambda_{2}$ be fixed and $\lambda=\lambda_{1}\lambda_{2}$ . Lemma 1 states that the above optimization problem is equivalent to minimizing

$\displaystyle\ell^{\lambda}(\theta)=\ell(\theta)+\sum_{j=1}^{p}\gamma_{j}+% \lambda\sum_{j=1}^{p}\sum_{m=1}^{M_{j}}{|\beta^{j}_{m}|},$ (3)

subject to a constraint $\gamma_{j}\geqslant 0$ , $j=1,\ldots,p$ . In Lemma 1, the dependences of the minimizers on complexity parameters are suppressed for a notational convenience.

.

Let $(\bar{\gamma},\bar{\beta},\bar{\alpha}_{0},\bar{\alpha}_{1})$ be a local minimizer of Eq. (2). Then there exists a local minimizer $(\hat{\gamma},\hat{\beta},\hat{\alpha}_{0},\hat{\alpha}_{1})$ of Eq. (3) such that $\bar{\gamma}_{j}\,{\mathsf{s}}_{j}(\cdot;\bar{\beta}^{j},\bar{\alpha}_{0}^{j},% \bar{\alpha}_{1}^{j})=\hat{\gamma}_{j}\,{\mathsf{s}}_{j}(\cdot;\hat{\beta}^{j}% ,\hat{\alpha}_{0}^{j},\hat{\alpha}_{1}^{j})$ , $j=1,\ldots,p$ . Conversely, let $(\hat{\gamma},\hat{\beta},\hat{\alpha}_{0},\hat{\alpha}_{1})$ be a local minimizer of Eq. (3). Then there exists a local minimizer $(\bar{\gamma},\bar{\beta},\bar{\alpha}_{0},\bar{\alpha}_{1})$ of Eq. (2) such that $\bar{\gamma}_{j}\,{\mathsf{s}}_{j}(\cdot;\bar{\beta}^{j},\bar{\alpha}_{0}^{j},% \bar{\alpha}_{1}^{j})=\hat{\gamma}_{j}\,{\mathsf{s}}_{j}(\cdot;\hat{\beta}^{j}% ,\hat{\alpha}_{0}^{j},\hat{\alpha}_{1}^{j})$ , $j=1,\ldots,p$ .

The proof is summarized in Appendix Appendix A. Proof of Lemma 1. This lemma shows that the fitted functional components in the two optimization problem are equal, and thus the fitted values are also the same. Let $\Theta=\{\theta=(\alpha_{0},\alpha_{1},\beta,\gamma)\in{\mathbb{R}}^{K}:\gamma% _{j}\geqslant 0,\ j=1,\ldots,p\}$ . The penalized additive neural network estimator (PANNE) is defined as

$\displaystyle\hat{f}={\mathsf{f}}(\,\cdot\,;\hat{\theta}^{\lambda}),\quad\text% {where}\quad\hat{\theta}^{\lambda}=\operatorname*{\operatorname{argmin}}_{% \theta\in\Theta}\ell^{\lambda}(\theta).$

4. Implementation

4.1 Iterative algorithm

We devise an iterative algorithm to obtain $\hat{\theta}^{\lambda}$ for the PANNE. In what follows, the $\sim$ notation indicates the current values of parameters. For a fixed $\lambda>0$ , the iterative procedure is summarized as follows:

Step 1.
Update $\tilde{\beta}$ by

$\displaystyle\tilde{\beta}\leftarrow\operatorname{\operatorname{argmin}}_{% \beta}\frac{1}{2}\sum_{i=1}^{N}\left(y^{i}-\sum_{j=1}^{p}\tilde{\gamma}_{j}% \sum_{m_{j}=1}^{M_{j}}\beta_{m_{j}}^{j}\sigma(\tilde{\alpha}_{0m_{j}}^{j}+% \tilde{\alpha}_{1m_{j}}^{j}x^{i}_{j})\right)^{2}+\lambda\sum_{j=1}^{p}\sum_{m=% 1}^{M_{j}}{|\beta^{j}_{m}|}.$
Step 2.
Update $(\tilde{\alpha}_{0},\tilde{\alpha}_{1})$ by

$\displaystyle(\tilde{\alpha}_{0},\tilde{\alpha}_{1})\leftarrow\operatorname{% \operatorname{argmin}}_{(\alpha_{0},\alpha_{1})}\frac{1}{2}\sum_{i=1}^{N}\left% (y^{i}-\sum_{j=1}^{p}\tilde{\gamma}_{j}\sum_{m_{j}=1}^{M_{j}}\tilde{\beta}_{m_% {j}}^{j}\sigma(\alpha_{0m_{j}}^{j}+\alpha_{1m_{j}}^{j}x^{i}_{j})\right)^{2}.$

If difference between current loss and updated loss is sufficiently small, go to step 3. Otherwise, go back to step 1.
Step 3.
Update $\tilde{\gamma}$ by

$\displaystyle\tilde{\gamma}\leftarrow\operatorname*{\operatorname{argmin}}_{% \gamma}\frac{1}{2}\sum_{i=1}^{N}\left(y^{i}-\sum_{j=1}^{p}\gamma_{j}\,{\mathsf% {s}}_{j}(x^{i}_{j};\tilde{\beta}^{j},\tilde{\alpha}^{j}_{0},\tilde{\alpha}^{j}% _{1})\right)^{2}+\sum_{j=1}^{p}\gamma_{j},$

subject to $\gamma_{j}\geqslant 0$ , $j=1,\ldots,p$ .
Step 4.
If difference between current loss and updated loss is sufficiently small, stop the algorithm. Otherwise, go back to step 1.

Each optimization in step 1–3 plays a different role. We refer to step 1–3 as ‘node pruning step’, ‘node placement step’, and ‘component selection step’. In the node pruning step, the appropriate number of nodes is determined by shrinking $\tilde{\beta}^{j}_{m}$ toward zero depending on the value of $\lambda$ . Second, the placements of nodes are optimized by updating $\tilde{\alpha}_{0}$ and $\tilde{\alpha}_{1}$ in the node placement step. After estimation process for the functional components, we select important functional components in the component selection step. To solve the optimization problems given in each step, we apply a coordinate descent algorithm. Let

$\displaystyle y^{i}_{j}=y^{i}-\sum_{k\neq j}\tilde{\gamma}_{k}\,{\mathsf{s}}_{% k}(x^{i}_{k};\tilde{\beta}^{k},\tilde{\alpha}^{k}_{0},\tilde{\alpha}^{k}_{1})% \quad\text{and}\quad y^{i}_{jm}=y^{i}_{j}-\sum_{l\neq m}\tilde{\beta}^{j}_{l}% \sigma(\tilde{\alpha}^{j}_{0l}+\tilde{\alpha}^{j}_{1l}x^{i}_{j}).$

The optimizations based on coordinate-wise updating process are described below.

Node pruning step

In the node pruning step, the problem becomes a lasso problem [12]. Here, $\lambda$ controls estimates at the node level by removing less affected nodes on each functional component. Consider a minimization problem with an objective function

$\displaystyle\frac{1}{2}\sum_{i=1}^{N}\left(y^{i}_{jm}-\beta^{j}_{m}\tilde{% \gamma}_{j}\sigma(\tilde{\alpha}^{j}_{0m}+\tilde{\alpha}^{j}_{1m}x^{i}_{j})% \right)^{2}+\lambda{|\beta^{j}_{m}|}$

with respect to $\beta^{j}_{m}$ . The solution to the given problem is given by

$\displaystyle\text{ST}\left(\frac{\sum_{i=1}^{N}y^{i}_{jm}\sigma(\tilde{\alpha% }^{j}_{0m}+\tilde{\alpha}^{j}_{1m}x^{i}_{j})}{\tilde{\gamma}_{j}\sum_{i=1}^{N}% \sigma^{2}(\tilde{\alpha}^{j}_{0m}+\tilde{\alpha}^{j}_{1m}x^{i}_{j})},\frac{% \lambda}{\tilde{\gamma}_{j}^{2}\sum_{i=1}^{N}\sigma^{2}(\tilde{\alpha}^{j}_{0m% }+\tilde{\alpha}^{j}_{1m}x^{i}_{j})}\right)$

where

$\displaystyle\text{ST}(a,b)=\left\{\begin{array}[]{ll}a-b,&a>b\\ 0,&{|a|}<b\\ a+b,&a<-b\\ \end{array}\right..$

The amount of shrinkage at node level gets larger when $\lambda$ increases. We sequentially update $\tilde{\beta}_{m}^{j}$ for $j=1,\ldots,p$ , $m=1,\ldots M_{j}$ .

Node placement step

We use an approximation method to optimize $\alpha_{0}$ and $\alpha_{1}$ . We only discuss the procedure for optimizing $\alpha_{0}$ as the same can be applied to the optimization of $\alpha_{1}$ . In this step, the center and width of the activation functions are determined by updating $\tilde{\alpha}_{0}$ and $\tilde{\alpha}_{1}$ . Consider a minimization problem with an objective function

$\displaystyle\frac{1}{2}\sum_{i=1}^{N}\left(y^{i}_{jm}-\tilde{\beta}^{j}_{m}% \tilde{\gamma}_{j}\sigma(\alpha^{j}_{0m}+\tilde{\alpha}^{j}_{1m}x^{j}_{i})% \right)^{2}$

with respect to $\alpha^{j}_{0m}$ . Based on the first order Taylor series approximation of ${\mathsf{f}}$ at $\tilde{\alpha}^{j}_{0m}$ with respect to $\alpha^{j}_{0m}$ , we approximately reformulate the above problem to minimizing

$\displaystyle\frac{1}{2}\sum_{i=1}^{N}\left(y^{i}_{jm}-\left(\tilde{\beta}^{j}% _{m}\tilde{\gamma}_{j}\sigma(\tilde{\alpha}^{j}_{0m}+\tilde{\alpha}^{j}_{1m}x^% {j}_{i})+\tilde{\beta}^{j}_{m}\tilde{\gamma}_{j}\sigma^{\prime}(\tilde{\alpha}% ^{j}_{0m}+\tilde{\alpha}^{j}_{1m}x^{j}_{i})(\alpha^{j}_{0m}-\tilde{\alpha}^{j}% _{0m})\right)\right)^{2}=\frac{1}{2}\sum_{i=1}^{N}\left(y^{i}-{\mathsf{f}}(x^{% i};\tilde{\theta})-\left(\tilde{\beta}^{j}_{m}\tilde{\gamma}_{j}\sigma^{\prime% }(\tilde{\alpha}^{j}_{0m}+\tilde{\alpha}^{j}_{1m}x^{j}_{i})(\alpha^{j}_{0m}-% \tilde{\alpha}^{j}_{0m})\right)\right)^{2}=\frac{1}{2}\sum_{i=1}^{N}\left(r^{i% }_{m}-\alpha^{j}_{0m}\tilde{\beta}^{j}_{m}\tilde{\gamma}_{j}\sigma^{\prime}(% \tilde{\alpha}^{j}_{0m}+\tilde{\alpha}^{j}_{1m}x^{j}_{i})\right)^{2}$

where $r^{i}_{m}=y^{i}-{\mathsf{f}}(x^{i};\tilde{\theta})+\tilde{\alpha}^{j}_{0m}% \tilde{\beta}^{j}_{m}\tilde{\gamma}_{j}\sigma^{\prime}(\tilde{\alpha}^{j}_{0m}% +\tilde{\alpha}^{j}_{1m}x^{j}_{i})$ and $\sigma^{\prime}$ denotes the derivative of $\sigma$ in terms of $\alpha^{j}_{0m}$ . The reformulated problem has the quadratic form, and hence is a least squares problem with respect to $\alpha^{j}_{0m}$ . The solution is obtained by

$\displaystyle\frac{\sum_{i=1}^{N}\sigma^{\prime}(\tilde{\alpha}^{j}_{0m}+% \tilde{\alpha}^{j}_{1m}x^{j}_{i})r^{i}_{m}}{\sum_{i=1}^{N}\tilde{\beta}^{j}_{m% }\tilde{\gamma}_{j}\sigma^{\prime}(\tilde{\alpha}^{j}_{0m}+\tilde{\alpha}^{j}_% {1m}x^{j}_{i})^{2}}.$

We sequentially update $\tilde{\alpha}_{0m}^{j}$ for $j=1,\ldots,p$ , $m=1,\ldots M_{j}$ .

Component selection step

The given problem in this step is a non-negative garrotte problem [3]. After obtaining the updated $\,{\mathsf{s}}_{j}$ plugged with $\tilde{\beta}^{j}$ , $\tilde{\alpha}^{j}_{0}$ , and $\tilde{\alpha}^{j}_{1}$ , the component selection process is conducted in this step. Consider a minimization problem with an objective function

$\displaystyle\frac{1}{2}\sum_{i=1}^{N}\left(y^{i}_{j}-\gamma_{j}\,{\mathsf{s}}% _{j}(x^{i}_{j};\tilde{\beta}^{j},\tilde{\alpha}^{j}_{0},\tilde{\alpha}^{j}_{1}% )\right)^{2}+\gamma_{j},\quad\text{subject to }\gamma_{j}\geqslant 0$

with respect to $\gamma_{j}$ . The solution to the above problem is given by

$\displaystyle\left(\frac{\sum_{i=1}^{N}y^{i}_{j}\,{\mathsf{s}}_{j}(x^{i}_{j};% \tilde{\beta}^{j},\tilde{\alpha}^{j}_{0},\tilde{\alpha}^{j}_{1})}{\sum_{i=1}^{% N}\,{\mathsf{s}}^{2}_{j}(x^{i}_{j};\tilde{\beta}^{j},\tilde{\alpha}^{j}_{0},% \tilde{\alpha}^{j}_{1})}-\frac{1}{\sum_{i=1}^{N}\,{\mathsf{s}}^{2}_{j}(x^{i}_{% j};\tilde{\beta}^{j},\tilde{\alpha}^{j}_{0},\tilde{\alpha}^{j}_{1})}\right)_{+}$

where $(\cdot)_{+}=\max(\cdot,0)$ . Note that the solution is given by a soft-threshold operator. The amount of shrinkage is inversely proportional to $\,{\mathsf{s}}_{j}^{2}$ which indicates that $\tilde{\gamma}_{j}$ will be largely shrunken if the $\,{\mathsf{s}}_{j}$ is less important. We sequentially update $\tilde{\gamma}_{j}$ for $j=1,\ldots,p$ .
4.2 Optimal complexity parameter

A choice of an optimal tuning parameter is important for performance of the estimators in penalization methods. This section presents an optimal tuning parameter selection strategy for the PANNE. Consider a decreasing sequence of $\lambda_{1}>\cdots>\lambda_{L}$ on the log-scale for the complexity parameter $\lambda$ . The maximum and minimum values of the sequence $\lambda_{1}$ and $\lambda_{L}$ need to be determined to explore various predictive models. The value of $\lambda_{L}$ is chosen to be small, say $10^{-8}$ . The next step is to derive an upper bound of complexity parameter beyond which all of nodes in the model are inactive, and set it to be $\lambda_{1}$ . However, determining the upper bound for the PANNE is a difficult task since the design matrix in terms of $\beta$ , and $\gamma$ are not fixed. Thus, we approximately compute the upper bound. We confine our attention to the case where all of parameters in the model are fixed except for $\beta$ . The values of the fixed parameters are set to be the initial values mentioned in subsection 4.3. Given data $\{(y^{i},x^{i})\}_{i=1}^{N}$ , the value of $\lambda_{1}$ is set to be

$\displaystyle\lambda_{1}=\max_{j=1,\ldots p}\max_{m=1,\ldots,M_{j}}\bigg{|}% \sum_{i=1}^{N}y^{i}\sigma(\tilde{\alpha}^{j}_{m0}+\tilde{\alpha}^{j}_{m1}x^{i}% _{j})\bigg{|}.$

Our numerical studies confirm that the approximately computed $\lambda_{1}$ is useful in practice. Given a set of complexity parameters, an optimal complexity parameter is selected based on Bayesian Information Criterion (BIC) given by

$\displaystyle\text{BIC}_{l}=N\log\ell(\hat{\theta}^{\lambda_{l}})+M^{\lambda_{% l}}\log N,\quad l=1,\ldots,L,$

where $M^{\lambda_{l}}$ is the number of active nodes for ${\mathsf{f}}(\,\cdot\,;\hat{\theta}^{\lambda_{l}})$ . An optimal complexity parameter is selected as the one with the minimum BIC value.

To reduce a computational burden in searching a set of complexity parameters, we propose a heuristic stopping rule where our algorithm stops before complete search on a sequence of $\lambda$ . The stopping rule is summarized as follows: let $l_{1}=1$ .

(1)
Compute $\text{BIC}_{l_{1}}$ and set $\text{BIC}_{\text{opt}}=\min_{l=1,\ldots,l_{1}}\text{BIC}_{l}$ .
(2)
If $l_{1}-\text{opt}>s$ is satisfied where $s$ is a prespecified criterion, stop the algorithm. Else, let $l_{1}\leftarrow l_{1}+1$ and go back to $1$ .
(3)
Select $\lambda_{\text{opt}}$ as an optimal complexity parameter.

Here, $s$ is selected depending on the number of complexity parameters $L$ . For example, we put $s=L\times 0.1$ . The basis of the stopping rule is the fact that the value of BIC begins to increases at minimum value of BIC along a sequence of $\lambda$ .

Figure 3.
The gray solid lines represent 25 initial activation functions.

4.3 Initialization

In general, the performance for neural network methods largely depends on an initialization scheme. Popular initializations include Xavier initialization [4], He initialization [6]. Following them, choice of an initialization scheme is related to a type of activation functions used. Thus, we need to develop an initialization strategy that takes into account of the B-spine activation function. Initialization method to be presented is similar to one for the radial basis function mentioned in [16], however it is specialized for our network structure.

Instead of determining $\alpha_{0}$ and $\alpha_{1}$ directly, we focus on choice of the center and width of the B-spline activation functions. For $j=1,\ldots,p$ , we first set the centers of nodes for the $j$ th functional components to be located in range of the $j$ th predictor to encourage all of them to be active initially. Then we determine the width not to be too small or too large so that the B-spline activation functions locally share their supports with one another. As an example, Fig. 3 show initial nodes generated from our initialization scheme. For $j=1,\ldots,p$ , a strategy of generating initial values is summarized.

(1)
Set $\tilde{\gamma}_{j}=1$ and $\tilde{\beta}^{j}_{m}=0$ , $m=1,\ldots,M_{j}$ .
(2)
Let $c^{j}_{m}$ , $m=1,\ldots,M_{j}$ be the quantiles of $\{x^{j}_{1},\ldots,x^{j}_{N}\}$ .
(3)
Set $2/\tilde{\alpha}^{j}_{1m}=Nd_{j}/M_{j}$ , $m=1,\ldots,M_{j}$ where $d_{j}$ is the maximum distance between $c^{j}_{m}$ .
(4)
Set $\tilde{\alpha}^{j}_{0m}=-\tilde{\alpha}^{j}_{1m}c^{j}_{m}$ , $m=1,\ldots,M_{j}$ .

To justify the proposed initialization, we compare it with random initialization and He initialization (uniform and normal). For random initialization, the initial values are generated from the normal distribution with mean zero and finite variance. The variances are chosen by 1 and 0.1 ${}^{2}$ . The experiment is conducted on a simulated dataset using unpenalized method under the same conditions. We compute the values of the loss $\ell$ through 500 iterations for each case. Figure 4 presents the values of loss along iterations for each method. The results show that the proposed initialization scheme provides consistently small loss values.

Figure 4.
Comparisons between the proposed initialization and random initializations.

We apply the warm start when the algorithm proceeds from $\lambda_{l}$ to $\lambda_{l+1}$ . If the estimate $\hat{\gamma}_{j}^{\lambda_{l}}$ at $\lambda_{l}$ is non-zero, $\hat{\gamma}_{j}^{\lambda_{l}}$ and the estimates associated to $s_{j}$ are used as the starting values at $\lambda_{l+1}$ . Otherwise, the starting values at $\lambda_{l+1}$ are set to be the starting values at $\lambda_{l}$ for estimates associated to the $j$ th functional component. If the estimate $\hat{\beta}^{j\lambda_{l}}_{m_{j}}$ at $\lambda_{l}$ is non-zero, $\hat{\beta}^{j\lambda_{l}}_{m_{j}}$ , $\hat{\alpha}^{j\lambda_{l}}_{0m_{j}}$ and $\hat{\alpha}^{j\lambda_{l}}_{1m_{j}}$ are used as the starting values at $\lambda_{l+1}$ . Otherwise, the starting values at $\lambda_{l+1}$ are set to be the starting values as $\lambda_{l}$ for estimates associated to the $(j,m_{j})$ th node.
5. Numerical study

5.1 Simulations

We investigate performance of the proposed method in terms of estimation accuracy and component selection consistency. We compare the PANNE with existing additive regression methods with component selection including the PARSE [7], COSSO [8] and the adaptive COSSO (ACOSSO; [11]). The last two methods use the smoothing spline and a penalty functional. The PARSE is an additive regression spline estimator based on total variation and nonnegative garotte penalties via the two stage procedure.

We adopt the mean squared error (MSE) and maximum deviation (MXDV) as discrepancy measures between the example function $f$ and each of the estimators. The MSE and MXDV provide the numerical evaluation in the global and local behavior of estimators, respectively. The quantities are defined as

$\displaystyle\text{MSE}(g)=\frac{1}{N}\sum_{i=1}^{N}(g(x^{i})-f(x^{i}))^{2}% \quad\text{and}\quad\text{MXDV}(g)=\max_{i=1,\ldots,N}{|g(x^{i})-f(x^{i})|}.$

To evaluate the component selection consistency, we compute the average number of relevant variables selected in the model (AverR) and average number of irrelevant variables selected in the model (AverI) as defined in [14].

For the comparison, we consider two simulation examples. A set of response observations is generated by

$\displaystyle y^{i}=f_{1}(x_{1}^{i})+\cdots+f_{p}(x_{p}^{i})+\varepsilon^{i},% \quad i=1,\ldots,N,$ (4)

where, for $i=1,\ldots,N$ , each of the predictors $x_{1}^{i},\ldots,x_{p}^{i}$ is generated from the uniform distribution on [0, 1], and $\varepsilon^{i}$ is generated from a normal distribution with mean zero and finite variance $\sigma^{2}$ . All of scenarios to be dealt with are repeated for 100 times to reduce the influence of random samples.

The initial number of nodes in each functional component for the PANNE are set to be $M_{j}=N/4$ , $j=1,\ldots,p$ .

5.1.1 Example 1

We first consider a simulated data example, which has also been considered by [2], with response observations generated from Eq. (4) where $\sigma^{2}=0.6$ and the functional component functions are given by, for $t\in[0,1]$ ,

•
$f_{1}(t)=40t$
•
$f_{2}(t)=8(2t-1)^{2}$
•
$f_{3}(t)=32\frac{\sin(2\pi t)}{2-\sin(2\pi t)}$
•
$f_{4}(t)=\frac{240}{9}[0.1\sin(2\pi t)+0.2\cos(2\pi t)+0.3\sin^{2}(2\pi t)+0.4% \cos^{3}(2\pi t)+0.5\sin^{3}(2\pi t)]$
•
$f_{5}(t)=-8t^{3}$
•
$f_{6}(t)=4\cos(2\pi t)$
•
$f_{k}(t)=0,k=7,\ldots,p$ .

In this example, the second, fifth and sixth functional components have comparatively weak influence on the response than the others. The example functional components are displayed in Fig. 5. We set $N=100$ and $p=10$ in which there are 6 signal variables and 4 noise variables.

Table 1 summarizes the simulation results for the performance measures. The average of MSEs and MXDVs obtained through 100 repetitions are reported with its standard error in parentheses. The results show that the PANNE performs well in terms of all of performance measures. The PANNE and COSSO perfectly draw a distinction between signal and noise variables while the ACOSSO and PARSE dose not exactly distinguish weak signals from noises.

Table 1
(Example 1) Average of each criterion over 100 repetitions with standard error in parentheses

Method MSE MXDV AverR AverI

$p=10$ PANNE 0.451 (0.007) 1.876 (0.030) 6.00 0.00

PARSE 0.594 (0.019) 2.200 (0.041) 6.00 0.62

ACOSSO 0.461 (0.008) 1.819 (0.026) 6.00 0.19

COSSO 4.667 (0.036) 5.792 (0.073) 6.00 0.00

Figure 5.
(Example 1) Functional components.

Figure 6.
(Example 1) 5th and 95th quantiles for the fitted functional components of PANNE obtained through 100 replications and their bands.

To examine the results for PANNE in detail, we compute the 5th and 95th quantile values at points in each functional component. Figure 6 presents the quantiles and bands drawn between them for each functional component. Figure 6 shows that the fitted functional components for the PANNE can recover the weak signal components as well as strong signal components.
5.1.2 Example 2

	Method	MSE	MXDV	AverR	AverI
$p=10$	PANNE	0.451 (0.007)	1.876 (0.030)	6.00	0.00
	PARSE	0.594 (0.019)	2.200 (0.041)	6.00	0.62
	ACOSSO	0.461 (0.008)	1.819 (0.026)	6.00	0.19
	COSSO	4.667 (0.036)	5.792 (0.073)	6.00	0.00

We consider a more complicated example in which response observations are generated from Eq. (4), where $\sigma^{2}=0.8^{2}$ and the functional component functions are given by, for $t\in[0,1]$ ,

•
$f_{1}(t)=3\sin(2.5\cos(3t)e^{2.2t})$
•
$f_{2}(t)=2\sin(3.5t^{2})$
•
$f_{3}(t)=8[t-0.5]_{+}$
•
$f_{4}(t)=3[0.1\sin(2\pi t)+0.2\cos(2\pi t)+0.3\sin^{2}(2\pi t)+0.4\cos^{3}(2% \pi t)+0.5\sin^{3}(2\pi t)]$
•
$f_{k}(t)=0,k=5,\ldots,p$ .

Example functional components, described in Fig. 7, have various characteristics. The first and second functional components are sine functions that possess spatially inhomogeneous smoothness. The third functional component has the sparse structure over [0, 0.5] in which the third predictor has no influence on responses. The fourth component is the same as the fourth in Example 1 except for a scale difference. We consider the two scenarios with $p=$ 10 and 20 where each possesses 6 and 16 noise variables, respectively. For each scenario, we consider $N=$ 200, 400 and 800.

Tables 2 and 3 summarizes the simulation results for $p=10$ and $p=20$ . The results show that all of the methods perform well in terms of component selection for $p=10$ . For $p=20$ , the ACOSSO and COSSO methods do not perfectly distinguish signal variables from noise variables. The PANNE outperforms the other methods in terms of estimation accuracy. In order to examine global and local behaviors of PANNE in detail, we present the fitted functional components with the median MSE and its active nodes (multiplied by the corresponding weights) in Figs 8 and 9, respectively. The PANNE seems to perform well in estimating example functional components by recovering the global and local trend simultaneously. It adapts to the inhomogeneous structure of $f_{1}$ without compromising the global fit. Moreover, it detects the sparse structure of $f_{3}$ by allowing the nodes to be properly located. It is attributed to the data-adaptive node selection on each functional component. On the other hand, the other methods based on the spline cannot exactly detect the sparse region.

Table 2
(Example 2) Average of each criterion over 100 repetitions with standard error in parentheses for $p=10$

Method MSE MXDV AverR AverI

$N=200$ PANNE 0.229 (0.004) 1.575 (0.030) 4.00 0.00

PARSE 0.441 (0.016) 2.162 (0.052) 4.00 0.00

ACOSSO 0.745 (0.021) 2.735 (0.037) 3.99 0.01

COSSO 1.064 (0.024) 3.165 (0.041) 4.00 0.00

$N=400$ PANNE 0.136 (0.003) 1.312 (0.024) 4.00 0.00

PARSE 0.149 (0.004) 1.299 (0.023) 4.00 0.00

ACOSSO 0.196 (0.005) 1.581 (0.029) 4.00 0.00

COSSO 0.317 (0.010) 2.011 (0.041) 4.00 0.00

$N=800$ PANNE 0.084 (0.002) 1.204 (0.024) 4.00 0.00

PARSE 0.094 (0.002) 1.230 (0.021) 4.00 0.00

ACOSSO 0.106 (0.003) 1.292 (0.025) 4.00 0.00

COSSO 0.128 (0.003) 1.427 (0.028) 4.00 0.00

Figure 7.
(Example 2) Functional components.

Table 3
(Example 2) Average of each criterion over 100 repetitions with standard error in parentheses for $p=20$

Method MSE MXDV AverR AverI

$N=200$ PANNE 0.244 (0.012) 1.575 (0.030) 3.99 0.00

PARSE 0.768 (0.038) 2.886 (0.064) 4.00 0.00

ACOSSO 1.976 (0.055) 3.832 (0.047) 3.87 0.18

COSSO 2.429 (0.049) 4.287 (0.051) 3.92 0.18

$N=400$ PANNE 0.136 (0.003) 1.297 (0.026) 4.00 0.00

PARSE 0.282 (0.007) 2.100 (0.031) 4.00 0.00

ACOSSO 0.368 (0.007) 2.150 (0.033) 4.00 0.00

COSSO 1.023 (0.014) 3.185 (0.025) 4.00 0.00

$N=800$ PANNE 0.095 (0.002) 1.214 (0.021) 4.00 0.00

PARSE 0.118 (0.004) 1.282 (0.022) 4.00 0.00

ACOSSO 0.129 (0.003) 1.408 (0.024) 4.00 0.00

COSSO 0.164 (0.002) 1.604 (0.023) 4.00 0.00

Figure 8.
Fitted functional components for PANNE (black lines) and data points (gray points).

Figure 9.
Active nodes (multiplied by the corresponding weights) for the $j$ th predictor.

5.2 Application

	Method	MSE	MXDV	AverR	AverI
$N=200$	PANNE	0.229 (0.004)	1.575 (0.030)	4.00	0.00
	PARSE	0.441 (0.016)	2.162 (0.052)	4.00	0.00
	ACOSSO	0.745 (0.021)	2.735 (0.037)	3.99	0.01
	COSSO	1.064 (0.024)	3.165 (0.041)	4.00	0.00
$N=400$	PANNE	0.136 (0.003)	1.312 (0.024)	4.00	0.00
	PARSE	0.149 (0.004)	1.299 (0.023)	4.00	0.00
	ACOSSO	0.196 (0.005)	1.581 (0.029)	4.00	0.00
	COSSO	0.317 (0.010)	2.011 (0.041)	4.00	0.00
$N=800$	PANNE	0.084 (0.002)	1.204 (0.024)	4.00	0.00
	PARSE	0.094 (0.002)	1.230 (0.021)	4.00	0.00
	ACOSSO	0.106 (0.003)	1.292 (0.025)	4.00	0.00
	COSSO	0.128 (0.003)	1.427 (0.028)	4.00	0.00

	Method	MSE	MXDV	AverR	AverI
$N=200$	PANNE	0.244 (0.012)	1.575 (0.030)	3.99	0.00
	PARSE	0.768 (0.038)	2.886 (0.064)	4.00	0.00
	ACOSSO	1.976 (0.055)	3.832 (0.047)	3.87	0.18
	COSSO	2.429 (0.049)	4.287 (0.051)	3.92	0.18
$N=400$	PANNE	0.136 (0.003)	1.297 (0.026)	4.00	0.00
	PARSE	0.282 (0.007)	2.100 (0.031)	4.00	0.00
	ACOSSO	0.368 (0.007)	2.150 (0.033)	4.00	0.00
	COSSO	1.023 (0.014)	3.185 (0.025)	4.00	0.00
$N=800$	PANNE	0.095 (0.002)	1.214 (0.021)	4.00	0.00
	PARSE	0.118 (0.004)	1.282 (0.022)	4.00	0.00
	ACOSSO	0.129 (0.003)	1.408 (0.024)	4.00	0.00
	COSSO	0.164 (0.002)	1.604 (0.023)	4.00	0.00

Figure 10.

The number of nodes (left plot) and BIC values (right plot) along a $\lambda$ sequence. The black solid line and black dotted line represent the BIC value with complete search along a $\lambda$ sequence and with the stopping rule, respectively. The red point and blue dashed line describe the optimal BIC values with and without the complete search, respectively.

Figure 11.

Fitted functional components of water and slag for the PANNE (red lines) and partial residuals (gray dots).

We apply the proposed method to the concrete dataset with 103 observations ( $n=103$ ) available in http://networkrepository.com [10]. The dataset is collected from a concrete slump flow test that is aimed to measure workability of fresh concrete, and consists of 10 continuous variables including the ingredients used to make high-performance concrete (HPC) and measurements to determine the workability of the HPC made; refer to [18] for detail of the dataset. To develop an appropriate model, we follow the previous approach conducted in [2] where the slump flow is used as response and predictors are cement, fine aggregate, coarse aggregate, water, fly ash, slag, and superplasticizer ( $p=8$ ). The predictors are ingredients used to make HPC and the response is a measurement for the workability of HPC. The goal of this analysis is to examine which ingredients are influential enough in determining the workability of HPC.

For the PANNE, the initial number of nodes in each functional component is set to be 25 and the number of complexity parameters is determined as 100. In Fig. 10, we present the number of nodes (the left plot) and the BIC values (the right plot) along a $\lambda$ sequence by the gray dotted line. The black solid lines represent the number of nodes and the BIC values obtained using the stopping rule. The red point and blue dashed line on the right plot describe the optimal BIC values with and without the complete search, respectively. The stopping rule and the maximum value $\lambda_{1}$ , mentioned in Subsection 4.2, seem to pull their weight. The optimal model with the stopping rule identifies the one with the complete search. Moreover, all of nodes are inactive at $\lambda_{1}$ and thus various models are searched.

The PANNE selects water and slag as signal variables with high influence on the slump flow. This is consistent with the result of analysis conducted in [2]. To examine effects of water and slag on the slump flow, the corresponding fitted functional components are described in Fig. 11, in which the partial residuals are displayed by the gray points. Overall, the slump flow tends to increase as the amount of water increases. This result coincides with the findings of [2]. However, the PANNE captures the local characteristics around [190, 200] based on the data. For slag variable, the functional component increases at first and then decreases as slag becomes larger than about 70. The residuals sum of squares (RSS) is computed as in [2]. The $\text{RSS}/n$ for the PANNE is given by 125.1901. The value of the PANNE is lower than the one obtained by their method although the selection results are the same.

6. Conclusion

In this paper, we developed a penalized additive regression estimator based on a neural network structure and a hierarchical lasso penalty. A flexible architecture of neural network model and accountability of the additive model was incorporated. We used a B-spline activation function with a local support to capture local trends of data and improve stability of the optimization algorithm. An iterative algorithm based on a coordinate descent updating process was devised. A specialized initialization scheme for the B-spline activation function was described. The simulation results showed the proposed estimator performs well compared to the existing methods. Adaptation in node placement was proven to be a outstanding advantage of the proposed method. It enables the proposed estimator to capture the sparse structure as well as the inhomogeneous structure of the functional component based on data.

A possible extension of the proposed method is to incorporate pairwise interaction effects between predictors into the additive neural network model. We expect that the extension leads to improvement of interpretability as well as prediction accuracy. A task for this extension is a development of the corresponding penalization to reduce model complexity. We defer it to the future research.

Footnotes

Acknowledgments

The research of Ja-Yong Koo was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2018R1D1A1B07049972) and by a Korea University Grant (K2109351). The research of Kwan-Young Bak was supported by the Basic Science Research Program through NRF funded by the Ministry of Education, Science and Technology (RS-2022-00165581).

Appendix A. Proof of Lemma 1

Denote the criterion Eq. (2) by $\ell_{1}^{\lambda_{1},\lambda_{2}}(\theta)$ . Fix $\lambda_{1}$ and $\lambda_{2}$ , and let $\lambda=\lambda_{1}\lambda_{2}$ . Define $\alpha=(\alpha_{0},\alpha_{1})$ . Throughout this section, we let ${|\cdot|}$ denote the $\ell_{1}$ -norm of a vector.

Let $(\bar{\gamma},\bar{\beta},\bar{\alpha})$ be a local minimizer of Eq. (2). We first show that $(\hat{\gamma}=\lambda_{1}\bar{\gamma},\hat{\beta}=\bar{\beta}/\lambda_{1},\hat% {\alpha}=\bar{\alpha})$ is a local minimizer of Eq. (3). Note that since $(\bar{\gamma},\bar{\beta},\bar{\alpha})$ is a local minimizer, there exists $\delta>0$ such that if $\gamma^{\prime}$ , $\beta^{\prime}$ and $\alpha^{\prime}$ satisfy ${|\gamma^{\prime}-\bar{\gamma}|}+{|\beta^{\prime}-\bar{\beta}|}+{|\alpha^{% \prime}-\bar{\alpha}|}<\delta$ , then we have

$\displaystyle\ell_{1}^{\lambda_{1},\lambda_{2}}(\bar{\gamma},\bar{\beta},\bar{% \alpha})\leqslant\ell_{1}^{\lambda_{1},\lambda_{2}}(\gamma^{\prime},\beta^{% \prime},\alpha^{\prime}).$

Choose $\delta^{\prime}$ such that $\frac{\delta^{\prime}}{\min(\lambda_{1},1/\lambda_{1})}\leqslant\delta$ . For any $(\gamma^{\prime\prime},\beta^{\prime\prime},\alpha^{\prime\prime})$ satisfying ${|\gamma^{\prime\prime}-\hat{\gamma}|}+{|\beta^{\prime\prime}-\hat{\beta}|}+{|% \alpha^{\prime\prime}-\hat{\alpha}|}<\delta^{\prime}$ , we have

$\displaystyle{|\gamma^{\prime\prime}/\lambda_{1}-\bar{\gamma}|}+{|\lambda_{1}% \beta^{\prime\prime}-\bar{\beta}|}+{|\alpha^{\prime\prime}-\bar{\alpha}|}$ $\displaystyle\leqslant\frac{\lambda_{1}{|\gamma^{\prime\prime}/\lambda_{1}-% \bar{\gamma}|}+1/\lambda_{1}{|\lambda_{1}\beta^{\prime\prime}-\bar{\beta}|}}{% \min(\lambda_{1},1/\lambda_{1})}+\frac{{|\alpha^{\prime\prime}-\bar{\alpha}|}}% {\min(\lambda_{1},1/\lambda_{1})}$ $\displaystyle=\frac{{|\gamma^{\prime\prime}-\hat{\gamma}|}+{|\beta^{\prime% \prime}-\hat{\beta}|}}{\min(\lambda_{1},1/\lambda_{1})}+\frac{{|\alpha^{\prime% \prime}-\hat{\alpha}|}}{\min(\lambda_{1},1/\lambda_{1})}<\frac{\delta^{\prime}% }{\min(\lambda_{1},1/\lambda_{1})}\leqslant\delta.$

It follow from the above inequality and

$\displaystyle\ell^{\lambda}(\lambda_{1}\gamma,\beta/\lambda_{1},\alpha)=\ell_{% 1}^{\lambda_{1},\lambda_{2}}(\gamma,\beta,\alpha)$

that

$\displaystyle\ell^{\lambda}(\hat{\gamma},\hat{\beta},\hat{\alpha})=\ell_{1}^{% \lambda_{1},\lambda_{2}}(\bar{\gamma},\bar{\beta},\bar{\alpha})\leqslant\ell_{% 1}^{\lambda_{1},\lambda_{2}}(\gamma^{\prime\prime}/\lambda_{1},\lambda_{1}% \beta^{\prime\prime},\alpha^{\prime\prime})=\ell^{\lambda}(\gamma^{\prime% \prime},\beta^{\prime\prime},\alpha^{\prime\prime}).$

Therefore, $(\hat{\gamma},\hat{\beta},\hat{\alpha})$ is a local minimizer of Eq. (3).

Let $(\hat{\gamma},\hat{\beta},\hat{\alpha})$ be a local minimizer of Eq. (3). We would like to show that $(\bar{\gamma}=\hat{\gamma}/\lambda_{1},\bar{\beta}=\lambda_{1}\hat{\beta},\bar% {\alpha}=\hat{\alpha})$ is a local minimizer of Eq. (2). Note that since $(\hat{\gamma},\hat{\beta},\hat{\alpha})$ is a local minimizer, there exists $\delta>0$ such that if $\gamma^{\prime}$ , $\beta^{\prime}$ , and $\alpha^{\prime}$ satisfy ${|\gamma^{\prime}-\hat{\gamma}|}+{|\beta^{\prime}-\hat{\beta}|}+{|\alpha^{% \prime}-\hat{\alpha}|}<\delta$ , then we have

$\displaystyle\ell^{\lambda}(\hat{\gamma},\hat{\beta},\hat{\alpha})\leqslant% \ell^{\lambda}(\gamma^{\prime},\beta^{\prime},\alpha^{\prime}).$

Choose $\delta^{\prime}$ such that $\frac{\delta^{\prime}}{\min(\lambda_{1},1/\lambda_{1})}\leqslant\delta$ . For any $(\gamma^{\prime\prime},\beta^{\prime\prime},\alpha^{\prime\prime})$ satisfying ${|\gamma^{\prime\prime}-\bar{\gamma}|}+{|\beta^{\prime\prime}-\bar{\beta}|}+{|% \alpha^{\prime\prime}-\bar{\alpha}|}<\delta^{\prime}$ , we have

$\displaystyle{|\gamma^{\prime\prime}\lambda_{1}-\hat{\gamma}|}+{|\beta^{\prime% \prime}/\lambda_{1}-\hat{\beta}|}+{|\alpha^{\prime\prime}-\hat{\alpha}|}$ $\displaystyle\leqslant\frac{1/\lambda_{1}{|\gamma^{\prime\prime}\lambda_{1}-% \hat{\gamma}|}+\lambda_{1}{|\beta^{\prime\prime}/\lambda_{1}-\hat{\beta}|}}{% \min(\lambda_{1},1/\lambda_{1})}+\frac{{|\alpha^{\prime\prime}-\hat{\alpha}|}}% {\min(\lambda_{1},1/\lambda_{1})}$ $\displaystyle=\frac{{|\gamma^{\prime\prime}-\bar{\gamma}|}+{|\beta^{\prime% \prime}-\bar{\beta}|}}{\min(\lambda_{1},1/\lambda_{1})}+\frac{{|\alpha^{\prime% \prime}-\bar{\alpha}|}}{\min(\lambda_{1},1/\lambda_{1})}<\frac{\delta^{\prime}% }{\min(\lambda_{1},1/\lambda_{1})}\leqslant\delta.$

Then we have

$\displaystyle\ell_{1}^{\lambda_{1},\lambda_{2}}(\bar{\gamma},\bar{\beta},\bar{% \alpha})=\ell^{\lambda}(\hat{\gamma},\hat{\beta},\hat{\alpha})\leqslant\ell^{% \lambda}(\lambda_{1}\gamma^{\prime\prime},\beta^{\prime\prime}/\lambda_{1},% \alpha^{\prime\prime})=\ell_{1}^{\lambda_{1},\lambda_{2}}(\gamma^{\prime\prime% },\beta^{\prime\prime},\alpha^{\prime\prime}).$

Therefore, $(\bar{\gamma},\bar{\beta},\bar{\alpha})$ is a local minimizer of Eq. (2).

References

Agarwal

Frosst

Zhang

Caruana

and Hinton

G.E.

, Neural additive models: Interpretable machine learning with neural nets, arXiv preprint arXiv:2004.13912, 2020.

Antoniadis

Gijbels

and Verhasselt

, Variable selection in additive models using p-splines, Technometrics 54(4) (2012), 425–438.

Breiman

, Better subset regression using the nonnegative garrote, Technometrics 37(4) (1995), 373–384.

Glorot

and Bengio

, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 249–256, JMLR Workshop and Conference Proceedings, 2010.

Hastie

and Tibshirani

, Generalized additive models, Wiley Online Library, 1990.

Zhang

Ren

and Sun

, Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, in: Proceedings of the IEEE International Conference on Computer Vision, pages 1026–1034, 2015.

Jhong

J.-H.

Bak

K.-Y.

Shin

J.-K.

and Koo

J.-Y.

, Additive regression splines with total variation and non negative garrote penalties, Communications in Statistics-Theory and Methods, pages 1–35, 2021.

Lin

Zhang

H.H.

et al., Component selection and smoothing in multivariate nonparametric regression, The Annals of Statistics 34(5) (2006), 2272–2297.

Potts

W.J.

, Generalized additive neural networks, in: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 194–200, 1999.

10.

Rossi

R.A.

and Ahmed

N.K.

, The network data repository with interactive graph analytics and visualization, in: AAAI, 2015.

11.

Storlie

C.B.

Bondell

H.D.

Reich

B.J.

and Zhang

H.H.

, Surface estimation, variable selection, and the nonparametric oracle property, Statistica Sinica 21(2) (2011), 679.

12.

Tibshirani

, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B (Methodological) 58(1) (1996), 267–288.

13.

Tsybakov

A.B.

, Introduction to nonparametric estimation, Springer Science & Business Media, 2008.

14.

Wang

and Huang

J.Z.

, Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements, Journal of the American Statistical Association 103(484) (2008), 1556–1569.

15.

Wasserman

, All of nonparametric statistics, Springer Science & Business Media, 2006.

16.

Wang

Zhang

and Du

K.-L.

, Using radial basis function networks for function approximation and classification, International Scholarly Research Notices, 2012, 2012.

17.

Yang

Zhang

and Sudjianto

, Gami-net: An explainable neural network based on generalized additive models with structured interactions, Pattern Recognition 120 (2021), 108192.

18.

Yeh

I.-C.

, Modeling slump flow of concrete using second-order regressions and artificial neural networks, Cement and Concrete Composites 29(6) (2007), 474–480.

19.

Yuan

and Lin

, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1) (2006), 49–67.

20.

Zhou

and Zhu

, Group variable selection via a hierarchical lasso and its oracle property, Statistics and Its Interface 3(4) (2010), 557–574.