Smoothing spline in multivariable semiparametric regression through fully bayesian approach

Abstract

A multivariable semiparametric regression model is a combination of parametric and nonparametric regressions, the parametric component of which refers to polynomial patterns, while its nonparametric component does not have certain pattern. This nonparametric component can be fitted with smoothing spline function. This reearch is aimed to to develop a multivariable semiparametric regression model through fully Bayesian approach for cross-sectional data. The development is meant to be implemented in analysing Open Unemployment Rate (OUR) in East Java Province, Indonesia. The result applying the model in estimating the Open Unemployment Rate (OUR) in East Java province reveals that multivariable semiparametric regression model with parametric component, based on macro economy, corresponds to linear and the nonparametric components corresponds to cubic smoothing spline function using fully Bayesian approach. The parametric component includes the percentage of population with higher education and regional minimum wage. The nonparametric component includes economic growth, population density, large-sized and medium-sized industries ratio. In conclusion, the smoothing spline modeling using fully Bayesian approach shows better performance than using Bayesian approach.

Keywords

Fully Bayesian Gibbs sampling MCMC multivariable semiparametric regression smoothing spline open unemployment rate

1. Introduction

Up to now, spline function estimations in multivariable semiparametric regression model have, in general, three forms, i.e. spline regression (including truncated spline, cubic spline, and B-spline), penalized spline (P-spline) and smoothing spline. The use of spline regression and penalized spline, unlike smoothing spline regression, require extra circumspection in determining the number of knots and their locations. A smoothing spline is constructed due to the addition of goodness of fit and the smoothness of curve (penalty).

There have been a number of studies concerning spline regression and penalized spline in multivariable semiparametric regression model using Bayesian approach, for instance by Smith and Kohn (1996), Wong and Kohn (1996), Li (2000), Smith et al. (2000), Kandala et al. (2001), Panagiotelis and Smith (2008), and Ryu et al. (2009). Kandala et al. (2002), Jerak and Wagner (2003), Lang and Brezger (2004), Crainiceanu et al. (2005), Nott (2006), Costa (2008), Marley and Wand (2010), and Shen (2011) applying P-spline using Bayesian approach. Wang (2011), Du et al. (2012) and Diana et al. (2012) applying penalized least square method to obtain additive semiparametric regression model estimator with linear function parametric and smoothing spline functions respectively for the nonparametric components. And they used bootstrap approach to construct the confidence interval.

Krivobokova et al. (2010) developed P-spline confidence interval by means of Bayesian approach in nonparametric regression model. Yang (2008) applied bootstrap approach to construct confidence interval of spline regression function in additive nonparametric regression model. Wiesenfarth et al. (2010) developed Krivobokova et al. (2010) approach to make P-spline confidence interval in additive nonparametric regression model. Wood and Marra (2011) applied Bayesian approach to construct P-spline confidence interval in additive nonparametric regression model for non-Gaussian response. Diana et al. (2013) developed smoothing spline in multivariable semiparametric regression model trough Bayesian approach using simulation data. She also implemented smoothing spline in multivariable semiparametric regression model using Bayesian approach making use of OUR data (Diana et al., 2014).

Therefore, it can be implied that many researches on spline regression and P-spline in additive semiparametric regression model either with classical or Bayesian approach have been extensively conducted. However, many researches on smoothing spline in additive semiparametric regression model have commonly been done through classical approach. In other words, rarely has a study on smoothing spline in multivariable semiparametric regression model applied the fully Bayesian approach. Thus, this study focuses on developing smoothing spline in multivariable semiparametric regression model along with the polinomial parametric and aditive nonparametric component without an interaction. The model was then further implemented in analysing the data of OUR in East Java, Indonesia.

In Section 2, we introduce the model of multivariable semiparametric regression. We propose estimation of multivariable semiparametric regression model using fully Bayesian approach in Section 3. The data for the implementation of the model and the results are reported in Section 4. Validations of model are reported in Section 5. The paper is concluded with summary and discussion of further work in Section 6.

2. Multivariable semiparametric regression model

Given pairing data $({{\bf x}}_{j}^{\ast},{{\bf z}}_{j}^{\ast},y_{j})$ , $j=1,2,\ldots,n$ , with ${{\bf x}}_{j}^{\ast}=(x_{1j},x_{2j},\ldots,x_{pj})$ are predictor variables whose patterns links to response variable $y_{j}$ ( $j$ -th observation) follow polynomial patterns and ${{\bf z}}_{j}^{\ast}=(z_{1j},z_{2j},\ldots,z_{qj})$ is a set of other predictor variables, the connection pattern of which with response variable $y_{j}$ is unknown. The relationship between response variable $y_{j}$ and predictor variables ${{\bf x}}_{j}^{\ast}$ and ${{\bf z}}_{j}^{\ast}$ is assumed to follow the multivariable semiparametric regression model:

$y_{j}=h({{\bf x}}_{j}^{\ast},{{\bf z}}_{j}^{\ast})+\varepsilon_{j},j=1,2,% \ldots,n,$ (1)

where

$h({\rm{\bf x}}_{j}^{\ast},{\rm{\bf z}}_{j}^{\ast})=\sum\limits_{i=1}^{p}{\left% ({\sum\limits_{h=0}^{r}{\gamma_{hi}x_{ij}^{h}}}\right)}+\sum\limits_{k=1}^{q}{% f_{k}(z_{kj})}.$ (2)

$\gamma_{hi}$ is the unknown parametric component parameter. Random error $\varepsilon_{j}$ is assumed to be linearly independent with zero mean and variance $\sigma{}^{2}.$ Equation (2) can be written in vector notation as follows:

$h({{\bf x}}_{j}^{\ast},{{\bf z}}_{j}^{\ast})=\sum\limits_{i=1}^{p}{{{\bf{x}^{% \prime}}}_{ij}{{\bf\gamma}}_{i}}+\sum\limits_{k=1}^{q}{f_{k}(z_{kj})},$ (3)

where ${{\bf\gamma}}_{i}=(\gamma_{0i},\gamma_{1i},\ldots,\gamma_{ri})^{\prime}$ and ${{\bf{x}}_{ij}^{\prime}}=(1,x_{ij},x_{ij}^{2},\ldots,x_{ij}^{r}),i=1,2,\ldots,p$ .

The regression curve $f_{k}$ is set as:

$f_{k}(z_{kj})=\sum\limits_{v=1}^{m}{\alpha_{kv}\phi_{kv}(z_{kj})}+\sum\limits_% {j=1,l=1}^{n}{\beta_{j}\theta_{k}\psi_{k}(z_{kj},z_{kl})}.$ (4)

$f_{k}$ is $m^{\rm th}$ constitutes smoothing spline function with $m$ degree (Wahba, 1990). Suppose that the parameter

$\displaystyle{{\bf\gamma}}=({\gamma_{0},\gamma_{11},\gamma_{21},\ldots,\gamma_% {r1},\gamma_{12},\gamma_{22},\ldots,\gamma_{r2},\ldots,\gamma_{1p},\gamma_{2p}% ,\ldots,\gamma_{rp}})^{\prime},$ $\displaystyle{{\bf\alpha}}=({\alpha_{11},\alpha_{12},\ldots,\alpha_{1∼{}m},% \alpha_{21},\alpha_{22},\ldots,\alpha_{2∼{}m},\ldots,\alpha_{q1},\alpha_{q2},% \ldots,\alpha_{qm}})^{\prime},$ $\displaystyle{{\bf\beta}}=({\beta_{11},\beta_{12},\ldots,\beta_{1n},\beta_{21}% ,\beta_{22},\ldots,\beta_{2n},\ldots,\beta_{q1},\beta_{q2},\ldots,\beta_{qn}})% ^{\prime},$

${{\bf\lambda}}=({\lambda_{1},\lambda_{2},\ldots,\lambda_{q}})^{\prime},{{\bf V% }}_{{{\bf\theta}}}=\theta_{1}{{\bf V}}_{1}+\ldots+\theta_{q}{{\bf V}}_{q}$ where ${{\bf V}}_{k}=\{{\psi_{k}(z_{j},z_{l})}\}_{j=1,l=1}^{n,n},k=1,2,\ldots,q$ and

$\displaystyle\psi_{k}(z_{kj},z_{kl})=\int\limits_{a_{k}}^{b_{k}}{\frac{({z_{kj% }-u})_{+}^{m-1}({z_{kl}-u})_{+}^{m-1}}{[{(m-1)!}]^{2}}du};({z_{kj}-u})_{+}^{m-% 1}=\begin{cases}({z_{kj}-u})^{m-1},(z_{kj}-u)\geqslant 0\\ 0,(z_{kj}-u)<0,\end{cases}$

and ${{\bf T}}=({{{\bf T}}_{1},\ldots,{{\bf T}}_{q}})$ where ${{\bf T}}_{k}=\{{\phi_{kv}(z_{kj})}\}_{j=1,v=1}^{n,m_{k}},k$ $=1,2,\ldots,q$ and $\phi_{kv}(z_{kj})=z_{kj}^{v-1}/(v-1)!,v=1,2,\ldots,m_{k}$ , Eqs (3) and (4) can be then rewritten in matrix notation:

${\bf h}={\bf X\gamma}+\sum\limits_{k=1}^{q}{{{\bf f}}_{k}({{\bf z}}_{k})}\text% { dengan }\sum\limits_{k=1}^{q}{{{\bf f}}_{k}({{\bf z}}_{k})}=\sum\limits_{k=1% }^{q}{{{\bf T}}_{k}{{\bm{\alpha}}}_{k}}+\sum\limits_{k=1}^{q}{{{\bf g}}_{k}}.$

If $Y$ is the random variable which is the response data and has normal distribution or $Y\sim N(\mu_{[y]},\sigma_{[y]}^{2})$ then the semiparametric regression model with fully Bayesian approach using the data will always be having normal distribution.

3. Estimation of multivariable semiparametric regression model using fully Bayesian approach

$\displaystyle{{\bm{\mu}}}_{[y]}=E({{\bf y}})=\sum\limits_{i=1}^{p}{{{\bf X}}_{% i}{{\bm{\gamma}}}_{i}}+\sum\limits_{k=1}^{q}{{{\bf f}}_{k}{{\bf(z}}_{k}{{\bf)}}}$

and

$\displaystyle\sum\limits_{k=1}^{q}{{{\bf f}}_{k}{{\bf(z}}_{k}{{\bf)}}}=\sum% \limits_{k=1}^{q}{{{\bf T}}_{k}{{\bm{\alpha}}}_{k}}+\sum\limits_{k=1}^{q}{{{% \bf g}}_{k}}$

then

$\displaystyle{{\bf y}}\sim N\left({{\bf X}{\bm{\gamma}}}+\sum\limits_{k=1}^{q}% {{{\bf T}}_{k}{{\bm{\alpha}}}_{k}}+\sum\limits_{k=1}^{q}{{{\bf g}}_{k}},\sigma% _{[y]}^{2}{{\bf I}}_{n}\right)$

where $\sigma_{[y]}^{2}=1/{\tau_{[y]}}$ is the precision parameter. The likelihood function of $\bf y$ is

$p({{\bf y}}|{{\bm{\gamma}}},{{\bm{\alpha}}},{{\bf g}},\tau_{[y]})=\frac{\tau_{% {}_{[y]}}^{n/2}}{(2\pi)^{n/2}}\exp\left\{{-\frac{\tau_{[y]}}{2}\left({{{\bf y}% }-{{\bf X}{\bm{\gamma}}}-{{\bf T}{\bm{\alpha}}}-{{\bf g}}}\right)^{\prime}% \left({{{\bf y}}-{{\bf X}{\bm{\gamma}}}-{{\bf T}{\bm{\alpha}}}-{{\bf g}}}% \right)}\right\}.$ (5)

In fully Bayesian approach, it is important that prior distribution be determined for parameters ${{\bm{\gamma}}},{{\bm{\alpha}}},{{\bf g}},{\rm dan}\ \tau_{[y]}.$ Since $h({{\bf x}}_{j}^{\ast},{{\bf z}}_{j}^{\ast})$ has prior improper Gaussian distribution which is

$h({{\bf x}}_{j}^{\ast},{{\bf z}}_{j}^{\ast})=\sum\limits_{i=1}^{p}{{{\bf{x}^{% \prime}}}_{i}{{\bm{\gamma}}}_{i}}+\sum\limits_{k=1}^{q}{\left({\sum\limits_{v=% 1}^{m}{\alpha_{kv}\phi_{kv}(z_{kj})+\eta^{1/2}\theta_{k}^{1/2}g_{k}(z_{kj})}}% \right)},$

the prior $\bf g$ distribution is therefore normal multivariate distribution, that is

${{\bf g}}\sim N\left({{{\bm{\mu}}}_{[g]},\frac{\sigma_{[y]}^{2}}{\lambda}{{\bf V% }}_{{{\bf\theta}}}}\right),$

where ${{\bf V}}_{{{\bf\theta}}}=\sum\nolimits_{k=1}^{q}{\theta_{k}{{\bf V}}_{k}}$ and ${{\bf g}}=\sum\nolimits_{k=1}^{q}{{{\bf g}}_{k}}.$ Implying that

${{\bf g}}_{k}\sim N\left({{\bm{\mu}}}_{[g_{k}]},\frac{\sigma_{[y]}^{2}}{% \lambda_{k}}{{\bf V}}_{k}\right)$

where $\lambda_{k}=\lambda/{\theta_{k}}$ behaves as the smoothing parameter. Since ${{\bf V}}_{k}$ is a singular matrix, a full rank parameterization with singular value decomposition (SVD) approach is required. Through SVD, ${{\bf V}}_{k}={{\bf Q}}_{k}{{\bf D}}_{k}{{\bf{Q}^{\prime}}}_{k}$ will be obtained, where ${{\bf Q}}_{k}$ is a matrix from eigen vectors which corresponds to the non zero eigen values, and ${{\bf D}}_{k}$ is a diagonal matrix from the non zero eigen values. Suppose that ${{\bf g}}_{k}={{\bf Q}}_{k}{{\bm{\beta}}}_{k}$ then prior distribution for parameter ${{\bm{\beta}}}_{k}$ , is Normal Multivariat or

${{\bm{\beta}}}_{k}\sim N\left({{{\bm{\mu}}}_{[\beta_{k}]},\sigma_{[\beta_{k}]}% ^{2}{{\bf I}}}\right),$ (6)

where $\sigma_{[\beta_{k}]}^{2}{\rm{\bf I}}=\frac{\sigma_{[y]}^{2}}{\lambda_{k}}{\rm{% \bf D}}_{k}$ and ${\rm{\bf g}}=\sum\nolimits_{k=1}^{q}{{\rm{\bf g}}_{k}}=\sum\nolimits_{k=1}^{q}% {{\rm{\bf Q}}_{k}{\rm{\bm{\beta}}}_{k}}={\rm{\bf Q}{\bm{\beta}}}$ . Hence, Eq. (5) becomes:

$p({{\bf y}}|{{\bm{\gamma}}},{{\bm{\alpha}}},{{\bm{\beta}}},\tau_{[y]})\propto% \tau_{{}_{[y]}}^{n/2}\exp\left\{{-\frac{\tau_{[y]}}{2}\left({{{\bf y}}-{{\bf X% }{\bm{\gamma}}}-{{\bf T}{\bm{\alpha}}}-{{\bf Q}{\bm{\beta}}}}\right)^{\prime}% \left({{{\bf y}}-{{\bf X}{\bm{\gamma}}}-{{\bf T}{\bm{\alpha}}}-{{\bf Q}{\bm{% \beta}}}}\right)}\right\}.$ (7)

The prior distributions used for each element of parameter vector ${\rm{\bm{\gamma}}},{\rm{\bm{\alpha}}},{\rm{\bm{\beta}}},{\rm{\bm{\lambda}}},{% \rm danparameter}\tau_{[y]}$ are as follow:

$\displaystyle\gamma_{hi}\sim N(\mu_{[\gamma]hi},\tau_{[\gamma]hi})\text{ where% }\tau_{[\gamma]hi}=1/{\sigma_{[\gamma]hi}^{2}},$ (8) $\displaystyle\alpha_{kv}\sim N(\mu_{[\alpha]kv},\tau_{[\alpha]kv})\text{ where% }\tau_{[\alpha]kv}=1/{\sigma_{[\alpha]kv}^{2}},$ (9) $\displaystyle\beta_{kj}\sim N(\mu_{[\beta_{k}]j},\tau_{[\beta_{k}]jj})\text{ % where }\tau_{[\beta_{k}]jj}=1/{\sigma_{[\beta_{k}]jj}^{2}},$ (10) $\displaystyle\lambda_{k}\sim\textit{Gamma}(a_{[\lambda]k},b_{[\lambda]k}),$ (11) $\displaystyle\tau_{[y]}\sim\textit{Gamma}(a_{[\tau_{[y]}]},b_{[\tau_{[y]}]}).$ (12)

Combination of likelihood Eq. (7) and prior distribution will produce combined posterior distribution function of all parameters that will be estimated, i.e.:

$p({\rm{\bm{\gamma}}},{\rm{\bm{\alpha}}},{\rm{\bm{\beta}}},{\rm{\bm{\lambda}}},% \tau_{[y]}|{\rm{\bf y}})\propto p({\rm{\bf y}}|{\rm{\bm{\gamma}}},{\rm{\bm{% \alpha}}},{\rm{\bm{\beta}}},\tau_{[y]})p({\rm{\bm{\gamma}}})p({\rm{\bm{\alpha}% }})p({\rm{\bm{\beta}}})p({\rm{\bm{\lambda}}})p(\tau_{[y]}).$ (13)

Using MCMC and Gibbs Sampling methods, the characteristics of each parameter in combined posterior distribution can be investigated without calculating the marginal functions of the parameters (Ntzoufras, 2009). The full conditional posterior distribution is analysed using the following lemmas.

Lemma 1. If given a multivariable semiparametric regression model that follows Eqs (1) and (2), random variable $Y\sim N(\mu_{[y]},\sigma_{[y]}^{2})$ with likelihood function as depicted in Eq. (7), and prior distribution to each of parameters as given in Eqs (8)–(12), then combined posterior distribution of the multivariable semiparametric regression model with fully Bayesian approach can be represented as:

$\displaystyle p({\rm{\bm{\gamma}}},{\rm{\bm{\alpha}}},{\rm{\bm{\beta}}},{\rm{% \bm{\lambda}}},\tau_{[y]}|{\rm{\bf y}})\propto\left({\tau_{{}_{[y]}}^{n/2}}% \right){\rm A}_{y}\left({\prod\limits_{i=1}^{p}{\prod\limits_{h=0}^{r}{\tau_{[% \gamma]hi}^{1/2}}}}\right){\rm B}_{\gamma}\left({\prod\limits_{k=1}^{q}{\prod% \limits_{v=1}^{m}{\tau_{[\alpha]kv}^{1/2}}}}\right){\rm C}_{\alpha}$ $\displaystyle\quad\times\left({\prod\limits_{k=1}^{q}{\prod\limits_{j=1}^{n}{% \left({\tau_{[y]}\lambda_{k}d_{[k]jj}^{-1}}\right)^{1/2}}}}\right){\rm D}_{% \beta}\left({\prod\limits_{k=1}^{q}{\lambda_{k}^{a_{[\lambda]k}-1}}{\rm E}_{% \lambda}}\right)\left({\tau_{[y]}^{a_{[\tau_{[y]}]}-1}}\right){\rm F}_{\tau_{[% y]}},$ (14)

where

$\displaystyle{\rm A}_{y}=\exp\left\{{-\frac{\tau_{[y]}}{2}\left({{\rm{\bf y}}-% {\rm{\bf X}{\bm{\gamma}}}-{\rm{\bf T}{\bm{\alpha}}}-{\rm{\bf Q}{\bm{\beta}}}}% \right)^{\prime}\left({{\rm{\bf y}}-{\rm{\bf X}{\bm{\gamma}}}-{\rm{\bf T}{\bm{% \alpha}}}-{\rm{\bf Q}{\bm{\beta}}}}\right)}\right\},$ $\displaystyle{\rm B}_{\gamma}=\exp\left\{{-\frac{1}{2}\sum\limits_{i=1}^{p}{% \sum\limits_{h=0}^{r}{\tau_{[\gamma]hi}(\gamma_{hi}-\mu_{[\gamma]hi})^{2}}}}% \right\},$ $\displaystyle{\rm C}_{\alpha}=\exp\left\{{-\frac{1}{2}\sum\limits_{k=1}^{q}{% \sum\limits_{v=1}^{m}{\tau_{[\alpha]kv}(\alpha_{kv}-\mu_{[\alpha]kv})^{2}}}}% \right\},$ $\displaystyle{\rm D}_{\beta}=\exp\left\{{-\frac{1}{2}\sum\limits_{k=1}^{q}{% \tau_{[y]}\lambda_{k}\left({\sum\limits_{j=1}^{n}{d_{[k]jj}^{-1}\left({\beta_{% kj}-\mu_{[\beta_{k}]j}}\right)^{2}}}\right)}}\right\},$ $\displaystyle{\rm E}_{\lambda}=\exp\left\{{-b_{[\lambda]k}\lambda_{k}}\right\}% \text{ and }{\rm F}_{\tau_{[y]}}=\exp\left\{{-b_{[\tau_{[y]}]}\tau_{[y]}}% \right\}.$

Proof Using prior distributio, Eqs (8)–(12), the combined posterior distribution can be given as:

$\displaystyle p({\rm{\bm{\gamma}}},{\rm{\bm{\alpha}}},{\rm{\bm{\beta}}},{\rm{% \bm{\lambda}}},\tau_{[y]}|{\rm{\bf y}})\propto p({\rm{\bf y}}|{\rm{\bm{\gamma}% }},{\rm{\bm{\alpha}}},{\rm{\bm{\beta}}},\tau_{[y]})p({\rm{\bm{\gamma}}})p({\rm% {\bm{\alpha}}})p({\rm{\bm{\beta}}})p({\rm{\bm{\lambda}}})p(\tau_{[y]})$ $\displaystyle\quad\propto\prod\limits_{j=1}^{n}{f(y_{j}|{\rm{\bm{\gamma}}},{% \rm{\bm{\alpha}}},{\rm{\bm{\beta}}},{\rm{\bm{\lambda}}},\tau_{[y]})}\prod% \limits_{i=1}^{p}{\prod\limits_{h=0}^{r}{p(\gamma_{hi})}}\prod\limits_{k=1}^{q% }{\prod\limits_{v=1}^{m}{p(\alpha_{kv})}}\prod\limits_{k=1}^{q}{\prod\limits_{% j=1}^{n}{p(\beta_{kj})}}\prod\limits_{k=1}^{q}{p(\lambda_{k})}p(\tau_{[y]})$ $\displaystyle\quad\propto\left({\prod\limits_{j=1}^{n}{\tau_{{}_{[y]}}^{1/2}}% \exp\left\{{-\frac{\tau_{[y]}}{2}\left({y_{j}-\sum\limits_{i=1}^{p}{{\rm{\bf{x% }^{\prime}}}_{ij}{\rm{\bm{\gamma}}}_{i}}-\sum\limits_{k=1}^{q}{\sum\limits_{v=% 1}^{m}{\alpha_{kv}\phi_{kv}(z_{kj})-}\sum\limits_{k=1}^{q}{\sum\limits_{l=1}^{% n}{\beta_{kj}Q_{k}(z_{kj},z_{kl})}}}}\right)^{2}}\right\}}\right)$ $\displaystyle\quad\times\left({\prod\limits_{i=1}^{p}{\prod\limits_{h=0}^{r}{% \tau_{[\gamma]hi}^{1/2}\exp\left\{{-\frac{\tau_{[\gamma]hi}}{2}(\gamma_{hi}-% \mu_{[\gamma]hi})^{2}}\right\}}}}\right)\left({\prod\limits_{k=1}^{q}{\prod% \limits_{v=1}^{m}{\tau_{[\alpha]kv}^{1/2}\exp\left\{{-\frac{\tau_{[\alpha]kv}}% {2}(\alpha_{kv}-\mu_{[\alpha]kv})^{2}}\right\}}}}\right)$ $\displaystyle\quad\times\left({\prod\limits_{k=1}^{q}{\prod\limits_{j=1}^{n}{% \left({\tau_{[y]}\lambda_{k}d_{[k]jj}^{-1}}\right)^{1/2}\exp\left\{{-\frac{% \tau_{[y]}\lambda_{k}d_{[k]jj}^{-1}}{2}\left({\beta_{kj}-\mu_{[\beta_{k}]j}}% \right)^{2}}\right\}}}}\right)$ $\displaystyle\quad\times\left({\prod\limits_{k=1}^{q}{\lambda_{k}^{a_{[\lambda% ]k}-1}\exp\left\{{-b_{[\lambda]k}\lambda_{k}}\right\}}}\right)\left({\tau_{[y]% }^{a_{[\tau_{[y]}]}-1}\exp\left\{{-b_{[\tau_{[y]}]}\tau_{[y]}}\right\}}\right).$

The form of full conditional posterior distributions of the multivariable semiparametric regression model is given below.

a. a.

The full conditional posterior distribution for $\gamma_{hi}$ is

$\gamma_{hi}|{\rm{\bf y}},{\rm{\bm{\gamma}}}_{\backslash hi},{\rm{\bm{\alpha}}}% ,{\rm{\bm{\beta}}},{\rm{\bm{\lambda}}},\tau_{[y]}\sim N({B_{\gamma 1}^{-1}B_{% \gamma 2},B_{\gamma 1}}).$ (15)

where

$\displaystyle B_{\gamma 1}=\left({\tau_{[y]}\sum\limits_{j=1}^{n}{\sum\limits_% {i=1}^{p}{\sum\limits_{h=0}^{r}{({x_{ij}^{h}})^{2}}}}+\sum\limits_{i=1}^{p}{% \sum\limits_{h=0}^{r}{\tau_{[\gamma]hi}}}}\right)\text{ and}$ $\displaystyle B_{\gamma 2}=\tau_{[y]}\sum\limits_{j=1}^{n}{\left({\sum\limits_% {i=1}^{p}{\sum\limits_{h=0}^{r}{x_{ij}^{h}y_{j}}}-\sum\limits_{i=1}^{p}{\sum% \limits_{h=0}^{r}{x_{ij}^{h}\sum\limits_{k=1}^{q}{\sum\limits_{v=1}^{m}{\alpha% _{kv}\phi_{kv}(z_{kj})}}}}-\sum\limits_{i=1}^{p}{\sum\limits_{h=0}^{r}{x_{ij}^% {h}\sum\limits_{k=1}^{q}{\sum\limits_{l=1}^{m}{\beta_{kj}q_{[k]lj}}}}}}\right)% }+\sum\limits_{i=1}^{p}{\sum\limits_{h=0}^{r}{\tau_{[\gamma]hi}\mu_{[\gamma]hi% }}},$

The full conditional posterior distribution for $\alpha_{kv}$ is

$\alpha_{kv}|{\rm{\bf y}},{\rm{\bm{\gamma}}},{\rm{\bm{\alpha}}}_{\backslash kv}% ,{\rm{\bm{\beta}}},{\rm{\bm{\lambda}}},\tau_{[y]}\sim N\left({C_{\alpha 1}^{-1% }C_{\alpha 2},C_{\alpha 1}}\right).$ (16)

where

$\displaystyle C_{\alpha 1}=\left({\tau_{[y]}\sum\limits_{j=1}^{n}{\sum\limits_% {k=1}^{q}{\sum\limits_{v=1}^{m}{({\phi_{kv}(z_{kj})})^{2}}}}+\sum\limits_{k=1}% ^{q}{\sum\limits_{v=1}^{m}{\tau_{[\alpha]kv}}}}\right)\text{ and}$ $\displaystyle C_{\alpha 2}=\tau_{[y]}\sum\limits_{j=1}^{n}{\left({\sum\limits_% {k=1}^{q}{\sum\limits_{v=1}^{m}{\phi_{kv}(z_{kj})y_{j}}}-\sum\limits_{i=1}^{p}% {\sum\limits_{h=0}^{r}{\gamma_{hi}x_{ij}^{h}\sum\limits_{k=1}^{q}{\sum\limits_% {v=1}^{m}{\phi_{kv}(z_{kj})}}}}}\right)}+\left.-\sum\limits_{k=1}^{q}{\sum% \limits_{v=1}^{m}{\phi_{kv}(z_{kj})\sum\limits_{k=1}^{q}{\sum\limits_{l=1}^{m}% {\beta_{kj}\left({q_{[k]lj}(z_{kj})}\right)}}}}\right)+\sum\limits_{k=1}^{q}{% \sum\limits_{v=1}^{m}{\tau_{[\alpha]kv}\mu_{[\alpha]kv}}},$

The full conditional posterior distribution for $\beta_{kj}$ is

$\beta_{kj}|{\rm{\bf y}},{\rm{\bm{\gamma}}},{\rm{\bm{\alpha}}},{\rm{\bm{\beta}}% }_{\backslash kj},{\rm{\bm{\lambda}}},\tau_{[y]}\sim N({D_{\beta 1}^{-1}D_{% \beta 2},\tau_{[y]}D_{\beta 1}}).$ (17)

where

$\displaystyle D_{\beta 1}=\left({\sum\limits_{j=1}^{n}{\left({\sum\limits_{k=1% }^{q}{\sum\limits_{l=1}^{n}{\left({q_{[k]lj}(z_{kj})}\right)^{2}}}+}\right.}% \sum\limits_{k=1}^{q}{\lambda_{k}d_{[k]jj}^{-1}}}\right)\text{ and}$ $\displaystyle D_{\beta 2}=\sum\limits_{j=1}^{n}{\left({\sum\limits_{k=1}^{q}{% \sum\limits_{l=1}^{n}{\left({q_{[k]lj}(z_{kj})}\right)y_{j}}}-\sum\limits_{i=1% }^{p}{\sum\limits_{h=0}^{r}{\gamma_{hi}x_{ij}^{h}\sum\limits_{k=1}^{q}{\sum% \limits_{l=1}^{m}{\left({q_{[k]lj}(z_{kj})}\right)}}}}}\right.}+\left.{-\sum% \limits_{k=1}^{q}{\sum\limits_{v=1}^{m}{\alpha_{kv}\phi_{kv}(z_{kj})}}\sum% \limits_{k=1}^{q}{\sum\limits_{l=1}^{m}{\left({q_{[k]lj}(z_{kj})}\right)}}+% \sum\limits_{k=1}^{q}{\lambda_{k}d_{[k]jj}^{-1}\mu_{[\beta_{k}]j}}}\right),$

The full conditional posterior distribution for $\lambda_{k}$ is

$\lambda_{k}|{\rm{\bf y}},{\rm{\bm{\gamma}}},{\rm{\bm{\alpha}}},{\rm{\bm{\beta}% }},{\rm{\bm{\lambda}}}_{\backslash k},\tau_{[y]}\sim\textit{Gamma}\left({a_{[% \lambda]k}+\frac{n}{2},\left[{b_{[\lambda]k}+\frac{1}{2}\tau_{[y]}\left({\sum% \limits_{j=1}^{n}{d_{[k]jj}^{-1}(\beta_{kj}-\mu_{[\beta_{k}]j})^{2}}}\right)}% \right]}\right).$ (18)

The full conditional posterior for $\tau_{[y]}$ is

$\tau_{[y]}|{\rm{\bf y}},{\rm{\bm{\gamma}}},{\rm{\bm{\alpha}}},{\rm{\bm{\beta}}% },{\rm{\bm{\lambda}}}\sim\textit{Gamma}\left({a_{[\tau_{[y]}]}+n,\left[{b_{[% \tau_{[y]}]}+\frac{1}{2}A_{\tau}}\right]+\frac{1}{2}\sum\limits_{k=1}^{q}{% \lambda_{k}\left(\sum\limits_{j=1}^{n}{d_{[k]jj}^{-1}(\beta_{kj}-\mu_{[\beta_{% k}]j})^{2}}\right)}}\right),$ (19)

where

$A_{\tau}=\sum\limits_{j=1}^{n}{\left({y_{j}-\sum\limits_{i=1}^{p}{\sum\limits_% {h=0}^{r}{\gamma_{hi}x_{ij}^{h}}}-\sum\limits_{k=1}^{q}{\sum\limits_{v=1}^{m}{% \alpha_{kv}\phi_{kv}(z_{kj})}}-\sum\limits_{k=1}^{q}{\sum\limits_{l=1}^{n}{% \beta_{kj}q_{[k]lj}(z_{kj})}}}\right)^{2}}.$

The estimation process of parameters of the multivariable semiparametric regression model with fully Bayesian approach was iteratively computationally conducted by means of WinBUGS 1.4 via MCMC with Gibbs Sampling algorithm. The parameters were estimated following Markov Chain process in the iteration. The process went full conditional for all parameters.

3.1 Algorthm procedure for MCMC with Gibbs sampling

The algorithm was done by generating sample within the parameter with B iterations that will be estimated using full conditional posterior distribution function. Suppose that iteration is $b=1,{\rm 2},\ldots,B$ ,

iii) i)
Generate $\gamma_{hi}$ using Eq. (15).
ii)
Generate $\alpha_{kv}$ using Eq. (16).
iii)
Generate $\beta_{kj}$ using Eq. (17).
iv)
Generate $\lambda_{k}$ using Eq. (18).
v)
Generate $\tau_{[y]}$ using Eq. (19).

4. The implementation of the multivariable semiparametric regression model with fully Bayesian approach on OUR data

In this study, the multivariable semiparametric regression model with fully Bayesian approach was applied to OUR data in East Java Province, Indonesia in 2011 as the responce variable ( $y$ ). The observed data consisted of 38 regencies/cities. And the used predictor variables are the percentage of population with higher education, economic growth, population density, investation per labour ratio, regional minimum wage, and large-sized and medium-sized industries ratio. The implementation of Multivariable semiparametric regression model with fully Bayesian approach to OUR data with the parametric component; the percentage of population with higher education and regional minimum wage, was a linear function ( $r=$ 1), and with the nonparametric component; economic growth, population density, investation per labour ratio, and large-sized and medium-sized industries ratio, was a cubic smooting spline function ( $m=$ 2).

The estimation process was done using MCMC and Gibbs sampling algoritm with 60,000 times of iterations, 20 thins, 50,000 burn-in iteration, thus, 10,000 sampling data were used for estimating the characteristic of parameter. From those 10,000 samples, the result of estimation has met the MCMC criteria, i.e. irreducible, aperiodic, and recurrent, as can be seen in Table 1.

Table 1
MCMC diagnostic plot for parameter $\gamma$ estimation

Plot	Gama1	Gama2	Gama3
History
Autocorrelation
Quantile
Density

Based on the result of MMC Diagnostic plot for parameter ${\rm{\bf\gamma}}$ estimation, it can concluded that the parameter estimation process has met Markov Chain characters which are strongly ergodic; irreducible and recurrent. Plot Quantile for parameter ${\rm{\bf\gamma}}$ estimation (see Table 1) shows that mean ergodic value resulting from the parameter estimation has been stable within credible interval. It can Imply that the iteration is convergent. The density plot has concordance with the prior distribution used for parameter ${\rm{\bf\gamma}},$ that is normal distribution.

Table 2

Estimation values of the parametric components for OUR data and their credible intervals as well as smoothing parameter using fully Bayesian approach

Parameter	Estimation value	Standar deviation	MC error	Credible interval
$\gamma_{0}$	1.9230000	0.0235700	0.0008843	1.8760000 $\leqslant\gamma_{0}\leqslant$ 1.9690000
$\gamma_{11}$	$-$ 0.0410000	0.0004077	0.0000105	$-$ 0.0417800 $\leqslant\gamma_{11}\leqslant$ $-$ 0.0401800
$\gamma_{12}$	0.0021270	0.0000203	0.0000006	0.0020880 $\leqslant\gamma_{12}\leqslant$ 0.0021670
$\lambda_{1}$	2.4390000	1.7810000	0.0529200	0.2789000 $\leqslant\lambda_{1}\leqslant$ 7.0810000
$\lambda_{2}$	616.2000000	12.9400000	0.1309000	590.9000000 $\leqslant\lambda_{2}\leqslant$ 641.5000000
$\lambda_{3}$	373.0000000	0.0312200	0.0003000	372.9000000 $\leqslant\lambda_{3}\leqslant$ 373.1000000
$\lambda_{4}$	6.1890000	0.1321000	0.0011930	5.9360000 $\leqslant\lambda_{4}\leqslant$ 6.4520000

The value of MC error for every estimated parameter is smaller than that of posterior deviation standard, which means that the parameter estimation is acceptable. The estimation results of poserior and credible interval for parametric component and smoothing parameter for OUR data can be respectively seen in Table 2. The estimation and credible interval of smoothing spline function for the nonparametric component can be seen in Table 3.

Table 3

Estimation values of nonparametric component function for OUR data and their credible intervals using fully Bayesian

Regency/city	Mean	2.50%	97.50%	Regency/city	Mean	2.50%	97.50%
01. Pacitan	$-$ 0.1826	$-$ 0.2331	$-$ 0.1316	20. Magetan	0.5994	0.5482	0.6503
02. Ponorogo	1.6810	1.6300	1.7320	21. Ngawi	1.2430	1.1920	1.2940
03. Trenggalek	0.2537	0.2030	0.3037	22. Bojonegoro	0.9375	0.8844	0.9916
04. Tulungagung	0.7987	0.7476	0.8497	23. Tuban	0.7385	0.6829	0.7935
05. Blitar	0.6906	0.6395	0.7421	24. Lamongan	1.3470	1.2920	1.4030
06. Kediri	1.3980	1.3420	1.4540	25. Gresik	1.1590	1.0970	1.2220
07. Malang	1.0300	0.9718	1.0890	26. Bangkalan	0.5318	0.4790	0.5846
08. Lumajang	$-$ 0.3151	$-$ 0.3657	$-$ 0.2645	27. Sampang	0.6762	0.6258	0.7262
09. Jember	0.7072	0.6530	0.7615	28. Pamekasan	$-$ 0.4420	$-$ 0.4967	$-$ 0.3880
10. Banyuwangi	0.5582	0.5044	0.6114	29. Sumenep	0.4904	0.4386	0.5425
11. Bondowoso	$-$ 0.1662	$-$ 0.2170	$-$ 0.1158	71. Kediri	2.4900	2.4290	2.5530
12. Situbondo	1.7670	1.7160	1.8180	72. Blitar	2.1750	2.1180	2.2320
13. Probolinggo	$-$ 0.0536	$-$ 0.1055	$-$ 0.0008	73. Malang	2.5120	2.4480	2.5760
14. Pasuruan	1.1110	1.0510	1.1700	74. Probolinggo	2.0410	1.9870	2.0970
15. Sidoarjo	1.9490	1.8850	2.0140	75. Pasuruan	2.2390	2.1810	2.2980
16. Mojokerto	0.8842	0.8236	0.9443	76. Mojokerto	3.7640	3.7050	3.8230
17. Jombang	1.2720	1.2180	1.3270	77. Madiun	3.3760	3.3160	3.4360
18. Nganjuk	2.0110	1.9590	2.0610	78. Surabaya	2.4520	2.3870	2.5180
19. Madiun	0.6783	0.6269	0.7295	79. Batu	1.2550	1.1950	1.3160

Based on the estimation result with fully Bayesian modeling, the deviance, MSE and RMSE values can be calculated for feasibility study, which are given in Table 4.

Table 4

The feasibiliy of OUR model by means of fully Bayesian approach

OUR model	Deviance	MSE	RMSE
Fully Bayesian approach	$-$ 298.2	4.6080 $\times$ 10 ${}^{-5}$	6.7882 $\times$ 10 ${}^{-3}$
Bayesian approach (Diana et al., 2013)	$-$ 152.6	2.5089 $\times$ 10 ${}^{-4}$	1.5839 $\times$ 10 ${}^{-2}$

Table 4 shows that the smallest value of deviance, MSE and RMSE is in OUR model with fully Bayesian approach. It indicates that the fully Bayesian approach results in OUR model performance better than Bayesian approach does modeling OUR data.

Figure 1.

Observed and estimated data plot for OUR in (a) 2012 and (b) 2013 using the model obtained from OUR in 2011.

5. Model validation

Validation of model was executed through evaluation by means of cross validation technique. The cross validation on the multivariable semiparametric regression model obtained from OUR in 2011 was implemented to OUR data prediction in 2012 and 2013. With reference to the validation, the value of The Mean Square Error of Prediction (MSEP) for the OUR data in 2012 and 2013; 0.2909 and 0.2583 each and the Root Mean Square Error of Prediction (RMSEP) in 2012 and 2013 respectively 0.5394 and 0.5082 each, is obtained and can be as feasibility model. The OUR estimation in 2012 and 2013 by means of multivariable semiparametric regression model with fully Bayesian approach is seen in Fig. 1. From Fig. 1, it is clearly seen that the estimated OUR and the observed OUR in 2012 and 2013 are very closely the same.

In order to evaluate the performance of the model, an exploration was statistically conducted by means of Kolmogorov-Smirnov test to see the similarity between the predicted data and the observed data pattern. The Kolmogorov-Smirnov test on predicted as well as observed OUR data in 2012 and 2013 reveals that the $p$ -value of 0.897 was obtained. It means that the predicted data of OUR in 2012 and the observed data in 2012 show the same patterns, and so do the OUR predicted and observed data in 2013. In other words, the model of multivariable semiparametric regression with the fully Bayesian approach is still valid to be used for the OUR data in 2011 can be successfully used to predict the OUR data in 2012 and 2013.

6. Discussion and conclusion

The estimation process of parametric and nonparametric components parameter as well as the estimation of smoothing parameter in multivariable semiparametric regression model with fully Bayesian approach has been applied to OUR data in East Java Province, Indonesia. The sample was taken iteratively through the full conditional posterior distribution with MCMC and Gibbs sampling method. And the credible interval was obtained by means of Highest Posterior Density (HPD). Through the application of multivariable semiparametric regression model with fully Bayesian on the OUR data in East Java year 2011, the following model has come up

$\widehat{\textit{OUR}}=1{,}923-0{,}041(\textit{TP})+0{,}002127(\textit{UMR})+% \hat{{f}}(\textit{PE})+\hat{{f}}(\textit{KP})+\hat{{f}}(I)+\hat{{f}}(U)$

where the parametric component which includes the percentage of population with higher education (TP) and regional minimum wage (UMR), is a linear function with $r=$ 1. Meanwhile, the nonparametric component which consists of economic growth (PE), population density (KP), investation per labour ratio (I), and large-sized and medium-sized industries ratio (U), is a cubic smooting spline function with $m=$ 2.

Based on the feasibility curve for the OUR data, the value of deviance, MSE and RMSE for OUR model with fully Bayesian is smaller or lower. This indicates that fully Bayesian is better to apply than the Bayesian approach. Even when the OUR modelling result in 2011 was applied to the OUR data in 2012 and 2013, the model are still valid. Therefore, the result reported in this present study can be used as the preliminary step for modeling OUR data and further more can be optimized by making use of longitudinal data.

Footnotes

Acknowledgments

We would like to acknowledge The Indonesian Central Bureau of Statistics (BPS) for providing the financial support.

References

Costa

M. J.

(2008). Penalized Spline Models and Applications, Ph.D. Dissertation., School of Sciences Statistics Program, University of Warwick, Coventry, UK. Crainiceanu, C., Ruppert, D., and Wand, M. P. (2005), Bayesian analysis for penalized spline regression using WinBUGS. Journal of Statistical Software, 14, 1-24.

Diana

Budiantara

I. N.

, & Darmesto

(2012). Smoothing Spline Estimators in Semiparametric Multivariable Regression Model. Proceedings of International Conference on Mathematics, Statistics and its Applications (ICMSA), Institut Teknologi Sepuluh Nopember, Bali, ISBN: 978-979-96152-7-5.

Diana

Budiantara

I. N.

, & Darmesto

(2013). Smoothing spline in semiparametric additive regression model with Bayesian approach. Journal of Mathematics and Statistics, 9, 161-168. ISSN: 1549-3644, doi: 10.3844/jmssp.2013.161.168.

Diana

Budiantara

I. N.

, & Darmesto

(2014). Statistical modeling for unemployment rate using smoothing spline in semiparametric multivariable regression model with Bayesian approach. Journal of Model Assisted Statistics and Applications, 9, 159-166. ISSN: 1574-1699 (Print), ISSN: 1875-9068 (Online), doi: 10.3233/MAS.130287.

Cheng

, & Liang

(2012). Semiparametric regression models with additive nonparametric components and high dimensional parametric components. Computational Statistics and Data Analysis, 56, 2006-2017.

Jerak

, & Wagner

(2003). Modeling Probabilities of Patent Oppositions in a Bayesian Semiparametric Regression Framework, Working Paper, Sonderforschungsbereich 386, University of Munich.

Kandala

N. S.

Lang

, & Klasen

(2001). Semiparametric Analysis of Childhood Undernutrition in Developing Countries, Technical Report 33, University of Munich.

Kandala

N. B.

Lang

Klasen

, & Fahrmeir

(2002). Semiparametric Analysis of the Socio-Demographic and Spatial Determinants of Undernutrition in Two African Countries, Working Paper, University of Munich, Ludwigstr. 33, Germany.

Krivobokova

Kneib

, and Claeskens

(2010). Simultaneous confidence bands for penalized spline estimators. Technical Report, University Gottingen.

10.

Lang

, & Brezger

(2004). Bayesian P-splines. Journal of Computational and Graphical Statistics, 13, 183-212.

11.

(2000). Efficient estimation of additive partially linear models. International Economic Review, 41, 1073-1092.

12.

Marley

J. K.

, & Wand

M. P.

(2010). Non-standard semiparametric regression via brugs. Journal of Statistical Software, 37.

13.

Nott

(2006). Semiparametric estimation of mean and variance functions for non-Gaussian data. Computational Statistics, 21, 603-620.

14.

Ntzoufras

(2009). Bayesian Modeling Using WinBUGS. Wiley, New Jersey, USA.

15.

Panagiotelis

, & Smith

(2008). Bayesian identification, selection and estimation of semiparametric functions in high-dimensional additive models. Journal of Econometrics, 143, 291-316.

16.

Ryu

Mallick

B. K.

, & Li

(2009). Bayesian Nonparametric Regression Analysis of Data with Random Effects Covariates from Longitudinal Measurements. Technical Report, Department of Statistics, Texas A&M University.

17.

Shen

(2011). Additive Mixed Modeling of HIV Patient Outcomes Across Multiple Studies, Disertasi Ph.D., Department of Statistics, University of California, Los Angeles.

18.

Smith

, & Kohn

(1996). Nonparametric regression using Bayesian variable selection. Journal of Econometrics, 75, 317-344.

19.

Smith

Kohn

, & Mathur

S. K.

(2000). Bayesian semiparametric regression: An exposition and application to print advertising data. Journal of Business Research, 49, 229-244.

20.

Wahba

(1990). Spline Model for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia.

21.

Wang

(2011). Smoothing Splines Methods and applications. CRC Press Taylor & Francis Group, California, USA.

22.

Wiesenfarth

Krivobokova

, & Klasen

(2010). Simultaneous Confidence Bands for Additive Models with Locally Adaptive Smoothed Components and Heteroscedastic Errors. Technical Report, Georg August University Gottingen.

23.

Wong

, & Kohn

(1996). A Bayesian approach to additive semiparametric regression. Journal of Econometrics, 74, 209-235.

24.

Wood

S. N.

, & Marra

(2011). Coverage Properties of Confidence Intervals for Generalized Additive Model Components. Research Report 313, Department of Statistical Science, University College London.

25.

Yang

(2008). Confidence band for additive regression model. Journal of Data Science, 6, 207-217.

Smoothing spline in multivariable semiparametric regression through fully bayesian approach

Abstract

Keywords

1. Introduction

2. Multivariable semiparametric regression model

Table 1 MCMC diagnostic plot for parameter γ estimation

6. Discussion and conclusion

Footnotes

Acknowledgments

References

Table 1
MCMC diagnostic plot for parameter $\gamma$ estimation