A multivariable semiparametric regression model is a combination of parametric and nonparametric regressions, the parametric component of which refers to polynomial patterns, while its nonparametric component does not have certain pattern. This nonparametric component can be fitted with smoothing spline function. This reearch is aimed to to develop a multivariable semiparametric regression model through fully Bayesian approach for cross-sectional data. The development is meant to be implemented in analysing Open Unemployment Rate (OUR) in East Java Province, Indonesia. The result applying the model in estimating the Open Unemployment Rate (OUR) in East Java province reveals that multivariable semiparametric regression model with parametric component, based on macro economy, corresponds to linear and the nonparametric components corresponds to cubic smoothing spline function using fully Bayesian approach. The parametric component includes the percentage of population with higher education and regional minimum wage. The nonparametric component includes economic growth, population density, large-sized and medium-sized industries ratio. In conclusion, the smoothing spline modeling using fully Bayesian approach shows better performance than using Bayesian approach.
Up to now, spline function estimations in multivariable semiparametric regression model have, in general, three forms, i.e. spline regression (including truncated spline, cubic spline, and B-spline), penalized spline (P-spline) and smoothing spline. The use of spline regression and penalized spline, unlike smoothing spline regression, require extra circumspection in determining the number of knots and their locations. A smoothing spline is constructed due to the addition of goodness of fit and the smoothness of curve (penalty).
There have been a number of studies concerning spline regression and penalized spline in multivariable semiparametric regression model using Bayesian approach, for instance by Smith and Kohn (1996), Wong and Kohn (1996), Li (2000), Smith et al. (2000), Kandala et al. (2001), Panagiotelis and Smith (2008), and Ryu et al. (2009). Kandala et al. (2002), Jerak and Wagner (2003), Lang and Brezger (2004), Crainiceanu et al. (2005), Nott (2006), Costa (2008), Marley and Wand (2010), and Shen (2011) applying P-spline using Bayesian approach. Wang (2011), Du et al. (2012) and Diana et al. (2012) applying penalized least square method to obtain additive semiparametric regression model estimator with linear function parametric and smoothing spline functions respectively for the nonparametric components. And they used bootstrap approach to construct the confidence interval.
Krivobokova et al. (2010) developed P-spline confidence interval by means of Bayesian approach in nonparametric regression model. Yang (2008) applied bootstrap approach to construct confidence interval of spline regression function in additive nonparametric regression model. Wiesenfarth et al. (2010) developed Krivobokova et al. (2010) approach to make P-spline confidence interval in additive nonparametric regression model. Wood and Marra (2011) applied Bayesian approach to construct P-spline confidence interval in additive nonparametric regression model for non-Gaussian response. Diana et al. (2013) developed smoothing spline in multivariable semiparametric regression model trough Bayesian approach using simulation data. She also implemented smoothing spline in multivariable semiparametric regression model using Bayesian approach making use of OUR data (Diana et al., 2014).
Therefore, it can be implied that many researches on spline regression and P-spline in additive semiparametric regression model either with classical or Bayesian approach have been extensively conducted. However, many researches on smoothing spline in additive semiparametric regression model have commonly been done through classical approach. In other words, rarely has a study on smoothing spline in multivariable semiparametric regression model applied the fully Bayesian approach. Thus, this study focuses on developing smoothing spline in multivariable semiparametric regression model along with the polinomial parametric and aditive nonparametric component without an interaction. The model was then further implemented in analysing the data of OUR in East Java, Indonesia.
In Section 2, we introduce the model of multivariable semiparametric regression. We propose estimation of multivariable semiparametric regression model using fully Bayesian approach in Section 3. The data for the implementation of the model and the results are reported in Section 4. Validations of model are reported in Section 5. The paper is concluded with summary and discussion of further work in Section 6.
Multivariable semiparametric regression model
Given pairing data , , with are predictor variables whose patterns links to response variable (-th observation) follow polynomial patterns and is a set of other predictor variables, the connection pattern of which with response variable is unknown. The relationship between response variable and predictor variables and is assumed to follow the multivariable semiparametric regression model:
where
is the unknown parametric component parameter. Random error is assumed to be linearly independent with zero mean and variance Equation (2) can be written in vector notation as follows:
where and .
The regression curve is set as:
is constitutes smoothing spline function with degree (Wahba, 1990). Suppose that the parameter
where and
and where and , Eqs (3) and (4) can be then rewritten in matrix notation:
If is the random variable which is the response data and has normal distribution or then the semiparametric regression model with fully Bayesian approach using the data will always be having normal distribution.
Estimation of multivariable semiparametric regression model using fully Bayesian approach
If
and
then
where is the precision parameter. The likelihood function of is
In fully Bayesian approach, it is important that prior distribution be determined for parameters Since has prior improper Gaussian distribution which is
the prior distribution is therefore normal multivariate distribution, that is
where and Implying that
where behaves as the smoothing parameter. Since is a singular matrix, a full rank parameterization with singular value decomposition (SVD) approach is required. Through SVD, will be obtained, where is a matrix from eigen vectors which corresponds to the non zero eigen values, and is a diagonal matrix from the non zero eigen values. Suppose that then prior distribution for parameter , is Normal Multivariat or
The prior distributions used for each element of parameter vector are as follow:
Combination of likelihood Eq. (7) and prior distribution will produce combined posterior distribution function of all parameters that will be estimated, i.e.:
Using MCMC and Gibbs Sampling methods, the characteristics of each parameter in combined posterior distribution can be investigated without calculating the marginal functions of the parameters (Ntzoufras, 2009). The full conditional posterior distribution is analysed using the following lemmas.
Lemma 1. If given a multivariable semiparametric regression model that follows Eqs (1) and (2), random variable with likelihood function as depicted in Eq. (7), and prior distribution to each of parameters as given in Eqs (8)–(12), then combined posterior distribution of the multivariable semiparametric regression model with fully Bayesian approach can be represented as:
where
Proof Using prior distributio, Eqs (8)–(12), the combined posterior distribution can be given as:
The form of full conditional posterior distributions of the multivariable semiparametric regression model is given below.
The full conditional posterior distribution for is
where
The full conditional posterior distribution for is
where
The full conditional posterior distribution for is
where
The full conditional posterior distribution for is
The full conditional posterior for is
where
The estimation process of parameters of the multivariable semiparametric regression model with fully Bayesian approach was iteratively computationally conducted by means of WinBUGS 1.4 via MCMC with Gibbs Sampling algorithm. The parameters were estimated following Markov Chain process in the iteration. The process went full conditional for all parameters.
Algorthm procedure for MCMC with Gibbs sampling
The algorithm was done by generating sample within the parameter with B iterations that will be estimated using full conditional posterior distribution function. Suppose that iteration is ,
The implementation of the multivariable semiparametric regression model with fully Bayesian approach on OUR data
In this study, the multivariable semiparametric regression model with fully Bayesian approach was applied to OUR data in East Java Province, Indonesia in 2011 as the responce variable (). The observed data consisted of 38 regencies/cities. And the used predictor variables are the percentage of population with higher education, economic growth, population density, investation per labour ratio, regional minimum wage, and large-sized and medium-sized industries ratio. The implementation of Multivariable semiparametric regression model with fully Bayesian approach to OUR data with the parametric component; the percentage of population with higher education and regional minimum wage, was a linear function ( 1), and with the nonparametric component; economic growth, population density, investation per labour ratio, and large-sized and medium-sized industries ratio, was a cubic smooting spline function ( 2).
The estimation process was done using MCMC and Gibbs sampling algoritm with 60,000 times of iterations, 20 thins, 50,000 burn-in iteration, thus, 10,000 sampling data were used for estimating the characteristic of parameter. From those 10,000 samples, the result of estimation has met the MCMC criteria, i.e. irreducible, aperiodic, and recurrent, as can be seen in Table 1.
MCMC diagnostic plot for parameter estimation
Plot
Gama1
Gama2
Gama3
History
Autocorrelation
Quantile
Density
Based on the result of MMC Diagnostic plot for parameter estimation, it can concluded that the parameter estimation process has met Markov Chain characters which are strongly ergodic; irreducible and recurrent. Plot Quantile for parameter estimation (see Table 1) shows that mean ergodic value resulting from the parameter estimation has been stable within credible interval. It can Imply that the iteration is convergent. The density plot has concordance with the prior distribution used for parameter that is normal distribution.
Estimation values of the parametric components for OUR data and their credible intervals as well as smoothing parameter using fully Bayesian approach
Parameter
Estimation value
Standar deviation
MC error
Credible interval
1.9230000
0.0235700
0.0008843
1.8760000 1.9690000
0.0410000
0.0004077
0.0000105
0.0417800 0.0401800
0.0021270
0.0000203
0.0000006
0.0020880 0.0021670
2.4390000
1.7810000
0.0529200
0.2789000 7.0810000
616.2000000
12.9400000
0.1309000
590.9000000 641.5000000
373.0000000
0.0312200
0.0003000
372.9000000 373.1000000
6.1890000
0.1321000
0.0011930
5.9360000 6.4520000
The value of MC error for every estimated parameter is smaller than that of posterior deviation standard, which means that the parameter estimation is acceptable. The estimation results of poserior and credible interval for parametric component and smoothing parameter for OUR data can be respectively seen in Table 2. The estimation and credible interval of smoothing spline function for the nonparametric component can be seen in Table 3.
Estimation values of nonparametric component function for OUR data and their credible intervals using fully Bayesian
Regency/city
Mean
2.50%
97.50%
Regency/city
Mean
2.50%
97.50%
01. Pacitan
0.1826
0.2331
0.1316
20. Magetan
0.5994
0.5482
0.6503
02. Ponorogo
1.6810
1.6300
1.7320
21. Ngawi
1.2430
1.1920
1.2940
03. Trenggalek
0.2537
0.2030
0.3037
22. Bojonegoro
0.9375
0.8844
0.9916
04. Tulungagung
0.7987
0.7476
0.8497
23. Tuban
0.7385
0.6829
0.7935
05. Blitar
0.6906
0.6395
0.7421
24. Lamongan
1.3470
1.2920
1.4030
06. Kediri
1.3980
1.3420
1.4540
25. Gresik
1.1590
1.0970
1.2220
07. Malang
1.0300
0.9718
1.0890
26. Bangkalan
0.5318
0.4790
0.5846
08. Lumajang
0.3151
0.3657
0.2645
27. Sampang
0.6762
0.6258
0.7262
09. Jember
0.7072
0.6530
0.7615
28. Pamekasan
0.4420
0.4967
0.3880
10. Banyuwangi
0.5582
0.5044
0.6114
29. Sumenep
0.4904
0.4386
0.5425
11. Bondowoso
0.1662
0.2170
0.1158
71. Kediri
2.4900
2.4290
2.5530
12. Situbondo
1.7670
1.7160
1.8180
72. Blitar
2.1750
2.1180
2.2320
13. Probolinggo
0.0536
0.1055
0.0008
73. Malang
2.5120
2.4480
2.5760
14. Pasuruan
1.1110
1.0510
1.1700
74. Probolinggo
2.0410
1.9870
2.0970
15. Sidoarjo
1.9490
1.8850
2.0140
75. Pasuruan
2.2390
2.1810
2.2980
16. Mojokerto
0.8842
0.8236
0.9443
76. Mojokerto
3.7640
3.7050
3.8230
17. Jombang
1.2720
1.2180
1.3270
77. Madiun
3.3760
3.3160
3.4360
18. Nganjuk
2.0110
1.9590
2.0610
78. Surabaya
2.4520
2.3870
2.5180
19. Madiun
0.6783
0.6269
0.7295
79. Batu
1.2550
1.1950
1.3160
Based on the estimation result with fully Bayesian modeling, the deviance, MSE and RMSE values can be calculated for feasibility study, which are given in Table 4.
The feasibiliy of OUR model by means of fully Bayesian approach
OUR model
Deviance
MSE
RMSE
Fully Bayesian approach
298.2
4.6080 10
6.7882 10
Bayesian approach (Diana et al., 2013)
152.6
2.5089 10
1.5839 10
Table 4 shows that the smallest value of deviance, MSE and RMSE is in OUR model with fully Bayesian approach. It indicates that the fully Bayesian approach results in OUR model performance better than Bayesian approach does modeling OUR data.
Observed and estimated data plot for OUR in (a) 2012 and (b) 2013 using the model obtained from OUR in 2011.
Model validation
Validation of model was executed through evaluation by means of cross validation technique. The cross validation on the multivariable semiparametric regression model obtained from OUR in 2011 was implemented to OUR data prediction in 2012 and 2013. With reference to the validation, the value of The Mean Square Error of Prediction (MSEP) for the OUR data in 2012 and 2013; 0.2909 and 0.2583 each and the Root Mean Square Error of Prediction (RMSEP) in 2012 and 2013 respectively 0.5394 and 0.5082 each, is obtained and can be as feasibility model. The OUR estimation in 2012 and 2013 by means of multivariable semiparametric regression model with fully Bayesian approach is seen in Fig. 1. From Fig. 1, it is clearly seen that the estimated OUR and the observed OUR in 2012 and 2013 are very closely the same.
In order to evaluate the performance of the model, an exploration was statistically conducted by means of Kolmogorov-Smirnov test to see the similarity between the predicted data and the observed data pattern. The Kolmogorov-Smirnov test on predicted as well as observed OUR data in 2012 and 2013 reveals that the -value of 0.897 was obtained. It means that the predicted data of OUR in 2012 and the observed data in 2012 show the same patterns, and so do the OUR predicted and observed data in 2013. In other words, the model of multivariable semiparametric regression with the fully Bayesian approach is still valid to be used for the OUR data in 2011 can be successfully used to predict the OUR data in 2012 and 2013.
Discussion and conclusion
The estimation process of parametric and nonparametric components parameter as well as the estimation of smoothing parameter in multivariable semiparametric regression model with fully Bayesian approach has been applied to OUR data in East Java Province, Indonesia. The sample was taken iteratively through the full conditional posterior distribution with MCMC and Gibbs sampling method. And the credible interval was obtained by means of Highest Posterior Density (HPD). Through the application of multivariable semiparametric regression model with fully Bayesian on the OUR data in East Java year 2011, the following model has come up
where the parametric component which includes the percentage of population with higher education (TP) and regional minimum wage (UMR), is a linear function with 1. Meanwhile, the nonparametric component which consists of economic growth (PE), population density (KP), investation per labour ratio (I), and large-sized and medium-sized industries ratio (U), is a cubic smooting spline function with 2.
Based on the feasibility curve for the OUR data, the value of deviance, MSE and RMSE for OUR model with fully Bayesian is smaller or lower. This indicates that fully Bayesian is better to apply than the Bayesian approach. Even when the OUR modelling result in 2011 was applied to the OUR data in 2012 and 2013, the model are still valid. Therefore, the result reported in this present study can be used as the preliminary step for modeling OUR data and further more can be optimized by making use of longitudinal data.
Footnotes
Acknowledgments
We would like to acknowledge The Indonesian Central Bureau of Statistics (BPS) for providing the financial support.
References
1.
CostaM. J. (2008). Penalized Spline Models and Applications, Ph.D. Dissertation., School of Sciences Statistics Program, University of Warwick, Coventry, UK. Crainiceanu, C., Ruppert, D., and Wand, M. P. (2005), Bayesian analysis for penalized spline regression using WinBUGS. Journal of Statistical Software, 14, 1-24.
2.
DianaR.BudiantaraI. N., & DarmestoS. (2012). Smoothing Spline Estimators in Semiparametric Multivariable Regression Model. Proceedings of International Conference on Mathematics, Statistics and its Applications (ICMSA), Institut Teknologi Sepuluh Nopember, Bali, ISBN: 978-979-96152-7-5.
3.
DianaR.BudiantaraI. N., & DarmestoS. (2013). Smoothing spline in semiparametric additive regression model with Bayesian approach. Journal of Mathematics and Statistics, 9, 161-168. ISSN: 1549-3644, doi: 10.3844/jmssp.2013.161.168.
4.
DianaR.BudiantaraI. N., & DarmestoS. (2014). Statistical modeling for unemployment rate using smoothing spline in semiparametric multivariable regression model with Bayesian approach. Journal of Model Assisted Statistics and Applications, 9, 159-166. ISSN: 1574-1699 (Print), ISSN: 1875-9068 (Online), doi: 10.3233/MAS.130287.
5.
DuP.ChengG., & LiangH. (2012). Semiparametric regression models with additive nonparametric components and high dimensional parametric components. Computational Statistics and Data Analysis, 56, 2006-2017.
6.
JerakA., & WagnerS. (2003). Modeling Probabilities of Patent Oppositions in a Bayesian Semiparametric Regression Framework, Working Paper, Sonderforschungsbereich 386, University of Munich.
7.
KandalaN. S.LangS., & KlasenS. (2001). Semiparametric Analysis of Childhood Undernutrition in Developing Countries, Technical Report 33, University of Munich.
8.
KandalaN. B.LangS.KlasenS., & FahrmeirL. (2002). Semiparametric Analysis of the Socio-Demographic and Spatial Determinants of Undernutrition in Two African Countries, Working Paper, University of Munich, Ludwigstr. 33, Germany.
9.
KrivobokovaT.KneibT., and ClaeskensG. (2010). Simultaneous confidence bands for penalized spline estimators. Technical Report, University Gottingen.
10.
LangS., & BrezgerA. (2004). Bayesian P-splines. Journal of Computational and Graphical Statistics, 13, 183-212.
11.
LiQ. (2000). Efficient estimation of additive partially linear models. International Economic Review, 41, 1073-1092.
12.
MarleyJ. K., & WandM. P. (2010). Non-standard semiparametric regression via brugs. Journal of Statistical Software, 37.
13.
NottD. (2006). Semiparametric estimation of mean and variance functions for non-Gaussian data. Computational Statistics, 21, 603-620.
14.
NtzoufrasI. (2009). Bayesian Modeling Using WinBUGS. Wiley, New Jersey, USA.
15.
PanagiotelisA., & SmithM. (2008). Bayesian identification, selection and estimation of semiparametric functions in high-dimensional additive models. Journal of Econometrics, 143, 291-316.
16.
RyuD.MallickB. K., & LiE. (2009). Bayesian Nonparametric Regression Analysis of Data with Random Effects Covariates from Longitudinal Measurements. Technical Report, Department of Statistics, Texas A&M University.
17.
ShenJ. (2011). Additive Mixed Modeling of HIV Patient Outcomes Across Multiple Studies, Disertasi Ph.D., Department of Statistics, University of California, Los Angeles.
18.
SmithM., & KohnR. (1996). Nonparametric regression using Bayesian variable selection. Journal of Econometrics, 75, 317-344.
19.
SmithM.KohnR., & MathurS. K. (2000). Bayesian semiparametric regression: An exposition and application to print advertising data. Journal of Business Research, 49, 229-244.
20.
WahbaG. (1990). Spline Model for Observational Data. Society for Industrial and Applied Mathematics, Philadelphia.
21.
WangY. (2011). Smoothing Splines Methods and applications. CRC Press Taylor & Francis Group, California, USA.
22.
WiesenfarthM.KrivobokovaT., & KlasenS. (2010). Simultaneous Confidence Bands for Additive Models with Locally Adaptive Smoothed Components and Heteroscedastic Errors. Technical Report, Georg August University Gottingen.
23.
WongC., & KohnR. (1996). A Bayesian approach to additive semiparametric regression. Journal of Econometrics, 74, 209-235.
24.
WoodS. N., & MarraG. (2011). Coverage Properties of Confidence Intervals for Generalized Additive Model Components. Research Report 313, Department of Statistical Science, University College London.
25.
YangL. (2008). Confidence band for additive regression model. Journal of Data Science, 6, 207-217.