Abstract
Since Basel 2, a financial institution can determine the capital required for credit risk by using internal models, which require the determination of Loss Given Default (LGD). LGD is also an important element in assets pricing. The dynamics of LGD can be evaluated directly or can be extracted from the dynamics of the recovery rate (RR). Observed LGDs/RRs present an asymmetric and bimodal distribution. In this article several models formalizing these features are employed: simple models and models modeling the unusual nature of LGDs/RRs (skewed regressions, adjusted regressions, inflated regressions). In sum, 25 models were retained for evaluating the dynamics of LGDs and RRs and 13 covariates. Performances of models were compared by using several classical and advanced metrics determined by using 10-fold cross-validation. According to obtained results, the best way to forecast LGDs is either to model and forecast it directly with the beta regression and with the zero-adjusted beta regression or drawn it from the fitted and forecasted RRs with its performing models. These latter models are the beta regression and the one-adjusted beta regression. As expected, most of the retained explanatory variables influence the dynamics of LGDs/RRs.
Introduction
Since Basel 2, financial institutions can determine the regulatory capital credit risk by using advanced internal models, which require the determination of the Probability of Default (PD), the Exposure at Default (EAD) and the Loss Given Default (LGD). LGDs are also an important element in assets pricing. Observed LGDs are skewed and bimodal with modes close to boundary values. It is important to model, evaluate and predict LGDs for each type of portfolio/segment by taking their features into account.
Earlier studies formalized LGDs’ dynamic with classical models; such as the OLS model, the Fractional Response model, and the Tobit model. Later studies used more appropriate models which account the characteristic distribution of LGDs. 1 Models taking the skewed nature of LGDs into account were proposed; such as the beta regression, the gamma regression, the inverse Gaussian regression, and the LogNormal. In order to take the bimodal nature of LGDs into account, methods based on mixture models were also retained; such as adjusted regressions, ordered logistic regression and inflated regressions. Do all these parametric models perform well in forecasting LGDs?
Initial studies based on parametric models considered only one or two models. Some recent works compared the performance of earlier and more recent models [7,19,33]. However, these authors considered a reduced number of parametric models. Furthermore, most of the existing studies investigated the dynamic of LGDs related to corporate loans. By contrast, the retail sector’s LGDs were retained by only a few authors [7,12,26,31].
This paper aims to complete the set containing few studies on retail LGDs by investigating the dynamic of LGDs related to retail loans. Compared to the rare existing studies in this field, retained data are specific to online loans. Furthermore, the performance of several simple as well as complex models are compared in the present study (25 models in total). The retained parametric approaches are: (1) simple OLS model, OLS approach based on different adjustment (local and global) methods as well as on various transformation methods (inverse Gaussian, inverse Gaussian with beta function, normal function, logit, logarithmic, arcsin, reciprocal, and power), (2) simple models (fractional response model, Tobit model), (3) models which take into account the skewed nature of LGDs’ (or RRs) distribution (beta regression, generalized beta regression, gamma regression, inverse Gaussian regression), and models formalizing the skewed and the bimodal natures of LGDs (zero or one adjusted (inflated) regressions, inflated regressions, ordered-logistic regression).
The dynamic of LGDs can be determined directly by modeling its dynamic or indirectly from the dynamic of the estimated recovery rate (RR) (RR = 1 − LGD) or from the dynamic of the losses (
A 10-fold cross-validation is used in order to determine and compare the performance of retained models. Compared to several existing studies which mainly test the goodness-of-fit and predictive power (validation) of models, I test the discriminatory power of retained models as well as test whether LGDs are not underestimated (RRs overestimated). For these tests and comparisons, classical metrics as well as more elaborated metrics are used.
A review of models formalizing the dynamic of LGD/RR is presented in the second section of this paper. A presentation and descriptive analysis of data used are done in the third part. Section 4 presents and discusses the obtained results.
Review of methods/models
OLS based regression
LGD and RR values are comprised in the interval [0 −1]; with 0 and 1 included. Furthermore, it is often shown that their probability distributions are skewed and/or bimodal with modes close to the boundary values. Due to these features, determining and forecasting LGD/RR with standard statistical models, such as the linear regression model estimated with ordinary least squares (OLS), can be inappropriate. Indeed, the OLS model requires normally distributed series and can produce fitted/predicted values outside the interval [0-1]. By taking into account both previous stressed features, it is possible to use standard OLS methods with transformed data. The most used transformation functions in the existing studies are the inverse Gaussian (IG) transformation, the Beta transformation (IGB), the logit transformation, and the probit transformation. Other used transformation functions are the normal transformation, the log transformation, transformation at power, the angular transformation, and the Box-Cox transformation.
Some of these transformations cannot be applied directly to the raw LGD/RR data. Indeed, some of these transformation functions are not defined at points 0 and 1. To overcome this problem, LGDs/RRs should be adjusted. An earlier proposed method consists of adjusting slightly observed LGD/RR at boundaries 0 and 1 (local adjustment) [19,27]. Qi and Zhao [27] noticed that a small adjustment produces a poor model performance, and a larger adjustment factor does not preserve the order of the raw LGD values. Based on this, Li et al. [19] proposed to adjust all observed LGD values globally. This Global adjustment method consists of adjusting all observed LGDs/RRs as follows:
Another simple method is the fractional response regression (FR). This method, proposed initially by Papke and Wooldridge [24], was used by several authors in determining LGD [5,6,11]. Compared to the classical OLS based method, the FR model is a distribution-free model. This model is expressed as:
y = LGD or RR. x is a vector of explanatory variables and 𝛽 is the vector of parameters. The function G(.) is often specified as a (1) logistic function or (2) a log-log function (3) or a probit (4) or a cauchit function. 2
As these functions ensure 0 < G (x) < 1, ∀x, the estimated, fitted and predicted LGD values are then bounded between 0 and 1. The vector of coefficients (𝛽) is estimated by maximizing the likelihood function. The predicted values of y = LGD, RR are given in Eq. (2).
The dynamics of LGDs and RRs cas also be modeled with a Tobit regression, which is expressed as:
y = LGD, RR. 𝜙(.) and 𝛷(.) represent the probability density function (PDF) and the cumulative density function (CDF) of a standard normal random variable, respectively. As for 𝜃
i
, it is defined as:
By using the estimated coefficients the expected y
i
= LGD
i
∕RR
i
for exposure i is determined as follows:
It is well known that observed LGDs’ and RRs’ distributions are skewed [9,18]. Apart from the skewed nature, the bimodal nature of LGD distribution is often observed [1,28]. In the empirical literature, more appropriated methods for these unusual distributions were proposed. These models are presented in what follows.
Modeling skewed nature of LGDs/RRs
Skewed LGDs/RRs can be modelled with the beta regression, generalized beta regression, the gamma regression, the inverse Gaussian regression, and the log-normal regression.
The beta distribution is a very flexible distribution which enables the modeling of different shapes such as left-skewed, right-skewed, U, inverted J, and uniform. Because of this, the beta distribution, defined between 0-1, is the most advised distribution enabling to formalize the shape of LGDs and RRs. For instance, this distribution is used in Moody’s KMV Losscalc software package for modeling RR and LGD [13]. The density function of the Beta distribution is written as:
As the beta distribution is not defined at points 0 and 1, LGDs/RRs should be adjusted. The fitted/predicted LGDs/RRs are re-adjusted in case of global adjustment. In case of local adjustment, fitted/predicted LGDs/RRs are compared to the locally adjusted observed LGDs.
The gamma regression (GA), the Inverse Gaussian regression (IG) and the LogNormal regression (LogN) enable to model mainly right skewed distributions. The gamma regression is expressed as:
The Inverse Gaussian regression (IG) is formalized as:
Regarding the Log-Normal (LogN) density function, it is defined as:
The mean (location, 𝜇) and the dispersion (scale, 𝜙) parameter of these distributions can also be dynamic as the beta regression.
These distributions are not defined at point 0. LGDs/RRs should then be adjusted partially at point 0. Furthermore, as these distributions are mainly right skewed and defined in the interval (0, ∞), these regressions are more appropriated for modeling losses instead of LGDs/RRs. However, they can be used to model LGDs and RRs. In this case, the fitted and the predicted LGDs and/or RRs can have value higher than 1.
Although skewed distributions enable to formalize several shapes, they are not very suitable for datasets containing large numbers of extreme values (0 and/or 1). Continuous-discreet distributions are more appropriated for this kind of data as suggested by authors such as: Hoff [17], Cook et al. [10], and Ospina and Ferrari [21–23]. In these studies, it is recommended to use one distribution for the continuous part and another distribution at 0 or 1 point. This mixed model is expressed as:
y = LGD or RR. 𝜋 represents the probability at point c (𝜋 = P (y = c)), which can be considered as 0 or 1 (c = 0 or c = 1). In case c = 0, the distribution is a zero-adjusted distribution (OAdj) and a one-adjusted distribution in case c = 1 (1Adj). f (y; 𝜇, 𝜙) is the probability density function. In case of zero-adjusted model, f(.) can be formalized as beta density function, gamma density function, inverse Gaussian density function, log-normal,… In a one-adjusted model, this function is a beta density function.
The mean (𝜇), the dispersion parameter (𝜙) and the probability (𝜋) can be dynamic as:
The Gamma function, the Inverse Gaussian function, and the log-Normal function are not defined at point 0. Owing to this fact, one-adjusted models based on these density functions are estimated with 0 adjusted LGDs/RRs (partial adjustment). Zero-adjusted models based on these functions are estimated with raw LGDs/RRs. The zero-adjusted Beta regression is applied to partially locally adjusted LGD/RR data at point 1. Similarly one-adjusted Beta regression is applied to partially locally adjusted LGD/RR data at point 0. Globally adjusted LGD/RR cannot be used with the zero or one adjusted regressions as globally adjusted LGDs values are different from 0 and 1.
The zero-adjusted Beta regression is more suitable to a dataset having a value in the interval 0-1, then for LGDs and RRs, and dataset having a large number of zeroes. As for the other zero-adjusted models (such as zero-adjusted Gamma, zero-adjusted inverse Gaussian, zero-adjusted LogNormal), defined on the interval (0; +∞), they are more suitable to model Losses and then draw the LGDs. For instance, Tong et al. [31] modeled losses (=LGD.EAD), containing extensive numbers of zeroes, with a mixed discrete-continuous model (zero-adjusted Gamma regression - 0Adj.GA). However, these latter zero-adjusted models can be also used for modeling LGDs and RRs having an important number of zeroes. In this case, the fitted and predicted LGDs and/or RRs can have a value higher than 1.
A general form of the zero-adjusted/one-adjusted regression consists of modeling the dynamic of LGDs (or RRs) at point 0, 1 and within (0-1) differently with mixture-models. This can be modelled with the ordered logistic (OL) regression and the inflated regression (Inf).
The two-step method, named also ordered logistic regression (OL), initially proposed by Gurtler and Hibbeln [14] and Li et al. [19], consists of estimating the dynamics of LGD (and RR) in two steps. The first step consists of specifying the dynamic of LGD (RR) at 0 and 1.
The second step consists of regressing observed LGDs (RRs, resp.), having values within the range (0,1), on the explanatory variables by using the OLS method. Estimated LGDs (RRs, resp.), from this second step (
A general form of the zero-adjusted/one-adjusted regression is the inflated Beta regression (InfBE), proposed initially by Ospina and Ferrari [21,23], and has been used by authors such as Pereira and Cribarineto [25], Qi and Zhao [27], Yashkir and Yashkir [33] and Li et al. [19]. This model is given as:
y = LGD, RR. The mean and the dispersion parameters of the beta distribution are defined as 0 < 𝜇 < 1 and 𝜙 > 0. f(.) is a beta probability density function (PDF). The mean 𝜇, the dispersion parameter (𝜙), the probabilities P
0 and P
1 can be dynamic; such as: g
1(𝜇) = g
1(x
T
𝛽1), g
2(𝜃) = g
2(x
T
𝛽2),
As in adjusted regressions, the link functions g
1, g
3, and g
4 can be: a logit, or a probit link function, or a complementary log-log or a log-log link function. As for g
2, a log link or a square-root link are recommended. x represents the set of explanatory variables. The coefficients of the model are estimated by maximizing the log likelihood function. The predicted values
Compared to the simple Beta regression and the zero or one adjusted Beta regression, inflated-Beta regression does not require an adjusted dependent variable.
The two-step approach (ordered logistic regression) and the inflated beta regression are quite similar. In the inflated beta model, parameters are estimated in a single step, whereas in the two-step approach they are estimated in two separate steps. However, the two-step method might perform better than the inflated regression due to its flexibility in predicting the observations within the interval (0, 1) [19]. Indeed, in the ordered-logistic regression
Data presentation
This article investigates the dynamic of retail sector’s LGDs and RRs by using data extracted online from the Lending Club. This club provides credits and loans online.
LGDs can be determined either by using the defaulted price of securities (implied LGD) or using the price of non-defaulted securities (market implied LGD), by using all cash-flows related to the defaulted securities (workout LGD). The workout LGDs require the determination of the values of all cash-flows at the default time. The default time is defined as 180 past due days. Workout LGDs are determined, precisely, by using the future values of unpaid installments over the 180 days and the present value at default time of all installments between the default time and the maturity of the loans. In this study, the workout LGDs were determined from loans having a default or charged off-status. In the determination of the present value of cash-flows, the interest rate of each loan was retained as the discount rate.

In-sample observed LGD and RR.
Due to data constraints, the sample period was restricted to four years from 2012 to 2015. In order to capture the whole four-year economic cycle, a weighted-balanced sample design was performed to the heterogeneous distribution of the number of applications and their default rates. A sample of 400 applications was selected randomly every month from January 2012 to December 2015. In total, 19200 applications were considered. All the determined workout-LGDs lie within 0-1 range. Although the retained database is large, a 10-fold cross-validation method was used to evaluate the validation and discrimination performances of retained models. This method consists of randomly dividing the entire dataset into 10 equal-sized subsets, then nine of these are used to estimate/calibrate the model (in-sample) and the remaining one is used to test the model performance (out-of-sample). This procedure is repeated 10 times, with each subset being used once as the test subset. Due to the length of the dataset (from 2012 to 2015) and the necessity to cover an economic cycle, the performance of retained models was tested only by using out-of-sample subsets. Such approach based on k-fold cross-validation on out-of-sample was employed by few authors; such as Bellotti and Crook [6], Baston [5], Qi and Zhao [27], Li et al. [19], and Tanoue et al. [30].
All in-sample datasets are composed with only a very small percent of 0 and a bit higher percent of LGD = 1. This implied that the RR datasets contain few observations having a value of one and a bit more RRs having a zero value. The frequency distributions of LGDs and RRs from one fold are displayed in Fig. 1.
Several existing studies took explanatory variables into account in the validation processes. According to Bellotti and Crook [7], five types of factors are important explanatory variables for the retail sector: (1) individual details, (2) account information at default, (3) changes in the personal/obligor situation over time, (4) the macroeconomic situation and (5) decisions of the bank on the level of risk.
Extracted data contained 116 types of information (variables) for each customer. 3 These variables were related to the loans characteristics (maturity, funded amount, interest rate, purpose,…), to the customers’ private life (address, employment, ratings…), to the customers’ past loan payments. Among these 116 variables, some were missing or irrelevant: 59 variables were assumed relevant and then retained at the first stage in the present analysis.
These retained variables were filtered again. Variables selection/filtering can be done by using different methods (see [4]). Although LGD is a continuous variable, it can be transformed in discrete form by labeling the LGD value smaller than the overall portfolio’s median LGD value as Good and LGD value higher than the overall median LGD value as Bad. All the continuous variables were split into five buckets with expert knowledge. As Cramer’s V, information value and Gain, provide similar results in terms of variable importance, only Cramer’s V was considered in this article. Based on the Bad definition, 59 variables were analyzed univariately.
In order to overcome model development and degrees of freedom constraints of the categorical variables, such as address state and rating, Weight of Evidence (WOE) values of their categories were used throughout this study. Furthermore, natural logarithm transformation were used to normalize and decrease the noise in asymmetric distribution of income, balance, and limit dollar values.
Based on Cramer’s V values, as well as our analysis and decisions 25 explanatory variables among the 59 variables were selected. In order to decrease multi-collinearity, further filtering of explanatory variables was done based on their Pearson correlations, and 13 variables among the previously selected 25 ones were retained as final explanatory variables. The selected variables in this multivariate analysis are: (1) the WOE of the assigned loan subgrade (grade), (2) the logarithm of the annual income provided of the borrower (income), (3) the WOE of the state of the borrower (address), (4) the number of 30+ days past-due incidences of delinquency of the borrower over the past 2 years (delinq), (5) revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credits (revol-util), (6) the logarithm of the total current balance of all accounts (revol-util), (7) number of trades opened in the past 24 months (acc-open), (8) number of mortgage accounts (mort-acc), (9) number of bankcard accounts (num-bc), (10) number of installment accounts (num-il), (11) number of revolving trades with balance >0 (num-rev), (12) the logarithm of the total bankcard high credit/credit limit (bc-limit), and (13) the variable representing the maximum number of months between the credit issue date and the oldest bank installment and revolving accounts or the recent revolving, bankcard or other accounts available (max-month).
The various models retained in this study were calibrated by using these 13 explanatory variables. A final variables selection was done during the model estimation stage with stepwise variables selection.
According to the Basel agreement, financial institutions should regularly validate and then check the goodness-of-fit and the predictive power (validation) as well as the discriminatory power in order to ensure that they are not mis-estimating LGD/RR and then the minimum required capital. A reduced predictive performance of a LGD or RR model can lead to an overestimation or underestimation of the required minimum capital and then induce less profitable (over-estimation) or more risky operation (under-estimation).
Methods for back-testing LGD models are rare [20]. Most of the existing practices in the literature consist of checking the validation power of models by comparing predicted/fitted LGDs and realized LGDs by using error-based metrics (MSE, RMSE, MAE, MAPE), or/and the coefficient of determination R 2. For continuous variables, such LGD and RR, the discriminatory power (classification power) of models can be measured with correlation-based statistics such as Pearson’s r, Kendall’s 𝜏, and Spearman’s 𝜌. Only few authors, such as Tong et al. [31], used these three types of metrics.
Under-estimation of LGDs (or over-estimation of RRs) can put financial institutions in a risky situation compared to overestimation. Owing that, Loterman et al. [20] recommended testing whether LGDs are under-estimated. This feature can be tested with Central Tendency metrics, which enable to test whether the fitted and predicted LGDs tend to under-estimate the true LGDs. Among existing metrics, the T test is the most used test. This metric tests whether the mean errors equals zero (null hypothesis H 0: x E = 0) against the alternate hypothesis H a : x E > 0. Regarding the RRs, this metric tests the null hypothesis stating H 0: x E = 0 against the alternate hypothesis stating the over-estimation of RRs H a : x E < 0.
All these classical and advanced metrics were determined and used to compare the performance of the models retained in this article. As 10-fold cross validation was used in this article, the average value of these metrics over the 10 folds were considered.
Metrics results of classical models - LGDs - In-sample (1)
Metrics results of classical models - LGDs - In-sample (1)
IG l and IG g : OLS based on locally (l) and globally (g) adjusted and transformed with Inverse Gaussian (IG) distribution. FR − Cl: Fractional Response Model based on a Cloglog (Cl) link function. IGB: Inverse Gaussian Beta distribution. Values between (.) are standard deviation. T-test: H 0: x E = 0 (not under-estimation) and H a : x E > 0 (under-estimation of LGDs).
The aim of this study was to model the dynamic of LGDs and RRs with the presented approaches and to select the model enabling to formalize the best the dynamic of LGDs either directly or from the dynamic of RRs as well as accurately predict it.
As the purpose was to determine the model presenting the best goodness-of-fit and having highest predictive power (validation) and discrimination, it is better to compare the performance of models by considering the average metrics’ values based on transformed back fitted/predicted LGDs/RRs and observed LGDs/RRs [7]. Important differences between these two methods of determination of metrics’ values were observed; especially for the value of R 2. R 2 calculated by using transformed fitted/predicted LGDs/RRs and transformed observed LGDs/RRs are all positive. By contrast, some of the R 2 determined from transformed back fitted/predicted LGDs/RRs and observed LGDs/RRs are negative (R 2) (see Tables 1–8). In the next part I only consider models having positive R 2 determined by using transformed back fitted/predicted LGDs/RRs and observed LGDs/RRs.
Metrics results of models taking account unusual features of LGDs - In-sample (2)
Metrics results of models taking account unusual features of LGDs - In-sample (2)
0Adj: zero-adjusted regression; 1Adj: one-adjusted regression; InfBE.: inflated Beta regression; InvGau: Inverse Gaussian regression, GA: Gamma, 0AdjGA: zero-adjusted Gamma, BE: Beta, G.BE: generalized beta; OL: ordered logistic regression. Values between (.) are standard deviation. T-test: H 0: x E = 0 (not under-estimation of LGD) and H a : x E > 0 (under-estimation of LGD).
LGD - 10-fold cross-validation - Metrics results - Out-of-sample (1)
Values between (.) are standard deviation. T-test: H 0: x E = 0 (not under-estimation of LGD) and H a : x E > 0 (under-estimation of LGD).
LGD - 10-fold cross-validation - Metrics results - Out-of-sample (2)
Values between (.) are standard deviation. T-test: H 0: x E = 0 (not under-estimation of LGD) and H a : x E > 0 (under-estimation of LGD).
Metrics Results of classical models - RRs - In-sample (1)
Local (l) and global (g) adjustment. IG l : Inverse Gaussian, IGB: Inverse Gaussian Beta distribution, FR − Cl: Fractional Response Model based on a Cloglog (Cl) link function. Values between (.) are standard deviation. T-test: H 0: x E = 0 (not over-estimation of RRs) and H a : x E < 0 (over-estimation of RR).
Metrics results of models taking account unusual features of RRs - In-sample (2)
0Adj: zero-adjusted regression; 1Adj: one-adjusted regression; InfBE.: inflated Beta regression; InvGau: Inverse Gaussian regression, GA: Gamma, 0AdjGA: zero-adjusted Gamma, BE: Beta, G.BE: generalized beta; OL: ordered logistic regression. Values between (.) are standard deviation. T-test: H 0: x E = 0 (not over-estimation of RRs) and H a : x E > 0 (over-estimation of RRs).
RR - 10-fold cross-validation - Metrics results - Out-of-sample - 1
Values between (.) are standard deviation.
RR - 10-fold cross-validation - Metrics results - Out-of-sample - 2
Values between (.) are standard deviation.
The estimation of some retained approaches required adjusted LGDs or RRs. Two adjustment methods (local and global) were used in this article. According to some authors; such as Qi and Zhao [27], Tong et al. [31] and Li et al. [19], results of the model are sensitive to the choice of the adjustment value (𝜖). For each local adjustment and global adjustment, several adjustment values (𝜖) were considered. Precisely, 11 adjustment values were retained: 0,000001; 0,00001; 0,0001; 0,001; 0,005; 0,01; 0,05; 0,075; 0,1; 0,25; and 0,4. The optimum cutoff adjustment value was defined by using metrics such as MSE and RMSE. As several approaches requiring adjusted LGD were considered in this article and 11 different adjustment values for each of the both adjustment methods, only values of the optimum cutoff adjustment value are presented in Tables 1 and 2 (for LGD) and Tables 5 and 6 (for RR) due to limited space.
Results and the performances of models modeling the dynamic of LGDs are presented and discussed first. Results related to RRs’ dynamic and models performance are displayed and analyzed later. Finally, the impacts of independent variables are presented and discussed.
Obtained results revealed that most OLS regressions using locally adjusted LGDs have a negative average R 2, determined from transformed back fitted/predicted LGDs and observed LGDs, except the OLS model using log transformation and locally adjusted LGDs (Log l ). This finding holds in-sample and out-of-sample (see Tables 1 and 3). This finding can be due to the fact that most of the used transformation functions are not linear and LGDs/RRs were not re-adjusted back in case of local adjustment. This finding can also be due to the local adjustment method. Indeed, according to Qi and Zhao [27] and Li et al. [19], a small local adjustment factor can imply a poor model performance whereas a larger local adjustment factor cannot preserve the rank ordering of the raw LGD/RR values and then affect the predictive performance.
This finding is also characteristic of some other OLS based models, such as an OLS model using inverse Gaussian Beta transformation and globally adjusted LGDs (IGB g ), and an OLS regressions based on arcsin and reciprocal transformation functions.
Positive R 2’s values are quite low (around 0.10 or less). As explained by Bellotti and Crook [7], such low values were observed by several authors and are typical to RRs and LGDs modeling.
Among the OLS regressions having positive R 2, models using inverse Gaussian (global adjustment, IG g ), normal (global adjustment; Normal g ) and logit transformation (global adjustment - Logit g ) have almost similar average error-based metrics’ values. This finding holds in-sample as well as out-of-sample (see Tables 1 and 3). These performing models have higher discriminatory power in both samples (Pearson correlation, Kendall’s 𝜏 and Spearman correlation). The validation performance of these models is quite similar to the performance of the simple OLS regression in both samples. In terms of discrimination, the simple OLS model dominates the other OLS based models in sample. These selected OLS based regressions do not underestimate LGDs neither in-sample nor out-of-sample according to the T-test, except the simple OLS regression.
According to the error-based metrics, OLS regressions using transformed LGDs at power (0.5, 1/3 or 0/4) have quite similar average error-based metrics’ values; with the lowest MSE and RMSE values obtained at power 0.5. This finding holds in terms of classification performance a well as in-sample and out-of-sample. This OLS model, using transformed LGDs at power, has quite similar validation and discriminatory power than dominant OLS based models in sample as well as out sample. However, OLS regressions using transformed LGDs at power lower than 0 tend to underestimate LGDs in-sample and out-of-sample according to the T-test.
The retained fractional response (FR) models present quite similar performances in terms of validation and discrimination whatever the link function (logit, probit, cauchit and cloglog). The FR model using Cloglog link function has the lowest MSE (see Tables 1 and 3). Fitted and predicted LGDs are not underestimated with the fractional response models according to the T-test, whatever the link function.
The FR model outperforms in terms of fitting/prediction the Tobit model in term of discrimination and validation in-sample as well as out-of-sample (see Tables 1 and 3). The FR model has similar fitting and predictive performances than the dominant OLS based models in sample (see Table 1). Similar finding is drawn in terms of discrimination power. The FR model performs slightly less than the dominant OLS based models out-of-sample in terms of prevision and discrimination (see Table 3). Similar results were obtained by Bellotti and Crook [6,7].
In terms of validation and classification, simple beta regression using locally adjusted LGDs (BE l ) performs better than other retained models based on beta regressions (beta regression based on globally adjusted LGDs (BE g ) and generalized beta regression (G.BE)). The BE l model also presents lower error-based metrics’ values and higher correlation based metrics values than the gamma model (GA l ). Furthermore, this performing model does not under-estimate LGDs (T-test). All these findings hold in-sample as well as out-of-sample (Tables 2 and 4).
Among the adjusted models, the zero-adjusted beta regression (0AdjBE l ) has the lowest average error-based metrics and highest average correlation based metrics’ values than other zero-adjusted models (see Tables 2 and 4). This model dominates also one-adjusted beta regression (1AdjBE l ). Furthermore, this later model has a negative average R 2. This finding is surprising. Indeed, the reverse could be expected as considered LGDs samples contained only few zeros values and a few more ones. Indeed, zero-adjusted models are more suitable for dataset containing large number of observations having zero value and one-adjusted models for dataset composed with numerous observations of value one [31].
Inflated models and ordered logistic regression (2-step approach) represent generalized version of zero/one adjusted model. Among these both models, the average R 2 of the inflated beta regression is negative (InfBE) in both samples. Furthermore, the ordered logistic regression (OL) has higher fitting/predictive power as well as higher discrimination power in both samples (see Tables 2 and 4). This finding is in line with results obtained by Li et al. [19]. Furthermore, fitted and predicted LGDs with this OL model are not under-estimated (T-test).
In sum, the beta regression based on locally adjusted LGDs (BE l ) and the zero-adjusted beta regression (0AdjBE l ) perform better than other retained models taken into account the characteristic distribution of LGDs. This finding can be explained by the fact that used LGDs datasets are composed with only few LGDs with value 0. Due to this fact, the zero-adjusted regression should be represented mainly by the continuous beta regression for LGDs between 0 and 1 and the distribution at point 0 should be almost insignificant.
These dominants models (simple beta regression (BE l ) and the zero-adjusted beta regression (0AdjBE)), present also higher performance in terms of validation and discrimination than OLS-based models, fractional response model and Tobit model. This finding holds in-sample as well as out-of-sample (see Tables 1–4). Similar results were obtained by Yashkir and Yashkir [33].
Models performance - RR
Obtained results revealed that the average R 2 are negative in the OLS regression using transformed and locally adjusted RRs (IG l , Normal l , Log l , and Logit l ) as well as in OLS regressions using transformed RRs with arcsin and reciprocal functions (arcsin and reciprocal l ). These observations are also characteristic of some models accounting the skewness and asymmetry, such as the generalized Beta regression (G.BE l ), the zero-adjusted Beta regression (0AdjBE l ), the zero-adjusted Gamma regression (0AdjGA l ), and the inflated Beta (InfBE). These findings hold in both samples (see Tables 5–8).
OLS based models, having a positive average R 2, present quite similar validation and discrimination performances in-sample as well as out-of-sample (see Tables 5 and 7). Fitted and predicted RRs with these models are not overestimated (T-Test).
Fractional response models based on probit, logit and cloglog link functions present similar performances in both samples. These models’ performances in terms of validation are higher than the cauchit link function in both samples. The lowest average error-based metrics’ values is obtained with the probit link function. Only results of this model (FR P , P: probit) is presented in this article. In terms of classification, model based on cauchit link function has slightly higher discrimination power in sample and out-of-sample. Furthermore, fitted and predicted RRs, with FR model based on these both link functions, are not over-estimated in both samples (T-test).
The FR model presents similar validation and discrimination performances than previously selected OLS based models in both samples. Similar findings were found by Bellotti and Crook [7].
Compared to LGD’s results, the simple beta regression using locally adjusted RRs (BE l ) and the simple Gamma regression (GA l ) present higher validation power than other retained simple skewed regressions in both samples (see Tables 6 and 8). Furthermore, these models do not overestimate RRs (T-test) neither in-sample nor out-of-sample.
The higher performance of gamma regression for RRs compared to LGDs can be explained by the fact that this regression fits better for left skewed data (as used RRs) than right skewed data (as used LGDs). Regarding the Beta regression (BE l and BE g ), computed metrics’ values for RRs are quite similar to those computed for LGDs. This finding is expected as the Beta regression can suit left skewed as well as right skewed data and as RR = 1 − LGD.
Among the retained adjusted models, the calculated R 2 of only the one-adjusted beta regression (1AdjBE) is positive. Furthermore, this model presents also higher validation and discrimination performances in both samples. Moreover, this performing model does not over-estimate the RRs (T-test). Furthermore, metrics values of this selected model are quite close to those drawn from the estimation of RRs with the beta regression using locally adjusted RRs. This finding can also be explained by the fact that retained RRs data contains only few data having 1 as a value. Due to this fact, the one-adjusted beta regression should be almost completely represented by the beta regression for RRs between 0-1 (0 and 1 are not included) and the regression for RRs = 1 should be almost insignificant.
Similarly to the results related to LGDs, ordered logistic (OL) regression presents higher validation and classification power than inflated beta-regression (InfBE). Furthermore, the average R 2 of the inflated beta regression is negative. Moreover, the T-test’s value indicated that the RRs are not over-estimated in the ordered-logistic model. All these findings hold in-sample and out-of-sample.
In sum, the beta regression using locally adjusted RRs (BE l ), the gamma regression using locally adjusted RRs (GA l ) and the one-adjusted beta regression (1AdjBE) present quite similar performance in term of validation and discrimination than other retained models accounting the unusual features of RRs distribution in both sample (Tables 6 and 8). These performing models have also higher fitting/predictive ability as well as higher discriminatory power than all retained classical models.
Is it better to model directly LGDs or draw it from the dynamic of RRs?
As LGD = 1 − RR, it is then possible to evaluate the dynamic of RR and then draw the fitted/predicted LGDs as well as fitted/predicted LGDs. It is possible to compare metrics values calculated from estimations of RRs with the values of metrics deduced from the estimation of LGDs’ dynamic. In order to respond to the previous question, I compared the validation and discriminatory powers of selected performing model for RRs with the selected performing models for LGDs. This comparison indicated that the performances of selected models for the LGDs are quite similar to those related to the selected models for modeling RRs. The performing models for the LGDs are the zero-adjusted beta regression and the simple beta regression using locally adjusted LGDs. Regarding the RRs, the selected models are the gamma regression, the beta regression using locally adjusted RRs and the one-adjusted beta regression. All these selected models present higher validation and discrimination power than other retained models in both samples (see Tables 1–8). Furthermore, these models do not under-estimate LGDs and over-estimate RRs.
Although fitted and predicted RRs with the gamma regression are within 0-1, fitted and predicted values with a gamma regression can have values higher than 1. To avoid this possible problem, it might be better to model and forecast the RRs with the beta regression using locally adjusted RRs or with one-adjusted beta regression or model and forecast LGDs directly with the beta regression using locally adjusted LGDs or with zero-adjusted beta regression.
Impacts of explanatory variables
Obtained results showed that the LGDs’ dynamic significantly reacts to the explanatory variables related to the grade (grade), the annual income (income), the address (address), the revolving line utilization rate (revol-util), the number of trades opened in the past 24 months (acc-open), the number of bankcard accounts (num-bc), the number of installment account (num-il), the number of revolving trades with balance >0 (num-rev), and the maximum number of months between the credit issue date and the oldest bank installment and revolving accounts or the recent revolving, bankcard or other accounts available (max-month) in almost all classical models (Table 9).
Impact of explanatory variables on LGDs - 1 (Mean)
Impact of explanatory variables on LGDs - 1 (Mean)
Values (.) represent the t-stat. grade: loan subgrade, income: annual income, address: state of the borrower, delinq: the number of delinquency, revol-util: revolving line/credits, revol-util: total current balance, acc-open: number of trades opened, mort-acc: number of mortgage accounts, num-bc: number of bankcard accounts, num-il: number of installment accounts, num-rev: number of revolving trades, bc-limit: total bankcard limit, Max-month: maximum number of months between the credit issue date and the oldest bank installment and revolving accounts or the recent revolving, bankcard or other accounts available.
Impact of explanatory variables on LGDs - 2 - (Mean)
Values (.) represent the t-stat.
Impact of explanatory variables on LGDs - 3 - (Dispersion)
Values (.) represent the t-stat.
Impact of explanatory variables on LGDs - 4
Values (.) represent the t-stat. P 0: probability function at point 0 (LGD = 0), P 1: probability function at point 1 (LGD = 1).
The negative impact of the annual income is expected. Indeed, as stated by some authors; such as Crook and Bellotti [7], a higher annual income means that the customer is more likely to pay back and then the loss and the LGD associated to the customer should be lower. Regarding the variable related to the address, higher quality regions are inhabited by people having higher income and higher wealth compared to poor regions. Costumers living in better regions should be more able to pay their loans than customers living in poor regions. This fact can explain the negative impact of the WOE associated to the region. This importance of regional differences on LGD variation was also shown by Grippa et al. [12] and Bellotti and Cook [7]. The negative impact of the rating (WOE of grade) can be explained as: a better rating means that the costumer is more likely to pay back his/her loan and will not default easily or default with the lower amount as possible. This result is in line with finding of Bellotti and Cook [7], who uncovered a positive impact of credit bureau scores (rating) on the RR. As expected the variable representing the number of installment accounts (num-il) impacts positively the dynamic of LGDs. Indeed, a higher number of installment accounts means that the costumer has to pay back more money and then the default amount per loan can be more important.
Compared to the classical methods, in models taking into consideration the unusual nature of LGDs fewer explanatory variables have a significant impact on the mean of LGDs (see Table 10). Similar to the classical models and as expected explanatory variables related to the annual income and to the address have a negative and significant impact on the level of the LGDs in almost all models.
In some retained models, the LGDs’ mean reacts homogeneously to some explanatory variables; such as the number of installment (num-il), the number of revolving trades with balance >0 (rev-bal), and the variable representing the maximum number of months between the credit issue date and the oldest bank installment and revolving accounts or the recent revolving, bankcard or other accounts available (max-month). These independent variables have a positive impact on LGDs. The expected positive impact of the number of installment (num-il) was explained previously.
Compared to the impact on the dynamic of the mean, explanatory variables have a more significant influence on the dynamic of LGDs’ dispersion (Table 11). In almost all models formalizing the unusual distribution of LGDs, the dynamic of the dispersion parameter reacts significantly and in the same direction to the following explanatory variables: rating of costumers (grade), WOE of the state of the borrower (address), past delinquency (delinq), number of mortgage accounts (mort-acc), number of trades opened in the past 24 months (acc-open), number of bankcard accounts (num-bc), and number of revolving trades with balance >0 (rev-bal). These variables have a negative impact on LGDs’ dispersion, except the number of mortgage accounts (mort-acc) and the number of bankcard accounts (num-bc).
The other retained variables either have a significant impact in only few models or do not have a homogenous impact (Table 11).
Table 12 indicates that the probability at point 0 (P 0), in the zero-adjusted regressions (OAdj) as well as in the inflated regression (Inf), reacts similarly to only one explanatory variable in the same magnitude and only at 10% significance level whatever the model is. The absence of significant reaction to variables can be explained by the fact that retained LGDs dataset contains only very few observations with value 0. As for the reaction to the same covariate and at the same magnitude, it can be explained by the fact that the probability at point 0 (P 0) is modeled with the same link function.
Similarly, the probability at point 1 (P 1), in the one-adjusted beta regression (0AdjBE) and in the inflated regression (InfBE), is influenced by the same explanatory variables and in the same magnitude. Compared to the probability P 0, the probability P 1 reacts to almost all variables and at a higher significant level (5% and 1%). These results can be explained by the fact that considered LGD dataset contains more observations with a value of 1 than 0. This probability is significantly impacted by all retained variables, except by the variable related to the delinquency (delinq), the number of installment accounts (num-il) and the number of revolving trades with positive balance (num-rev). This probability P 1 reacts negatively to the grade (grade), the annual income (income), the address (address), the revolving line utilization rate, or the amount of credit the borrower (rev − util), the number of mortgage accounts (mort-acc), and the number of bankcard accounts (num-bc). As explained previously, as expected a higher income enables reducing the LGDs as well as P 1. Similarly, the negative influence of the address goes in the same way than the impact on the LGDs’ mean. The negative impact of the number of mortgage accounts (mort-acc) is in line with the positive influence of this variable on the dispersion parameter. Indeed, as stipulated earlier, these results are explained by the fact that the more trustworthy a costumer is, the more mortgage and bankcards the customer will get . This fact can also explain the negative impact of the revolving line utilization rate, or the amount of credit the borrower (rev − util).
The probability P 1 increases with the following variables: the total current balance of all accounts (revol-util), the number of trades opened in the past 24 months (acc-open), the total bankcard high credit/credit limit (bc-limit) and the maximum number of months between the credit issue date and the oldest bank installment and revolving accounts or the recent revolving, bankcard or other accounts available (max-month). The positive impact of the total current balance of all accounts (revol-util) on P 1 is in line with its negative impact on LGDs’ mean. Similarly, by positively impacting LGDs’ mean, the number of revolving trades with positive balance (rev-bal) influences also P 1 positively. The positive impact of the number of installment accounts (num-il) is also in line with the result obtained on the LGDs’ mean. Indeed, the more money the costumer has to pay back, the more the defaulted amount can be important and then the higher can be the LGD as well as the probability of defaulting the whole borrowed money and then P 1.
Similarly to the dynamics of LGDs, the dynamics of RR reacted mainly to explanatory variables related to the grade (grade), the annual income (income), the address (address), the number of installment accounts (num-il), the number of revolving trades with positive balance (rev-bal), and the maximum number of months between the credit issue date and the oldest bank installment and revolving accounts or the recent revolving, bankcard or other accounts available (max-month) (Table 13). Compared to the negative influence on LGDs, RRs react positively and significantly to some of these variables as expected, since RR = 1 − LGD. These variables represent the grade, the annual income, and the address. As for the other variables, they have a positive impact.
Similarly to the dynamic of LGDs’ mean, RRs’ mean reacts to only few independent variables. These variables are the logarithm of the annual income (income), the WOE of the address (address), the logarithm of the total current balance of all accounts (revol-util) and the number of trades with positive balance (num-rev) (see Table 14). As expected and presented earlier, variables having a negative (respectively positive) impact on LGDs’ mean influence positively (respectively negatively) RRs’ mean. As RR = 1 − LGD, these impacts on RRs are as expected.
Impact of explanatory variables on RRs - 1 (Mean)
Values (.) represent the t-stat.
Impact of explanatory variables on RRs - 2 (Mean)
Values (.) represent the t-stat.
Similar to the dynamic of LGDs’, the dispersion (scale)/variance of RRs is influenced by several explanatory variables. Results in Table 15 show that RRs’ dispersion parameter increases with the following variables: the number of mortgages accounts (mort-acc), the number of bankcard accounts (num-bc), the number of installment accounts (num-il) and the maximum number of months between the credit issue date and the oldest bank installment and revolving accounts or the recent revolving, bankcard or other accounts available (max-month). These results indicate that these variables have a negative impact on the RRs’ variance as due to the inverse relation between the dispersion (scale) parameter and the variance. As expected and in line with the results obtained in the dispersion parameter of LGDs, the number of mortgages accounts (mort-acc) had a positive impact on the dispersion parameter and then a negative influence on the RRs’ variance. This fact can also explain the positive impact of the revolving line utilization rate, or the amount of credit the borrower (rev − util) on the dispersion parameter and then its negative impact on the variance. Indeed, the most trusty is costumer the more mortgage/bankcard he/she can get.
Impact of explanatory variables on RRs - 3 - (Dispersion)
Values (.) represent the t-stat.
Impact of explanatory variables on RRs - 4
Values (.) represent the t-stat. P 0: probability function at point 0 (RR = 0), P 1: probability function at point 1 (RR = 1).
This dispersion parameter is negatively impacted by the rating (grade), the address (address), the delinquency (delinq), the total current balance of all accounts (cu − bal), and the number of trades opened in the past 24 months (acc-open) whatever the retained models formalizing the unusual form of RRs are. These results are as expected and almost in line with the results related to the LGDs’ dynamic.
In line with the results related to P 0 for LGDs, the probability P 1 associated to a complete recovery rate (RR = 1) reacts to only the variable representing the grade (grade) and at a very low significance level (10%). This result is explained by the fact that retained RR dataset contains only few observations with a value of 1.
The impacts of independent variables on the probability of zero recovery rate (P 0) are in line with the results related to the reaction of the probability of total loss given default (P 1). As P 1 for LGDs, P 0 of RRs reacts to almost all independent variables, except to the variable representing the delinquency (delinq). The probability (P 0) of zero recovery is negatively impacted by the annual income (income), the address (address), the revolving line utilization rate, the amount of credit the borrower (rev − util), the number of mortgage accounts (mort-acc), and the number of bankcard accounts (num-bc). These variables also negatively influence the probability of total losses (P 1 of LGD). As RR = 1 − LGD, the lower probability of complete losses (LGD = 1; P 1) is equivalent to a higher recovery rate and then lower probability of zero recovery rate (P 0 for RR).
In line with the increase of P 1 of LGD; the probability (P 0) of zero recovery increases also with the total current balance of all accounts (revol-util), the number of trades opened in the past 24 months (acc-open), the number of installment accounts (num-il), the number of revolving trades with positive balance (rev-bal), the total bankcard high credit/credit limit (bc-limit) and the maximum number of months between the credit issue date and the oldest bank installment and revolving accounts or the recent revolving, bankcard or other accounts available (max-month).
The dynamics of LGD and RR were modelled with several models: simple models and models modeling the unusual nature of LGDs/RRs (skewed regressions, adjusted regressions, inflated regressions). These models were estimated by using 10-fold cross-validation and compared based on their fitting/predictive power and discrimination power by using several metrics. T-test was also used in order to test whether the LGDs are under-estimated as well as whether RRs are over-estimated.
The simple beta regression using locally adjusted LGDs and the zero-adjusted beta regression present higher performance in terms of validation and discrimination than other retained models in modeling the in-sample and out-of-sample dynamic of LGDs. These models perform better than classical models (OLS based models, fractional response model, Tobit model). All these findings were observed in-sample as well as out-of-sample. Furthermore, these selected performing models do not underestimate LGDs neither in-sample nor out-of-sample.
Regarding the RRs, it is better to model and forecast them with a beta regression using locally adjusted RRs or one-adjusted beta regression. These models present higher validation and discrimination power as well as they do not over-estimate RRs. These findings hold in both samples. As the RRs datasets contain only few RR observations with value 1, the similar performance of beta regression with the one-adjusted beta regression is expected.
The comparison of all these performing models revealed that it is equivalent to estimate and forecast directly LGDs with the selected models or draw it from the fitted and predicted RRs with its performing models.
In all retained classical models, the mean of LGDs and the mean of RRs were influenced significantly by several covariates. Compared to the classical models, the mean of LGDs and the mean of RRs react significant to less variables in models taking into account the unusual nature of dependent variables (adjusted regressions, inflated regression and ordered logistic regression). In these latter models, explanatory variables have more significant impact on the dispersion parameters.
As the LGD dataset contains only few observations having 0 value, the probability of LGD = 0 (P 0) does not react significantly to retained explanatory variables as expected. By contrast, the probability of LGD = 1 (P 1) reacted significantly almost all retained independent variables. The signs of these impacts were as expected. Similarly, the probability of RR = 1 does not react significantly to retained explanatory variables as the RRs dataset contained only few observations. Whereas the probability of RR = 0 was influenced by almost all retained covariates as the probability of LGD = 1.
