We define the odd extended log-logistic-G family, and obtain some of its statistical properties. We construct a new extended regression based on the logarithm of the proposed distribution, which can be better than other known regressions to fit real data.
In the last twenty years or so, several methods were developed to generalize any baseline distribution. Many continuous distributions have been proposed to explain several data sets from medicine, finance, reliability, biomedical sciences, and other areas. The applications of the generalized distributions to problems in these areas are a clear need nowadays.
The main contributions of this article are described below:
The first objective is to propose a new family of distributions called the odd extended log-logistic-G (“OELL-G” for short).
The second objective is to define new distributions in this family, namely the OELL-Weibull (OELLW), OELL-Normal (OELLN) and OELL-Gumbel (OELLGu). In this context, we introduce several structural properties and prove empirically the flexibility of these distributions.
A regression model is constructed based on the OELLW distribution, called the log-odd extended log-logistic-Weibull (LOELLW) regression model, in the form of location and scale for censored data.
Different simulation studies are developed to study the behavior of the maximum likelihood estimates (MLEs) for the OELLW distribution and the LOELLW regression model.
Finally, we provide two applications to show empirically that the proposed models are superior to some existing models in the literature.
Let be a continuous cumulative distribution function (cdf) depending on a parameter vector . The cdf of the OELL-G family is defined by
where and are extra shape parameters, and .
By differentiating (1), the probability density function (pdf) of the OELL-G family reduces to
where .
Henceforth, let be a random variable having density Eq. (2). For , this equation becomes the odd log-logistic-G (OLL-G) class (Gleaton & Lynch, 2006); and the baseline G follows when .
The random variable admits the stochastic representation:
where has the Burr Type XII distribution, say . The verification of Eq. (3) follows the same steps of the proof of Proposition 2.1 below.
Setting , we have , where denotes the pdf of . By differentiating with respect to , its mode satisfies:
where and .
The hazard rate function (hrf) of becomes
where is the baseline hrf. The hazard ratio (HR) is given by (for ). A straightforward calculation provides the values of Table 1.
Limit values of
From now on, we omit the arguments from the functions of the distributions.
The quantile function (qf) of follows from Eq. (1)
where is the qf of , and .
The paper is structured as following. Section 2 gives some special models, and Section 3 reports some properties of the new family. A location-scale regression based on the logarithm of the odd extended log-logistic Weibull (LOELLW) distribution in introduced in Section 4. Two simulation experiments are done in Section 5 for the maximum likelihood estimates (MLEs). Two applications to real data reveal the flexibility of the new models in Section 6. Section 7 addresses some conclusions.
where , is a shape, and is a scale. Inserting this expression and its derivative in Eq. (2) gives the OELLW density
Here, consider that is a random variable having density (8). The Weibull distribution follows when . Note that , for , for , and for .
By taking as in Eq. (7), it follows from Eq. (4) that the mode of the OELLW distribution is an intersection point of the graphs of the functions and , namely
where , . Note that it is an arduous task to obtain analytically the points of intersection of the non-polynomial functions and . Numerical methods such as Newton’s Method and secant Method are suitable for approximately finding these points. Graphically, it can be seen that and have at most three points of intersection, from which the bimodality form of the OELLW model holds (see Fig. 1).
Plots of the OELLW density, including well-known distributions, are displayed in Fig. 1, which reveal asymmetric and bimodal shapes.
Plots of the OELLW density.
(Stochastic representation).
If is distributed according to the OELLW distribution, the representation holds
Proof..
Let be the cdf of . By setting , we obtain (for )
for , where is given by Eq. (7). Note that . Since for , we obtain (for ), i.e., the function is increasing. Then, the term of the right-hand side of Eq. (9) has the form
Then, and are equal in distribution. Since (for ), the proof follows. ∎
.
If , then
where is the beta function.
Proof..
Letting , we have
By Proposition 2.1, if and only if . But it is well-known that
which ends the proof. ∎
.
If and , then
Proof..
The proof immediately follows by using the stochastic representation in Proposition 2.1, Jensen’s inequality with , and the identity (10) with . ∎
The hrf of the OELLW distribution can be determined from Eqs (5) and (7) as
where . A straightforward calculation gives the results in Table 2.
Limit values of
0
Undef.
0
0
0
Undef.
The OELLW distribution allows modeling reliability data since it hrf has unimodal, (bathtub), and bimodal shapes.
.
Let be a continuous univariate distribution on .
The distribution has upper light-tail distribution if, for any ,
The distribution has upper heavy-tail distribution if, for any ,
We show below that (for ) the OELLW distribution has a transition from heavy-tailed to light-tailed.
.
The shape parameter governs the tail behavior of the OELLW distribution as follows:
If , then the OELLW distribution has an upper heavy-tail.
If , then the OELLW distribution has an upper light-tail.
For , the OELLW distribution does not have a guaranteed transition from heavy-tailed to light-tailed.
Proof..
Let be the cdf of . A simple algebraic manipulation shows that (for and )
This completes the proof. ∎
The OELL-Normal (OELLN) model
The cdf of the normal distribution (for ) is
where , is the mean, is the standard deviation, and is the standard normal cdf.
The OELLN density can be expressed from Eq. (2) as
If is distributed according to Eq. (11), it follows from Eq. (3) the representation
where denotes the standard normal qf.
The normal distribution is not suitable for fitting data that present asymmetry or bimodality. So, many studies have been conducted with the purpose to creating new distributions that can model either the asymmetry, kurtosis or bimodality of the data. Further, the OELLN model can be considered a good alternative to some extended normal distributions such as the skew-normal, beta-normal, Kumaraswamy-normal, normal-gamma distributions, among others.
Some plots of the OELLN density are displayed in Fig. 2.
Plots of the OELLN density.
The OELL-Gumbel (OELLGu) model
The Gumbel cdf is (for ), where , , , and .
The OELLGu density can be determined from Eq. (2) as
If is distributed according to the OELLGu distribution (12), it follows from Eq. (3) that
The Gumbel is one of the most widely models for engineering problems. The fact that the OELLGu model is more flexible than the Gumbel is important so that it can be used to model data from the areas of evapotranspiration, agroclimatology, bio-meteorology, operational agrometeorology, weather-disease relations, micro-meteorology and meteorology. Also, it can be applied to model extreme events.
Plots of the OELLGu density (12) are displayed in Fig. 3.
Plots of the OELLGu density.
Structural properties
We obtain some mathematical properties of the OELL-G family.
Linear representation
By using the power series and (for real), the denominator of Eq. (1) can be expressed as
where , , and (for ).
By using the last power series and a result from Hairer et al. (1993) (for ), we obtain
where the coefficients in the numerator are (for ).
The ratio of two power series (Apostol, 1974, p. 237) leads to
where , and the coefficients ’s (for ) come recursively from
By differentiating (14), the pdf of can be expressed as
where and is the “exponentiated-G” (exp-G) density function with power parameter .
Equation (15) reveals that the OELL-G density function is a linear combination of exp-G densities. So, some structure properties of the new family can be determined from well-established exp-G properties reported in several papers of Tahir and Nadarajah (2015, Table 1). The last equation is the main result of this section.
Moments
Let be a random variable with density . The th ordinary moment of follows from Eq. (15) as
where .
In a similar manner, the th incomplete moment of can be expressed as
where both integrals can be calculated for most distributions.
The Bonferroni and Lorenz curves, and mean deviations of can be determined from the first incomplete moment .
The generating function (gf) of follows from Eq. (15) as
where is the gf of and .
Censored maximum likelihood estimation
Regressions in the form of location and scale can be adopted to verify the influence of covariates on the survival times for censored and uncensored data. Let be a lifetime with density (2), where . Suppose independent observations (for ), where the censoring time and are independent. The log-likelihood for has the form
where is the number of failures, and and are the sets of uncensored and censored observations, respectively.
We can employ the numerical procedure BGFS in R software to find the MLE of parameter .
Regression modelling
Usually, there are situations where the failure time may depend on a vector xi of explanatory variables. The OELLW model can be extended to include covariables in different ways. The most common regression models have the form of location and scale, for example, Prataviera et al. (2019), Cancho et al. (2021), Braga et al. (2022) and Vasconcelos et al. (2022). Following these lines of research, we propose the regression model based on the OELLW distribution.
The random variable has the log-odd extended log-logistic-Weibull (LOELLW) distribution. By setting and , the density of (for ) can be expressed as
where and are shape parameters, is a location, and is a scale. Then,
Consequently, the generating function (gf) of is given by
where the last equality follows from the stochastic representation of Proposition 2.1. So, for each , by Proposition 2.3, an upper bound for the gf of can be found.
Since because of inequality (for ), and by Proposition 2.3, the upper bound holds
Some plots of the density (17) are reported in Fig. 4, which show the flexibility of the LOELLW density.
Plots of the LOELLW density.
The survival function of is
It is clear that (for )
So, the LOELLW distribution has upper light-tail (see Definition 2.4).
The density of becomes
We proposed a location-scale regression for the response variable
where the stochastic error has density (18), and the systematic component for the location is
where is the vector of unknown parameters, and is the vector of explanatory variables.
Equations (19) and (20) defines the LOELLW regression for censored data. The log odd log-logistic Weibull (LOLLW) regression pionnered by da Cruz et al. (2016) is a special model when . The log-Weibull (LW) regression is also a special model when . Thus, the LOELLW regression with censored data generalizes some known models in the literature.
Next, let be independent observations, where each random response is defined by . We assume noninformative censoring such that the observed lifetimes and censoring times are independent. Let and be the sets of individuals for which is the log-lifetime and log-censoring, respectively.
The log-likelihood function for from regression (19) has the form
where is the number of uncensored observations (failures) and . The MLE of can be found by maximizing (4) using (MaxBFGS function), or software, or the procedure NLMixed in SAS.
Two simulations studies
The first study investigates the accuracy of the MLEs of the parameters in the OELLW distribution. One-thousand replications of samples of size are generated from the OELLW distribution from Eq. (6), and the MLEs are determined for each replication using .
Simulation results for censoring rate of 20%
Scenarios
AEs (, , , )
Biases
MSEs
20
Scenario 1
(0.203, 1.825, 1.871, 10.738)
(0.003, 0.075, 0.129, 0.738)
(0.002, 0.174, 0.089, 4.438)
Scenario 2
(0.209, 0.950, 4.087, 11.949)
(0.009, 0.050, 0.087, 1.949)
(0.008, 0.073, 0.134, 14.750)
Scenario 3
(0.1549, 0.8970, 43.9638, 4.6489)
(0.018, 0.016, 1.209, 0.335)
(0.003, 0.036, 37.413, 0.980)
50
Scenario 1
(0.214, 1.896, 1.956, 9.912)
(0.014, 0.004, 0.044, 0.088)
(0.002, 0.084, 0.029, 1.033)
Scenario 2
(0.214, 0.924, 4.058, 10.667)
(0.014, 0.024, 0.058, 0.667)
(0.002, 0.024, 0.050, 3.547)
Scenario 3
(0.1546, 0.8964, 43.2291, 4.3358)
(0.018, 0.015, 0.474, 0.022)
(0.001, 0.012, 4.550, 0.123)
100
Scenario 1
(0.216, 1.908, 1.978, 9.783)
(0.016, 0.008, 0.022, 0.217)
(0.001, 0.041, 0.011, 0.588)
Scenario 2
(0.216, 0.910, 4.037, 10.244)
(0.016, 0.010, 0.037, 0.244)
(0.001, 0.010, 0.017, 1.035)
Scenario 3
(0.1531, 0.8866, 43.0278, 4.2846)
(0.016, 0.006, 0.273, 0.029)
(0.001, 0.005, 0.892, 0.046)
350
Scenario 1
(0.209, 1.920, 2.004, 9.929)
(0.009, 0.020, 0.004, 0.071)
( 0.001, 0.011, 0.002, 0.034)
Scenario 2
(0.218, 0.907, 4.036, 10.027)
(0.018, 0.007, 0.036, 0.027)
(0.001, 0.003, 0.004, 0.104)
Scenario 3
(0.1498, 0.8800, 42.8132, 4.2888)
(0.013, 0.001, 0.058, 0.025)
( 0.001, 0.001, 0.034, 0.008)
Simulation results for censoring rate of 30%
Scenarios
AEs (, , , )
Biases
MSEs
20
Scenario 1
(0.206, 1.811, 1.869, 10.771)
(0.006, 0.089, 0.131, 0.771)
(0.012, 0.337, 0.505, 4.218)
Scenario 2
(0.250, 0.944, 4.081, 11.890)
(0.050, 0.044, 0.081, 1.890)
(1.499, 0.072, 0.133, 14.177)
Scenario 3
(0.1552, 0.8959, 43.8617, 4.6309)
(0.018, 0.015, 1.107, 0.317)
(0.004, 0.036, 34.918, 0.948)
50
Scenario 1
(0.213, 1.882, 1.946, 9.968)
(0.013, 0.018, 0.054, 0.032)
(0.002, 0.097, 0.044, 0.838)
Scenario 2
(0.214, 0.922, 4.053, 10.645)
(0.014, 0.022, 0.053, 0.645)
(0.003, 0.024, 0.048, 3.407)
Scenario 3
(0.1542, 0.8925, 43.1778, 4.3439)
(0.017, 0.012, 0.423, 0.030)
(0.001, 0.012, 5.869, 0.158)
100
Scenario 1
(0.215, 1.917, 1.978, 9.796)
(0.015, 0.017, 0.022, 0.204)
(0.001, 0.053, 0.019, 0.312)
Scenario 2
(0.216, 0.908, 4.039, 10.279)
(0.016, 0.008, 0.039, 0.279)
(0.001, 0.010, 0.018, 1.057)
Scenario 3
(0.1524, 0.8868, 43.0560, 4.2866)
(0.015, 0.006, 0.301, 0.027)
(0.001, 0.005, 1.262, 0.046)
350
Scenario 1
(0.210, 1.928, 2.005, 9.935)
(0.010, 0.028, 0.005, 0.065)
( 0.001, 0.013, 0.003, 0.029)
Scenario 2
(0.218, 0.907, 4.035, 10.010)
(0.018, 0.007, 0.035, 0.010)
(0.001, 0.003, 0.005, 0.099)
Scenario 3
(0.1498, 0.8797, 42.8442, 4.2837)
(0.013, 0.001, 0.089, 0.303)
( 0.001, 0.001, 0.079, 0.012)
The scenarios for the data-generating processes are: Scenario 1: ; Scenario 2: ; and Scenario 3: , where the scenario 3 is related to the data from the first application.
The survival times under random censoring mechanism are generated as follows: (i) Compute the inverse function of ; (ii) Generate a uniform ; (iii) Apply in Eq. (6) to obtain variates , and generate ( is the censoring rate); (iv) Set , and define a -vector , whose components take one if () and zero otherwise.
The average estimates (AEs), biases and mean squared errors (MSEs) for 20, 50, 100, and 350 are reported in Tables 3 and 4 under each scenario with censoring rates 20% and 30%, respectively. The biases and MSEs tend to zero when increases, although there are small biases. Further, the MSEs also tend to increase when the percentage of censored data increases.
Simulation study for the LOELLW regression
A second study with one-thousand replications is done with software to verify the precision of the estimates. The log-lifetimes are obtained from the LEOLLW regression (19) for censoring percentages approximately equal to 0%, 20% and 30% for 50, 100, 350, 500.
The generation scheme follows as: (i) Set ( is the censoring rate), and determine from Eq. (18); (ii) Obtain , where ; and let ; (iii) Define a -vector with one if () and zero otherwise.
The findings in Tables 5 and 6 reveal that the MSEs converge to zero, and the AEs tend to be closer to the true parameters when increases. So, the MLEs are consistent.
We prove empirically the flexibility of the new models by means of two real data sets.
Golden shiner data (censored data)
We consider the survival times of fishes of the species “Notropis Dourado, Notemigonus crysoleucas” obtained from Fachini et al. (2008). The response variable is the survival time (in years) of the pixels, the sample size , and the censoring rate is 14%.
Table 7 reports the MLEs (their standard errors in parentheses) of the fitted OELLW, OLLW and Weibull distributions to these data using the NLMixed procedure in SAS software. Iterative maximization of Eq. (16) started with the baseline Weibull, and .
Findings from the fitted models to golden shiner data
Model
AIC
CAIC
BIC
OELLW
3.6742
0.1070
0.0961
0.3081
466.2
466.6
476.8
(1.1895)
(0.0475)
(0.0696)
(0.0611)
OLLW
4.8310
1
53.8201
0.1118
477.1
477.3
485.1
(2.0987)
–
(13.5917)
(0.0476)
Weibull
1
1
5.5654
0.5146
481.6
481.7
486.9
–
–
(1.1483)
(0.0438)
The Akaike Information Criterion (AIC), Consistent Akaike Information Criterion (CAIC), and Bayesian Information Criterion (BIC) values for the fitted models are given in Table 7, which show that the OELLW distribution is the best one for modeling the current data. The likelihood ratio (LR) statistics (with the -values) for some model comparisons are listed in Table 8, which indicate that the OLLW model yields a better fit to these data than its sub-models.
LR statistics for golden shiner data
Model
Hypotheses
Statistic
-value
OLLW vs OELLW
is false
18.30
0.0001
Weibull vs OELLW
is false
24.80
0.0001
The plots of the estimated OELLW, OLLW and Weibull survival functions, and the empirical survival function are displayed in Fig. 5. These plots support the previous findings.
Estimated and empirical survival functions for golden shiner data.
Regression for chemical dependency data
The illicit substance consumed the most in Brazil is marihuana: 7.7% of Brazilians with ages between 12 and 65 years report having used it at least once during their lifetime. Cocaine (in powdered form, as opposed to crack) is in second place, with 3.1%. Much has been stated about chemical dependence, but little accurate information is known. It is defined as the uncontrollable urge to use a substance compulsively. This can cause mental and physical disturbances. Consequently, the people’s behavior changes, which can cause many negative alterations in their routine and also that of the people near to them. In the long run, chemical dependence can favor the development of various sequelae, and at the extreme can cause death. Isolation from family members, close friends and even society in general, is a harmful consequence of use of illegal drugs. People who are chemically dependent form a distorted picture of reality, besides suffering prejudice by others, making the latter intolerant. However, this condition, classified as a disease by many, can be cured. Motivated by this reality, we illustrate the application of the LEOLLW regression to a dataset provided by the Associação Mãe Admirável, located in Caratinga (State of Minas Gerais, Brazil). More specifically, Caratinga belongs to the Doce River Valley in the metropolitan ring of the “Steel Valley” mining district, about 310 km to the east of the state capital, Belo Horizonte. It covers an area of 1,258.479 km, of which 15.9 km is classified as urban, and its population in 2020 was 92,603 inhabitants. More details about the city of Caratinga can be found on the website: https://caratinga.mg.gov.br.
The data come from a survey conducted among 141 chemically dependent residents in the period from 2000 to 2005. The response variable is the rehab time spent in the Association’s community until the end of treatment, with the maximum period being 270 days, during which there is no contact with illegal drugs. Those who complete this regimen are classified as censored data. Studies of the risk factors associated with relapse of chemical dependence are very important, because these results can increase the probability of starting new clinical interventions in timely fashion. Besides this, risk studies allow identifying protective factors to reduce the vulnerability to temptation for relapse. The variables are:
: time spent in the community until the end of treatment (days);
: marital status (0 single, 1 married or separated/divorced);
: schooling level (0 none or incomplete “fundamental school” (through 8th grade), 1 complete fundamental schooling, irrespective of high school and/or college);
: age (0 30 years, 1 30 years), for .
We present results on fitting the regression ()
where the response variable has the LOELLW density Eq. (17).
Table 9 gives the findings for some fitted regressions to these data. The fitted LOELLW regression indicates that and are significant at 5%, and that there is a significant difference among the levels of the schooling level and age for the time spent in the community until the end of treatment. The adequacy measures for regression comparisons are reported in Table 10, which indicate that the LOELLW and LOLLW regressions outperform the LW model irrespective of the criteria. So, the proposed regression can be adopted to explain the chemical dependency data.
Results for some fitted regressions to chemical dependency data
Model
LOELLW
5.7065
0.3092
5.3408
3.9803
0.6178
0.7752
0.7114
(0.6012)
(0.0305)
(0.4538)
(0.2705)
(0.3496)
(0.3442)
(0.3397)
[ 0.0001]
[0.0795]
[0.0259]
[0.0381]
LOLLW
16.288
1
21.1424
11.9612
0.4084
0.8811
0.7684
(1.562)
(0.4013)
(0.2781)
(0.3816)
(0.3492)
(0.3589)
[ 0.0001]
[0.2864]
[ 0.0128]
[0.0340]
LW
1
1
1.2957
4.9617
0.1845
0.8021
0.6680
(0.1269)
(0.3203)
(0.4432)
(0.3655)
(0.4079)
[ 0.0001]
[0.6779]
[0.0295]
[0.1037]
Adequacy measures for chemical dependency data
Statistic
Model
AIC
CAIC
BIC
LOELLW
462.5083
463.8823
483.1496
LOLLW
463.6923
464.7832
481.3849
LW
471.7792
472.6201
486.5218
The LOELLW regression is compared with two special models in Table 11, and the LR values indicate that the LOELLW and LOLLW regressions are very competitive for modeling these data.
LR statistics for chemical dependency data
Model
Hypotheses
LR statistic
-value
LOLLW vs LOELLW
is false
3.184
0.0743
LWeibull vs LOELLW
is false
13.2709
0.0013
The empirical survival function and the estimated survival functions displayed in Fig. 6 support the previous findings.
Kaplan-Meier curves and estimated LOELLW survival functions stratified by explanatory variable for chemical dependency data: (a) The marital status. (b) Schooling. (c) Age.
From Table 9 and Fig. 6, we conclude the following points:
There is no significant difference at a significance level of 5% between being married or single in terms of time spent in the community until the end of treatment.
In relation to the level of education, there is a significant difference, i.e., people with a higher level of education have spent more time in the community until the end of treatment.
Regarding age, people aged under 30 years have more time spent in the community until the end of treatment.
Conclusions
The development of new classes of distributions has been an area of great applicability in the last two decades. We provided some mathematical properties of the new odd extended log-logistic-G family. Some simulations showed the consistency of the maximum likelihood estimates. We constructed a regression associated with the logarithm of a special distribution in the new family. Two applications to real data illustrated the potentiality of the proposed models.
Disclosure statement
There are no conflicts of interest to disclose.
Funding
This work was supported by the CNPq and CAPES, Brazil.
BragaA.S.CordeiroG.M.OrtegaE.M.M.SilvaG.O., & VasconcelosJ.C.S. (2022). A random-effects regression model based on the odd log-logistic skew normal distribution. Journal of Statistical Theory and Practice, 33, 1-24.
3.
CanchoV.G.BarrigaG.D.C.CordeiroG.M.OrtegaE.M.M., & SuzukiA.K. (2021). Bayesian survival model induced by frailty for lifetime with long-term survivors. Statistica Neerlandica, 75, 299-323.
4.
da CruzJ.N.OrtegaE.M.M., & CordeiroG.M. (2016). The log-odd log-logistic Weibull regression model: Modelling, estimation, influence diagnostics and residual analysis. Journal of Statistical Computation and Simulation, 86, 1516-1538.
5.
FachiniJ.B.OrtegaE.M.M., & Louzada-NetoF. (2008). Influence diagnostics for polyhazard models in the presence of covariates. Statistical Methods and Applications, 17, 413-433.
6.
GleatonJ.U., & LynchJ.D. (2006). On the distribution of the breaking strain of a bundle of brittle elastic fibers. Advances in Applied Probability, 36, 98-115.
PratavieraF.VasconcelosJ.C.S.CordeiroG.M.HashimotoE.M., & OrtegaE.M.M. (2019). The exponentiated power exponential regression model with different regression structures: Application in nursing data. Journal of Applied Statistics, 46, 1792-1821.
9.
TahirM.H., & NadarajahS. (2015). Parameter induction in continuous univariate distributions: Well-established G families. Anais Academia Brasileira de Ciências, 87, 539-568.
10.
VasconcelosJ.C.S.CordeiroG.M., & OrtegaE.M.M. (2022). The semiparametric regression model for bimodal data with different penalized smoothers applied to climatology, ethanol and air quality data. Journal of Applied Statistics, 49, 248-267.