Abstract
A new parametric regression model is developed based on the gamma-Maxwell distribution. Monte Carlo simulations show the accuracy of the maximum likelihood estimators. The proposed model explains COVID-19 mortality rates of the 50 U.S. largest cities.
Introduction
In late 2019, a disease called SARS-CoV-2, known as COVID-19, spread and became a pandemic. A wide range of symptoms were reported, and people experienced difficulty breathing, fever, loss of taste or smell, sore throat, among others.1
After the outbreak, countless statistical applications were performed to pandemic data in the U.S. and other countries. A new quantile regression was constructed by Gallardo et al. (2022) for coronavirus mortality rates in the U.S. states. The article considered social, demographic and health variables that affected the first-wave mortality rate. Karmakar et al. (2021) explained COVID-19 incidence and mortality rates in U.S. in terms of county-level social and demographic variables. Moreover, Biazatti et al. (2022) presented applicability of the Weibull beta-prime distribution to three real COVID-19 data sets. In Brazil, Sousa et al. (2020) identified the risk factors associated with mortality and survival of COVID-19 cases in the Northeast region. In conclusion, the mortality is similar to countries with numerous of cases, with elderly people, and comorbidities presented a greater risk of death. For an extensive study, Duhon et al. (2021) estimated the initial growth rate of COVID-19 wide-world. They studied the association between the initial growth rate and non-pharmaceutical interventions as well as pre-existing country characteristics. Other related works can be found in Janke et al. (2021) and Stokes et al. (2021).
In this context, our main goal is a construction of a new regression model based on the gamma-Maxwell (GM) (Iriarte et al., 2017) distribution, which has advantages over other competitive models. We study the coronavirus mortality rates of the 50 U.S. largest cities to show the flexibility of the new regression. Presently, the USA has the highest number of COVID-19 deaths worldwide, more than one million.2 In addition to the geographical proportions, high rates of obesity and cardiac disease, the lack of a universal health care system hindered the American response to the pandemic. The application for regression modeling is done in which we investigate the influence of explanatory variables on the mortality rates from COVID-19. The paper aims to contribute to the literature with a new regression model as well as check the influence of disease in the USA. The behavior of the data set is illustrated in Fig. 1.
Histogram and empirical density of COVID-19 mortality rates of the 50 U.S. largest cities.
Regression models are necessary when there are explanatory variables associated with the response variable. We present some global influence measures to assure the choice of the best fit regression model to the data set. Likewise, we establish residual analysis based on deviance residuals jointly with simulated envelopes.
The article is structured as follows. Section 2 reviews the GM distribution (Iriarte et al., 2017). Section 3 provides new mathematical properties. Section 4 determines the maximum likelihood estimates (MLEs), and provides a simulation study to verify their accuracy. Section 5 develops the GM regression model, and examines the consistency of the estimators. Section 6 defines deviance residuals for the fitted regression model, and investigates some diagnostic measures. Section 7 proves the advantage of the new regression model compared to other competitive models, and selects the best model for explaining the COVID-19 mortality rates of the 50 U.S. largest cities. The results indicate randomly residuals inside the simulated envelope, which is expected for a good fit. It also supports and presents some valuable findings. Finally, Section 8 ends with some remarks and suggests guidelines for future research.
The cumulative distribution function (cdf)
where
The probability density function (pdf) of
A random variable
and
respectively.
For
GM distribution. (a) Pdf. (b) Cdf. (c) Hazard rate. (d) Survival function.
Due to the impossibility of obtaining closed-form mathematical properties of the GM distribution, we find a linear representation for its density function.
The exponentiated-Maxwell random variable
where
is the density of
Proof. The proof is straightforward. The pdf of
Hence, Eq. (5) is very useful to find structural properties of
The
where
and the quantities
Proof. The
The incomplete gamma function ratio admits the power series (Prataviera et al., 2020)
where
Likewise, from Prataviera et al. (2020), the power series holds
where
Applying Eq. 0.314 from Gradshteyn and Ryzhik (2014) leads to
where
and
From Eq. (10),
Combining Eqs (11) and (8) leads to
By solving the integral, Eq. (7) becomes
where
Finally,
∎
The generating function (gf) of
where
and all other quantities are defined below.
Proof. The gf
Inserting Eqs (11) and (8) in the last expression
The above integral can be written, based on Eq. (2.3.15.7) in Prudnikov and Marichev (1986), as
where
Hence,
where
and
Finally the gf of
∎
The log-likelihood function for
The MLEs can be found by solving the equations given in Iriarte et al. (2017), or
We study the accuracy of these estimators by generating
Simulations from schemes 1 and 2
Table 1 reports the average estimates (AEs), biases and mean square errors (MSEs) for each scheme. The AEs converge to the true parameters, and the biases and MSEs tend to zero when
Definition
The systematic component of the GM regression model considers that the parameter
where
Consider a sample
The parameter vector
The accuracy of the MLEs in the GM regression model is evaluated based on the measures (Cordeiro et al., 2017): bias, mean square error (MSE), estimated average length (AL), and coverage probability (CP). We generate
Plots of the measure values for the parameters. (a–c) Biases. (d–f) MSEs. (g–i) ALs.
Plots of the CPs for the parameters.
Figures 3 and 4 show these measure values versus the sample size. The biases, MSEs and ALs decrease to zero when
We present diagnostic measures (Cook & Weisberg, 1982) and residual analysis (Venables & Ripley, 2013) to investigate if the model provides a good representation of the data, i.e., if the sample contains outliers or influential observations.
We employ some measures based on case deletion in the systematic component Eq. (16) to find influential observations in the regression model
Here, the changes in the parameter estimates coming from exclusion of the
By comparing the difference between
The first global influence measure is the Generalized Cook distance (GCD), namely
where
The second influence measure is the likelihood distance
Moreover, the analysis of residuals is an efficient way to check the model adequacy and verify if there are incompatibilities to the response distribution. The deviance residuals for the GM regression are defined by
where
are the martingale residuals,
The normal probability plot serves to assess the normality assumption of the residuals. The empirical distribution of the deviance residuals agrees with the standard normal distribution when the sample size increases.
We construct envelopes to interpret the probability normal plot of the deviance residuals by means of simulated confidence bands (Atkinson, 1985). When most of the points lie randomly distributed into these bands, the regression model can be considered well-fitted to the data.
For showing the utility of the new regression model compared to other competitive models, we provide an application for COVID-19 mortality rates in the USA. These rates are calculated for the 50 biggest cities from the Center for Disease Control and Prevention (CDC) (CDC, 2022).
The dependent variable and explanatory variables obtained from County Health Rankings Model (2022a), U.S. Department of Agriculture (2022) and County Health Rankings Model (2022b) are described below:
MR: Mortality rate (dependent variable). DB: Diabetes prevalence (percentage of adults with diabetes aged 20) (data from 2019). UN: Unemployment rate (percentage of average annual unemployment) (data from 2021). LE: Life expectancy (data from 2018–2020).
Summary of statistics
Table 2 provides some descriptive statistics from this dataset.
We compare the GM model with other alternative generators by taking the Maxwell and Weibull baselines: Maxwell, alpha-power-Maxwell (Erdogan et al., 2021), transmuted-Maxwell (Iriarte & Astorga, 2014), odd log-logistic Maxwell (Prataviera et al., 2020), Weibull, beta-Maxwell (Amusan, 2010) and Maxwell-Weibull (Ishaq & Abiodun, 2020).
The alpha-power-Maxwell (APM) and odd log-logistic Maxwell (OLLM) densities are (for
and
respectively, where all parameters are positive.
The transmuted-Maxwell (TM) density is (for
for
Finally, the densities of the beta-Maxwell (BM) and Maxwell-Weibull (MW) are (for
and
respectively, and all parameters are positive.
The MLEs for all models are calculated using the goodness.fit function from AdequacyModel package (Marinho et al., 2019) available in R Core Team, (2013) with the BFGS method. The best fitted model is chosen based on well-known measures (some with only initials): AIC, CAIC, BIC, HQIC, Cramér-von Mises (
Findings from the fitted models to COVID-19 mortality rates
The results for all statistics are reported in Table 3 with the MLEs and standard errors (SEs) in parentheses. They reveal that the GM distribution is the best fitted model. Further, plots of the estimated pdfs and cdf for the three best models in Fig. 5 confirm that the GM distribution is the most suitable model for the current COVID-19 mortality rate data.
Fitted GM regression to COVID-19 in the 50 biggest cities in USA
(a) Histogram and estimated pdfs. (b) Empirical and estimated cdfs.
The LR statistic for testing
Profile log-likelihood functions from the fitted GM regression model to COVID-19 data. Parameters: (a) 
(a) GCD for the fitted GM regression model. (b) LD for the fitted GM regression model.
Nest, we consider the systematic component (for
Table 4 lists the results from the fit of the previous GM regression model. For a significance level of 5%, all explanatory variables are significant.
Profile log-likelihood plots versus some parameter values (with all other parameter estimates fixed) are reported in Fig. 6. These plots can provide confidence intervals for the parameters.
Normal probability plot of 
The GCD and the LD measures identify possible influential observations, nevertheless these do not affect the model, as illustrated in Fig. 7. The plot of the deviance residual (
Finally, from the parameter estimates reported in Table 4, the GM regression model becomes
Some conclusions follow from the above equation:
The DB prevalence is extremely significant ( The UN rate gives a The LE is significant at the 1% level, and its estimate is positive, which indicates that the MR is slightly higher in cities with high life expectancy. Similarly, Notari and Torrieri (2022) showed that in the European region, the life expectancy is an increase factor for COVID-19 mortality. Further, Wang et al. (2020) also determined a positive correlated between life expectancy and the initial transmission growth rate of COVID-19.
We proposed a new gamma-Maxwell regression model, and obtained some new mathematical properties of the response distribution. The model parameters were estimated by the maximum likelihood method, and the consistency of the estimators was proved using Monte Carlo simulations. Diagnostic analysis and deviance residuals were provided. We proved that the new model is more flexible than some others by analyzing COVID-19 mortality rates in the fifty largest U.S. cities.
Footnotes
