Abstract
This paper attempts to fit the best survival model distribution for the Malaysian COVID-19 new infections experience of Wave I/II and Wave III using the well-known Survival Data Analysis (SDA) procedures. The purpose of fitting such models is to reduce the complexity and frequency of the COVID-19 new infections data into a single measure of scale and shape parameters to enable monitoring of weekly trends, undertake short term forecasts and estimate duration when the virality will be contained. The analysis showed a Weibull distribution is the best statistical fit for Malaysia’s new infections COVID-19 data. The estimates of scale and shape parameters for Wave I/II was 0.05901 and 2.48956 and for Wave III was 0.06463 and 2.5693, respectively. Much higher hazard force in Wave III is due to weaker control in the implementation of cordon sanitaire measures imposed in containing the virality spread. Based on the survival function the short-term forecasts showed that the number of new infections projected to decline from 23,282 cases in 28th week to 22,017 cases in 31st week. Similarly, based on the cumulative hazard function the duration estimated for containing the virality completely projected to stretch over another 19.6 weeks under the prevailing conditions.
Keywords
Introduction

New COVID-19 infection data in Wave I/II in Malaysia.
This paper attempts to establish a methodological procedure for identifying the best statistical fit for the Malaysian COVID-19 new infection data using the well-known Survival Data Analysis (SDA) procedure. In the current practice the frequency distribution data that provides timely daily COVID-19 counts on new infections, deaths and recovery are still relevant. Complementing the current data practice, the proposed SDA procedure is aimed at reducing the daily frequency counts of COVID-19 new infections data into weekly estimates of shape and scale parameters of the best-fit survival distribution. Upoun establishing the distribution additional analysis like differentiating the experiences of COVID-19 by waves of infections and short-term forecasting on new infections as well as when it is projected to disappear can be carried out. No doubt daily frequency counts on COVID-19 new infections provide timelier estimate, but daily numbers are too many for SDA estimation procedures especially life-table construction which works at best if number of rows are limited to 30 that weekly numbers catered. The fitted distribution, in turn is used to undertake short-term forecasts on new COVID-19 infections using the survival function of the best fit statistical model. The modelling exercise is also used to project the time duration that will take in containing the virality completely under the prevailing conditions. However, the nature of COVID-19 global pandemic is as such subject to changing conditions due to either new variants or mutations that are more aggressive and posing greater life-threatening menace to mankind especially when the borders are open allowing free flow of people and goods. In such circumstances the short-term projections on new COVID infection numbers and when completely it will get mitigated warrants a review of the estimation procedure.
The Malaysian COVID-19 pandemic experiences of Wave I, Wave II and Wave III are used for illustrating the proposed SDA methodological procedure. At this stage Malaysia have been experiencing third wave of COVID-19 pandemic [1]. Despite various shades of cordon sanitaire mitigatory strategies similar to that of Wave I/II that government has put in place beginning 18

New COVID-19 infection data in Wave III in Malaysia (on-going).
Currently, the public policy makers, development practitioners, media and academia as well as international organizations are using frequency counts in their policy formulation, planning and advocacy activities. As widely known from past experiences that epidemic or pandemic data are subjected to high fluctuations, skew and kurtosis. Malaysian COVID-19 virus experiences regarding new infections or deaths or recovery after medical treatment are not exception to such erratic phenomena. Compiling data in frequency format and producing summary statistics pertaining to measures of location and dispersion are notably the first statistical activity being undertaken in any data analysis and more so, easier to compile and understand [2, 3] especially the public, media, public policy makers and politicians. Nonetheless, the data presented in frequency counts may exhibit inherent great variations that may not provide meaningful comparisons especially between pandemic waves or geographies or over time [3, 4]. Thus, towards complementing the usage of frequency distribution, this paper explores SDA methodology of converting the COVID-19 new infections frequency counts into scale and shape parameters of best fit survival model distribution. The scale and shape parameters of a statistical distribution are free from unit of measurement and magnitude. The scale measure depicts the extent of virality over time along the horizontal axis and shape parameter determines the rate at which the hazard is increasing and such pure numbers even people with less akin to statistical subject matters able to comprehend the status of virality over time. Indeed, the fitted distribution can be used in producing weekly estimates that become additional information for gauging and monitoring as well as undertake short-term forecasts regarding the virus spirality that will be of interest to mainstream users such as public policy makers, medical professionals and planners, development practitioners in the health sectors, academia and media.
The Figs 1 and 2 shows the number of COVID-19 new infection cases during Wave I/II and Wave III in Malaysia. The analytical investigation is apportioned into two waves of epidemic because the prevailing conditions for the COVID-19 virality and cordon sanitaire measures imposed by the government for containing the virality as well as attitude and behavioural aspect of the population differed greatly between the Wave I/II and Wave III. Pertinently, the total number of new COVID-19 infections over the span of 28 weeks differed greatly, that is 9002 in Wave I/II and 262,596 in Wave III at end of 28th week. Besides that during the Wave I/II period the number of countries affected by COVID-19 virus globally was much lesser than during Wave III which partly experienced increase in the number of cases due to importation of virus from neighbouring countries and by returning Malaysians from overseas [5]. In the same vein, the Movement Control Order (MCO) or cordon sanitaire that were imposed during the Wave I/II period registered with stricter enforcement by the Government authorities in comparison to Wave III, which registered a much more relaxed conditions allowed for economy recovery. The relaxed rules and regulations registered increase in the movement of people for work, free flow of goods, services and workers between localities near and far and resumption of functioning of educational institutions and religious gatherings as well as various shades of social mobility. Indeed, the relaxed conditions paved the way for prolific increase in the virality numbers in Wave III. Reckoning the vast differences in the two waves of epidemic, the analysis prompted to gauge the level of COVID-19 new infections by waves.
For the Wave I, the WHO for the first time recorded the Malaysian COVID-19 experiences on 25
Literature review on Survival Data Analytics (SDA) modelling approach
The study has explored a modelling approach in developing a weekly monitoring system on monitoring the COVID-19 pandemic trend in Malaysia that can be used for analysing and differentiating the new infection experiences and undertake short term forecasts on new infections and when the COVID-19 poised to subside. For studying epidemiological data three types of models namely mathematical, statistical or survival data analysis are usually considered. Each model approach has its own distinct features and characteristics as well as merits and demerits. Among these model approaches, survival data analysis deemed to provide added merits and advantages over the other two. For instance, for analysing small pox disease, [7] Bernouli proposed for the first time a mathematical model using deterministic approach and usage of differential equations [8, 9]. But mathematical modelling demands a sound understanding, appropriate representation and interpretation of mathematical based results of the physical problem [9]. But the weakness in this mathematical approach was the application of the differential equation procedure does not take account of physical units of measurements and random variations associated with unknown factors [9]. Towards overcoming these challenges the cellular automata (CA) mathematical modelling procedure came into practice as it can transform time and space discretely and model the evolution of complex physical systems by incorporating characteristics of the medical conditions, covariates or spatial variables and lags in a lattice structure that the model entails [10]. Nonetheless, sound interpretation of outcomes of mathematical modelling approach still remains much of a challenge for its applicability to COVID-19 type of data, which by and large lack of covariate types of data other than confining to new infections, deaths and recovery numbers [8, 11].
Subsequently, the typical statistical modelling, spatial modelling, space-time approaches and survival modelling that can handle random variations gained stronger footage in modelling exercises. For instance, in Autoregressive Modelling Average (ARMA) framework time lag is estimated using autoregressive process (AR) that treats the observations as a weighted sum of their values at previous time points and the moving average (MA) provides a method that accounts for and corrects for the errors in the previous prediction through a weighted linear sum of previous errors [12]. However, the ARMA models lack statistical efficiency in dealing with issues related to seasonal variations (localized trends), cyclical variations (trends over a longer time period) and irregular fluctuations due to unknown factors and consequently extrapolating future predictions pose modelling difficulties [12, 13]. Similarly, the spatial modelling premises upon homogeneous Poisson process, aims at locating clustering or regularity in recorded events over space and time, estimating and mapping relative risk of the event incidence or identifying clustering around a particular point [14]. Such method is widely applied in spatial epidemiology as the procedures have the ability to model autocorrelation between measurements taken at different spatial lags [15] or applicable to Generalized Linear Model (GLM) framework [16] or dynamic model methodology that models consider non-parametrically non-linear temporal trends [17] in gauging spatial variations. But the spatiotemporal point process statistical modelling tends to average the temporal aspect over time across individuals.
Recognizing the challenges and shortfalls in mathematical and statistical modelling approaches, this paper opted to analyse the COVID-19 new infections data using survival data analysis (SDA). Indeed, the SDA is an aged old procedure that can be traced back to early work on mortality in the seventeenth century when Graunt published the first Weekly Bill of Mortality in London and Healey published the first lifetable [18]. Since then, the lifetable method has been used frequently by actuaries, statisticians, and biomedical researchers in governmental and private agencies in determining life-expectancy at a given age and survival or mortality or relapsed rates in clinical trials and determining insurance premium rates et cetera. In non-medical fields the survival analysis was used in assessing the reliability of military equipment during World War II and subsequently the methodology was used in analysing the reliability of industrial products and devices. In the past four decades, survival analysis has become one of the most frequently used methods for analysing data pertaining to survival times in disciplines ranging from medicine, epidemiology, and environmental health, to criminology, marketing, and astronomy.
In comparison to mathematical and statistical modelling, the survival data analyses offer many distinct merits. First, the purpose of applying SDA procedures to new infections data is to study the time taken for a new infection to happen remising upon probabilistic notion of survival time and hazard functions [19, 50]. Second, the survival time analysis is applicable to situations where the exact survival time may be longer than the duration of the study time (or observation time) [18] because the procedures enable censoring of data either right or left when data are truncated. Applications of survival data analysis in public health or in engineering field or in social science have registered a wide spectrum of usage like estimation of survival distributions, testing hypotheses of equality of two or survival distributions using Gehan’s Generalized Wilcoxon test [20, 21] and the log-rank test [22] and identification of risk or prognostic factors and relating its relationship to the length of disease-free time, survival, or remission [18, 23, 24, 25].
Since the COVID-19 global pandemic came into effect since late 2019, many research endeavours have been undertaken in studying the impact of the COVID-19 virus on human survival using various aspects of survival analysis. To name a few, Salinas et al. [26] and Kyeong [27] investigated impact of COVID-19 on Mexican and South Korean population by expounding Kaplan-Mier curves and Cox proportional model, respectively. Specifically, they examined variables such as age, sex, comorbidities, pregnancy, immune-suppression, smoking, time elapsed between the onset of symptoms and hospitalization, and death, as well as the time elapsed from admission to health care unit to death, development of pneumonia, hospitalization, ICU admissions, intubation, and the type of health service. The case studies concluded that fatality rate was high among males, older age, and those with chronic diseases. Altonen et al. [28] and Eghbal et al. [29] undertook similar studies but focused on USA and Kurdistan population, respectively and concluded that elders with chronic diseases like diabetes need to be under active surveillance and screened frequently.
Atlam et al. [30] deployed machine learning techniques and artificial intelligence for computing infection based on Cox regression modelling aimed at helping hospitals to choose patients who have better chances of survival and predict the most important symptoms (features) affecting survival probability. Interestingly, Yue Zhao and Deepika Dilip [31] used Cox regression procedure on exploring the relationship of the COVID-19 deaths as per Johns Hopkins University publishing and democracy indices as per Economic Intelligence Unit records and concluded that in the public health crisis setting, a democratic government may face more constraints when taking draconian measures against disease control, simply due to its structure and likelihood of opposition. Researchers like Martin Spousta [32] and Bui et al. [33] used parametric models namely linear exponential and Weibull distribution respectively in estimating the incubation period for COVID-19.
Succinctly put, the foregoing literature reviews have indicated that SDA methodology has been registering increasing number of research activities on COVID-19 in many countries using non-parametric or semi-parametric or parametric approaches. In the same vein this exercise scopes its survival analysis to one specific aspect only, that is investigating the time duration incurred in transmission of COVID-19 new infection data from one person to another using either nonparametric or semiparametric or graphical or parametric or combination of approaches. Specifically, attempt is being made in developing a weekly monitoring system on COVID-19 new infections virality by waves by referring to experiences in Malaysia.
Merits and demerits of presenting data frequency distribution and basic statistics
Undoubtedly, the compilation of frequency counts depicting COVID-19 pandemic new infections or deaths, or recovery are easily compiled with the support of various reporting sources nationwide in any country including Malaysia. Being a pandemic phenomenon the published daily totals including its cumulative counts are not only concerns of mainstream policy makers, development practitioners and academia but it is also warranted the attention of less statistically orientated ordinary citizens and media in the country. Presenting the data in frequency format obviously become the first option in any statistical activity as it provides a quick glance at the entirety of data conveniently; can spot maximum and minimum values in the data set; and can observe whether they are concentrated in one area or spread out across the entire scale [2, 3, 4, 34]. The industrious users may even monitor the trends whether increasing or decreasing, or remaining at constant level or detect exhibited seasonal and cyclic variations in an attempt to study the emergence of subsequent waves of reappearance of the disease and undertake future projections as well [3, 35]. The advanced users may also convert the numerous frequency counts into single measures of central tendency and dispersion for better understanding of the pandemic phenomena [2, 3, 4, 33].
Data presented in frequency counts are typically constitute numerous observations. However, when presented in time series format the data bound to exhibit high fluctuations in the patterns especially when the series is long. In such data set difficulties may arise in culling out the underlying patterns and trends particularly when the observations are subject to erratic fluctuations that are typically encountered in pandemic kind of data [34, 35]. Besides that, any attempt to compile the data into a grouped frequency format additional concerns arise regarding the number of class intervals, which are fairly arbitrary and determined depending on the size of the timeseries data [2, 3, 4]. If the number of class intervals are too few, it may lead to the loss of much information in the counts and at the same time, if there are too many categories, one may not be able to see the overall picture as one gets bogged down by the excessive details.
Similarly, reducing the numerous COVID-19 daily timeseries observations into representative and dispersion measures like mean or median or mode or standard deviation are likely to encounter several statistical challenges especially whenever spikes or drastic drops in numbers occur regarding the virality of the COVID-19. Moreover, in the presence of extreme values that usually occur in any pandemic data the statistical representation of the data become questionable; skewness may occur in one way or other direction over time due to lack of symmetry; high degree of variation in kurtosis may result due to clustering of cases; mode may become overly sensitive and it can easily be made to “jump around” by varying the limits of the class intervals size and the number [2, 3, 4, 34]. On comparison, in SDA methodology the measures of mean and variance as well as coefficient of variation can be determined for the best fit statistical distribution, besides determining the scale and shape parameters that characterises the nature of the distribution.
Succinctly put, frequency counts with erratic fluctuations may not be statistically efficient for comparing COVID-19 experiences between waves of infection. Such frequency data also may not be suitable for producing meaningful and valid results in undertaking any projections or short-term forecasts. Alternatively, the SDA procedures offer a methodology of reducing the numerous time series-based frequency data into scale and shape parameters of a best-fit survival distribution. The scale and shape parameters are purely numeric numbers and more so, free from order of magnitude and unit of measurements [36] and more aptly, suitable for comparing COVID-19 experiences between waves of infections, despite they differ one from another in terms of intensity of infections or number of deaths or recoveries or duration of epidemic or covariates and prognostics factors influencing the pandemic. But the SDA methodology also offers a statistical procedure for gauging, monitoring, assessing and producing short-term forecasts on COVID-19 new infections by using survival and hazard functions. Historically, the SDA methodology saw its introduction in clinical environment [37, 38, 39, 40] and subsequently used in reliability life testing experiments in engineering and manufacturing plants [37, 41] and as mentioned earlier, today its application is seen in many areas including studying COVID-19 phenomena. As such, in this exercise attempt is being made to measure the virality of new infections of COVID-19 phenomena by waves of new infections experienced in Malaysia in the context of public policy and advocacy activity relevance who are concerned about the trends, patterns, features and characteristics of new virus infections as well as future projections
Research objectives
The main objective of this paper is to establish a methodological procedure of gauging, monitoring assessing and evaluating the COVID-19 new infections virality experiences in Malaysia using SDA procedures. Specifically, the SDA methodology is applied in producing and monitoring the weekly estimates of shape and scale parameters for the best fit statistical distribution for the new COVID-19 infections. The weekly results are produced by waves of new infections and assessed in differentiating the trends, features and characteristics inherent to the waves. Having established the weekly estimates the methodology also enables determining short-term forecasts regarding either proliferation or mitigation in new infections and also determining the duration when the COVID-19 viral chain expected to disappear completely.
Currently, frequency-based daily records are used in monitoring and evaluating the COVID-19 phenomena by time or by geography or by waves of infections. As highlighted in the literature review that frequency counts that are highly subjected to presence of extreme values lack inherent statistical inefficiencies for making meaningful evaluations or benchmarking or projections. The SDA procedures that have innate capability of reducing voluminous data that are of diverse characteristics into scale and shape parameters of best fit statistical distribution offer better validity options for evaluation or benchmarking or projections.
This research focuses on Wave I, wave II and Wave II that Malaysia have undergone since the beginning of COVID-19 as a global pandemic. Each wave seemingly has their own distinct features and characteristics in terms of intensity of infections or cordon sanitaire strategies and attitude, behaviour and adherence of people to rules and regulations imposed by authorities. As highlighted earlier the intensity of new infections in Wave III (that is, 262,596 cases) was 25 times more than the combined numbers recorded in Wave I/II (that is, 9002 cases). The rate at which the numbers proliferated between the waves is indeed startling, and warranting a differentiation study pertaining to trends, features and characteristics as well as impact of COVID-19 by Wave I/II and Wave III as highlighted earlier. Reiterating again the trend analysis enable short-term projections on new COVID-19 infections and also predicting when the virality will cease if the current conditions persist. The methodology is dynamic and flexible in the sense projections are subject to review from time to time if there are drastic changes in the virality conditions. Indeed, not only the current data but also a more precise projections on the virality will be of great interest to the mainstream policy makers, medical and public health planners and development practitioners as well as media and academy for their policy, planning, advocacy and communication routines.
Towards this aim the methodological objectives considered the four well-known survival distribution models namely Exponential, Linear Exponential, Weibull and Gompertz distributions [37, 38, 39, 40]. Being a new global phenomenon, the nature of virality of COVID-19 is not known precisely yet. As mentioned earlier, researchers like Martin Spousta [32] and Bui [33] made prior assumptions pertaining to Exponential and Weibull distribution on analysing COVID-19 incubation period, respectively. But, in this exercise attempt is being made to explore combination of graphical, non-parametric lifetable technique, Gehan-Siddiqui semi-parametric procedures and parametric Maximum Likelihood Estimation (MLE) procedure in order to arrive at the more appropriate distribution that describes COVID-19 new infection phenomena in Malaysia.
Specifically, the non-parametric life table technique was used to compute the hazard, cumulative hazard and survival function values pertaining to COVID-19 new infections data that have survival time characteristics that SDA procedures are premise upon. The hazard, cumulative hazard and survival functions of lifetable are in turn used to establish the best fit survival distribution by considering both graphical and regression estimation procedures. Specifically, the hazard and cumulative hazard plots of Exponential, Linear Exponential, Weibull and Gompertz distributions are considered in the graphical procedure and hazard function values are used in the regression estimation procedure. The graphical or semi-parametric procedures deemed to provide only a preliminary indication on estimated values of scale and shape parameters. Refined measures of scale and shape parameters are determined using parametric MLE procedure, which usually statistically considered providing more consistent, efficient and predictable than semi-parametric estimation procedures [42].
The foregoing study objectives depict statistical objectives of determining best fit survival model and its shape and scale parameters. Statistically speaking, the study objective is also include elucidating the meaning of the parameters, that is, scale parameters relate the extent of virality in terms of age or duration, and shape parameters determine rate of hazard of virality growth characterizing the COVID-19 new infections. Interpretatively, these parameters provide surrogate measures for gauging the efficacy of various shades of cordon sanitaire measures that government has put in place. Pertinently, when the rate of hazard is high the incidence of infections is high and vice versa. Thus, effective implementation on the part of authorities and committed and responsible behaviour of people on the other hand are crucial in determining the success rate in containing the COVID-19 chain especially new infections which ultimately can reduce the number of COVID-19 deaths. Unfortunately, the success rate of mitigating the virality of new infections in Wave III was not satisfactory in comparison to Wave I/II.
Data source, scope and coverage
For the construction of life-table the study requires data pertaining to number of new COVID-19 infections and deaths by date of reporting. As mentioned, earlier the requisite data are sourced from WHO website and confirmed with Ministry of Health (MoH) records in Malaysia, which is being official statistics. The quality of data is considered valid and reliable as Malaysia has long established public health surveillance system nationwide and more so, the COVID-19 dedicated hospitals are supported with contemporary information communication technology [43].
During the Wave I, Wave II and Wave III a number of zero cases were intermittently reported, meaning no hazard (
Methodology
Towards obtaining the best-fit survival distribution and estimation of refined values of scale (
Non-parametric estimation procedure of life-table technique
Survivorship function
For life-table construction, essentially the SDA methodology premises upon the definition of survival time, denoted as
where
Computationally, for non-censored observations
Where
Hazard and cumulative hazard function
The other important function that is typically derived from lifetable is the hazard function
Typically the failure rate refers to death rate in mortality table or end of life span in reliability experiment. Analogously, in this research exercise it refers to “rate of COVID new infection”.
For computation purposes as per actuarial practices, the estimated hazard function
Where,
For computational purposes the cumulative hazard function is derived from the following relationship [37, 38, 39, 40];
Thus, at
Using the large sample approximations, variances of the estimated hazard function
Accordingly, for illustration purposes lifetable for Wave I/II is shown in Appendix I. For the Wave I/II the radix number (
Hazard and cumulative hazard and survivorship functions framework
Hazard and cumulative hazard and survivorship functions framework
The graphical procedure, semi-parametric estimation procedure of regression technique and parametric estimation procedure of using Maximum Likelihood Estimation are based on the parametric assumptions of hazard and cumulative hazard functions of the four well-known survival distributions namely exponential, linear exponential, Weibull and Gompertz [37, 38, 39, 40] and the underpinning formulae of SDA characteristic functions are as per Table 1.
Among the models considered in the framework hazard function of exponential or linear exponential distribution characterises either a constant hazard (
The other popular competing hazard models include log normal distribution, logistics distribution and Gamma distribution but these standard distributions have limitations in fitting some of the real data accurately. For instance, the log normal distribution is characteristically quite similar to Weibull distribution, but it is more applicable to skewed distributions having lower mean values and large variance, in comparison Weibull is more flexible [47]. Similarly, the logistics distribution resembles with normal distribution in shape with mean, median and mode having the same value but with heavier tails or kurtosis than exponential type distribution [48]. Both Gamma and Weibull distributions are generalization of the exponential distribution family which characterises the waiting time as a Poisson process or the time wait until an event occurs. But the hazard or instantaneous failure rate in Gamma distribution is an increasing function of time (
Graphical procedure
As outlined in Table 1, the graphical plotting [38, 39, 40] and are based on values of
Semi-parametric Gehan-Siddiqui regression technique
In undertaking the regression procedure, Gehan and Siddiqui considered linearity relationships of hazard functions as outlined in Table 1 and three types of weights as per below [38, 40]. W
W W
Accordingly, the weighted least square estimates for
Where the weighted least squares estimate for
For identifying the best befitting regression models among the four competing survival models, computation of log-likelihood values based on survival function estimates considered as provided in the columns of the life-table [35, 36, 37], computed as per the formulae below
Thus, the logarithm of the likelihood is:
The model that gives the largest log-likelihood value could be chosen as the best-fit specific model. The best modelfitted will be duly considered for parametric estimation procedure of Maximum Likelihood Estimation procedure in the next step.
Maximum Likelihood Estimation (MLE) procedure
The Weibull Distribution has the density function.
where
Ref. [49] transformed the above equation as follows:
As for such distribution, the MLE of
Study findings
The study findings can be summarised as follows:
MLE weekly estimates – Wave I/II and Wave III of Malaysia Short term forecasts on COVID-19 new infections in Wave III Weekly shape parameter values by Wave I/II and Wave III in Malaysia. Downward trend of shape value of COVID-19 in Wave III in Malaysia.


The weekly trend of shape parameters of fitted Weibull distribution for Wave I/II and Wave III is shown in Fig. 1. The trend is indicating changing direction of the Weibull shape parameter over time. In Weibull distribution the shape parameter depicts the hazard rate behavior or probabilistic chance of a person getting infected by COVID-19 virus at a given time, provided that person is infection free prior to that point. Interpretively, if shape value is less than 1, then the hazard rate decreases with time; if its value is greater than 1, then the failure rate increases with time. When the value of shape parameter is equal to 1, the hazard rate is constant, depicting exponential distribution. Thus, it can be seen in Table 3 as well as in Fig. 1 that the shape value for Wave I/II from week 1 to week 5 was less than 1, indicating the rate of hazard force is lesser in comparison to week 6 onwards where the shape value is greater than one. Similarly, close scrutiny of shape values for Wave III revealed that the rate of hazard value was greater than 1 from week 2 onwards, indicating that the force of aggression for infection was much higher in Wave III from onset than in Wave I/II.
The other pertinent noteworthy feature in Table 3 and Fig. 1 is that the trend or movement of changes in the values of shape parameters. Specifically, it can be observed that the trend of hazard force was increasing until week 11 before it begun to comedown thereafter in the case of Wave I/II; that is increased from 0.10816 in first week to 3.02715 in 11th week as depicted in Table 3 and as well shown graphically in Fig. 3. Similarly, in Wave III the trend of hazard force was on the rise until 12th week before it begun to register a downward trend, that is the shape value increased from 2.050 in third week to 3.51047 in twelfth week (see Table 3). It is also duly acknowledged in the analysis that Wave I/II COVID-19 new infections phenomena were over, thus no further analysis explored. In the case of Wave III, which has been on-going at the point of analysis the trend showed that it has been gradually declining from 13th week onwards. Thus, for short-term forecasting purposes the trend fitting was done for the downward trend from Week 13 onwards as reflected in Fig. 4. It can be seen that linear trend provided the best fit of
Visualization of trend for scale parameter values by week: Wave I/II and Wave III Malaysia.
Based on the assumption of survival function
where
The forecasts exercise was undertaken at week 28. In order to validate the accuracy of the results the exercise also estimated values of scale and shape as well as the survivorship function values for three prior weeks, that is week 25, 26 and 27 and accordingly the forecast numbers on new infections were determined. These numbers in turn were compared against the actual counts of new infection and accordingly the percentage of over or under estimation were determined as shown in Table 3; the deviation ranged from underestimation of 3.0% for 28th week to 27.7% for 26th week when there was a sudden surge in the number of new infections occurred. The average of percentage of under estimation deviation over the preceding four consecutive weeks was 11.8%; excluding week 26 the average was only 6.5%, which is generally an acceptable range in projection exercise.
Based on the current prevailing conditions, the forecast results on new COVID-19 infections for the next two weeks are also shown in Table 3. As it can be seen in the Table 3 that the number of new infections for the week 29 was 22,889 cases, which compared against actual count of 19,742 cases, which is considered over estimation by 16%. Similarly, for the week 30 the estimated number of cases was 22,469 and actual was 18,825 cases, giving rise to 19% in over estimation. The implicit challenge in this methodology is that the forecast estimation is sensitive to drastic changes in prevailing conditions of virality like being seen in week 26, 29 and 30. For example, the actual counts on week 27 reported as 25,668 which dropped to 23,282 cases in week 28, that is a drop of 9.4% and comparing against week 29 that registered 19,742 cases resulted further drop by 15%. Acknowledging the sensitivity of the methodology, in practice the short-term forecasts warrant review from time to time especially when significant changes are observed in the number of new infections cases.
As acknowledged earlier that the Wave I/II COVID-19 infection were over for Malaysia. Nonetheless, the visual trend for scale parameter values for Wave I/II and Wave III is shown in Fig. 5. Further examination separately for each wave, as depicted in Fig. 5 the analyses revealed that the power curve function (
As it can be seen in Fig. 5 that the estimated value of scale gradually decreased with progression of time, indicating prolongation in the disappearance of the virality of the COVID-19 new infections. It can be seen in Table 3 that the scale values for Wave I/II were very high for the first five weeks and only from week 6 onwards the scale value that recorded 0.7413 thereafter declined incrementally until it reached value of 0.05901 in 28th week. Similarly in Wave III the scale value registered a stable measure of 0.81059 at the 4th week and thereafter slowly declined until it reached 0.06463 by 28th week. In both Wave I/II and Wave III the initial scale values were high probably due to small number of cases of new infections as the scale measure is inversely proportional to number of cases as per MLE formula. At the early stages of COVID-19 pandemic understandably the government of Malaysia was attempting to implement various shades of cordon measures in containing the virality spread and with concerted support of masses the number of infections in Wave I/II begun to come down continuously after reaching its maximum of 1112 cases at 12th week. Whereas the scenario in Wave III is different and the cases of new infection have been continually on the rise and as such the number reached 23,282 by the 28th week. The key difference is that Wave I/II saw stricter implementation of cordon sanitaire measures and in Wave III the rules and regulations of cordon sanitaire measures have been much relaxed in lieu of reviving the ailing economy growth despite health menace to the population.
In Wave I/II only essential services of economy especially public utility services, working from home and online education were allowed to function. While in Wave III, some of the relaxed conditions include opening up of all economic sectors, schools and religious institutions, worship centres, social gatherings and greater social mobility et cetera, but with stricter standard operating procedures (SOP) regarding maintaining social distancing, wearing face mask, frequent sanitization of hands, gauging body temperature and recording Q-R code for technology tracing. Indeed, the Government has been facing challenging times in balancing the economy growth and maintaining the health of the population in such a global pandemic scenario.
Succinctly put, the foregoing survival data analysis procedures regarding the COVID-19 new infections data have realized a number of statistical benefits. First, the research exercise has established a methodology of analysing the COVID-19 new infection using SDA procedures that are founded upon the probabilistic notion of survival time and hazard function. Second, the SDA procedure deployed non-parametric, semi-parametric and parametric as well as graphical procedures in determining the appropriate statistical distribution that deemed to provide best fit, instead of pre-empting its assumptions. Accordingly, the analysis showed the Weibull distribution provided the best fit among the well-known distributions considered in epidemiological kind of studies. Third, the methodology reduced the voluminous time series daily frequency counts of COVID-19 new infections data into weekly class intervals data not exceeding 30 rows that deemed appropriate for efficient application of life-table technique. Fourth, the life-table in turn enabled the estimation of hazard and survival function values for the new infections data, which through the semi-parametric regression and MLE procedures enabled the estimation of scale and shape parameters for the fitted Weibull distribution for both Wave I/II and Wave III in the case of Malaysian experience. Fifth, being free from unit of measurement and order of magnitude [36], the scale and shape parameters of Weibull distribution provided meaningful comparisons of COVID-19 new infection experiences between Wave I/II and Wave III. Specifically, the scale and shape parameters for Wave I/II was 0.05901 and 2.48956 and for Wave III was 0.06463 and 2.5693, respectively. Much higher hazard force as reflected in larger shape values in Wave III is due to weaker control in the implementation of cordon sanitaire measures imposed by Government in containing the virality, in comparison to Wave I/II. Sixth, by fitting appropriate trends for the estimated weekly results of scale and shape parameters the survival function of Weibull distribution enabled the short-term forecasts on new infections, which showed decline in the trend incrementally from 23,282 cases in 28th week to 22,017 cases in 31st week and poised to decline further under the current prevailing conditions unless abrupt changes occur in the trend Seventh, the cumulative hazard function of Weibull distribution provided a basis for estimating the duration when the virality of new COVID infections is likely to disappear completely and the results showed that it may stretch over another 19.6 weeks as per estimation at 28th week Lastly, the foregoing SDA methodology provides a complementary measure in addition to frequency counts distribution and more so, the methodology can be an exemplary model for other countries to emulate.
