Abstract
Currently, the Algerian health system is facing the fourth wave of COVID-19 in which the number of recovered cases grows exponentially each day due to the COVID-19 Omicron variant. According to the Algerian National Institute of Public Health (ANIPH), it was reported 168 668 confirmed cases and 4 189 deaths till 29 July, 2021. In this work, we aim to utilize supervised Machine Learning (ML) based models in an attempt to forecast the future trend of the disease in Algeria. To that end, we use three forecasting models: Facebook Prophet, LSTM and ARIMA. Forecasting results of the 90 future days are provided. The used dataset contains the confirmed and death cases collected from the daily Epidemiological Situation (ES), published by ANIPH, from 19 April 2020 to 29 July 2021. The forecasting accuracy of the models are assessed and compared using several statistical assessment criteria. The results show that ARIMA outperforms Facebook Prophet and LSTM in the case of confirmed cases. However, LSTM shows best performance in the case of death cases. This study shows clearly that the pandemic spread is still in progress and protection measures like contact restriction and lockdown should be strictly applied especially with the appearance of the COVID-19 Delta and Omicron variants.
Introduction
The first case of COrona VIrus Disease 2019 (COVID-19) in Algeria was reported on 25 February 2020 when a foreign worker from Italy was tested positive for the Severe Acute Respiratory Syndrome COrona Virus 2 (SARS-CoV2). World Health Organization (WHO) officially renames SARS-CoV or Novel coronavirus as COVID-19 on 11 February 2020 (WHO, 2020). According to (WHO, 2021), the Delta variant of COVID-19 has now invaded more than 111 countries and WHO predicts very soon to be the prevailing COVID-19 strain circulating worldwide. As well, The Omicron variant was first reported to WHO from South Africa on 24 November 2021. In Algeria, the Omicron variant is, currently, the one of the main causes of increased transmission rate which is justified by an increase in social contacts, people mobility and by the uneven application of social distancing and health measures since it reported 1855 new recovered cases and 15 deaths on 21 January 2022. According to the Algerian Ministry of Health (AMH), on 13 January 2022, the Delta variant represents 67% of the COVID-19 circulating variants against 33% for the Omicron variant. Since COVID-19 pandemic is currently a daunting global health challenge, it’s important to find out approximately when the COVID-19 and its variants will be eradicated and our ordinary life will resume to normality.
In this paper, we aim to forecast the time series of COVID-19 in Algeria based on data recorded from the Epidemiological Situation (ES) published by the Algerian National Institute of Public Health (ANIPH). Models, in time series forecasting methods, are used for predicting future values based on previously observed ones. Accordingly, we select and implement three Machine Learning (ML) based forecasting time series models which are: ARIMA (Selva, 2019), Facebook Prophet (Taylor & Letham, 2018) and LSTM (Hochreiter & Schmidhuber, 1997). The recovered dataset is structured as a sequence of historical measurements (i.e., univariate time series) from 20 April 2020 to 29 July 2021 for both confirmed and death cases. Experiments with various forecasting time series-based models were carried out using Python programming language under Anaconda environment. In addition, the forecasting accuracy of the models was assessed using statistical assessment criteria, namely, R2 score, Mean Absolute Error (MAE), Root-Mean-Square Error (RMSE) and Max Error. The results show that the ARIMA model has outperformed Prophet and LSTM when forecasting confirmed cases. However, LSTM model has the best performance in forecasting death cases. In both cases, Prophet does not exhibit a good prediction since Prophet behaves well with time series characterized by strong seasonal effects.
The outline of the paper is as follows: Section 2 reviews the literature of machine learning-based forecasting methods of COVID-19 outbreak. Section 3 presented the recovered Algerian dataset for both confirmed and death cases. In Section 4, the three selected forecasting models are briefly introduced and explained. The paper findings and some comments are discussed in Section 5. Finally, we conclude the paper and propose our future work in Section 6.
Related works
The spread of COVID-19 in the whole world prompted the intervention of the world’s scientists and researchers to accelerate research and develop new norms and standards to contain the spread of the corona virus. In (Ranjan, 2020), authors predicted the outbreak of COVID-19 in India using the SIR model on the daily data. Consistent results were found with confirmed and death cases. Also, the weapon to fight with COVID-19 is the combination between social distancing and lockdown. Tomar et al. predicted, in (Tomar & Gupta, 2020), the confirmed cases in India for the next 30 days by using Long Short-Term Memory (LSTM) algorithm and effect of precaution measures in spread of coronavirus. In (Dehesh et al., 2020), the authors suggest the use of ARIMA model to predict COVID-19 outbreak in countries with a high number of confirmed cases in the world. The used dataset was recovered from 22 January 2020 to 1 March 2020. The best combination of ARIMA parameters (i.e., p, d and q) founded were ARIMA (2,1,0) for Mainland China, ARIMA (2,2,2) for Italy, ARIMA (1,0,0) for South Korea, ARIMA (2,3,0) for Iran and ARIMA (3,1,0) for Thailand. Mainland China and Thailand had almost a stable trend. The trend of South Korea was decreasing and will become stable in near future. Iran and Italy had unstable trends. In (Aditya Satrio et al., 2021), the authors implement two machine learning-based forecasting time series models which are ARIMA and Facebook Prophet to analyze and predict COVID-19 outbreak in Indonesia. Also, a performance evaluation of the two models was provided. The results that Facebook Prophet model generally outperforms ARIMA using a dataset recorded from 20 January 2020 to 21 May 2020. Regarding published works on COVID-19 spread in Algeria, Djeddou et al. conducted a predictive study using an Extreme Learning Machine (ELM) model. The results presented in that paper showed that the proposed ELM model achieved an adequate prediction accuracy with smallest errors. (Djeddou et al., 2020). Also, a prediction study of confinement effects on the cases number of COVID-19 outbreak in Algeria is proposed in (Moussaoui & Auger, 2020). The authors use SEIR modelling to forecast COVID-19 outbreak and demonstrate that the final size of the epidemic depends on two factors: the cumulative number of cases at the intervention date and the fraction of the population in confinement.
Although these works have considerably forwarded the COVID-19 outbreak analysis and prediction, they did not take into account the new outbreak of the Delta and Omicron variants in Algeria given that the number of recovered cases and deaths grows exponentially each day. Furthermore, it is extremely interesting to make future prediction of the new recovered cases and deaths under the current circumstances of disrespect of social distancing and lockdown since the Delta variant is dangerous and classified as the most transmissible SARS-CoV2 virus to date mainly among the unvaccinated.
Dataset
According to (ANIPH, 2021), till 30 July 2021, nine provinces record a rate of more than 500 confirmed cases per 100000 inhabitants, Blida with 817.65 cases, Algiers with 805.42 cases, Constantine with 617.42 cases, Jijel with 596.29 cases, Tizi Ouzou with 586.52 cases, Batna with 582.03 cases, Bejaïa with 578.59 cases, Tébessa with 530.63 cases and Ouargla with 527.45 cases (Fig. 1). In terms of probable cases, five provinces exceed the threshold of 800 cases per 100000 inhabitants, Médéa with 2594.76 cases, Tébessa with 1904.67 cases, Bordj Bou Arreridj with 1145.20 cases, Blida with 1095.24 cases and Chlef with 939.14 cases. Noting that Blida and Tébessa record high incidences for both confirmed and probable cases (ANIPH, 2021).
The dataset used in this research is collected manually from the daily ES published by the ANIPH from 19 April 2020 to 29 July 2021. The initial dataset consists of table with 48 rows (the number of Algerian provinces) and 466 columns (Dates) for both confirmed and death cases. Such a structure of recovered dataset does not match with the adopted ML-based forecasting algorithms in which the data frame should be provided as univariate time series. To that end, we proceed for dataset transformation in order to restructure it into univariate time series as illustrated in Table 1. In addition, both Facebook Prophet, ARIMA have their default time stamp (i.e., Dates) being their indexes.
Confirmed and death cases time series
Confirmed and death cases time series
Distribution of incidence rates of PCR
In time series analysis, seasonality and trends are two features of time series data that break many models. Seasonality refers to regular periodic fluctuations (i.e., periodic patterns). The trends, in contrast, represent an increase or decrease in time serie values over time. Both seasonality and trend should be detected, measured and removed from the time series under review (i.e., seasonal adjustment or deseasonalization). Obviously, the sequence plot related to COVID-19 time series in Algeria does not exhibit seasonality nor trend for both confirmed and death cases (Fig. 2).
Daily reported COVID-19 confirmed and death cases from 19 April 2020 to 29 July 2021.
According to (Nwogu et al., 2016), seasonality, is specified as a pattern that repeats and can be visually identified (e.g., 4 for quarterly data, 12 for monthly data, etc.). Similarly, the identification of the trend component is not very difficult since the trend is monotonous (consistently increasing or decreasing). It’s necessary to handle these two characteristics given that time series forecasting process uses seasonal patterns or trend for future forecasting. Also, both ARIMA and Prophet models have properties that handle trends and seasonality.
Machine Learning algorithms for time series forecasting have obtained popularity. In this study, ARIMA, Prophet and LSTM models are proposed to forecast the number of total confirmed cases and total deaths in Algeria. Selected models are trained using the official reported data. In what follows, a brief introduction to the three models is provided. More specifications and details will be exhibited in the results and discussions section succeeded by an evaluation of models performances for both confirmed and death cases.
The ARIMA model
Auto Regressive Integrated Moving Average (ARIMA) is the most widely used approach to time series forecasting (Selva, 2019). Typically, the parameters p, d and q of the ARIMA model are defined as the lag order and consists of the number of lag observations included in the model, the number of differencing required to make the time series stationary, and the order of MA term and specifies the size of the moving average window respectively. In short, p is the order of Auto-Regressive (AR) model, q is the order of Moving Average (MA) model and d is the order of Integration (I). A common ARIMA model will be created based on the following steps: (1) preprocessing data in order to clean missing values, (2) checking the stationary of the data, (3) plot the Auto Correlation Function (ACF) and the Partial Auto Correlation Function (PACF), and (4) construct the ARIMA Model based on the data.
The Prophet model
Prophet (Taylor & Letham, 2018) is a model for forecasting univariate time series data created and released by Facebook’s Core Data Science team as open-source software for both R and Python. Prophet is built in Stan (Carpenter et al., 2017), a programming language for statistical inference written in C
Where
The reader is invited to read the article of taylor et al. (Taylor & Letham, 2018) for more details about
Long Short-Term Memory (LSTM) was proposed for the first time in 1997 (Hochreiter & Schmidhuber, 1997) and consists of a special kind of Recurrent Neural Network (RNN) with the capability of remembering the values from earlier stages for the purpose of future use (Siami-Namini et al., 2019). LSTM was developed to overcome the problems related to the conventional RNN by adding more module interactions. A typical LSTM model has a chain structure form consists of multiple modules called cells (Elsheikh et al., 2021). A typical LSTM cell is configured mainly by three gates: input gate, forget gate and output gate. With the aim of controlling the state of each cell, the input gate adds information to the cell state, the forget gate removes the information that is no longer required by the model. In contrast, the output gate selects the information to be shown as output.
Results and discussions
In order to adapt recovered Algerian COVID-19 dataset to proper time series analysis, it must be pre-processed. Among others, pre-processing process involves estimating missing values, removing outliers, and accounting for seasonal variation. In our case, no missing values are outlined since we put zero instead blanks in the case when no new cases were reported. Also, selected forecasting models require data to be in Pandas Data Frame under a specific format. In Prophet, for example, the date column should be renamed as “ds”, and the number of cases column should be renamed to “y”. These are the standard names in all the Prophet models.
Furthermore, in order to identify the order of the ARIMA model (i.e., values of p, d and q), the authors of (Geurts et al., 1977) proposed to use the Auto Correlation Function (ACF) and the Partial Auto Correlation Function (PACF) as the basic tools of the sample data. We should, first, check the time series stationarity which is a required condition to find the ARIMA model. To this end, we used Augmented Dickey Fuller (ADF) test that stipulates a value less than or equal to 0.05 (i.e., threshold) for data to be stationary. Also, ADF informs that the degree to which a null hypothesis can be rejected or not rejected in order to determine if our data is stationary or not. The Table 2 shows that the
ADF tests of COVID-19 confirmed and death cases in Algeria
ADF tests of COVID-19 confirmed and death cases in Algeria
Due to its ease of use and popularity, we use log transformation (Feng et al., 2014). As seen in the Fig. 3, the results after transformation proved to be quite effective and the
ADF tests of COVID-19 confirmed and death cases after performing transformation.
Furthermore, based on results returned by ACF and PACF for confirmed cases, the estimated ARIMA model is ARIMA (4,1,1). We estimate to use four (4) AR terms (i.e.,
With regards to LSTM, the used architecture is composed, adding to the output layer, of one input layer trained using 20 epochs for death cases dataset and 100 epochs for confirmed cases dataset. In contrast, Prophet needs to adjust the dataset on a data frame with two columns: ‘ds’ or date stamp (in datetime format), and ‘y’ or the forecasting measurement (i.e., numerical values). While stationarity is not an explicit assumption for Prophet and LSTM, these models are trained on the original datasets (i.e., non-stationary datasets). For each recovered dataset (confirmed and death cases), data splitting process was done in such a way as to divide the data into two subsets: training set and test set where 70% of each dataset was used for training and the remaining 30% was used for evaluating the models. Accordingly, for both confirmed and death datasets, 326 datapoints (from 19 April 2020 to 11 March 2021) have been used for training as well as the rest 140 datapoints (from 12 March 2021 to 29 July 2021) for testing selected models. After proceeding with the training phase, we evaluate the prediction behaviour of selected ML-based methods with the test sets. Overall, predictions made with ARIMA, Prophet and LSTM are closely resembling the real cases (Figs 4 and 5) except in extreme confirmed and death cases using Prophet with noticeably high peaks.
Actual vs predicted COVID-19 confirmed cases.
Actual vs predicted COVID-19 death cases.
With regard to forecasting future trends, we began in 30 July 2021 and ended in 27 October 2021 (i.e., forecast horizon
Forecasting confirmed cases for the next 90 days.
Forecasting death cases for the next 90 days.
In the experiments, holidays are not considered since the current version of Prophet does not allow setting the parameter country_holidays for Algeria. It should be noted that the COVID-19 spread in Algeria was not impacted by holidays, religious feasts like Eid-Al-fitr, Eid-Al-Adha, Achoura, etc., official days like Independence Day, Women’s and Mother’s days, Labour Day, etc., and overall events that contain gatherings like weddings and funerals. This is justified by the fact that the Algerian government has imposed a strict lockdown on the entire territory during the aforementioned events. Also, remarkable high peaks have been reported after breaking the lockdown.
In addition, hyperparameter tuning is a serious problem since an optimal combination of hyperparameters should be selected for a learning algorithm. In Prophet, changepoint hyperparameter means the fact where the data shift direction. In the case of COVID-19 data, the changepoint may be that new cases start declining after peaking once a vaccine is introduced or the appearance of the new COVID-19 variants (e.g., Delta, Omicron, etc.). Changepoint hyperparameter is defaulted by 0.8 which means that Prophet specifies potential changepoints which are evenly placed in the first 80% of the data. Based on that initial value of changepoint hyperparameter, Prophet is unable to incorporate the slowdown in new COVID-19 cases (i.e., cases that belong to the rest of 20% of data). In our study, 80% of data reports cases from the beginning till 26 April 2021 with an average of 200 confirmed cases and 10 death cases. However, the extreme record was reported on 28 July 2021 in which 1927 confirmed cases and 49 death cases were recorded. Most reported cases in the last two months were infected by the new COVID-19 Delta variant (i.e., the third wave of COVID-19). In this particular context, it’s more important to understand the overall trend of cases to possibly predict when the pandemic will end. However, we should keep the changepoint range at 80% or lower to ensure that the model avoid overfitting problem and can understand the last 20% on its own. With regard to LSTM forecasting, RNN-based models are more generalizable and designed for sequence data not for time series data like ARIMA and Prophet. Also, the order of data in sequenced data matters but the time stamp is irrelevant. However, in time series, the data is ordered and collected in a fixed time-difference between successive data points such as the case of COVID-19 data in Algeria.
In order to evaluate the performance of used models, several metrics are used for both confirmed and death cases on test set (Table 3). R2 score, Mean Absolute Error (MAE), Root-Mean-Square Error (RMSE) and Max error. The RMSE is a measure frequently used for assessing the accuracy of prediction obtained by a model. In the case of Prophet, RMSE is calculated using Prophet cross validation function with the following parameters (initial
Evaluation metrics of the ARIMA, Prophet and LSTM for confirmed and death cases
Seeing the RMSEs, it is clear that the ARIMA model has outperformed (smaller RMSE) Prophet and LSTM in the case of predicting confirmed cases. However, LSTM model has the best performance in predicting death cases. In both cases, Prophet does not exhibit a good prediction since Prophet works best with time series that have strong seasonal effects as stated by owners (Taylor & Letham, 2018). From a practical point of view, after tuning three parameters of ARIMA (i.e., p, d and q) high prediction accuracy was obtained. The parameter selection of ARIMA and Prophet require a time-consuming and each forecast needs another model fitting while one model fitting is necessary for LSTM.
Although many studies have dealt with the COVID-19 time series forecasting over several countries, few were interested in the outbreak prediction in Algeria. Our objective for this study is to compare how well ARIMA, Prophet and LSTM can handle Algerian COVID-19 dataset. This latter contains data of both confirmed and death cases recovered from 20 April 2020 to 29 July 2021. Also, trends and seasonality were not figured in the dataset in addition to the remarked random patterns (i.e., no periodic fluctuations). Overall, predictions made with ARIMA, Prophet and LSTM are closely resembling the actual testing data. However, in forecasting stage, ARIMA model has outperformed Prophet and LSTM in confirmed cases likewise LSTM in death cases. Results presented in this study shows clearly that the pandemic spread is still in progress especially with the appearance of the COVID-19 Delta and Omicron variants. Consequently, protection measures like contact restriction and lockdown should be strictly applied. In future works, we plan to investigate the effect of both vaccination and protection measures on the disease prevalence.
