Abstract
A time series prediction model was developed to predict the number of confirmed cases from October 2022 to November 2022 based on the number of confirmed cases of New Coronary Pneumonia from January 20, 2021 to September 20, 2022. We will analyze the number of confirmed cases in the Philippines from January 1, 2020 to September 20, 2022 to build a prediction model and make predictions. Among the works of other scholars, it can be shown that time series is an excellent forecasting model, particularly around dates. The study in this work begins with the original data for inference, and each phase of inference is based on objective criteria, such as smooth data analysis utilising ADF detection and ACF graph analysis, and so on. When comparing the performance of algorithms with functions for time series models, hundreds of algorithms are evaluated one by one on the basis of the same data source in order to find the best method. Following the acquisition of the methods, ADF detection and ACF graph analysis are undertaken to validate them, resulting in a closed-loop research.
Although the dataset in this study was generated from publicly available data from the Philippines (our data world for coronaviruses), the ARIMA model used to predict data beyond September 20, 2022 exhibited unusually high accuracy. This model was used to compare the performance of several algorithms, each evaluated using the same training data. Finally, the best R2 for the ARIMA model was 92.56% or higher, and iterative optimization of the function produced a predictive model with an R2 of 97.6%. This reveals the potential trajectory of coronaviruses in the Philippines. Finally, the model with the greatest performance is chosen as the prediction model. In actual implementations, several subjective and objective elements, such as the government’s epidemic defence measures, the worldwide pandemic condition, and whether the data source distributes the data in a timely way, might restrict the prediction’s accuracy. Such prediction findings can be used as a foundation for data releases by health agencies.
Introduction
Since its outbreak in December 2019, the novel coronavirus pneumonia (Corona Virus Disease 2019, Coronavirus), or “new coronavirus pneumonia,” has spread to countries and regions around the world via a polycentric transmission pathway. The World Health Organization (WHO) held a press conference on January 30, 2020, Geneva time, to announce that the novel coronavirus outbreak had been classified as a Public Health Emergency of International Concern (PHEIC). The number of confirmed Coronavirus cases, deaths, and countries affected by the outbreak continued to rise in the following period. The novel coronavirus’s epidemiological situation can be described as “pandemic.” Despite the promotion of full vaccination against the new coronavirus, the outbreak continues to spread in a number of countries around the world. As of November 8, 2020, there had been 250,584,988 confirmed cases of NCCP and 5,064,202 deaths worldwide. In a single day, 352,844 new cases were confirmed globally, and 4,826 new deaths occurred. More than half of all new confirmed cases came from European countries, the serious epidemic in the United States has not improved, and the epidemics in Japan, Germany, India, and other regions continue to reoccur and are ineffectively controlled. This global public health event is a long-term battle for all countries, but it also emphasizes the need for the international community to work together to learn from experience and improve measures to combat the virus effectively. According to academician Zhong Nanshan, if the current situation persists, the impact of the new coronavirus on human health will gradually diminish, and the future will be similar to the flu, with regular annual vaccination, and the human and the new coronavirus will gradually develop into a long-term coexistence situation.
The Philippines, a Southeast Asian country, is also affected by Neoplasmosis, and the number of confirmed cases increased slowly from March 2020 to July 2020, according to Our World in Data. In just a few months, the number of cases rose from several hundred thousand per day to several thousand per day, because there is a cure, the main method of controlling this epidemic is to implement strict quarantine and lockdown procedures. The prediction of the spread of New Crown pneumonia has piqued the interest of many scientists, researchers, and policymakers because it aids in the development of appropriate plans and protective measures against the pandemic’s spread. Long-term forecasting of New Coronavirus epidemics is critical in making critical decisions, particularly in densely populated areas such as schools and office buildings. It aids in the identification of preventive measures and restrictions for citizens to reduce disease spread and thus reduce the economic impact. Furthermore, forecasting new coronary pneumonia epidemics may aid in the development of vaccine and treatment programs, as well as the identification of hot spots that should be prioritized by policymakers. Nonetheless, a number of prediction methods for predicting the spread of a new crown pneumonia pandemic have been reported in the last two years. In many countries and regions, mathematical and statistical methods have been widely used to forecast the spread of New Coronary Pneumonia. Artificial intelligence models have recently performed well in addition to simulating the spread of the NCC pandemic. Several studies have reported that artificial intelligence models outperform other traditional models [1].
Whether we are using artificial intelligence or machine learning, we must first understand the concept of prediction, which is a technology that uses historical data and certain algorithms to create valuable but unobserved data. Many business activities require prediction of various variables, such as predicting the cost and quantity of goods. Researchers face many challenges in developing prediction models, and the main challenge is collecting historical data. There are two main sources of historical data: the first is data directly obtained by the collecting party, which we can call first-hand data. This type of data is usually collected through surveys, focus interviews, and other methods. The second is data published by third parties. This type of data may have been used by many people, and the quality of the data varies greatly, so we need to carefully identify it [2]. A good prediction performance is dependent not only on the model’s exact structure, but also on its parameterization [3]. After the CORONAVIRUS outbreak in China in 2019, many scholars want to predict the development of the outbreak using machine learning models. For example, after the CORONAVIRUS outbreak in 2020, some scholars used the SIR model to predict the new crown pneumonia outbreak in South Carolina, U.S. The SIR model differs from the traditional Bayesian model in that it specifies the transmission rate within each county and, optionally, transmission from neighbouring counties. Asymptomatic cases are included in the transmission model because they are assumed to be part of the symptoms [4]. In order to improve the accuracy of long-term forecasting for periods exceeding two weeks, two machine learning-based models have been proposed in the literature. The first model is a recursive neural network with two layers of long short-term memory (LSTM) blocks, while the second model is a one-dimensional convolutional neural network that incorporates a selection optimization algorithm [5, 6]. To enhance the prediction performance of the COVID-19 virus transmission, a novel approach has been suggested that combines time pattern extraction based on LSTM-based recursive neural networks with prior spatial analysis based on convolutional neural networks. This integrated approach considers both the spatial and temporal effects, and the final results obtained by this model have achieved a remarkable accuracy of 99% [7]. Moreover, to overcome the limitations of previous research, a PHSM decision support system based on LSTM autoencoder has been developed and implemented. This system employs multiple output strategies to predict the number of confirmed cases per day over multiple time periods, and it also employs an anomaly detection technique to quickly identify the impact of different strategies. By adopting this innovative approach, the proposed system can provide more accurate and reliable decision-making support for stakeholders in the public health domain [8]. Because NCP has been around for three years, data on the disease is quite extensive, The researchers have used artificial neural networks to predict the outbreak of NCP in the epidemiological field by using ARDL models to analyze the validity of impact variables on NCP data. This was then improved by combining machine learning algorithms [9]. Spatial modeling approaches have been employed by researchers to investigate the temporal spread of Neoplastic pneumonia in relation to demographic and built environment factors. Both random and aggregated models have been utilized, and the spatial variability of each partition has been used to reflect the diverse trends in the incidence density of Neoplastic pneumonia associated with the selected influencing factors [10]. ARIMA models can also predict confirmed cases of neoconjunctivitis [11, 12]. For prediction, the Poisson-Gamma model was combined with Bayesian analysis [13]. In this paper, we propose the fuzzy similarity metric, which captures heterogeneous fit expectations and volatility in predicted values in addition to the mean error, particularly when cross-validation methods are used. We used cross-validation methods to test this metric, a model that captures heterogeneous fit expectations and volatility in predicted values, through a fuzzy similarity metric, in order to select models that predict future values of the number of new coronary pneumonia, which has relatively stable characteristics [14]. Others have improved the short-term predictive accuracy of autoregressive models based on time series of online search activity for the number of CORONAVIRUS deaths compared to jointly confirmed CORONAVIRUS cases or deaths in multiple countries/regions [15].
Experimental procedure
General information
Research roadmap.
This study employs the autoregressive integrated moving average in time series analysis to predict the number of confirmed cases of the new crown epidemic in the Philippines. It is used to model non-stationary time series data through differencing, where the non-stationary difference time series can then be modelled using autoregressive moving average.
The unique aspect of the ARIMA model is that it does not consider specific trends in the historical data of the series being predicted. Instead, it employs an iterative method to identify potential models from the general model class, which are then validated against historical data to ensure accurate representation of the series. The model consists of three components, namely Autoregressive (AR), Differential (I), and Moving Average (MA).
In summary, this study uses a rigorous layer of analysis and experiments starting from the original data to obtain the desired results. The application of the ARIMA model is an established and respected approach in time series analysis and will contribute valuable insights into predicting the spread of the new crown epidemic in the Philippines. Time series has a prerequisite that it must be stationary in order to establish a model. If the time series fails to pass the test, it must be transformed into a stationary sequence through the process of differencing. The transformation process is called order of integration. The ARIMA model can be represented in the form of ARIMA (p, d, q), where the number of autoregressive terms can be represented by p, the number of moving average terms can be represented by q, and d represents the number of orders of differencing required to achieve a stationary sequence. The critical point of this model is differencing. Using the ARIMA model for prediction requires using smoothed time series data. Using unstable data will not capture the time series model [16].
The ARIMA model is a hybrid of AR and MA models. Compared with the ARMA model, the ARIMA model is more suitable for non-stationary time series models. If an ARMA model is used to build a non-stationary time series model, it can be divided into two steps. First, the non-stationary time series must be transformed into a stationary time series through differencing. Then, an ARMA model can be established.
In addition to the differences, the advantages and disadvantages of the ARIMA model are as follows. The advantage is that it is very simple and only requires an endogenous variable, without the assistance of other exogenous variables. This leads to a disadvantage that is more obvious, because it requires data to be operated on after differencing is stable.
Below are several models defined:
1) AR Model
To characterize the interdependence between present and past observations, the autoregressive model must exhibit the property of stability. Specifically, the equation for the pth-order autoregressive process is typically specified.
In the context of autoregressive modeling, the current value of a time series is denoted as
Expansion of an equation:
When the random perturbation term follows a white noise process, the autoregressive (AR) model is characterized as a pure AR (p) process, denoted as:
The autoregressive (AR) model relies on historical data to forecast future values, as denoted by the equation where p represents the number of lagged observations included in the model. Nonetheless, this technique is associated with several limitations. Firstly, the AR model is solely dependent on its own data for predictions. Secondly, the time series data must exhibit a certain degree of smoothness. Additionally, the method necessitates the presence of correlation, and the AR model is not recommended when the autocorrelation coefficient (
2) MA model
The sliding average model is an important component of the autoregressive model, which focuses on the accumulation of error terms. When the random
where
The moving average (MA) model is derived when the current values of a time series are solely dependent on the linear combination of historical white noise, i.e., there is no correlation between present and past observations. Notably, in the autoregressive (AR) model, the impact of historical white noise on current forecast values is indirect, affecting the values. The equation is specified as:
Forecasting with a sliding average can effectively eliminate random fluctuations.
3) ARMA model
Combining AR (p) and MA (q) yields the general autoregressive sliding average model ARMA (p, q) [17].
4) ARIMA model
The ARIMA comprises the AR and MA models along with the differencing method I to ensure data stability. The order of differencing is denoted by d. Thus, by combining these components, we obtain the ARIMA (p, d, q) model, where p and q are the orders of the AR and MA models, respectively. The differenced data is then used for the ARMA model. These steps help in building a differential autoregressive sliding average model to predict future values of a time series. The details of this model have been discussed in literature [18].
The data in this paper spans the period from January 1, 2020 to September 20, 2022, with the location being the Philippines, the event being a confirmed case of new crown, the total number of data being 973, and the database provider being Our World in Data. The file format is.csv. The following Fig. 2 is listed: Date and Total cases.
The list of data.
Judging whether white noise
Check by Ljung-Box.
The test of Ljung-Box
The test of Ljung-Box
The
Determine if it is smooth
Check by ADF and KPSS.
The test of ADF and KPSS
The ADF test is used to determine whether or not a time series is smooth; it is a hypothesis test with the null hypothesis being that the data is not smooth. How can I see the ADF test results when they are calculated? We are primarily interested in the
Stability test statistical analysis table
ACF and PACF.
The Fig. 3 isn’t smooth graphically.
By examining the Autocorrelation Coefficient (ACF) and Partial Autocorrelation (PACF) function plots for a smooth time series with autocorrelation or partial autocorrelation coefficients.
ACF formula:
PACF formula:
This allows for a closer examination of the mutual relationship between the two elements under study. In this context, the maximum lag point of the partial autocorrelation coefficient (PACF) graph can be used to estimate the
Based on the results of these two diagnostic methods, it can be concluded that the original data is not smooth and requires preprocessing before proceeding with the model. Table 3 presents the results obtained after applying first and second-order differencing. Following first-order differencing, the
Algorithm performance comparison
Diagnostic.
The Fig. 4 mainly shows the data trend of 1st and 2nd order.
To reduce interference in the next prediction, This helps in evaluating various algorithms for time series and ranking them based on their performance measured by
Stability test statistical analysis table for ARIMA
ARIMA’s evaluation data
ARIMA diagnostic.
The formula is not related to the number of
From Fig. 5 it can be seen that when the 1st order becomes ARIMA (0, 1, 0).
From Table 5, we can see that the
Determining the best algorithm and prediction
From the Fig. 6 you can see that the trend is smoothly extended.
From the Table 6 we can see that the R2 of ARIMA is 97.6%.
The Table 7 is the comparison between the predicted and actual values from September 16, 2022 to September 20, 2022, because the data of these 5 days are separated as a test set, so the comparison results reflect the real situation.
Predicted values are compared with actual values
Predicted values are compared with actual values
ARIMA forecast trend chart.
Since the total number of confirmed cases in the Philippines ranges from January 2020 to September 2022, the validation of the predicted results will begin on September 21, 2022, and will end on October 4, 14 days later, if the predicted results are found to be proportionally satisfactory and essentially in line with the current trend of the epidemic.
Actual vs. projected values from September 21, 2022 to October 4, 2022
Actual vs. projected values from September 21, 2022 to October 4, 2022
The data from September 21, 2022 will be outside the range of the original data, as shown by the Table 8, and the data generated in the tables is obtained entirely through forecasting algorithms.
As previously stated, the ARIMA model provides the best R2 performance in the prior set of studies. This result is validated once again by comparing it to the real numbers. Furthermore, the model is smoother for the original data due to first-order difference, as demonstrated by the ADF and ACF plots in the experiment. Although this estimate is for the total number of new cases identified in the Philippines, Daily new instances may be more relevant for health-care needs, but they are too dependent on surrounding variables, and the data might be unpredictable. Using time series methods may not always yield the expected results.
The dataset utilized in this study was derived exclusively from publicly available data on coronaviruses in the Philippines, obtained from our designated data source. Among the various algorithms evaluated using the same training data, the ARIMA model stood out for its remarkable predictive accuracy when forecasting data beyond September 20, 2022. During the evaluation process, the ARIMA model consistently demonstrated an exceptional performance, with an R2 value of 92.56% or higher. Notably, through iterative optimization of the model’s function, we were able to achieve further improvements, resulting in a predictive model with an impressive R2 of 97.6%. These findings provide valuable insights into the potential trajectory of coronaviruses in the Philippines. The high accuracy exhibited by the ARIMA model suggests its suitability for forecasting future trends and aiding decision-making processes in combating the pandemic. By leveraging this predictive model, health agencies and policymakers can gain a deeper understanding of the virus’s spread and devise effective strategies to control and mitigate its impact. It is important to acknowledge that the success of the ARIMA model is contingent upon the quality and timeliness of the data source used. Therefore, ensuring the continued availability of up-to-date and reliable data will be crucial for maintaining the accuracy of the predictions. The ARIMA model’s exceptional accuracy in predicting the trajectory of coronaviruses in the Philippines highlights its potential as a valuable tool in epidemiological research and public health planning. Further research and validation of the model, along with continued efforts to improve data collection and analysis, will contribute to the development of robust prediction models that can support proactive measures in managing and controlling future outbreaks.
Conclusion
After thorough evaluation, the prediction model with the highest performance has been selected. However, it is crucial to acknowledge that the accuracy of predictions in practical implementations may be influenced by a range of subjective and objective factors. These include the effectiveness of government measures in combating the epidemic, the global pandemic situation, and the timeliness of data distribution from the data source. While the chosen prediction model provides a solid foundation for data releases by health agencies, it is essential to recognize that the accuracy of predictions is not guaranteed due to these limiting factors. The future outlook for prediction models in this context is optimistic, as ongoing advancements in technology and data collection methodologies are likely to enhance their performance and reliability. Furthermore, increased collaboration between governments, health agencies, and data providers can contribute to more accurate predictions and facilitate better decision-making in epidemic defense. Looking ahead, it is imperative to continuously monitor and evaluate the performance of prediction models, incorporating feedback from real-world implementation. This iterative process will aid in refining the models, addressing their limitations, and ensuring their effectiveness in providing reliable insights for health agencies. By leveraging the potential of predictive analytics and emerging technologies, we can aspire to develop even more robust and accurate prediction models, assisting governments and health organizations in mitigating the impact of future pandemics.
Footnotes
Acknowledgments
The authors appreciate ourworldindata for contributing the data, as well as the reviewers and editors for their efforts.
