Abstract
Background
The COVID-19 pandemic underscores the necessity for proactive measures against emerging diseases, epitomized by WHO's “Disease X.” Among the myriad of indicators tracking COVID-19 progression, the count of hospitalized patients assumes a pivotal role. This metric facilitates timely responses from government agencies, enabling proactive allocation and management of medical resources.
Objective
In this study, we introduce a novel hybrid intelligent approach, the EMD&LSTM-ARIMA model.
Results
Our analysis reveals that all forecasted error rates remain below 10%, with Mean Absolute Percentage Error (MAPE) values obtained for these four countries as 2.30%, 3.33%, 1.63%, and 2.89%, respectively.
Conclusion
Our proposed EMD&LSTM-ARIMA model demonstrates robust forecasting performance, particularly for COVID-19 hospitalization data.
Introduction
The World Health Organization (WHO) is convening over 300 scientists to examine evidence from more than 25 virus families and bacteria, and to conceptualize “Disease X”. 1 Introduced by the WHO in 2018, Disease X represents a hypothetical, yet potentially devastating pathogen that could cause significant outbreaks, epidemics, or pandemics. This concept underscores the necessity for global preparedness against unknown health threats. COVID-19, which emerged as a novel coronavirus in December 2019, exemplifies Disease X by rapidly evolving into a global pandemic that has affected over 200 countries.2,3 The pandemic has severely strained healthcare systems worldwide, particularly in the United States, where surging infection rates have overwhelmed hospitals, delayed elective surgeries, and caused persistent shortages of medical personnel. By January 2022, the demand for healthcare workers had more than doubled compared to the early pandemic period, revealing enduring challenges in the healthcare sector. 4 The experience with COVID-19 highlights the urgent need for sustained vigilance and collaboration among governments, health authorities, and academia to manage and mitigate the impacts of future pandemics. The conceptual framework of Disease X remains vital for global health security and preparedness.
Since the advent of the COVID-19 pandemic, researchers worldwide have diligently engaged in the development and deployment of predictive models aimed at comprehending the potential severity of forthcoming outbreaks. Within the realm of COVID-19 prediction, a diverse array of methodologies and tools has been harnessed to construct robust prediction models. In this context, Mohamadou conducted a comprehensive review of 61 scholarly articles, reports, fact sheets, and web resources pertinent to COVID-19. The review revealed a prevalent reliance on mathematical models grounded in the Susceptible-Exposed-Infected-Removed and Susceptible-Infected-Recovered frameworks. Concurrently, AI-based implementations predominantly centered on Convolutional Neural Networks (CNN), particularly in the analysis of X-ray and CT images. 5 Seeking innovative avenues, Sun introduced an advanced paradigm denoted as the Dynamic-Susceptible-Exposed-Infective-Quarantined model. Distinctive in its augmentation of the SEIR model, this framework strategically incorporates machine learning-driven parameter optimization, operating within the confines of epidemiological rational constraints. 6 In parallel, Samui devised a compartmental mathematical construct tailored to prognosticate and manage the transmission dynamics of the COVID-19 pandemic within India. 7 Underpinning the pursuit of more sophisticated modeling, Santosh critically examined the limitations inherent in the SEIR and SIR methodologies. Advocating for a departure from conventional stochastic and/or discrete models reliant on randomly generated parameters, Santosh proposed a paradigm shift towards intricate models that incorporate continuous and unprecedented factors. 8
In light of the heightened computational capabilities, the escalating demand for versatile predictive frameworks, the relatively short historical trajectory of the disease, and the inherent uncertainties linked to input data and prediction methodologies, an array of studies has embraced machine learning techniques and deep learning methods to prognosticate COVID-19 outcomes. Addressing a distinct facet of COVID-19 prediction, Shastri presented an innovative approach for forecasting COVID-19 cases up to one month in advance. Rooted in recurrent neural network principles, this methodology integrated various iterations of LSTM architectures including Stacked LSTM, Bi-directional LSTM, and Convolutional LSTM. 9 In a parallel endeavor, Sinha conducted an in-depth comparative analysis between the efficacy of artificial neural network and recurrent neural network -based LSTM models, employing data from five countries. The discerning results unequivocally endorsed the supremacy of the LSTM model over artificial neural network. 10 Exploring a novel trajectory, Verma meticulously designed a comprehensive suite of neural network models, collectively referred to as the “vanilla LSTM.” This ensemble encompassed an intricate fusion of LSTM, Encoder-Decoder-LSTM, Bi-directional LSTM, CNN, and the hybrid CNN + LSTM model, collectively engineered to capture the intricate trends underpinning the COVID-19 outbreak. 11 Sah et al. compared several models with the Indian dataset, and found the stacked LSTM-gated recurrent unit prediction was superior to the others forecast the number of confirmed and active cases. 12 A 48-day short-term prediction was performed for the cumulative infections, cumulative deaths, and active cases in India based on a trained Holt-Winters model, which reproduced the observed values reasonably well. 13
However, despite the proliferation of predictive methodologies, there remains a paucity of investigations that specifically concentrate on forecasting the count of hospitalized patients. Nguyen, in a notable contribution, harnessed the potency of a multivariate time-series framework - the vector error correction model-to meticulously compute cross-correlations between hospital census and local infection incidence, incorporating a lag of up to 21 days. This innovative approach was adeptly wielded to prognosticate the trajectory of COVID-19 hospital census. 14 Concurring with this endeavor, Bekker devised a model underpinned by linear programming principles, deftly applied to the context of the Netherlands, thus facilitating predictions pertaining to hospital admissions and bed occupancy for COVID-19 patients. 15 Further diversifying the landscape, Song embarked on the predictive endeavor through the prism of machine learning, specifically harnessing four distinct prediction models to anticipate the hospitalization patterns of older adults who tested positive for COVID-19. The discerning analysis culminated in the assertion that the random forest model demonstrated the pinnacle of predictive prowess. 16 A predictive model employing LSTM architecture utilized the count of hospitalized COVID-19 cases as one parameter to forecast the time series of new daily positive cases, and the outcomes underscored the pivotal roles of vaccination effectiveness and the transmissibility of viral variants in shaping future forecasts of daily positive cases. 17
Accurately quantifying the hospitalization rate is profoundly significant, as this metric plays a crucial role in enabling prompt and effective responses from relevant governmental entities. It facilitates the prudent allocation and preemptive deployment of critical medical resources. Despite extensive research on predicting COVID-19 outcomes, a notable gap remains in studies specifically targeting the accurate prediction of hospitalization rates. Addressing this gap is essential for enhancing healthcare system resilience and ensuring adequate resource allocation during pandemics. The remainder of this paper is structured as follows: Section 2 introduces the proposed hybrid intelligence method, “EMD&LSTM-ARIMA”, and details its components. Section 3 describes the experimental setup and the data used in the study. Section 4 discusses the results and evaluates the performance of the proposed model. Section 5 provides an advanced forecasting example to demonstrate the efficacy of the EMD&LSTM-ARIMA model. Finally, Section 6 concludes the paper and outlines potential directions for future research.
Proposed model
The flow chart of the EMD&LSTM-ARIMA model
The flow chart of the proposed EMD&LSTM-ARIMA model is shown in Figure 1, which includes mainly 5 steps. The following subsections describe the details of each step.

The process flow chart of the EMD&LSTM-ARIMA model.
Our research focuses on utilizing COVID-19 hospitalization data, specifically the count of hospitalized patients. This critical data serves as the foundation for constructing a well-structured and standardized training dataset, pivotal for the subsequent development of the predictive model. The process of data normalization plays a pivotal role in rendering the input data amenable for effective utilization within the LSTM-based framework. Data normalization constitutes a transformative procedure wherein data with varying dimensional units are mapped onto a uniform interval, typically denoted within the [0, 1] range, through a predefined mathematical formula. The application of data normalization is imperative due to LSTM's inherent sensitivity to input data magnitude, necessitating a consistent data range to mitigate discrepancies and enhance model performance. Within the EMD&LSTM-ARIMA model, the data normalization technique employed is the maximum-minimum method. This method orchestrates the uniform transformation of all data points, ensuring their confinement within the [0, 1] interval. This approach serves to mitigate dimensionality differences and align the dataset's scale, fostering optimal performance of the subsequent modeling stages. After completing the forecasting calculation, the normalization formula is inverted. This reversal of the normalization process restores the forecasted results to their original scale. By reverting the normalization, the forecasted values are presented in a format that aligns with the original data, facilitating interpretation and comparison with other non-normalized datasets. The mathematical formulation underpinning the data normalization process is represented as follows:
EMD method stands as a data-adaptive multiresolution technique, proficient in disentangling intricate signals into distinct, physically meaningful constituents. Notably, EMD's remarkable attribute lies in its innate capacity to effectuate decomposition devoid of reliance on predefined basis functions.
18
The potency of EMD is particularly evident in its aptitude for scrutinizing naturally occurring signals that typically exhibit non-linearity and non-stationarity. A salient characteristic of the EMD process is its progressive extraction of signal components from higher to lower frequencies within the time domain. These extracted components, known as Intrinsic Mode Functions (IMFs), encapsulate the inherent fluctuations at varying scales of the signal. Concomitantly, the method facilitates the gradual, adaptive extraction of the sequence's underlying trend element.
19
The formulas for EMD are as follows:
The assignment of IMFs emerges as the linchpin of the entire predictive process, crucially entailing the allocation of the decomposed six IMFs to both the LSTM model and ARIMA model. Subsequently, the outputs generated by both models are aggregated, thereby constituting the yardstick for evaluating the forecasting performance. The litmus test for gauging the efficacy of the forecasted resides in the comparative analysis of the forecasted values (FV) against their actual counterparts (AV). In instances where the forecasted outcomes fall short of the desired level of accuracy, a recalibration of the IMFs-Assignment process ensues, facilitating the fine-tuning of the allocation proportion between the LSTM and ARIMA components. This iterative procedure is reiterated as needed until the desired level of forecasting fidelity is achieved.
The forecasting step with the combination of LSTM and ARIMA model
The forecasting phase hinges upon the synergistic integration of the LSTM model 20 and the ARIMA model, 21 augmented by the dynamic adjustment of IMF-Assignment. The process of allocating IMFs to both the LSTM and ARIMA models is carried out independently, culminating in the synthesis of FV through the aggregation of outputs from both models.
The LSTM model, a sophisticated variant of the recurrent neural network, emerges as a pivotal player in this framework. Its sequential nature renders it adept at mitigating the vanishing gradient predicament that plagues conventional recurrent neural networks. Particularly well-suited for temporal data analysis, LSTM excels in classification, processing, and prediction tasks involving time series data, a domain where disparate time lags between critical events often prevail over extended temporal spans.
20
At its core, the LSTM model is underpinned by a memory cell referred to as a ‘cell state,’ which steadfastly retains its contextual information over time, which consists of three gates (Forget gate, Input gate, and Output gate) performed as individual function to control the flow of information. LSTM has feedback connections, and the training process is realized by the forward algorithm and backpropagation through time algorithm, as shown in the following formulas:
If
The ARIMA model constitutes a statistical analysis approach employed for comprehending time series datasets and forecasting future trends. This model leverages historical data to predict forthcoming values by using lagged moving averages to enhance the smoothness of the time series data. Functioning as a variant of regression analysis, ARIMA assesses the influence of a dependent variable concerning other dynamic variables. To encapsulate the foundational facets of ARIMA, three descriptive acronyms serve as cornerstones:
AR: Autoregression. This component establishes a model that capitalizes on the interdependence between a given observation and a range of lagged observations. I: Integrated. The “Integrated” element necessitates the differentiation of raw observations, entailing the subtraction of a prior time step observation from the current one. This procedure is pivotal in rendering the time series data stationary. MA: Moving average. Operating within this context, the “Moving Average” dimension correlates an observation with a residual error derived from a moving average model that pertains to lagged observations.
Each of these constituents is explicitly defined as a parameter within the ARIMA model. A standard notation is ARIMA with p, d, and q, where whole number values replace the parameters, delineating the specific ARIMA model type employed.
The final phase encompasses the assessment of forecasted outcomes through multiple evaluation indices, aimed at determining their acceptability and adequacy. Should the forecasted outcomes meet the established criteria, the process concludes; however, should the results fall short, recalibration of the IMF-Assignment is necessitated, thereby implying that the frequency of adjustments is contingent upon evaluation results. Within the EMD&LSTM-ARIMA model, two key statistical evaluation indices are employed: the Mean Absolute Percentage Error (MAPE) and the Error Rate (ER). In gauging forecasting performance, a lower value for each of these evaluation indices is indicative of superior forecasted accuracy. The formulas for evaluation indexes are shown as following, in which
The detailed information of the selected dataset (Data source: Our World in Data. CDC COVID data tracker).

The results of EMD components in the time domain (6 IMFs and the Res). (A. UK, B. CANADA, C. ITALY, D. JAPAN).
The ensuing sections outline the comprehensive procedural steps involved in implementing the EMD&LSTM-ARIMA model, utilizing the count of hospitalized COVID-19 patients from four distinct countries as the primary input dataset. The selection of data sources was based on their representation of the most severe instances of COVID-19 hospitalizations. A comprehensive breakdown of the chosen datasets, including pertinent details, is provided in Table 1. The initial phase of the process encompassed data pre-processing, which was initiated upon the acquisition and ingestion of the dataset. During this stage, particular focus was directed towards the identification of the prediction columns, a pivotal precursor to subsequent analyses.
The EMD decomposition outcomes for hospital admissions data from the four countries are illustrated in Figure 2, encompassing six IMFs alongside the residual component (Res). It is noteworthy that, in consonance with prior EMD-based analyses, IMF1 and IMF2 are not readily discernible due to their inherently chaotic characteristics. Consequently, the Res component, which conveys a minute fraction of information relative to the original data and the remaining IMFs, has been excluded from the final results, aligning with previous findings.
24
The procedure for assigning IMFs was executed following the definition outlined in Section 2. This assignment was contingent upon a thorough comparison between the trends of FV and AV, coupled with the outcomes of evaluation indices after each adjustment. To demonstrate the process, the MAPE results in each adjustment were presented as an example using the dataset of COVID-19 hospitalized patients in Canada. A detailed account of the adjusted IMFs assignment process, along with the corresponding outcomes of evaluation indices, is presented in Table 2. The values of the evaluation indices exhibit a gradual reduction concomitant with the successive adjustments of IMFs. During the stages of training and forecasting within the LSTM model, the configuration was established with a singular hidden layer, three inputs, and two outputs. To optimize the LSTM model's performance, the Adam optimizer was selected, known for its superior predictive precision compared to alternative optimizers.
21
The training phase encompassed all available data for each respective country, as it was input into the LSTM model. The parameters for both training and forecasting were meticulously set by the guidelines, including the number of Neurons (9,12,15), Epochs (150, 200,250), and Batch size (100), drawing from prior experiential insights. Following the adjustment of IMFs-Assignment, the forecasting process, including subsequent re-forecasting, was iteratively conducted within the LSTM model. Furthermore, the assigned IMF data within the ARIMA model were recalculated as part of this iterative process.
The assignment of 6 IMFs for the LSTM model and the ARIMA model, and the results of evaluation indexes MAPE values obtained in each adjustment.
The assignment of 6 IMFs for the LSTM model and the ARIMA model, and the results of evaluation indexes MAPE values obtained in each adjustment.
The values of MAPE obtained from the EMD&LSTM-ARIMA model for 4 countries. (N represents the number of Neurons, and E represents the Epochs).

Comparison results of AV and FV obtained from EMD&LSTM-ARIMA for UK, CANADA, ITALY and JAPAN.
The forecasted outcomes for the four countries outlined in Table 2 were computed during the forecasting process. The values of MAPE derived from the EMD&LSTM-ARIMA model are meticulously presented in Table 3. It was found that the forecasting accuracy varied significantly depending on the parameters, and the lowest MAPE for UK, CANADA, ITALY and JAPAN was 2.30%, 3.33%, 1.63%, and 2.89%, respectively. The forecasting adeptly captures the fluctuating trends observed in the AV data, which are reflected in Figure 3. Impressively, the correlation coefficient

The distribution of everyday Error Rate obtained from EMD&LSTM-ARIMA summarized for 4 countries. The median (blue) and 95th percentile (red) are shown.
The specific parameters harnessed in the LSTM, encompassing both the number of neurons and the epochs, as well as the determined IMF-assignment ratio governing the interplay between the LSTM and ARIMA models for achieving optimal forecasting results, are meticulously detailed in Table 4.
Obtained parameters for LSTM and IMFs assignment ratio between LSTM and ARIMA according to the best forecasting results.
Comparison results of daily reported and forecasted number of hospital admissions in the advance forecasting.

Forecasted and actual curves for the number of hospitalized patients of COVID-19 in Florida.
To ascertain the efficacy of the proposed EMD&LSTM-ARIMA model, a comprehensive validation was conducted using subsequent data gathered from June 1 to June 30, 2022, on the number of hospitalized patients in Florida, USA. This dataset served as input for advanced forecasting, deliberately projected forward from the CDC COVID data tracker. It is crucial to emphasize that this forecasting endeavor was undertaken without any knowledge of the corresponding data post-June 15. The compelling forecasting prowess of the EMD&LSTM-ARIMA model is demonstrated through the meticulously outlined outcomes presented in Table 5. In this tabulated summary, AV-1 represents the actual data available during the forecasting calculation, while AV-2 denotes the data that had not yet occurred or been acquired at the time of the prediction. the EMD&LSTM-ARIMA model produced forecasts encompassing both time categories. This outcome underscores the effectiveness of the forecasting model in accurately predicting data that had not yet manifested or been acquired at the time of prediction. Notably, the calculated MAPE was ascertained to be 5.99%. The direct comparative trajectory between AV and FV is visually presented in Figure 5, showcasing an impressive correlation coefficient

The distribution of everyday Error Rate obtained from EMD&LSTM-ARIMA summarized for Florida. The median (blue) and 95th percentile (red) are shown.
In this study, we introduced a hybrid intelligent approach, the EMD&LSTM-ARIMA model, designed to forecast the number of hospitalized COVID-19 patients. Our methodology encompasses data preprocessing, EMD decomposition, IMF assignment, and combined calculations with the LSTM and ARIMA models. The training data comprised COVID-19 hospitalization figures from the UK, Canada, Italy, and Japan, spanning March 2020 to March 2022, with subsequent 15-day data employed for forecasting. The evaluation yielded promising results, with the lowest MAPE values for the UK, Canada, Italy, and Japan being 2.30%, 3.33%, 1.63%, and 2.89%, respectively, indicating high forecasting accuracy. Additionally, the correlation coefficient (
The reclassification of COVID-19 as an endemic disease globally has led to a decrease in surveillance and data collection efforts. Nonetheless, research regarding Disease X persists. WHO Director-General Tedros Ghebreyesus has called for countries to endorse the pandemic treaty, emphasizing the necessity of global cooperation in addressing health crises. 25 WHO Member States have agreed to resume negotiations aimed at finalizing a pandemic agreement during the period from April 29 to May 10. “Our Member States are fully aware of how important the pandemic agreement is for protecting future generations from the suffering we endured through the COVID-19 pandemic,” stated WHO Director-General Dr Tedros Adhanom Ghebreyesus. 26 The concept of “Disease X” underscores the potential for yet-to-be-identified pathogens to trigger global pandemics, with COVID-19 serving as a crucial case study for understanding and managing such diseases.
Continuous research into COVID-19 remains essential for preventing future outbreaks and reducing suffering and fatalities. In conclusion, sustained research into COVID-19 is critical for managing the ongoing pandemic and preparing for future global health threats. This highlights the significance of international collaboration and investment in research and development. Our proposed EMD&LSTM-ARIMA model demonstrates robust forecasting performance, particularly for COVID-19 hospitalization data. Such methodologies hold significant promise for predicting similar time series data associated with future outbreaks, potentially facilitating proactive responses and efficient resource allocation.
Footnotes
Ethical approval
This study did not involve human subjects or ethical considerations, therefore ethical approval was not required.
Author contributions
M.H. collected, calculated and analyzed data, took charge of the integration, revision and finalization of the manuscript. H.Z. provided the initial idea and the program code designs, the research plan and guided the whole research progress.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
Software environment
In this paper, the computing platform is Intel(R) Core (TM) i7-9700 CPU @ 3.00 GHz 3.00 GHz processor with 64 GB of computer memory. The operating system is 64-bit Windows 10. We chose the Python language and used the Keras third-party library to develop the proposed model, the PyCharm platform as Python language integrated development environment, the open-source libraries such as numpy, pandas, scikit-learn, etc. The version number of Python is 3.6, and the version number of PyCharm is 2017.3.2.
Data and code availability
The datasets and code generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.
Guarantor
The guarantor for this work is Prof. Hongbing Zhu, who can be contacted at chinmumu@gmail.com.
