Abstract
The emergency department of a hospital plays an extremely important role in the healthcare of patients. To maintain a high quality service, clinical professionals need information on how patient flow will evolve in the immediate future. With accurate emergency department forecasts it is possible to better manage available human resources by allocating clinical staff before peak periods, thus preventing service congestion, or releasing clinical staff at less busy times. This paper describes a solution developed for the presentation of hourly, four-hour, eight-hour and daily number of admissions to a hospital’s emergency department.
A 10-year history (2009–2018) of the number of emergency admissions in a Portuguese hospital was used. To create the models several methods were tested, including exponential smoothing, SARIMA, autoregressive and recurrent neural network, XGBoost and ensemble learning. The models that generated the most accurate hourly time predictions were the recurrent neural network with one-layer (sMAPE
Introduction
The Emergency Department (ED) of a hospital plays an extremely important role in the healthcare of ambulatory and hospitalized patients. Overcrowding of emergency services is a phenomenon that, if unaddressed, has a negative impact on the quality of care provided, on clinical outcomes and on users’ satisfaction. The quality of ED services, measured by waiting time and length of stay, is significantly affected by patient arrivals. Increased patient arrivals could compromise the quality of care, thus placing patients at risk. It is therefore of utmost importance to have a fast and reliable service to provide the necessary care to incoming patients. The discrepancy between current and expected admissions at an ED can lead to a high patient/staff ratio that causes poor service, on the one hand, or to a low patient/staff ratio that increases service costs without additional benefits, on the other. With high-accuracy forecasts it is possible to better manage available human resources by allocating clinical staff to the ED before surging peaks – thus preventing service congestion, or releasing clinical staff at less busy times.
The unpredictability of arrivals to the ED of a hospital is the main motivation of this work. It is crucial to the hospital’s management team to have a forecasting mechanism of these arrivals in order to improve service and reduce costs. The literature on the forecast of ED patient flow is broad, including models with different time granularity (hours, days, weeks) [1, 2, 3, 4, 5, 6, 7, 8]. However, the datasets used are far less extensive than the one considered in this work. Specifically, data corresponding to 10 years of admissions to the ED of a Portuguese hospital were analysed via time-series-based statistical and machine learning methods. Besides the development and comparison of statistical and machine learning models, we will also perform ensemble learning with these two kinds of models to predict ED admissions. Moreover, most previous studies only make daily predictions, while in the present study hourly prediction of ED admissions is achieved. This is a more detailed prediction, and thus more challenging to obtain because of its greater variability.
The article is organized as follows. Section 2 provides an overview of the literature on time-series models to forecast ED patients’ arrivals. Section 3 briefly presents the forecast methods used in this work. Section 4 explains the main metrics used to evaluate the models. Section 5 gives a description of the dataset in an exploratory data analysis. In Section 6, a total of six different models, as well as the ensemble of the four best individual models, for hourly ED admissions forecast are developed and tested. Section 7 explains how models for four-hour, eight-hour and daily granularity forecasts are derived. In Section 8, the models that achieve the highest accuracy and lowest execution time are selected to be used on the company dashboard. Comparison to related work in the literature is made in Section 9. In Section 10, conclusions and future work are disclosed.
Literature review
Time series are based on exploring historical data of an event to create a model that can be used to predict future behaviour. Early attempts to study time series in the nineteenth century were below expectations, largely because of the prevailing theories of a deterministic world. It was in 1927 that Yule [9] postulated that all time series can be considered as a realization of a stochastic process. With the introduction of this simple idea, the number of time series analysis methods has increased considerably since then. The new methods of time-series analysis have proven to be a valuable tool to predict events, with demand in a diverse range of areas spanning finance, meteorology, science, and engineering [10].
The first study on the prediction of patients’ arrivals to ED was published in 1988 by Milner [11], where an ARIMA model was used to generate the predictions. Since then, in addition to linear models such as ARMA/ARIMA [1, 3, 4, 7, 8, 12, 13, 14, 15, 16, 17], other approaches have been considered, including linear regression [5, 7, 12], exponential smoothing [1, 3, 12, 13], and, more recently, nonlinear machine learning methods [6, 12, 15, 16]. Despite the early success of linear models, over the last 10 years nonlinear methods have achieved similar and, in some cases, better results. In the previously mentioned studies, the models were mostly developed for daily forecasts, and only the studies [1, 7] explored hourly forecasts, similar to the work presented here. Hourly forecasts are challenging to obtain due to the noise caused by random variation underlying the type of event, which can mask the variation patterns in the data.
Regarding the number of hospitals included, while in some studies one model per hospital was created [4, 14, 15], others included data from various hospitals to create a model for each hospital [5, 12, 13, 16]. Boyle [1] and Aboagye-Sarfo [3] used 29 and 80 clinical facilities, respectively, to construct two models each, representing the type of facility (clinic or hospital) or its location (metropolitan or non-metropolitan). The size of the hospitals in the studies referenced here also varies widely, with hospitals with an average daily intake of ED admissions ranging between less than 100 [7, 12, 14] and more than 300 [8, 15, 17]. This feature has a significant impact on the accuracy of forecasts, because the forecast error tends to decrease with the increase of ED admissions. The quantity and quality of data are decisive factors in obtaining accurate predictions. It is therefore necessary to analyse which variables should be included in the construction of the predictive model. Most studies have highlighted that month, day of week, time of day, and weekend indication are consistently the variables that most contribute to model improvement [1, 7, 12, 19]. Meteorological data such as temperature and humidity have been shown to be relevant in a model that predicts ED influx due to respiratory problems [16], while in other studies where the reason for admission to ED is not distinguished the contribution of the meteorological conditions data did not have a relevant influence [7, 8, 19].
The size of the dataset to create the model has an evident effect on the accuracy of the predictive model. As a rule of thumb, the more data available to train the model, the better. However, this is not always the case, which may be due to the sharp change in data trends over time, or, as found by Boyle [1], the fact that certain models provide more accurate predictions with more data while others with less data. The division of the total dataset for model performance evaluation should also be carefully performed. Ideally, to test a predictive model, data that comprise at least one complete seasonal cycle should be used [20] – i.e. the test dataset should span an entire year. With some exceptions [3, 8, 16], most studies found in the literature do not meet this standard. Failure to use a comprehensive test dataset limits model validation, as the model can provide more accurate predictions only at certain times of the year, notably those with least variation [10, 20].
The work here described distinguishes from the other works, because 10 years of ED admissions are considered, several statistical and machine learning models are developed and with the best ones an ensemble model is created to overcome the deficiencies of individual models.
Forecasting methods
In the initial phase of this project, several time-series analysis methods with potential applicability to ED admissions forecast were identified. As postulated by Wolpert in the “No Free Lunch” theorem [22], there is no exact definition of what type of model works best for each problem type. The only way to know if a method can accurately provide predictions for a problem is by testing the developed models. There are three main categories of forecasting methods: exponential smoothing methods, linear methods, and nonlinear methods [10]. After careful review of the forecasting literature, the most referenced methods, as well as other promising recent ones, were identified, including naive seasonal (used as a reference for comparison with other methods), exponential smoothing, SARIMA, autoregressive and recurrent neural network, and XGBoost. In addition to these individual models, an ensemble model that combines the predictions of various models in order to increase the accuracy of individual model predictions was also developed.
Model evaluation
In addition to the training and test dataset, in the creation of machine learning models, another dataset – the validation dataset – must be added. The function of this dataset is to test the accuracy of predictions during model training and optimization of model hyper-parameters [20]. Thus, during model training only the training and validation datasets are used, and for the final evaluation of the model a completely independent test dataset is used to obtain a true estimate of the model error with new data.
To perform model validation, different error metrics can be used to measure the accuracy of predictions. The forecast error Eq. (1) is the difference between the observed value (
Discussions on which metric to use are common in the literature. Still, no consensus on the “best” metric has been achieved, as “no single measure is universally best”. Some metrics are more popular than the others. Several researchers conducted surveys to understand the frequency of use or importance of different metrics used in regression. A variety of metrics were identified in these surveys. The three metrics found most popular in the independent surveys that were performed over a timeline of 25 years were: mean square error (MSE) (or root MSE (RMSE)), mean absolute error (MAE) and mean absolute percentage error (MAPE) [21].
There are two main categories of forecast errors: scale-dependent errors and percentage errors. The difference between these is that the former, as its name implies, are on the same scale as the data, while the latter are given in percentage, which allows to use them to compare models applied to different datasets.
The most common scaling errors are Mean Absolute Error (MAE) Eq. (2) and Root Mean Squared Error (RMSE) Eq. (3).
The RMSE is a quadratic scoring rule which measures the average magnitude of the error. In the equation for the RMSE the errors are squared before they are averaged, which gives a relatively high weight to large errors. This means the RMSE is most useful when large errors are particularly undesirable, which is the case. So, to choose between the several models both measures will be presented but the choice will be guided by RMSE.
The most common percentage errors are the Mean Absolute Percentage Error (MAPE) Eq. (4) and the symmetric Mean Absolute Percentage Error (sMAPE) Eq. (5).
For comparison of intra-study models that use the same data, MAE or RMSE are a good option. However, in order to compare models from different studies, a scale-independent error metric such as MAPE should be considered. When one of the values observed in the time series becomes zero, the MAPE metric is impossible to calculate due to division by zero. In these cases, the sMAPE metric should be used. Despite having a different formula, sMAPE gives very similar results to MAPE.
In the previously referenced studies, the vast majority use MAPE values, which vary depending on the forecast window. For monthly forecasts MAPE ranges from 1.94% to 6.84% [1, 3, 8], for daily forecasts from 4.8% to 8.5% [1, 5, 8, 12, 15, 16, 17], and for hourly forecasts from 47% to 50% [1, 7].
When a patient enters the ED of the hospital, an emergency episode is created at the department office and the patient is admitted to the ED. An emergency episode is understood here as the patient’s entire stay in the hospital’s ED, from admission until administrative discharge. For this study, a total of 1,724,920 admissions records were used at a Portuguese hospital, from January 1, 2009 until December 31, 2018, totalling 10 years of admissions to the ED.
Number of ED admissions from 2009 to 2018.
With an annual granularity, the number of admissions in the ED of the studied hospital increased over the years, except for the years 2015 and 2018 (Fig. 1). The most likely reason for this drop in admissions growth is that the hospital in question has been in the process of upgrading, where the first phase began in 2015 and the second in 2018, diverting the flow of patients to other hospitals nearby.
Regarding the monthly variation in the number of admissions per day (Fig. 2), August has significantly (
Variation in the number of ED admissions per day in the 12 months of the year.
The high seasonality during the year can also be observed in the variation in the number of weekly ED admissions over the 52 weeks of the year (Fig. 3).
Variation in the number of ED admissions per day in the 52 weeks of the year.
To verify the existence of daily patterns in the admissions to the hospital’s ED, a diagram of extremes and quartiles was created with the number of admissions for each day of the week and holidays (Fig. 4). There is a significantly higher daily inflow (
Variation in number of ED admissions on different days of the week.
Variation in number of ED admissions at different times of day.
Increasing the granularity of admissions for hour of day (Fig. 5) reveals a daily pattern where there are two peaks. From 01:00 until 08:00 there is an average number of admissions below 10, at which point the number of admissions increases until it reaches its peak between 10:00 and 11:00. Then the average number of admissions decreases to rise again and reach a new peak between 14:00 and 15:00. Between 16:00 and 22:00 the average number of admissions decreases, varying between 20 and 30 per hour.
To better understand the flow of admissions in the ED, the Seasonal and Trend decomposition using Loess (STL) technique [24] was employed (Fig. 6). This technique extracts the variation in the number of admissions regarding the trend of the data over time, and the variation regarding the seasonality of the data. In particular, in accordance with Figs 3–5 daily (24 hours), weekly (168 hours) and annual (8760 hours) seasonalities were explored.
Seasonality and trend decomposition using STL.
Daily seasonality is the one that accounts for the greatest share of the variance, and hence the most useful for the purpose of this work. Figure 6 also shows the remainder, which is the variation that is not explained by trend or seasonality. This component has a very high amplitude, which means that this technique struggles to explain all variation in light of trend and seasonality.
This section describes the experimental study that was carried over the dataset previously presented. Some brief descriptions about the methods used to tackle the problem are presented, as well as the results obtained with the models developed and the analysis of the forecast errors produced by each model.
The dataset comprises 10 years of admissions to the ED that was divided into three data sets: training data (from January 1, 2009 until December 31, 2016), validation data (from January 1, 2017 until December 31, 2017), and test data (from January 1, 2018 until December 31, 2018). The data provided to train the models was the number of ED admissions per hour. For methods where it is possible to provide more variables, the following variables have been added: year, month, day of the week, time of day, holiday, admission values prior to one-year, preceding admission values at 1, 2, 3, 4, 50, 51, 52, 53 and 54 weeks. Since it is necessary to use the previous values up to 54 weeks, it is not possible to use data from the 2009 year to train the model. Hence, the data used to train the model starts in the third week of 2010.
It should be noted that the optimization of a predictive model is an iterative process where the predictions made by the models are evaluated in the validation data set. The forecast is evaluated with the test data set only after defining the final model.
Seasonal naive
The naive method is a fairly simple method that is usually used as a reference for comparison with other methods. This method makes forecasts using the value of the last observation. When the data is highly seasonal, as is the case with the data in this work, the seasonal naive method is used. Instead of using the last observed value, this method uses the last observed value in the previous seasonal period. Due to the high daily and weekly seasonality, in this case, using a naive model with the same date of the previous year (365 days before), the results are worse than if the same day of the week is used for the same date of the previous year (364 days before) (Table 1).
Naive model error metrics
Naive model error metrics
The naive model 364 (Naive 364) produces large amplitude errors (
Although the decomposition is mainly used for the study of time series, and exploratory analysis of their variations over time, it can also be used to produce predictions. To make a prediction of a decomposed time series, we initially identify and separate seasonal and non-seasonal components. As seasonal components tend to vary very slowly, it is assumed that they will be the same as in the previous period, and the seasonal naive method is applied. To calculate the non-seasonal component, a predictive method is applied to this component – in this case, exponential smoothing. Forecasts produced by exponential smoothing methods are weighted averages of observation history, where the weights of each value decrease exponentially as the observations get farther apart [20]. In other words, the more recent the observation, the greater the weight of its contribution to the forecast. There are three types of exponential smoothing models: one that assumes that there is no systematic structure, an extension of this method that deals with data trending, and the more advanced type that also supports seasonality [20].
After model creation and optimization, error metrics were calculated for the predictions made in the training and test datasets (Table 2). The predictions made were found to be better than with the naive model with a significant reduction in MAE and RMSE for the test data, 0.371 and 0.446, respectively. The performance of the exponential smoothing decomposition (ESD) model for the training data relative to the test data was substantially better (Table 2), indicating that there was an overfitting of the model to the training data and was unable to generalize to the test data.
ESD model error metrics
ESD model error metrics
The amplitude of errors in the ESD model is much smaller than in the Naive 364 model (Fig. A.2). However, there is still a significant error autocorrelation, indicating that the model still does not effectively capture the variation in the number of admissions over the time series.
The ARIMA method provides another approach to time-series predictions. While the exponential smoothing method is based on the description of trend and seasonality, the ARIMA method aims to describe the autocorrelation of the data [10]. The ARIMA method is a combination of autoregressive methods integrated with moving average methods. In an autoregressive method, the prediction of the variable of interest is performed using a linear combination of historical values of the variable itself. The term ‘autoregression’ indicates that it is a regression of the variable against itself. Instead of using the history of values to create predictions in a regression, a moving average model uses forecast errors made in a regression-like model. Each forecast can be considered as a weighted moving average of forecast history errors. As this type of model can only be used in stationary time series, it was necessary to integrate in the ARIMA method the differentiation, a transformation of the series in order to become stationary. The ARIMA method by itself is not seasonal, but the seasonality can be considered in these methods, thus being called SARIMA (Seasonal ARIMA) models. The seasonal part of the method consists of seasonal components similar to the non-seasonal components of the model, but involves retracting a temporal window equal to the seasonal period [20]. The structure of a SARIMA model can thus be defined as (p, d, q) (P, D, Q)
The SARIMA method is one of the most successfully applied time-series methods, but it also has its disadvantages. One is that, as the seasonal period increases, complexity increases exponentially, and its implementation in R [25] prevents the use of seasonal periods greater than 350. As in this work the time series is created with hourly data, the seasonality is equal to 8760 (24 hours
This model presents an improved accuracy over the naive model of 1,300 and 1,798 for the MAE and RMSE, respectively (Table 3). This improvement can also be observed by comparing the errors of the two models.
SARIMA model error metrics
SARIMA model error metrics
Analysing the autocorrelation of errors (Fig. A.3), it was found that, despite the improvement over the results of previously analysed models, there are still variations in the data that the model cannot capture. The significant autocorrelation of errors from previous values means that the model consistently fails predictions, that is, there are variations in the data that the algorithm cannot capture.
Artificial neural network can discern complex nonlinear relationships between an output variable and one or more input variables. A neural network can be thought of as a network of neurons that are organized in layers. Input variables comprise the bottom layer while forecasts (output data) form the top layer of the network. Between these two layers there may also be intermediate layers called ‘hidden layers’ [27].
In time series, the antecedent values can be used as input data to a neural network, just as they are used in a linear autoregressive model, thus forming an autoregressive neural network (AR-NN). For the implementation of an autoregressive model the nnetar function of the R forecast package [28] was used. This function creates a AR-NN (p, P, k)
This model shows a significant improvement over the seasonal naive model, similar to the SARIMA model (Table 4). Although this model has a higher MAE than the SARIMA model, the opposite is true for the sMAPE percentage error metric. This is because the MAE metric is an average of the errors, while the sMAPE metric is a percentage average of the error relative to the observed value. Although the mean of the AR-NN model error is higher, this is smaller than the observed value. It is also possible to observe that the created model presents a considerable overfitting since the error in the test data is considerably larger than the training error.
AR-NN model error metrics
AR-NN model error metrics
When analysing the errors of this model, it was found that the errors follow a normal distribution (Fig. A.4C). However, it is possible to observe that there is still a significant autocorrelation of the errors (Fig. A.4B), meaning that this model cannot properly capture the variation in the time series.
A feature of feed-forward neural network is that they have no memory. Input data is processed independently, without a state being recorded between data entry. With such network, to analyse a time series, it is necessary to feed the network with the complete time series at once. Recurrent neural network has solved this problem by processing time series by iterating each of its elements and maintaining a state containing information on what has been seen so far [27].
Long Short-Term Memory (LSTM) neural network is a variant of recurrent neural network in which their memory is extended. This type of network gives memory to its nodes, being able to read, write and delete information. This makes this network ideal when it is necessary to store information about what has previously occurred in the time series [27]. In this subsection the implementation and analysis of the accuracy of two models of recurrent neural network is described, more specifically their LSTM variant. For the creation of the neuronal network the Keras API for R [28] was used with TensorFlow in backend [28].
The first step using a neural network involves creating variables that provide information that helps the model make predictions more accurately. A dataset was created which, in addition to the number of admissions per hour, includes the variables stated at the beginning of Section 6. The next step was the normalization of data, since the neural network tend to perform poorly when there is a large variation of data dimensions. Thus, each attribute was transformed by centering the values by the mean and scaling them using the standard deviation.
Due to the high number of hyper-parameters, the optimization of the neural network becomes a very laborious process. Using the initial values for the parameters defined in Table 5, the optimal values for training data size, batch size (how many time series points are given to the network at each iteration), lookback (how many historical points are provided by each point in the time series) plus the optimization function were identified. For the size of the time series several datasets were tested, where the oldest year is removed until only one year is included. For batch size and lookback multiples from 24 to 168 and then multiples from 168 to 1344 were used. For the optimization function the seven functions available in the keras package were tested.
Initial and optimal neuronal network hyper parameters
Initial and optimal neuronal network hyper parameters
To optimize the remaining hyper-parameters, referred to in Table 6, a Bayesian model was created using the mlrMBO package [29]. This model aims to find the values of the hyper-parameters that generate the lowest MAE in the predictions for the validation data. In a total of 50 iterations, testing five neuronal networks at each iteration, the model suggested the optimal values presented in Table 6.
Initial and optimal neuronal network hyper parameters
After testing the predictions of this model, made with a recurrent neural network with one layer (RNN-1L) a significant improvement over all models thus far presented here was achieved (Table 7). The improvement over the naive model of 1,629 of MAE and 10% for the sMAPE demonstrates the power of this model to capture the trend and seasonality of data, presenting forecasts with high accuracy.
RNN-1L model error metrics
Analysing the errors of the RNN-1L model, it can be observed that they follow a normal distribution (Fig. A.5C) and that their autocorrelation significance is much lower than all models presented here (Fig. A.5A). However, there is still significant autocorrelation.
A neural network may have more than one layer. Typically, layers are added when more complex problems exist and more nodes are needed to try to capture these trends. Therefore, since using only one layer in the recurrent neural network model is not enough to capture all the variation in the data, it was decided to test models with three node layers (RNN-3L). Analysing the error metrics shows that there is a slight improvement in the generated forecasts (Table 8).
RNN-3L model error metrics
The analysis of the errors made in the model predictions (Fig. A.6) also shows that, despite a slight improvement, there is still a significant autocorrelation of the errors. Overall, this model achieved the best results, generating the forecasts with greater accuracy.
XGBoost – eXtreme Gradient Boosting – is a scalable end-to-end tree boosting [30] system proposed by Chen and Guestrin [31]. ‘Boosting’ consists of an iterative model training process using the same learning algorithm. At each iteration, the i
In addition to the number of hourly admissions, the attributes referred to at the beginning of Section 6 were also used to create this model. Data normalization is not a necessary transformation as in the case of neural networks, however the XGBoost method tends to achieve better performance. Thus, each attribute was transformed by centering the values by their mean and scaling them using the standard deviation. There was a slight improvement when the data were normalized, so this step was adopted before the construction of the model.
As with recurrent neural networks, the optimization of hyperparameters is a very laborious process due to its high number. XGBoost not only allows the use of decision trees as a base learning model, but also allows the use of linear models. However, in the model created in this work the decision trees were used due to the best results obtained in an initial test. Similar to the process of optimization of recurrent neural networks, a Bayesian model was used to find the values for each parameter to minimize the MAE error metric. The optimal values suggested by the model after 3000 iterations are presented in Table 9.
Optimal hyper-parameters obtained for XGBoost by Bayesian model
Optimal hyper-parameters obtained for XGBoost by Bayesian model
After testing the accuracy of the predictions of this model (Table 10), a significant improvement over the naive model can be observed. Compared to the RNN-3L model, this model has a slightly lower accuracy (
XGBoost model error metrics
When analysing the errors of this model, it can be observed that they follow a normal distribution (Fig. A.7C) and that the autocorrelation of the errors is much lower than most of the models presented here (Fig. A.7A). However, there are still significantly autocorrelated errors, especially errors from the previous two weeks and the nearest weeks of the previous year (Fig. A.7B).
Ensemble is based on the assumption that good independently trained models are probably good for different reasons, as each model captures different aspects of the data to make its predictions [23]. Thus, normally, the key to create accurate ensemble models is the use of very distinct models, those that do not make correlated errors or negatively correlated errors. There are several advanced methodologies for creating ensemble models; in this work one of the simplest methodologies, the weighted average, was used. This type of model consists of using a weighted average of the predictions made by the several models.
Ensemble model error metrics
Ensemble model error metrics
The four best predictive models were used to create this model: SARIMA, RNN-1L, RNN-3L and XGBoost (Table 11). In order to find the ideal weight of each model, an exhaustive search was performed by testing all possibilities rounding to the percentage unit. The configuration of the model with the highest accuracy (lowest MAE) was as follows: 3% SARIMA, 15% XGBoost, 15% RNN-1L, and 67% RNN-3L. This model has greater accuracy compared to the best model (RNN-3L), but the difference is quite small (0.008 for MAE and 0.06% for sMAPE).
Due to the high computational cost of training the recurrent 3-layer neural network, an ensemble model was created without including its predictions to gauge its accuracy (Table 11). The difference between these ensemble models was also found to be minimal (0.023 for MAE and 0.11% for sMAPE). Due to the low weight of the SARIMA model, its predictions were also excluded from the ensemble model and the results were found to be identical to the previous ensemble model that combines XGBoost with RNN-1L.
The slight difference in the accuracy of the ensemble model (RNN-1L
With a fine granularity (one hour) it is possible to provide information that will help in resource management for each shift. However, coarse granularity can also be helpful with weekly or monthly management. Therefore, in addition to presenting daily forecasts, forecasts for 4, 8 and 24 hours were also calculated. To achieve this goal, there were two possibilities: group the data in these time windows and create new models from that data, or use the finer granularity predictions and sum them according to the desired granularity. In this work we decided to use the latter, since, when grouping the data, the model learns the patterns at a macrolevel and some lower-level variations are not captured. Another reason for using this approach is that, if the model is unbiased, errors behave like white noise, so the average error will be zero or close to zero, and so will be the sum of errors.
The four-hour, eight-hour and 24-hour granularity forecasts for the test data set were calculated by summing the hourly predictions of all models tested so far. The MAE, RMSE and sMAPE were calculated to measure the accuracy of the predictions of each model (Fig. 7). As might be expected, as granularity increases, MAE and RMSE also increase. Since these metrics are of the same magnitude as the data, increasing granularity increases the number of admissions for each point in the time series and, consequently, the forecast error. However, when analysing sMAPE, which is a percent-error metric, the error decreases as the granularity becomes coarser. That is, as the granularity becomes coarser, although the absolute error increases, the relative error decreases. These results confirm that the sum of the errors of models with accurate predictions is close to zero. By thickening the granularity of the data, it was found, as expected, that the models with greater accuracy are the same as those of the hourly forecast: the RNN-1L and RNN-3L models, the XGBoost model and the ensemble model.
Error metrics for forecasts made with one, four, eight and twenty four hour granularity. (A) MAE metric, (B) RMSE metric, (C) sMAPE metric.
The models proposed here have been evaluated for their performance and accuracy for further integration into a dashboard. After an initial analysis of all singular models, the exponential smoothing decomposition, SARIMA and autoregressive neural network models were discarded due to their low accuracy in comparison with the remaining models, the RNN-1L, RNN-3L and XGBoost. Of these three models, the one that generated the most accurate predictions was the RNN-3L. However, the time required to train the model led to its exclusion from the candidate list for integration into the dashboard. The combined RNN-1L model with XGBoost led to an improvement in accuracy over single models, making this model the one selected to display hourly forecasts on the dashboard (Fig. 8).
Dashboard with hourly forecasts.
Dashboard with daily forecasts.
Regarding the four-hour, eight-hour and daily predictions, it was found that the RNN-1L and XGBoost models are also the most accurate, but the combination of the two models showed no improvement over the best single model, RNN-1L. Thus, the model selected to generate four-hour, eight-hour, and daily forecasts on the dashboard was the RNN-1L model (Fig. 9).
In order to evaluate the models proposed here, their accuracy was compared with that of the previous works found in the literature. However, there is a limitation to this comparison, as it was not possible to use the exact same error metric. This is because MAPE cannot be calculated from the data used in this study, as this metric does not accept zeros in the time series and there are some points in the time series for which there is no admission to the ED. Nevertheless, the sMAPE error metric was used for the forecasts of this study, which present values very close to the MAPE error metric. Two distinct studies were identified that made hourly forecasts [1, 7]. In both studies, with very different data and using different methods, the best model obtained 47% MAPE and the other 50%. Compared to the model proposed in this study that achieved a 23% sMAPE, there is a drastic improvement (Table 12). Regarding the assessment of the accuracy of the daily forecast model, the model presented here also presents a lower error than the published articles. While in published studies [1, 5, 8, 12, 16, 17] the MAPE of the best models ranged from 4.8% to 8.5%, in this study the best model with daily forecasts obtained a sMAPE of 4.3%. Considering an average of 18 admissions/hour, an improvement of 0.5% corresponds to two additional correct predicted admissions per hour. Once again, this study obtained very positive results, demonstrating a clear improvement over the works developed so far. While in these studies the most commonly used methods were ARIMA, linear regression and exponential smoothing, in this work the methods with most accurate predictions were the machine learning methods.
Comparison of forecast performance reached with related work
Comparison of forecast performance reached with related work
This work demonstrated the strong seasonality present in the admissions to the ED of the hospital under study. This allowed the creation of high-precision fine granularity prediction models, reducing the percentage of error by about half compared with similar studies. Different methods of time-series analysis, such as exponential smoothing and SARIMA, were tested. However, it was the RNN-1L and XGBoost methods, most recently applied to time-series analysis, that produced the best models. The developed models can be used in any hospital that makes the electronic register of patient ED admissions. By integrating the dashboard with the predictive methods into the ALERT
As future work, new models will be created with data with different granularities. We intend to get meteorological data and registers of diseases from National Health Service to study the influence of these aspects in ED forecasts. Regarding ensemble techniques, more advanced techniques such as bagging and boosting should also be tested. In addition to the descriptive (current admissions) and predictive (predicted admissions) in the future, forecasting the number of clinical staff required in the ED will be also considered.
Footnotes
Acknowledgments
The authors would like to acknowledge the support by ALERT Life Sciences Computing, SA and for providing the data.
Appendix A
Error analysis of naive seasonal model 364.
Error analysis of exponential smoothing decomposition model.
Error analysis of SARIMA model.
Error analysis of the autoregressive neural network model.
Error analysis of the recurrent neuronal network model with one LSTM layer.
Error analysis of the recurrent neuronal network model with three LSTM layers.
Error analysis of XGBoost model.
