An evolutionary ensemble model based on GA for epidemic transmission prediction

Abstract

This paper proposes an evolutionary ensemble model based on a Genetic Algorithm (GAEEM) to predict the transmission trend of infectious diseases based on ensemble again and prediction again. The model utilizes the strong global optimization capability of GA for tuning the ensemble structure. Compared with the traditional ensemble learning model, GAEEM has three main advantages: 1) It is set to address the problems of information leakage in the traditional Stacking strategy and overfitting in the Blending strategy. 2) It uses a GA to optimize the combination of base learners and determine the sub. 3) The feature dimension of the data used in this layer is extended based on the optimal base learner combination prediction information data, which can reduce the risk of underfitting and increase prediction accuracy. The experimental results show that the R² performance of the model in the six cities data set is higher than all the comparison models by 0.18 on average. The MAE and MSE are lower than 42.98 and 42,689.72 on average. The fitting performance is more stable in each data set and shows good generalization, which can predict the epidemic spread trend of each city more accurately.

Keywords

Evolutionary ensemble genetic algorithm ensemble strategy epidemics transmission prediction

1 Introduction

Epidemics have become a global public health issue. As viruses mutate, they become more infectious, and infections become faster, resulting in very complex and nonlinear data. Traditional machine learning models are sensitive to data, making it difficult to fit the complex infectious disease data well [1].

Ensemble learning enables better prediction performance than traditional machine learning models by combining the prediction results of the base learners [2]. Due to the different ensemble strategies, various ensemble learning algorithms are generated. The main difference between each ensemble strategy is the combination of base learners, the training process and the process of combining their prediction results. However, the existing ensemble approaches have the following problems:

For the data processing method: the traditional Stacking [3] ensemble method uses the whole training set for cross-validation of the base learners, quickly causes information leakage, while the Blending [4] ensemble method divides the training set and validation set uniformly and uses the training set for training different base learners without cross-validation, which is easy to cause overfitting compared to Stacking that produces fewer samples.

For the base learner combination strategy: traditional ensemble learning models such as XGBoost, AdaBoost, CatBoost, Gradient Boosting Decision Tree(GBDT), LightGBM(LGBM), Random Forest(RF), Deepforest21(DF21) [5] simply combine base learners and do not have their combinations for optimization, and there is no a straightforward practice in the selection of sub-learners.

For optimization tasks in ensemble learning: e.g. evolutionary algorithms can perform various optimization tasks in ensemble learning, such as sample selection, model structure optimization [6], fusion strategies, etc. However, a series of evolutionary algorithms such as genetic algorithm(GA) and particle swarm optimization(PSO) [7] have their characteristics, and it is also tricky to select the appropriate evolutionary algorithm for a specific problem.

This paper proposes an evolutionary ensemble learning model based on GA to address the above issues. Genetic algorithms have robust global search capability [8], and according to recent studies, they are preferred over other evolutionary algorithms for solving complex, realistic task optimization [9]. Therefore, GAEEM is proposed for the data task of epidemic transmission with significant complexity and nonlinearity using genetic algorithms. by comparing with traditional ensemble models based on Blending or Boosting strategies and DF21 proposed by Zhou in 21 years. It is shown that the ensemble approach of GAEEM can cope well with the epidemic’s complex and nonlinear data task. It can ensemble models adaptively and predict epidemic transmission trends more accurately.

In summary, we have made the following main contributions.

A GAEEM model is proposed to predict epidemic transmission trends, which can adaptively adjust the combination of base learners and sub-learners for the automatic model ensemble.

We reduce the risk of information leakage with the traditional Stacking strategy and the risk of overfitting of Blending strategy.

Extending the feature dimension of the data used in this layer based on the optimal base learner combination prediction information data can reduce the risk of underfitting and increase prediction accuracy.

Using the popular ensemble models as base learners, ensemble again based on ensemble and predicting based on prediction information, makes the model more stable and robust.

The rest of the paper is organized as follows. Section two presents the background of evolutionary ensemble learning and the prediction of infectious diseases, and Section three introduces the GAEEM model. The fourth section describes the experiments and their result analysis, containing the data preprocessing process, experimental parameter settings, and comparison experiments. Finally, the fifth session concludes the paper.

2 Related work

2.1 Evolutionary ensembled learning

Evolutionary ensemble learning combines the advantages of both ensemble learning and evolutionary algorithms and is widely used in machine learning, data mining and pattern recognition [10]. Moreover, evolutionary algorithms are mainly used for feature engineering, an ensemble of model parameters, and optimization search of model structures.

Sound feature engineering can make the model fit the data better [11], so there is much research in optimal feature engineering. Usman et al. [13] used the combination of information gain (IG), and cuckoo optimisation algorithm (COA), non-dominated sorting genetic algorithm (NSGA-III) as the evaluation metrics for filtering, and the feature data selected using it have good performance in most of the datasets. Sharma et al. [14] used GA, PSO and ant colony optimisation (ACO) in feature engineering to combine analysis and determine the best solution and showed in the prediction experimental results that using PSO for feature engineering can effectively improve the prediction accuracy. Moldovan et al. [15] proposed Horse Optimization Algorithm (HOA) and applied it to feature selection, which performs better than other evolutionary algorithms. Accuracy performance than other evolutionary algorithms.

In terms of optimization model parameters and structure, Dayalan et al. [16] proposed a new Stackelberg-particle swarm algorithm based on multi-stage excitation to optimize the demand response module of the model and achieve better results. Gu et al. [17] proposed an improved bagging ensemble surrogate-assisted evolutionary algorithm (IBE-CSEA) in solving the problem of approximation error accumulation and computational cost accumulation in solving multi-objective optimization problems. He assisted the evolutionary algorithm (IBE-CSEA) to solve the problem of approximation error accumulation and computational cost accrual in solving multi-objective optimization problems. IBE-CSEA is more competitive than the popular agent-assisted evolutionary algorithms. Ngo et al. [18] proposed an evolutionary bagging approach to ensemble learning to adjust bag diversity based on an evolutionary bagging approach. They found experimentally that their method is due to the traditional ensemble method (bagging and random forests). Guo et al. [19] addressed the ensemble structure redundancy, considerable computational cost and other multi-model optimization problems and proposed an evolutionary dual ensemble class imbalance learning method. By experimenting on seven datasets based on human localization, their results showed that the method provided the simplest ensemble structure with the best imbalance accuracy and outperformed other traditional ensemble methods in all metrics.

2.2 Ensemble learning applied to epidemics

Recently artificial intelligence has achieved much success in the field of epidemics. Padinjappurathu Gopalan et al. [20] performed disease prediction based on a deep-learning neural network with a herding genetic algorithm. Their experimental results improved prediction accuracy, privacy and security compared to existing methods. Ngabo et al. [21] used machine learning architecture for epidemics in intelligent cities. They studied the experimental results by comparing them with ensemble models and were able to propose corresponding epidemic solutions based on the performance evaluation. Nguyen et al. [22] constructed an evolutionary ensemble computing framework using KNN, SVM, RF, and XGB as base learners and using PSO to optimize the weights of the four learners. Their experimental results show that the method is a stable, robust, and practical framework. et al. optimized the optimization algorithm for ten base learners and a sub-learner SVM to find countermeasures against virus propagation. Yahia et al. [23] used Long Short Term Memory (LSTM), Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN). Three models were an ensemble, their complementarity was used to predict the epidemic trends in China and Tunisia for forecasting, and better results were obtained.

3 Methods

Figure 1 depicts the general flow of the evolutionary ensemble model based on the genetic algorithm(GAEEM), which is divided into four steps. The first step is to filter the base learner combinations in the base learner layer by the GA layer. The filtered optimal base learner combinations are passed to the sub-learners layer in the second step. The third step is selecting the sub-learners from BOX to pass to the sub-learners layer. In the fourth step, the ensemble of the model is performed.

Fig. 1

GAEEM Brief Process.

Figure 2 depicts the specific design of GAEEM, which has three layers, the base learner layer, the GA layer, and the sub-learner layer. The layers mainly perform the following tasks.

Fig. 2

GA Evolutionary Ensemble Model(GAEEM).The fitness function is defined as in Equation 1.

3.1 Base learner layer

In the base learner layer, the training set is denoted as Train-data , and the validation set is denoted as Validation-data . Six traditional ensemble models, XGBoost, AdaBoost, CatBoost, LGBM, GBDT represented by boosting, and RF represented by Bagging, are used as base learners, denoted as m1,m2, . . . ,m6 . Their training sets are denoted as m-t1, m-t2, . . . , m-t6 , and their validation sets are denoted as m-v1, m -v2, . . . m-v6 . Their predicted data for the validation set is denoted as v-p1, v-p2, . . . v-p6 , and later v-p1, v-p2, . . . v-p6 are merged with the validation set labels as the GA layer data set denoted as GA-data .

3.2 GA adaptive selection layer

The adjustment and selection of the base learner combination with the sub-learners are made at the GA layer. The base learners are first coded 0/1, after which a BOX is set up in the algorithm, which holds seven learners, XGBoost, AdaBoost, CatBoost, LGBM, GBDT, RF, and DF21. The 5-fold cross-validation index R² of each learner is used as the fitness function to avoid overfitting and to accurately judge the effect of the selected population (base learner combination) accurately and stably. After inputting GA-data and iterating, the corresponding learners in the BOX are selected as sub-learners according to the optimal fitness value. The optimal solution is decoded and fed back to the base learner layer, and the corresponding optimal base learners are selected for the combination noted as mx . . . , mn .

${Model}_{k} : R^{2} = \frac{1}{n} \sum_{j = 1}^{n} (1 - \frac{\sum_{i = 1}^{m} {({\hat{y}}^{(i)} - y^{(i)})}^{2}}{\sum_{i = 1}^{m} {(\bar{y} - y^{(i)})}^{2}})$ (1)

Where, Model_k : R² represents the R² calculation process for the n-fold cross-validation of the learner with the name k in the BOX: n denotes the n folded cross-validation: m denotes the number of samples: y⁽ⁱ⁾ denotes the true value of the sample i: ${\hat{y}}^{(i)}$ denotes the predicted value of the sample i: $\bar{y}$ denotes the sample mean.

Algorithm 1 describes the process of GA adjustment and selection of base learner combinations and sub-learners. There are three modules, and the first module performs the initialization of the GA, mainly initializes the population, encodes the base learners, sets the population size N , the maximum number of genetic iterations I , the crossover probability c , the mutation probability a , and the fitness function is set to the cross-validation R² of the model. constructs the BOX and puts a total of XGBoost, AdaBoost, CatBoost, LGBM, GBDT, RF, DF21 seven model codes into BOX noted as { s1,s2,s3,s4,s5,s6,s7 }.

The second stage performs genetic iteration. Firstly, the BOX is traversed, starting from s1 . If the iteration requirement is satisfied, the fitness value of each chromosome in the current population is calculated using the current learner’s 5-fold cross-validation R² as the fitness function. After that, the chromosomes in the population are sampled with random probability, and crossover operations are performed with crossover probability c and mutation operations are performed with mutation probability a . As the genetic operations are performed to generate new populations, the fitness value of each chromosome in the new population is calculated. Finally, the current learner name si and its corresponding optimal solution with the optimal fitness value are recorded.

The third stage determines the optimal base learner combination and sub-learners. From the records, the optimal fitness values and their optimal solutions are filtered, the optimal solutions are decoded, and the decoded results are mx . . . mn . The si is the Sub-learner in the BOX corresponding to the optimal fitness values extracted.

Algorithm 1: GA selection process

Input: GA-data , Population size N , current iteration t ,

maximum iteration I , crossover probability c , mutation probability a ,

learners { s1,s2,s3,s4,s4,s5,s6,s7 } in BOX, cross-validation

metric R²

Output: Combination of base learners: mx . . . mn , Sub-learners: si

1 Initialize the population and encode the base learner

2 For mi in BOX do

3 While t ≤ I do

4 The R² of the current learner mi’s 5-fold

crossvalidation index is used as the fitness

function, and the formula is shown in equation (1) to

calculate the fitness value of each chromosome in the

current population.

5 The chromosomes in the population are sampled with

random probability, crossed with crossover probability

c , and mutated with mutation probability a to generate

a new population, and the fitness value of each

chromosome in the current population is calculated.

6 End while

7 Record the name of the current learner mi, and the

corresponding optimal solution and optimal fitness value

8 End for

9 From the records, the optimal fitness value and its optimal solution

are filtered

10 Decode the optimal solution to the corresponding base learner

combination mx . . . mn , extract the si in the BOX corresponding

to the optimal fitness value and return

3.3 Sub-learners layer

In the sub-learning layer, the test set is denoted as Test-data . mx . . . ,mn is extracted for Validation-data , and the predicted data is represented as v-px . . . ,v-pn . And the predicted data is denoted as T-px . . . ,T-pn using mx . . . ,mn for Test-data . The predicted data are represented as T-px . . . ,T-pn .

The new features and their data are constructed: v-px . . . ,v-pn is used to build the new features and Validation-data as the training set of the sub-learners, which is called SL-testdata , and T-px . . . ,T-pn is used to construct the new features and Test-data as the test set of the sub-learners, which is called SL-testdata .

After training the sub-learners selected by the algorithm with SL-traindata , the final prediction is made using SL-testdata . The traditional ensemble method uses the prediction data from the base learner directly to the sub-learner, which makes the data dimension used by the sub-learner too small and quickly causes underfitting on the sub-learner. Therefore, in the sub-learner layer of GAEEM, the underfitting risk of the sub-learner is reduced by expanding the feature dimension. Then the prediction data of the optimal base learner combination pair test set is combined with the test set as the test set of the sub-learner, and finally, the sub-learner is used for prediction. The standard notations and descriptions in this paper are shown in Table 1.

Table 1
Symbols used and their description

Symbol Description Symbol Description

GAEEM GA Evolutionary Ensemble Model m-t1, m-t2, . . . , m-t6 Each base learner training set

GAE-RF, GAE-XGB, . . . GAE-DF21 Denote the GAEEM with RF, XGB, —DF21 as sub-learners m-v1, m-v2, . . . , m-v6 Each base learner validation set

m1, m2, . . . , m6 Each base learner v-px . . . ,v-pn Optimal base learner combination for the validation set of the predicted data

mx . . . , mn Optimal base learner combination v-p1, v-p2, . . . v-p6 Prediction data of each base learner for the validation set

{ s1,s2,s3,s4,s5,s6,s7 } Denotes the BOX in which XGBoost, AdBoost, CatBoost, LGBM, GBDT, RF and DF21 are stored in total 7 learners. T-px . . . ,T-pn Predicted data from the optimal base learner combination for the test set

Train-data Training set Model_k:R² Function: cross-validation R² of learner k in BOX

Validation-data Validation set N Population size

Test-data Test set t Current number of iterations

GA-data GA layer dataset I Maximum number of iterations

SL-testdata Sub-learner prediction set c Crossover probability

SL-traindata Sub-learner training set a Mutation probability

Symbol	Description	Symbol	Description
GAEEM	GA Evolutionary Ensemble Model	m-t1, m-t2, . . . , m-t6	Each base learner training set
GAE-RF, GAE-XGB, . . . GAE-DF21	Denote the GAEEM with RF, XGB, —DF21 as sub-learners	m-v1, m-v2, . . . , m-v6	Each base learner validation set
m1, m2, . . . , m6	Each base learner	v-px . . . ,v-pn	Optimal base learner combination for the validation set of the predicted data
mx . . . , mn	Optimal base learner combination	v-p1, v-p2, . . . v-p6	Prediction data of each base learner for the validation set
{ s1,s2,s3,s4,s5,s6,s7 }	Denotes the BOX in which XGBoost, AdBoost, CatBoost, LGBM, GBDT, RF and DF21 are stored in total 7 learners.	T-px . . . ,T-pn	Predicted data from the optimal base learner combination for the test set
Train-data	Training set	Model_k:R²	Function: cross-validation R² of learner k in BOX
Validation-data	Validation set	N	Population size
Test-data	Test set	t	Current number of iterations
GA-data	GA layer dataset	I	Maximum number of iterations
SL-testdata	Sub-learner prediction set	c	Crossover probability
SL-traindata	Sub-learner training set	a	Mutation probability

3.4 Analysis of model complexity and convergence

Complexity Analysis: Algorithm 1 is based on GA improvement, and the time complexity of the traditional GA is O ( I × N ), where I denotes the maximum number of iterations and N is the population size. The time complexity of Algorithm 1 is also a function of I and N . Still, the Algorithm 1 process differs from the traditional GA process because BOX is set in the GA adaptive selection layer. The time complexity of Algorithm 1 can be expressed as O ( s1 × I1 × N1 + s2 × I2 × N2+–sn × In × Nn ), where si denotes the serial number of the learner in the BOX currently traversed in the GA adaptive selection layer, Ii denotes the maximum number of iterations, and Ni denotes the population size. Since the BOX is nested in the genetic algorithm and each learner uses the same genetic parameter settings, the time complexity of Algorithm 1 is O ( v × I × N ), and v denotes the number of learners in the BOX. Therefore, the time complexity of Algorithm 1 will also be higher than that of the conventional GA.

Convergence analysis: According to [24 –27], the convergence of genetic algorithms is influenced by parameters such as crossover probability, mutation probability, and population size. Increasing randomness and searchability by increasing the number of populations, crossover probability and mutation probability operations can avoid the algorithm’s premature convergence. However, these methods cannot guarantee the global convergence of the algorithm, and they cannot avoid falling into local optimum. Nevertheless, the searchability of the algorithm is still crucial and preferred in the research of scholars and practical industrial applications.

4 Experiment and results

4.1 Data sources and pre-processing

The dataset of this paper comes from the Baidu 2020 International Big Data Competition (https://aistudio.baidu.com). It contains 60 days of epidemic transmission in a total of 561 areas in 6 cities, precisely the number of new infections in each area on that day, the human flow index and migration intensity of each area at each time of day, the specific weather conditions (including temperature, humidity, wind direction, etc.) at each time of day, the population migration index between cities and so on. The details are shown in Table 2.

Table 2
Data set distribution

Epidemic dataset No. of features Lines

Footfall data 5 167,231,424

City Region Data 3 55,286

Number of new infections per day in the region Data 4 33,660

Intercity Migration Data 4 7,200

Migration data by region within the city 6 13,690,734

Weather Data 8 8,632

Epidemic dataset	No. of features	Lines
Footfall data	5	167,231,424
City Region Data	3	55,286
Number of new infections per day in the region Data	4	33,660
Intercity Migration Data	4	7,200
Migration data by region within the city	6	13,690,734
Weather Data	8	8,632

Missing value processing: In order to be realistic, the missing values in the six datasets included in the epidemic dataset are not filled using the mean or plural. Because the infectious data are time-sensitive and continuous, the missing value (e.g., weather data) is filled with data from the previous hour.

Feature transformation: the data in date format, in order to reduce the scale, will be converted from the original format (such as 21200501) to 1–60 accordingly; for wind direction (such as Southeast), wind speed (such as 16–24km/h), weather (such as sunny) remove the special characters and then carry out the unique thermal code; for the existence of a small number of unique values in the wind (such as 3-4) will be converted to 3.5.

Data merging: The time scales of the human flow dataset, the migration dataset of each region within the city and the weather dataset are accurate to the hour, but the task objective is to predict the infection in the region daily, so the three datasets are integrated to the unit of days respectively, and finally the six processed datasets are merged by date.

4.2 Experimental environment and model parameter settings

The data from the first 46 days of epidemics were used as the training set and the after 14 days of data as the test set. The GAEEM was used for comparison experiments with seven popular ensemble models. The experimental environment is shown in Table 3, and the parameters of each model are shown in Table 4.

Table 3
Experimental environment Settings

Type Versions

OS Windows10

Program Python3.7

Algorithm Genetic Algorithm

Model XgBoost, GBDT, LGBM, AdaBoost, CatBoost, RandomForest, DeepForest21

CPU i7-11800H

GPU NVIDIA GeForce RTX 3060

Memory 16GB

Programming

Tools Pycharm

Type	Versions
OS	Windows10
Program	Python3.7
Algorithm	Genetic Algorithm
Model	XgBoost, GBDT, LGBM, AdaBoost, CatBoost, RandomForest, DeepForest21
CPU	i7-11800H
GPU	NVIDIA GeForce RTX 3060
Memory	16GB
Programming
Tools	Pycharm

Table 4

Parameter settings of each model

Model	Parameters
RF	N-estimators = 500,max-depth = 7, min-samples-leaf = 5,min-samples-split = 9
XGBoost	Learning-rate = 0.05, n-estimators = 500,,max-depth = 6
AdaBoost	Learning-rate = 0.09, n-estimators = 900, loss = square
CatBoost	Learning-rate = 0.09, n-estimators = 900, depth = 4
LGBM	Learning-rate = 0.06, n-estimators = 900, num-leaves = 6, max-depth = 6
GBDT	Learning-rate = 0.01, n-estimators = 800, max-depth = 6
DF21	Max-depth = 11, max-lavers = 9
GAEEM	c = 0.4, a = 0.2, I = 5, N = 80

4.3 Evaluation indicators

The mean absolute error (MAE), mean squared error (MSE), and R² are used as the performance evaluation indexes of the model. MAE and MSE are both commonly used loss functions of regression models, and MAE has better robustness to outliers compared to MSE, while MSE is easy to calculate. The formulas are shown in Equations (2), (3) and (4). $MAE = \frac{1}{m} \sum_{i = 1}^{m} | y^{(i)} - {\hat{y}}^{(i)} |$ (2) $MSE = \frac{1}{m} \sum_{i = 1}^{m} {(y^{(i)} - {\hat{y}}^{(i)})}^{2}$ (3) $R^{2} = 1 - \frac{\sum_{i = 1}^{m} {({\hat{y}}^{(i)} - y^{(i)})}^{2}}{\sum_{i = 1}^{m} {(\bar{y} - y^{(i)})}^{2}}$ (4)

Where, m denotes the number of samples: y⁽ⁱ⁾ denotes the true value of the sample i: ${\hat{y}}^{(i)}$ denotes the predicted value of the sample i: $\bar{y}$ denotes the sample mean.

4.4 Feature selection results

As shown in Fig. 3, feature importance analysis was performed on the epidemic dataset using XGBoost, LGBM, GBDT, CatBoost, and AdaBoost, respectively, and the comprehensive importance ranking filtered out 16 essential features.

Fig. 3

Comparison of the importance of model features.

In addition to the important features mined from the data itself, since the GA layer can filter the optimal base learner combinations that can fit the epidemic transmission data more accurately, the predicted data from the optimal base learner combinations are used as new feature data, so that these new feature data contain information based on the accurate prediction of the epidemic transmission data. Finally, the sub-learners dataset dimension is extended with these new feature data. Allow the sub-learner to build on the projections again. In summary, the important features affecting the spread of epidemics are shown in Fig. 4.

Fig. 4

Important features.

4.5 Prediction results of epidemic spread trend

To verify the effectiveness of the GA adjustment and merit-based learner combination, sub-learners, and extended sub-learners data dimension methods, the learners in the BOX of this model were used as sub-learners one by one to conduct comparison experiments with the traditional ensemble model for infectious disease transmission prediction. The experimental results are shown in Tables 5 (GAE-RF, GAE-XGBoost, . . . GAE-LGBM denote the GAEEM with RF, XGBoost, . . . GAE-LGBM in BOX as sub-learners, respectively).

Table 5
Model prediction performance for data by city

(a) Comparison of GAEEM experiments with different sub-learners

DATA Evaluate Model

BM GAE-RF GAE-GBDT GAE-XGBoost GAE-AdaBoost GAE-CatBoost GAE-DF21

CityA R² 0.823 (GAEEM) 0.752 0.792 0.764 0.753 0.811 0.776

MAE 129.0 (GAEEM) 145.0 135.1 146.8 150.0 134.1 133.6

MSE 45242 (GAEEM) 63514 53295 60371 63314 48362 57259

CityB R² 0.894 0.869 0.902 0.926 (GAEEM) 0.843 0.866 0.802

MAE 81.6 84.1 78.7 69.4 (GAEEM) 97.9 91.3 91.9

MSE 21637 26889 20117 15126 (GAEEM) 32048 27458 40587

CityC R² 0.862 0.824 0.860 0.872 0.842 0.900 (GAEEM) 0.829

MAE 45.0 52.2 45.0 42.1 47.6 37.0 (GAEEM) 49.8

MSE 4487 5741 4541 4155 5139 3259 (GAEEM) 5569

CityD R² 0.837 0.824 0.837 0.871 (GAEEM) 0.846 0.834 0.808

MAE 222.4 233.2 223.9 188.8 (GAEEM) 213.1 227.8 230.9

MSE 128583 139319 128927 102110 (GAEEM) 121893 131359 151982

CityE R² 0.773 0.688 0.774 0.800 (GAEEM) 0.712 0.761 0.640

MAE 83.5 89.6 77.9 75.2 (GAEEM) 91.2 78.7 90.8

MSE 20656 28471 20620 18211 (GAEEM) 26298 21834 32861

CityF R² 0.762 0.732 0.769 0.817 (GAEEM) 0.704 0.791 0.734

MAE 95.3 104.6 102.8 87.4 (GAEEM) 116.1 89.4 105.2

MSE 30258 34052 29340 23349 (GAEEM) 37734 26565 33818

(b) Experimental comparison of GAEEM and traditional ensemble models

DATA Evaluate Model

LGBM RF GBDT XGBoost AdaBoost CatBoost DF21 GAEEM

CityA R² 0.752 0.720 0.754 0.750 0.660 0.719 0.618 0.823

MAE 134.6 158.4 150.7 141.5 185.1 174.1 171.2 129.0

MSE 63399 71594 62945 63910 86924 71850 97674 45242

CityB R² 0.855 0.801 0.843 0.921 0.618 0.819 0.497 0.926

MAE 92.3 102.1 86.9 70.6 150.4 100.3 128.8 69.4

MSE 29682 40699 32218 16077 78157 36984 102918 15126

CityC R² 0.837 0.785 0.857 0.858 0.655 0.838 0.484 0.900

MAE 45.4 56.0 46.2 45.1 74.6 48.1 74.8 37.0

MSE 5306 6991 4647 4635 11240 5274 16796 3259

CityD R² 0.792 0.640 0.776 0.851 0.602 0.772 0.425 0.871

MAE 244.7 314.5 250.7 201.7 362.6 258.4 331.6 188.8

MSE 164235 284047 176486 118009 313967 179978 453683 102110

CityE R² 0.629 0.626 0.735 0.787 0.602 0.591 0.023 0.800

MAE 98.8 101.2 86.8 81.2 122.0 114.1 164.4 75.2

MSE 33861 34116 24133 19459 36333 37328 89040 18211

CityF R² 0.730 0.662 0.756 0.804 0.171 0.650 0.324 0.817

MAE 103.8 119.5 97.7 90.1 251.6 124.5 155.4 87.4

MSE 34427 43013 30998 24921 105492 44549 86052 23349

(a) Comparison of GAEEM experiments with different sub-learners
CityA	R²	0.823 (GAEEM)	0.752	0.792	0.764	0.753	0.811	0.776
	MAE	129.0 (GAEEM)	145.0	135.1	146.8	150.0	134.1	133.6
	MSE	45242 (GAEEM)	63514	53295	60371	63314	48362	57259
CityB	R²	0.894	0.869	0.902	0.926 (GAEEM)	0.843	0.866	0.802
	MAE	81.6	84.1	78.7	69.4 (GAEEM)	97.9	91.3	91.9
	MSE	21637	26889	20117	15126 (GAEEM)	32048	27458	40587
CityC	R²	0.862	0.824	0.860	0.872	0.842	0.900 (GAEEM)	0.829
	MAE	45.0	52.2	45.0	42.1	47.6	37.0 (GAEEM)	49.8
	MSE	4487	5741	4541	4155	5139	3259 (GAEEM)	5569
CityD	R²	0.837	0.824	0.837	0.871 (GAEEM)	0.846	0.834	0.808
	MAE	222.4	233.2	223.9	188.8 (GAEEM)	213.1	227.8	230.9
	MSE	128583	139319	128927	102110 (GAEEM)	121893	131359	151982
CityE	R²	0.773	0.688	0.774	0.800 (GAEEM)	0.712	0.761	0.640
	MAE	83.5	89.6	77.9	75.2 (GAEEM)	91.2	78.7	90.8
	MSE	20656	28471	20620	18211 (GAEEM)	26298	21834	32861
CityF	R²	0.762	0.732	0.769	0.817 (GAEEM)	0.704	0.791	0.734
	MAE	95.3	104.6	102.8	87.4 (GAEEM)	116.1	89.4	105.2
	MSE	30258	34052	29340	23349 (GAEEM)	37734	26565	33818
(b) Experimental comparison of GAEEM and traditional ensemble models
DATA	Evaluate	Model
		LGBM	RF	GBDT	XGBoost	AdaBoost	CatBoost	DF21	GAEEM
CityA	R²	0.752	0.720	0.754	0.750	0.660	0.719	0.618	0.823
	MAE	134.6	158.4	150.7	141.5	185.1	174.1	171.2	129.0
	MSE	63399	71594	62945	63910	86924	71850	97674	45242
CityB	R²	0.855	0.801	0.843	0.921	0.618	0.819	0.497	0.926
	MAE	92.3	102.1	86.9	70.6	150.4	100.3	128.8	69.4
	MSE	29682	40699	32218	16077	78157	36984	102918	15126
CityC	R²	0.837	0.785	0.857	0.858	0.655	0.838	0.484	0.900
	MAE	45.4	56.0	46.2	45.1	74.6	48.1	74.8	37.0
	MSE	5306	6991	4647	4635	11240	5274	16796	3259
CityD	R²	0.792	0.640	0.776	0.851	0.602	0.772	0.425	0.871
	MAE	244.7	314.5	250.7	201.7	362.6	258.4	331.6	188.8
	MSE	164235	284047	176486	118009	313967	179978	453683	102110
CityE	R²	0.629	0.626	0.735	0.787	0.602	0.591	0.023	0.800
	MAE	98.8	101.2	86.8	81.2	122.0	114.1	164.4	75.2
	MSE	33861	34116	24133	19459	36333	37328	89040	18211
CityF	R²	0.730	0.662	0.756	0.804	0.171	0.650	0.324	0.817
	MAE	103.8	119.5	97.7	90.1	251.6	124.5	155.4	87.4
	MSE	34427	43013	30998	24921	105492	44549	86052	23349

Table 5(a) shows the performance of epidemic transmission prediction with any learner in the BOX as a sub-learner. The R² differences between the sub-learners selected by the GA layer in GAEEM and the worst performing sub-learners on the six urban epidemic datasets are 0.071, 0.083, 0.076, 0.047, 0.112, and This indicates the stability of the GAEEM based ensemble and re-ensemble approach for fitting complex data such as an epidemic. Table 5(b) compares the prediction performance of the GAEEM and the traditional ensemble model. It is intuitive to see that GAEEM outperforms the traditional ensemble model in R² for epidemic data in six cities and has more petite MAE and MSE, which further validates that GAEEM is based on GA adaptive ensemble-based re-ensemble and prediction-based re-prediction. Ensemble-based and prediction-based can fit the data well and predict the transmission of the epidemic.

GAEEM outperforms the traditional ensemble model for all evaluation metrics in the other city datasets. DF21 did not fit well enough on each city dataset, with an average R² of about 0.395. However, GAE-DF21 fit well on each city dataset, with an average R² of about 0.765, and was significantly improved compared to the performance of DF21 on each city dataset. The fit of AdaBoost to the city F data was poor, with an R² of 0.171. However, when it was used as a sub-learner for GAEEM, its fit improved significantly, and the MSE decreased from 105492 to 37734. For city A, all models fit the epidemic transmission data of city A better than other traditional ensemble models, with the R² of GBDT reaching 0.754, and the R² of GAE-GBDT is 0.792 higher than that of GBDT, which has a better fitting performance. The R² of GAE-LGBM and GAE-CatBoost exceeded 0.8, while the R² of LGBM and The mean R² of CatBoost was 0.736. The MAE and MSE of GAA-LGBM also reached the lowest in the modelled experiments.

Figure 5 shows the R² performance of GAEEM and the traditional ensemble model on each city dataset. The fitting performance of DF21, AdaBoost, CatBoost, and LGBM models on each city dataset is unstable. GAEEM has stable, appropriate implementation on the city dataset compared with other ensemble models and shows good generalization. DF21, CatBoost, and LGBM fit on the city E dataset’s fitting performance vary more than other models. AdaBoost fit on the city F dataset performs much less than on other city datasets. This demonstrates that GAEEM positively improves the model’s fit by extending the data dimension by constructing new features at the sub-learner layer that contain information on the prediction of infectious diseases to predict again based on the forecast. It also improves GAEEM’s fitting effect compared with the traditional ensemble model and has a more stable and fit performance across datasets.

Fig. 5

Fitting comparison of each model.

Figure 6 depicts the actual situation of daily epidemic transmission trends in the six cities compared with the GAEEM predictions, using the date of transmission as the independent variable and the number of new infections as the dependent variable. Four outbreaks of different sizes in the number of new conditions occurred during the 14 consecutive days of epidemic development in each of the six cities. The number of new infections continued to show an upward trend after the outbreak phase. In general, the number of new conditions increased explosively in these 14 days, and the epidemic rapidly spread. Six cities had different scales of growth in the number of new infections due to objective reasons such as the size of the urban population and various efforts to prevent and control the epidemic. During the outbreak phase, city C had the smallest number of new infections, maintaining a growth scale of about 100,000. The number of new conditions in cities A, B, E, and F remained in the range of several hundred thousand. The number of new infections in city D reached about one million.

Fig. 6

Comparison of trends in the spread of the epidemic.

From the perspective of epidemic transmission trend prediction, GAEEM can well predict outbreak periods and outbreak trends and can more accurately predict fluctuations in the number of new infections due to epidemic development. The number of new conditions in each city showed explosive growth four times during the 14 days, and although the growth rate slowed down after the explosive growth, the graph shows that GAEEM can accurately predict such fluctuations. From the perspective of data fitting, GAEEM indicates the number of new infections during an epidemic outbreak, although there are errors with the actual situation. The conservative prediction method of GAEEM does not accurately fit the number of new conditions. Still, it does not exaggerate the number of new infections, and this conservative prediction method makes GAEEM. This conservative prediction approach allows GAEEM to fit the transmission trend accurately. This is particularly evident in City B, where GAEEM accurately fits the number of new infections and the trend of new conditions during days 50 to 52 and 54 to 56, the two phases of the outbreak.

In summary, for the ensemble strategy: the GAEEM uses the traditional ensemble model as the base learner and adapts and chooses the best combination of base learners and sub-learners according to GA. This approach allows GAEEM to use any learner in the BOX as a sub-learner and outperform the respective comparative ensemble model on each city dataset. GAEEM can adaptively select sub-learners from the learner box based on the optimal fitness value, thus further improving the model performance. In terms of epidemic transmission prediction: the GAEEM can extend the data dimension with the prediction information of epidemic transmission with the optimal base learner combination as a new feature based on the traditional ensemble model. This enables the data in the model to be enriched with a large amount of predictive information about the direction of epidemic transmission, which helps the model to fit more accurately to the data of epidemics, which are complex and changing. The experimental results also validate that GAEEM is a good fit for all urban epidemic datasets and more accurately predicts the transmission trend of the epidemic.

5 Conclusion and prospect

In this paper, we propose an evolutionary ensemble model based on GA(GAEEM) to predict the transmission trend of the epidemic. It is demonstrated experimentally by predicting the transmission trend of the epidemic that the model can cope well with the epidemic’s complex and nonlinear data task based on ensemble and prediction again, and also reduces the risk of traditional ensemble strategies. It performs better than traditional ensemble models in predicting the trend of epidemic transmission.

Future experiments will consider the following.

In the face of data with complexity and temporality, such as the epidemic, we consider introducing other techniques to improve our work so that our model can have a strong learning capability and, simultaneously, be real-time.

Many novel evolutionary algorithms are being proposed for task optimization, and we will try to integrate them with novel techniques in the future.

Since the content of this work is still in the context of traditional machine learning, the feature engineering work still needs to be integrated. We would like to further integrate this part with the model in the future so that the model can be generalized and extended to high-dimensional data tasks.

Since the prediction of the epidemic is very dependent on time, weather, region, regional changes in human mobility, etc., and to accurately grasp the outbreak of infection, the time scale must be reduced to hours or even smaller, focusing on the analysis of data during the outbreak period. However, this dramatically increases the data analysis effort. Therefore, future work would like to include the above considerations for the more accurate outbreak and transmission trend predictions.

Credit authorship contribution statement

Xiaoning Li: Conceptualization, Data Curation, Methodology, Software, Writing - Original Draft, Visualization. Qiancheng Yu: Formal analysis, Investigation, Writing - Review & Editing, Supervision. Yufan Yang: Supervision. Chen Tang: Supervision. Jinyun Wang: Supervision.

Declarations

Conflict of interest

The authors declared that they have no conflicts of interest to this work.

Footnotes

Acknowledgments

This work was supported by the Doctoral Program of Northern MinZu University (Grant No.2020KYQD48); the 2022 Ningxia Autonomous Region Key Research and Development Plan (Talent Introduction Special) Project(2022YCZX0013); the 2022 University Research Platform “Digital Agriculture Empowering Ningxia Rural Revitalization Innovation Team” of Northern University for Nationalities (2022PT_S10); and the major key project of school-enterprise joint innovation in Yinchuan 2022 (2022XQZD009).

Data availability statements

The datasets generated during or analysed during the current study are available in the [] repository.

References

Liu

Y.S.

, Wu

Q.J.

, Lu

Y.H.

, et al., Prediction model analysis of novel coronavirus pneumonia (COVID-19) epidemic, Public Health and Preventive Medicine 31(3) (2020), 10–13.

Iwendi

, Khan

, Anajemba

J.H.

, et al., The use of ensemble models for multiple class and binary class classification for improving intrusion detection systems, Sensors 20(9) (2020), 2559.

Wolpert

D.H.

, Stacked generalization, Neural Networks 5(2) (1992), 241–259.

Zhou

Z.H.

, Ensemble methods: foundations and Algorithms CRC press, 2012.

Zhou

Z.H.

and Feng

, Deep forest, National Science Review 6(1) (2019), 74–86.

Chen

, Jin

, Zheng

, et al., An innovative method-based CEEMDAN–IGWO–GRU hybrid algorithm for short-term load forecasting, Electrical Engineering, 2022:1–20.

Eberhart

R.C.

, Shi

and Kennedy

, Swarm intelligence, Elsevier, 2001.

Shen

, Lin

, Wu

, et al., Evolving Deep Multiple Kernel Learning Networks Through Genetic, Algorithms IEEE Transactions on Industrial Informatics 19(2) (2022), 1569–1580.

Bumin

and Ozcalici

, Predicting the direction of financial dollarization movement with genetic algorithm and machine learning algorithms: The case of Turkey, Expert Systems with Applications 213 (2023), 119301.

10.

, Qu

, Liang

, et al., Overview of Evolutionary Ensemble Learning, Algorithms Journal of Intelligence Science and Technology 3(1):18–35.

11.

Almalaq

, Albadran

and Mohamed

M.A.

, Deep machine learning model-based cyber-attacks detection in smart power systems, Mathematics 10(15) (2022), 2574.

12.

Ngabo

, Dong

, Ibeke

, et al., Tackling pandemics in smart cities using machine learning architecture, Mathematical Biosciences and Engineering 18(6) (2021).

13.

Usman

A.M.

, Yusof

U.K.

and Naim

, Filter-based feature selection: a comparison among binary and continuous Cuckoo optimisation algorithms along with multi-objective optimisation algorithms using gain ratio-based entropy, International Journal of Bio-Inspired Computation 20(3) (2022), 183–192.

14.

Sharma

, Agnihotri

, Rathod

, et al., Feature selection using evolutionary algorithms: a data-constrained environment case study to predict tax defaulters, International Journal of Cloud Computing 11(4) (2022), 345–355.

15.

Moldovan

, Binary Horse Optimization Algorithm for Feature Selection, Algorithms 15(5) (2022), 156.

16.

Dayalan

, Gul

S.S.

, Rathinam

, et al., Multi-Stage Incentive-Based Demand Response Using a Novel Stackelberg–Particle Swarm Optimization, Sustainability 14(17) (2022), 10985.

17.

, Zhang

, Chen

, et al., An improved bagging ensemble surrogate-assisted evolutionary algorithm for expensive many-objective optimization, Applied Intelligence 52(6) (2022), 5949–5965.

18.

Ngo

, Beard

and Chandra

, Evolutionary bagging for ensemble learning, Neurocomputing 510 (2022), 1–14.

19.

Guo

, Chu

, Jiao

, et al., Evolutionary dual-ensemble class imbalance learning for human activity recognition, IEEE Transactions on Emerging Topics in Computational Intelligence, 2021.

20.

Padinjappurathu Gopalan

, Chowdhary

C.L.

, Iwendi

, et al., An efficient and privacy-preserving scheme for disease prediction in modern healthcare systems, Sensors 22(15) (2022), 5574.

21.

Ngabo

, Dong

, Ibeke

, et al., Tackling pandemics in smart cities using machine learning architecture, Mathematical Bio-Sciences and Engineering 18(6) (2021).

22.

Nguyen

, Nguyen Vo

T.H.

, Trinh

Q.H.

, et al., iANP-EC: identifying anticancer natural products using ensemble learning incorporated with evolutionary computation, Journal of Chemical Information and Modeling 62(21) (2022), 5080–5089.

23.

Yahia

N.B.

, Kandara

M.D.

and BenSaoud

N.B.

, Integrating Models and Fusing Data in a Deep Ensemble Learning Method for Predicting Epidemic Diseases Outbreak, Big Data Research 27 (2022), 100286.

24.

Rudolph

, Convergence analysis of canonical genetic algorithms, IEEE Transactions on Neural Networks 5(1) (1994), 96–101.

25.

Ming

, Wang

and Cheung

Y.M.

, On convergence rate of a class of genetic algorithms, 2006 World Automation Congress, IEEE, 2006:1–6.

26.

Leung

, Gao

and Xu

Z.B.

, Degree of population diversity-a perspective on premature convergence in genetic algorithms and its markov chain analysis, IEEE Transactions on Neural Networks 8(5) (1997), 1165–1176.

27.

Suzuki

, A Markov chain analysis on simple genetic algorithms, IEEE Transactions on Systems, Man, and Cybernetics 25(4) (1995), 655–659.