Abstract
The accuracy of stock index prediction is of great significance to national economic development. However, because of the nonlinearity and long-term dependence of stock index data, effective prediction of future stock index price becomes a challenge. In order to solve the above problems, this paper proposes a research method of stock index time series prediction based on ensemble learning model. This method first uses an Adaboost.R2 algorithm to iteratively train multiple LSTM models and then integrates these LSTM models based on the parameters obtained by iterative training. Finally, it uses the ensemble model to predict stock index time series data. This paper uses the Shanghai Composite index, CSI 300 index and Shenzhen Composite index as experimental data sets, and uses the BP model, CNN model and LSTM model as comparative models to conduct an experimental analysis. The experimental results show that the new ensemble learning model proposed in this paper has certain advantages in the research of stock index time series prediction.
Introduction
The stock market is an important part of the national economy. As an important symbol reflecting the changes of the stock market, the stock index also reflects the development and change of the national economy. At the same time, the change of stock index also affects the economic interests of investors. Stock index can reflect the overall fluctuation trend of multiple stock prices and help investors make important decisions on individual stock investment. Therefore, the effective prediction of future stock index changes has attracted the attention of researchers at home and abroad. However, stock indexes are influenced by political, military and macroeconomic factors. This makes the stock index full of uncertainty and becomes a difficult problem in forecasting research.
In recent years, deep learning technology has been widely applied in various fields. This technology has been successful in image recognition, speech recognition, natural language understanding and other fields [1, 2, 3, 4, 5, 6]. At present, deep learning technology has also achieved preliminary results in the field of financial time series prediction. Stock index forecasting is an important part of financial time series forecasting. Many deep learning models, such as Long Short-Term Memory (LSTM), have been applied to this research. In the time series, early information is a factor that affects late information. The LSTM is suitable for processing and analyzing time series data with long time intervals and delays. In addition, as a combinatorial optimization learning method, ensemble learning can combine multiple simple models together to obtain better performance models. Therefore, this paper combines LSTM and ensemble learning algorithm to study the stock index prediction, making full use of the advantages of LSTM in dealing with the long-term dependence problem, and using an ensemble learning algorithm to get a combination model with higher prediction accuracy.
In this paper, three data sets of different scales, three comparative models and two evaluation indexes are used for experimental analysis. The experimental results show that the ensemble learning model proposed in this paper has good prediction performance of different types and scales of data sets.
The main innovations of this paper are as follows:
Put forward a new stock index prediction model. In this model, multiple LSTM models are used as weak learners. It can not only capture the nonlinear characteristics of data, but also solve the problem of long-term dependence in stock index forecasting. At the same time, AdaboostR2 algorithm is used to assemble weak learners into strong learners, which can further improve the prediction accuracy and generalization ability. Different from the Adaboost algorithm used in other stock index prediction studies, this paper uses AdaboostR2 algorithm. The Adaboost algorithm is suitable for classification problems, while AdaboostR2 algorithm is designed to deal with regression problems. In the research of stock index numerical prediction, AdaboostR2 algorithm has more advantages.
The rest of the paper is organized as follows: Section 2 mainly introduces the related work of stock index forecasting research. Section 3 introduces the LSTM-Adaboost.R2 ensemble learning model in detail. Section 4 carries out experimental analysis of three different types and different sizes of data sets. Section 5 is the conclusion of the research work in this paper.
Stock index prediction is an important part of the financial index time series forecasting research, which has been widely concerned by financial and academic circles. Scholars initially used time series model to forecast stock index. Wang and Guo [7] used a hybrid model based on Autoregressive Integrated Moving Average (ARIMA) for stock price prediction. The simulation results show that the model has good approximation ability and generalization ability, and can fit the stock index opening price well. Awartani and Corradi [8] applied the GARCH model to stock index forecasting and found that the asymmetric Generalized Autoregressive Conditional Heteroskedast (GARCH) model has more advantages in the case of two-step comparison in advance.
With further research, scholars found that machine learning model has better performance than some classical time series prediction techniques. Some machine learning methods are used in the research of stock index prediction. For example, Kao et al. [9] applied an improved Support Vector Regression (SVR) method for stock index prediction. The prediction error of the model is small and the prediction performance is high. Nti et al. [10] proposed a novel homogeneous ensemble classifier called GASVM based on a Genetic Algorithm (GA) enhanced Support Vector Machine (SVM) for stock market prediction. The practicability of the proposed method is proved by experiments.
In recent years, deep learning model has begun to show its superiority in some fields. Such as one-dimensional Convolutional Neural Network (CNN) is widely used in signal processing and other fields [11, 12]. Forecasting models based on various deep learning techniques are more and more used in the field of financial time series forecasting. For example, Yu et al. [13] made stock price prediction based on BP neural network and proved that the proposed model had higher accuracy. Dingli and Fournier [14] used CNN in the study of stock index prediction. It is believed that with further feature engineering and network configuration, CNN may surpass Logit model prediction and support vector machine. Yümlü et al. [15] used Multilayer Perceptron (MLP) and Recurrent Neural Network (RNN) to study stock index forecasting. It is found that a smooth piecewise neural model has advantages in capturing volatility in index return series. LSTM networks have been introduced by Hochreiter and Schmidhuber [16]. Fischer and Krauss [17] used LSTM to predict the S&P 500 index. It is proved that LSTM network can extract meaningful information from noisy financial time series data. Yan et al. [18] established a high-precision short-term prediction model of financial market time series based on LSTM deep neural network. It is proved that the LSTM deep neural network has high prediction accuracy and can effectively predict the stock market time series.
The above machine learning models for forecasting all use a single machine learning algorithm to predict stock indexes. However, the generalization ability of a single machine learning algorithm is weak, and it is not easy to be generalized in a wider range. The ensemble algorithm provides an effective way to improve the generalization ability of the model [19, 20]. The main idea of the ensemble algorithm is similar to “everyone gathers firewood, and the flame is high”. By integrating multiple learners together, the ability of learning and training can be strengthened. Nti et al. [10] have made an extensive comparative analysis on integration technologies such as boosting, bagging and stacking. Using decision tree (DT), support vector machine (SVM) and neural network (NN), different types of synthetic regression and classifier are constructed. The research results on several stock index datasets show that innovative research in the field of stock market direction prediction should include ensemble technology in its algorithm.
Among many ensemble algorithms, the AdaBoost algorithm is one of the most representative methods [21, 22, 23]. It is an ensemble learning algorithm proposed by Freund and Schapire in 1995 and officially published in 1997 [24]. Sun et al. [25] proposed a stock market forecasting model using the AdaBoost algorithm and integrating LSTM. The results show that the proposed AdaBoost-LSTM set is better than some other single forecasting models. Chuan et al. [26] explained the success of Adaboost and applied it to portfolio management. The empirical studies verify the potential applications of AdaBoost in portfolio management.
However, the AdaBoost algorithm is mainly used for binary classification problems, and it is not suitable for regression problems. The AdaBoost.R2 algorithm proposed by Drucker on the basis of the AdaBoost algorithm overcomes this difficulty and can be used to deal with the regression problem of stock index prediction [27]. Therefore, this paper introduces the idea of ensemble learning, adopts the ensemble learning Adaboost.R2 algorithm, and uses the LSTM model in deep learning as the weak learner to study the forecasting of the Shanghai Composite index (SSEC), Shenzhen Composite Index (SZSC) and CSI300 index. The results of comparative experiments show that the prediction of stock index based on the LSTM-AdaBoost.R2 model is more accurate.
Model and algorithm
This paper proposes a stock index forecasting method based on the LSTM-Adaboost.R2 model, as shown in Fig. 1. Firstly, the stock index data is preprocessed, an LSTM model is constructed as a weak learner, and then the weight of the training set is initialized. After initialization of the training set weight, iterative training is started. In each iteration, the original LSTM model is trained with the weighted training set, the weight
LSTM-Adaboost.R2 process.
In order to use the LSTM-Adaboost.R2 model to predict the stock index, the data must be preprocessed first. The data used in this paper is the stock index, expressed as
where
After data preprocessing, weak learners need to be constructed. In the LSTM-Adaboost.R2 model proposed in this paper, LSTM is used as a weak learner to predict the stock index. The reason why LSTM is chosen as the weak learner is that LSTM’s unique “Three Gates” structure makes it not have to face the problem of gradient disappearance in the actual use process like RNN, but also enables it to better deal with the long-term dependence problem. The “Three Gates” structure of LSTM includes the forgetting gate (
Internal structure of LSTM.
As shown in the structure in Fig. 2, the cell status update process at every timestep t is as follows:
First, three activation values
Then, the candidate cell information
Finally, the output information
In the LSTM-AdaBoost.R2 model proposed in this paper, LSTM is used as a weak learner to predict the stock index. At the same time, in order to further improve the prediction accuracy and generalization ability of LSTM, the ensemble algorithm is selected to train LSTM. Since AdaBoost.R2 algorithm does not limit the types of weak learners, LSTM can be used as weak learners. The core idea of the AdaBoost.R2 algorithm is to train different weak learners for the same training set, and then put these weak learners together to form a strong learner. The training process is shown in Fig. 3.
Iterative training flow of weak learners.
Before starting iterative training, it is necessary to determine the weak learner model
where
After initialization of training set weight, iterative training is started. The number of iterations is expressed by
The first step of iterative training is to use the training set with weight
The second step is to calculate the average error of
Then calculate the error of each sample. According to AdaBoost.R2 algorithm,
Finally, the average error
The last step of iterative training is to update the weight distribution of training set according to the average error. The calculation process is shown in Eqs (14) and (15):
where
In the LSTM-Adaboost.R2 model proposed in this paper, the weighted median method with
where
Measurement standard
In this paper, root mean square error (RMSE) and mean absolute percentage error (MAPE) are used to evaluate the performance of LSTM-AdaBoost.R2, BP neural network, CNN and LSTM. Equations (17) and (18) give the calculation methods of these two standards.
where
The datasets selected in this paper are SSEC, CSI300 and SZSC. The SSEC is selected from January 2, 2014 to December 31, 2019 (1464 trading days in total). The CSI300 is selected from December 31, 2004 to June 1, 2020 (a total of 3745 trading days). The SZSC is selected from April 3, 1991 to June 1, 2020 (a total of 7155 trading days). In these three datasets, we select the data of the first 80% trading days as the training set, and the data of the remaining 20% trading days as the verification set. Data source: Ruisi database (www.resset.com).
Sample error of LSTM-AdaBoost.R2 model
When the Adaboost.R2 algorithm calculates the sample error, there are three different calculation methods. The purpose of this section is to study the influence of different calculation methods on the prediction performance of the LSTM-AdaBoost.R2 model in the calculation of sample error, so as to find the optimal calculation method for the prediction performance of the LSTM-AdaBoost.R2 model. The three ways of calculating sample error are linear error, square error and exponential error. The specific calculation equation is shown in Eqs (10)–(12) introduced in Section 3.3. The prediction results of LSTM-AdaBoost.R2 model with three different sample errors are recorded in Table 1.
In Table 1, RMSE1 and MAPE1 represent the prediction errors of the three sample error calculation methods on the SSEC dataset, RMSE2 and MAPE2 represent the prediction errors of the three sample error calculation methods on the CSI300 dataset, RMSE3 and MAPE3 represent the prediction errors of the three sample error calculation methods on the SZSC dataset. It can be seen from Table 1 that the performance of linear error and exponential error have their own advantages and disadvantages in the SSEC and CSI300. In the SZSC, the performance of exponential error is the best. In the subsequent experiments, the exponential error will be used as the calculation method of sample error.
Prediction results of three sample errors
Prediction results of three sample errors
In practical application, the best prediction performance of the model is more concerned. Therefore, this section aims to find the number of weak learners that make the LSTM-Adaboost.R2 model predict the best performance during iterative training. Each iteration of training will produce a weak learner, and the search for the optimal number of weak learners is also the search for the most appropriate iteration number K. In the experiments in this section, the value range of K is 1 to 10. The prediction results of different iterations are shown in Table 2. By analyzing the data in Table 2, it can be seen that the optimal number of iterations on SSEC is 5, the optimal number of iterations on CSI300 is 3 or 4, and the optimal number of iterations on the data set of SZSC is 6, which indicates that the optimal iteration times of the LSTM-AdaBoost.R2 model is between 3 and 6.
Prediction results of different iterations
Prediction results of different iterations
Prediction results of four models
In order to verify the performance of the LSTM-Adaboost.R2 model in predicting stock indexes, we conducted experiments on data sets of SSEC, CSI300 and SZSC. And compared them with BP neural network model, CNN model and LSTM model. The prediction time step is 10 steps, which is to use the data on the first 10 days of day t to predict the indicator of day t. The LSTM-AdaBoost.R2 model uses exponential error to calculate sample error. In order to compare the experimental results of different models on different data sets more intuitively, only part of the experimental results are shown in Figs 4–6.
Prediction results of four models on SSEC (partial).
Prediction results of four models on CSI300 (partial).
Prediction results of four models on SZSC (partial).
In Figs 4–6, the red line represents the real value in the test set, the black line represents the prediction value of the BP neural network, the yellow line represents the prediction value of CNN, the blue line represents the prediction value of LSTM-AdaBoost.R2, and the green line represents the prediction value of LSTM. It can be seen from the figure that the blue line is closer to the red line than the other color lines, indicating that the predicted result of the LSTM-AdaBoost.R2 model is better than the other three models. By analyzing the error values in Table 3, it is not difficult to find that the RMSE and MAPE values of the LSTM-AdaBoost.R2 model are the smallest on three data sets. This indicates that the prediction error of the LSTM-AdaBoost.R2 model is smaller and the prediction accuracy is higher than the other three models.
In this paper, we propose a method to predict stock index based on LSTM-AdaBoost.R2 ensemble learning model. In this method, the simple LSTM network is used as the weak learner of AdaBoost.R2 algorithms for iterative training, and then the strong learner is used to predict the stock index. Compared with other baseline models, the RMSE value and the MAPE value of the LSTM-AdaboostR2 model are lower, indicating that this model is more accurate in predicting the stock index. In the comparison test of different sample error calculation methods, in most cases, the error value of the exponential error calculation method is the smallest. In addition, the influence of different number of weak learners on the prediction performance of the LSTM-AdaboostR2 model is also discussed. The LSTM-AdaboostR2 model has an optimal number of weak learners of 5 on SSEC, 3 or 4 on CSI300, and 6 on SZSC. The results show that the optimal number of weak learners is between 3 and 6. The future work will further optimize the parameters and models, test on more characteristic data sets, or try to combine the AdaBoost.R2 algorithm with other classic models.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (No. 61802230).
