Quantitative investment prediction analysis for enterprise asset management using machine learning algorithms

Abstract

Quantitative investment can manage enterprise assets better to obtain higher revenues. This paper analyzed quantitative investment prediction using machine learning algorithms. First, the support vector machine (SVM) algorithm was introduced, and stock changes were predicted by the SVM algorithm. Then, the feature factors in stock data were extracted by maximum information coefficient (MIC) as the input of the SVM algorithm. Finally, the performance and backtest results of the SVM algorithm was analyzed. It was found that the SVM algorithm had a good performance, and its F1-score was 0.9884, which was better than C4.5 and random forest algorithms. In terms of backtesting, the portfolio built based on the prediction results of the SVM algorithm obtained a higher annualized return rate when the number of stocks was small; when the number of stocks was 10, the portfolio built based on the SVM algorithm had an annualized return rate of 83.67%, a smaller maximum retracement, and a higher Sharpe ratio than the other algorithms, which balanced the risk and return well. The results demonstrate the reliability of the SVM algorithm in predicting quantitative investment, which is beneficial to achieving the optimization of enterprise asset management.

Keywords

Machine learning quantitative investment asset management support vector machine maximum retracement

1. Introduction

In the financial markets, predicting future changes is always something that individual and enterprise investors strive for. As the stock market volatility is affected by many factors [1], predicting stocks is challenging [2], and the returns of investments are also full of uncertainties. With the development of technology, financial data can be analyzed and processed by various algorithms to forecast future returns and construct portfolios for quantitative investment [3]. Quantitative investment analyzes a large amount of data to extract useful information before building a model and then analyzes investment feasibility by validating the model. In enterprise asset management, its quantitative investment process generally consists of analyzing and screening factors related to return and risk, evaluating stocks, and recombining them to improve the return on enterprise investments. Quantitative investment uses mathematical models instead of subjective human judgment, which is a more rational and objective way of investment. It avoids errors and losses that may result from subjective decisions and balances the return and risk of enterprises better, which is more conducive to enterprise asset management. How to accurately predict stock prices [4] and then build portfolios is an important issue. Advances in machine learning algorithms have provided more new methods for stock market forecasting [5]. Wang et al. [6] predicted the one-day volatility of Chinese and Japanese stock indices with a Generalized Autoregressive Conditional Heteroskedasticity (GARCH) model, analyzed the results using the CSI 300 and Nikkei 225 indices, and found that the method had a favorable forecasting performance. Ji et al. [7] used an improved particle swarm optimization (IPSO) and Long Short Term Memory (LSTM) hybrid model for stock price prediction and found through experiments on the Australian stock market index that the model had good reliability. Ferdaus et al. [8] designed a parsimonious learning machine (PALM) to predict stock indices and found through predicting closing prices of 15 stock markets that the method had good performance. Thapa et al. [9] predicted Nepalese stock market volatility with Geometric Brownian Motion (GBM), carried out experiments on the stock market under the COVID-19 pandemic in 2020, and found that the method had a good degree of flexibility. Compared with current research, this paper made use of the advantages of machine learning algorithms in stock price prediction to build portfolios using stocks predicted to rise and applied it in the asset management of enterprises to achieve more objective and scientific quantitative investment, which is not only helpful to help enterprises get higher investment returns but also can promote the further development of the quantitative investment.

2. Prediction methods based on machine learning algorithms

2.1 Support vector machine

Machine learning [10] is a process of observing existing data to summarize rules, predicting future things with the rules, and simulating humans with machines. The support vector machine (SVM) algorithm, a common binary classification method in machine learning [11], has very extensive applications in data prediction [12], graphics processing [13], text mining [14], image classification [15], etc. The goal of SVM is to find a hyperplane to classify data sets, and its formula is written as:

$\displaystyle w^{T}x+b=0,$ (1)

where $w^{T}$ is the hyperplane parameter and $b$ is the bias. For data set $T=\{({x_{1},y_{1}}),({x_{2},y_{2}}),\cdots,(x_{n},\linebreak y_{n})\}$ , to maximize the class interval, i.e.,

$\displaystyle\min\frac{1}{2}||w||^{2},$ (2) $\displaystyle\text{s.t. }y_{i}({w^{T}x_{i}+b})\geqslant 1.$ (3)

The above equations are transformed into the following equation using the Lagrangian function:

$\displaystyle L({w,b,a})=\frac{1}{2}||w||^{2}-\mathop{\sum}\limits_{i=1}^{N}a_% {i}[{y_{i}({w^{T}x_{i}+b})-1}],$ (4)

where $a_{i}$ is the Lagrangian multiplier. Then, the original problem is transformed into a problem of solving $a$ , i.e.,

$\displaystyle\mathop{\max}\limits_{a}\mathop{\sum}\limits_{i=1}^{N}a_{i}-\frac% {1}{2}\mathop{\sum}\limits_{i=1}^{n}\mathop{\sum}\limits_{j=1}^{n}a_{i}a_{j}y_% {i}y_{j}x_{i}x_{j},$ (5) $\displaystyle\text{s.t.}\mathop{\sum}\limits_{i=1}^{N}y_{i}a_{i}=0.$ (6)

Then, the classification function of SVM is obtained:

$\displaystyle f(x)=\textit{sgn}\left[{\mathop{\sum}\limits_{i=1}^{N}y_{i}a_{i}% ({x_{i}\cdot x})+b}\right].$ (7)

In practical situations, it is generally difficult to find a hyperplane that can separate the data completely and accurately; thus, slack variable $\xi_{i}$ and penalty parameter ${C}$ are introduced into SVM. To maximize the class interval, the formula is written as:

$\displaystyle\mathop{\max}\limits_{w,b}\frac{1}{2}||w||^{2}+C\mathop{\sum}% \limits_{i=1}^{N}\xi_{i},$ (8) $\displaystyle\text{s.t. }y_{i}({w^{T}x_{i}+b})\geqslant 1-\xi_{i}.$ (9)

In dealing with nonlinear problems, SVM first transforms the data into a high-dimensional space and then makes the samples linearly differentiable. The decision function of the nonlinear SVM is written as:

$\displaystyle f(x)=\textit{sgn}\left[{\mathop{\sum}\limits_{i=1}^{N}y_{i}a_{i}% K({x_{i}\cdot x})+b}\right].$ (10)

where $K({x_{i}\cdot x})$ is the kernel function, including many classes, among which the radial basis function (RBF) has a good performance on large and small samples and requires less number of parameters. Therefore, in this paper, the SVM model is established by the RBF, and its calculation formula is:

$\displaystyle K({x_{i},x_{j}})=\exp\left({-\frac{x_{i}-x_{j}^{2}}{2\sigma^{2}}% }\right),$ (11)

where $\sigma$ is the Gaussian kernel bandwidth.

2.2 Feature screening

When predicting stock changes with the SVM, the relevant factors need to be extracted from the stock data as the input to the SVM first. In this paper, the feature factors were selected from the quotation side and technical side. The quotation-side factors include the amount of increase and decrease, opening price, closing price, turnover rate, and volume. The technical-side factors are shown in Table 1.

Table 1
Technical-side factors

Name	Meaning
PE	Price-to-earnings ratio
PB	Price-to-book ratio
PS	Price-to-sales ratio
ROA	Return on assets
ROE	Return on equity
MA10	10-day moving average
MACD	Moving average convergence/divergence
RSI	Relative strength index
TVMA20	20-day transaction volume moving average
REVS20	20-day revenues
BIAS20	20-day bias rate
MTM	Momentum index

The above 17 factors were re-screened by maximum information coefficient (MIC) [16]. MIC is an algorithm that measures the correlation between data [17]. Mutual information refers to information content shared between two variables. The formula for feature selection using mutual information can be written as:

$\displaystyle I({x;y})=\mathop{\int}\!\!\!\int\nolimits p({x,y})\log\frac{p({x% ,y})}{p(x)p(y)}dx\,dy,$ (12) $\displaystyle J(f)=I({C;f})-\beta\mathop{\sum}\limits_{s\in S}I({s:f}),$ (13)

where $p({x,y})$ refers to the joint probability of ${x}$ and ${y}$ , $J$ refers to the evaluated value of every feature at every iteration, ${I}({{C};{f}})$ refers to the mutual information between class ${C}$ and feature ${f}$ , $nI({{s}:{f}})$ refers to the mutual information between feature subset ${s}$ and feature ${f}$ , and ${\beta}$ is an adjustment coefficient (value $=$ 1). It is assumed that there is data set D and integers ${x}$ and ${y}$ , the data size is ${n}$ , and ${G}$ is the grid distribution of ${x}$ and ${y}$ . The calculation formula of MIC is:

$\displaystyle\textit{MIC}(D)=\mathop{\max}\limits_{xy\in B(n)}\frac{I({D,x,y})% }{\log_{2}\min({x,y})},$ (14) $\displaystyle I({D,x,y})=\max_{G}I({D|G}),$ (15)

where $B(n)$ is the growth factor, $B(n)=n^{0.6}$ .

Ultimately, the equation for feature selection based on MIC can be written as:

$\displaystyle J(f)=\frac{1}{n}\mathop{\sum}\limits_{i=1}^{n}\textit{MIC}({C;f_% {i}})-\frac{1}{n}\beta\mathop{\sum}\limits_{s\in S}\mathop{\sum}\limits_{i=1}^% {n}\textit{MIC}({s_{i}:f_{i}}).$ (16)

3. Experiments and results

3.1 Quantitative investment process for enterprise asset management

The experimental data used in this paper were from the CSMAR database. The data of the CSI 300 index on trading days between January 1, 2010 and December 31, 2020 were selected, including quotation-side and technical-side feature data. The missing values were filled in. The extreme values were corrected using the quantile method. Then, the min-max standardization was performed on the data. The model was trained using the data between January 1, 2010 and December 31, 2015 and validated using the data between January 2016 and December 31, 2020. The asset management of enterprise A was simulated. The initial capital was set as 10 million yuan. The commission rate was 3 ‰. After feature factor screening by MIC, the future daily change of the stock was predicted by the SVM model. The class label was determined by the closing price. If the predicted closing price of the next trading day was higher than the current value, it indicated that the index rose, denoted as TRUE, and the stock was bought into; otherwise, it was denoted as FALSE, and the stock was sold.

3.2 Model evaluation indexes

The evaluation of the results consisted of two parts. Firstly, the performance of the SVM model was evaluated based on the confusion matrix (Table 2).

Table 2
Confusion matrix

	TRUE	FALSE
Predicted as TRUE	TP	FP
Predicted as FALSE	FN	TN

The specific indicators included:

(1) precision:

$\displaystyle\textit{Precision}=\frac{TP}{TP+FP};$ (17)

(2) recall rate:

$\displaystyle\textit{Recall}=\frac{TP}{TP+FN};$ (18)

(3) accuracy:

$\displaystyle\textit{Accuracy}=\frac{TP+FN}{TP+TN+FP+FN};$ (19)

(4) F1-score:

$\displaystyle\textit{F1-score}=\frac{2\times\textit{Precision}\times\textit{% Recall}}{\textit{Precision}+\textit{Recall}}.$ (20)

The quantitative investment results were evaluated by backtesting. The specific indicators are shown below.

(1) Annualized return rate: the return rate earned through the portfolio in one year:

$\displaystyle R_{p}=[{({1+P})^{\frac{250}{n}}-1}]\times 100\%,$ (21)

where $P$ refers to the return of the portfolio and $n$ refers to the number of days to execute the portfolio.

(2) $\alpha$ : the excess return rate of the portfolio free from market fluctuations:

$\displaystyle\alpha=R_{p}-[{R_{f}+\beta_{p}({R_{m}-R_{f}})}],$ (22)

where $R_{p}$ , $R_{f}$ and $R_{m}$ are the annualized return rate, risk-free rate and benchmark annualized return rate of the portfolio, respectively.

(3) $\beta$ : the return of the portfolio in response to market fluctuations:

$\displaystyle\beta_{p}=\frac{\textit{cov}({D_{p},D_{m}})}{\textit{var}({D_{m}}% )},$ (23)

where $D_{p}$ refers to the daily return of the portfolio, $D_{m}$ refers to the benchmark daily return, cov refers to the covariance, and var refers to the variance.

(4) Maximum retracement: the maximum decline of the portfolio that occurred in the backtest period:

$\displaystyle\max D=\max\frac{{P}_{x}-{P}_{y}}{{P}_{x}},$ (24)

where $P_{x}$ and $P_{y}$ refer to the total value of the stock under the portfolio on a certain day and the total value of cash.

(5) Sharpe ratio: excess return per unit of risk:

$\displaystyle\textit{Sharpe Ratio}=\frac{{R}_{p}-{R}_{f}}{{\sigma}_{i}},$ (25)

where $\sigma_{i}$ refers to the standard deviation of the return rate of the portfolio.

3.3 Results

The features were screened using MIC. The factor with the largest $J$ value was added to the feature subset in every iteration until all factors were added to the subset. To find the appropriate subset, subsets containing different numbers of factors were used as the input of the SVM model. The accuracy of the SVM model is shown in Fig. 1.

Figure 1.

Results of feature screening by MIC.

It was seen from Fig. 1 that when the factor with the largest $J$ value was used as the input to the SVM model, the accuracy of the SVM model was 58.64%; when the top 2 and 3 factors with the largest $J$ value were used as the input of the SVM model, the accuracy of the SVM increased; when the top 7 factors with the largest $J$ value were used as the input of the SVM model, the accuracy of the SVM model was the highest, 69.87%; with the increase of screening factors, the accuracy of the SVM model decreased instead. Therefore, the top 7 factors with the largest $J$ value were selected as the input of the SVM model to establish the prediction model.

To further understand the reliability of feature selection based on MIC, the SVM model was used as the predictive model to compare the accuracy of the model without feature selection with the accuracy of the model that selected features based on another commonly used feature selection method, principle component analysis (PCA) [18]. The accuracy of different methods is shown in Fig. 2.

It was seen from Fig. 2 that the accuracy of the SVM model without feature selection was 47.86% only, indicating that the absence of feature selection had a significant impact on the algorithm accuracy and led to a poor prediction accuracy; when PCA was used, the accuracy of the SVM model was 55.43%, which was 7.57% higher than that of the model without feature selection; the accuracy of the SVM model that selected features based on MIC was 14.44% higher than that based on PCA. These results verified the effectiveness of MIC, which contributed to the improvement of stock prediction accuracy.

The performance of the SVM model in predicting stocks was compared with two other machine learning methods: C4.5 [19] and random forest algorithms [20], and the results are shown in Table 3.

Table 3

Results of feature screening by MIC

	SVM algorithm	C4.5 algorithm	Random forest algorithm
Precision	0.9883	0.9762	0.9754
Recall rate	0.9886	0.9972	0.9987
Accuracy	0.9796	0.9464	0.9578
F1-score	0.9884	0.9866	0.9869

Figure 2.

The effect of feature selection on the accuracy.

It was seen from Table 3 that the SVM algorithm had the highest precision (0.9883), the random forest algorithm had the highest recall rate (0.9987), the SVM algorithm had the highest accuracy (0.9796), and the F1-score of the SVM algorithm was 0.9884, which was 0.0018 higher than the C4.5 algorithm and 0.0015 higher than the random forest algorithm. These results suggested that the SVM algorithm had a good performance in predicting stocks and could predict the change of stocks accurately to establish proper quantitative portfolios and realize the optimization of enterprise asset management.

The quantitative investment portfolio was built using the SVM model. To determine the number of stocks in the portfolio, the annualized return rate under different situations was compared, and the results are shown in Fig. 3.

Figure 3.

The effect of the number of stocks on the annualized return rate.

It was seen from Fig. 3 that with the increase of the number of stocks in the quantitative investment portfolio, the annualized return rate of the portfolio decreased gradually; when the number of stocks was 10, the annualized return rate was the highest, 83.67%; when the number of stocks was 20, the annualized return rate dropped by 11.31%; when the number of stocks was 50, the annualized return rate was only 41.28%, which was 42.39% lower than that when the number of stocks was 10. These results demonstrated that the greater the number of stocks was, the stronger the uncertainty was, and the higher the risk was.

Finally, the backtest results of the quantitative portfolio were analyzed. The investment portfolio was built using different models. The asset management data of enterprise A when the number of stocks in the portfolio was 10 are shown in Table 4.

Table 4

The performance of the SVM algorithm

	SVM algorithm	C4.5 algorithm	Random forest algorithm
Annualized return rate	83.67%	57.64%	55.56%
$\alpha$	0.61	0.56	0.44
$\beta$	0.86	0.95	0.84
Maximum retracement	44.32%	46.78%	47.12%
Sharpe ratio	1.92	1.37	1.59

It was seen from Table 4 that the SVM algorithm had the highest annualized return, 83.67%, which was 26.03% higher than the C4.5 algorithm and 28.11% higher than the random forest algorithm, indicating that the portfolio built based on the prediction results of the SVM algorithm could achieve higher returns. Moreover, the SVM algorithm had the largest $\alpha$ value, 0.61. The random forest algorithm had the smallest $\beta$ value, 0.84, but the $\beta$ value of the SVM algorithm was only 0.02 higher than the random forest algorithm. The maximum retracement of the SVM algorithm was the smallest (44.32%), which was 2.46% lower than the C4.5 algorithm and 2.8% lower than the random forest algorithm. The Sharpe ratio of the SVM algorithm was the highest (1.92), which was 0.55 higher than the C4.5 algorithm and 0.33 higher than the random forest algorithm. The above results demonstrated that the portfolio built based on the prediction results of the SVM algorithm had better risk control and could achieve greater returns with balanced risk, which was more conducive to enterprise asset management.

4. Conclusion

This paper used machine learning algorithms to study quantitative investment in enterprise asset management. The future change of stocks was predicted by the SVM algorithm to build quantitative portfolios, and the algorithm was applied in the asset management of enterprise A. The performance of the algorithm was evaluated and verified through backtest. The experimental results demonstrated that the SVM algorithm had a good performance, with high accuracy and precision, and its F1-score was 0.9884, which was higher than C4.5 and random forest algorithms. The backtest results demonstrated that the portfolio built based on the prediction results of the SVM algorithm for asset management had a good balance of risk and return; the smaller the number of stocks was, the higher the annualized return rate was; when the number of the stocks was 10 in the portfolio, the annualized return rate was 83.67%, the maximum retracement was minimum, 44.32%, and the Sharpe ratio was 1.92, which verified the reliability of the SVM algorithm in quantitative investment. This work can provide scientific guidance for enterprise asset management, and the SVM algorithm can be further promoted and applied in practice.

References

. The Research on Influencing Factors of Stock Price Fluctuation of Listed Companies in China Based on PCA-Multiple Regression. Open Journal of Social Sciences. 2021; 09(3): 305-315.

Sangeetha

Priya

Elias

Mamgain

Wassan

Gulati

. Techniques using artificial intelligence to solve stock market forecast, sales estimating and market division issues. Journal of Contemporary Issues in Business and Government. 2021; 27(3): 2021.

Chen

Zhang

Shen

Deng

Huang

. A quantitative investment model based on random forest and sentiment analysis. Journal of Physics Conference Series. 2020; 1575: 012083.

Linardos

Kermanidis

Maragoudakis

. Using financial news articles with minimal linguistic resources to forecast stock behaviour. International Journal of Data Mining Modelling and Management. 2015; 7(3): 185.

Chang

. LSTM-based sentiment analysis for stock price forecast. PeerJ Computer Science. 2021; 7(1): e408.

Wang

Nishiyama

. Volatility forecast of stock indices by model averaging using high-frequency data. International Review of Economics & Finance. 2015; 40(NOV.): 324-337.

Liew

Yang

. A novel improved particle swarm optimization with long-short term memory hybrid model for stock indices forecast. IEEE Access. 2021; 9: 23660-23671.

Ferdaus

Chakrabortty

Ryan

. Multiobjective automated type-2 parsimonious learning machine to forecast time-varying stock indices online. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2021; PP(99): 1-14.

Thapa

Aryal

. Use of geometric Brownian motion to forecast stock market scenario using post covid-19 NEPSE index. BIBECHANA. 2021; 18(2): 50-60.

10.

Challagulla

Bastani

Yen

Paul

. Empirical assessment of machine learning based software defect prediction techniques. International Journal of Artificial Intelligence Tools. 2015; 17(02): 389-400.

11.

Tang

Tian

Pardalos

. Valley-loss regular simplex support vector machine for robust multiclass classification. Knowledge-Based Systems. 2021; 216(3): 106801.

12.

Navidi

Seyedmohammadi

Jalali

. Predicting soil water content using support vector machines improved by meta-heuristic algorithms and remotely sensed data. Geomechanics and Geoengineering. 2021; (1): 1-15.

13.

Anca

Kluczek

Zagajewski

Raczko

Kycko

Al-Sulttani

Tardà

Pineda

Corbera

. Comparison of support vector machines and random forests for corine land cover mapping. Remote Sensing. 2021; 13(4): 777.

14.

Damanik

Setyohadi

. Analysis of public sentiment about COVID-19 in indonesia on twitter using multinomial naive bayes and support vector machine. IOP Conference Series: Earth and Environmental Science. 2021; 704(1): 1-11.

15.

Sonmez

Eczacoglu

Gumuş

Aslan

Sabancic

Aikkutlud

. Convolutional neural network – Support vector machine based approach for classification of cyanobacteria and chlorophyta microalgae groups. Algal Research. 2022; 61: 102568-.

16.

Guo

Mao

. Detecting associations based on the multi-variable maximum information coefficient. IEEE Access. 2021; 9: 54912-54922.

17.

Wang

Dai

Wang

Zhou

. SuperMIC: Analyzing large biological datasets in bioinformatics with maximal information coefficient. IEEE/ACM Transactions on Computational Biolody and Bioinformatics. 2017; PP(4): 783-795.

18.

Shastry

Sanjay

. A modified genetic algorithm and weighted principal component analysis based feature selection and extraction strategy in agriculture. Knowledge-Based Systems. 2021; 232(1): 107460.

19.

Fitrani

Rosid

Taurusta

Fauzia

. Classification Using C4.5 Algorithm in Election Participation Prediction. IOP Conference Series: Materials Science and Engineering. 2020; 874: 012016.

20.

Castro-Franco

Costa

Peralta

Aparicio

. Prediction of soil properties at farm scale using a model-based soil sampling scheme and random forest. Soil Science. 2015; 180(2): 74-85.

Quantitative investment prediction analysis for enterprise asset management using machine learning algorithms

Abstract

Keywords

1. Introduction

2. Prediction methods based on machine learning algorithms

2.1 Support vector machine

Table 1 Technical-side factors

3.1 Quantitative investment process for enterprise asset management

3.2 Model evaluation indexes

Table 2 Confusion matrix

References

Table 1
Technical-side factors

Table 2
Confusion matrix