Research on influencing factors of stock returns based on multiple regression and artificial intelligence model

Abstract

When choosing stock investment, there are many stock companies, and the stock varieties are also complicated. At present, there are various systems for evaluating stock performance in the market, but there is no uniform standard, so investors often cannot effectively invest in stocks. Simultaneously, stock management companies also have their own characteristics, and there are differences in shareholding structure and internal management structure. Based on this, based on multiple regression models and artificial intelligence models, this paper constructs a stock return influencing factor analysis model to statistically describe the sample data and factor data, and tests the applicability of the five-factor model for performance evaluation of mixed stocks. In addition, this article combines the actual situation to carry out data simulation analysis and uses a five-factor analysis model to carry out quantitative research on stock returns. Through data simulation analysis, we can see that the model constructed in this paper has a certain effect in the analysis of factors affecting stock returns.

Keywords

Multiple regression artificial intelligence stock returns influencing factors

1 Introduction

Since the birth of the world’s first stock exchange in the Netherlands, the issue of stock returns has always been the most concerned topic in the stock market, and the volatility of stock prices is the undoubted core of this topic. Over the years, there have been countless studies on the relationship between stock returns and volatility in both academic and industry circles. People have been trying to describe this relationship from various angles and using different methods and have drawn many meaningful conclusions. These studies have greatly promoted the deepening of theoretical research and promoted the maturity of the global securities market. However, unfortunately, these studies have not been able to reach a completely consistent conclusion, and the research on the relationship between stock returns and volatility is still in the process of continuous exploration [1].

Return rate and volatility rate are important aspects of many economic and financial research. The return rate reflects the price fluctuations in the financial market, and the volatility rate reflects the severity of price fluctuations. The rate of return and its fluctuation are related to the selection of securities portfolio and risk management. In reality, some domestic policies and random events, such as macro-control and market emergencies, will have an impact on the stock market. The current research methods for these factors mainly include principal component analysis and linear regression analysis. However, these methods can only process low-dimensional data, especially linear regression analysis, and can only analyze the impact of specific factors on the results.

From the perspective of reality, science and technology continue to develop, and government statistics are increasing, which brings about the difficulty of processing high-dimensional data. The dynamic factor model is the promotion and development of the traditional factor model in terms of time series. It is good at processing data where the number of observation time points is greater than the number of observation variables. If the dimension of the observation variable is high and the influence of the factor is limited, the factor loading matrix of the observation equation is often a sparse matrix. So far, there are three estimation methods for dynamic factor models. One is the state space and maximum likelihood estimation method, which can only deal with low-dimensional dynamic factor models. The second is the method of extracting principal components, and the third is the method of hybrid estimation of principal components and state space, neither of which can get a sparse factor loading matrix. In order to solve the above problems, ERM is used to obtain the sparse parameter estimation of the high-dimensional dynamic factor model, and the sparseness of this method is also in line with the actual situation of the financial market [2].

Our national capital market started late, its development is not yet mature enough, the relevant laws and regulations are not sound enough, the information disclosure of listed companies is not transparent enough, the money trapping phenomenon is serious, and the protection of investors is insufficient. When choosing stock investment, there are many stock companies, and the stock varieties are also complicated. Moreover, there are various systems for evaluating stock performance in the market, but there is no uniform standard, so investors often have no way to start. In addition, stock management companies also have their own characteristics. There are differences between the ownership structure and internal management structure, the stock management philosophy is biased, stock managers frequently leave their jobs, and the heterogeneity of stock managers is obvious, which make the stability and continuity of stock investment performance questionable. When the public chooses to invest in stocks, they need performance evaluation standards for reference. Therefore, based on the current situation and prospects of the accelerated development of China’s securities investment stock industry, this paper studies the performance of open hybrid stocks that account for a large proportion of the market. By using scientific analysis methods combined with theoretical and empirical methods, this article analyzes the performance of hybrid stocks and the stock selection and timing capabilities of hybrid stock managers. Moreover, this article attempts to explore the impact of non-systematic risks of listed companies on stock performance and analyze the contribution of stock managers’ investment ability to stock performance. In addition, this article strives to make a certain contribution to the improvement of financial theory, and to provide references for investors in stock selection [3].

2 Related work

The literature [4] pointed out that in the field of stock market risk management, starting from the analysis standard of whether the stock price volatility is an external influence factor or an internal volatility range, market risk measurement methods can be divided into relative measurement algorithms and absolute measurement algorithms. The literature [5] developed a “risk measurement system” model for measuring market risk based on rigorous and systematic probability and statistics theory, and defined VaR as: the estimated maximum value of the market value that may occur before the position is cancelled or revalued.

The literature [6] discussed the meaning and application of the VaR method. As the shortcomings of the VaR model in its application gradually emerged, scholars began to think about revising the premises and assumptions of VaR or improving the model. The literature [7] pointed out that compared with the normal distribution, the return rate sequence estimated by the VaR model under the GED distribution can better reflect the characteristics of the fat tail, and there are fewer parameters to be estimated. The literature [8] proposed parametric and non-parametric estimation methods for the weakness of the VaR model that cannot fit the fat tail characteristics of the actual rate of return sequence, and then found through empirical research that the modified distribution can make up for this weakness. The results show that the VaR obtained by non-parametric correction is better.

The literature [9] pointed out that the traditional VaR model does not have enough research on tail risk. Through empirical research, it concluded that the conditional extreme value model is more suitable for predicting smaller samples because it is based on the tail distribution measurement. The literature [10] believed that the normal distribution of VaR limits its application. It proposed a biased student distribution and discussed the value-at-risk measurement under the assumptions of different income distributions. The literature [11] uses the time-varying conditional variance method to calculate the liquidity risk of stock prices from the perspective of making up for the weakness of VaR. The research results show that the time-varying conditional variance method can more accurately describe the characteristics of “peak, fat tail and asymmetric” price volatility.

The literature [12] believed that the VaR method only provides a tail data of the rate of return and does not conduct an in-depth study of the tail loss, which is not conducive to measuring the volatility under extreme conditions. Based on this, they proposed a further improvement plan, that is, the ES method to solve the value at risk of financial assets when the loss is greater than a specific VaR value. The literature [13] also demonstrated through empirical evidence that extreme value theory (EVT) is more accurate in describing the distribution of financial asset returns under extreme volatility. The literature [14] uses the Bayesian MCMC method to combine the prior information of historical data with sample data as the unknown parameters in the posterior information estimation model to estimate the VaR value.

The literature [15] used a mathematical model to deduce that the VaR method does not satisfy the consistency principle, and it has only monotonic and same additive properties. The literature [16] pointed out that the VaR method restricts the minimum risk portfolio. At present, the VaR model is widely used in the fields of investment portfolio, risk measurement, performance measurement, position management, risk estimation and control of financial institutions. The literature [17] put endogenous and exogenous liquidity into the traditional VaR risk measurement framework, and conducts empirical testing based on Shanghai and Shenzhen sample data. The result proves that the VaR model considering liquidity is more conducive to accurately measuring the stock price volatility than the traditional VaR model.

During the growth period of the ARCH model, scholars made innovative expansions of the ARCH cluster model and made two major breakthroughs. The literature [18] extended it to the GARCH model and increased the lag period of the conditional variance of the interference term. Other extended models include VARCH, threshold ARCH, factor ARCH models, etc. Since the creation of the GARCH model, the new results of the ARCH cluster family are almost all formed on the basis of the GARCH model. A series of long-term memory research combined with the ARCH model has made the ARCH model a breakthrough in economics. The ARCH cluster model has been applied to the study of economic time data due to its good statistical characteristics and the advantages of accurately describing the volatility of financial asset returns.

The literature [19] summarized and classified the ARCH model to explore the performance of the long memory ARCH model. Using the ARCH model, it was found that stock price fluctuations are clustered, and the fluctuations are stable. In addition, the Shanghai A-share market is weak and effective.

With the continuous development and innovation of mathematical theory and financial technology, the ARCH model has been found to have the following four shortcomings. One is that the ARCH model is difficult to test the model setting. The literature [20] believes that the ARCH model has obstacles in model setting testing, and the diagnostic analysis method of the linear regression model is not suitable for the ARCH model. The second is that the ARCH model is not suitable for describing stock price fluctuations under extreme conditions. According to the ARCH model, when abnormal returns occur in a certain period, the volatility of the conditional variance in the next period will increase, so the volatility parameters obtained by this model are unstable and easy to produce errors. The third is that the ARCH model assumes that the impact of positive and negative shocks on financial asset price fluctuations is consistent and symmetrical, which does not conform to the actual situation of the financial market. The four is that the conditional variance equation of the ARCH model depends on the autoregressive order. To improve the accuracy of risk measurement, it is necessary to increase the autoregressive order, which greatly increases the difficulty of parameter estimation and reduces the accuracy of risk measurement. Literature [21] believed that the shortcomings of the ARCH model are: on the one hand, the lag order of the residual square is difficult to determine; on the other hand, if the influence of all lag squares is considered, the lag order will be pushed to a larger value and the model is more complicated.

3 Consistent optimization solution for sparse multiple logistic regression

This section will first consider the case of handling large-scale samples, and design and propose a distributed optimization algorithm suitable for data sets containing large-scale samples. This algorithm is called SP-SMLR. The theoretical basis of the algorithm proposed in this section is global variable consistency optimization, which has been studied by many scholars. The core idea is that when the sample size of the data set is large, although a single machine cannot process it, the data set can be divided into multiple sub-data sets, and even if a single machine cannot process the complete data set, it can still process the sub-data sets. The data set can be divided into multiple sub-data sets according to the sample dimensions, that is, the data set is divided in the manner shown in Fig. 1. The divided data set is expressed as D = { D₁, D₂, ⋯ , D_N } ^T. Among them, D_i = { X_i, Y_i } , X_i ∈ R^m_i×n, Y_i ∈ R^k×m_i. Moreover, $\sum_{i = 1}^{N} m_{i} = m$ , and m_i represents the i-th data block [22].

Fig. 1

Data division according to sample dimensions.

In order to process multiple data blocks independently, we write the original optimization goal of the SMLR problem in the following form to perform distributed optimization, which is as follows: $\begin{matrix} \underset{w_{i}, z}{min imize} \sum_{i = 1}^{N} l_{i} (X_{i} W_{i}) + λ {∥ Z ∥}_{1} \\ s, t, W_{i} - Z = 0, i = 1, 2, \dots, N \end{matrix}$ (1)

Among them, $l_{i} (X_{i} W_{i}) = - \frac{1}{m} Tr [Y_{i}^{T} p (X_{i} W_{i})]$ (2)

The minimization problem shown in formula (1) is called the global consistency problem. Among them, the constraints ensure that the local variables will eventually become consistent. For convenience, the global consistency problem of SMLR is called a distributed optimization problem based on a sample partition strategy, that is, a distributed SMLR problem based on sample partition. Among them, the variable W_i ∈ R^n×k and Z_i ∈ R^n×k are called local variables and global variables, respectively. The objective function l_i of the i-th part uses the i-th data block to optimize the model parameter W_i, and different calculation nodes optimize different local variables.

When we use the augmented Lagrangian form to rewrite formula (1), we get the following formula: $\begin{matrix} L_{ρ} (W_{1}, \dots, W_{n}, Z, U) = \sum_{i = 1}^{N} l_{i} (X_{i} W_{i}, Y_{i}) + \\ λ {∥ Z ∥}_{1} + \frac{ρ}{2} {∥ W_{i} - Z + U_{i} ∥}_{2}^{2} \end{matrix}$ (3)

At this time, the distributed SMLR problem based on sample partitioning can be solved in an iterative manner. The iterative formulas of variables are shown in formula (4) to formula (6).

$W_{i}^{k + 1} : = \underset{W_{i}}{arg min} (l_{i} (X_{i} W_{i}) + \frac{ρ}{2} {∥ W_{i} - Z^{k} + U_{i}^{k} ∥}_{2}^{2})$ (4) $Z^{k + 1} : = \underset{z}{arg min} (λ {∥ Z ∥}_{1} + \frac{n ρ}{2} {∥ Z - {\bar{W}}^{k + 1} - {\bar{U}}^{k} ∥}_{2}^{2})$ (5) $U_{i}^{k + 1} : = U_{i}^{k} + W_{i}^{k + 1} - Z^{k + 1}$ (6)

Among them, ${\bar{W}}^{k + 1}$ and ${\bar{U}}^{k}$ are respectively the average value of $W_{i}^{k + 1}$ and $U_{i}^{k}$ , and the value of i is

Among them, ${\bar{W}}^{k + 1}$ and ${\bar{U}}^{k}$ are respectively the average value of $W_{i}^{k + 1}$ and $U_{i}^{k}$ , and the value of i is 1, 2, ⋯ , N . , 2, ⋯ , N.

In formulas (4) to (6), the calculations in the first and third steps can be distributed in different calculation nodes. In the second step, the algorithm will aggregate the local variable W_i obtained by each calculation node and update the global variable Z. The update problem of Z can be regarded as a Lasso problem, which can be solved using any Lasso solving algorithm. The above steps continue to iterate to ensure that local variables and global variables tend to be consistent. Moreover, the above variable update formula constitutes the main algorithm framework of SP-SMLR.

We use primitive variables and dual variables to determine the convergence of the SP-SMLR algorithm. The primitive variable r^k+1 and the dual variable s^k+1 can be expressed as: ${\begin{matrix} r^{k + 1} = (W_{1}^{k + 1} - {\bar{W}}^{k + 1}, \dots, W_{N}^{k + 1} - {\bar{W}}^{k + 1}) \\ s^{k + 1} = ρ ({\bar{W}}^{k + 1} - {\bar{W}}^{k}, \dots, {\bar{W}}^{k + 1} - {\bar{W}}^{k}) \end{matrix}$ (7)

4 Shared optimization solution for sparse multiple logistic regression

One disadvantage of using the original IRLS algorithm is that its algorithm complexity is O ((nk) ³), which makes the IRLS algorithm not suitable for processing large-scale features or data sets with a large number of categories. In order to solve the problem of slow iteration speed of IRLS to solve large-scale feature data sets, this section will design and propose a distributed optimization algorithm FP-SMLR suitable for data sets containing large-scale features. Its theoretical basis is variable sharing optimization. The core idea is that when the feature scale of the data set is large, we can perform distributed solution by dividing the original high-dimensional features into multiple sub-data sets, and the data set is divided into multiple sub-data sets according to the feature dimensions. That is, the data is divided in the way shown in Fig. 2.

Fig. 2

Data division according to feature dimensions.

The divided data set is represented as D ={ D₁, D₂, ⋯ , D_N }. Among them, D_i = { X_i, Y } , X_i ∈ R^m×n_i, Y ∈ R^k×m and $\sum_{i = 1}^{N} n_{i} = n$ . n_i represents the i-th data block, that is, the sub-data set divided according to characteristics.

In this section, the regular term is divided into the form of multiple sums, then the original optimization goal of the SMLR problem can be written in the form of formula (8) to perform distributed optimization. The form is as follows: $\begin{matrix} \underset{w_{i}, z_{i}}{min imize} l (\sum_{i = 1}^{N} Z_{i}) + λ \sum_{i = 1}^{N} {∥ W_{i} ∥}_{1} \\ s, t, X_{i} W_{i} - Z_{i} = 0, i = 1, 2, \dots, N \end{matrix}$ (8) Among them, $l (\sum_{i = 1}^{N} Z_{i}) = - \frac{1}{m} Tr [Y^{T} p (\sum_{i = 1}^{N} Z_{i})]$ (9) variable $W^{T} = {[W_{1}^{T}, W_{2}^{T}, \dots W_{N}^{T}]}^{T} \in R^{k \times n}$ . Among them, W_i ∈ R^n_i×k, and Z_i ∈ R^m×k is the parameter of sparse multiple logistic regression. The objective function of the i-th part uses the data block of the i-th part to estimate some parameters.

The minimization problem shown in formula (8) is called SMLR’s sharing problem. For convenience, the sharing problem of SMLR is called the distributed optimization problem based on feature division strategy, that is, the distributed SMLR problem based on feature division. This paper uses the ADMM algorithm to solve equation (8), and its augmented Lagrangian form can be expressed as: $\begin{matrix} L_{ρ} (W_{1}, \dots, W_{N}, Z, U) = l (\sum_{i = 1}^{N} Z_{i}) + \\ λ \sum_{i = 1}^{N} {∥ W_{i} ∥}_{1} + \frac{ρ}{2} {∥ X_{i} W_{i} - Z_{i} + U_{i} ∥}_{2}^{2} \end{matrix}$ (10)

At this time, the distributed SMLR problem based on feature division can be solved in an iterative manner. The iterative formula of the variable is shown in formula (11) to formula (13): $W_{i}^{k + 1} : = \underset{W_{i}}{arg min} (\begin{matrix} λ \sum_{i = 1}^{N} {∥ W_{i} ∥}_{1} + \\ \frac{ρ}{2} {∥ X_{i} W_{i} - Z_{i}^{k} + U_{i}^{k} ∥}_{2}^{2} \end{matrix})$ (11) $Z^{k + 1} : = \underset{Z}{arg min} (\begin{matrix} l (\sum_{i = 1}^{N} Z_{i}) + \\ \frac{ρ}{2} {∥ X_{i} W_{i}^{k + 1} - Z_{i} + U_{i}^{k} ∥}_{2}^{2} \end{matrix})$ (12) $U_{i}^{k + 1} : = U_{i}^{k} + W_{i}^{k + 1} - Z_{i}^{k + 1}$ (13)

The update of variable W_i involves N parallel Lasso solving problems, and any Lasso solving algorithm can be used to solve it. However, the update of the variable Z involves solving for N variables. By introducing a new variable $\bar{Z}$ , N variables can be reduced to one. At this point, the problem of minimizing the variable Z can be rewritten as: $\begin{matrix} min imizel (N \bar{Z}) + \frac{ρ}{2} {∥ X_{i} W_{i}^{k + 1} - Z_{i} + Y_{i}^{k} ∥}_{2}^{2} \\ s, t, \bar{Z} - \frac{1}{N} \sum_{i = 1}^{N} Z_{i} = 0 \end{matrix}$ (14)

The Lagrangian multiplier method can be used to solve the minimization problem (14) to obtain the analytical solution of the variable Z_i, as shown in formula (15): $Z_{i} = U_{i}^{k} + X_{i} W_{i}^{k + 1} + \bar{Z} - {\bar{U}}^{k} - {\bar{XW}}^{k + 1}$ (15)

When the variable $\bar{Z}$ in formula (14) is used to replace Z_i, formula (11) to formula (13) become: $W_{i}^{k + 1} : = \underset{W_{i}}{arg min} (\begin{matrix} \frac{ρ}{2} {∥ \begin{matrix} X_{i} W_{i} - X_{i} W_{i}^{k} - Z^{k} \\ + {\bar{XW}}^{k} + U^{k} \end{matrix} ∥}_{2}^{2} \\ + λ {∥ W_{i} ∥}_{1} \end{matrix})$ (16) ${\bar{Z}}^{k + 1} : = \underset{\bar{Z}}{arg min} (\begin{matrix} l (N \bar{Z}, Y) + \\ \frac{N ρ}{2} {∥ \bar{Z} - {\bar{XW}}^{k + 1} - U^{k} ∥}_{2}^{2} \end{matrix})$ (17) $U^{k + 1} : = U^{k} + {\bar{XW}}^{k + 1} - {\bar{Z}}^{k + 1}$ (18)

Among them, ${\bar{XW}}^{k + 1}$ is the average value of $X_{i} W_{i}^{k + 1}$ , and i takes is the average value of $X_{i} W_{i}^{k + 1}$ , and i takes 1, 2, ⋯ , N . , 2, ⋯ , N. The above variable update formula constitutes the main algorithm framework of FP-SMLR.

After that, we use primitive variables and dual variables to determine the convergence of the FP-SMLR algorithm. The original variable r^k+1 and the dual variable s^k+1 can be expressed as: ${\begin{matrix} r^{k + 1} = (X_{1} W_{1} - Z_{1}, \dots, X_{N} W_{N} - Z_{N}) \\ s^{k + 1} = ρ (W^{k + 1} - W^{k}, \dots, W^{k + 1} - W^{k}) \end{matrix}$ (19)

5 Convergence analysis of solution algorithm

Before proving convergence, we give two theorems about functions f (W) = l (XW) and g (Z) = λ ∥ Z ∥ ₁.

Theorem 1: The functions f (W) and g (Z) are both normal closed convex functions.

Proof: Obviously, for g (Z) = λ ∥ Z ∥ ₁, when λ > 0, since the norm must satisfy the triangular inequality, g (Z) must be a normal closed convex function. For f (W), its upper boundary diagram can be expressed as the following forms: $epif = {(W, t_{W}) \in R^{n} \times R | f (W) ⩽ t_{W}}$ (20)

Its domain of definition is W ∈ Rⁿ. In the domain of n, the photopic graph epif of f (W) is a non-empty closed convex set. From the properties of the graph, f (W) is a normal closed convex function. The iterative step of the ADMM algorithm is to solve the optimal solution of each sub-problem. Obviously, the optimal solutions $W^{k + 1}$ and $Z^{k + 1}$ of the sub-problems are both feasible. Moreover, the minimization problem of $W^{k + 1}$ and $Z^{k + 1}$ has a solution (not necessarily unique). At the same time, defined by the convex function, when f (W) and g (Z) are normal closed convex functions, f (W) + g (Z) is also a normal closed convex function. The proof is over.

Theorem 2: Standard Lagrangian function $L_{0} (W, Z, Y) = f (W) + g (Z) + Y^{T} (W - Z)$ (21)

There is a saddle point, that is, the point (W^*, Z^*, Y^*), which is not necessarily unique, so that the following formula holds for all W, Z, and Y.

$L_{0} (W^{*}, Z^{*}, Y) ⩽ L_{0} (W^{*}, Z^{*}, Y^{*}) ⩽ L_{0} (W, Z, Y^{*})$ (22)

Proof: The original problem is $min_{W, Z} sup_{Y} L_{0} (W, Z, Y)$ , represented by P^l, and the dual problem is $min_{Y} inf_{W, Z} L_{0} (W, Z, Y)$ , represented by D^l. For L₀ (W, Z, Y), because f (W) + g (Z) is a normal closed convex function, W - Z = 0 is an affine function, and the point (W^*, Z^*, Y^*) satisfies the Karush-Kuhn-Tucker condition, the following conclusions can be obtained according to the strong and weak duality and optimality conditions of the Lagrange multiplier method.

The optimal value of the original problem P^l and the dual problem D^l is equal, that is, val (P^l) = val (D^l). The dual gap between the original problem and the dual problem is zero, which satisfies the strong duality condition. Moreover, P^l and D^l have the same optimal solution. Among them, val (x) represents the value of x.

In L₀ (W, Z, Y), any point (W^*, Z^*, Y^*) that satisfies KKT condition has $\begin{matrix} inf_{W, Z} L_{0} (W, Z, Y^{*}) ⩽ L_{0} (W^{*}, Z^{*}, Y^{*}) \\ ⩽ sup_{Y} L_{0} (W^{*}, Z^{*}, Y) \end{matrix}$ (23) That is: $val (D^{l}) ⩽ L_{0} (W^{*}, Z^{*}, Y^{*}) ⩽ val (P^{l})$ (24)

When the dual gap between the original problem P^l and the dual problem D^l is 0, val (P^l) = val (D^l). At this point, we can obtain: $\begin{matrix} L_{0} (W^{*}, Z^{*}, Y^{*}) = inf_{W, Z} L_{0} (W, Z, Y^{*}) \\ ⩽ L_{0} (W, Z, Y^{*}), \forall W, Z \in R^{n} \end{matrix}$ (25)

In the same way, we can obtain: $\begin{matrix} L_{0} (W^{*}, Z^{*}, Y^{*}) = sup_{Y} L_{0} (W^{*}, Z^{*}, Y) \\ ⩾ L_{0} (W^{*}, Z^{*}, Y), \forall Y \in R^{n} \end{matrix}$ (26)

In summary, we can obtain:

$L_{0} (W^{*}, Z^{*}, Y) ⩽ L_{0} (W^{*}, Z^{*}, Y^{*}) ⩽ L_{0} (W, Z, Y^{*})$ (27)

That is, L₀ (W, Z, Y) has saddle point (W^*, Z^*, Y^*), which is not necessarily unique.

According to Theorem 1 and Theorem 2, ADMM iteration meets the following conditions:

Residuals converge. When k→ ∞, r^k → 0, that is, iteratively produces a solution is feasible.

Target convergence. When k→ ∞, f (W^k) + g (Z^k) → f (W^*) + g (Z^*), and the iterative objective function approaches the optimal value.

The dual variables converge. When k→ ∞, Y^k → Y^*, and Y^* is a dual optimal point.

Convergence rate is an important concept, which reflects the convergence speed of iterative algorithms. Under the assumption of strong convexity of the function, ADMM can achieve global convergence of O (1/k), and k is the number of iterations. Under the assumption of strong convexity of the function, the global convergence that ADMM can achieve, and k is the number of iterations. In the absence of such a strong convexity assumption, B. He et al. gave the most general results of the convergence rate of ADMM and proved that only the objective function terms are required to be convex (not necessarily smooth). Since f (W) and g (Z) are both convex in this paper, the ADMM solution of SMLR algorithm can achieve O (1/k) convergence. In fact, the distributed algorithms SP-SMLR and FP-SMLR essentially decompose complex tasks, and they still follow the iterative steps of serial ADMM. Therefore, the SP-SMLR and FP-SMLR algorithms do not change the convergence of the ADMM algorithm, and they have the same convergence rate as the FSMLR algorithm.

6 Model construction and performance analysis

The training set and the test set are strictly separated to ensure the reliability and accuracy of the results. The processing model proposed in this paper is shown in Fig. 3.

Fig. 3

Feature mining data model.

The explained variable of the model is the series of excess returns of stock portfolios from July 1999 to June 2020. There are two ways to construct the explained variable, one is the 5×5 method, and the other is the 2×4×4 method.

The first step of the two-dimensional 5×5 stock grouping method is to use the annual financial data of listed companies in the A-share market from July 1997 to June 2018 to group stocks year by year to build an investment portfolio. At the end of June in year t, the two indicators Size and B/M are grouped independently. Moreover, according to the circulating market value Size at the end of year t-1, this method groups the stock pools from July in year t to June in year t + 1 to form 5 groups with market capitalization from small to large. After that, according to the book-to-market ratio B/M at the end of year t-1, this method groups the stock pools from July in year t to June in year t + 1 to form 5 groups with the book-to-market ratio from small to large. The second step is to group the stock portfolios according to the two indicators of Size and B/M to construct 25 stock portfolios. The third step is to calculate the monthly excess return weighted by the market value of the stock portfolio.

The first step of the three-dimensional 2×4×4 stock grouping method is to use the annual financial data of listed companies in the A-share market from July 1997 to June 2018 to group the stocks year by year to build a portfolio. At the end of June of year t, the three indicators are grouped independently, and the stock pools from July of year t to June of year t + 1 are grouped into two groups from small to large according to the circulating market value Size at the end of year t-1. Moreover, according to the book-to-market value ratio B/M at the end of year t-1, the stock pools from July in year t to June in year t + 1 are grouped to form 4 groups with the book-to-market value ratio from small to large. In addition, according to the operating profit rate OP at the end of year t-1, the stock pools from July in year t to June in year t + 1 are grouped to form 4 groups with operating profit rates from small to large. In the second step, according to the intersection of the stock portfolios grouped by the three indicators of Size, B/M and OP, the stock sample can be divided into 2×4×4, a total of 32 investment portfolios.

The first step of the two-dimensional 2×2 stock grouping method is to use the annual financial data of listed companies in the A-share market from July 1997 to June 2018 to group stocks year by year to construct a portfolio. According to the median of the circulating market value Size at the end of year t-1, the stock pools from July in year t to June in year t-1 are grouped into two groups of small market value (S1) and large market value (S2). Moreover, according to the median value of the book-to-market ratio B/M at the end of year t-1, the stock pools from July of year t to June of t-1 are grouped to form two groups of low book-to-market value ratio (B1) and high book-to-market value ratio (B2). In the second step, according to the intersection of the stock portfolios obtained by Size and B/M grouping, all stocks can be divided into four stock portfolios: S1B1, S1B2, S2B1, and S2B2. The third step is to use Size and OP, Size and Inv as the grouping standards, and repeat the grouping steps to obtain four combinations of S1O1, S1O2, S2O1, S2O2 and four combinations of S1I1, S1I2, S2I1, and S2I2. Among them, O1 means low profit, O2 means high profit, I1 means low growth rate of total assets, and I2 means high growth rate of total assets. The fourth step is to calculate the weighted average rate of return of the market value of the 12 portfolios respectively. The fifth step is to construct the four factors of SMB, HML, RMW, and CMA by using the difference in the return rate of different stock portfolios.

The first step of the four-dimensional 2×2×2×2 stock grouping method is to use the annual financial data of listed companies in the A-share market from July 1997 to June 2018 to group stocks year by year to build a portfolio. According to the median value of the market capitalization Size at the end of year t-1, the stock samples from July in year t to June in year t-1 are divided into two groups: small market capitalization (S1) and large market capitalization (S2). Moreover, according to the median value of the book-to-market ratio B/M at the end of year t-1, the stock pools from July of year t to June of t-1 are grouped into two groups: low book-to-market value ratio (B1) and high book-to-market value ratio (B2). In addition, according to the median value of OP at the end of year t-1, the stock samples from July in year t to June in year t-1 are grouped to form two groups of low operating margin (O1) and high operating margin (O2). According to the median value of the investment level Inv at the end of year t-1, the stock pools from July in year t to June in year t-1 are grouped into two groups: low total asset growth rate (I1) and high total asset growth rate (I2). The second step is to control the four indicators of Size, B/M, OP, and Inv. The third step is to calculate the weighted average rate of return of the circulating market value of the 16 combinations; The fourth step is to construct the four factors of SMB, HML, RMW, and CMA by using the difference in the return rate of different stock portfolios.

The vertical scale in Tables 1 3 represents scale indicators, from top to bottom scale from small to large, and the horizontal direction in Tables 1 3 represents other indicators, from left to right indicators from low to high.

Table 1

Scale-book to market value ratio portfolio

	low	2	3	4	high
small	0.034138	0.039188	0.029593	0.03131	0.026563
2	0.027674	0.02626	0.025452	0.024846	0.021008
3	0.019695	0.022624	0.020402	0.02121	0.019998
4	0.019089	0.01717	0.018786	0.016968	0.016968
Big	0.014241	0.012625	0.012726	0.011413	0.011211

Table 2

Scale-operating profit margin portfolio

	low	2	3	4	high
small	0.033734	0.028987	0.029896	0.029088	0.031411
2	0.026563	0.022119	0.023836	0.026765	0.022422
3	0.022422	0.020705	0.021109	0.020604	0.020503
4	0.019291	0.016665	0.017574	0.018988	0.018079
Big	0.010302	0.011716	0.012019	0.014645	0.012827

Table 3

Scale-investment level portfolio

	low	2	3	4	high
small	0.036057	0.031916	0.029492	0.027169	0.025452
2	0.028179	0.025048	0.024442	0.022119	0.021513
3	0.022927	0.017271	0.025149	0.019695	0.020604
4	0.018786	0.016665	0.017271	0.018584	0.018483
Big	0.013635	0.013837	0.012524	0.013029	0.012928

The corresponding statistical graph is shown in Figs. 4 –6.

Fig. 4

Return diagram of scale-book to market value ratio portfolio.

Fig. 5

Return diagram of scale-operating profit margin portfolio.

Fig. 6

Return diagram Scale-investment level portfolio.

It can be seen that the scale effect is obvious, and there is a negative correlation between the excess return rate of the small stock portfolio and the market value. By observing each panel in the table vertically and keeping other indicators unchanged, it can be seen that the excess return of the portfolio decreases as the scale changes from small to large.

While keeping the scale index unchanged, the excess return rate of the portfolio in the second and fifth rows decreases as the book-to-market ratio index changes from low to high. However, the excess return rate of the third row of investment portfolios showed an increase as the book-to-market ratio index changes from low to high. In addition, other lines did not show obvious regularity. From other literature, it can be see that there is a value effect in the Chinese A-share market. The return rate of the excess portfolio of stocks with high book-to-market ratio is higher than the yield of low book-to-market ratio stock portfolios.

When keeping the scale index unchanged, the excess return rate of the investment portfolio in the fifth row showed an increase as the operating profit rate index changed from low to high, and other banks did not show obvious regularity. The profit effect shown is not so obvious, and only the excess return rate of the larger stock portfolio increases with the increase in profitability.

When keeping the scale index unchanged, the excess return rate of the first row and second row of the portfolio shows a decrease as the investment level index changes from low to high. In general, the excess return rate of stock portfolios with low investment levels is higher than that of stock portfolios with high investment levels.

In order to further verify the effectiveness of the model in this paper, this paper processes the acquired monthly rate of return data to obtain the annualized rate of return and the fluctuation of the rate of return to obtain the returns and risks of the 20 funds in the sample, as shown in Table 4 and Fig. 7.

Table 4

Rate of return and standard deviation of the sample funds

Serial number	rate of return	Standard deviation	Serial number	rate of return	Standard deviation
1	0.26	0.23	12	0.15	0.30
2	0.05	0.26	13	0.13	0.28
3	0.18	0.32	14	0.12	0.31
4	0.15	0.40	15	0.30	0.34
5	0.42	0.37	16	0.25	0.36
6	–0.14	0.38	17	0.15	0.19
7	0.39	0.29	18	0.22	0.17
8	0.27	0.33	19	–0.13	0.29
9	0.37	0.25	20	0.15	0.38
10	0.34	0.37	Market benchmark	0.18	0.13
11	0.15	0.41

Fig. 7

Statistical diagram of rate of return and standard deviation of the sample funds.

The above table shows that among the 20 sample funds, 10 funds perform better than the market benchmark return, and their risks are all higher than the market benchmark portfolio, and basically obey the positive correlation between return and risk.

7 Conclusion

Among the five factors, taking market risk can bring positive returns to investment funds, and investing in securities of small-scale companies can bring excess returns than securities of large-scale companies. Similarly, investing in securities of high-profitability companies can also achieve higher returns than investing in securities of low-profitability companies, which is consistent with the interpretation of the five-factor model by Fama and French. However, the book value ratio factor and the investment ability factor show opposite effects. Contrary to the theoretical belief that corporate securities with high book value and low book value ratio, low investment capacity and higher investment capacity can achieve higher returns, In fact, investing in securities of companies with low book value ratios and securities of growth companies can achieve higher returns.

Securities investment funds can beat the market portfolio and obtain excess returns. The investment ability of a fund manager has a certain impact on the return of the fund, but it is not the main explanatory factor for the return of the fund. The explanatory power of the market risk premium is more obvious. However, fund managers can still change the fund’s investment portfolio to obtain higher returns through stock selection and market investment style judgments.

Through data simulation analysis, we can see that the model constructed in this paper has certain effects in the analysis of factors affecting stock returns.

References

Gao

, Meng

and Zhao

, Income and social communication: The demographics of stock market participation, The World Economy 42(7) (2019), 2244–2277.

Kleven

H.J.

and Schultz

E.A.

, Estimating taxable income responses using Danish tax reforms, American Economic Journal: Economic Policy 6(4) (2014), 271–301.

Hoffmann

, The consumption–income ratio, entrepreneurial risk, and the us stock market, Journal of Money Credit and Banking 46(6) (2014), 1259–1292.

Carvalho

and Di Guilmi

, Technological unemployment and income inequality: a stock-flow consistent agent-based approach, Journal of Evolutionary Economics 30(1) (2020), 39–73.

Gantino

, Effect of Managerial Ownership Structure, Financial Risk and Its Value on Income Smoothing in the Automotive Industry and Food & Beverage Industry Listed in Indonesia Stock Exchange, Research Journal of Finance and Accounting 6(4) (2015), 48–56.

Nguyen

, Duong

H.N.

and Singh

, Stock market liquidity and firm value: an empirical examination of the Australian market, International Review of Finance 16(4) (2016), 639–646.

Khan

H.H.

, Naz

, Qureshi

, et al., Heuristics and stock buying decision: Evidence from Malaysian and Pakistani stock markets, Borsa Istanbul Review 17(2) (2017), 97–110.

, Information sharing and stock market participation: Evidence from extended families, Review of Economics and Statistics 96(1) (2014), 151–160.

Mohammadi

M.Y.

and Arman

M.H.

, The survey of accounting variables effect on incomesmoothing in stock exchange companies, Journal of Fundamental and Applied Sciences 8(2) (2016), 1257–1271.

10.

Saez

and Zucman

, Wealth inequality in the United States since 1913: Evidence from capitalized income tax data, The Quarterly Journal of Economics 131(2) (2016), 519–578.

11.

Baker

S.R.

, Debt and the response to household income shocks: Validation and application of linked financial account data, Journal of Political Economy 126(4) (2018), 1504–1557.

12.

Alquraan

, Alqisie

and Al Shorafa

, Do behavioral finance factors influence stock investment decisions of individual investors? (Evidences from Saudi Stock Market), Journal of American Science 12(9) (2016), 72–82.

13.

Voelzke

, Individual labour income, stock prices and whom it may concern, Applied Economics Letters 23(13) (2016), 965–968.

14.

Bengtsson

and Waldenström

, Capital shares and income inequality: Evidence from the long run, The Journal of Economic History 78(3) (2018), 712–743.

15.

Apergis

, Simo-Kengne

and Gupta

, The long-run relationship between consumption, house prices, and stock prices in South Africa: evidence from provincial-level data, Journal of Real Estate Literature 22(1) (2014), 83–99.

16.

Nazar

S.N.

, Ekowati

and Setiyawan

, Does Income Smoothing Improve Informativeness Of Stock Prices? Jurnal Ilmiah Econosains 15(2) (2017), 225–239.

17.

Adam Cobb

, How firms shape income inequality: Stakeholder power, executive decision making, and the structuring of employment relationships, Academy of Management Review 41(2) (2016), 324–348.

18.

Caselli

and Ciccone

, The human capital stock: a generalized approach: comment, American Economic Review 109(3) (2019), 1155–1174.

19.

Foerster

, Tsagarelis

and Wang

, Are cash flows better stock return predictors than profits? Financial Analysts Journal 73(1) (2017), 73–99.

20.

Bricker

, Henriques

, Krimmel

, et al., Measuring income and wealth at the top using administrative and survey data, Brookings Papers on Economic Activity 2016(1) (2016), 261–331.

21.

Rosenthal

S.S.

, Are private markets and filtering a viable source of low-income housing? estimates from a” repeat income” model, American Economic Review 104(2) (2014), 687–706.

22.

Ballings

, Van den Poel

, Hespeels

, et al., Evaluating multiple classifiers for stock price direction prediction, Expert Systems with Applications 42(20) (2015), 7046–7056.