Research on prediction and recommendation of financial stocks based on K-means clustering algorithm optimization

Abstract

The prediction and recommendation of financial stocks are of great values. This study mainly analyzed the application of K-means clustering algorithm in stock forecasting and recommendation. Firstly, it introduced the k-means algorithm briefly and analyzed its advantages and disadvantages. Then, the k-means algorithm was optimized by introducing artificial fish swarm algorithm (AFSA) to obtain KAFSA. Then 100 stocks of listed companies were taken as the research subject and predicted by KAFSA designed in this study. The prediction results were verified through closing price, price earning ratio, earnings per share and return on net assets. The results showed that there were obvious differences between A and B stocks divided by KAFSA, and the differences of B stocks were significantly larger than those of A stocks. It shows that 100 stocks are well divided into high performance stocks and poor performance stocks through clustering, which provides a good reference for investors to invest in stocks and is worth of further application.

Keywords

K-means clustering algorithm data mining artificial fish swarm algorithm Stock forecast

1. Introduction

Stock forecasting is an important part of the financial stock market [15], which is a very challenging task [14]. Financial stock prediction refers to the prediction of the future trend of stocks through the analysis of historical stock data [5]. Successful prediction can help investors make considerable profits [9]. With the development of the stock market, stock forecasting becomes increasingly complex [17]. The stock information accumulated brings great difficulty to data processing and analysis. The emergence of data mining technology effectively solves the problem [19]. Data mining technology can observe, process and analyze a large amount of complex financial data. Stock forecasting by data mining technology has been widely studied. Sadaei and Enayatifar [10] fuzzified historical data into difference fuzzy set using the method of fuzzy time series, established difference fuzzy logic groups, made stock prediction in defuzzification, and improved the accuracy of the algorithm by imperialist competition algorithm, and verified through the experiment that the method had better performance in stock prediction. Wang et al. [13] combined support vector regression (SVR), principal component analysis (PCA) and brainstorm optimization (BSO) for stock forecasting, took Shanghai-Shenzhen stock exchange and Shenzhen stock exchange as examples, and found that the designed forecasting model was an effective tool with simple calculation and high accuracy for stock forecasting. Das et al. [6] clustered the data of well-known stock markets using K-means clustering algorithm, then determined the classification rate by grey wolf optimizer, predicted the stock market by the nonlinear autoregressive exogenous neural network algorithm, and verified its prediction performance through experiment. Berradi and Lazaar [18] combined Principal Component Analysis (PCA) with recurrent neural network (RNN) to predict 29 days of data of Casablanca stock exchange. They found that the error of the method in stock prediction was only 0.00596, which showed a good performance. Chou and Nguyen [12] predicted the stock price of Taiwan construction companies using sliding window meta heuristic optimization, which makes some contributions to the guidance for the decision-making and trading of investors. Chen et al. [11] analyzed the application of nonlinear support vector regression (SVR) method in stock forecasting and optimized it by grid search (GRID), particle swarm optimization (PSO) and genetic algorithm (GA). They found that the minimum root mean square error (RMSE) of the GA-SVR model was 15.630, which showed that the model could provide investors with good technical reference. Kaur et al. [8] predicted the stock market of Bombay Stock Exchange using the adaptive network fuzzy inference system (ANFIS), combined with the least square method with back propagation gradient method to train the system, and verified the effectiveness of the method in stock market prediction. K-means algorithm has an excellent performance in data classification. To realize an accurate prediction and recommendation of stocks and improve the benefit of investors, this study selected the K-means algorithm, realized the prediction through stock clustering, and optimized the K-means algorithm by artificial fish swarm algorithm (AFSA) against the shortcomings of the algorithm. Taking 100 stocks as an example, the prediction performance of the optimized clustering algorithm was verified to understand the feasibility of the algorithm in stock prediction. This study found that the proposed method could make an accurate classification on stocks, which can provide a reference for investors and help them to make correct investment choice and gain good benefits. It was found from the experimental results that the optimization on the clustering algorithm was effective, and it significantly improved the performance of the algorithm; moreover, the clustering algorithm had an excellent performance in stock prediction and could be promoted and applied in practice. On the one hand, the clustering algorithm can help investors obtain better benefits; on the other hand, it is beneficial to master changes of the stock market and make timely responses.

2. Financial stock forecast

Financial stocks have existed for more than 200 years. The increasingly prosperous financial stock market is an important driving force for economic development and an important component of the market economy. It has become the focus of attention of more and more investors. However, the stock market fluctuates greatly, and a little careless will cause huge losses. Stock prediction is very important for investors. It can help investors master the information of stock changes, effectively avoid risks, and obtain greater benefits. With the growth of the stock market, the difficulty of forecasting financial stocks has increased. The emergence of data mining technology provides a new idea for stock forecasting. Methods sch as neural network [1], clustering algorithm [4], multiple regression [2] and emotion analysis [7] have good applications in stock prediction. Some classification algorithms can classify the stocks with high returns, while methods such as sentiment analysis are mostly used for targeted recommendation to users. However, in stock forecasting, the effect of supervised algorithms such as neural network and multiple regression is not as good as that of unsupervised algorithms such as clustering algorithm as the corresponding label of financial system is not known. Obtaining the law of stock changes by analyzing and processing a large amount of historical stock information through data mining technology and investing the best stock has become the choice of many investors.

3. Optimization of K-means clustering algorithm with artificial fish swarm algorithm

3.1 K-means clustering algorithm

Clustering is a method of data classification, i.e., data are divided into several categories according to some feature, so that data in the same category has the maximum similarity and data in different categories has the minimum similarity [16]. K-means clustering algorithm is a very widely used clustering algorithm. Objects participating in the cluster are divided by the setting of the number of clusters. It makes the similarity of objects in the same cluster as high as possible and the similarity of objects between different cluster as low as possible. The algorithm is simple and has high clustering efficiency. It has been extensively applied in fields such as data mining, pattern recognition and image analysis. In the prediction of stock, it can compute in a high speed and obtain accurate clustering results, but it also has problems such as sensitive to initialization and easy to fall into local extremum. The process of the K-means algorithm is shown in Fig. 1.

Figure 1.

The flow of K-means clustering algorithm.

The algorithm steps are as follows.

(1)

Data set $X$ is given, there are $N$ objects, $n=1,2,\ldots,m$ , $X=\{x_{m}\}_{m=1}^{n}$ , the $k$ objects are randomly selected as the initial cluster center.

(2)

The distance from the $m$ -th object ( $x_{m}$ ) to the $i$ -th cluster center ( $c_{i}$ ) is calculated:

$\displaystyle D\left(x_{m},c_{i}\right)=\sqrt{\left({x_{m}-c_{i}}\right)^{2}}.$ (1)

(3)

The minimum distance $D_{\min}(x_{m},c_{i})$ from the $m$ -th object ( $x_{m}$ ) to the $i$ -th cluster center ( $c_{i}$ ) is calculated. Objects are assigned to the nearest class:

$\displaystyle C_{i}=\left\{{x_{m}:D\left({x_{m}-c_{i}}\right)<D(x_{m}-c_{j}),1% \leqslant j\leqslant k}\right\}.$ (2)

(4)

The mean of objects in the same class is calculated. Cluster center is updated:

$\displaystyle c_{j}=\frac{1}{n_{j}}\left[\sum\limits_{\forall X_{m}\in Z_{j}}{% X_{m}}\right],$ (3)

where $n_{j}$ indicates the number of objects in the $j$ class and $Z_{j}$ refers to the subset of all object collections of class $j$ .

(5)

Steps (2)–(4) are repeated until the function converges.

K-means clustering algorithm generally judges the clustering effect using the error square sum function:

$\displaystyle J=\sum\limits_{j=1}^{k}\sum\limits_{i=1}^{Z_{j}}{\left\|{x_{i}^{% j}-c_{j}}\right\|^{2}},$ (4)

where $k$ represents the cluster number, $Z_{j}$ represents the capacity of cluster $j$ , $x_{i}^{j}$ represents the object in cluster $j$ , $c_{j}$ represents the clustering center and $\|{x_{i}^{j}-c_{j}}\|^{2}$ represents the distance from object $x_{i}^{j}$ to cluster center $c_{j}$ .

However, K-means clustering algorithm also has some shortcomings. Firstly, the number of clusters k is determined by human random, which can only be determined by experience. Different k values may result in different clustering results. Secondly, different distance calculation methods can resulted in different clustering results. Thirdly, the objective function is easy to fall into local extremum.

3.2 AFSA

In order to make up for the shortcomings of K-means clustering algorithm, artificial fish swarm algorithm was introduced for optimization. AFSA is a bionic algorithm which is not sensitive to initialization and can overcome local extremum to obtain the global optimal solution, i.e., it can effectively make up the shortcomings of K-means clustering algorithm. It has strong robustness and high convergence speed and flexibility. The algorithm steps were as follows.

(1) Initialization of parameters

Parameters such as the total number $N$ of artificial fish, the field of view Visual, the step size Step, the congestion degree factor $\sigma(0<\sigma<1)$ and the number of attempts Try – number were optimized.

(2) Updating of the artificial fish state ( $X_{i}$ as initial state)

The first one was foraging behavior, i.e., the behavior of fish towards food. $X_{j}$ represents the random state of the artificial fish, Rand() represents the random number between 0 and 1, $X_{j}=X_{i}+\textit{Rand()}\times\textit{Visual}$ , then,

$\displaystyle X_{\textit{next}}=X_{i}+\frac{X_{j}-X_{i}}{\left\|{X_{j}-X_{i}}% \right\|}\cdot\textit{Step}\cdot\textit{Rand}(∼{})$ (5)

Next was cluster behavior, i.e., the behavior of artificial fish moving closer to the center of the school. $n_{f}$ represents the number of artificial fish in the field and $X_{\textit{center}}$ represents the fish centre. If the food concentration was met $Y_{\textit{center}}>Y_{i}$ , then,

$\displaystyle X_{\textit{next}}=X_{i}+\frac{X_{\textit{center}}-X_{i}}{\left\|% {X_{\textit{center}}-X_{i}}\right\|}\cdot\textit{Step}\cdot\textit{Rand}(∼{})$ (6)

The third one was rear – end behavior, i.e., the behavior of nearby fish swimming towards less crowded and more food. The biggest fish acceptable in the field was $X_{\max}$ . If $Y_{\max}>Y_{i}$ and $\frac{Y_{\max}}{n_{f}}<\sigma\times Y_{i}$ were satisfied, then

$\displaystyle X_{\textit{next}}=X_{i}+\frac{X_{\max}-X_{i}}{\left\|{X_{\max}-X% _{i}}\right\|}\cdot\textit{Step}\cdot\textit{Rand}(∼{})$ (7)

The last one was random behavior, i.e., the artificial fish moved freely to forage or search for partners to ensure optimal efficiency,

$\displaystyle X_{\textit{next}}=X_{i}+\textit{Rand(∼{})}\times\textit{Visual}$ (8)

(3) Evaluation behavior: the fitness function (food concentration) $Y=f(X)$ of each artificial fish was calculated. Repeat (2) until the termination condition was met.

3.3 K-means algorithm for artificial fish swarm optimization

The clustering center in the K-means algorithm was represented by the fitness function of the artificial fish swarm algorithm:

$\displaystyle\textit{fitness}=\frac{1}{J}=\frac{1}{\sum\limits_{i=1}^{k}{\sum% \limits_{j=1}^{n}{\left\|{c_{i}-x_{j}}\right\|^{2}}}}$ (9)

where $J$ represents the error square sum function of clustering algorithm. If the fitness function was the largest, the square sum of the minimum clustering error could be obtained, and the optimal clustering effect could be achieved.

The specific flow of KAFSA combined with artificial fish swarm algorithm is shown in Fig. 2.

Figure 2.

Flow chart of KAFSA.

4. Application of KAFSA in stock prediction recommendation

One hundred A-share listed financial companies were selected randomly. Six data, including closing price, price earning ratio, earnings per share, return on net assets, provident fund per share and net assets per share, were obtained from the comprehensive data of stock in 2017 in RESSET database. The six data could reflect the abilities of the company, such as profitability, debt paying ability, and growth ability, and embody the financial condition of the company. The KAFSA algorithm was used for classification. The data obtained is shown in Table 1.

Table 1
Data of the 100 listed companies

Code of listed company	Stock name	Closing price (yuan)	Pricing earning ratio (%)	Earnings per share (yuan/share)	Return on net assets (%)	Per share provident fund (yuan/share)	Net assets per share (yuan/share)
C000001	Ping an bank	13.30	9.9143	1.12	8.7813	3.92	11.54
C000166	Shenwan Hongyuan	5.37	20.9357	0.18	6.5754	0.37	2.72
C000416	Minsheng Holdings	7.12	112.4803	0.05	3.0672	0.13	1.65
C000563	Shaanxi Guotou A	4.25	22.8127	0.13	5.2361	0.97	2.54
C000567	Hyde shares	24	62.8931	0.37	20.8057	0.78	1.79
C000617	CNPC capital	15	12.6129	0.55	7.1006	4.79	7.79
C000627	Tianmao Group	8	22.2754	0.20	6.0072	1.43	3.28
C000686	Northeast Securities	8.77	25.779	0.27	4.0587	2.80	6.75
C000712	Jinlong shares	16.98	52.7329	0.22	5.2347	1	4.17
C000728	Guoyuan Securities	11	27.5842	0.30	4.2623	2.98	7.09
C000750	Guohai Securities	4.90	34.5557	0.13	4.0276	1.63	3.31
C000776	GF Securities	16.68	15.4788	0.84	7.6161	6.19	10.97
C000783	Changjiang Securities	7.87	22.3326	0.27	5.6023	2.16	4.80
C000987	Yuexiu Jinkong	9.85	34.2648	0.22	3.7592	4.20	5.73
C002142	Bank of Ningbo	17.81	10.2199	1.45	13.6630	2.51	9.69
……	……	……	……	……	……	……	……
C601601	China Pacific Insurance	41.42	26.519	1.21	8.0486	11.39	14.98
C601628	China Life Insurance	30.45	26.5429	0.95	8.3101	5.54	11.14
C601688	Huatai Securities	17.26	20.0441	0.66	5.4785	8.98	12
C601788	Everbright Securities	13.43	21.3785	0.49	4.6411	6.71	10.61
C601818	Everbright bank	4.05	6.0358	0.54	9.2937	1.29	5.22
C601878	Zhejiang Merchants Securities	16.62	51.191	0.24	6.1771	1.29	3.94
C601881	Bank of China	10.51	23.6144	0.33	5.2733	4.65	6.33
C601901	Founder Securities	6.89	45.964	0.16	3.5781	2.12	4.52
C601939	China Construction Bank	7.68	8.0385	0.80	11.8928	32.31	6.76
C601988	Bank of China	3.97	6.6667	0.49	9.8233	1.27	4.69
C601997	Guiyang bank	13.36	7.4115	1.39	13.5329	2.78	10.25
C601998	CITIC bank	6.20	7.254	0.71	8.7735	2.52	7.38
C603323	Wujiang bank	8.23	15.2557	0.43	7.4567	2.66	5.73

In the preprocessing, the duplicate data were deleted, and the missing data were filled up by the mean value. Data were standardized using maximum-minimum normalization, $x^{\prime}=\frac{x-\min}{\max-\min}$ , where $x$ refers to the original data, $\max$ and $\min$ refer to the maximum and minimum values respectively among the sample data, and $x^{\prime}$ refers to the data after standardization, to map the values of sample data to $\left[{0,1}\right]$ .

Cluster analysis was carried out in MATLAB. The clustering performance of the traditional K-means algorithm and balanced iterative reducing and clustering using hierarchies (BIRCH) algorithm was compared with that of the KAFSA proposed in this study. Parameters were set as follows: $n=$ 20, $\textit{Visual}=$ 1.5, $\textit{Step}=$ 0.5, $\sigma=$ 0.2. The times of iterations was 100, for 20 times. The comparison of the clustering performance between the two algorithms is shown in Table 2. In Table 2, silhouette coefficient is the evaluation index of clustering; the more close the value of silhouette coefficient is close to 1, the better the clustering performance is. For a sample, the average distance of the sample and other samples in the same cluster was set as $a$ , the average distance between the sample and samples in other clusters was set as $b$ , then the silhouette coefficient of the sample was $\left({b-a}\right)/\max\left({a,b}\right)$ .

Table 2

Comparison of the clustering performance

	Traditional K-means algorithm	BIRCH algorithm	KAFSA
Convergence time/s	0.098	0.127	0.326
Standard deviation	0.036	0.021	0
Silhouette coefficient	0.630	0.780	0.990

It was seen from Table 2 that the standard deviation and silhouette coefficient of the traditional K-means algorithm and BIRCH algorithm were not as good as the KAFSA though it converged fast; the standard deviation and silhouette coefficient of the clustering result obtained by the KAFSA was 0 and 0.99 respectively, indicating that the algorithm had good clustering effect.

The 100 listed companies were divided into two categories according to the KAFSA, and details are shown in Table 3.

Table 3

Stock clustering results

Category	Stock name
A (poor-performance stock)	The Pacific securities, Minsheng Jinke, Panda Gold Holdings, Minsheng Holdings, Green Court Investment, Xinli Finance, First Capital, Zhongyuan Securities, Southwest Securities, Huaxin Stock, Shaanxi Guotou A, Guohai Securities, Shanxi Securities, Baoshuo Stock, Kazakhstan shares……
B (high-performance stock)	Ping An Insurance, China Merchants Bank, Industrial Bank, West Water stock, Xinhua Insurance, Bank of Shanghai, Bank of Ningbo, Shanghai Pudong Development Bank, Guiyang Bank, China Pacific Insurance, Ping An Bank, Huaxia Bank……

In order to verify whether the algorithm was effective for stock prediction, two types of stocks were compared in aspects of closing price, price earning ratio, earnings per share and return on net assets. The results is shown in Figs 3 and 4.

Figure 3.

The comparison of closing price and price earning ratio between A and B stocks.

Figure 4.

Comparison of earnings per share and return on net assets between A and B stocks.

Figure 3 showed the comparison of closing price and price earning ratio between the two types of stocks. Figure 4 shows the comparison of earnings per share and return on net assets between the two types of stocks. It could be found that there were obvious differences between the two types of stocks. Stock indicators were the main reference for investors when investing. It could be found from the comparison chart that A stocks generally performed poorly in the four indicators; the closing price and price earning ratio of A stocks were low, earnings per share of A stocks was between 0 yuan/share and 0.5 yuan/share, and the return on net assets was basically lower than 5%, which could be classified as poor-performance stocks; B stocks had significantly higher closing price and price earning ratio than A stocks and high earnings per share and return on net assets, which was classified as high-performance stocks. The above results showed that the algorithm proposed in this study was correct in stock prediction and could accurately recommend high-performance stocks to investors.

5. Discussion and conclusion

The importance of financial stocks has been demonstrated over time. More and more people are joining the stock market and choosing stocks for investment. With the development of the financial industry, there are more and more financial data. In order to process and analyze these data effectively, data mining technology is introduced and applied well. Through data mining technology, financial stocks can be analyzed and predicted based on the analysis of historical data to dig out the changing rules and trends of the stock market, so as to make the scientific and reasonable investment and obtain greater benefits [3].

In the study, K-means clustering algorithm was selected to realize stock prediction. In order to improve the sensitivity of the clustering algorithm to the initial K value and solve the problem of easy to fall into local optimization, the artificial fish swarm algorithm was introduced to optimize the algorithm. KAFSA was obtained and applied to the prediction and recommendation of financial stocks. The 100 stocks were divided into A and B categories, and it could be found from the comparison of closing price, price earning ratio, earnings per share and return on net assets that the clustering results obtained were accurate. The closing price, earnings per share and return on net assets could well reflect the operating conditions of listed companies. It was observed from the comparison chart that the closing price, price earning ratio, earnings per share and return on net assets of A stocks were lower than those of B stocks. It indicated that the listed companies which were classified as A had general operating conditions and poor profitability, which were not recommended to investors for investment. The listed companies which were classified into category B had good profitability and development prospects, which were worth recommending to investors for long-term investment and could bring better returns to investors.

In summary, the optimized K-means clustering algorithm has a good performance in financial stock prediction and recommendation. It can accurately classify stocks through historical data analysis, predict high-performing stocks and low-performing stocks, and make scientific recommendations to investors, which is worth a widespread application. In the future study, experiments will be carried out on data samples with a larger scale, and the K-means algorithm will be further optimized to improve the stock prediction ability better.

References

Gupta

Chaudhary

D.K.

and Choudhury

, Stock prediction using functional link artificial neural network (FLANN), International Conference on Computational Intelligence and Networks. IEEE (2018), 10–16.

Izzah

Sari

Y.A.

Widyastuti

and Cinderatama

T.A.

, Mobile app for stock prediction using improved multiple linear regression, International Conference on Sustainable Information Engineering and Technology. IEEE (2018), 150–154.

Sharma

Bhuriya

and Singh

, Survey of stock market prediction using machine learning approach, Electronics, Communication and Aerospace Technology. IEEE (2017), 506–509.

Bini

B.S.

and Mathew

, Clustering and regression techniques for stock prediction? Procedia Technology 24 (2016), 1248–1255.

Tsai

C.F.

and Quan

Z.Y.

, Stock prediction by searching for similarities in candlestick charts, Acm Transactions on Management Information Systems 5(2) (2014), 1–21.

Das

Safa Sadiq

Mirjalili

et al., Hybrid clustering-GWO-NARX neural network technique in predicting stock price, Journal of Physics Conference Series, (2017), 012018.

Wang

Zhang

T.Q.

Rao

K.S.

and Zhang

, Exploring mutual information-based sentimental analysis with kernel-based extreme learning machine for stock prediction, Soft Computing – A Fusion of Foundations, Methodologies and Applications 21(12) (2017), 3193–3205.

Kaur

Dhar

and Guha

R.K.

, A hybrid approach to forecast stock market index, International Journal of Artificial Intelligence and Soft Computing 5(2) (2015), 165-176.

Huang

Zhang

Deng

and Chen

, Predicting stock trend using fourier transform and support vector regression, International Conference on Computational Science and Engineering. IEEE, (2015), 213–216.

10.

Sadaei

H.J.

Enayatifar

Lee

M.H.

and Mahmud

, A hybrid model based on differential fuzzy logic relationships and imperialist competitive algorithm for stock market forecasting, Applied Soft Computing 40(C) (2016), 132–149.

11.

Chen

J.C.

Chen

H.Z.

and Huo

Y.J.

, Application of SVR models in stock index forecast based on different parameter search methods, Open Journal of Stats 7(2) (2017), 194–202.

12.

Chou

J.S.

and Nguyen

T.K.

, Forward forecast of stock price using sliding-window metaheuristic-optimized machine learning regression, IEEE Transactions on Industrial Informatics, (2018), 1.

13.

Wang

J.Z.

Hou

Wang

and Shen

, Improved v, Support vector regression model based on variable selection and brain storm optimization for stock price forecasting, Applied Soft Computing 49 (2016), 164–178.

14.

Dudhwala

N.D.

Jadhav

Gabda

and Kishor

, Prediction of stock market using data mining and artificial intelligence, International Journal of Computer Applications 134(12) (2016), 9–11.

15.

Kumar

and Bala

, Intelligent stock data prediction using predictive data mining techniques, International Conference on Inventive Computation Technologies. IEEE, (2017), 1–5.

16.

Kapil

Chawla

and Ansari

M.D.

, On K-means data clustering algorithm with genetic algorithm, International Conference on Parallel. IEEE, (2017), 202–206.

17.

Xia

Liu

and Chen

, Support vector regression for prediction of stock trend, International Conference on Information Management, Innovation Management and Industrial Engineering. IEEE, (2014), 123–126.

18.

Berradi

and Lazaar

, Integration of principal component analysis and recurrent neural network to forecast the stock price of casablanca stock exchange, Procedia Computer Ence 148 (2019), 55–61.

19.

Jiang

and Chen

, BDI based stock prediction, Online Analysis and Computing Science. IEEE, (2016), 119–122.

Research on prediction and recommendation of financial stocks based on K-means clustering algorithm optimization

Abstract

Keywords

1. Introduction

2. Financial stock forecast

3. Optimization of K-means clustering algorithm with artificial fish swarm algorithm

3.1 K-means clustering algorithm

Table 1 Data of the 100 listed companies

References

Table 1
Data of the 100 listed companies