Clustering analysis in the evaluation of securities investment funds

Abstract

Clustering analysis as one of the key components of data mining has been widely applied. This paper aimed to apply a clustering algorithm to classify and evaluate securities investment funds. It established a fund evaluation index system by researching the indexes that are influenced by the performance of funds. It drew upon domestic and foreign mature funds evaluation theory and used the data mining function of Excel to establish a clustering analysis model. Finally, this paper used 40 equity funds as sample data to conduct an empirical research. The cluster results would be beneficial in evaluating funds’ performance and guiding the decision making on rational investment.

Keywords

Clustering analysis evaluation securities investment funds

1 Introduction

The rapid economic development of our country has prompted an increasing number of people to pay attention to the securities market. Securities investment funds are favored by investors because of their professional, low risk and high yield. However, making the right decision to obtain higher returns in the securities investment fund market poses a problem for investors.

The process of judging the performance of securities investment funds has been an interesting topic for both investment institutions and researchers. Several researchers have applied traditional methods to evaluate the performance of investment funds. The classical performance measures by Treynor [1], Sharpe [2], and Jensen [3] are central for any kind of performance evaluation and there have been many attempts to improve upon these measures. A generalized functional form CAPM (Capital Asset Pricing Model) model was proposed for international closed-end country funds performance, it examined the effect of heterogeneous investment horizons on the portfolio choices in the global market [4]. The PROMETHEE II (Preference Ranking Organization Method for Enrichment Evaluations) method was used to develop outranking models for mutual funds’ performance, according to their performance, the PROMETHEE II method ranked the funds from the best to the worst ones [5]. These traditional methods focus on the study of single or individual indexes, thereby occasionally failing to comprehensively reflect the funds’ performance.

In recent years, data mining has been widely used in financial field and there are many academic achievements [6 –8]. In terms of clustering analysis, it was applied to sector indices [9], stock price data [10] or trend research [11]. K-centroids method was used to group stocks [12], a new stocks analysis method based on clustering was presented to recognizes some rules of essence trends of the stock markets [13], the hybrid GTC (Gene Trajectory Clustering) was applied to learn the structure of the stock market and to infer interesting relationships out of closing prices data, and concluded that hybrid GTC can successfully identify homogeneous and stable stock clusters [14].

However, most studies have focused on the stock market but have rarely evaluated the securities investment funds through data mining. The fund market has its own characteristics, although it is closely related to the stock market. This paper focused on applying a clustering algorithm to classify the funds and analyzing the consequences to fully evaluate the funds’ performance. The findings would be beneficial in guiding the decision making on rational investment.

2 Clustering algorithm

2.1 Introduction of clustering algorithm

Clustering is the process of assigning a set of objects into classes of similar objects. The objects in a cluster are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Typically, there are about four categories for the major clustering methods [15]:

Partitioning Method

The partitioning method classifies a given data set of n objects into k groups which satisfy two requirements: each object must be part of one group and each group must comprise at least one object.

Hierarchical Method

In the hierarchical method, a hierarchical decomposition is created for the given objects. According to how the hierarchical decomposition is formed it can be divided into two types of bottom-up and top-down.

Density-based method

This method is to continue growing the given cluster until the density (number of data points) in the “neighborhood” exceeds some threshold.

Grid-Based Method

The clustering operations of this method are performed on the grid structure which is formed by quantizing the object space into a finite number of cells.

2.2 Selection of the clustering algorithm

Data mining has become a hot spot of research in recent years [16, 17]. Clustering analysis as a data mining function has been important research topic, and several studies in the literature have analyzed clustering algorithm [18, 19]. The cluster assignment of each data point is the key step in clustering analysis. A new cluster labeling method for support vector clustering (SVC) was developed based on some invariant topological properties of a trained kernel radius function [20]. A novel, parameter-less and efficient clustering algorithm, namely, Correlation Search Technique (CST) was proposed, which incorporates the validation techniques into the clustering process. The experimental results showed that CST was outperform other clustering methods greatly in terms of clustering automation, efficiency and quality [21]. A novel separability-correlation measure (SCM) was applied to rank the importance of attributes during the classification process [22]. A new clustering algorithm based on a new heuristic called Chameleon Army was presented, and it was implemented and tested on well-known dataset. Compared to those of the algorithms K-means, PSO, and PSO-kmeans, the proposed algorithm gave better clusters [23].

The most fundamental method of cluster analysis is partitioning, and the k-means algorithm is the most commonly used and well-known partitioning method. The algorithm’s computational complexity is O(nkt), which indicates that the method is relatively efficient and scalable in processing large data sets. Typically, k << n and t << n, where n refers to the total number of objects, k refers to the number of clusters, and t refers to the number of iterations. K-means clustering algorithm is commonly used in partitioning stock price time series data [24]. A SOM&K-means based trading rules system on financial markets at intraday trading frequencies was built and the result that k-means clustering after the training of the SOM can classify successfully [25]. Based on previous studies, this research selected the k-means algorithm to build a clustering analysis model for evaluating the funds’ performance.

2.3 K-means algorithm

The k-means algorithm partitions a set of n objects into k clusters. The resulting cluster must satisfy the minimum requirement of differences in the same cluster while ensuring high intracluster similarity. The similarity between clusters is measured by the mean value of all of the objects in the same cluster.

The value of k is usually entered in the clustering process. First, k objects are randomly selected as the initial cluster center. The distance between the remaining objects and the cluster mean is calculated, and the objects are placed into the most similar cluster. Second, the mean value of each cluster is recomputed. The preceding steps are repeated until the stop criterion is met. The flow chart of the k-means algorithm is presented in Fig. 1.

The square-error criterion is typically defined as $E = \sum_{i = 1}^{k} \sum_{p = c_{i}} (p - m_{i})^{2}$ where E is the sum of the square error for all objects, p represents one point in space, and m_i is the mean value of cluster c_i. The similarity between clusters is based on the distance between objects. The Euclidean distance, which is a well-known method for measuring distance, is selected. It is defined as follows. $D (X, Y) = \sqrt{Σ_{i} (X_{i} - Y_{i})^{2}}$ where i = 1, 2, …, s

3 Establishing a fund evaluation index system

The index system (see Fig. 2) must be established prior to performing a clustering analysis. These indexes are used as the inputs of the clustering analysis model, and each index must reflect some characteristics of the research object. The clustering results largely depend on the inputs of the analysis model. The more accurate and comprehensive is the index selection, the more reasonable is the clustering analysis result.

3.1 Index selection

This research primarily adopted a qualitative analysis method to select the indexes by drawing upon domestic and foreign funds evaluation theory and the rating standard from a professional fund website. It adopted the principle of objectivity, practicability, and quantifiability. It ultimately selected five indexes as the inputs of the clustering analysis model, namely, fund accumulated net value (ended December 16, 2014), total rate of return (since this year), standard deviation (three years), Sharpe ratio (three years), and beta coefficient. These indexes reflect the return and risk of the funds, both of which significantly affect the funds’ performance.

3.2 Explanations of the indexes

Fund accumulated net value pertains to the sum of unit net value and the cumulative dividends since the establishment of the fund. It takes the funds’ operating time into account and can embody the funds’ real performance more accurately.

Total rate of return, which includes income return and capital return, reflects the fund’s historical performance over a certain period.

Standard deviation measures the amount of dispersion from the average return. A low standard deviation indicates that the data points tend to be extremely close to the expected value; that is, the higher is the standard deviation, the greater is the volatility.

Sharpe ratio characterizes how well the return of an asset compensates the investor for the risk taken. It measures the excess return per unit of the total risk. A higher Sharpe ratio provides better return for the same risk.

Beta coefficient measures the overall volatility of the fund return with respect to the market benchmark return. The market portfolio of all of the investable assets has a beta of exactly one. The fund with a higher beta coefficient suggests that it has a higher volatility compared with the benchmark market.

4 Empirical research

Simulations can be conducted as soon as the system of fund evaluation index has been established.

4.1 Empirical research platform

The Excel data mining module is selected as the simulation platform. Microsoft Office 2013, SQL Server 2012, and SQL Server 2012 Office Data mining add-ons must be installed successfully prior to using Excel in data mining. Excel must be connected to the Microsoft SQL Server Analysis Service so that it can be opened. This service provides multiple algorithms that can be used in data mining solutions, such as the expectation maximization method and the k-means algorithm. The algorithm can be selected by setting the CLUSTERING_METHOD parameter. This study selected method 3, that is, the k-means algorithm.

4.2 Sample of data

This study used the open-ended fund data from the CHINAFUND net (ended December 16, 2014) and selected 10% of 465 equity funds for investigation. First, the data of 465 equity funds were initially sorted and then randomly arranged using the “RANDBETWEEN (A, B)” function in Excel. The first 40 equity funds were finally selected as the sample data of this study (see Table 1).

4.3 Building the clustering analysis model

Simulation was performed using the data mining function in Excel. First, the “cluster analysis” module was selected after establishing a data mining structure. Second, the k-means algorithm was selected and the algorithm parameters were established. Finally, the established data mining structure was operated.

4.4 Cluster results analysis

The data mining structure is run; the results are shown in Fig. 3.

The randomly selected 40 equity funds were divided into three clusters. Clusters 1, 2, and 3 consisted of 14, 13, and 13 equity funds, respectively. As shown in Fig. 3, the first column contains five indexes, and the second column shows all of the sample data’s maximum, minimum, and average values in each index. The rest of the columns successively show the overall distribution of sample data in each index and the distribution of each cluster in each index. The characteristics of funds in each cluster can be further analyzed and summarized by comparing the index data in each cluster. This approach is of significant practical value in guiding the decision making on rational investment. Table 2 presents the mean value of each index in three clusters.

As shown in Table 2, the funds in cluster 1 have a minimum accumulated net value and slightly higher total rate of return and standard deviation. This result indicates that the funds’ return this year is slightly better but is excessively fluctuating. The funds’ minimum Sharpe ratio and maximum beta demonstrate that their excess return per unit of the total risk is relatively low and the risk with respect to bank deposits is high. Thus, the funds in cluster 1 (see Table 3) are not recommended for their comparatively poor performance.

The accumulated net value ranks highest in cluster 2. The rate of return since this year is slightly low and the standard deviation is minimum, thereby confirming that the fund return is stable and less volatile since establishment. Moreover, the Sharpe ratio is maximum and the beta is minimum, proving that this type of funds has less fluctuation according to market changes and has demonstrated good performance for a long period. A thorough study shows that this type of funds (see Table 4) aims to pursue a long-term and stable accumulated value, and its investment targets prioritize good quality and highly visible large blue-chip company stock. Hence, holding on to such funds in the long term is suggested.

The indexes of funds in cluster 3 are in the center, demonstrating that the funds effectively perform and do not excessively fluctuate with the market. However, the standard deviation is maximum, which indicates that the return rate of fund excessively fluctuates, and a trend is unclear. A careful analysis of the funds in cluster 3 (see Table 5) suggests that their major investment strategy is to pursue undervalued stocks. Thus, holding on to these funds in the short term is suggested, moreover, the right time to invest must be determined.

4.5 Comparative analysis

Morningstar, Inc. is the authority of fund rating agencies. It provides mutual fund market analysis and fund rating for investors as a reference. According to its risk-adjusted return index, Morningstar rates mutual funds from 1 to 5 stars in comparison to similar funds. Funds with less than three years of history are not rated. As shown in Fig. 4 [26], the top 10% of funds receive 5 stars, the next 22.5% receive 4 stars, the middle 35% receive 3 stars, the next 22.5% receive 2 stars, and the bottom 10% receive 1 star.

In Section 4.4, 40 equity funds were assigned into three clusters, and each cluster has its own characteristics. A comparison of the results of cluster analysis with Morningstar’s rating for each fund (see Fig. 5) shows that funds in cluster 2 have relatively higher ratings, funds in cluster 1 have relatively lower ratings, and funds in cluster 3 have medium ratings. However, four funds in cluster 3 are rated with one star, which is relatively low. The cluster analysis results are basically the same as the ratings from Morningstar.

5 Summary and prospect

This study performed a clustering analysis to evaluate securities investment funds. A simulation for the equity funds sample was conducted using the data mining module in Excel. The study selected the k-means algorithm to build a clustering analysis model and assessed the clustering results. Satisfactory results were obtained because each cluster had its own characteristics and the funds in same cluster demonstrated similar performances. This research has a significant application value and is beneficial in guiding the decision making on rational investment to some extent.

However, due to the inadequacy of the selection research on fund evaluation index, some errors may emerge in the results. For example, based on the analysis of cluster results, cluster 2 performs excellently from an overall perspective. However, the average return rate since this year is the lowest among the three clusters, which will have a certain effect on the cluster results. Furthermore, some subjective factors have been disregarded, such as the asset management company and the fund manager’s ability, and research on fund risk is rough. Hence, the establishment of a fund evaluation index system must be carefully studied. Despite the extensive use of the k-means algorithm as one of the classical algorithms in clustering analysis, the algorithm for a specific problem must be improved, and this aspect merits furtherstudy.

In further research, the author hopes to continue to focus on the evaluation of equity funds, examine the indexes that influence funds’ performance using other research methods, improve the clustering algorithm to a certain degree, obtain more rational cluster results, and provide investors with guidance on proper investment decision making.

Footnotes

Acknowledgments

This paper is funded by Beijing Natural Science Foundation (4132024) and Funding Project for Academic Human Resources Development in Institutions of Higher Learning under the Jurisdiction of Beijing Municipality (PHR2011 06133).

References

Treynor

, How to Rate Management of Investment Funds, Harvard Business Review, 1995, pp. 63–75.

Sharpe

W.F.

, Mutual fund performance, Journal of Businesspart II (1996), 119–138.

Jensen

M.C.

, The performance of mutual funds in the period 1945–1964, Journal of Finance26 (1968), 389–416.

Lee

C.-F.

, Patro

D.K.

and Liu

, Functional Forms for Performance Evaluation: Evidence from Closed-End Country Funds, Springer, pp. 2010, 1523–1553.

Pendaraki

and Zopounidis

, Evaluation of equity mutual funds’ performance using a multi-criteria methodology, Operational Research3(1) (2003), 69–90.

Wong

F.S.

, Wang

P.Z.

and Teh

H.H.

, A stock selection strategy using fuzzy neural networks, Computer Science in Economics and Management4(2) (1991), 77–89.

Kohara

, Fukuhara

and Nakamura

, Selective presentation learning for neural network forecasting of stock markets, Neural Computing & Applications4(3) (1996), 143–148.

Huang

C.-J.

, Chen

P.-W.

and Pan

W.-T.

, Using multi-stage data mining technique to build forecast model for Taiwan stocks, Neural Computing and Applications21(8) (2012), 2057–2063.

Boillat

P.-Y.

and de Skowronski

and Tuchschmid

, Cluster analysis: Application to sector indices and empirical validation, Financial Markets and Portfolio Management16(4) (2002), 467–486.

10.

Dragut

A.B.

, Stock data clustering and multiscale trend detection, Methodology and Computing in Applied Probability14(1) (2012), 87–105.

11.

C.M.

, Chou

S.C.

and Liaw

H.T.

, A trend based investment decision approach using clustering and heuristic algorithm, Science China Information Sciences57(9) (2014), 1–14.

12.

Czekala

and Kuziak

, Clustering of Stocks in Risk Context, Dresden, Proceedings of the 22nd Annual GFKl Conference, 1998, pp. 447–452.

13.

Shi

and Shi

, Clustering Based Stocks Recognition, Third International Conference, FSKD 2006, Xi’an, China, vol. 4223, 2006, pp. 1121–1129.

14.

Moldovan

and Silaghi

G.C.

, Gene Trajectory, Clustering for Learning the Stock Market Sector, 9th International Conference, ICANNGA 2009, Kuopio, Finland, vol. 5495, 2009, pp. 559–568.

15.

Han

, Kamber

and Pei

, Data mining concepts and techniques, Beijing: China Machine Press, 2012.

16.

Zhou

, Li

and Zhou

, et al., Adaptive processing for distributed skyline queries over uncertain data, IEEE Transactions on Knowledge and Data Engineering, 2015, pp. 1–1.

17.

, Sheng

V.S.

, Tay

K.Y.

, et al., Incremental support vector learning for ordinal regression, IEEE Transactions on Neural Networks & Learning Systems26 (2015), 1403–1416.

18.

Tsaparas

, Mannila

and Gionis

, Clustering aggregation, Acm Transactions on Knowledge Discovery from Data1(1) (2007), 341–352.

19.

Zhao

, Zhang

and Kong

, Image segmentation by generalized hierarchical fuzzy C-means algorithm, Journal of Intelligent & Fuzzy Systems28(2) (2015), 4024–4028.

20.

Lee

and Lee

, An improved cluster labeling method for support vector clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence27 (2005), 461–464.

21.

Tseng

V.S.

and Kao

C.-P.

, Efficiently mining gene expression data via a novel parameterless clustering method, IEEE/ACM Transactions on Computational Biology and Bioinformatics2 (2005), 355–365.

22.

X.J.

and Wang

L.P.

, Data dimensionality reduction with application to simplifying RBF network structure and improving classification performance, IEEE Trans, System, Man, Cybern, Part B-Cybernetics33(3) (2003), 399–409.

23.

Kamel

and Boucheta

, A New Clustering Algorithm Based on Chameleon Army Strategy, Proceedings of the 10th International Conference on Computing and Information Technology (IC2IT2014), vol. 265, 2014, pp. 23–32.

24.

, Chen

, Jin

and Chen

S.-H.

, Trading strategies based on K-means clustering and regression models, Computational Intelligence in Economics and Finance2 (2007), 123–134.

25.

Huang

, Study on Financial Data Hybrid Clustering Based on Stock Trading Rule, Proceedings of the 2012 International Conference on Cybernetics and Informatics, Kuopio, Finland, vol. 163, 2014, pp. 2351–2357.

26.

http://cn.morningstar.com/help/data/fundrisk.html.