Heuristic methods for stock selection and allocation in an index tracking problem

Abstract

Index tracking is one of the most popular passive strategy in portfolio management. However, due to some practical constrains, a full replication is difficult to obtain. Many mathematical models have failed to generate good results for partial replicated portfolios, but in the last years a data driven approach began to take shape. This paper proposes three heuristic methods for both selection and allocation of the most informative stocks in an index tracking problem, respectively XGBoost, Random Forest and LASSO with stability selection. Among those, latest deep autoencoders have also been tested. All selected algorithms have outperformed the benchmarks in terms of tracking error. The empirical study has been conducted on one of the biggest financial indices in terms of number of components in three different countries, respectively Russell 1000 for the USA, FTSE 350 for the UK, and Nikkei 225 for Japan.

Keywords

Machine learning partial replication index tracking stocks selection

1 Introduction

Investors have two main investment strategies that can be used to generate return, respectively an active and a passive portfolio management. In an active strategy, the portfolio manager will try to pick the best performing stocks using their experience and judgement. On the other hand, a passive investment is based on the assumption that “you can’t beat the market” and in the long run an active strategy will lead to diminishing returns due to transaction costs and other market frictions. One of the most popular passive strategies is to track a stock index considered a proxy for market behavior.

Index tracking describes the process of attempting to replicate the performance of a financial index in time. A straightforward approach is to buy each stock constituting the market index in the proportion of its market weights. Even if in theory this will lead to a perfect (or full) replication, this is rarely used in practice due to some unrealistic constraints. First of all, you will need a lot of capital to buy all the stocks in a given proportion, making this approach impossible for most of the individual investors. Secondly, index weights are changing in time. For example, S&P 500 composition changed 60 times in the year 2000 (Beasley et al., 2003). Moreover, the exact composition for some indices is not known or the investment in some constituents is not possible as Andriosopoulos and Nomikos (2014) noted for Spot Energy index. Therefore, a partial replication is preferred.

In order to make a performant partial replication, two steps are required, respectively selection of the best k stocks and weight calculation for each of them. Classical approaches are based on mean-variance portfolio framework proposed by Markowitz (1952). The variance is defined as tracking error relative to a benchmark in Roll (1992). Rohweder (1998) enhances his approach by including transaction costs in the objective function but minimizing the objective function results in a quadratic program such as in Jansen and Dijk (2002). Other authors like Corielli and Marcellino (2006), among others investigate a factor-based approach to index tracking using Arbitrage Pricing Theory framework and in a more recent paper, Strub and Baumann (2018) proposed some practical constrains in a mixed integer linear programing formulation in order to obtain superior tracking performance.

Another line of research has been explored due to increasing performance of Machine Learning algorithms. Beasley et al. (2003) showed that an evolutionary heuristic method in index tracking problem is more desirable than a full replication. Oh et al. (2005) used a genetic algorithm to optimize the weights of stocks selected through fundamental analysis and in Chiam et al. (2013) the same procedure was used to minimize both tracking error and transaction costs. Another method with great potential was proposed by Heaton et al. (2017). They were using deep neural networks for index tracking problem. Most recent papers were built on this approach respectively Ouyang et al. (2019) and Kim and Kim (2019) which uses deep neural network for both selection and dynamic allocation.

This paper enhances the research on heuristic methods by proposing new algorithms for stock selection like tree based algorithms (Random Forest and XGBoost) and Lasso with stability selection. To the best of author knowledge, none of these methods have been used before in an index tracking problem. However, they were successfully used in other feature selection problems in finance as in Liu et al. (2015), Nobre and Neves (2019), or Sohrabi and Movaghari (2020). As benchmark, the autoencoders proposed by Kim and Kim (2019) and a strategy based on the largest stocks have been considered. On the allocation part, three schemes have been tested, respectively the neural network sensitivity approach proposed by Ouyang et al. (2019), a simple OLS method and a dummy equally weighted scheme for a robust comparison.

The organization of this paper is as follows. In the next two sections I will briefly discuss the models used for stock selection and for allocation. Section 4 will discuss data and methodology and Section 5 and 6 the empirical performance of the proposed methodology, respectively the conclusions.

2 Selection

This section will briefly discuss the algorithms used to determine the most relevant stocks in an index. As Kim and Kim (2019) noted, a more direct way is to choose the largest components of the index. However, there are cases in which this approach could not be applied because we either could not invest in all components or the exact structure is not known. Therefore, a more general procedure is required.

The task of finding the most relevant stocks could be seen as a classical feature selection problem. There are three general classes of feature selection algorithms (Miao & Niu, 2016), respectively:

Filter methods which consist in applying a statistical measure to assign a score for each feature like Chi-Squared test, information gain or correlation coefficient score;

Wrapper methods that consider the feature selection as a search problem where different combinations are prepared, evaluated, and compared to one another. An example of wrapper is recursive feature elimination;

Embedded methods which determine what features best contribute to the accuracy of the model while the model was created.

This paper focuses on the last class of feature selection algorithms. From the embedded methods, Random Forest, XGBoost and Lasso model combined with a wrapper method (stability selection) have been chosen. For comparison, a Deep Autoencoders algorithm was also considered.

2.1 Random forest

Random Forest model introduced by Breiman (2001) is one of the most efficient algorithms for both classification and regression tasks. The model is based on bagging principle (Breiman, 1996), an aggregation scheme that generates multiple sets of data by bootstrapping from the original input set, makes a prediction for each set using CART model and aggregate the predictions in a single result.

The characteristics of this algorithm make it suitable for selection task. One of the biggest issues in detecting relevant features is discerning between variables that seem important due to random fluctuations and weakly but relevant variables. In a Random Forest model, each variable has chances to be included in the tree construction, so even weakly relevant features that are marginally related with the decision attribute will be used.

Following Genuer et al. (2010), the importance of a variable X^j is defined as follows. For each tree t of the forest we consider an out of bagging sample (00B_t) on which we are computing err00B_t, the error (mean square error in this case) of a single tree t on this 00B_t sample. By randomly permuting the values of X^j in 00B_t, a perturbed sample denoted by ${\tilde{00 B}}_{t}^{j}$ with a corresponding error $err {\tilde{00 B}}_{t}^{j}$ will be obtained. Variable importance of X^j is equal to: $VI (X^{j}) = \frac{1}{ntree} \sum_{t} (err {\tilde{00 B}}_{t}^{j} - err 00 B_{t})$ (1) where the sum is over all trees t and ntree denotes the number of trees of the Random Forest.

2.2 XGBoost

XGBoost model developed by Chen and Guestrin (2016) is an efficient and scalable implementation of Gradient Boosting Machine. Its popularity in the Machine Learning competitions is due to numerous optimizations like (i) the addition of regularization term that improves the generalization ability, (ii) the multithreading parallel computing which increase the speed with over 10 times according to Chen and Guestrin (2016) and (iii) the efficiency of dealing with missing data.

To train the model the following optimization function must be minimize: $L (φ) = \sum_{i} l ({\hat{y}}_{i}, y_{i}) + \sum_{k} Ω (f_{k})$ (2) where l is the cost differentiable function that measures the difference between the prediction ${\hat{y}}_{i}$ and the target value y_i and Ω, is a function of penalizing the complexity of the tree. Intuitively, the objective function will choose the model with the best prediction and the lowest complexity.

Stock selection will be made using the same approach as in Random Forest. Features will be ranked based on their importance in the model and the best k stock will be selected.

2.3 Lasso

Lasso (least absolute shrinkage and selection operator) model popularized by Tibshirani (1996) minimizes the residual sum of squares subject to the sum of absolute value of coefficients being less than a constant. This constraint will shrink the coefficients towards zero, non-null ones being the most informative. The Lasso estimator $\hat{β}$ is obtained as: $\underset{β}{arg min} \frac{1}{2} {(Y_{u} - X_{u} β)}^{T} (Y_{u} - X_{u} β) + λ \sum_{j = 1}^{p} | β_{j} |$ (3) where λ is a regularization parameter. To enhance Lasso selection algorithm, stability selection methodology has been applied.

Stability selection was proposed by Meinshausen and Bühlmann (2010) as a technique designed to improve the existing methods. They consider β a p-dimensional vector where s < p components are non-zero. Denote the set of non-zeros values by S = k: β _k \ne 0. The goal of this structure estimation is to find the set S from noisy observations. For every value of regularization parameter λɛz.epsi;Λ⊆ Â⁺ it is obtained an estimate ${\hat{S}}^{λ} \subseteq (1, \dots, p)$ . The goal is to find where there exists a λɛz.epsi;Λ such that ${\hat{S}}^{λ}$ is identical to S with a high probability.

In order to do that, a subsample of size n/2 is randomly selected without replacement on which Lasso algorithm is applied. This procedure will be executed many times, for every iteration a structure estimate ${\hat{S}}^{λ}$ being created. Consequently, we can compute the probability of selection for each variable. The variables will be ranked based on this probability and the best k stocks will be selected.

2.4 Deep autoencoders

Autoencoders are one of the most used dimensionality reduction techniques. They were successfully used in index tracking problem by Heaton et al. (2016), Ouyang et al. (2019) and Kim and Kim (2019). The goal is to create a deep network architecture that will reconstruct the input vector in the output layer with as much accuracy as possible. In other words, for any given input x_i, we will try to obtain through a series of nonlinear transformations an output $x_{i}^{'}$ so that the difference between x_i and $x_{i}^{'}$ to be minimal.

Autoencoders have usually a symmetric architecture. The middle layer consists in one or multiple neurons. Ouyang et al. (2019) argue that a structure with one neuron in the middle shares certain similarities with Capital Asset Pricing Model (CAPM) because its value can be interpreted as a market portfolio. For this reason, in this paper the center layer has exactly one node.

Selection of the best stocks will be made using Heaton et al. (2016) methodology. The most informative stocks will be the ones that have the highest similarity in the autoencoder. This will be measured using $d_{i} = {∥ x_{i} - x_{i}^{'} ∥}^{2}$ (4) where x_i is the input stock and $x_{i}^{'}$ is the output. The smaller the value of d_i the more common information is between stock i and market portfolio. Based on that value, all stocks will be ranked and the best k of them will be selected.

3 Allocation

This section discusses some stock allocation schemes that can be used after the stock selection stage. The allocation procedure consists in finding weights for the selected stocks in which we can invest so that the difference between partially replicated index return and the true index return to be as small as possible at the end of the testing period.

As stated in Introduction, many scholars use some optimization functions to determine the true weights based on Markowitz framework (e.g., Roll (1992), Rohweder (1998) or Jansen and Dijk (2002)). Other approaches use an equal-weighted scheme (Heaton et al.., 2016) or some schemes based on correlation of return (Chen and Kwon (2012), Kim and Kim (2019)). From heuristic methods, one of the most used approaches is to determine the weights through an evolutionary algorithm applied on an objective function (Beasley et al. (2003), Oh et al. (2005), Chiam et al. (2013)). However, if the objective function has a formulation similar to the mean square error function, the weights computed through an evolutionary algorithm will tend to the weights computed through ordinary least squares method, so this approach will be superfluous.

Despite their success in regression analyses, other Machine Learning algorithms like tree-based models or deep learning models, are not suited for this task due to their inability to express the output as a linear combination of features required in trading. However, Ouyang et al. (2019) propose to use a sensitivity analyses in order to determine the weights for each stock. They argue that a Neural Network model can efficiently extract the representation of each stock prices and model the nonlinear interaction between them.

In this paper two different allocation schemes have been considered, respectively a linear approach based on OLS method and a nonlinear approach based on the sensitivity of a deep neural network.

3.1 Ordinary least squares (OLS)

OLS is a statistical method used to determine the unknown parameters of a linear regression. The algorithm choses the parameters based on the least squares’ principle: minimizing the sum of squares of the difference between the observed dependent variables and those predicted by the linear function. The mathematical formulation can be written as: $\hat{W} = \underset{w}{argmin} {(Y - XW)}^{T} (Y - XW) = {(X^{T} X)}^{- 1} X^{T} Y$ (5) where Y is a (n, 1) vector with n index prices, X is a (n, k) matrix with n prices for k stocks. oversetlower0.5emsmash ⌢ → W can be interpreted as the weights for each k stock.

3.2 Deep neural network sensitivity

A deep neural network is a set of interconnected processing nodes whose functionality is based on an animal’s neural network and it was first introduced by McCulloch and Pitts (1943).

Any neural network model presents 3 different types of layers, respectively an input layer in which we have the explicative variable (stock prices in this case), one or more hidden layers and an output layer (index price). Each layer contains many neurons. The functionality of an individual neuron is simple and direct. Each neuron summates all the signals sent to it, adds a bias term and performs a non-linear transformation through an activation function. The activation (transfer) function is an increasing monotonic function, most often a logistic function, hyperbolic tangent or ReLu type. The signal transformed into a neuron is forwarded by a certain weight to another neuron in another layer, and the process is repeated. This process is called feedforward step. The processing power of the network is determined by the weights given to each neuron which are computed using backpropagation method (see Rumelhart et al. (1986) for details).

For a simple network (one hidden layer) case, the feedforward step can be written as: $\hat{Y} = f_{2} (W_{2} f_{1} (W_{1} X + B_{1}) + B_{2})$ (6) where X is the input vector, W₁ and W₂ are the weights matrices, B₁ and B₂ are bias vectors, f₁(.) and f₂(.) are the activation functions (usually ReLu, sigmoid or tanh), and $\hat{Y}$ is the output vector. Cost function is defined as ${∥ Y - \hat{Y} ∥}^{2}$ . Through backpropagation, weights and biases are updated. Usually, the algorithm used in minimization is gradient descending.

The wights matrixes W₁ and W₂ reflects the relationship between different units and different layers in the neural network, but they cannot reflect the relationship between the input and the output. In order to overcome this issue, Ouyang et al. (2019) propose a sensitivity analysis in order to determine the direct influence of input with respect of output. This sensitivity could be interpreted as weights of stocks in a portfolio.

$\begin{matrix} \hat{W} = \frac{dY}{dX} = f_{2}^{'} (W_{2} * f_{1} ((W_{1} * X) + B_{1}) + B_{2}) \\ = * W_{2} * f_{1} {((W_{1} * X) + B_{1})}^{'} * W_{1} \end{matrix}$ (7)

For a general deep network case, the equation can be written as: ${\begin{matrix} \hat{W} = \frac{d Y}{d X} = \prod_{i = 1}^{n} A_{i}^{'} * W_{i} \\ A_{i} = f_{i} (W_{i} A_{i - 1} + B_{i}) \\ A_{1} = f_{1} (W_{1} X + B_{1}) \end{matrix}$ (8)

4 Data and methodology

In order to highlight the practicality of heuristic approaches, three of the biggest indices in terms of number of components have been considered, respectively Russell 1000 for the USA, FTSE 350 for the UK, and Nikkei 225 for Japan. They represent over 90%of total market capitalization in each country. The data used in this analysis are represented by the daily prices of each index and their corresponding stocks traded between 01.01.2010 and 31.12.2020. To ensure the robustness of the methodology, 6 rolling windows have been considered, each of them with 5 years of training data and 1 year of out of sample data. The source of the datasets is Thomson Reuters Tick History, and the components of the indices are the ones at the end of each training period. The stocks that were not traded in the training period have been eliminated. In the case there are some non-trading days for some stocks, the last available price has been considered for the missing days. This also includes the cases in which a company have been delisted in out of sample dataset due to merge and acquisition or bankruptcy.

The aim of the empirical analysis is to track an index using fewer constituents. Therefore, k stocks have been selected, where k is 10, 25 and 50, respectively. Using only those stocks, the cumulative return of the index is being replicated in order to have the smallest tracking error. The input for Machine Learning models will be the daily prices for each stock, and the output will be the daily prices of the corresponding index.

Section 2 briefly discusses the algorithms and the methodology used in selection. In addition to those algorithms, as benchmark, the largest k stocks at the end of each training set have been considered. Random Forest was estimated using 100 decision trees with a maximum depth of 20 levels. The minimum number of samples required to split an internal node is 2, and the minimum number of samples required in a leaf is 1. A node will be split only if the division induces a decrease of the impurity greater than 0. The criterion for measuring the quality of the division of a tree is given by the mean squared errors function. XGBoost uses a gbtree booster with a learning rate of 0.3. Maximum depth of a tree is 20, as in Random Forest and the minimum loss reduction required for node partition is 0. The method used to sample the training instances is uniform. The L2 regularization term has a value of 1 and the objective function is to minimize the sum of squared errors. The hyperparameters tunning strategy for these algorithms is random search method.

In the case of autoencoders, same architecture as Kim and Kim (2019) has been used, respectively a 3-hidden layer deep autoencoder where the first and second layer have neurons of 1/4 and 1/16 of input stocks number. The middle layer has only one neuron as in Ouyang et al. (2019). The activation function is represented by hyperbolic tangent function $f (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$ .

Section 3 briefly discusses the algorithms and the methodology used in stock allocation. The main disadvantage of the majority of heuristic models is that they are mainly nonlinear, so unsuited for weights computation in a portfolio management problem. Kim and Kim (2019) tested 3 allocation schemes, 2 based on correlation of return of the stocks with the index and one based on solving a quadratic programming problem. However, they found out that those approaches are not better than a simple equally weighted scheme. Therefore, in this paper I have chosen to compare the sensitivity of neural networks approach propose by Ouyang et al. (2019) (which has not been compared before with other approaches) with a standard OLS approach and with an equally weighted scheme. The purpose of this analysis is not to find the best possible calibration of the parameters, but to show that even an arbitrary configuration can produce notable results. Each strategy can be furthermore optimized based on the number of inputs or the rolling window.

As a measure of tracking error, a tracking error volatility of return (TEV), introduced by Roll (1992) and used by Kim and Kim (2019) have been considered: $\begin{matrix} TE V^{(a)} = \\ \sqrt{\frac{1}{T} \sum_{t = 1}^{T} {(R_{t} - \sum_{s = 1}^{k} w_{s}^{(a)} r_{s, t} - E [R_{t} - \sum_{s = 1}^{k} w_{s}^{(a)} r_{s, t}])}^{2}} \end{matrix}$ (9)

Here T is the number of days in out of sample data, R is the daily return of the index, k is the total number of selected stocks, w^(a) are the weights computed for each allocation scheme (a) and r is the daily return for a stock s. Another measure of performance used in this paper is the correlation between the partially replicated index and the market value of the index.

5 Empirical results

Table 1 shows the tracking error expressed in both TEV and correlation for three indices, respectively FTSE 350, Russell 1000, and Nikkei 250. The tracking error has been computed for 5 different selection strategies, respectively Autoencoders, Lasso with stability selection, Random Forest, XGBoost, and the Largest stocks in term of market capitalization, each of the selection algorithm with three different allocation schemes, respectively Neural Network Sensitivity (NNS), ordinary least squares (OLS) and an equally weighted scheme for benchmark. For robustness, three sets of stocks have been selected for each strategy, respectively 10, 25 and 50 stocks. The results represent an average over 6 one-year rolling window from 2015 to 2020. The individual results for each year can be found in Appendix. Most of the results are in the same range, the error being, on average, only 0.4%–0.7%.

Table 1
Average tracking error expressed in both tracking error volatility (TEV) and correlation for each index, each selection strategy, allocation scheme and number of stocks for 1 year rolling window from 2015 to 2020. The best combinations for each index have been highlighted

k = 10 k = 25 k = 50

Selection Allocation TEV Correlation TEV Correlation TEV Correlation

FTSE 350 Autoencoders NNS 0.98% 58.29% 0.56% 80.11% 0.78% 70.44%

OLS 1.06% 73.66% 0.62% 87.26% 0.42% 90.33%

Equals 0.73% 76.63% 0.52% 87.54% 0.43% 89.56%

Lasso NNS 0.63% 73.61% 0.51% 84.83% 0.48% 86.00%

OLS 1.24% 76.81% 0.51% 81.64% 0.44% 89.09%

Equals 0.82% 76.07% 0.51% 84.04% 0.44% 87.94%

Largest NNS 0.72% 83.95% 0.51% 91.32% 0.43% 92.00%

OLS 0.73% 81.73% 0.52% 90.51% 0.43% 93.00%

Equals 0.77% 83.04% 0.55% 90.38% 0.48% 91.17%

Random Forest NNS 0.73% 71.22% 0.59% 78.37% 0.60% 76.18%

OLS 0.53% 82.80% 0.51% 85.88% 0.45% 88.95%

Equals 0.91% 69.62% 0.68% 75.85% 0.54% 81.99%

XGBoost NNS 0.63% 78.05% 0.52% 87.44% 0.46% 88.74%

OLS 0.49%* 84.63%* 0.41%* 90.60%* 0.34%* 93.32%*

Equals 0.87% 77.12% 0.56% 86.17% 0.43% 89.73%

Russell 1000 Autoencoders NNS 0.50% 91.33% 0.42% 94.43% 0.36% 96.00%

OLS 0.47%* 92.94%* 0.36%* 95.84%* 0.26%* 97.64%*

Equals 0.49% 92.47% 0.37% 96.33% 0.27% 97.54%

Lasso NNS 0.66% 84.47% 0.47% 92.10% 0.32% 96.54%

OLS 0.68% 84.80% 0.51% 90.87% 0.33% 96.40%

Equals 0.61% 87.22% 0.56% 89.60% 0.33% 96.32%

Largest NNS 0.76% 83.39% 0.58% 88.62% 0.42% 93.57%

OLS 0.87% 75.52% 0.61% 85.53% 0.51% 92.17%

Equals 0.93% 81.22% 0.70% 87.33% 0.47% 92.15%

Random Forest NNS 0.67% 84.51% 0.47% 92.90% 0.40% 94.12%

OLS 0.63% 87.13% 0.50% 91.97% 0.39% 95.19%

Equals 0.57% 89.79% 0.44% 94.10% 0.37% 95.74%

XGBoost NNS 0.62% 87.88% 0.45% 93.60% 0.35% 96.12%

OLS 0.60% 89.10% 0.44% 93.81% 0.33% 96.34%

Equals 0.64% 88.63% 0.48% 92.21% 0.34% 96.37%

Nikkei 225 Autoencoders NNS 0.74% 69.38% 0.58% 78.91% 0.55% 81.49%

OLS 0.87% 69.29% 0.54% 85.84% 0.48% 87.30%

Equals 0.74% 75.47% 0.53% 84.39% 0.45% 89.15%

Lasso NNS 0.70% 70.03% 0.38%* 92.93%* 0.32% 94.23%

OLS 0.70% 79.17% 0.48% 89.82% 0.34% 93.24%

Equals 0.71% 83.09% 0.47% 91.70% 0.45% 91.37%

Largest NNS 0.65% 86.17% 0.49% 92.57% 0.41% 95.05%

OLS 0.72% 83.42% 0.52% 91.01% 0.41% 94.42%

Equals 0.70% 85.07% 0.52% 92.48% 0.44% 94.11%

Random Forest NNS 0.75% 73.09% 0.58% 81.45% 0.44% 88.06%

OLS 0.57% 87.42% 0.51% 87.41% 0.37% 91.35%

Equals 0.68% 86.71% 0.49% 92.38% 0.50% 92.27%

XGBoost NNS 0.74% 71.66% 0.46% 88.94% 0.38% 91.66%

OLS 0.53%* 87.66%* 0.46% 90.39% 0.34% 93.40%

Equals 0.55% 87.58% 0.43% 92.61% 0.31%* 95.46%*

			k = 10	k = 25	k = 50
FTSE 350	Autoencoders	NNS	0.98%	58.29%	0.56%	80.11%	0.78%	70.44%
		OLS	1.06%	73.66%	0.62%	87.26%	0.42%	90.33%
		Equals	0.73%	76.63%	0.52%	87.54%	0.43%	89.56%
	Lasso	NNS	0.63%	73.61%	0.51%	84.83%	0.48%	86.00%
		OLS	1.24%	76.81%	0.51%	81.64%	0.44%	89.09%
		Equals	0.82%	76.07%	0.51%	84.04%	0.44%	87.94%
	Largest	NNS	0.72%	83.95%	0.51%	91.32%	0.43%	92.00%
		OLS	0.73%	81.73%	0.52%	90.51%	0.43%	93.00%
		Equals	0.77%	83.04%	0.55%	90.38%	0.48%	91.17%
	Random Forest	NNS	0.73%	71.22%	0.59%	78.37%	0.60%	76.18%
		OLS	0.53%	82.80%	0.51%	85.88%	0.45%	88.95%
		Equals	0.91%	69.62%	0.68%	75.85%	0.54%	81.99%
	XGBoost	NNS	0.63%	78.05%	0.52%	87.44%	0.46%	88.74%
		OLS	0.49%*	84.63%*	0.41%*	90.60%*	0.34%*	93.32%*
		Equals	0.87%	77.12%	0.56%	86.17%	0.43%	89.73%
Russell 1000	Autoencoders	NNS	0.50%	91.33%	0.42%	94.43%	0.36%	96.00%
		OLS	0.47%*	92.94%*	0.36%*	95.84%*	0.26%*	97.64%*
		Equals	0.49%	92.47%	0.37%	96.33%	0.27%	97.54%
	Lasso	NNS	0.66%	84.47%	0.47%	92.10%	0.32%	96.54%
		OLS	0.68%	84.80%	0.51%	90.87%	0.33%	96.40%
		Equals	0.61%	87.22%	0.56%	89.60%	0.33%	96.32%
	Largest	NNS	0.76%	83.39%	0.58%	88.62%	0.42%	93.57%
		OLS	0.87%	75.52%	0.61%	85.53%	0.51%	92.17%
		Equals	0.93%	81.22%	0.70%	87.33%	0.47%	92.15%
	Random Forest	NNS	0.67%	84.51%	0.47%	92.90%	0.40%	94.12%
		OLS	0.63%	87.13%	0.50%	91.97%	0.39%	95.19%
		Equals	0.57%	89.79%	0.44%	94.10%	0.37%	95.74%
	XGBoost	NNS	0.62%	87.88%	0.45%	93.60%	0.35%	96.12%
		OLS	0.60%	89.10%	0.44%	93.81%	0.33%	96.34%
		Equals	0.64%	88.63%	0.48%	92.21%	0.34%	96.37%
Nikkei 225	Autoencoders	NNS	0.74%	69.38%	0.58%	78.91%	0.55%	81.49%
		OLS	0.87%	69.29%	0.54%	85.84%	0.48%	87.30%
		Equals	0.74%	75.47%	0.53%	84.39%	0.45%	89.15%
	Lasso	NNS	0.70%	70.03%	0.38%*	92.93%*	0.32%	94.23%
		OLS	0.70%	79.17%	0.48%	89.82%	0.34%	93.24%
		Equals	0.71%	83.09%	0.47%	91.70%	0.45%	91.37%
	Largest	NNS	0.65%	86.17%	0.49%	92.57%	0.41%	95.05%
		OLS	0.72%	83.42%	0.52%	91.01%	0.41%	94.42%
		Equals	0.70%	85.07%	0.52%	92.48%	0.44%	94.11%
	Random Forest	NNS	0.75%	73.09%	0.58%	81.45%	0.44%	88.06%
		OLS	0.57%	87.42%	0.51%	87.41%	0.37%	91.35%
		Equals	0.68%	86.71%	0.49%	92.38%	0.50%	92.27%
	XGBoost	NNS	0.74%	71.66%	0.46%	88.94%	0.38%	91.66%
		OLS	0.53%*	87.66%*	0.46%	90.39%	0.34%	93.40%
		Equals	0.55%	87.58%	0.43%	92.61%	0.31%*	95.46%*

To highlight the performance of each algorithm, a more in-depth analyses have been conducted. Table 2 presents the average tracking error and correlation with respect to selection strategy for each index. On average, in the analyzed period, all heuristic methods outperformed the benchmark based on the Largest companies in the index, with XGBoost having the smallest tracking error. Although was successfully used by Kim and Kim (2019), Autoencoders had worse performance than the other data-driven approaches. However, it seems that it has better results if the number of inputs is bigger as in the case of Russell 1000. Table 3 shows the average tracking error and correlation with respect to allocation scheme for each index. In all cases, an allocation based on an ordinary least squares model is better than the dummy equally weighted scheme. The neural network sensitivity model proposed by Ouyang et al. (2019) is not outperforming the benchmark.

Table 2

Average tracking error (TEV) and correlation grouped by Selection strategies for each index. The best selection strategy for each index have been highlighted

	FTSE 350		Russell 1000		Nikkei 225		Total
Selection	TEV	Corr	TEV	Corr	TEV	Corr	TEV	Corr
Autoencoders	0.68%	79.31%	0.39%*	94.95%*	0.61%	80.14%	0.56%	84.80%
Lasso	0.62%	82.23%	0.50%	90.93%	0.51%	87.29%	0.54%	86.81%
Largest	0.57%	87.57%	0.65%	86.61%	0.54%	90.52%*	0.59%	88.57%
Random Forest	0.62%	78.98%	0.49%	91.72%	0.54%	86.68%	0.55%	85.79%
XGBoost	0.52%*	88.20%*	0.47%	92.67%	0.47%*	88.76%	0.49%*	89.21%*

Table 3

Average tracking error (TEV) and correlation grouped by Allocation schemes for each index. The best allocation scheme for each index have been highlighted

	FTSE 350		Russell 1000		Nikkei 225		Total
Allocation	TEV	Corr	TEV	Corr	TEV	Corr	TEV	Corr
NNS	0.61%	80.04%	0.50%	91.31%	0.55%	83.73%	0.55%	85.03%
OLS	0.58%*	86.01%*	0.50%	91.02%	0.52%*	87.38%*	0.53%*	88.14%*
Equals	0.62%	83.12%	0.50%*	91.80%*	0.53%	86.92%	0.55%	87.95%

Table 4 presents the average tracking error and correlation based on the number of stocks selected in the selection step for each index. There is a clear negative relationship between the number of stocks in the partially replicated index and tracking error, higher number of stocks imposing smaller errors. This can be explained due to diversification of the portfolio that reduce the overall volatility. Table 5 shows the average errors with respect to rolling window. Note that the tracking error is fluctuating with more than 2 times from period to period requiring a better training on a bigger training set in order to capture market volatility.

Table 4

Average tracking error (TEV) and correlation grouped by number of stocks (k) selected for each replicated index. The best results for each index have been highlighted

	FTSE 350		Russell 1000		Nikkei 225		Total
No. of stocks	TEV	Corr	TEV	Corr	TEV	Corr	TEV	Corr
k = 10	0.79%	76.48%	0.65%	86.69%	0.69%	79.65%	0.71%	80.94%
k = 25	0.54%	85.46%	0.49%	91.95%	0.50%	88.88%	0.51%	88.77%
k = 50	0.48%*	87.23%*	0.36%*	95.48%*	0.41%*	91.50%*	0.42%*	91.40%*

Table 5

Average tracking error (TEV) and correlation grouped by one year rolling window (starting from 2015) for each index

	FTSE 350		Russell 1000		Nikkei 225		Total
Window	TEV	Corr	TEV	Corr	TEV	Corr	TEV	Corr
1	0.56%	89.66%	0.48%	93.17%	0.44%	91.24%	0.49%	91.36%
2	0.69%	85.38%	0.65%	93.35%	0.48%	88.22%	0.61%	88.98%
3	0.41%	76.31%	0.35%	88.90%	0.45%	71.94%	0.40%	79.05%
4	0.57%	76.97%	0.45%	92.49%	0.58%	88.09%	0.53%	85.85%
5	0.52%	80.19%	0.44%	88.40%	0.49%	86.33%	0.48%	84.97%
6	0.85%	89.83%	0.64%	91.95%	0.77%	94.25%	0.75%	92.01%

Figures 1 –3 show the cumulative return curves for each index with the partially replicated indices generated by the three allocation schemes (NNS, OLS and equals) for one of the best selection strategies in term of the lowest tracking error, respectively Lasso with stability selection for FTSE 350 in 2018, XGBoost for Nikkei 225 in 2015 and Random Forest for Russell 1000 in 2016. The selected period capture both a lateral movement in markets and some huge volatilities with corrections of over 30%in one week. In all cases, the replicated index follows closely the true value of the index.

Fig. 1

Cumulative result for FTSE 350 with Lasso selection on 50 stocks in 2019.

Fig. 2

Cumulative result for Russell 1000 with Random Forest on 50 stocks in 2016.

Fig. 3

Cumulative result for Nikkei 225 with XGBoost selection on 50 stocks in 2015.

6 Conclusions

Index tracking problem has a great practical importance in the financial economics field. This paper extends the latest methodologies in stock selection and allocation by proposing three new approaches for selection, respectively XGBoost, Random Forest and Lasso with stability selection. Among those, autoencoders have also been used. For the selection algorithm ordinary least squares has been compared to a neural network sensitivity approach and an equally weighted strategy. For robustness, three different indices have been considered, respectively Russell 1000, FTSE 350, and Nikkei 225.

Empirical results suggest that the proposed selection strategies have outperformed the considered benchmarks of top largest stocks in terms of market capitalization and deep autoencoders who were successfully used by the latest scholars. From those, XGBoost had the best performance. In the case of allocation schemes, ordinary least squares overperforms the dummy equally weighted allocation scheme and the neural network sensitivity approach. However, the differences are not significant. The number of stocks selected in the partially replicated index have a clearly negative relationship with the tracking error volatility, the error decreasing as the number of stocks increase. The error is not constant in time and can fluctuate more than 2 times from period to period.

The main advantage of the data driven approach is that it can be used to recreate any index return from a given pool of financial assets. Moreover, the transaction costs are very low because it does not require a dynamic allocation. One of the biggest limitations of this study is the assumption that the market allows short-selling which is not always the case. Moreover, stocks with higher beta could negatively influence the newly generated index. More robust tests should be performed in order to confirm the performance of this methodology.

Footnotes

Appendix

Table 6

Tracking error expressed in both tracking error volatility (TEV) and correlation for each index, each selection strategy, allocation scheme and number of stocks for 1^st rolling window (Jan 2015 –Dec 2015). The best combinations for each index have been highlighted

			k = 10		k = 25		k = 50
	Selection	Allocation	TEV	Correlation	TEV	Correlation	TEV	Correlation
FTSE 350	Autoencoders	NNS	0.52%	89.90%	0.42%	89.99%	0.73%	82.42%
		OLS	2.31%	85.87%	0.76%	94.08%	0.33%	96.51%
		Equals	0.57%	86.13%	0.51%	93.52%	0.38%	95.17%
	Lasso	NNS	0.61%	63.67%	0.52%	87.85%	0.40%	92.53%
		OLS	2.09%	72.65%	0.49%	90.11%	0.37%	94.47%
		Equals	1.19%	83.77%	0.44%	90.47%	0.39%	93.53%
	Largest	NNS	0.46%	89.35%	0.41%	95.49%	0.37%	95.38%
		OLS	0.47%	89.81%	0.44%	94.66%	0.39%	97.22%
		Equals	0.52%	88.35%	0.45%	93.83%	0.44%	94.65%
	Random Forest	NNS	0.46%	89.59%	0.54%	85.30%	0.54%	85.04%
		OLS	0.51%	87.32%	0.47%	89.00%	0.43%	93.06%
		Equals	0.53%	86.29%	0.53%	85.75%	0.55%	88.49%
	XGBoost	NNS	0.54%	85.46%	0.42%	92.11%	0.40%	92.76%
		OLS	0.42%	91.70%	0.35%	93.99%	0.26%	97.24%
		Equals	0.54%	85.75%	0.38%	93.27%	0.42%	91.41%
Russell 1000	Autoencoders	NNS	0.37%	96.05%	0.42%	95.68%	0.35%	97.03%
		OLS	0.46%	94.24%	0.38%	96.65%	0.22%	98.67%
		Equals	0.52%	93.42%	0.40%	97.71%	0.27%	98.87%
	Lasso	NNS	0.67%	86.96%	0.42%	94.98%	0.23%	98.59%
		OLS	0.73%	81.45%	0.54%	91.83%	0.24%	98.40%
		Equals	0.57%	90.47%	0.62%	92.33%	0.28%	97.89%
	Largest	NNS	0.79%	87.56%	0.46%	94.27%	0.35%	96.73%
		OLS	0.77%	71.42%	0.53%	92.52%	0.48%	93.71%
		Equals	1.21%	83.39%	0.52%	92.53%	0.42%	95.16%
	Random Forest	NNS	0.91%	73.60%	0.52%	93.02%	0.27%	97.94%
		OLS	0.68%	86.21%	0.50%	94.57%	0.26%	98.25%
		Equals	0.62%	90.92%	0.50%	93.98%	0.30%	97.66%
	XGBoost	NNS	0.51%	92.99%	0.39%	95.96%	0.33%	97.33%
		OLS	0.61%	91.81%	0.39%	96.10%	0.23%	98.58%
		Equals	0.61%	90.91%	0.40%	96.00%	0.25%	98.46%
Nikkei 225	Autoencoders	NNS	0.59%	80.49%	0.63%	78.84%	0.45%	88.86%
		OLS	0.68%	78.19%	0.46%	89.28%	0.39%	91.78%
		Equals	0.80%	73.02%	0.58%	83.71%	0.48%	88.69%
	Lasso	NNS	0.49%	88.87%	0.32%	95.40%	0.29%	95.51%
		OLS	0.45%	90.57%	0.51%	91.53%	0.37%	92.71%
		Equals	0.61%	83.18%	0.41%	92.49%	0.43%	91.26%
	Largest	NNS	0.49%	92.08%	0.44%	94.33%	0.37%	97.65%
		OLS	0.59%	87.79%	0.42%	92.76%	0.36%	97.31%
		Equals	0.53%	88.29%	0.46%	95.52%	0.38%	96.88%
	Random Forest	NNS	0.61%	84.55%	0.44%	89.35%	0.42%	90.67%
		OLS	0.39%	93.02%	0.47%	90.00%	0.41%	92.54%
		Equals	0.41%	94.16%	0.43%	95.92%	0.37%	96.63%
	XGBoost	NNS	0.35%	93.88%	0.27%	96.13%	0.26%	96.38%
		OLS	0.35%	94.26%	0.30%	95.69%	0.29%	96.25%
		Equals	0.33%	95.33%	0.29%	96.49%	0.28%	97.50%

Appendix

Tracking error expressed in both tracking error volatility (TEV) and correlation for each index, each selection strategy, allocation scheme and number of stocks for 2^nd rolling window (Jan 2016 –Dec 2016). The best combinations for each index have been highlighted

			k = 10		k = 25		k = 50
	Selection	Allocation	TEV	Correlation	TEV	Correlation	TEV	Correlation
FTSE 350	Autoencoders	NNS	1.56%	37.27%	0.60%	89.12%	1.02%	75.80%
		OLS	0.64%	83.13%	0.61%	95.13%	0.31%	95.57%
		Equals	0.92%	81.26%	0.46%	91.48%	0.38%	94.35%
	Lasso	NNS	0.74%	81.21%	0.47%	90.39%	0.49%	88.54%
		OLS	2.86%	80.40%	0.57%	85.62%	0.44%	90.58%
		Equals	0.83%	80.80%	0.48%	90.16%	0.45%	91.62%
	Largest	NNS	0.90%	93.18%	0.62%	93.75%	0.53%	93.06%
		OLS	0.90%	92.52%	0.62%	93.06%	0.46%	97.05%
		Equals	0.92%	85.57%	0.69%	93.22%	0.57%	94.62%
	Random Forest	NNS	1.02%	58.50%	0.46%	90.41%	0.87%	58.78%
		OLS	0.68%	79.35%	0.53%	88.79%	0.45%	91.68%
		Equals	0.93%	75.47%	0.80%	79.02%	0.54%	87.44%
	XGBoost	NNS	0.94%	54.03%	0.40%	94.18%	0.58%	88.88%
		OLS	0.63%	71.02%	0.34%	95.45%	0.28%	96.81%
		Equals	0.85%	87.34%	0.51%	92.83%	0.37%	93.72%
Russell 1000	Autoencoders	NNS	0.60%	93.59%	0.58%	94.68%	0.55%	94.73%
		OLS	0.60%	94.14%	0.52%	95.52%	0.43%	96.87%
		Equals	0.64%	93.89%	0.60%	95.97%	0.43%	96.92%
	Lasso	NNS	0.71%	90.99%	0.53%	94.97%	0.50%	96.08%
		OLS	0.75%	89.92%	0.54%	95.03%	0.46%	96.50%
		Equals	0.73%	92.90%	0.60%	93.54%	0.49%	96.09%
	Largest	NNS	0.93%	86.28%	0.79%	90.98%	0.52%	95.59%
		OLS	0.96%	75.42%	0.65%	84.10%	0.63%	93.73%
		Equals	1.52%	83.59%	1.15%	92.80%	0.71%	93.43%
	Random Forest	NNS	0.71%	92.82%	0.54%	95.24%	0.45%	96.60%
		OLS	0.68%	94.10%	0.59%	94.56%	0.49%	96.27%
		Equals	0.69%	91.91%	0.55%	94.76%	0.46%	96.46%
	XGBoost	NNS	0.73%	94.37%	0.63%	94.52%	0.50%	96.04%
		OLS	0.76%	93.79%	0.55%	95.24%	0.47%	96.41%
		Equals	0.86%	92.25%	0.82%	92.21%	0.62%	94.83%
Nikkei 225	Autoencoders	NNS	0.65%	69.02%	0.75%	70.43%	0.33%	92.98%
		OLS	0.92%	66.76%	0.55%	86.32%	0.44%	91.06%
		Equals	0.77%	74.71%	0.56%	84.88%	0.47%	89.03%
	Lasso	NNS	0.56%	80.25%	0.36%	94.20%	0.31%	94.06%
		OLS	0.55%	83.88%	0.48%	91.20%	0.32%	92.99%
		Equals	0.60%	86.18%	0.47%	93.44%	0.36%	92.67%
	Largest	NNS	0.64%	89.16%	0.47%	95.26%	0.38%	96.03%
		OLS	0.63%	91.36%	0.49%	94.44%	0.39%	95.67%
		Equals	0.67%	88.84%	0.47%	94.88%	0.37%	94.16%
	Random Forest	NNS	0.94%	51.03%	0.45%	86.33%	0.32%	93.14%
		OLS	0.45%	89.86%	0.37%	91.69%	0.29%	95.35%
		Equals	0.48%	89.68%	0.38%	93.22%	0.35%	91.52%
	XGBoost	NNS	0.73%	76.39%	0.37%	90.62%	0.33%	92.84%
		OLS	0.40%	90.94%	0.29%	94.44%	0.28%	95.52%
		Equals	0.55%	83.04%	0.32%	94.28%	0.28%	95.96%

Appendix

Tracking error expressed in both tracking error volatility (TEV) and correlation for each index, each selection strategy, allocation scheme and number of stocks for 3^rd rolling window (Jan 2017 –Dec 2017). The best combinations for each index have been highlighted

			k = 10		k = 25		k = 50
	Selection	Allocation	TEV	Correlation	TEV	Correlation	TEV	Correlation
FTSE 350	Autoencoders	NNS	0.95%	22.26%	0.41%	74.26%	0.60%	60.34%
		OLS	0.36%	71.44%	0.34%	87.56%	0.30%	86.64%
		Equals	0.86%	51.52%	0.47%	84.15%	0.35%	86.68%
	Lasso	NNS	0.38%	74.67%	0.32%	81.13%	0.35%	80.65%
		OLS	0.46%	69.40%	0.38%	74.41%	0.29%	86.19%
		Equals	0.53%	65.00%	0.39%	74.52%	0.30%	83.08%
	Largest	NNS	0.51%	87.13%	0.35%	91.65%	0.24%	91.18%
		OLS	0.53%	73.99%	0.38%	90.04%	0.24%	91.39%
		Equals	0.57%	85.23%	0.41%	87.24%	0.27%	87.37%
	Random Forest	NNS	0.79%	37.14%	0.44%	58.52%	0.29%	81.62%
		OLS	0.39%	73.05%	0.36%	78.77%	0.25%	87.72%
		Equals	0.82%	42.06%	0.42%	58.72%	0.35%	72.03%
	XGBoost	NNS	0.31%	81.03%	0.34%	82.24%	0.27%	86.90%
		OLS	0.29%	84.35%	0.26%	87.75%	0.20%	92.59%
		Equals	0.59%	61.40%	0.37%	79.58%	0.24%	89.40%
Russell 1000	Autoencoders	NNS	0.32%	93.00%	0.19%	96.95%	0.14%	98.34%
		OLS	0.33%	91.56%	0.22%	95.71%	0.20%	96.37%
		Equals	0.38%	88.31%	0.30%	94.20%	0.24%	94.44%
	Lasso	NNS	0.52%	76.78%	0.31%	91.20%	0.21%	95.97%
		OLS	0.41%	84.87%	0.32%	90.15%	0.20%	96.68%
		Equals	0.54%	75.64%	0.35%	90.19%	0.23%	95.25%
	Largest	NNS	0.57%	77.40%	0.43%	83.91%	0.31%	91.32%
		OLS	0.57%	76.77%	0.41%	84.78%	0.29%	92.19%
		Equals	0.69%	71.21%	0.52%	77.74%	0.39%	86.30%
	Random Forest	NNS	0.43%	85.19%	0.34%	91.41%	0.28%	92.81%
		OLS	0.38%	87.51%	0.32%	92.23%	0.18%	96.94%
		Equals	0.49%	81.77%	0.27%	93.68%	0.24%	95.12%
	XGBoost	NNS	0.41%	85.69%	0.27%	93.50%	0.25%	94.75%
		OLS	0.51%	78.75%	0.33%	91.57%	0.26%	94.83%
		Equals	0.55%	76.04%	0.39%	85.60%	0.22%	95.72%
Nikkei 225	Autoencoders	NNS	1.08%	21.59%	0.34%	76.41%	0.57%	58.92%
		OLS	0.78%	50.91%	0.57%	64.27%	0.46%	70.76%
		Equals	0.60%	60.81%	0.44%	72.01%	0.36%	78.48%
	Lasso	NNS	0.65%	41.45%	0.30%	84.41%	0.26%	86.25%
		OLS	0.69%	45.30%	0.48%	72.11%	0.28%	84.77%
		Equals	0.64%	66.27%	0.45%	83.72%	0.34%	82.33%
	Largest	NNS	0.54%	72.41%	0.38%	84.32%	0.34%	88.01%
		OLS	0.65%	65.82%	0.51%	73.63%	0.33%	88.22%
		Equals	0.64%	67.80%	0.45%	82.03%	0.35%	88.14%
	Random Forest	NNS	0.67%	53.00%	0.60%	49.89%	0.39%	72.13%
		OLS	0.42%	73.97%	0.42%	75.28%	0.37%	74.33%
		Equals	0.35%	79.81%	0.32%	83.84%	0.22%	89.61%
	XGBoost	NNS	0.67%	35.39%	0.34%	73.55%	0.25%	86.36%
		OLS	0.39%	68.39%	0.33%	79.56%	0.29%	81.48%
		Equals	0.36%	75.45%	0.28%	84.26%	0.22%	90.13%

Appendix

Tracking error expressed in both tracking error volatility (TEV) and correlation for each index, each selection strategy, allocation scheme and number of stocks for 4th rolling window (Jan 2018 –Dec 2018). The best combinations for each index have been highlighted

			k = 10		k = 25		k = 50
	Selection	Allocation	TEV	Correlation	TEV	Correlation	TEV	Correlation
FTSE 350	Autoencoders	NNS	0.88%	53.23%	0.66%	61.19%	0.63%	76.71%
		OLS	1.01%	54.81%	0.64%	76.79%	0.41%	85.95%
		Equals	0.51%	75.57%	0.48%	78.61%	0.45%	81.49%
	Lasso	NNS	0.65%	65.22%	0.52%	77.23%	0.51%	79.87%
		OLS	0.58%	71.15%	0.48%	79.60%	0.45%	83.80%
		Equals	0.70%	70.28%	0.58%	76.80%	0.47%	82.33%
	Largest	NNS	0.48%	83.02%	0.49%	81.32%	0.42%	88.11%
		OLS	0.52%	78.25%	0.45%	85.29%	0.43%	88.77%
		Equals	0.55%	82.75%	0.53%	84.08%	0.48%	87.04%
	Random Forest	NNS	0.66%	71.03%	0.66%	71.07%	0.58%	67.29%
		OLS	0.62%	74.48%	0.51%	78.98%	0.51%	80.07%
		Equals	1.04%	57.82%	0.78%	67.99%	0.50%	76.12%
	XGBoost	NNS	0.53%	77.83%	0.45%	84.62%	0.44%	84.58%
		OLS	0.51%	79.64%	0.45%	85.14%	0.42%	87.76%
		Equals	0.79%	70.52%	0.60%	77.14%	0.48%	82.47%
Russell 1000	Autoencoders	NNS	0.51%	91.71%	0.48%	92.77%	0.31%	96.50%
		OLS	0.44%	93.86%	0.33%	96.03%	0.27%	97.39%
		Equals	0.43%	93.99%	0.28%	97.28%	0.21%	98.42%
	Lasso	NNS	0.61%	85.82%	0.48%	91.38%	0.29%	97.00%
		OLS	0.67%	84.53%	0.50%	90.95%	0.29%	96.94%
		Equals	0.51%	89.97%	0.53%	89.49%	0.27%	97.28%
	Largest	NNS	0.58%	88.14%	0.55%	89.25%	0.39%	94.45%
		OLS	0.70%	83.71%	0.49%	84.64%	0.49%	91.61%
		Equals	0.52%	90.28%	0.73%	91.01%	0.43%	93.13%
	Random Forest	NNS	0.67%	84.94%	0.39%	94.79%	0.35%	95.75%
		OLS	0.61%	87.63%	0.36%	95.20%	0.35%	95.66%
		Equals	0.51%	92.79%	0.37%	95.08%	0.35%	95.97%
	XGBoost	NNS	0.67%	83.88%	0.41%	93.76%	0.33%	96.05%
		OLS	0.55%	90.45%	0.44%	93.24%	0.34%	95.85%
		Equals	0.52%	91.18%	0.36%	95.32%	0.30%	96.85%
Nikkei 225	Autoencoders	NNS	0.63%	84.39%	0.62%	75.26%	0.47%	89.87%
		OLS	0.88%	68.62%	0.50%	91.29%	0.54%	91.08%
		Equals	0.73%	82.20%	0.49%	88.63%	0.40%	92.68%
	Lasso	NNS	0.82%	44.61%	0.44%	92.98%	0.29%	96.70%
		OLS	0.80%	84.64%	0.47%	93.71%	0.35%	95.60%
		Equals	0.80%	87.73%	0.55%	92.85%	0.51%	94.71%
	Largest	NNS	0.75%	80.49%	0.53%	95.50%	0.39%	96.70%
		OLS	0.81%	84.02%	0.58%	95.14%	0.41%	95.92%
		Equals	0.76%	89.38%	0.56%	95.72%	0.48%	95.47%
	Random Forest	NNS	0.77%	83.12%	0.74%	80.43%	0.47%	91.34%
		OLS	0.61%	86.44%	0.55%	84.45%	0.35%	95.75%
		Equals	0.56%	91.01%	0.66%	90.43%	0.37%	94.97%
	XGBoost	NNS	1.17%	66.24%	0.59%	85.62%	0.35%	94.39%
		OLS	0.53%	89.94%	0.66%	83.38%	0.41%	92.93%
		Equals	0.66%	89.42%	0.54%	92.70%	0.35%	95.49%

Appendix

Tracking error expressed in both tracking error volatility (TEV) and correlation for each index, each selection strategy, allocation scheme and number of stocks for 5^th rolling window (Jan 2019 –Dec 2019). The best combinations for each index have been highlighted

			k = 10		k = 25		k = 50
	Selection	Allocation	TEV	Correlation	TEV	Correlation	TEV	Correlation
FTSE 350	Autoencoders	NNS	0.86%	66.04%	0.52%	75.04%	0.97%	52.39%
		OLS	0.74%	64.57%	0.72%	76.66%	0.42%	85.55%
		Equals	0.56%	76.32%	0.49%	83.61%	0.37%	85.45%
	Lasso	NNS	0.56%	67.80%	0.49%	80.86%	0.45%	81.45%
		OLS	0.51%	78.99%	0.45%	80.87%	0.43%	85.22%
		Equals	0.61%	68.89%	0.45%	80.12%	0.41%	83.44%
	Largest	NNS	0.53%	78.17%	0.40%	90.59%	0.35%	88.89%
		OLS	0.60%	74.72%	0.43%	88.91%	0.41%	90.14%
		Equals	0.60%	82.95%	0.41%	90.06%	0.40%	88.75%
	Random Forest	NNS	0.55%	82.39%	0.53%	75.56%	0.52%	73.70%
		OLS	0.36%	88.66%	0.36%	88.91%	0.37%	87.89%
		Equals	1.02%	65.90%	0.64%	73.07%	0.46%	76.55%
	XGBoost	NNS	0.48%	82.87%	0.51%	79.82%	0.42%	84.49%
		OLS	0.37%	87.46%	0.37%	86.86%	0.32%	89.90%
		Equals	0.84%	72.54%	0.60%	79.88%	0.45%	85.52%
Russell 1000	Autoencoders	NNS	0.54%	81.33%	0.44%	89.30%	0.30%	94.43%
		OLS	0.37%	91.60%	0.31%	93.92%	0.17%	98.17%
		Equals	0.42%	89.79%	0.27%	95.13%	0.18%	97.81%
	Lasso	NNS	0.63%	78.36%	0.41%	88.97%	0.28%	94.65%
		OLS	0.64%	81.35%	0.43%	88.18%	0.30%	94.24%
		Equals	0.53%	83.71%	0.49%	83.39%	0.29%	94.29%
	Largest	NNS	0.69%	79.19%	0.50%	85.07%	0.42%	88.60%
		OLS	0.90%	74.74%	0.62%	80.03%	0.45%	87.43%
		Equals	0.63%	77.05%	0.53%	81.65%	0.39%	89.79%
	Random Forest	NNS	0.52%	81.86%	0.38%	90.84%	0.40%	89.51%
		OLS	0.55%	82.79%	0.47%	87.19%	0.42%	91.48%
		Equals	0.46%	89.43%	0.42%	91.32%	0.45%	92.01%
	XGBoost	NNS	0.52%	84.13%	0.38%	90.86%	0.26%	95.53%
		OLS	0.49%	87.38%	0.31%	93.78%	0.25%	96.23%
		Equals	0.61%	87.98%	0.43%	88.56%	0.27%	95.17%
Nikkei 225	Autoencoders	NNS	0.71%	66.37%	0.55%	75.97%	0.61%	65.01%
		OLS	0.92%	62.30%	0.45%	88.70%	0.36%	83.12%
		Equals	0.75%	68.87%	0.48%	81.27%	0.47%	88.94%
	Lasso	NNS	0.86%	71.60%	0.37%	92.98%	0.26%	95.26%
		OLS	0.77%	79.21%	0.41%	92.90%	0.31%	94.99%
		Equals	0.73%	81.49%	0.48%	90.22%	0.40%	91.69%
	Largest	NNS	0.54%	86.46%	0.44%	90.61%	0.34%	95.12%
		OLS	0.69%	74.63%	0.42%	92.14%	0.36%	95.07%
		Equals	0.64%	80.85%	0.48%	89.22%	0.38%	94.57%
	Random Forest	NNS	0.74%	75.04%	0.48%	86.98%	0.48%	84.02%
		OLS	0.49%	90.13%	0.49%	86.67%	0.36%	92.17%
		Equals	0.48%	92.58%	0.42%	93.70%	0.36%	92.55%
	XGBoost	NNS	0.52%	80.76%	0.33%	93.00%	0.31%	92.93%
		OLS	0.46%	89.08%	0.32%	94.13%	0.26%	96.24%
		Equals	0.59%	87.79%	0.46%	91.56%	0.30%	95.82%

Appendix

Tracking error expressed in both tracking error volatility (TEV) and correlation for each index, each selection strategy, allocation scheme and number of stocks for 6^th rolling window (Jan 2020 –Dec 2020). The best combinations for each index have been highlighted

			k = 10		k = 25		k = 50
	Selection	Allocation	TEV	Correlation	TEV	Correlation	TEV	Correlation
FTSE 350	Autoencoders	NNS	1.14%	81.01%	0.75%	91.07%	0.70%	75.01%
		OLS	1.30%	82.13%	0.65%	93.33%	0.73%	91.73%
		Equals	0.97%	88.98%	0.69%	93.87%	0.64%	94.24%
	Lasso	NNS	0.84%	89.07%	0.74%	91.54%	0.67%	92.94%
		OLS	0.92%	88.28%	0.73%	79.24%	0.65%	94.29%
		Equals	1.07%	87.65%	0.70%	92.17%	0.64%	93.67%
	Largest	NNS	1.43%	72.86%	0.78%	95.11%	0.66%	95.37%
		OLS	1.34%	81.08%	0.78%	91.10%	0.65%	93.43%
		Equals	1.45%	73.38%	0.80%	93.84%	0.71%	94.60%
	Random Forest	NNS	0.91%	88.67%	0.90%	89.35%	0.79%	90.64%
		OLS	0.66%	93.92%	0.82%	90.80%	0.66%	93.30%
		Equals	1.09%	90.17%	0.93%	90.55%	0.83%	91.30%
	XGBoost	NNS	0.98%	87.08%	0.98%	91.65%	0.65%	94.82%
		OLS	0.70%	93.60%	0.68%	94.44%	0.56%	95.60%
		Equals	1.60%	85.18%	0.89%	94.35%	0.65%	95.87%
Russell 1000	Autoencoders	NNS	0.65%	92.33%	0.38%	97.22%	0.52%	94.98%
		OLS	0.64%	92.24%	0.38%	97.21%	0.29%	98.39%
		Equals	0.52%	95.43%	0.35%	97.69%	0.26%	98.77%
	Lasso	NNS	0.82%	87.95%	0.68%	91.12%	0.40%	96.96%
		OLS	0.87%	86.69%	0.74%	89.08%	0.48%	95.66%
		Equals	0.80%	90.61%	0.76%	88.68%	0.39%	97.15%
	Largest	NNS	1.00%	81.81%	0.78%	88.26%	0.56%	94.75%
		OLS	1.31%	71.04%	0.93%	87.15%	0.73%	94.37%
		Equals	1.02%	81.77%	0.78%	88.25%	0.51%	95.07%
	Random Forest	NNS	0.76%	88.64%	0.64%	92.07%	0.66%	92.12%
		OLS	0.88%	84.57%	0.78%	88.05%	0.63%	92.56%
		Equals	0.65%	91.91%	0.50%	95.81%	0.41%	97.25%
	XGBoost	NNS	0.88%	86.23%	0.60%	93.03%	0.40%	97.01%
		OLS	0.66%	92.44%	0.61%	92.90%	0.45%	96.14%
		Equals	0.67%	93.41%	0.49%	95.57%	0.39%	97.22%
Nikkei 225	Autoencoders	NNS	0.80%	94.45%	0.57%	96.55%	0.90%	93.28%
		OLS	1.02%	88.98%	0.69%	95.15%	0.71%	96.02%
		Equals	0.79%	93.23%	0.62%	95.85%	0.52%	97.09%
	Lasso	NNS	0.82%	93.40%	0.49%	97.58%	0.53%	97.62%
		OLS	0.93%	91.45%	0.51%	97.49%	0.43%	98.35%
		Equals	0.87%	93.71%	0.49%	97.51%	0.64%	95.54%
	Largest	NNS	0.95%	96.43%	0.69%	97.83%	0.65%	96.77%
		OLS	0.93%	96.92%	0.71%	97.92%	0.61%	94.35%
		Equals	0.97%	95.24%	0.71%	97.52%	0.68%	95.46%
	Random Forest	NNS	0.80%	91.82%	0.75%	95.73%	0.58%	97.05%
		OLS	1.04%	91.08%	0.75%	96.38%	0.45%	97.96%
		Equals	1.83%	72.99%	0.74%	97.15%	1.31%	88.33%
	XGBoost	NNS	1.03%	77.31%	0.87%	94.71%	0.77%	87.03%
		OLS	1.05%	90.36%	0.86%	95.13%	0.50%	97.99%
		Equals	0.82%	94.44%	0.71%	96.38%	0.45%	97.87%

References

Andriosopoulos,

, & Nomikos,

, Performance replication of the Spot Energy Index with optimal equity portfolio selection: Evidence from the UK, US and Brazilian markets. European Journal of Operational Research, 234(2), 571–582.

Beasley, , John,

, Nigel Meade, , & Chang,

T-J.

, An evolutionary heuristic for the index tracking problem. European Journal of Operational Research, 148(3), 621–643.

Breiman,

, An evolutionary heuristic for the index tracking problem. European Journal of Operational Research, 148(3), 621–643.

Breiman,

, Random forests. Machine learning, 45(1), 5–32.

Chen,

, & Guestrin,

, 2016, August. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794).

Chiam,

S.C.

, Tan,

K.C.

, & Al Mamun,

, 2013. Dynamic index tracking via multi-objective evolutionary algorithm. Applied Soft Computing, 13(7), 3392–3408.

Cli,

, & Marcellino,

, 2006. Factor based index tracking. Journal of Banking & Finance, 30(8), 2215–2233.

Genuer,

, Poggi,

J.M.

, & Tuleau-Malot,

, 2010. Variable selection using random forests. Pattern recognition letters, 31(14), 2225–2236.

Heaton,

J.B.

, Polson,

N.G.

, & Witte,

J.H.

, 2017. Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry, 33(1), 3–12.

10.

Jansen,

, & Van Dijk,

, 2002. Optimal benchmark tracking with small portfolios. The Journal of Portfolio Management, 28(2), 33–39.

11.

Kim,

, & Kim,

, 2020. Index tracking through deep latent representation learning. Quantitative Finance, 20(4), 639–652.

12.

Liu,

, Chan,

, Alam Kazmi,

S.H.

, & Fu,

, 2015. Financial fraud detection model: Based on random forest. International journal of economics and finance, 7(7).

13.

Markowitz,

, & Kim,

, 1952. Portfolio selection. Journal of Finance, 7(1952), 77–91.

14.

McCulloch,

W.S.

, & Pitts,

, 1943. A logical calculus of the ideas immanent in nervous activity. The bulletin of Mathematical Biophysics, 5(4), 115–133.

15.

Meinshausen,

, & Bühlmann,

, 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), 417–473.

16.

Miao,

, & Niu,

, 2016. A survey on feature selection. Procedia Computer Science, 91, 919–926.

17.

Nobre,

, & Neves,

R.F.

, 2019. Combining principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Systems with Applications, 125, 181–194.

18.

Oh,

K.J.

, Kim,

T.Y.

, & Min,

, 2005. Using genetic algorithm to support portfolio optimization for index fund management. Expert Systems with Applications, 28(2), 371–379.

19.

Ouyang,

, Zhang,

, & Yan,

, 2019. Index tracking based on deep neural network. Cognitive Systems Research, 57, 107–114.

20.

Rohweder,

H.C.

, 1998. Implementing stock selection ideas: Does tracking error optimization do any good?. Journal of Portfolio, 24(3), 49.

21.

Roll,

, 1992. A mean/variance analysis of tracking error. The Journal of Portfolio Management, 18(4), 13–22.

22.

Rumelhart,

D.E

, Hinton,

G.E

, & Williams,

R.J.

, 1986. Learning representations by back-propagating errors. nature, 323(6088), 533–536.

23.

Sohrabi,

, & Movaghari,

, 2020. Reliable factors of Capital structure: Stability selection approach. The Quarterly Review of Economics and Finance, 77, 296–310.

24.

Strub,

, & Baumann,

, 2018. Optimal construction and rebalancing of index-tracking portfolios. European journal of operational research, 264(1), 370–387.

25.

Tibshirani,

, 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.