Abstract
This paper proposes prediction of the bitcoin return direction with logistic, discriminant analysis and machine learning classification techniques. It extends the prediction of the bitcoin return direction using exogenous macroeconomic and financial variables which have been investigated as drivers of bitcoin return. We also use google trends as proxy for investors interest on bitcoin. We consider those variables as predictors for bitcoin return direction. We conduct an in-sample and out-of-sample empirical analysis and achieve a misclassification error around 4% for in-sample evaluation and around 41% in out-of-sample empirical analysis. Ensemble learning trees based outperforms the other methods in both in-sample and out-of-sample analyses.
Keywords
Introduction
From the birth of the first cryptocurrency Bitcoin (Nakamoto, 2008), cryptocurrencies never cease to grow in number and in use. They attract investors, government authorities and academic researchers. In particular, the most popular bitcoin. The latter dominates cryptocurrency markets. Its modelling becomes another challenge for quantitative analysts.
Analysis of bitcoin price, return and volatility dynamics continues and seeks to incorporate new information content or more bitcoin specificities. Modelling bitcoin price volatility, researchers try to tackle various stylized facts, Dyhrberg (2016), Katsiampa (2017), Conrad et al. (2018) and Bouri et al. (2019) among others. Concerning bitcoin price and return modellings, various studies have been conducted on such issues, Polasik et al. (2015), Ciaian et al. (2016), Jang and Lee (2018). They try to find models, from linear analysis to machine learning methodologies, which may allow us to incorporate more accurate informations.
Investors may also interest on changes in price direction of cryptocurrencies. Researchers have used machine learning models and algorithms for addressing such issues. Madan et al. (2015), predicting the sign of the future change in bitcoin price using machine learning algorithm. Similar work has been conducted for Ethereum, Chen et al. (2015). We will continue research on this direction.
In this paper, we are interested on predicting bitcoin return direction, by considering new information as drivers of bitcoin return from the literature. As several studies on bitcoin return modelling have investigated significant effects of the following macroeconomic and financial variables: oil prices, American financial market indices, exchange rates, Kristoufek (2015) and Ciaian et al. (2016). Apart from macroeconomic and financial variables, google trends from websearch for word bitcoin and bitcoin specific explanatory variable like bitcoin trading volume are also found significant, Kristoufek (2013) and Balcilar et al. (2017). We will retain those variables as predictors for bitcoin return direction. We will use classification techniques such as logistic regression, discriminant analysis and machine learning.
The paper is organized as follows. Section 2 for methodology, the next section for empirical analysis and the last section for concluding remarks.
Logistic, discriminant analysis and machine learning classifications
Investors may be interested on a financial product return direction at a given time where it can be up or down. This lies on binary classification problem
Logistic regression for classification
The previous conditional probability
where
Given a training set
The corresponding first order condition system doesn’t have solution. We then use a regularized estimation procedure known as elastic net algorithm with the following penalization:
where
We assign an observation to the up class if
We will next summarize the discriminant analysis principle.
In our presentation of discriminant analysis, we consider
where
Bayes classifier rule classifies an observation to the class having the lowest error rate. In other word
Given a training set
For advanced studies on discriminant analysis, we refer readers to Friedman (1988), Baudat and Anouar (2000) and Wu et al. (2016).
We next present the first machine learning classification method that we will consider for modeling bitcoin return direction.
Artificial neural networks are a popular machine learning approach for classification tasks. They can be used for estimating the conditional probability
where
Some prefered classes of sigmod activation functions are the logistic
Given a training set
We will use the resilient backpropagation learning which is faster than basic backpropagation algorithm, Riedmiller and Braun (1993).
We next provide an overview of the second machine learning classification method.
Gradient tree boosting has been successfully used in classification ensemble learning tree based to improve weak classifiers. This method is based on summation of several binary trees to predict the output, Friedman (2001). We consider a recent successful extension of this ensemble learning, known as the extreme gradient boosting developed in Chen and Guestrin (2016). Binary classification with Xgboost can be viewed as follows.
Given a training set
where
Among features of Xgboost learning algorithm on preventing overfitting are: the shrinkage and the approximate greedy algorithm. The shrinkage controls the learning rate by scaling the contribution of each tree. Lower value for shrinkage implies larger value for
We now proceed on modeling bitcoin return direction with these classification methods.
In this section, we will provide some details about our data with basic statistical analysis and an empirical investigation for helping investors on finding the best prediction of bitcoin return direction.
Data and preliminary analysis
From the literature, Oil prices,
Having a look on the evolution of the bitcoin price in Fig. 1, we remark that from 2014 to earlier in 2017 the bitcoin price registered an interesting increasing exponential trend. After this period until the beginning of the year 2018, there has been remarkable variation with decreasing trend. After that period, bitcoin started to regain in value. In fact, the bitcoin price has a high variation in our period of study.
For other information on our data, an overview on these variables from basic statistic can be found in Table 1.
Summary statistics
Summary statistics
Evolution of bitcoin price in USD.
Looking on all statistical measures in Table 1, they show high movement inside all our variables. The bitcoin seems having a very high movement. Let continue our study on correlation analysis, as we are interested on possible relation between bitcoin price and the other variables.
We compute the correlation coefficients between bitcoin and explanatory variables.
Correlation between bitcoin and explanatory variables
Note: The last two rows represent the 95% CI lower and upper limits.
All explanatory variables have positive correlation with bitcoin price. Even the google trend has the lowest correlation with bitcoin price, correlation variability more oriented to positive direction would be in line with the use of google trend in the literature as proxy for investors interest on bitcoin. All other variables register higher correlation with bitcoin, in particular for bitcoin volume and
Next, we will model the bitcoin return direction with logistic, discriminant analysis and machine learning classification techniques.
We split our data into two parts where the first part from 2014-10-01 to 2019-03-30 will be used for in-sample analysis and the rest from 2019-04-01 to 2019-09-30 for out-of-sample analysis.
We build the logistic, the discriminant analysis and the machine learning classification models using the first sample. We incorporate lagged values of explanatory variables in models and try to select this optimal lag
As stated in the previous methodology, for parameter vector estimation of logistic classification, we use the elastic net regularized estimation procedure. We retain the following three models from logistic classification techniques: the Ridge regression with
For artificial neural networks, we use learning algorithm with resilient backpropagation in Fritsch and Gunther (2008) by considering maximum number of hidden neurons as 10 with both tanh and logistic activation functions. We then obtain an ANN with 7 hidden neurons and tanh as activation function.
For extreme gradient boosting, we use the xgboost learning algorithm in Chen and Guestrin (2016) and get the total number of trees
We start our analysis by computing the confusion matrices. Results are reported in Table 3 for logistic regression, in Table 4 for discriminant analysis and and in Table 5 for machine learning. We mention that in all confusion matrices in this paper, the true classes are in row while the predicted in column where U represents up and D represents down.
In sample elastic net confusion matrices
In sample elastic net confusion matrices
In sample LDA and QDA confusion matrices
In sample ANN and Xgboost confusion matrices
Confusion matrices for the three logistic methods (Ridge, LASSO and Best combination) and the linear discriminant analysis indicate variation of false negative, false positive and correctly classified around one from one model to other. Differences come from QDA, ANN and Xgboost where these latter models have high rate for correctly classified for the two classes (Up and Down) compared to former models.
For advanced and more compact analysis on assessing the quality of these classification techniques, we will use misclassification error and the AUC measures. We report in Table 6 the misclassification error and the AUC associated to Ridge, LASSO and the best combination for logistic classification.
AUC and Classification error for elastic net
We remark that values in Table 6 are very close like in confusion matrices. From both measures, misclassification error and AUC, all three models have similar performance on modeling bitcoin return direction. The Ridge regression shows a bit better performance than the two others models from AUC criterion. From some literature, these AUC values are an indicator of classification failure or to others a sign of weak classifiers.
For discriminant analysis, we consider the LDA and QDA classification techniques for modeling bitcoin return direction. We report in Table 7 the associated misclassification error.
Classification error for LDA and QDA
The LDA has 44% misclassification error like the previous three models, in line with a remark in James et al. (2013) about close connection of LDA and logistic regression. Besides, the misclassification error around 32% for QDA is very small compared to the LDA. As stated in James et al. (2013) that QDA is a more general method with its non-linear behavior and can perform better than LDA.
For the two machine learning methods, ANN and Xgboost, the corresponding misclassification error are reported in Table 8.
Classification error for ANN and Xgboost
We obtain misclassification errors around 12% for ANN and 3.5% for Xgboost. These errors from ANN and Xgboost reduce largely the errors from the three logistics and the two discriminant analyses. In addition, among the two machine learning techniques, the xgboost has the smallest error which hilights the frequent stated quality of ensemble learning.
From in-sample analysis of bitcoin return direction with logistic, discriminant analysis and machine learning classification techniques, we achieve the best misclassification error with the extreme gradient boosting.
We continue our analysis with the hold-on data for out-of-sample empirical evaluation.
We recall that we have held on the last six months of our sample for out-of-sample analysis. Besides, models used in this out-of-sample empirical evaluation are from in-sample analysis. In other word, we don’t need to re-estimate the models and never use the sample for out-of-sample empirical evaluation during model building. Similar to in-sample analysis, let start with computation of confusion matrices which are given in Table 9 for logistic regression, in Table 10 for discriminant analysis and in Table 11 for machine learning.
Out of sample elastic net confusion matrices
Out of sample elastic net confusion matrices
Out of sample LDA and QDA confusion matrices
Out of sample ANN and Xgboost confusion matrices
Changes in entries of confusion matrices for the three logistic methods (Ridge, LASSO and Best combination) and the linear discriminant analysis indicate variation of false negative, false positive and correctly classified around one or two within and between models. ANN and Xgboost don’t share any remarkable common structure with other methods.
We next compute the percent correctly classified in Table 12.
Correctly classified for out-of-sample (in %)
All our models deliver at least 50% correct classification rates. Machine learning techniques still deliver better results in sense of having higher correctly classified rates. The ensemble learning Xgboost with 59,16% percent correct outperforms all other classifiers. Then, the best model Xgboost from in-sample analysis keeps its strength in out-of-sample analysis.
We have proposed an empirical analysis of bitcoin return direction using logistic, discriminant analysis and machine learning classification techniques which is an extension of the modelling of the bitcoin return direction based on some macroeconomic and financial variables. Those variables have been picked from drivers of bitcoin return dynamics.
Focusing on in-sample and out-of-sample evaluation assessment of proposed models with our exogenous variables, we achieve a misclassification error around 4% from in-sample investigation and 41% in out-of-sample empirical analysis.
In the literature, the logistic regression has been considered as a benchmark in modeling the sign of bitcoin return. When we have extended such modeling with googletrends and macroeconomic and financial variables, the machine learning can perform largely the logistic regression for in-sample analysis.
The best performance from in-sample evaluation comes from ensemble learning Xgboost. The out-of-sample empirical analysis does not reject such performance of Xgboost from in-sample analysis. This classifier improves results from weak classifiers. Apart from the success of this ensemble learning in various fields and challenges like the machine learning competition site Kaggle or the Knowledge Discovery Data association Cup (KDDCup), it is computationally very fast than existing popular solutions, Chen and Guestrin (2016). Our results extend strength of the Xgboost to cryptocurrencies prediction.
Alot have been investigated on bitcoin price and volatility dynamics, in contrast bitcoin return direction modelling seems less developed. Selection of endogenous and exogenous variables as drivers of bitcoin return direction would be fundamental. Such variable selections would be part of possible future works on bitcoin return direction modelling.
Footnotes
Acknowledgments
The author thanks the two anonymous referees and the Co-Editor-in-Chief Dr. Stan Lipovetsky.
