Some statistical and CI models to predict chaotic high-frequency financial data

Abstract

To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.

Keywords

ARIMA models neural networks learning algorithms time series forecasting

1 Introduction

Forecasting of financial time series data is a complex problem, which has benefited from recent advancements and research in machine learning. In economics and in particular in the field of financial markets, forecasting is very important because forecasting is an essential instrument to operate day by day in the economic environment. It is generally known that financial high frequency data behave unforeseeable. They are stochastic, non-linear and chaotic. Application of deterministic nonlinear dynamics and chaos theory to the analysis of stochastic time series are widely used in contemporary macroeconomics and finance. A broad pioneering volume on the complexity of the economy is edited by Anderson [1].

Time series models are based on the analysis of chronological sequence of observations on particular variable. The main purpose of time series analysis is to understand the underlying mechanism that generates data, and, in turn, to estimate observed data and apply the models for forecasting.

Typically, in conventional time series, we assume that the generating mechanism is probabilistic and that the observed series is a realization of a stochastic process. This process is assumed to be stationary and is described by a class of models called autoregressive moving average (ARMA) models.

Over the past ten years academics of computer science have developed new soft techniques based on latest information technologies such as classic and soft (fuzzy logic) neural networks and granular computing, which evolved in the process of understanding incredible learning and adaptive features of neuronal mechanisms inherent in certain biological species [2 –9].

Artificial neural networks (ANN) can be understood as a system which produces output based on inputs the user has defined. It is important to say that user has no knowledge about internal working of the system of ANN. Examples are brought forward the network and then the network tries to get as close as possible to the given output by adapting its parameters (weights). ANN model has a large number of internal variables which are supposed to set up well in order to optimize the outputs. This approach is based almost exclusively on finding non-linear function between the inputs and the output of the system. Neural networks have shown their efficiency in various identification, prediction, and diagnostic cases [10 –12].

In ANN applications, in addition to the classical gradient learning method, the methods based on the principle of Darvinian evolution are increasingly used. The concept of GA was first presented by John Holland [13]. Evolutionary computing is a set of metaheuristics inspired from biological evolution and based on natural selection and genetic inheritance. It is mainly applied for optimization purposes in modeling of non-linear dynamic systems [14] or in many real-life applications data are subject to uncertainty due to their random nature, measurement errors or other reasons [15, 16]. GA works as an iterative procedure with a population of individuals each representing solution of the given optimization problem. The quality measure of an individual is the value of fitness corresponding to the value of the purpose function. Individuals with better fitness have a better chance of surviving and reproduce. In reproduction new individuals – offsprings inherit some features from their parents and form a new generation. By repeating this process, the average of fitness population or the best fitness improves from generation to generation. Unlike classic GA, where the population is composed of a large number of individuals, MGA uses a populations with a small number of individuals. Of course, a small population quickly reduces its diversity and can easily get stuck at a local optimum. When convergence occurs, the new population is restarted. A new population is created by adding the best individual from the current population (elitism), and the other individuals are randomly generated. Elitism guarantees the best individual in the next generation will not be worse from the previous generation.

The goal of this paper is to illustrate that the two distinct approaches, i.e. statistical models and computational networks may be used for financial high-frequency time series modelling. The main work of this study is to test the training speed and the predictive accuracy level of the statistical learning with neural networks one on the large data set.

In Section 2 a special dynamic process with random values as its observations is characterized. The characterization of conventional time series modelling is introduced in Section 3. In Section 4 the development and calculation of statistical models will be discussed. Section 5 introduces the characterization of NN architectures and proposes learning procedures used for NN and learning methods. Section 6 describes the program implementation of NN and both GA and MGA learning algorithms. Section 7 provides the assessment of prediction results from all learning approaches and verifies they applicability. Concluding remark are given in Section 8.

2 Conventional time series models

To build a time series model in a research a sample observations from the available data is usually collected. A time series consists of an observation set {y₁, y₂, . . . , y_t . . .} of some phenomenon, taken at equally spaced time intervals. As we mentioned above, we assume that the observed series {y₁, y₂, . . . , y_t . . .}is a realization of a stochastic process {Y₁, Y₂, . . . , Y_t . . .}. This process is assumed to be stationary and is described by a class of linear models called autoregressive moving average (ARMA) models. Box and Jenkins (B-J) give a thorough treatment of these models [17]. We also assume that y_t is a real for each t ∈ T, where {T = 1, 2, . . . , n} is an index set. The subscript t can now be referred to as time, so y_t is the observed value of the time series at time t. The total number of observations in a time series (here n) is called the length of the time series or length of the data. In the following, we will typically refer to realization of stochastic processes by the notation y_t for a value at t, and {y_t} for a full set of values corresponding to the index set T = {1, 2, . . . , n}. We will also restrict our attention to discrete stochastic process, for which the index set is a discrete set, in which case we generally use the notation y_t, which may apply also to continuous processes.

Once an appropriate model fits, it can be used to generate forecasts for future time periods. Most forecasting methods, commonly used in time series analysis, generate forecasts of future observations that are optimal in a minimum mean, square error sense (i.e. the best linear predictor).

Next, let Y_n (τ) denote the forecast τ step ahead; we define as ${\hat{Y}}_{n} (τ) = E (Y_{n} (n + τ | ψ_{n})$ (1) conditional expectation of Y_n+τ given ψ_t = {Y₁, Y₂, . . . , Y_t . . .}, where E is the expectations operator and ψ_n represents a particular information set. Here we assume that we have the data extending to the infinite past. Equation (1) can be used recursively to obtain forecast values Y_n+τ for τ = 1, 2, . . . once we known the right-hand side of (1).

In practice we have a finite number of observations, Equation (1); nevertheless, on the best linear predictor in the infinite sample limit enables to develop a way of calculating the approximate best linear predictor when n is large.

3 Conventional time series modeling

Conventional time series modeling can be grouped into two types. Time series methods and causal methods. As mentioned above, univariate time series models are based on the analysis of chronological sequence of observations on a particular variable. Causal models assume that the variable to be modeled can be explained by the behavior of another variable, or a set of variables.

In practice there are many time series in which successive observations are depended, i.e. there exists an observational relation $R = {(y_{t}, y_{t - 1}), (y_{t - 1}, y_{t - 2}), . . .} \subseteq Y_{t} \times Y_{t - 1}$ (2) where Y_t, Y_t-1 denote the variables and y_t, y_t-1, . . . denote the observed values of Y_t and Y_t-1 respectively. If there is a strict inclusion R ⊂ Y_t × Y_t-1, it is reasonable to say that variables Y_t and Y_t-1 interact. In order to account for this interaction the usual practice is to find some analytical expression that describes this interaction.

The most often used model is, however, an explicit function $f : Y_{t - 1} \to Y_{t}$ (3) belonging to a pre-specified class of mappings. Very often the linear function (Markov process) $y_{t} = f (y_{t - 1}, ϕ_{1}, ɛ_{t}) = ϕ_{1} y_{t - 1} + ɛ_{t}$ (4) is used, where ɛ_t is a random error or noise component that is drawn from a stable probability distribution with zero mean and constant variance. Equation (4) is usually called an autoregressive process of the order p = 1 abbreviated AR(1) because the current observation y_t is “regressed” on previous realization y_t-1 of the same time series. Roughly speaking, to determine the model (4) means to find the coefficient (parameter) ϕ₁ such that function (3) satisfies some optimality criterion in fitting the observed data R.

The AR(1) process (4) is a special case of a stochastic process which is known as the mixed autoregressive-moving average model of the order (p, q) which is abbreviated ARMA(p, q): $y_{t} = {\begin{matrix} ϕ_{1} y_{t - 1} + ϕ_{2} y_{t - 2} + . . . + ϕ_{p} y_{t - p} \\ - θ_{1} ɛ_{t - 1} - θ_{2} ɛ_{t - 2} - . . . - θ_{q} ɛ_{t - q} + ɛ_{t} \end{matrix}$ (5) where {ϕ₁, ϕ₂, . . . , ϕ_p} and {θ₁, θ₂, . . . , θ_q} are called AR and MA parameters of the autoregressive and moving average parts respectively, and ɛ_t is white noise normally distributed, i.e. ɛ_t $N (0, σ_{ɛ}^{2})$ . As mentioned above, it is important that each invertible ARMA(p, q) process can be considered as an AR(∞) or as an appropriate AR(p) model of the form $y_{t} = ϕ_{1} y_{t - 1} + ϕ_{2} y_{t - 2} + . . . + ϕ_{p} y_{t - p} + ɛ_{t}$ (6)

All the above time series can be derived from linear combination of independent white noise random variables {ɛ_t, ɛ_t-1, ɛ_t-2 . . .}, i.e. $y_{t} = μ + ψ_{0} ɛ_{t} + ψ_{1} ɛ_{t = 1} + . . . = μ + \sum_{j = 0}^{\infty} ψ_{j} ɛ_{t - j}$ (7) where ψ_j, (j = 0, 1, 2, ...) are usually called weights and μ is a constant that determines the level of the process. Equation (6) is usually called linear filter. In view of the linear filter model, it can be defined a time series model as a function that transforms a white noise process into a time series.

A general model capable of representing a wide class of non-stationary time series is autoregressive integrated moving average process of order (p, d, q), ARIMA(p, d, q), where d is an operator for differencing a time series. Thus, the model represents the dth difference of the original series as a process containing p autoregressive and q moving average parameters. For example, the ARIMA(1,1,1) has the form: $y_{t} = (1 - ϕ_{1}) y_{t - 1} - ϕ_{1} y_{t - 2} + ɛ_{t} - θ_{t - 1} ɛ_{t - 1}$ (8)

4 An application of ARIMA models

To illustrate the Box-Jenkins approach, consider the time readings of the currency exchange rate between Czech Koruna (CZK) and Euro (EUR) (abbreviated as currency CZK/EUR) collected for the first week of December 2018. The data used for research discussed in this paper were published by GAIN Capital company and are freely available at http://ratedata.gaincapital.com/.

The preview of used time series data are shown on the left hand-site of Fig. 1. The data set starts at the 2nd December 2018 17 : 02 : 14 and ends at the 17th December 2018 16 : 59 : 55. It contains 60586 values. After removal of duplicates and interpolating the missing values, the time series counts 7197 observations. The STATA software was used for this 1 . From examining the left-hand site of Fig. 1, we note the variability of the series decreases as its general level decreases (the currency CZK/EUR time series has a declining trend). This suggests that the logarithms of the currency data should be analyzed, rather than the raw series.

Fig. 1

The currency CZK/EUR (on the left) interpolated and (right) first difference of currency CZK/EUR EUR/CZK (on the right).

For successful usage of the Box-Jenkins method for creating an ARIMA model, the data should be stationary. The time series on the left-hand site of Fig. 1 has homogeneously non stationary behavior in the mean. In any local segment of time the observation look like those in any other segment, apart of their average. However, its first difference that is y_t - y_t-1 shown on the right-site of Fig. 1 is stationary in the mean and slope.

To build a forecast model the time series data was split into training and validation data set. The training data consist of 90% of the original series and the validation data set as the time period from the first observation after the end of the sample period to the most recent observation. The primary tool used in identification process is Auto Correlation Function (ACF) denoted as ρ_k. Actually, the theoretical ACF is unknown and must be estimated by the sample ACF, i.e. $ρ_{k} = \frac{cov (y_{t + k}, y_{t})}{\sqrt{var (y_{t + k}) var (y_{t})}}$ (9)

The estimation of the PACF is based on solving the Yule-Walker equations. For details see [18]. The sample autocorrelation and partial autocorrelation functions for the series are shown in Fig. 2.

Fig. 2

Sample autocorrelation function (on the left) and partial autocorrelation function for the first difference of currency CZK/EUR (on the right).

The standard errors of the sample ACF and PACF are useful in identifying mean zero values. For assistance in interpreting these functions, two-standard-errors limits are plotted on the graphs as horizontal lines.

We find that the sample autocorrelation function tails off after lag 10 and also partial autocorrelation function tails off after lag 10. Therefore, we would tentatively identify out time series as the ARIMA(10,1,10) process.

The quantification of the model was performed by means of the STATA software using ML estimating procedure. On the basis of the calculated test statistics in Table 1, we have no evidence to reject the ARIMA(10,1,10) model.

Table 1

STATA-estimated parameters for the currency CZK/EUR data: model ARIMA(10,1,10) and statistical test characteristics to assess the suitability of the ARIMA (10,1,10) model

Sample:	02dec2018 17 : 03 : 00 – 07dec20 16 : 59 : 00
		Number of obs.		* 7197
		Wald chi:		* 148975.77
Log. likelihood = 34099.69 Prob > chi2				* 0.0000
		Std. Err	z	P > \|z\|
Coef.	–.0000122	1.19e-06	– 10.28	0.000
AR
L1.	–.5591765	.249423	– 2.24	0.025
L2.	–.2349692	.1623016	– 1.45	0.148
L3.	.648814	.2061530	6.11	0.000
L4.	.530892	.2372566	2.24	0.025
L5.	.8331457	.1947631	4.28	0.000
L6.	–.0083992	.2047483	– 0.04	0.967
L7.	–.0294048	.1720582	– 0.17	0.864
L8.	–.2710065	.1204172	– 2.25	0.024
L9.	.2560415	.1139643	2.25	0.025
L10.	–.2016631	.1313104	– 1.54	0.125
MA
L1.	.4866756	.277928	1.75	0.000
L2.	.16528	.1472552	1.12	0.262
L3.	–.7151753	.2348639	– 3.14	0.003
L4.	–.6077084	.2240039	– 2.71	0.007
L5.	–.8693586	.2114018	– 4.11	0.801
L6.	.049309	.195784	0.25	0.801
L7.	.0513134	.1531564	0.34	0.738
L8.	.3561364	.1168793	3.05	0.002
L9.	–.1624173	.1280688	– 1.27	0.205
L10.	.2459452	.1363641	1.80	0.071

5 Neural network approach

Neural networks can be understood as a system which produces output based on inputs the user has defined. It is important to say that user has no knowledge about internal working of the NN system. Neural networks work on the Black Box principle. According to some publications such as [19], NN are the prediction models which have the biggest potential in predicting time series and high-frequency financial time series data.

In NN examples are brought forward the network and then the network tries to get as close as possible to the given output by adapting its parameters (weights). Neural network model has a large number of internal variables which are supposed to set up well in order to optimize the outputs.

In this section we firstly show an approach of function estimation for time series modelled by means of classic network trained by BP, and then trained by GA and MGA.

5.1 Classic NN trained by BP algorithm

Roughly speaking, artificial neural networks are also mathematical models which can learn with arbitrary precision to imitate any behaviour that can be described with continuous function [20]. The structure of a neural network is defined by its architecture (processing units and their interconnections, activation functions, methods of learning and so on).

In this section we study the feed-forward networks in the context of supervised learning. We restrict ourselves further to three-layer feed-forward network, see Fig. 3.

Fig. 3

An example of three layer feed-forward network notation for units and weights with architecture k – s – 1 (see text for detail).

The input-output mapping of three-layer feed-forward network shown in Fig. 3 can be mathematically described as $\hat{y} = ψ^{[2]} ψ^{[1]} [W^{[1]} x_{t}]]$ (10) or in matrix form $\hat{y} = ψ^{[2]} [ψ^{[1]} [(\begin{matrix} w_{11} & \dots & w_{1 k} \\ ⋮ & ⋱ & ⋮ \\ w_{s 1} & \dots & w_{sk} \end{matrix}) \times [\begin{matrix} x_{1} \\ ⋮ \\ x_{k} \end{matrix}]] \times [\begin{matrix} v_{1} \\ ⋮ \\ v_{s} \end{matrix}]]$ (11) where ψ^[2] is the activation function usually taken to continuous sigmoid, ψ^[1] is linear function. The next expression $U^{[1]} = [u_{1}, u_{2}, . . ., u_{s}]^{T} = [(\begin{matrix} w_{11} & \dots & w_{1 k} \\ ⋮ & ⋱ & ⋮ \\ w_{s 1} & \dots & w_{sk} \end{matrix}) \times [\begin{matrix} x_{1} \\ ⋮ \\ x_{k} \end{matrix}]]$ (12) is known as the potentials of the hidden nodes. In Equations (11) and (12) the weigh vector w_rj for hidden layers is represented in the matrix form. This way it is easier to work with it in code. So w₁₁ is the weight of the synapse from the first neuron of the previous layer to the first neuron of the current layer, w₂₁ is the weight of the synapse from the first neuron of previous layer to the second neuron of the current layer.

In general, the network in Fig. 3 learnt so that the errors identified as $Δ^{[2]} = y - \hat{y}$ at the output node, where y is the desired output pattern or teacher, $\hat{y}$ is the actual pattern, are propagated backwards and adapt the weights according to the following procedures:

Compute the error for the output node $Δ^{[2]} = (y - \hat{y}) ψ^{[2]} (u_{j}) for j = 1, 2, . . ., s$ (13)

update the connections as $v_{j}^{new} = v_{j}^{old} + o_{j} Δ^{[2]} forj = 1, 2, . . ., s$ (14)

Compute the deltas for the hidden layer nodes $Δ_{j}^{[1]} = Δ^{[2]} ψ_{j}^{[1]} (u_{j}) v_{j}^{old}, for j = 1, 2, . . ., s$ (15)

update the connections w_rj as $\begin{matrix} w_{rj}^{new} = w_{rj}^{old} + Δ_{j}^{[1]} x_{r} for j = 1, 2, . . ., s; \\ r = 1, 2, . . ., k \end{matrix}$ (16)

In Equations (11) and (12) the weighs w_rj for hidden layers are represented in the matrix form. Typically, the updating process is divided into epochs. Each epoch involves updating all the weights for all the examples. The inputs and outputs are also called as x and $\hat{y}$ respectively. In Fig. 3, we have omitted any thresholds, they can be treated as connections to an input terminal that has permanently the value equal to -1 with connections weights w_0j [21].

ARIMA(10,1,10) model for predicting currency CZK/EUR time series data is based on 10 auto-regressive and 10 moving average values, as shown in Section 3. Therefore, ANN should have 20 neurons in the input layer. Most implementations of neural networks use a neuron with no input wired to each other neurons as a bias. It is crucial that the values of (input+bias) are rather small because a sigmoid function is used as an activation function of the hidden layer. The last, output layer has the identity function as the activation function and has only one output neuron. Based on empirical experience the optimal size of the hidden layer is 25 neurons. Larger size did not provide better results while smaller provided worse.

The resulting shape of the network was three layers having 20–25 – 1 nodes. This network can be used for approximating the above mentioned ARIMA(10,1,10) model, predicting a value for 1-time unit (1 minute in this case) in the future of the last ten values and ten residuals of the moving average part of the ARIMA model.

5.2 Classic NN trained by genetic algorithm

The weights v, w can be adapted by genetic algorithms (GA) as well [22]. Genetic algorithms (see Fig. 4) are implemented as a computer simulation in which a population of abstract representations (called chromosomes) of candidate solutions (called individuals) to an optimization problem evolves toward better solutions.

Fig. 4

Flow chart of common GA method.

The evolution usually starts from a population of randomly generated individuals and happens in generations. In each generation the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population (based on the fitness), and modify it (recombined and mutated) to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of iterations has been produced or a satisfactory fitness level has been reached for the population.

In the first two blocks of GA we define the initial population of neural network weights, optimization criteria, and fitness functions.

We train neural networks using a genetic and micro-genetic algorithm with different population sizes to compare the times needed for training. The fitness function was set as the summary measure of model's forecast accuracy defined as the Mean Square Error (MSE) ${MSE}_{E} = \frac{1}{N_{E}} \sum_{E} (y - \hat{y})^{2}$ (17) where y is the desired output pattern, $\hat{y}$ is the estimated pattern, N_E denotes the size of the validation (testing) data set.

Genetic algorithms traditionally work with genes either 0 or 1. The initial population of weights v, w was generated randomly from the interval (a, b) ≡ (-0.7, 0.7) and transformed into the integer digit denoted as l by the following formula $l = \frac{z - a}{z - b} (2^{k} - 1)$ (18) where z is the value of weights (v, w and bias) randomly chosen from the interval (a, b), k is the length of binary string. There are many methods how to select the best chromosomes [23, 24]. In this paper was used linear rank selection method.

The rank selection block prepares the population for the crossover operations. The ranking algorithm was implemented for the choice of crossover pairs. As mentioned earlier, the probability of an individual being chosen is not directly proportional to its fitness as in traditional implementations. Ranking algorithm therefore avoids a problem that if one individual is much more fit than the rest of the population, and the chance of other individuals being chosen is minimal. This can quickly lead to a loss of diversity in the population. The ranking algorithm gives more chance to individuals who have lower survival rate because their genetic information may also be beneficial.

Our implementation selects individuals for crossover using this rank-selection technique. From a technical point of view, a linked list is constructed and the worst individual is added once, second worst twice etc. Therefore the best individual in a population consisting of 300 individuals is 300 times more likely to be chosen than the worst. In contrast to a technique based solely on a loss function value, if one individual is much better than others, it does not prevent other individuals from being chosen for crossover. After sorting the individuals on the basis of their fitness, the rank is assigned to them. The best individual gets rank n and the worst individual gets rank 1. The selection probability of an individual is given as follows $p (i) = \frac{rank (i)}{n (n - 1)}$ (19)

After selection of two chromosomes from the current population, individuals are modify (recombined and mutated). In this work the single-point crossover has been applied. In the chromosome a point was randomly selected which divide chromosome into two parts. Then those two parts of chromosomes were exchanged. After a crossover is performed, mutation take place. This is to prevent falling all solutions in population into a local optimum of solved problem. Mutation changes randomly the new offspring. For binary encoding we can switch a few randomly chosen bits from 1 to 0 or from 0 to 1. The crossover and mutation operators are graphically illustrated in Fig. 5. More information about crossover and mutation operators can be find in [25].

Fig. 5

Crossover (a) and Mutation (b) operators.

5.3 Classic NN trained by micro-genetic algorithm

Micro-genetic algorithm is a modification of GA. [26, 27]. It is based on a small number of individuals in the population. Figure 6 shows the main flowchart of the MGA algorithm.

Fig. 6

Flowchart of the micro-genetic algorithm.

The common phases of MGA flow are initialization, elitism, selection, crossover, mutation, new population, and termination. Mutations are generally not used here. A mutation is only used if there is a loss of population diversity. The convergence control and restart have come.

The new population emerges from newly created offspring that in the next generation can change their properties and thus increase the diversity of the population.

It is necessary to find out if the convergence has occurred or not. The detection of convergence is based on a comparison of the population diversity between the best fitness individual and others according to the following equation $α = \frac{{MSE}_{max} - {MSE}_{min}}{{MSE}_{min}}$ (20) where α is the convergence parameter, MSE_max is the highest fitness value in the given generation and MSE_min is the smallest fitness value in a given generation [28]. When convergence occurs (α ⩽ 0.01), the new population is restarted. A new population is created by adding the best individual from the current population, and the other individuals are randomly generated. Otherwise go to the Elitism block.

6 Neural networks implementation

Genetic algorithms traditionally work with genes either 0 or 1. For this application, it is inadequate, because this algorithm need to find weights and biases of a neural network, which can be an arbitrary floating point decimal numbers. Therefore, each parameter is transformed into interval [0, 1]. An individual for the genetic algorithm then consists of floating point numbers from [0, 1] interval and count of parameters in intervals equals to count of all weights and biases of the network.

As for program implementation, a custom implementation of both GA and MGA learning algorithms and neural networks was used. As we have experience with implementing machine learning algorithms in Lisp language, the Clojure 1.6 language was chosen for this implementation 2 .

Clojure is a modern functional programming language dialect of Lisp, dynamically compiled to the Java bytecode. Testing was performed on a PC with AMD Ryzen 2 2600 processor, 6 physical and 12 logical cores with 16GB memory.

Neural networks are represented as a collection of matrices of weights and biases in our program. Clojure is not object oriented, therefore there is no benefit in defining classes or interfaces (these features exist in Clojure mainly for interoperability with Java). A slightly different approach is used as by the general representation shown in Section 5. Each layer of the network is represented by a matrix of weights and vector of biases in the following expanded matrix forms $(\begin{matrix} w_{11} & \dots & w_{1 k} \\ ⋮ & ⋱ & ⋮ \\ w_{s 1} & \dots & w_{sk} \end{matrix}) \times [\begin{matrix} a_{1} \\ ⋮ \\ a_{k} \end{matrix}] + [\begin{matrix} b_{1} \\ ⋮ \\ b_{k} \end{matrix}] = [\begin{matrix} u_{1} \\ ⋮ \\ u_{s} \end{matrix}]$ (21) or $W^{L} \times A^{L - 1} + B^{L} = U^{L}$ (22) and $A^{L} = ψ (U^{L})$ (23) where W is a matrix of weights, W^L is a matrix of weights in the present layer (L and L – 1 are indices), A^L-1 is the column vector of the activations of the previous layer, B^L is the column vector of the biases of the present layer, U^L is the column vector of the weighted input (sometimes referred to as potentials) of the present layer, ψ is the activation function.

We see from Equations (22) and (23), the program assumes that networks can have more than one hidden layer. This allows us to use the same software for future testing of deep neural networks [29 –31].

Equation (22) shows the weighted input of each layer in the expanded matrix form. This can be viewed as a vector of the potentials of the neurons in the layer. A^L is a simplified notation for applying an activation function to each element of the U^L vector, creating the A^L vector.

7 Results and empirical comparison

This section presents the experiment results conducted to study the performance of the prediction methods using BP, GA and MGA learning. All numeral experiments were conducted using the variables and data sets as the statistical model above. Therefore, the input layer for the network consisted of 20 plus one neurons in the input layer. The output layer has one neuron with linear activation function. Based on empirical experience the optimal size of the hidden layer is 25 neurons. The resulting architecture of the network is 21 – 25 – 1 for all NN methodological frameworks.

We trained neural network using a genetic and micro-genetic algorithm with different population sizes. Our target was to train a network with MSE function of the validation data set being under 1.0×10^-7 value. It is also the condition for training stopping. Table 2 shows the training parameters used in GA and MGA algorithms.

Table 2
The parameters used in the GA and MGA learning algorithms

Parameter name Value

Number of crossbreds Size of population – (elites+random term)

Number of randomly generated individuals 1–4 depending on population

Number of elites 1

Probability of mutation 1 %

Parameter name	Value
Number of crossbreds	Size of population – (elites+random term)
Number of randomly generated individuals	1–4 depending on population
Number of elites	1
Probability of mutation	1 %

Each algorithm with a particular population size was performed 12 times, removing the lowest and highest time. This eliminates the measured outliers.

As shown in the Table 3 and Fig. 7 when going from genetic to micro-genetic algorithms (from 300 population to 30 in this case), the needed number of generations went up. Since the population is much smaller, calculating the fitness of the generation is faster as well. In our case, the genetic algorithm with 600 population had population size too large for this problem, slowing the convergence of the algorithm down. The last one MGA with population size of 6, the population size is too small and the convergence is slow as well. As shown in Fig. 7, an optimal population which gives minimal learning speed is the population of size 20.

Table 3

A summary of the predicted accuracy and needed time of calculations related to the size of the generations and the number of iterations

Population size	Mean number of generations	Mean time [ms]	MSE
Statistical (B-J approach)
		overnight	1.44×10^-08
Neural network (BP learning)
		2235.17	3.44×10^-08
Neural network (GA learning)
600	19.91	98047.73	9.47×10^-08
300	12.09	27290.18	3.52×10^-08
Neural network (MGA learning)
30	114.27	25168.64	7.65×10^-08
20	135.36	20871.55	5.98×10^-08
10	300.64	24677.18	7.04×10^-08
6	797.72	51542.55	7.73×10^-08

Fig. 7

Time needed to train GA and MGA algorithm related to the size of populations (GA population size 300 and 600; MGA 6, 10, 20, 30).

Table 3 shows also the accuracy results of the statistical ARIMA, GA and MGA methods expressed in term of MSE. All proposed forecast models based on advanced statistical and NN methods indicate that all forecast models are very good. The most accurate method is statistical ARIMA (10,1,10) followed by NN trained with BP, NN trained with GA and population size = 300, NN trained with MGA method with MSE values approximately 7×10^-08.

The use of ARIMA models is a powerful approach to the solution of many forecasting problems. But, they are not without several limitations. In statistical models based on the B-J methodology, there is not conventional way to modify or update the estimates of the model parameters as each new observation becomes available. In contrast to NN, another drawback of ARIMA models is, that there is the learning speed very slow. The estimate of the parameters can be hardly parallelized.

8 Conclusion

In this paper, we have studied statistical and neural network techniques to predict high frequency time series for the currency between the Czech koruna and EUR (CZK/EUR) and test the training speed and forecasting accuracy of statistical learning with neural networks. It was shown that the GA and MGA can be used to train a feed-forward neural network to approximate an ARIMA model for predicting high frequency time series data. The presented results emphasize that a satisfactory learning speed can be achieved with MGA learning and also with satisfactory predictive accuracy.

Statistical learning and forecasting based on ARIMA model provides forecasting accuracy expressed by MSE = 1.44×10^-08, while the calculated training time is very long. Regarding predictions by means of artificial intelligence based on neural networks, the most accurate method is NN trained with BP with MSE = 3.44×10^-08 followed NN trained by GA method with MSE = 3.52×10^-08 and population size = 300, NN trained by MGA method with MSE values approximately equal to 7×10^-08 and population size equal to 10.

In our experiments conducted with GA and MGA methods, we also investigate the training time related to the size of populations. We found that the optimal population size is equal to 21 with training time equal to 21 s.

Generally, proposed forecast models based on advanced statistical and NN methods indicate that all investigated forecast models are very good. The use of ARIMA models is a powerful approach to the solution of many forecasting problems. It can provide extremely accurate forecasts of time series, and offers a formal, structured approach to model building and analysis. However, these models are not without several limitations [18]. In statistical models based on the B-J methodology, there is not conventional way to modify or update the estimates of the model parameters as each new observation becomes available. In contrast to NN, another drawback of ARIMA models is, that there is the learning speed very slow. The estimate of the parameters can be hardly parallelized.

In the future our main research objective is to apply developed metaheuristics on various datasets. Selected metaheuristics will be tested with different parameter combinations, and the combination of parameters which can yield approximate feasible solution in an acceptable computation time. Some changes may be done in the mutation, and selection procedure may be improved. The proposed BP algorithm may be also hibritted with another metaheuristic method.

Footnotes

Acknowledgments

This work was supported within Operational Program Education for Competitiveness – Project No. CZ.1.07./2.3.00./20.0296 and as a part of project SP2019/81 of the Student grant competition EkF, VŠB-TU Ostrava.

See:

References

Anderson

P.W.

, Arrow

and Pines

, The Economy as an Evolving Complex System, Santa Fe Institute Studies in the Science of Complexity, 5, Addison-Wesley Publishing Company, 1998.

Gupta

M.M.

and Rao

D.H.

, On the Principles of Fuzzy Neural Networks, Fuzzy Sets and Systems 61 (1994), 1–18.

Hines

and Usyninnin

, Current Computational Trends in Equipment Prognostics, International Journal of Computational Intelligemce Systems 1(1) (2008), 94–102.

Hong

, et al., Model selection system identification: a review, 39(10) (2008), 925–946.

Rahimi

and Recht

, Unsupervised Regression with Applications to Nonlinear System Identification, Advances in Neural Information Precessing Systems 19 (2007), 111–1120. MIT Press

Marcek

, Forecasting of financial data: a novel fuzzy logic neural network based on error-correction concept and statistics, Complex & Intelligent Systems 4(2) (2018), 95–104.

Marcek

and Rojcek

, The Category Proliferation Problem in ART Neural Networks, Acta Polytechnica Hungarica 4(5) (2017), 49–63.

Sakai

, Nonlinear Dynamicsand Chaos Agricultural Systems, Elsevier, 2001.

CH.

and Li

, Computational Intelligence Assisted Design in Industrial Revolution 4.0. CRC Press, Taylor & Francis Group, 2018

10.

Suykens

J.A.K

J.P.L. and Vandewalle

B.L.

, Artificial Neural Networks for Modeling and Control of Non-Linear systems, Springer-Verlag, 1995.

11.

Wang

, et al., Neural network-based robust adaptive controlof nonlinear systems with unmodeled dynamics, Mathematics ad computers in Simulations 79(5) (2009), 1745–1753.

12.

Tan

, Time-varrying time-delay estimation for nonlinear systems using Neural Networks, Int JApl Math Comput Sci 14(1) (2004), 63–68.

13.

Holland

, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence. University of Michigan Press, 1975.

14.

Tinos

and Yang

, A Sekf Organizing Random Immigrants Genetic Algorithm for Dynamic Optimization Problems, Genetic Programming and Evolvable Machines 8(3) (2007), 255–286.

15.

Borcinova

, Solving the capacitated vehicle routing problem using a parallel micro genetic algorithm. In: 2018 IEEE Workshop on Complexity in Engineering, (COMPENG) (2018), 1–6.

16.

Borcinova

, Robust models in distribution problems, Dissertation thesis, University of Žilina, Faculty of Management Science and Informatics, Department of mathematical methods, 2019.

17.

Box

G.P.

and Jenkins

G.M.

, Time Series Analysis, Forecasting and Control, Revised Edition, Holden-Day, San Francisco, CA, 1976.

18.

Montgomery

D.C.

, Lynwood

A.J.

and Gardiner

J.S.

, Fecasting and Time Series Analysis, New York: McGraw-Hill, Inc., pp. 260–261, 1990.

19.

Gooijer

J.G.

and Hyndman

R.J.

, 25 Years of Time Series Forecasting, , International, Journal of Forecasting 22 (2006), 443–473.

20.

HornikApproximation

, capabilities of multilayered feedforward networks, , Neural Networks 4 (1991), 251–257.

21.

Hertz

, Krogh

and Palmer

R.G.

, Introduction to the Theory of Neural Computation, Addison-wesley, 1991.

22.

Zamba

, A Genetic Algorithm Approach for Solving Cutting Stock Problem in Lumber Cutting Industry,” Proc. 18th Int. Conf. on Soft Computing, eds. R. Matousek, Czech Republic, Unicersity of Pardubice, Pardubice 70–75, 2012.

23.

Jebari

and Madiafi

, Selection Methods for Genetic Algorithms, International Journal of Emerging Sciences Int J Emerg Sci 3(4) 333–344, December 2013.

24.

Saini

, Review of Selection Methods in Genetic Algorithms, International Journal of Engineering and Computer Science 6(2) (2017), 22261–22263.

25.

Kalyanmoy

DEB

, Multi objective Optimizationary using Evolutionary Algorithms, Wiley India Edition, 2005.

26.

Goldberg

D.E.

, Genetic Algorithm in Search, Optimization, and Machine Learning, Addison-Wesley Publishing Co., Reading Massachutes, 1989. Environment, 3(1) (2014), 18–26.

27.

Kirshnakumar

, Microgenetic algorithm for stationary and non-stationary function optimization, Proc. SPIE 1196 (1990), 113–119.

28.

Alajmi

and Wright

, Selecting the most efficient genetic algorithm sets in solving unconstrained building optimization problem, Built Environment 3(1) (2014), 18–26.

29.

Nielsen

M.A.

, Neural Networks and Deep Learning, Determination Press, 2015.

30.

Zocca

, Spacagna

, Slater

and Roleans

, Python Deep Learning Packt Publishing Ltd, 2017.

31.

Charu

and Aggarwal , Neural Networks and Deep Learning. Springer International Publishing AG, 2018.