Abstract
To forecast time series data, two methodological frameworks of statistical and computational intelligence modelling are considered. The statistical methodological approach is based on the theory of invertible ARIMA (Auto-Regressive Integrated Moving Average) models with Maximum Likelihood (ML) estimating method. As a competitive tool to statistical forecasting models, we use the popular classic neural network (NN) of perceptron type. To train NN, the Back-Propagation (BP) algorithm and heuristics like genetic and micro-genetic algorithm (GA and MGA) are implemented on the large data set. A comparative analysis of selected learning methods is performed and evaluated. From performed experiments we find that the optimal population size will likely be 20 with the lowest training time from all NN trained by the evolutionary algorithms, while the prediction accuracy level is lesser, but still acceptable by managers.
Introduction
Forecasting of financial time series data is a complex problem, which has benefited from recent advancements and research in machine learning. In economics and in particular in the field of financial markets, forecasting is very important because forecasting is an essential instrument to operate day by day in the economic environment. It is generally known that financial high frequency data behave unforeseeable. They are stochastic, non-linear and chaotic. Application of deterministic nonlinear dynamics and chaos theory to the analysis of stochastic time series are widely used in contemporary macroeconomics and finance. A broad pioneering volume on the complexity of the economy is edited by Anderson [1].
Time series models are based on the analysis of chronological sequence of observations on particular variable. The main purpose of time series analysis is to understand the underlying mechanism that generates data, and, in turn, to estimate observed data and apply the models for forecasting.
Typically, in conventional time series, we assume that the generating mechanism is probabilistic and that the observed series is a realization of a stochastic process. This process is assumed to be stationary and is described by a class of models called autoregressive moving average (ARMA) models.
Over the past ten years academics of computer science have developed new soft techniques based on latest information technologies such as classic and soft (fuzzy logic) neural networks and granular computing, which evolved in the process of understanding incredible learning and adaptive features of neuronal mechanisms inherent in certain biological species [2–9].
Artificial neural networks (ANN) can be understood as a system which produces output based on inputs the user has defined. It is important to say that user has no knowledge about internal working of the system of ANN. Examples are brought forward the network and then the network tries to get as close as possible to the given output by adapting its parameters (weights). ANN model has a large number of internal variables which are supposed to set up well in order to optimize the outputs. This approach is based almost exclusively on finding non-linear function between the inputs and the output of the system. Neural networks have shown their efficiency in various identification, prediction, and diagnostic cases [10–12].
In ANN applications, in addition to the classical gradient learning method, the methods based on the principle of Darvinian evolution are increasingly used. The concept of GA was first presented by John Holland [13]. Evolutionary computing is a set of metaheuristics inspired from biological evolution and based on natural selection and genetic inheritance. It is mainly applied for optimization purposes in modeling of non-linear dynamic systems [14] or in many real-life applications data are subject to uncertainty due to their random nature, measurement errors or other reasons [15, 16]. GA works as an iterative procedure with a population of individuals each representing solution of the given optimization problem. The quality measure of an individual is the value of fitness corresponding to the value of the purpose function. Individuals with better fitness have a better chance of surviving and reproduce. In reproduction new individuals – offsprings inherit some features from their parents and form a new generation. By repeating this process, the average of fitness population or the best fitness improves from generation to generation. Unlike classic GA, where the population is composed of a large number of individuals, MGA uses a populations with a small number of individuals. Of course, a small population quickly reduces its diversity and can easily get stuck at a local optimum. When convergence occurs, the new population is restarted. A new population is created by adding the best individual from the current population (elitism), and the other individuals are randomly generated. Elitism guarantees the best individual in the next generation will not be worse from the previous generation.
The goal of this paper is to illustrate that the two distinct approaches, i.e. statistical models and computational networks may be used for financial high-frequency time series modelling. The main work of this study is to test the training speed and the predictive accuracy level of the statistical learning with neural networks one on the large data set.
In Section 2 a special dynamic process with random values as its observations is characterized. The characterization of conventional time series modelling is introduced in Section 3. In Section 4 the development and calculation of statistical models will be discussed. Section 5 introduces the characterization of NN architectures and proposes learning procedures used for NN and learning methods. Section 6 describes the program implementation of NN and both GA and MGA learning algorithms. Section 7 provides the assessment of prediction results from all learning approaches and verifies they applicability. Concluding remark are given in Section 8.
Conventional time series models
To build a time series model in a research a sample observations from the available data is usually collected. A time series consists of an observation set {y1, y2, . . . , y t . . .} of some phenomenon, taken at equally spaced time intervals. As we mentioned above, we assume that the observed series {y1, y2, . . . , y t . . .}is a realization of a stochastic process {Y1, Y2, . . . , Y t . . .}. This process is assumed to be stationary and is described by a class of linear models called autoregressive moving average (ARMA) models. Box and Jenkins (B-J) give a thorough treatment of these models [17]. We also assume that y t is a real for each t ∈ T, where {T = 1, 2, . . . , n} is an index set. The subscript t can now be referred to as time, so y t is the observed value of the time series at time t. The total number of observations in a time series (here n) is called the length of the time series or length of the data. In the following, we will typically refer to realization of stochastic processes by the notation y t for a value at t, and {y t } for a full set of values corresponding to the index set T = {1, 2, . . . , n}. We will also restrict our attention to discrete stochastic process, for which the index set is a discrete set, in which case we generally use the notation y t , which may apply also to continuous processes.
Once an appropriate model fits, it can be used to generate forecasts for future time periods. Most forecasting methods, commonly used in time series analysis, generate forecasts of future observations that are optimal in a minimum mean, square error sense (i.e. the best linear predictor).
Next, let Y
n
(τ) denote the forecast τ step ahead; we define as
In practice we have a finite number of observations, Equation (1); nevertheless, on the best linear predictor in the infinite sample limit enables to develop a way of calculating the approximate best linear predictor when n is large.
Conventional time series modeling can be grouped into two types. Time series methods and causal methods. As mentioned above, univariate time series models are based on the analysis of chronological sequence of observations on a particular variable. Causal models assume that the variable to be modeled can be explained by the behavior of another variable, or a set of variables.
In practice there are many time series in which successive observations are depended, i.e. there exists an observational relation
The most often used model is, however, an explicit function
The AR(1) process (4) is a special case of a stochastic process which is known as the mixed autoregressive-moving average model of the order (p, q) which is abbreviated ARMA(p, q):
All the above time series can be derived from linear combination of independent white noise random variables {ɛ
t
, ɛt-1, ɛt-2 . . .}, i.e.
A general model capable of representing a wide class of non-stationary time series is autoregressive integrated moving average process of order (p, d, q), ARIMA(p, d, q), where d is an operator for differencing a time series. Thus, the model represents the dth difference of the original series as a process containing p autoregressive and q moving average parameters. For example, the ARIMA(1,1,1) has the form:
To illustrate the Box-Jenkins approach, consider the time readings of the currency exchange rate between Czech Koruna (CZK) and Euro (EUR) (abbreviated as currency CZK/EUR) collected for the first week of December 2018. The data used for research discussed in this paper were published by GAIN Capital company and are freely available at http://ratedata.gaincapital.com/.
The preview of used time series data are shown on the left hand-site of Fig. 1. The data set starts at the 2nd December 2018 17 : 02 : 14 and ends at the 17th December 2018 16 : 59 : 55. It contains 60586 values. After removal of duplicates and interpolating the missing values, the time series counts 7197 observations. The STATA software was used for this 1 . From examining the left-hand site of Fig. 1, we note the variability of the series decreases as its general level decreases (the currency CZK/EUR time series has a declining trend). This suggests that the logarithms of the currency data should be analyzed, rather than the raw series.

The currency CZK/EUR (on the left) interpolated and (right) first difference of currency CZK/EUR EUR/CZK (on the right).
For successful usage of the Box-Jenkins method for creating an ARIMA model, the data should be stationary. The time series on the left-hand site of Fig. 1 has homogeneously non stationary behavior in the mean. In any local segment of time the observation look like those in any other segment, apart of their average. However, its first difference that is y t - yt-1 shown on the right-site of Fig. 1 is stationary in the mean and slope.
To build a forecast model the time series data was split into training and validation data set. The training data consist of 90% of the original series and the validation data set as the time period from the first observation after the end of the sample period to the most recent observation. The primary tool used in identification process is Auto Correlation Function (ACF) denoted as ρ
k
. Actually, the theoretical ACF is unknown and must be estimated by the sample ACF, i.e.
The estimation of the PACF is based on solving the Yule-Walker equations. For details see [18]. The sample autocorrelation and partial autocorrelation functions for the series are shown in Fig. 2.

Sample autocorrelation function (on the left) and partial autocorrelation function for the first difference of currency CZK/EUR (on the right).
The standard errors of the sample ACF and PACF are useful in identifying mean zero values. For assistance in interpreting these functions, two-standard-errors limits are plotted on the graphs as horizontal lines.
We find that the sample autocorrelation function tails off after lag 10 and also partial autocorrelation function tails off after lag 10. Therefore, we would tentatively identify out time series as the ARIMA(10,1,10) process.
The quantification of the model was performed by means of the STATA software using ML estimating procedure. On the basis of the calculated test statistics in Table 1, we have no evidence to reject the ARIMA(10,1,10) model.
STATA-estimated parameters for the currency CZK/EUR data: model ARIMA(10,1,10) and statistical test characteristics to assess the suitability of the ARIMA (10,1,10) model
Neural networks can be understood as a system which produces output based on inputs the user has defined. It is important to say that user has no knowledge about internal working of the NN system. Neural networks work on the Black Box principle. According to some publications such as [19], NN are the prediction models which have the biggest potential in predicting time series and high-frequency financial time series data.
In NN examples are brought forward the network and then the network tries to get as close as possible to the given output by adapting its parameters (weights). Neural network model has a large number of internal variables which are supposed to set up well in order to optimize the outputs.
In this section we firstly show an approach of function estimation for time series modelled by means of classic network trained by BP, and then trained by GA and MGA.
Classic NN trained by BP algorithm
Roughly speaking, artificial neural networks are also mathematical models which can learn with arbitrary precision to imitate any behaviour that can be described with continuous function [20]. The structure of a neural network is defined by its architecture (processing units and their interconnections, activation functions, methods of learning and so on).
In this section we study the feed-forward networks in the context of supervised learning. We restrict ourselves further to three-layer feed-forward network, see Fig. 3.

An example of three layer feed-forward network notation for units and weights with architecture k – s – 1 (see text for detail).
The input-output mapping of three-layer feed-forward network shown in Fig. 3 can be mathematically described as
In general, the network in Fig. 3 learnt so that the errors identified as Compute the error for the output node
update the connections as
Compute the deltas for the hidden layer nodes
update the connections w
rj
as
In Equations (11) and (12) the weighs w
rj
for hidden layers are represented in the matrix form. Typically, the updating process is divided into epochs. Each epoch involves updating all the weights for all the examples. The inputs and outputs are also called as
ARIMA(10,1,10) model for predicting currency CZK/EUR time series data is based on 10 auto-regressive and 10 moving average values, as shown in Section 3. Therefore, ANN should have 20 neurons in the input layer. Most implementations of neural networks use a neuron with no input wired to each other neurons as a bias. It is crucial that the values of (input+bias) are rather small because a sigmoid function is used as an activation function of the hidden layer. The last, output layer has the identity function as the activation function and has only one output neuron. Based on empirical experience the optimal size of the hidden layer is 25 neurons. Larger size did not provide better results while smaller provided worse.
The resulting shape of the network was three layers having 20–25 – 1 nodes. This network can be used for approximating the above mentioned ARIMA(10,1,10) model, predicting a value for 1-time unit (1 minute in this case) in the future of the last ten values and ten residuals of the moving average part of the ARIMA model.
The weights v, w can be adapted by genetic algorithms (GA) as well [22]. Genetic algorithms (see Fig. 4) are implemented as a computer simulation in which a population of abstract representations (called chromosomes) of candidate solutions (called individuals) to an optimization problem evolves toward better solutions.

Flow chart of common GA method.
The evolution usually starts from a population of randomly generated individuals and happens in generations. In each generation the fitness of every individual in the population is evaluated, multiple individuals are stochastically selected from the current population (based on the fitness), and modify it (recombined and mutated) to form a new population. The new population is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of iterations has been produced or a satisfactory fitness level has been reached for the population.
In the first two blocks of GA we define the initial population of neural network weights, optimization criteria, and fitness functions.
We train neural networks using a genetic and micro-genetic algorithm with different population sizes to compare the times needed for training. The fitness function was set as the summary measure of model's forecast accuracy defined as the Mean Square Error (MSE)
Genetic algorithms traditionally work with genes either 0 or 1. The initial population of weights v, w was generated randomly from the interval (a, b) ≡ (-0.7, 0.7) and transformed into the integer digit denoted as l by the following formula
The rank selection block prepares the population for the crossover operations. The ranking algorithm was implemented for the choice of crossover pairs. As mentioned earlier, the probability of an individual being chosen is not directly proportional to its fitness as in traditional implementations. Ranking algorithm therefore avoids a problem that if one individual is much more fit than the rest of the population, and the chance of other individuals being chosen is minimal. This can quickly lead to a loss of diversity in the population. The ranking algorithm gives more chance to individuals who have lower survival rate because their genetic information may also be beneficial.
Our implementation selects individuals for crossover using this rank-selection technique. From a technical point of view, a linked list is constructed and the worst individual is added once, second worst twice etc. Therefore the best individual in a population consisting of 300 individuals is 300 times more likely to be chosen than the worst. In contrast to a technique based solely on a loss function value, if one individual is much better than others, it does not prevent other individuals from being chosen for crossover. After sorting the individuals on the basis of their fitness, the rank is assigned to them. The best individual gets rank n and the worst individual gets rank 1. The selection probability of an individual is given as follows
After selection of two chromosomes from the current population, individuals are modify (recombined and mutated). In this work the single-point crossover has been applied. In the chromosome a point was randomly selected which divide chromosome into two parts. Then those two parts of chromosomes were exchanged. After a crossover is performed, mutation take place. This is to prevent falling all solutions in population into a local optimum of solved problem. Mutation changes randomly the new offspring. For binary encoding we can switch a few randomly chosen bits from 1 to 0 or from 0 to 1. The crossover and mutation operators are graphically illustrated in Fig. 5. More information about crossover and mutation operators can be find in [25].

Crossover (a) and Mutation (b) operators.
Micro-genetic algorithm is a modification of GA. [26, 27]. It is based on a small number of individuals in the population. Figure 6 shows the main flowchart of the MGA algorithm.

Flowchart of the micro-genetic algorithm.
The common phases of MGA flow are initialization, elitism, selection, crossover, mutation, new population, and termination. Mutations are generally not used here. A mutation is only used if there is a loss of population diversity. The convergence control and restart have come.
The new population emerges from newly created offspring that in the next generation can change their properties and thus increase the diversity of the population.
It is necessary to find out if the convergence has occurred or not. The detection of convergence is based on a comparison of the population diversity between the best fitness individual and others according to the following equation
Genetic algorithms traditionally work with genes either 0 or 1. For this application, it is inadequate, because this algorithm need to find weights and biases of a neural network, which can be an arbitrary floating point decimal numbers. Therefore, each parameter is transformed into interval [0, 1]. An individual for the genetic algorithm then consists of floating point numbers from [0, 1] interval and count of parameters in intervals equals to count of all weights and biases of the network.
As for program implementation, a custom implementation of both GA and MGA learning algorithms and neural networks was used. As we have experience with implementing machine learning algorithms in Lisp language, the Clojure 1.6 language was chosen for this implementation 2 .
Clojure is a modern functional programming language dialect of Lisp, dynamically compiled to the Java bytecode. Testing was performed on a PC with AMD Ryzen 2 2600 processor, 6 physical and 12 logical cores with 16GB memory.
Neural networks are represented as a collection of matrices of weights and biases in our program. Clojure is not object oriented, therefore there is no benefit in defining classes or interfaces (these features exist in Clojure mainly for interoperability with Java). A slightly different approach is used as by the general representation shown in Section 5. Each layer of the network is represented by a matrix of weights and vector of biases in the following expanded matrix forms
We see from Equations (22) and (23), the program assumes that networks can have more than one hidden layer. This allows us to use the same software for future testing of deep neural networks [29–31].
Equation (22) shows the weighted input of each layer in the expanded matrix form. This can be viewed as a vector of the potentials of the neurons in the layer. A L is a simplified notation for applying an activation function to each element of the U L vector, creating the A L vector.
This section presents the experiment results conducted to study the performance of the prediction methods using BP, GA and MGA learning. All numeral experiments were conducted using the variables and data sets as the statistical model above. Therefore, the input layer for the network consisted of 20 plus one neurons in the input layer. The output layer has one neuron with linear activation function. Based on empirical experience the optimal size of the hidden layer is 25 neurons. The resulting architecture of the network is 21 – 25 – 1 for all NN methodological frameworks.
We trained neural network using a genetic and micro-genetic algorithm with different population sizes. Our target was to train a network with MSE function of the validation data set being under 1.0×10-7 value. It is also the condition for training stopping. Table 2 shows the training parameters used in GA and MGA algorithms.
The parameters used in the GA and MGA learning algorithms
The parameters used in the GA and MGA learning algorithms
Each algorithm with a particular population size was performed 12 times, removing the lowest and highest time. This eliminates the measured outliers.
As shown in the Table 3 and Fig. 7 when going from genetic to micro-genetic algorithms (from 300 population to 30 in this case), the needed number of generations went up. Since the population is much smaller, calculating the fitness of the generation is faster as well. In our case, the genetic algorithm with 600 population had population size too large for this problem, slowing the convergence of the algorithm down. The last one MGA with population size of 6, the population size is too small and the convergence is slow as well. As shown in Fig. 7, an optimal population which gives minimal learning speed is the population of size 20.
A summary of the predicted accuracy and needed time of calculations related to the size of the generations and the number of iterations

Time needed to train GA and MGA algorithm related to the size of populations (GA population size 300 and 600; MGA 6, 10, 20, 30).
Table 3 shows also the accuracy results of the statistical ARIMA, GA and MGA methods expressed in term of MSE. All proposed forecast models based on advanced statistical and NN methods indicate that all forecast models are very good. The most accurate method is statistical ARIMA (10,1,10) followed by NN trained with BP, NN trained with GA and population size = 300, NN trained with MGA method with MSE values approximately 7×10-08.
The use of ARIMA models is a powerful approach to the solution of many forecasting problems. But, they are not without several limitations. In statistical models based on the B-J methodology, there is not conventional way to modify or update the estimates of the model parameters as each new observation becomes available. In contrast to NN, another drawback of ARIMA models is, that there is the learning speed very slow. The estimate of the parameters can be hardly parallelized.
In this paper, we have studied statistical and neural network techniques to predict high frequency time series for the currency between the Czech koruna and EUR (CZK/EUR) and test the training speed and forecasting accuracy of statistical learning with neural networks. It was shown that the GA and MGA can be used to train a feed-forward neural network to approximate an ARIMA model for predicting high frequency time series data. The presented results emphasize that a satisfactory learning speed can be achieved with MGA learning and also with satisfactory predictive accuracy.
Statistical learning and forecasting based on ARIMA model provides forecasting accuracy expressed by MSE = 1.44×10-08, while the calculated training time is very long. Regarding predictions by means of artificial intelligence based on neural networks, the most accurate method is NN trained with BP with MSE = 3.44×10-08 followed NN trained by GA method with MSE = 3.52×10-08 and population size = 300, NN trained by MGA method with MSE values approximately equal to 7×10-08 and population size equal to 10.
In our experiments conducted with GA and MGA methods, we also investigate the training time related to the size of populations. We found that the optimal population size is equal to 21 with training time equal to 21 s.
Generally, proposed forecast models based on advanced statistical and NN methods indicate that all investigated forecast models are very good. The use of ARIMA models is a powerful approach to the solution of many forecasting problems. It can provide extremely accurate forecasts of time series, and offers a formal, structured approach to model building and analysis. However, these models are not without several limitations [18]. In statistical models based on the B-J methodology, there is not conventional way to modify or update the estimates of the model parameters as each new observation becomes available. In contrast to NN, another drawback of ARIMA models is, that there is the learning speed very slow. The estimate of the parameters can be hardly parallelized.
In the future our main research objective is to apply developed metaheuristics on various datasets. Selected metaheuristics will be tested with different parameter combinations, and the combination of parameters which can yield approximate feasible solution in an acceptable computation time. Some changes may be done in the mutation, and selection procedure may be improved. The proposed BP algorithm may be also hibritted with another metaheuristic method.
