Neural networks trained with high-dimensional functions approximation data in high-dimensional space

Abstract

Neural networks can approximate data because of owning many compact non-linear layers. In high-dimensional space, due to the curse of dimensionality, data distribution becomes sparse, causing that it is difficulty to provide sufficient information. Hence, the task becomes even harder if neural networks approximate data in high-dimensional space. To address this issue, according to the Lipschitz condition, the two deviations, i.e., the deviation of the neural networks trained using high-dimensional functions, and the deviation of high-dimensional functions approximation data, are derived. This purpose of doing this is to improve the ability of approximation high-dimensional space using neural networks. Experimental results show that the neural networks trained using high-dimensional functions outperforms that of using data in the capability of approximation data in high-dimensional space. We find that the neural networks trained using high-dimensional functions more suitable for high-dimensional space than that of using data, so that there is no need to retain sufficient data for neural networks training. Our findings suggests that in high-dimensional space, by tuning hidden layers of neural networks, this is hard to have substantial positive effects on improving precision of approximation data.

Keywords

Data sparsity high-dimensional function high-dimensional space neural networks

1 Introduction

Neural networks have ability of approximation both data and the internal linear representation of nonlinear system [1 –3], since they possess many non-linearity layers to represent highly nonlinear complex mapping [4, 5]. Although neural networks have excellent approximation capability, they become quite difficult to approximate data in high-dimensional space. Because the “curse of dimensionality” is a big trap for models and methods, causing their inefficiencies and low precision [6].

Data resided inside high-dimensional space gets more sparse as dimensionality goes up. The instability and over-fitting of model are easy to be caused by sparse data [7]. Obviously, it is a hard task to eliminate data sparsity effects while analyzing data. To handle data sparse caused by the curse of dimensionality, Taylor expansion is used to deal with data in high-dimensional space [8]. During approximation data, the precision of solution is hard to be sufficiently considered by Taylor expansion.

Using good neural networks to deal with data has more advantages than that of using other models [9]. However, in extremely sparse or high-dimensional space, a neural network approximates nonlinear data, whose measured outputs are easily perturbed by stochastic disturbances [10], thereby generating the larger approximation error. For instance, in [11], to improve the precision of approximation, the neural network adding the Kachmazh’s algorithm (hybrid network) is used for medical data approximation, instead of direct using a single neural network. Although the introduction of the Kachmazh’s algorithm increases the complexity of the neural network training, the approximated precision is guaranteed. Similarly, it has been shown in [12] that to reduce approximation error, the neural network merging convex functions is used to approach the data of a diet for a patient with type-2 diabetes. Beyond that, to approximate high-dimensional data, the methods that project the data first into a lower dimension and then build the neural network approximation over this lower dimensional projection data space are usually selected [13], rather than direct approximation high-dimensional data using neural networks. Overall, it can see from these studied results that this is a hard challenge to train a neural network of being suitable for high-dimensional space, so that the trained neural network approximates high-dimensional data in high quality.

High-dimensional functions act as an useful tool in complex engineering or multi-dimensional data analysis, such as, analysis multi-sensor data [14], calculation the risk associated with complex investment portfolios [15], approximation over high-dimensional data (e.g., data with more than 10 dimensions, and possibly with 100 dimensions or even higher) [16, 17]. In high-dimensional space, definition high-dimensional functions of approximation data means that these approximation error are also large as the amount of data is large [18], so that high-dimensional functions suffer from serious negative effects when they directly approach data. Although this is not easy to approach directly data in high-dimensional space using high-dimensional functions, we can approach data in high-dimensional space through the neural networks trained using high-dimensional functions. This is because neural networks can theoretically approximate any finite-dimensional continuous function uniformly on compact sets arbitrarily well [19, 20], allowing researchers not to do beforehand any hypothesis [21]. Furthermore, universal approximation capability of neural networks also states that any continuous function defined on a closed set can be uniformly approximated to an arbitrary degree of accuracy using neural networks [22].

Related studies as regards the approximation capacity of neural networks using functions are presented in [23 –27]. Their approaches use polynomials to approximate functions and then approximate these polynomials by neural networks. Smooth functions are approximated reasonably well by using polynomials, meanwhile, neural networks are known to be able to approximate monomials in high quality[23 –27]. In addition, function approximation theory incorporates into neural networks, the training process and the precision of network adaptation are controlled well by using the order of polynomials [28]. For example, in [29], the capability of approximation any high-order function using neural networks is fully verified, showing that the approximation error is lower. Zahra [30] solves the drawbacks both the slow convergence of first-order learning algorithms and inverse hessian calculation of second-order learning algorithms using multi-layer neural networks approximation function. To induce both the best approximation performance and the most stable internal state of neural networks, the Gaussian function is adopted during the adaptive adjustment of neural networks [31]. As a result, it can be seen that this is valuable to study neural networks to approximate high-dimensional functions.

In this work, our primary goal is approximation data in high-dimensional space by neural networks. However, we aim at proving that in high-dimensional space, the neural networks trained using high-dimensional functions outperform that of trained using data in terms of approximation ability. More importantly, we look at demonstrating that as long as the neural networks trained by high-dimensional functions are sufficiently good, so there is no need to retain sufficient data for neural networks training. Consequently, to ensure that the neural networks trained using high-dimensional functions are suitable for high-dimensional space, we derived the two deviation, i.e., the deviation of the neural networks trained using high-dimensional functions, and the deviation of high-dimensional functions approximation data. The further meaning for the two deviations are that they provide a reference for neural networks training and approximation data in high-dimensional space.

We summarize the main contributions of this work as follows:

To train the neural networks of being suitable for approximation high-dimensional space, the two deviations, i.e., the deviation of the neural networks trained using high-dimensional functions, and the deviation of high-dimensional functions approximation data, were derived.

The neural networks trained using high-dimensional functions are more suitable for high-dimensional space than that of using data, so that there is no need to retain sufficient data for neural networks training.

In high-dimensional space, by tuning hidden layers of neural networks, this is hard to have substantial positive effects on improving precision of approximation data.

2 Theory

In Section 2.1, several formal definitions are firstly given in order to detailed description our approach. Then, we review the Lipschitz condition of being beneficial to derive the two deviations. In Section 2.2, the proposed method is described. Firstly, the two deviations are derived. Then, we explain the role and selection of candidate functions. This purpose is to provide some suggests for exploring high-dimensional functions.

2.1 Preliminary

Some formal definitions in this work are firstly given. Symbols of appearing and their meaning in the theory are listed in Table 1.

Table 1
Table1 Symbol table

Symbols Description

x_i (i = 1,2,...) data in high-dimensional space

Ω high-dimensional space

d dimensionality of high-dimensional space

g^d(x) high-dimensional dataset

f_t a target function

f_n ^# a neural network

T_t a candidate function

ξ the deviation of approximation x_i using T_t

_k (k > 0) a constant

B a constant

n the number of neurons.

Δ a constant

Ξ (Ξ >0) Lipschitz constant

Symbols	Description
x_i (i = 1,2,...)	data in high-dimensional space
Ω	high-dimensional space
d	dimensionality of high-dimensional space
g^d(x)	high-dimensional dataset
f_t	a target function
f_n ^#	a neural network
T_t	a candidate function
ξ	the deviation of approximation x_i using T_t
_k (k > 0)	a constant
B	a constant
n	the number of neurons.
Δ	a constant
Ξ (Ξ >0)	Lipschitz constant

Definition 1. Given data x_i in the, the g^d(x)={x_i, i = 1,2,.....} is composed of data x_i from the Ω, and g^d(x) ∈ R ^d .

Definition 2. f_t is a abstract high-dimensional function, namely a target function, which is used to approach g^d(x), denoting as |f_t£-g^d(x)|. Moreover, f_t is also used for the neural network f_n^# training, denoting as |f_t£-f_n^# |.

Definition 3. T_t is a candidate function. Due to f_t is an abstract high-dimensional function, the f_t is replaced by using the constructed T_t. In addition, the replacement of a target function by a constructed candidate function is likely to produce deviation during approximation data in high-dimensional space. Consequently, item ξ is used to denote the deviation of approximation x_i using T_t.

Before deriving the two deviations, the Lipschitz condition and related lemmas are given.

Lipschitz condition. As for f(x) defined on the domain D, there is a constant k, and k > 0. For any x1∈ D and x2∈ D, this is true |f(x1)-f(x2)|≤k|x1-x2|. So, f(x) satisfies the Lipschitz condition on the domain D, where, k is the Lipschitz constant. If the |f(x1)-f(x2)|≤k|x1-x2| ^m holds, f(x) satisfies the m-order Lipschitz condition on the domain D.

Lemma 1. [18]. The error of approximation data in high-dimensional space using neural networks is $b^{d} / \sqrt{n}$ , where, n is the number of neurons in both cases, b (b > 1) is a constant depending on the class of the activation functions, d is the dimensionality of high-dimensional space.

2.2 Method

2.2.2.1 A. Two deviations

Let us assume that f _t is Lipschitz smooth with respect to the Euclidean norm, for any x1 and x2, which means that $| f_{t} (x_{1}) - f_{t} (x_{2}) | ⩽ k * | | x_{1} - x_{2} | |$ (1) and k > 0. Let us further assume that f _t is smooth, implying that it has well defined and bounded derivatives everywhere.

Theorem 1. (The first deviation). The deviation for $f_{n}^{#}$ trained by using f_t , having that $| f_{t} (x_{i}) - f_{n}^{#} (x_{i}) | ⩽ b^{d} / \sqrt{n} + Δ * s . t_{x_{i} \in Ω} | \frac{d f_{t} (x)}{d x} |$ (2)f or any x_i ∈ g^d (x), the Equation (2) hold s. W here, b is a constant of depend ing on the class of the activation functions of hidden neurons , and b > 1. As for a detailed description of the b and d , please refer to references [ 32, 33]. n is the number of neurons . The item $s . t_{x_{i} \in Ω} | \frac{d f_{t} (x_{i})}{d x_{i}} |$ is the maximal value c over ing data region on Ω. The item Δ is a constant, and then it is discussed later.

Proof. $\begin{matrix} | f_{t} (x_{i}) - f_{n}^{#} (x_{i}) | \\ = | f_{t} (x_{i}) - f_{n}^{#} (x_{i}) + g^{d} (x_{i}) - g^{d} (x_{i}) | \\ ⩽ | f_{t} (x_{i}) - g^{d} (x_{i}) | + | f_{n}^{#} (x_{i}) - g^{d} (x_{i}) | \end{matrix}$ (3) Using the results of Lemma 1, having that $\begin{matrix} | f_{t} (x_{i}) - f_{n}^{#} (x_{i}) | \\ ⩽ | f_{t} (x_{i}) - g^{d} (x_{i}) | + b^{d} / \sqrt{n} \end{matrix}$ (4) Let us transform the right part of Equation (4), having $\begin{matrix} | f_{t} (x_{i}) - g^{d} (x_{i}) | + b^{d} / \sqrt{n} \\ = | T_{t} (x_{i}) - g^{d} (x_{i} \pm ξ) | + b^{d} / \sqrt{n} \\ ⩽ s . t_{| ξ | ⩽ Δ} | T_{t} (x_{i}) - g^{d} (x_{i} \pm ξ) | + b^{d} / \sqrt{n} \end{matrix}$ (5) If ξ is equal to zero, the deviation is zero using T_t, meaning that T_t achieves good replacement. Thereafter, we derive the item s . t_|ξ|⩽Δ using the Lipschitz condition again. Having that $\begin{matrix} s . t_{| ξ | ⩽ Δ} | T_{t} (x_{i}) - g^{d} (x_{i} \pm ξ) | \\ ⩽ s . t_{| ξ | ⩽ Δ} Ξ * | x_{i} - (x_{i} + ξ) | \\ = | ξ | * s . t_{| ξ | ⩽ Δ} Ξ \end{matrix}$ (6) where, the item Ξ is Lipschitz constant and Ξ >0. The constant Ξ is reconstructed by taking derivative of the candidate function T_t(x_i), i.e., $Ξ = | \frac{d T_{t} (x_{i} + ξ)}{d (x_{i} + ξ)} |$ . Calculation the right part of Equation (6), having that $\begin{matrix} | ξ | * s . t_{| ξ | ⩽ Δ} Ξ = | ξ | * s . t_{| ξ | ⩽ Δ} | \frac{d T_{t} (x_{i} + ξ)}{d (x_{i} + ξ)} | \\ ⩽ Δ * s . t_{| ξ | ⩽ Δ} | \frac{d T_{t} (x_{i} + ξ)}{d (x_{i} + ξ)} \\ ⩽ Δ * s . t_{ξ = 0} | \frac{d T_{t} (x_{i})}{d x_{i}} | \\ ⩽ Δ * s . t_{x_{i} \in Ω} | \frac{d T_{t} (x_{i})}{d x_{i}} | \end{matrix}$ $\Rightarrow | ξ | * s . t_{| ξ | ⩽ Δ} Ξ ⩽ Δ * s . t_{x_{i} \in Ω} | \frac{d T_{t} (x_{i})}{d x_{i}} |$ (7) According to the above derivation process, we add the result of Equation (7) into Equation (5), having that $\begin{matrix} | f_{t} (x_{i}) - g^{d} (x_{i}) | + b^{d} / \sqrt{n} \\ ⩽ Δ * s . t_{x_{i} \in Ω} | \frac{d T_{t} (x_{i})}{d x_{i}} | + b^{d} / \sqrt{n} \end{matrix}$ (8) Finally, Equation (8) is taken into Equation (4), having that $\begin{matrix} | f_{t} (x_{i}) - f_{n}^{#} (x_{i}) | \\ ⩽ Δ * s . t_{x_{i} \in Ω} | \frac{d T_{t} (x_{i})}{d x_{i}} | + b^{d} / \sqrt{n} \end{matrix}$ (9) Hence, the proof of Theorem 1 is completed. By the theorem 1, the $| f_{t} (x_{i}) - g^{d} (x_{i}) | ⩽ | f_{t} (x_{i}) - f_{n}^{#} (x_{i}) |$ holds. Hence, it can be to get the second deviation for a target function approximation data in high-dimensional space. Having that $\begin{matrix} | f_{t} (x_{i}) - g^{d} (x_{i}) | \\ ⩽ b^{d} / \sqrt{n} + Δ * s . t_{x_{i} \in Ω} | \frac{d f_{t} (x)}{d x} | \end{matrix}$ (10)

Equation (10) indicates that the deviation of f_t approximation data in high-dimensional space should not be defined as strictly greater than that of f_n^# trained using f_t. Overall, Equations (2) and (10) are the two deviations.

Next following, we discuss the item Δ in Theorem 1. Let Δ depend on the distances between pairs of points within the Ω, i.e., let the range of Δ maintains within the range of distances between points within Ω. The advantage of doing this is that Δ can trade-off the data region of s . t_{x_i∈Ω} covered in Ω based on the position of pairs points in Ω. Certainly, here has many measurement approaches in regard to the distances between pairs of points in Ω, e.g., Hausdorff metric, Riemann metric, etc.

2.2.2.2 B. Candidate function T_t

The Gaussian function in [18] is adapted as a candidate function. m is data dimensionality, having $T_{t} = m * e^{- (\sum_{i = 1}^{m} x_{i}^{2}) / m}$ (11)

Noting that there are many ways to opt for candidate functions, e.g., higher-order function, etc. In this research, we do not conduct the comparisons between different way of candidate function selection. Because that is not what we are focused on in this work. Nevertheless, the value of candidate functions is to provide a valid way of verifying our approach.

3 The proposed neural network

In this section, a neural network is developed to verify our approach. For the proposed neural network, we do not more focus on their architectures, since it is just used to verify the proposed method. Further meaning, we more hope that the proposed method fits universal neural networks, not limited to a specific neural network architecture. In section 3.1, the architecture and hyper-parameters of the proposed neural network are given. In section 3.2, the training and testing of the proposed neural network are described.

3.1 Architecture and hyper-parameters

The designed neural network owned two hidden-layers, namely NN-Tt, is as shown in Fig. 1. It can be seen in Fig. 1 that the second deviation, i.e., $| f_{t} (x) - g^{d} (x) | ⩽ b^{d} / \sqrt{n} + Δ * s . t_{x \in Ω} | \frac{d f_{t} (x)}{d x} |$ , is used during approximation inputting data. After reaching the second deviation, then, the first deviation, i.e., $| f_{t} (x) - f_{n}^{#} (x) | ⩽ b^{d} / \sqrt{n} + Δ * s . t_{x \in Ω} | \frac{d f_{t} (x)}{d x} |$ , is used for NN-Tt training. As regards the training and testing of NN-Tt, these are given in detail in Section 3.2.

Fig. 1

Architectures of the proposed neural network.

Activation function ReLu, i.e., $\hat{g} (x) = max (0, x)$ , can compensate for the gradient vanishing caused by activation function Sigmoid and tanh. It is hard to induce gradient vanishing by using ReLu while inputting positive values. However, the $\hat{g} (x)$ has no gradient (appears gradient vanishing) on the part where inputting values are less than zero. Therefore, the improved ReLU is used as an activation function, i.e., Leaky ReLU, g (x) = max(0, x) + leak * min(0, x), where, the leak is a very small constant. The advantages using the Leaky ReLU as the activation function are that the g(x) information is not completely lost and is also retained accordingly when the inputting x is less than zero. That is, the Leaky ReLu also gives a very small gradient on the part where inputting values are less than zero. As a result, the Leaky ReLU also extremely reduces the probability of gradient vanishing for the proposed neural network even through inputting values are less than zero.

3.2 Training and testing

All datasets (see Section 4.4 for detail) are into two parts, i.e., training dataset and test dataset. In the same dataset, 80% of the data is randomly chosen as training dataset for training, the left 20% of data is then used as test sample for testing.

Training. The high-dimensional function f_t(x) approximates the inputting data g^d(x), until the approximation error between f_t(x) and g^d(x) is lower than the $| f_{t} (x) - g^{d} (x) | ⩽ b^{d} / \sqrt{n} + Δ * s . t_{x \in Ω} | \frac{d f_{t} (x)}{d x} |$ , in this case, the f_t(x) is obtained. Thereafter, NN-Tt is trained by using the obtained f_t(x). During NN-Tt training, the training error of NN-Tt is calculated. Until the training error of NN-Tt is less than the $| f_{t} (x) - f_{n}^{#} | ⩽ b^{d} / \sqrt{n} + Δ * s . t_{x \in Ω} | \frac{d f_{t} (x)}{d x} |$ , the training of NN-Tt is stopped. Finally, the trained NN-Tt is saved. The algorithm of NN-Tt training is listed Table 2.

Table 2
Training algorithm

Initialization hyper parameters: learning rate, neurons etc;

Input: training data g^d(x);

While:

f_t(x) approximates g^d(x);

Calculate the approximation error |f_t (x) - g^d (x) |;

If |f_t (x) - g^d (x) | satisfies the second deviation $| f_{t} (x) - g^{d} (x) | ⩽ b^{d} / \sqrt{n} + Δ * s . t_{x \in Ω} | \frac{d f_{t} (x)}{d x} |$ :

Stop f_t(x) approximation g^d(x);

Output f_t(x);

End;

Input f_t(x) into the input layer of NN-Tt;

While:

Training NN-Tt;

Calculate the training error $| f_{t} (x) - f_{n}^{#} (x) |$ ;

If $| f_{t} (x) - f_{n}^{#} (x) |$ satisfies the first deviation $| f_{t} (x) - f_{n}^{#} (x) | ⩽ b^{d} / \sqrt{n} + Δ * s . t_{x \in Ω} | \frac{d f_{t} (x)}{d x} |$ ;

Stop Training;

Save the NN-Tt trained;

End;

Testing. After completing NN-Tt training, the ability of approximation data in high-dimensional space is testing by using testing datasets.

4 Experimental settings

In Section 4.1, to evaluate the precision of neural networks approximation high-dimensional data, assessments metrics are given. In addition, we also analyze the statistical significance of the difference between the proposed neural network and the competitor using t-test. In Section 4.2, the competitor and its parameters are presented. In Section 4.3, assessment metric of data sparsity is also given to quantify data sparsity. In Section 4.4, from the three perspective of data sparsity, high dimensionality and data volume, 8 synthetic datasets are generated, as well as 4 real-world datasets are selected.

4.1 Assessment metrics

The receiver operating characteristic curve (ROC) and corresponding area under curve (AUC) are used to assess the accuracy of approximation high-dimensional data using neural networks. In addition, mean square error (mse) and standard deviation (sd) are applied to analyze the approximation error. The calculation formulas of mse and sd are as follows. ${\begin{matrix} mse = \frac{1}{N} \sum_{i = 1}^{N} (y_{i} - p_{i})^{2} \\ sd = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} (p_{i} - \frac{1}{N} \sum_{i = 1}^{N} p_{i})^{2}} \end{matrix}$ (12) where, y_i is test value, and p_i is predicted value. N is data volume. To test the statistical significance of the difference between the average performances, we used the t-test.

4.2 Comparison methods and parameters

Compared the proposed NN-Tt with the feed-forward Neural Network (NN) [34] owned five hidden-layers, in order to address a fair comparison and to achieve a convincing conclusion, the same activation function (see section 3.1) is used for the NN-Tt and NN. As for NN-Tt and NN, the number of neurons in each hidden unit is 50, respectively. Given that data dimensionality and data volume, the number of neurons is considered to be sufficiently large, but not too large. Unless otherwise stated, the NN-Tt and NN ran on a GPU using the same default hyper-parameters, e.g., regularization parameter, learning rate, etc.

4.3 Data sparsity assessment

Through calculating the relationship between norm-1 and norm-2, Patrik [35] quantifies data sparsity. In this work, data sparsity is calculated using the formula given by Patrik, having that $sp (x) = \frac{\sqrt{n} - (\sum | x_{i} |) / \sqrt{\sum x_{i}^{2}}}{\sqrt{n} - 1}$ (13) where, x is dataset. The item sp(x) is the quantified result of data sparsity, and sp(x)∈ [0,1]. Data sparsity is inversely proportional to the sp(x).

4.4 Datasets description

Several high-dimensional synthetic datasets and high-dimensional real-world datasets are considered. As for the synthetic datasets, we consider three different aspects, the first one is data volume. Then, the last two aspects are data dimensionality and data sparsity, which are used to discuss the capability of approximation high-dimensional sparse data using NN and NN-Tt. Based on this, the 8 synthetic datasets (generated method refers to [36]) are generated. In the case of a fixed data volume and dimensionality, data sparsity gradually augments from 0.6 to 0.97 (calculation of data sparsity refers to Equation (13)).

For the real-world datasets, the 4 sparse high-dimensional datasets (http://archive.ics.uci.edu/ml/) are selected. Details of the 8 synthetic datasets and the 4 real-world datasets are listed in Table 3.

Table 3
Description of the synthetic and real-word datasets

Synthetic datasets Data dimensionality Data volume Data sparsity

s1 1000 1000 0.60

s2 1000 1000 0.66

s3 1000 1000 0.71

s4 1000 1000 0.76

s5 1000 1000 0.81

s6 1000 1000 0.91

s7 1000 1000 0.94

s8 1000 1000 0.97

Real-world datasets Data dimensionality Data volume Data sparsity

CNAE 857 1080 0.9355

FPS 4813 3600 0.9368

Zemberek 5692 3600 0.9372

TTC 7570 3600 0.9402

Synthetic datasets	Data dimensionality	Data volume	Data sparsity
s1	1000	1000	0.60
s2	1000	1000	0.66
s3	1000	1000	0.71
s4	1000	1000	0.76
s5	1000	1000	0.81
s6	1000	1000	0.91
s7	1000	1000	0.94
s8	1000	1000	0.97
Real-world datasets	Data dimensionality	Data volume	Data sparsity
CNAE	857	1080	0.9355
FPS	4813	3600	0.9368
Zemberek	5692	3600	0.9372
TTC	7570	3600	0.9402

5 Results and discussion

Experimental results, including mse, sd and the precision performance, are presented in this section. We also analyzed the experimental results with statistical significance of test dataset by of t-test (p-value<0.05 for mse).

All experimental results show that the approximation performance of NN-Tt is significantly better than that of NN in all considered cases. These results do not show in general a difference between the ability of approximation data in high-dimensional space for NN-Tt and NN. Hence, these together confirm our expectation that in high-dimensional space, the neural networks trained using high-dimensional functions suffer less negative effects of data sparsity during approximation data than the neural networks trained using data. Section down below detailed the experimental results.

5.1 Experiments on synthetic datasets

The results show that the mse and sd gotten by NN-Tt all are lower than that of obtained by NN in Table 4. These statistical results in Table 4 do not indicate in general a difference between the accuracy of approximation high-dimensional data as for the proposed NN-Tt and the competitor NN. To intuitively clarify this process, we visualized the results on the 8 synthetic datasets for NN-Tt and NN in Fig. 2. The accuracy of approximation data using NN-Tt and NN are displayed in Fig. 3.

Table 4
Results of mse and sd on synthetic datasets. mse for t-test (p < 0.05). The experiments were carried out independently 500 times. mse, {sd}. mse, {sd}. [Significant at p-values for t-test]

Dataset NN mse, {sd} NN-Tt mse, {sd} [p-values]

s1 3.01e-5, {3.39e-5} 1.05e-5, {1.04e-5} [3.764E-4]*

s2 2.86e-5, {1.77-e5} 1.31e-5, {1.37e-5} [3.407E-3]*

s3 8.21e-5, {8.31e-5} 2.21e-5, {3.22e-5} [2.619E-4]*

s4 4.23e-5, {5.09e-4} 2.59e-5, {1.41e-5} [4.557E-4]*

s5 1.03e-4, {2.54e-4} 2.04e-5, {1.15e-5} [2.297E-4]*

s6 1.98e-4, {1.72e-4} 1.74e-5,{2.10e-5} [9.751E-5]*

s7 184e-4, {2.81e-4} 3.33e-5, {1.08e-5} [1.212E-4]*

s8 1.87e-4, {4.71e-4} 3.32e-5, {2.72e-5} [1.650E-3]*

Dataset	NN mse, {sd}	NN-Tt mse, {sd}	[p-values]
s1	3.01e-5, {3.39e-5}	1.05e-5, {1.04e-5}	[3.764E-4]*
s2	2.86e-5, {1.77-e5}	1.31e-5, {1.37e-5}	[3.407E-3]*
s3	8.21e-5, {8.31e-5}	2.21e-5, {3.22e-5}	[2.619E-4]*
s4	4.23e-5, {5.09e-4}	2.59e-5, {1.41e-5}	[4.557E-4]*
s5	1.03e-4, {2.54e-4}	2.04e-5, {1.15e-5}	[2.297E-4]*
s6	1.98e-4, {1.72e-4}	1.74e-5,{2.10e-5}	[9.751E-5]*
s7	184e-4, {2.81e-4}	3.33e-5, {1.08e-5}	[1.212E-4]*
s8	1.87e-4, {4.71e-4}	3.32e-5, {2.72e-5}	[1.650E-3]*

Fig. 2

Data approximated by NN-Tt and NN.

Fig. 3

Results of approximation precision.

From Figs. 2 and 3, several observations can be obtained: (i) There is no significant difference between NN-Tt and NN in these results (see Table 4). Obviously, NN-Tt is statistically better than NN in approximation precision in Fig. 3. From the perspective, NN-Tt is regarded as a winner. (ii) NN-Tt shows a clear advantage over NN approximation high-dimensional data as shown in Fig. 2. Unfortunately, NN exists distortion in approximation high-dimensional data, e.g, datasets s5, s6, s7 and s8, when data sparsity reaches over 0.8. (iii) The neural networks trained using data are difficult to obtain advanced results in approximation data in high-dimensional space, while superior results can be easy to obtain by the neural networks trained using high-dimensional functions.

Figure 4 displays the running-time of NN-Tt and NN on the 8 synthetic datasets. In the case of fixed dimensionality and data volume, the running-time of the two neural networks drops as data sparsity augments. Compared with NN, NN-Tt has no advantages. The reason is that NN-Tt needs to calculate the two deviation during each iteration. Calculation the two deviations depends on data dimensionality m1 and data volume m2, so computational complexity is equal to O(t)=c1m1 + c2m2, i.e., O(n³)> > O(t)> O(n²), where, c1 and c2 are constants. However, NN does not have to do that. Overall, NN-Tt takes a lot of time to calculate the deviation, resulting in increasing running-time.

Fig. 4

Results on running-time.

5.2 Experiments on real-world datasets

The results on real-world datasets show that NN-Tt is obviously higher than NN as for approximation accuracy in Table 5. For instance, data dimensionality of TTC is equal to 7570, and data sparsity reaches 0.9402, in terms of precision, it can be seen that NN-Tt is 24.4% higher than NN. Beyond that, these statistical results in Table 5 also demonstrate that the approximation precision does not shows in general a difference between the proposed NN-Tt and the competitor NN. This implies that in high-dimensional space, superior results are easier to obtain using NN-Tt than that of using NN.

Table 5
Approximation precision on real-world datasets.We independently did experiments for 500 times. Precision for t-test (p < 0.05). {precision value}

Datasets NN-Tt NN p-values

CNAE {0.875±0.011} {0.706±0.023} [4.115E-5]*

FPS {0.822±0.08} {0.647±0.31} [6.298E-6]*

Zemberek {0.806±0.011} {0.611±0.037} [8.562E-6]*

TTC {0.772±0.022} {0.528±0.026} [1.539E-5]*

Datasets	NN-Tt	NN	p-values
CNAE	{0.875±0.011}	{0.706±0.023}	[4.115E-5]*
FPS	{0.822±0.08}	{0.647±0.31}	[6.298E-6]*
Zemberek	{0.806±0.011}	{0.611±0.037}	[8.562E-6]*
TTC	{0.772±0.022}	{0.528±0.026}	[1.539E-5]*

To explore the capability of approximation data in high-dimensional space using neural networks, we add the number of hidden-layers (from 5 to 12) for NN without changing parameters. The number of hidden-layers for NN-Tt is still constant. We independently did experiments for 500 times, then, compared results with NN-Tt are shown in Fig. 5.

Fig. 5

Approximated precision on the four real-world datasets.

It can obtain several observations from Fig. 5: (i) Even if the hidden layers of NN is increased, the approximation precision of NN-Tt is still better than that of NN. (ii) In high-dimensional space, by tuning hidden-layer, this is difficult to have substantial positive effects on improving approximation ability of the neural networks trained using data. (iii) When the number of hidden layers for NN reaches above 10, we find that its approximation capability hardly does increase significantly. This implies that the approximation ability of the neural networks trained using data is hardly to improve substantially when the hidden layers increase to a certain number.

Through analyzing experimental results, we demonstrate the advantages of the proposed approach compared to the state-of-the-art methods, i.e., (i) High-dimensional functions can train higher quality neural networks of being suitable for high-dimensional space, so that there is no need to retain sufficient data for neural networks training. (ii) In the case of being short of data scenarios, the mechanism using high-dimensional functions training fully replaces that of using data training. (iii) In high-dimensional space, the neural networks trained using data are more difficult to obtain advanced results of approximation data, while superior results are more easily obtained by that of trained using high-dimensional functions.

5.3 Discussion

Equation (2) is used to constrain the deviation of neural networks training. This purpose is that the training deviation is governed within a tolerable range, so as to obtain the neural networks of having high accuracy and being suitable for high-dimensional space. Equation (10) minimizes the deviation of approximation data in high-dimensional space using neural networks. By Equation (2) and Equation (10), neural networks show better dimensional adaptability and resistance to data sparsity in high-dimensional space. Although the capability of approximation data using neural networks is improved under constraint of Equation (2) and Equation (10), meanwhile, this also exist a little disadvantages, such as time complexity, approximation a large number of data, computational time complexity increases. However, the time complexity O(t) is acceptable, i.e., O(n³)> > O(t)> O(n²).

Neural networks training requires sufficient data. This may not be available in practical applications. Consequently, the value of our approach is to provide a reference for neural networks training. We proved the rationality and validity of the proposed thought theoretically and experimentally. However, we do not demonstrate how to select high-dimensional functions, because that is not the focus in our work. Certainly, we also suggest that the selection of high-dimensional functions considers smooth functions, or high-order functions, etc.

6 Conclusion

In this paper, to address this issue of approximation data in high-dimensional space using neural networks, we derived the two deviation according to the Lipschitz condition, i.e., neural networks trained using high-dimensional functions and high-dimensional functions approximation data in high-dimensional space. Here, the two deviation provide a reference to improve precision for neural networks training in high-dimensional space. In the future, we will look at exploring more approaches to improve the precision of approximation data in high-dimensional space.

Footnotes

Acknowledgment

This work was supported by the Science and Technology Research Program of Chongqing Municipal Education Commission of China under Grant KJQN201903003. And the Science and Technology Research Program of Chongqing Municipal Education Commission of China under Grant KJQN202003001. And the Chongqing Municipal Education Commission of China under Grant 192072. And the Higher Education of Chongqing Municipal Education Commission of China under Grant CQGJ20ZX021.

References

Bethany

, Lusch,

Nathan Kutz

and Brunton

Steven L.

, Deep learning for universal linear embeddings of nonlinear dynamics[J], Nature Communications 9 (2018), 1–10.

Le Cun

, Bengio

and Hinton

, Deep learning [J], Nature 521 (2015), 436–444.

Chen

Tao

, Xu

Ruifeng

, He

Yulan

, et al., Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN [J], Expert Systems with Applications 72 (2017), 221–230.

Hosseini-Asl

Ehsan

, Zurada

Jacek M.

and Nasraoui

Olfa

, Deep learning of part-based representation of data using sparse autoencoders with nonnegativity constraints [J], IEEE Transactions on Neural Networks and Learning System 27(12) (2016), 2486–2498.

Huang

Andong

, Zhong

Zheng

, Guo

Yongxin

, et al., An artificial neural network-based electrothermal model for GaN HEMTs with dynamic trapping effects consideration [J], IEEE Transactions on Microwave Theory and Techniques 64(8) (2016), 2519–2528.

Emad Oliaee

Seyed Mohammad

, Shoorehdeli

Mahdi Aliyari

and Teshnehlab

Mohammad

, Faults Detecting of High-Dimension Gas Turbine by Stacking DNN and LLM [C], 2018 6th Iranian Joint Congress on Fuzzy and Intelligent Systems (CFIS), IEEE, 2018.

Mocanu

Decebal Constantin

, Mocanu

Elena

, Stone

Peter

, et al., Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science [J], Nature Communications 9 (2018), 1–18.

Huang

Andong

, Zhong

Zheng

and Guo

Yong-Xin

, A dimension-reduced artificial neural network for the compact modeling of semiconductor devices [C], 2018 IEEE MTT-S International Wireless Symposium (IWS), IEEE, 2018.

Bai

Zhou

and Kremer

Stefan C.

, Sequence learning: analysis and solutions for sparse data in high dimensional spaces [J], IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (2012), 298–305.

10.

Quang Minh

, Nguyen

Huu-Thiet

and Cheah

Chien Chern

, Data-driven Learning for Approximation of Nonlinear Functions with Stochastic Disturbances [C], 2020 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), IEEE, 2020.

11.

Astafyev

Andrey N.

, Gerashchenko

Sergey I.

, Markuleva

Marina V.

, et al., Neural Network System for Medical Data Approximation [C], 2020 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (EIConRus), IEEE, 2020.

12.

Calafiore

Giuseppe C.

, Gaubert

Stephane

and Possieri

Corrado

, A Universal Approximation Result for Difference of Log-Sum-Exp Neural Networks [J], IEEE Transactions on Neural Networks and Learning Systems 31(12) (2020), 5603–5612.

13.

Andras

Peter

, Random Projection Neural Network Approximation[C], 2018 International Joint Conference on Neural Networks (IJCNN), IEEE, 2018.

14.

Wang

Shulin

and Chen

Fang

, Spectral Clustering of High-dimensional Data via Nonnegative Matrix Factorization [C], 2015 International Joint Conference on Neural Networks (IJCNN), IEEE, 2015.

15.

Lever

Jake

, Krzywinski

Martin

and Altman.

Naomi

, Principal component analysis [J], ethods 14(7) (2017), 641–642.

16.

Deng

Yue

, Bao

Feng

and Dai

Qionghai

, Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning [J], Nature Methods 16 (2019), 311–314.

17.

Zhu

, Liu

Jeremiah Z.

and Cauley

Stephen F.

, Image reconstruction by domain transform manifold learning [J], ature 555 (2018), 487–492.

18.

Andras

, Function approximation using combined unsupervised and supervised learning [J], IEEE Transactions Neural Networks and Learning System 25(3) (2014), 495–505.

19.

Voevoda

Alexander A.

and Romannikov

Dmitry O.

, Synthesis of a Neural Network for N-Dimension Surfaces Approximation[C], 2018 XIV International Scientific-Technical Conference on Actual Problems of Electronics Instrument Engineering (APEIE), IEEE, 2018.

20.

Guliyev

N.J.

and Ismailov

V.E.

, On the approximation by single hidden layer feedforward neural networks with fixed weights [J], Neural Networks 98 (2018), 296–304.

21.

Emin Orhan

and Ma

Wei Ji

, Efficient probabilistic inference in generic neural networks trained with non-probabilistic feedback, Nature Communications 8(1) (2017), 1–14.

22.

Zainuddin

Zarita

and Fard

Saeed Panahian

, Approximation of multivariate 2π-periodic functions by multiple 2π-periodic approximate identity neural networks based on the universal approximation theorems [C], 2015 11th International Conference on Natural Computation (ICNC), IEEE, 2015.

23.

Petersen

and Voigtlaender

, Optimal approximation of piecewise smooth functions using deep ReLU neural networks [J], Neural Networks 108 (2018), 296–330.

24.

Schwab

and Zech

, Deep learning in high dimension: Neural network expression rates for generalized polynomial chaos expansions in UQ [J], Anal. Appl 17(01) (2019), 19–55.

25.

Voigtlaender

and Petersen

, Approximation in Lp(μ) with deep ReLU neural networks [J], arXiv (2019), 1904.04789.

26.

, Shen

, Yang

and Zhang

, Deep network approximation for smooth functions [J], arXiv (2020), 2001. 03040.

27.

Yarotsky

, Error bounds for approximations with deep ReLU networks [J], Neural Networks 94 (2017), 103–114.

28.

Chan

Ka-Hou

, Im

Sio-Kei

and Ke

Wei

, Self-Adaptive Layer: An Application of Function Approximation Theory to Enhance Convergence Efficiency in Neural Networks [C], 2020 International Conference on Information Networking (ICOIN), IEEE, 2020.

29.

Jia

Yaohui

, Chen

Feng

, Wu

Peng

, et al., A Study of Online Function Approximation System Based on BP Neural Network [C], 2019 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), IEEE, 2019.

30.

Ghorrati

Zahra

, A NewAdaptive Learning algorithm to train Feed-Forward Multi-layer Neural Networks, Applied on Function Approximation Problem [C], 2020 Fourth IEEE International Conference on Robotic Computing (IRC), IEEE, 2020.

31.

Chu

Yundi

, Fei

Juntao

and Hou

Shixi

, Adaptive global sliding-mode control for dynamic systems using double hidden layer recurrent neural network structure [J], IEEE Transactions on Neural Networks and Learning Systems 31(4) (2020), 1297–1309.

32.

Niyogi

and Girosi

, Generalization bounds for function approximation from scattered noisy data[J], Advances in Computational Mathematics 10(1) (1999), 51–80.

33.

Barron

A.R.

, Universal approximation bounds for superpositions of a sigmoidal function [J], IEEE Transactions on Information Theory 39(3) (1993), 930–945.

34.

Huang

Yuxuan

, Capretz

Luiz Fernando

and Ho

Danny

, Neural Network Models for Stock Selection Based on Fundamental Analysis [C], 2019 IEEE Canadian Conference of Electrical and Computer Engineering (CCECE), IEEE, 2019.

35.

Hoyer

Patrik O.

, Non-negative matrix factorization with sparseness constraints [J], Journal of Machine Learning Research 5 (2004), 1457–1469.

36.

Campos

G.O.

, Zimek

, Sander

, Campello

R.J.G.B.

, et al., On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study [J], Data Mining & Knowledge Discovery 30 (2016), 891–927.