Why neural networks in the first place: a theoretical explanation

Abstract

Neural networks – specifically, deep neural networks – are, at present, the most effective machine learning techniques. There are reasonable explanations of why deep neural networks work better than traditional “shallow” ones, but the question remains: why neural networks in the first place? why not networks consisting of non-linear functions from some other family of functions? In this paper, we provide a possible theoretical answer to this question: namely, we show that of all families with the smallest possible number of parameters, families corresponding to neurons are indeed optimal – for all optimality criteria that satisfy some reasonable requirements: namely, for all optimality criteria which are final and invariant with respect to coordinate changes, changes of measuring units, and similar linear transformations.

Keywords

Neural networks invariance function approximation theoretical explanation

1 Formulation of the problem

A natural question. At present, the most successful machine learning tool is deep neural networks – a specific case of neural networks; see, e.g., [2]. This empirical success leads to a natural question: why are deep neural networks so successful? There are some theoretical explanations of why deep neural networks are more successful than traditional neural networks; see, e.g., [2 –4]. There are some explanations of why neural networks are usually more effective than some other technique, e.g., than support vector machines [1].

However, a general question remains: why neural networks in general are so effective in the first place? This question is not only about computer applications: artificial neural networks started by simulating biological neurons – that are largely performing similar data processing. Biological neurons are a product of billions of years of improving evolution, so the fact that this type of data processing is used in biological neurons is a good indication that such data processing is effective – but why?

Let us formulate this question in more precise terms. A neural network is composed of neurons, each of which transforms the inputs x₁, …, x_n into an output value $y = s (w_{1} \cdot x_{1} + \dots + w_{n} \cdot x_{n} + w_{0})$ (1) for some coefficients w_i. In other words:

First, we form a generic linear combination of the inputs: $X \overset{def}{=} w_{1} \cdot x_{1} + \dots + w_{n} \cdot x_{n} + w_{0} .$ (2)

Then, we apply a non-linear function s(x) of one variable – known as the activation function – to this linear combination X.

In these terms, the above question is: why is the family (1) of non-linear functions more effective than other possible families of non-linear functions?

Let us simplify this question. In order to answer this question, let us perform the following two simplifications.

First, let us notice that in the generic linear expression (2), the last term w₀ is different from all the other terms. To make this formula more uniform, let us follow the usual arrangement and introduce an auxiliary variable x₀ = 1. Then, the formula (1) takes the form $s (w_{0} \cdot x_{0} + \dots + w_{n} \cdot x_{n}) .$ (3)

Second, let us take into account that in many cases, the output signal y represents the value of some physical quantity. This happens, e.g., at the last layer of the neural network, when we generate the computations result – and in a prediction problem, this result is about the future value of the quantity of interest (e.g., the next moment’s distance between a mobile robot and a nearby wall). The numerical value of a quantity depends on the choice of a measuring unit. If we select a new measuring unit which is C times smaller than the original one, then all numerical values will multiply by C: e.g., if we replace meters by centimeters, all values are multiplied by 100. In the new units, the output of the neuron takes the form $C \cdot s (w_{0} \cdot x_{0} + \dots + w_{n} \cdot x_{n}) .$ (4) From this viewpoint, instead of considering a family of all the functions (3) corresponding to different values w_i, it makes sense to consider a more general family (4) corresponding to all possible values of C and w_i.

What we do in this paper. In this paper, we explain why the family (4) is better than other possible families of nonlinear functions that have the same (or smaller) number of parameters. This provides a possible theoretical explanation of why neural networks – in particular, deep neural networks – are so effective.

2 Analysis of the problem

Natural robustness requirement on transformation functions. Inputs to data processing comes from measurements, and measurements are never absolutely accurate. There is, in general, a non-zero difference between the measurement result ${\tilde{x}}_{i}$ and the actual (unknown) value x_i of the measured quantity. This difference is known as the measurement error. This difference affects the result of data processing. We want to make sure that the corresponding effect is not amplified too much: we want to make sure that the difference in the results is proportional to the measurement errors, i.e., that for the corresponding transformation y = f(x₁, …, x_n) satisfies the following inequality: $| f ({\tilde{x}}_{0}, \dots, {\tilde{x}}_{n}) - f (x_{1}, \dots, x_{n}) | \leq$ $L \cdot (| {\tilde{x}}_{0} - x_{0} | + \dots + | {\tilde{x}}_{n} - x_{n} |) .$ (5) for some coefficient L. Such functions are known as Lipschitz continuous.

It is known that Lipschitz continuous functions are almost everywhere differentiable, and many of their properties are similar to properties of smooth (everywhere differentiable) functions.

What do we mean by a family of functions. We are interested in functions of n + 1 variables x₀, …, x_n. The expression (4) describes a family of such functions that depends, in addition to the multiplicative factor C, on n + 1 parameters w₀, …, w_n, to the total of n + 2. Since we are interested in families with the same (or smaller) number of parameters, we need to consider families that also depend on no more than n + 2 parameters.

We also want to make sure that a family is uniquely determined by its functions, so if we simply change the parameters without changing the class of functions, we will end up with the same family.

Definition 1. Let n and p be positive integers.

We say that two nonlinear Lipschitz continuous mappings f(x₀, …, x_n, c₀, …, c_p) and $g (x_{0}, \dots, x_{n}, c_{0}^{'}, \dots, c_{p}^{'})$ are equivalent if the following two conditions are satisfied

for each C and for each tuple c = (c₀, …, c_p), there exists a value C′ and a tuple $c^{'} = (c_{0}^{'}, \dots, c_{p}^{'})$ for which, for all x_i, we have: $C \cdot f (x_{0}, \dots, x_{n}, c_{0}, \dots, c_{p}) =$ $C^{'} \cdot g (x_{0}, \dots, x_{n}, c_{0}^{'}, \dots, c_{p}^{'});$

for each C′ and for each tuple $c^{'} = (c_{0}^{'}, \dots, c_{p}^{'})$ , there exists a value C and a tuple c = (c₀, …, c_p) for which, for all x_i, we have: $C^{'} \cdot g (x_{0}, \dots, x_{n}, c_{0}^{'}, \dots, c_{p}^{'}) =$ $C \cdot f (x_{0}, \dots, x_{n}, c_{0}, \dots, c_{p}) .$

By a family, we mean an equivalence class of functions – in terms of the above equivalence.

We say that a function t(x₁, …, x_n) belongs to the family $F$ – as defined by its element $f (x_{0}, \dots, x_{n}, c_{0}, \dots, c_{p})$ if there exist values C, c₀, …, c_p for which, for all x_i, we have $t (x_{0}, \dots, x_{n}) = C \cdot f (x_{0}, \dots, x_{n}, c_{0}, \dots, c_{p}) .$

What do we mean by “better”? We want to analyze why families corresponding to neural data processing perform better than other possible nonlinear families. Usually, “better” means that:

we have some numerical criterion – e.g., mean square approximation error after a certain fixed computation time, and

“better” means a smaller value of this numerical criterion.

However, we can have more complex cases: e.g., if we have several families with the same mean square approximation error, we can select, among them, the one with the smallest probability of approximation errors exceeding some given threshold. If this still leaves us with several possible families, we can minimize something else, etc.

So, to describe what is better in the most general way, let us go beyond simple numerical criteria and simply require that we have two relations on the set of all families:

a relation $F < G$ meaning that a family $F$ is better than the family $G$ ; and

a relation $F \sim G$ meaning that a family $F$ has the same quality as the family $G$ – with respect to the given criterion.

It is reasonable to require that these two relations satisfy transitivity: if

F

is better than

G

, and

G

is better than

H

, then

F

should be better than

H

. Thus, we arrive at the following definition (see, e.g., [5]):

Definition 2. By an optimality criterion, we mean a pair of relations (< , ∼) on the set of all possible families for which the following properties are satisfied for all $F$ , $G$ , and $H$ :

if $F < G$ and $G < H$ , then $F < H$ ;

if $F < G$ and $G \sim H$ , then $F < H$ ;

if $F \sim G$ and $G < H$ , then $F < H$ ;

if $F \sim G$ and $G \sim H$ , then $F \sim H$ ;

if $F \sim G$ , then $G \sim F$ ;

if $F < G$ , then we cannot have $F \sim G$ .

In mathematical terms, this pair is known as pre-order. The difference from order is that we can have $F \sim G$ for $F \neq G$ .

We have mentioned that if there are several families which are optimal with respect to a given criterion, this means that we can optimize something else – i.e., in effect, that the original criterion was not final. Thus, we arrive at the following definition.

Definition 3.

We say that a family $F_{opt}$ is optimal with respect to the optimality criterion (< , ∼) if for every family $F$ , we have $F_{opt} < F$ or $F_{opt} \sim F$ .

We say that the optimality criterion (< , ∼) is final if there is exactly one family which is optimal with respect to this criterion.

Invariance. In many practical situations, it makes sense to consider not only the original values x_i, but also their linear combinations $x_{i}^{'} = \sum_{j = 0}^{n} a_{ij} \cdot x_{j},$ (6) where a_ij is a reversible matrix. For example, if x_i are coordinates, we can use a different coordinate system. We can also use different units for different inputs, which also – as we mentioned earlier – amounts to linear transformations x_i → C_i · x_i.

Such a transformation does not change the problem, so it makes sense to require that the result of comparing two families should not change if we simply apply such a transformation. For example, it would be strange if one program worked better if all the data are in meters, but another one is better if all inputs are in inches. Thus, we arrive at the following definition.

Definition 4.

By an affine transformation A, we mean a reversible transformation of type (6).

For each family $F$ described by a function $f (x_{0}, \dots, x_{n}, c_{0}, \dots, c_{p})$ and for each affine transformation A, by the result $A (F)$ of applying this transformation to the family, we mean a family generated by the function $Tf (x_{0}, \dots, x_{n}, c_{0}, \dots, c_{p}) \overset{def}{=}$ $f (\sum_{j = 0}^{n} a_{0 j} \cdot x_{j}, \dots, \sum_{j = 0}^{n} a_{nj} \cdot x_{j}, c_{0}, \dots, c_{p}) .$

We say that the optimality criterion (< , ∼) is affine-invariant if for every affine transformation A and for every two families $F$ and $G$ , the following two conditions hold:

if $F < G$ , then $A (F) < A (G)$ ;

if $F \sim G$ , then $A (F) \sim A (G)$ .

Now, we are ready to formulate our main result.

Comment. A natural question is: what can be an example of affine-invariant final optimality criterion? For example, does “mean square approximation error after a certain fixed computation time" satisfy the definitions?

If the corresponding class of problems – on which we measure the mean square approximation error – is affine-invariant, then definitely the above criterion is affine-invariant. We do not know whether this criterion will be final – this depends on which class of problems we consider. However, even if this particular criterion is not final, then, as we have mentioned earlier, we can use the corresponding non-uniqueness to optimize something else – and it is reasonable to require that this “something else” is also affine-invariant. This way, even if the original criterion was not final, we will eventually arrive at a final affine-invariant criterion.

3 Main result

Proposition.

The smallest p for which there exists an affine-invariant final optimality criterion on the set of all families is p = n.

For p = n, for every affine-invariant final optimality criterion on the set of all families, the optimal family is of type (4) for some functions s(x).

In other words, neurons are indeed optimal non-linear transformation functions – optimal with respect to any optimality criterions that satisfies reasonable properties of being final and affine-invariant.

Proof. Let (< , ∼) be a final affine-invariant optimality criterion, and let $F_{opt}$ denote the family which is optimal with respect to this criterion. Let us prove that this function has the neural form (4).

1°. Let us first prove that the family $F_{opt}$ is itself affine-invariant, i.e., that for each affine transformation A, we have $A (F_{opt}) = F_{opt}$ .

Indeed, the fact that the family $F_{opt}$ is optimal means that for every family $F$ , we have either $F_{opt} < F$ or $F_{opt} \sim F$ . In particular, for every family $F$ , one of these two conditions is satisfied for the family $A^{- 1} (F)$ , where A^-1 denotes the inverse affine transformation. In other words, we have either $F_{opt} < A^{- 1} (F)$ or $F_{opt} \sim A^{- 1} (F)$ .

Due to affine-invariance, taking into account that $A (A^{- 1} (F)) = F$ , we conclude that $A (F_{opt}) < F$ or $A (F_{opt}) \sim F$ . This is true for each family $F$ . By definition of optimality, this means that the family $A (F_{opt})$ is optimal. But we know that $F_{opt}$ is optimal, and we assumed that our optimality criterion is final – which means that there is only one optimal family. Thus, we indeed have $A (F_{opt}) = F_{opt}$ .

2°. The property 1° means that for each function t(x₁, …, x_n) from the optimal family and for each affine transformation, the transformed function also belongs to the same optimal family.

The functions f forming the family $F$ are Lipschitz and thus, almost everywhere differentiable. Let us pick one such function t(x₀, …, x_n). Since this function is nonlinear and almost everywhere differentiable, there exist points (X₀, …, X_n) where this function’s value is non-zero and its gradient is defined and is also non-zero. Let us select one such point.

We can always perform an affine transformation of coordinates so that in the new coordinates the gradient vector will be parallel to the 0-th axis – e.g., we can rotate the axes so that one of the axes becomes parallel to the gradient vector. In the new coordinates z₀, …, z_n, for the correspondingly transformed function T(z₀, …, z_n), we have $\nabla T = (\frac{\partial T}{\partial z_{0}}, \frac{\partial T}{\partial z_{1}}, \dots, \frac{\partial T}{\partial z_{n}}) = (1, 0, \dots, 0)$ at the selected point – which in the new coordinates, has the form (Z₀, …, Z_n).

By multiplying this function $T \in F_{opt}$ by an appropriate constant C, we can get, for each possible value T₀ ≠ 0, a new function from the family $F$ for which C · T(Z₀, …, Z_n) = T₀ and for which the gradient is still parallel to the 0-th axis. Similarly, for any given non-zero vector v = (v₀, …, v_n), by rotating the coordinates z_i (and, if needed, by re-scaling all of them), we can get a new function from the family $F_{opt}$ for which the gradient at the point (Z₀, …, Z_n) is equal to v. Thus, for each tuple (T₀, v₀, …, v_n), we have a function from the family $F_{opt}$ for which:

the value at the point (Z₀, …, Z_n) is equal to T₀ and

the gradient at this point is equal to (v₀, …, v_n).

Thus, if we assign, to each tuple (T₀, v₀, …, v_n), one of the corresponding functions from the family $F_{opt}$ , we will get a (n + 2)-parametric family of functions. Thus, the total number p + 2 of parameters (one parameter C and p + 1 parameters c₀, …, c_p) cannot be smaller than n + 2, thus p ≥ n. This proves the first statement of our proposition. To be more precise, we also need to prove that such a criterion exists, but this is easy – e.g., a criterion according to which the neural family (4) is better than every other family – while all other families are equivalent to each other – is clearly final and affine-invariant.

Let us prove the second statement. For this, let us consider the case when p = n. In this case, the whole family $F_{opt}$ depends only on n + 2 parameters. Thus, if we had, for each tuple, a whole at-least-1-parametric family of functions corresponding to this tuple, we would have a family determined by more than n + 2 parameters – which would contradict to our assumption that p = n.

In particular, this means that the functions $T (z_{0}, α \cdot z_{1}, z_{2}, \dots, z_{n})$ corresponding to different values α – functions which also belong to the family $F_{opt}$ and for which the tuple (T₀, v₀, …, v_n) is the same – cannot form a 1-parametric family. This means that all these functions corresponding to different values α should be identical, i.e., that $T (z_{0}, α \cdot z_{1}, z_{2}, \dots, z_{n}) = T (z_{0}, α^{'} \cdot z_{1}, z_{2}, \dots, z_{n})$ for all α and α′. This means that the function T cannot depend on z₁ at all. Similarly, we can prove that the function does not depend on any other variable, i.e., that it depends only on z₀: T(z₀, …, z_n) = s(z₀) for some function s(x) of one variable.

By applying linear transformations to z_i and multiplying the expression by C, we get exactly the family (4).

The proposition is proven.

Footnotes

Acknowledgments

This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology.

It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478.

The authors are thankful to the anonymous reviewers for valuable suggestions.

References

Bokati

, Kosheleva

, Kreinovich

, Sosa

Why Deep Learning Is More Efficient than Support Vector Machines, and How It Is Related to Sparsity Techniques in Signal Processing, Proceedings of the 2020 4th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence ISMSI’2020, Thimpu, Bhutan, April 18–19, 2020.

Goodfellow

, Bengio

, Courville

Deep Learning, MIT Press, Cambridge, Massachusetts, 2016.

Kreinovich

, Kosheleva

Deep Learning (Partly) Demystified, Proceedings of the 2020 4th International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence ISMSI’2020, Thimpu, Bhutan, April 18–19, 2020.

Kreinovich

, Kosheleva

Optimization under Uncertainty Explains Empirical Success of Deep Learning Heuristics, In: P. Pardalos, V. Rasskazova, and M.N. Vrahatis (eds.), Black Box Optimization, Machine Learning and No-Free Lunch Theorems, Springer, Cham, Switzerland (2021), 195–220.

Nguyen

H.T.

, Kreinovich

Applications of Continuous Mathematics to Computer Science, Kluwer, Dordrecht (1997).