Abstract
Neural networks – specifically, deep neural networks – are, at present, the most effective machine learning techniques. There are reasonable explanations of why deep neural networks work better than traditional “shallow” ones, but the question remains: why neural networks in the first place? why not networks consisting of non-linear functions from some other family of functions? In this paper, we provide a possible theoretical answer to this question: namely, we show that of all families with the smallest possible number of parameters, families corresponding to neurons are indeed optimal – for all optimality criteria that satisfy some reasonable requirements: namely, for all optimality criteria which are final and invariant with respect to coordinate changes, changes of measuring units, and similar linear transformations.
Formulation of the problem
However, a general question remains: why neural networks in general are so effective in the first place? This question is not only about computer applications: artificial neural networks started by simulating biological neurons – that are largely performing similar data processing. Biological neurons are a product of billions of years of improving evolution, so the fact that this type of data processing is used in biological neurons is a good indication that such data processing is effective – but why?
First, we form a generic linear combination of the inputs:
Then, we apply a non-linear function s(x) of one variable – known as the activation function – to this linear combination X.
In these terms, the above question is: why is the family (1) of non-linear functions more effective than other possible families of non-linear functions?
First, let us notice that in the generic linear expression (2), the last term w0 is different from all the other terms. To make this formula more uniform, let us follow the usual arrangement and introduce an auxiliary variable x0 = 1. Then, the formula (1) takes the form
Second, let us take into account that in many cases, the output signal y represents the value of some physical quantity. This happens, e.g., at the last layer of the neural network, when we generate the computations result – and in a prediction problem, this result is about the future value of the quantity of interest (e.g., the next moment’s distance between a mobile robot and a nearby wall). The numerical value of a quantity depends on the choice of a measuring unit. If we select a new measuring unit which is C times smaller than the original one, then all numerical values will multiply by C: e.g., if we replace meters by centimeters, all values are multiplied by 100. In the new units, the output of the neuron takes the form
It is known that Lipschitz continuous functions are almost everywhere differentiable, and many of their properties are similar to properties of smooth (everywhere differentiable) functions.
We also want to make sure that a family is uniquely determined by its functions, so if we simply change the parameters without changing the class of functions, we will end up with the same family.
We say that two nonlinear Lipschitz continuous mappings f(x0, …, x
n
, c0, …, c
p
) and
for each C and for each tuple c = (c0, …, c
p
), there exists a value C′ and a tuple for each C′ and for each tuple By a family, we mean an equivalence class of functions – in terms of the above equivalence. We say that a function t(x1, …, x
n
) belongs to the family
we have some numerical criterion – e.g., mean square approximation error after a certain fixed computation time, and “better” means a smaller value of this numerical criterion.
However, we can have more complex cases: e.g., if we have several families with the same mean square approximation error, we can select, among them, the one with the smallest probability of approximation errors exceeding some given threshold. If this still leaves us with several possible families, we can minimize something else, etc.
So, to describe what is better in the most general way, let us go beyond simple numerical criteria and simply require that we have two relations on the set of all families: a relation a relation
It is reasonable to require that these two relations satisfy transitivity: if
if if if if if if
In mathematical terms, this pair is known as pre-order. The difference from order is that we can have
We have mentioned that if there are several families which are optimal with respect to a given criterion, this means that we can optimize something else – i.e., in effect, that the original criterion was not final. Thus, we arrive at the following definition.
We say that a family We say that the optimality criterion (< , ∼) is final if there is exactly one family which is optimal with respect to this criterion.
Such a transformation does not change the problem, so it makes sense to require that the result of comparing two families should not change if we simply apply such a transformation. For example, it would be strange if one program worked better if all the data are in meters, but another one is better if all inputs are in inches. Thus, we arrive at the following definition.
By an affine transformation A, we mean a reversible transformation of type (6). For each family We say that the optimality criterion (< , ∼) is affine-invariant if for every affine transformation A and for every two families if if
Now, we are ready to formulate our main result.
Comment. A natural question is: what can be an example of affine-invariant final optimality criterion? For example, does “mean square approximation error after a certain fixed computation time" satisfy the definitions?
If the corresponding class of problems – on which we measure the mean square approximation error – is affine-invariant, then definitely the above criterion is affine-invariant. We do not know whether this criterion will be final – this depends on which class of problems we consider. However, even if this particular criterion is not final, then, as we have mentioned earlier, we can use the corresponding non-uniqueness to optimize something else – and it is reasonable to require that this “something else” is also affine-invariant. This way, even if the original criterion was not final, we will eventually arrive at a final affine-invariant criterion.
The smallest p for which there exists an affine-invariant final optimality criterion on the set of all families is p = n.
For p = n, for every affine-invariant final optimality criterion on the set of all families, the optimal family is of type (4) for some functions s(x).
In other words, neurons are indeed optimal non-linear transformation functions – optimal with respect to any optimality criterions that satisfies reasonable properties of being final and affine-invariant.
1°. Let us first prove that the family
Indeed, the fact that the family
Due to affine-invariance, taking into account that
2°. The property 1° means that for each function t(x1, …, x n ) from the optimal family and for each affine transformation, the transformed function also belongs to the same optimal family.
The functions f forming the family
We can always perform an affine transformation of coordinates so that in the new coordinates the gradient vector will be parallel to the 0-th axis – e.g., we can rotate the axes so that one of the axes becomes parallel to the gradient vector. In the new coordinates z0, …, z
n
, for the correspondingly transformed function T(z0, …, z
n
), we have
By multiplying this function the value at the point (Z0, …, Z
n
) is equal to T0 and the gradient at this point is equal to (v0, …, v
n
).
Thus, if we assign, to each tuple (T0, v0, …, v
n
), one of the corresponding functions from the family
Let us prove the second statement. For this, let us consider the case when p = n. In this case, the whole family
In particular, this means that the functions
By applying linear transformations to z i and multiplying the expression by C, we get exactly the family (4).
The proposition is proven.
Footnotes
Acknowledgments
This work was supported in part by the National Science Foundation grants 1623190 (A Model of Change for Preparing a New Generation for Professional Practice in Computer Science), and HRD-1834620 and HRD-2034030 (CAHSI Includes), and by the AT&T Fellowship in Information Technology.
It was also supported by the program of the development of the Scientific-Educational Mathematical Center of Volga Federal District No. 075-02-2020-1478.
The authors are thankful to the anonymous reviewers for valuable suggestions.
