Topographic representation adds robustness to supervised learning

Abstract

In recent years, machine learning especially deep models have made significant improvements over their performances and thus applicable to many problems that until a decade ago were prohibitively difficult to learn. One of the strengths of the deep models is that they adaptively capture well-structured representations of the input in their internal representations that help them to generate desirable outputs. However, while many studies are dedicated for improving the performances of neural networks, less efforts are focused for understanding the formation of the internal representations in hierarchical neural networks and the implications to their performances. Here, we study a network model that incorporates topographical self-organizing maps into a supervised network and show how gradient learning results in a form of a self-organizing learning rule. Topographical self-organizing principles as internal representation is interesting because while topographical self-organizing principles have motivated much of early learning models and relevant to biological learning systems, such principles have rarely been included in supervised learning architectures. In this paper our objectives are explaining the dynamics of the proposed model, visually comparing the internal representation of the proposed model against some deep models and importantly showing that our model is robust in the sense of its application to a variety of areas, which is believed to be a hallmark of biological learning systems.

Keywords

Self-Organization topographic representation supervised learning deep neural networks visualization

1 Introduction

Machine learning has made significant improvements that allow the development of many new applications, specifically with the help of deep neural network models [1 –4]. Deep learning was made possible by increasingly faster processor technology such as GPUs as well as some algorithmic advancements as explained in [5 –8]. Training deep neural networks was difficult if not impossible even a decade ago due to their complexity or data volume. Such learning tasks can now be executed in reasonable time scales. Such improvements allow the applications of deep learning into many real world problems.

Learning good internal representations is a key aspect of supervised learning in hierarchical neural networks. Indeed, it is interesting to recall that the first breakthrough in deep learning came from an application of unsupervised pre-training with gradient-based fine tuning [9]. Restricted Boltzmann Machines (RBMs) [10] and Autoencoders [11, 12] were utilized for constructing the hidden layers of early models such as Deep Belief Networks (DBN), Deep Boltzmann Machine (DBM) and Stacked Autoencoders [13, 14]. While many of deep learning research focuses on learning efficiency and execution performances, less efforts are dedicated to the understanding of the formation of internal representations in hierarchical neural networks. Most of the existing studies on internal representations [15 –17] focus on pre-training of deep models but rarely give perspective on the structures of the internal representations and their relations to the final outputs. Moreover, while topographic self-organizing maps have been integral parts of biologically motivated learning theories since the 1970s [18 –20], the role of such topographic self-organizing mechanisms are less understood in modern deep learning theories. Topographic self-organization is interesting as it is often observed in biological neural networks [21, 22] and thus may give new insights in understanding learning and self-organization in artificial neural networks.

In this paper we propose a network that combines aspects of self-organization into a supervised network model for classification. More specifically, we modify the previously proposed Restricted Radial Basis Function Networks (rRBF) [23 –26] with a Softmax output layer that is trained on a cross-entropic cost function. This is more consistent with a probabilistic interpretation of the class membership output function than the previous implementation. The modification allows a more clear derivation of the emergence of the self-organizing learning aspects in this network. We call this modified network the Softmax Restricted Radial Basis Function Networks (S-rRBF).

A major thrust of this paper is that we show that it is possible to build a learning model in which self-organization and supervised learning are only different aspects of a single learning mechanism. Furthermore, we visualize the uniqueness of the internal representations of the S-rRBF compared to some of the more regular deep models. We show that our network achieves compatible performance to the more regular deep network architectures while having the added feature of robustness in the sense that it compares consistently favorable with the best performers in the studied examples while the best performer changes for different applications. While the results are consistent with Wolpert’s ‘No free lunch theorem’ [27], they also highlight that robustness to variations of applications is an important part for flexible learners rather than looking entirely into accuracy measure. Such application robustness is thought to be of importance in understanding biological learning systems.

We highlight our ideas here with well understood benchmark examples of moderate complexity. However, the proposed architecture can be easily scaled to deeper layers and hence applied to deeper learning problems. The main contribution in this paper is showing algebraically the emergence of the self-organizing structures from supervised gradient learning. We believe that this research opens new insights into the relation between unsupervised and supervised learning.

2 Softmax Restricted Radial Basis Function Networks (S-rRBF)

Softmax Restricted Radial Basis Function Network (S-rRBF) is a hierarchical neural network that has one or more hidden layers where the neurons are aligned in a two-dimensional grid. For simplicity we will restrict our discussion to networks with one hidden layer as shown in Fig 1. The S-rRBF is developed based on Restricted Radial Basis Function Networks (rRBF) introduced in [23, 26]. Here, unlike the original rRBF that has a sigmoidal output layer and quadratic cost function, S-rRBF adopts a softmax output layer with a cross-entropy loss function. These modifications yield clearer understanding on the relation between the internal self-organization and a supervised learning process. So far, most studies consider self-organization and supervised learning as two unrelated learning mechanisms. Here, we argue that with the proposed S-rRBF it is possible to build a learning model in which topographic self-organization is an integrated process of supervised learning, and thus giving a new perspective on the learning process of artificial neural networks.

Fig.1

Outline of S-rRBF.

The dynamics of the S-rRBF is as follows. Suppose the S-rRBF is trained against a data set {(Xⁱ, Yⁱ)} (i = 1, 2, …, m), in which $X^{i} \in ℝ^{d}$ and Yⁱ ∈ {1, 2, …, C}, and, m is the number of samples, d is the dimension of the input while C is the number of classes and thus the number of output neurons.

Given input, Xⁱ, at time t, the j-th hidden neuron generates output, $h_{j}^{i}$ , as

$h_{j}^{i} = σ ({win}^{i}, j, t) e^{- ∥ X^{i} - W_{j} (t) ∥^{2}}$ (1) ${win}^{i} = \arg min_{j} ∥ X^{i} - W_{j} (t) ∥^{2}$

Here, W_j (t) is the reference vector associated with the j-th hidden neuron at time t.

The function σ () in Equation is a neighborhood function defined as

$σ ({win}^{i}, j, t) = e^{- \frac{dist ({win}^{i}, j, t)}{S (t)}}$ (2) $\begin{matrix} S (t) & = & S_{start} (\frac{S_{end}}{S_{start}})^{\frac{t}{t_{end}}} \\ (0 \leq t \leq t_{end}, S_{start} > S_{end}), \end{matrix}$

where dist (win, j, t) is the Euclidean distance between the winning neuron and the j-th neuron on the two-dimensional grid of the hidden layer. The variable t is the current epoch, and t_end is the target epoch when the learning process is terminated. The activation function of a hidden neuron in S-rRBF is similar to that of the Radial Basis Function Network (RBF) [28], except that in S-rRBF it is topologically restricted by the neighborhood function σ (win, j, t).

The outputs of the hidden neurons are then propagated to the output layer, where the k-th output, O_k, is computed as follows. $O_{k} = e^{V_{k}^{T} h^{i}}$ (3) Here, V_k is the weight vector leading from the hidden layer into the k-th output neuron, while hⁱ is the output vector of the hidden layer, given Xⁱ as input. The superscript T stands for the transpose.

The conditional probability that the S-rRBF classifies the input into the class k is given by $P (Y^{i} = k | W, V, X^{i}) = \frac{e^{V_{k}^{T} h^{i}}}{\sum_{l} e^{V_{l}^{T} h^{i}}}$ (4) $\begin{matrix} V & = & [V_{1} V_{2} \dots V_{n_{out}}] \\ W & = & [W_{1} W_{2} \dots W_{n_{hid}}] \\ h^{i} & = & [h_{1}^{i} h_{2}^{i} \dots h_{n_{hid}}^{i}]^{T} \end{matrix}$ Here, n_out and n_hid are the number of output neurons and the number of hidden neurons, respectively.

The S-rRBF is then trained to minimize the cross entropy in Equation. $J (W, V) = - \sum_{i} P (Y^{i}) logP (Y^{i} | W, V, X^{i})$ (5)

Considering that Yⁱ ∈ {1, …, C}, Equation can be rewritten as $J (W, V) = - \sum_{i} \sum_{k} Π (Y^{i} = k) log \frac{e^{V_{k}^{T} h^{i}}}{\sum_{l} e^{V_{l}^{T} h^{i}}}$ (6) In Equation 2, Π (Yⁱ = k) =1 when Yⁱ = k is true, and Π (Yⁱ = k) =0 otherwise.

To minimize the cross-entropy, the gradient of the loss function with respect to vector V_j is calculated as follows.

$\begin{matrix} \frac{\partial J}{V_{j}} & = & - \sum_{i} {Π (Y^{i} = j) h^{i} - (\sum_{k} Π (Y^{i} = k)) \\ \frac{e^{V_{j}^{T} h^{i}}}{\sum_{l} e^{V_{l}^{T} h^{i}}} h^{i}} \end{matrix}$ (7) Because (∑_kΠ (Yⁱ = k)) = 1, Equation can be expressed as follows. $\frac{\partial J}{V_{j}} = - \sum_{i} (Π (Y^{i} = j) - P (Y^{i} = j | W, V, X^{i})) h^{i}$ (8) Hence, the modification of the weight vector leading to the j-th output neuron is as follows.

$\begin{matrix} V_{j} (t + 1) = V_{j} (t) + η \sum_{i} (Π (Y^{i} = j) \\ - P (Y^{i} = j | W, V, X^{i})) h^{i} \end{matrix}$ (9) Equation shows that the values of connection weights leading to an output neuron are increased if that neuron is associated with the true label of the input and are decreased otherwise. Consequently, these modifications increase the probability that the S-rRBF predicts the correct class.

Also, $\frac{\partial J}{\partial W_{n}} = \frac{\partial J}{\partial h^{i}} \frac{\partial h^{i}}{\partial W_{n}}$ (10) In calculating Equation 2, considering the weight vector W_n is only relevant to the output of the n-th hidden neuron, h_n, the equation can be rewritten as

$\begin{matrix} \frac{\partial J}{\partial W_{n}} = \frac{\partial J}{\partial h_{n}^{i}} \frac{\partial h_{n}^{i}}{\partial W_{n}} \\ = - \sum_{i} \sum_{k} {\frac{\partial}{\partial h_{n}^{i}} (Π (Y^{i} = k) \\ (log e^{V_{k}^{T} h^{i}} - log \sum_{l} e^{V_{l}^{T} h^{i}})) \frac{\partial h_{n}^{i}}{\partial W_{n}}} \\ = - \sum_{i} \sum_{k} Π (Y^{i} = k) { \\ (v_{kn} - \sum_{l} v_{\ln} P (Y^{i} = l | W, V, X^{i})) \frac{\partial h_{n}^{i}}{\partial W_{n}}} \end{matrix}$ (11) In Equation 2, v_kn is the weight connecting the n-th hidden neuron with the k-th output neuron. Hence, ∑_lv_lnP (Yⁱ = l|W, V, Xⁱ) is the weighted average of the connection weights from the n-th hidden neuron to the output layer, with the conditional probabilities of the respected class as the weighting coefficient.

Defining, ${\tilde{v_{n}}}^{i} = \sum_{l} v_{\ln} P (Y^{i} = l | W, V, X^{i})$ , Equation 2 can be expressed as

$\begin{matrix} \frac{\partial J}{\partial W_{n}} = - \sum_{i} \sum_{k} Π (Y^{i} = k) (v_{kn} - {\tilde{v}}_{n}^{i}) \frac{\partial h_{n}^{i}}{\partial W_{n}} \\ = - 2 \sum_{i} \sum_{k} Π (Y^{i} = k) (v_{kn} - {\tilde{v}}_{n}^{i}) h_{n}^{i} (X^{i} - W_{n}) \end{matrix}$ (12)

When the true class of the given input Xⁱ is K, hence Π (Yⁱ = K) =1 and 0 for all other classes, Equation 3 becomes as follows,

$\frac{\partial J}{\partial W_{n}} = - 2 \sum_{i} (v_{Kn} - {\tilde{v}}_{n}^{i}) h_{n}^{i} (X^{i} - W_{n})$ (13) Hence the modification of the n-th reference vector is given by

$\begin{matrix} W_{n} (t + 1) = W_{n} (t) + η \sum_{i} (v_{Kn} - {\tilde{v}}_{n}^{i}) h_{n}^{i} (X^{i} - W_{n} (t)) \\ = W_{n} (t) + η \sum_{i} {(v_{Kn} - {\tilde{v}}_{n}^{i}) σ ({win}^{i}, n) \\ e^{- ∥ X^{i} - W_{n} ∥^{2}} (X^{i} - W_{n} (t))} \end{matrix}$ (14) Equation shows that a topographic self-organizing process, similar to that of Kohonen’s Self-Organizing Maps (SOM) [19, 20] shown in Equation, occurs in the internal layer during the supervised training process of S-rRBF. The self-organization occurs as a mathematical implication of the cost function minimization. It shows that it is possible to link topographic self-organization with supervised learning in a single model, which so far are often treated as two different learning mechanisms.

$\begin{matrix} W_{n} (t + 1) \\ = W_{n} (t) + η \sum_{i} σ ({win}^{i}, n) (X^{i} (t) - W_{n} (t)) \end{matrix}$ (15)

It is important to mention that although the internal self-organization of S-rRBF is similar to that of SOM, they differ in a significant way, in that in SOM, as shown in Equation, the reference vector is always modified towards the input vector while the direction of self-organization in S-rRBF is regulated by the sign of $(v_{Kn} (t) - {\tilde{v}}_{n}^{i} (t))$ . The sign of this regularization term is decided by the relative value of the weight connecting the neuron associated with the n-th reference vector with the output associated with the true class of the input. If the weight leading to the output neuron associated with the true class is larger than the expected value of the weight leading from the n-th hidden neuron, a "positive" self-organization as in SOM occurs. In contrast, when the value of the weight is below the expected value, a "negative" self-organization that moves the reference vector away from the input occurs. In this context, the self-organization process in SOM is label-independent, while in S-rRBF it is label-oriented. While the internal self-organization in this study is not fully unsupervised, as it depends on the labels of the input, it does not require the exact information of the output error. Only the relative value of connection weight from a particular hidden neuron leading to the output neuron is required. Hence, it is not supervised in the strict sense either. We consider that the semi-supervised self-organization here is a good starting point in further studies to connect unsupervised learning with a supervised learning scheme. Furthermore, the term e^{-∥X(t)-W_n(t)∥²} in Equation triggers a dropout effect [5, 8], resulting in a sparse network in which hidden neurons associated with reference vectors that differ greatly from the input X are inhibited.

3 Experiments

We chose a variety of problems with various degrees of complexities, data dimensions and data sizes to test the generality and scalability of the proposed S-rRBF. We thereby compared our proposed architecture against three common deep learning models, namely a Deep Belief Network (DBN) [29, 30] where the internal representations were obtained through unsupervised layer-wise training utilizing Restricted Boltzmann Machines, a Stacked Autoencoder (SAE) [31, 32] where the internal representations were layer-wise autoencoders, and a ReLU MLP [33 –35] with Softmax output layers and cross-entropic loss function. The average classification error rates with the corresponding variances in brackets over 15-fold cross validation test are shown in Table 1. In those experiments, the number of hidden neurons, as well as the structures for the deep neural networks were empirically tried, and the results of the best settings were registered for comparison. In Table 1 the performance of the best algorithm is highlighted in bold. The results indicate that although S-rRBF does not always outperform the three deep networks, it generally compares favorably with the best performing deep model.

Table 1
Performance in terms of Error Rate (%) (Standard Deviation) of the different methods on a series of standard machine learning benchmark programs.

Dataset S-rRBF DBN SAEs ReLU MLP

Abalone 27.8 (1.9) 27.8 (2.2) 26.4 (2.4) 26.6 (3.6)

Activity Log 6.4 (1.4) 11.4 (1.8) 1.3 (0.3) 21.9 (8.4)

Balance 9.6 (2.6) 11.4 (5.3) 1.1 (2.5) 2.9 (2.7)

Bank Marketing 10.5 (1.2) 11.5 (1.4) 13.3 (1.6) 10.8 (1.5)

Breast Cancer 2.5 (1.8) 2.6 (2.3) 3.7 (2.6) 3.2 (2.3)

Cardiotocography 9.6 (2.4) 13.3 (3.2) 9.6 (2.1) 11.6 (4.2)

Heart 15.2 (5.4) 16.3 (7.0) 22.2 (8.4) 15.6 (10.1)

Iris 1.3 (2.8) 2.7 (4.7) 3.3 (3.5) 4.7 (6.4)

Spambase 7.6 (1.5) 8.9 (1.4) 6.0 (1.3) 11.8 (8.4)

Waveform 13.4 (1.6) 14.2 (1.0) 15.0 (1.9) 13.8 (2.2)

Wine Quality 21.7 (2.3) 25.4 (1.7) 22.5 (2.2) 22.1 (1.7)

MNIST Fashion 11.3 (0.68) 13.9 (0.6) 10.9 (0.7) 41.6 (10.6)

Dataset	S-rRBF	DBN	SAEs	ReLU MLP
Abalone	27.8 (1.9)	27.8 (2.2)	26.4 (2.4)	26.6 (3.6)
Activity Log	6.4 (1.4)	11.4 (1.8)	1.3 (0.3)	21.9 (8.4)
Balance	9.6 (2.6)	11.4 (5.3)	1.1 (2.5)	2.9 (2.7)
Bank Marketing	10.5 (1.2)	11.5 (1.4)	13.3 (1.6)	10.8 (1.5)
Breast Cancer	2.5 (1.8)	2.6 (2.3)	3.7 (2.6)	3.2 (2.3)
Cardiotocography	9.6 (2.4)	13.3 (3.2)	9.6 (2.1)	11.6 (4.2)
Heart	15.2 (5.4)	16.3 (7.0)	22.2 (8.4)	15.6 (10.1)
Iris	1.3 (2.8)	2.7 (4.7)	3.3 (3.5)	4.7 (6.4)
Spambase	7.6 (1.5)	8.9 (1.4)	6.0 (1.3)	11.8 (8.4)
Waveform	13.4 (1.6)	14.2 (1.0)	15.0 (1.9)	13.8 (2.2)
Wine Quality	21.7 (2.3)	25.4 (1.7)	22.5 (2.2)	22.1 (1.7)
MNIST Fashion	11.3 (0.68)	13.9 (0.6)	10.9 (0.7)	41.6 (10.6)

For demonstrating the uniqueness of the S-rRBF internal representations, the internal layer was visualized and compared against the internal layers of deep models for some benchmark problems.

The first example is the internal representation for the Iris problem, a well-known 3-class problem where one of the classes is linearly separable from the other non-linearly separable two classes. For understanding the original structure of this data set, two dimension-reduction methods were executed, t-Stochastic Neighborhood Embedding (t-SNE) [36, 37] and SOM. The results are shown in Fig. 2a and b. The former shows the t-SNE representation, where the stochastic proximity structure of the data in their original high-dimensional space is preserved. The later shows the SOM representation, where the topological structure of the data is preserved. In these 2-D maps, each class is represented with different marker and color, while a × on the SOM shows the overlapping representation of some data points belonging to conflicting classes. The size of a marker reflects the number of data points it represents. It should be noted that for SOM and t-SNE, the 2-D representations were constructed based only on the feature similarities of the data while their labels were irrelevant in the dimensional reduction process and utilized only for the visualization clarity. These two low-dimensional maps nicely describe the well-known separability characteristics of the Iris data set, where the samples in class are easily separable from the two other classes, and, while some samples from those two classes overlap. The internal layer of the S-rRBF is given in Fig. 2c. It shows that one of the classes, represented with is aligned relatively far from the other two classes, represented by and, that are aligned close to each other. For comparison, the last hidden layers of the comparative deep models are also shown. As the deep models may contain more than two neurons in their last hidden layers, their internal representations are treated as high-dimensional vectors and hence require a dimensional reduction method to visualize. Here, we chose t-SNE for reducing their dimensions. Figure 2d shows the representation in the last hidden layer of DBN, Fig. 2e shows the representation of the last hidden layer in SAEs, while Fig. 2f shows the representation of the last hidden layer in ReLU MLP. The S-rRBF and the comparative models show similar internal representations in that the classes for this problem are well separable except for a few points belonging to two of the classes. This is a good indication that this is an easy classification problem, which is confirmed by the classification performances given in Table 1.

Fig.2

Iris (dim:4, class:3).

The second example is the Bank Marketing Data, a 48-D, 2-class problem. The low-dimensional representation of this problem using t-SNE is shown in Fig. 3a, and the corresponding SOM is shown in Fig. 3b. The figures indicate that there are many overlapping data points belonging to contrasting classes that make this problem relatively difficult to classify. Figure 3c indicates that the S-rRBF generates a nice topographical internal representation illustrating how the classifier separates the two classes. The internal representations in DBN, SAEs, and ReLU MLP are shown in Fig. 3d, e and f, respectively. All the internal representations nicely visualize the imbalance between class represented by and However, it can be observed that for this data set, the internal representation of the S-rRBF shows a better separability compared to that of the deep models. This is consistent with its better performance shown in Table 1.

Fig.3

Bank Credit (dim:48, class:2).

The third example is the Heart Data, a 13 dimension, 2-class problem. The t-SNE and SOM representations in Fig. 4a and b show that there are many overlapping samples belonging to the two conflicting classes, and thus indicating that this is a relatively difficult classification problem. The representations of DBN, SAEs, and ReLU MLP, in Fig. 4d, e and f, indicate that there are some sub-spaces of overlapping classes that are likely to be misclassified. The representation of the S-rRBF, in Fig. 4c, shows better class-separability compared to the comparative methods. However, the overlapping areas are also apparent in S-rRBF. This is consistent with the high error rate shown in Table 1.

Fig.4

Heart (dim:13, class:2).

In the three examples above the S-rRBF outperforms the comparative models. However, for understanding the characteristics of the proposed method, it is also important to demonstrate some negative results [27, 38] where the S-rRBF was outperformed by other methods.

The fourth example is the Activity Log Data, a 561-D, 6-class problem. The t-SNE and SOM representations in Fig. 5a and b indicate many overlapping classes. The S-rRBF formed an interesting representation, shown in Fig. 5c, consisting of mainly three distinctive clusters, the first one containing samples from, that are closely aligned, and thus is likely to produce misclassification. The second one contains samples from and, while the third one contains only a single class of. From Fig. 5d, e and f, it can be observed that DBN, SAEs and ReLU MLP generate overlapping representations that are likely to cause misclassification, while from Fig. 5e it is obvious that the SAEs generates nicely separable clusters representations, which is consistent with its outstanding performance as shown in Table 1.

Fig.5

Activity Log (dim:561, class:6).

The final example is the recently proposed "Fashion MNIST", an apparel-related image classification problem [39] where the S-rRBF was outperformed by SAEs. Some of the image samples from this data set are shown in Fig. 6, in which the numbers and the markers indicate the classes that they represent. This data set has the same dimensionality, class number and data size with the traditional MNIST handwritten digits dataset [40]. The t-SNE and SOM representations of this problem are shown respectively in Fig. 7a and b. It is interesting to visually observe that class 6 represented with is distinctively separated from all other nine classes. Aside from class 6, there are some classes that form distinctive clusters, but there are also many overlapping classes indicating that some of the classes are likely to be easily classified while some are not. The representation of the S-rRBF, shown in Fig. 7c significantly differs from the three comparative methods, respectively shown in Fig. 7d, e and f. The internal representation of the S-rRBF clearly shows that some classes are distinctively separable while it also shows that some are difficult to be classified. For this problem, the SAEs formed the best cluster-separability in its internal layer as apparent from Fig. 7e, which is also consistent with its best classification performance as shown in Table 1.

Fig.6

Fashion MNIST Samples.

Fig.7

Fashion MNIST (dim:784, class:10).

From the above examples it is clear that the S-rRBF generates unique internal representations compared to the deep models. The visual appearance of the internal representation correlates well with the performance of the S-rRBF. This fact supports our argument that topographic representation can be useful for the internal representation in hierarchical neural networks. It should also be noted that the S-rRBF directly forms a low-dimensional representation in its internal layer as opposed to the other methods that forms high-dimensional representations that are often hard to interpret and require additional dimensional reduction methods such as t-SNE for visualization. Furthermore, it is important to notice that while the S-rRBF is not necessarily the best classifier in each benchmark problem, its performances are still close to the best ones. Overall, S-rRBF is a stable classifier as it performs relatively well on all benchmarks. This is an empirical indication that topographic internal representation adds robustness to a classifier.

4 Conclusions

In this research we show that it is possible to build a hierarchical neural network that self-organizes a class-relevant topographic internal representation. More specifically, we show that topographic self-organization can emerge as an implication of a supervised learning. Thus, the two learning processes of topographic self-organization and supervised learning, which are often considered to be unrelated, can be viewed as two different aspects of a single learning mechanism where they are only distinguished by the layer they occupy. It should be noted that while self-organization is normally an unsupervised process, here it is not fully unsupervised in that it is directed by the output error. Importantly, the self-organizing process arises from the derivation of a loss function, which is one of the novelty of this study.

The experiments show that the classification performance of the proposed model is comparable to that of standard supervised networks. While the proposed model does not always outperform existing conventional models, we found that the performance was comparable to the best performer for most of the diverse benchmark applications. Here we argue that only looking for best accuracy in single applications is not necessarily a good way to evaluate good methods for machine learning and applications of artificial intelligence methods. Specific machine learning methods often perform well on data sets for which they have been designed. However, it is well acknowledged that sufficient performance in a variety of tasks is useful in many applications such as robotics for systems that have to function well in changing environments and a variety of tasks. Also, it is likely that robust and versatile systems are potentially important to understand human abilities better. After all, it has been shown that engineered systems can outperform humans, such as a calculator multiplying two large numbers. However, the versatility of humans skills is still unmatched by artificial intelligence systems, and topographic organization have been observed in the human brain.

References

LeCun

, Bengio

, Hinton

, Deep learning, Nature521 (2015), 436–444.

Bengio

Learning Deep Architectures for AI. now Publishers, Hanover, MA, 2009.

Krizhevsky

, Sutskever

, Hinton

, Imagenet classification with deep convolutional neural networks. In F.

Pereira

C.J.C.

Burges

Bottou

Q. Weinberger

, editors, Advances in Neural Information Processing Systems 25, Curran Associates, Inc., 2012, pp. 1097–1105.

Goodfellow

, Bengio

, Courville

, Deep Learning, The MIT Press, Cambridge, MA, 2016.

Srivastava

, Hinton

, Krizhevsky

, Sutskever

, Salakhutdinov

, Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research15 (2014), 1929–1958.

Goodfellow

, Warde-Farley

, Mirza

, Courville

, Bengio

, Maxout networks. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of The 30th International Conference on Machine Learning, volume 28 of JMLR Workshop and Conference Proceedings, JMLR.org, 2013, pp. 1319–1327.

Hochreiter

, The vanishing gradient problem during learning recurrent neural nets and problem solutions, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems6(2) (1998), 107–116.

Glorot

, Bordes

, Bengio

, Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudik, editors, Proceeding of the Fourteenth International Conference on Artficial Intelligence and Statistics, AISTAT 2011, Fort Lauderdale, FLorida, USA, volume 15 of JMLR Workshop and Conference Proceedings, JMLR.org, 2011, pp. 315–323.

Hinton

, Salakhutdinov

, Reducing the dimensionality of data with neural networks, Science313(5786) (2006), 504–507.

10.

Hinton

, Training products of experts by minimizing contrastive divergence, Neural Computation14(8) (2002), 1711–1800.

11.

Bourland

, Kamp

, Auto-association by multilayer perceptrons and singular value decomposition, Biological Cybernetics59 (1988), 291–294.

12.

Hinton

, Zemel

R.S.

, Autoencoders, minimum description length, and helmholtz free energy. In J.D.

Cowan

Tesauro

Alspector

, editors, Advances in Neural In formation Processing Systems 6 (NIPS 1993), 1994.

13.

Hinton

, Osindero

, Ateh

Y.-W.

, Fast learning algorithm for deep belief nets, Neural Computation18 (2006), 1527–1554.

14.

Salakhutdinov

, Learning deep generative models, Annual Review of Statistics and Its Application2 (2015), 361–385.

15.

Erhan

, Bengio

, Courville

, Manzagol

P.-A.

, Vincent

, Bengio

, Why does unsupervised pre-training help deep learning?Journal of Machine Learning Research11 (2010), 625–660.

16.

Bengio

, Deep learning of representations for unsupervised and transfer learning. In I.

Guyon

Dror

Lemaire

Taylor

Silver

, editors, Proceedings of ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proceedings of Machine Learning Research, Bellevue, Washington, USA, 2012, pp. 17–36. PMLR.

17.

Bengio

, Courville

, Vincent

, Representation learning: A review and new perspectives, IEEE Transactions on Pattern Analysis and Machine Intelligence35(8) (2013), 1798–1828.

18.

Willshaw

D.J.

, Von Der Malsburg

, How patterned neural connections can be set up by self-organization, Proc Royal Society of London194(1117) (1976), 431–445.

19.

Kohonen

, Self-organized formation of topologically correct feature maps, Biological Cybernetics43 (1982), 59–69.

20.

Kohonen

, Essential of self-organizing map, Neural Networks37 (2013), 52–65.

21.

Hubel

, Wiesel

, Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex, The Journal of Physiology160(1) (1962), 106–154.

22.

Romani

G.L.

, Williamson

S.J.

, Kaufman

, Tonotopic organization of the human auditory cortex, Science216(4552) (1982), 1339–1340.

23.

Hartono

, Hollensen

, Trappenberg

, Learningregulated context relevant topographical map, IEEE Trans On Neural Networks and Learning Systems26(10) (2015), 2323–2335.

24.

Trappenberg

, Hollensen

, Hartono

, Classifier with hierarchical topographical maps as internal representation, In Proceedings of the 2nd International Conference on Learning Representations (ICLR) 2015, 2015.

25.

Hartono

, Hollensen

, Trappenberg

, Visualizing hierarchical representation in a multilayered restricted rbf network, In Proc International Conference on Artificial Neural Networks (ICANN 2014), LCNS 8681, 2014, pp. 339–346.

26.

Hartono

, Classification and dimensional reduction using restricted radial basis function networks, Neural Computing and Applications30(3) (2018), 905–915.

27.

Wolpert

D.H.

, Macready

W.G.

, No free lunch theorems for optimization, IEEE Trans on Evolutionary Computation1(1) (1997), 67–81.

28.

Poggio

, Girosi

, Networks for approximation and learning, Proceedings of IEEE87 (1990), 1484–1487.

29.

Le Roux

, Y.

Bengio

, Representational power of restricted boltzmann machines and deep belief networks, Neural Computation20(6) (2008), 1631–1649.

30.

Bengio

, Pascal

, Popovici

, Larochelle

, Greedy layer-wise training of deep networks. In B.

Schölkopf

J.C.

Platt

Hoffman

, editors, Advances in Neural Information Processing Systems19, MIT Press, 2007, pp. 153–160 .

31.

Baldi

, Autoencoders, unsupervised learning, and deep architectures. In Isabelle

Guyon

Gideon

Dror

Vincent

Lemaire

Graham

Taylor

Daniel

Silver

, editors, Proceedings of ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proceedings of Machine Learning Research, PMLR, 2012, pp. 37–49.

32.

Vincent

, Larochelle

, Lajoie

, Bengio

, Manzagol

P.-A.

, Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion, Journal of Machine Learning Research11 (2010), 3371–3408.

33.

Maas

A.L.

, Hannun

A.Y.

, Ng

A.Y.

, Rectifier nonlinearities improve neural network acoustic models, In Proc International Conference on Machine Learning (ICML) 2013, volume 30, 2013.

34.

Montúfar

, Pascanu

, Cho

, Bengio

, On the number of linear regions of deep neural networks , In Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2, NIPS’14, Cambridge, MA, USA, 2014, pp. 2924–2932. MIT Press.

35.

Arora

, Mianjy

, Mukherjee

, Understanding deep neural networks with rectified linear units, Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2018.

36.

van der Maaten

L.P.J.

, Visualizing high-dimensional data using t-sne, Journal of Machine Learning Research9 (2008), 2579–2605.

37.

van der Maaten

L.P.J.

, Postma

E.O.

and van den Herik

H.J.

Dimensionality reduction: A comparative review. Technical Report TiCC-TR -005, Tilburg University, (2009).

38.

Sculley

, Snoek

, Rahimi

, Wiltschko

, Winner’s curse? on pace, progress, and empirical rigor, In Proceedings of the 2nd International Conference on Learning Representations (ICLR) 2018, 2018.

39.

Xiao

, Rasul

, Vollgraf

, Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms, 2017. https://arxiv.org/abs/1708.07747.

40.

Lecun

, Bottou

, Bengio

, Haffner

, Gradient-based learning applied to document recognition, Proceedings of the IEEE86(11) (1998), 2278–2324.