Abstract
In recent years, machine learning especially deep models have made significant improvements over their performances and thus applicable to many problems that until a decade ago were prohibitively difficult to learn. One of the strengths of the deep models is that they adaptively capture well-structured representations of the input in their internal representations that help them to generate desirable outputs. However, while many studies are dedicated for improving the performances of neural networks, less efforts are focused for understanding the formation of the internal representations in hierarchical neural networks and the implications to their performances. Here, we study a network model that incorporates topographical self-organizing maps into a supervised network and show how gradient learning results in a form of a self-organizing learning rule. Topographical self-organizing principles as internal representation is interesting because while topographical self-organizing principles have motivated much of early learning models and relevant to biological learning systems, such principles have rarely been included in supervised learning architectures. In this paper our objectives are explaining the dynamics of the proposed model, visually comparing the internal representation of the proposed model against some deep models and importantly showing that our model is robust in the sense of its application to a variety of areas, which is believed to be a hallmark of biological learning systems.
Keywords
Introduction
Machine learning has made significant improvements that allow the development of many new applications, specifically with the help of deep neural network models [1–4]. Deep learning was made possible by increasingly faster processor technology such as GPUs as well as some algorithmic advancements as explained in [5–8]. Training deep neural networks was difficult if not impossible even a decade ago due to their complexity or data volume. Such learning tasks can now be executed in reasonable time scales. Such improvements allow the applications of deep learning into many real world problems.
Learning good internal representations is a key aspect of supervised learning in hierarchical neural networks. Indeed, it is interesting to recall that the first breakthrough in deep learning came from an application of unsupervised pre-training with gradient-based fine tuning [9]. Restricted Boltzmann Machines (RBMs) [10] and Autoencoders [11, 12] were utilized for constructing the hidden layers of early models such as Deep Belief Networks (DBN), Deep Boltzmann Machine (DBM) and Stacked Autoencoders [13, 14]. While many of deep learning research focuses on learning efficiency and execution performances, less efforts are dedicated to the understanding of the formation of internal representations in hierarchical neural networks. Most of the existing studies on internal representations [15–17] focus on pre-training of deep models but rarely give perspective on the structures of the internal representations and their relations to the final outputs. Moreover, while topographic self-organizing maps have been integral parts of biologically motivated learning theories since the 1970s [18–20], the role of such topographic self-organizing mechanisms are less understood in modern deep learning theories. Topographic self-organization is interesting as it is often observed in biological neural networks [21, 22] and thus may give new insights in understanding learning and self-organization in artificial neural networks.
In this paper we propose a network that combines aspects of self-organization into a supervised network model for classification. More specifically, we modify the previously proposed Restricted Radial Basis Function Networks (rRBF) [23–26] with a Softmax output layer that is trained on a cross-entropic cost function. This is more consistent with a probabilistic interpretation of the class membership output function than the previous implementation. The modification allows a more clear derivation of the emergence of the self-organizing learning aspects in this network. We call this modified network the Softmax Restricted Radial Basis Function Networks (S-rRBF).
A major thrust of this paper is that we show that it is possible to build a learning model in which self-organization and supervised learning are only different aspects of a single learning mechanism. Furthermore, we visualize the uniqueness of the internal representations of the S-rRBF compared to some of the more regular deep models. We show that our network achieves compatible performance to the more regular deep network architectures while having the added feature of robustness in the sense that it compares consistently favorable with the best performers in the studied examples while the best performer changes for different applications. While the results are consistent with Wolpert’s ‘No free lunch theorem’ [27], they also highlight that robustness to variations of applications is an important part for flexible learners rather than looking entirely into accuracy measure. Such application robustness is thought to be of importance in understanding biological learning systems.
We highlight our ideas here with well understood benchmark examples of moderate complexity. However, the proposed architecture can be easily scaled to deeper layers and hence applied to deeper learning problems. The main contribution in this paper is showing algebraically the emergence of the self-organizing structures from supervised gradient learning. We believe that this research opens new insights into the relation between unsupervised and supervised learning.
Softmax Restricted Radial Basis Function Networks (S-rRBF)
Softmax Restricted Radial Basis Function Network (S-rRBF) is a hierarchical neural network that has one or more hidden layers where the neurons are aligned in a two-dimensional grid. For simplicity we will restrict our discussion to networks with one hidden layer as shown in Fig 1. The S-rRBF is developed based on Restricted Radial Basis Function Networks (rRBF) introduced in [23, 26]. Here, unlike the original rRBF that has a sigmoidal output layer and quadratic cost function, S-rRBF adopts a softmax output layer with a cross-entropy loss function. These modifications yield clearer understanding on the relation between the internal self-organization and a supervised learning process. So far, most studies consider self-organization and supervised learning as two unrelated learning mechanisms. Here, we argue that with the proposed S-rRBF it is possible to build a learning model in which topographic self-organization is an integrated process of supervised learning, and thus giving a new perspective on the learning process of artificial neural networks.

Outline of S-rRBF.
The dynamics of the S-rRBF is as follows. Suppose the S-rRBF is trained against a data set {(
Given input,
Here,
The function σ () in Equation is a neighborhood function defined as
where dist (win, j, t) is the Euclidean distance between the winning neuron and the j-th neuron on the two-dimensional grid of the hidden layer. The variable t is the current epoch, and t end is the target epoch when the learning process is terminated. The activation function of a hidden neuron in S-rRBF is similar to that of the Radial Basis Function Network (RBF) [28], except that in S-rRBF it is topologically restricted by the neighborhood function σ (win, j, t).
The outputs of the hidden neurons are then propagated to the output layer, where the k-th output, O
k
, is computed as follows.
The conditional probability that the S-rRBF classifies the input into the class k is given by
The S-rRBF is then trained to minimize the cross entropy in Equation.
Considering that Y
i
∈ {1, …, C}, Equation can be rewritten as
To minimize the cross-entropy, the gradient of the loss function with respect to vector
Also,
Defining,
When the true class of the given input
It is important to mention that although the internal self-organization of S-rRBF is similar to that of SOM, they differ in a significant way, in that in SOM, as shown in Equation, the reference vector is always modified towards the input vector while the direction of self-organization in S-rRBF is regulated by the sign of
We chose a variety of problems with various degrees of complexities, data dimensions and data sizes to test the generality and scalability of the proposed S-rRBF. We thereby compared our proposed architecture against three common deep learning models, namely a Deep Belief Network (DBN) [29, 30] where the internal representations were obtained through unsupervised layer-wise training utilizing Restricted Boltzmann Machines, a Stacked Autoencoder (SAE) [31, 32] where the internal representations were layer-wise autoencoders, and a ReLU MLP [33–35] with Softmax output layers and cross-entropic loss function. The average classification error rates with the corresponding variances in brackets over 15-fold cross validation test are shown in Table 1. In those experiments, the number of hidden neurons, as well as the structures for the deep neural networks were empirically tried, and the results of the best settings were registered for comparison. In Table 1 the performance of the best algorithm is highlighted in bold. The results indicate that although S-rRBF does not always outperform the three deep networks, it generally compares favorably with the best performing deep model.
Performance in terms of Error Rate (%) (Standard Deviation) of the different methods on a series of standard machine learning benchmark programs.
Performance in terms of Error Rate (%) (Standard Deviation) of the different methods on a series of standard machine learning benchmark programs.
For demonstrating the uniqueness of the S-rRBF internal representations, the internal layer was visualized and compared against the internal layers of deep models for some benchmark problems.
The first example is the internal representation for the Iris problem, a well-known 3-class problem where one of the classes is linearly separable from the other non-linearly separable two classes. For understanding the original structure of this data set, two dimension-reduction methods were executed, t-Stochastic Neighborhood Embedding (t-SNE) [36, 37] and SOM. The results are shown in Fig. 2a and b. The former shows the t-SNE representation, where the stochastic proximity structure of the data in their original high-dimensional space is preserved. The later shows the SOM representation, where the topological structure of the data is preserved. In these 2-D maps, each class is represented with different marker and color, while a × on the SOM shows the overlapping representation of some data points belonging to conflicting classes. The size of a marker reflects the number of data points it represents. It should be noted that for SOM and t-SNE, the 2-D representations were constructed based only on the feature similarities of the data while their labels were irrelevant in the dimensional reduction process and utilized only for the visualization clarity. These two low-dimensional maps nicely describe the well-known separability characteristics of the Iris data set, where the samples in class are easily separable from the two other classes, and, while some samples from those two classes overlap. The internal layer of the S-rRBF is given in Fig. 2c. It shows that one of the classes, represented with is aligned relatively far from the other two classes, represented by and, that are aligned close to each other. For comparison, the last hidden layers of the comparative deep models are also shown. As the deep models may contain more than two neurons in their last hidden layers, their internal representations are treated as high-dimensional vectors and hence require a dimensional reduction method to visualize. Here, we chose t-SNE for reducing their dimensions. Figure 2d shows the representation in the last hidden layer of DBN, Fig. 2e shows the representation of the last hidden layer in SAEs, while Fig. 2f shows the representation of the last hidden layer in ReLU MLP. The S-rRBF and the comparative models show similar internal representations in that the classes for this problem are well separable except for a few points belonging to two of the classes. This is a good indication that this is an easy classification problem, which is confirmed by the classification performances given in Table 1.

Iris (dim:4, class:3).
The second example is the Bank Marketing Data, a 48-D, 2-class problem. The low-dimensional representation of this problem using t-SNE is shown in Fig. 3a, and the corresponding SOM is shown in Fig. 3b. The figures indicate that there are many overlapping data points belonging to contrasting classes that make this problem relatively difficult to classify. Figure 3c indicates that the S-rRBF generates a nice topographical internal representation illustrating how the classifier separates the two classes. The internal representations in DBN, SAEs, and ReLU MLP are shown in Fig. 3d, e and f, respectively. All the internal representations nicely visualize the imbalance between class represented by and However, it can be observed that for this data set, the internal representation of the S-rRBF shows a better separability compared to that of the deep models. This is consistent with its better performance shown in Table 1.

Bank Credit (dim:48, class:2).
The third example is the Heart Data, a 13 dimension, 2-class problem. The t-SNE and SOM representations in Fig. 4a and b show that there are many overlapping samples belonging to the two conflicting classes, and thus indicating that this is a relatively difficult classification problem. The representations of DBN, SAEs, and ReLU MLP, in Fig. 4d, e and f, indicate that there are some sub-spaces of overlapping classes that are likely to be misclassified. The representation of the S-rRBF, in Fig. 4c, shows better class-separability compared to the comparative methods. However, the overlapping areas are also apparent in S-rRBF. This is consistent with the high error rate shown in Table 1.

Heart (dim:13, class:2).
In the three examples above the S-rRBF outperforms the comparative models. However, for understanding the characteristics of the proposed method, it is also important to demonstrate some negative results [27, 38] where the S-rRBF was outperformed by other methods.
The fourth example is the Activity Log Data, a 561-D, 6-class problem. The t-SNE and SOM representations in Fig. 5a and b indicate many overlapping classes. The S-rRBF formed an interesting representation, shown in Fig. 5c, consisting of mainly three distinctive clusters, the first one containing samples from, that are closely aligned, and thus is likely to produce misclassification. The second one contains samples from and, while the third one contains only a single class of. From Fig. 5d, e and f, it can be observed that DBN, SAEs and ReLU MLP generate overlapping representations that are likely to cause misclassification, while from Fig. 5e it is obvious that the SAEs generates nicely separable clusters representations, which is consistent with its outstanding performance as shown in Table 1.

Activity Log (dim:561, class:6).
The final example is the recently proposed "Fashion MNIST", an apparel-related image classification problem [39] where the S-rRBF was outperformed by SAEs. Some of the image samples from this data set are shown in Fig. 6, in which the numbers and the markers indicate the classes that they represent. This data set has the same dimensionality, class number and data size with the traditional MNIST handwritten digits dataset [40]. The t-SNE and SOM representations of this problem are shown respectively in Fig. 7a and b. It is interesting to visually observe that class 6 represented with is distinctively separated from all other nine classes. Aside from class 6, there are some classes that form distinctive clusters, but there are also many overlapping classes indicating that some of the classes are likely to be easily classified while some are not. The representation of the S-rRBF, shown in Fig. 7c significantly differs from the three comparative methods, respectively shown in Fig. 7d, e and f. The internal representation of the S-rRBF clearly shows that some classes are distinctively separable while it also shows that some are difficult to be classified. For this problem, the SAEs formed the best cluster-separability in its internal layer as apparent from Fig. 7e, which is also consistent with its best classification performance as shown in Table 1.

Fashion MNIST Samples.

Fashion MNIST (dim:784, class:10).
From the above examples it is clear that the S-rRBF generates unique internal representations compared to the deep models. The visual appearance of the internal representation correlates well with the performance of the S-rRBF. This fact supports our argument that topographic representation can be useful for the internal representation in hierarchical neural networks. It should also be noted that the S-rRBF directly forms a low-dimensional representation in its internal layer as opposed to the other methods that forms high-dimensional representations that are often hard to interpret and require additional dimensional reduction methods such as t-SNE for visualization. Furthermore, it is important to notice that while the S-rRBF is not necessarily the best classifier in each benchmark problem, its performances are still close to the best ones. Overall, S-rRBF is a stable classifier as it performs relatively well on all benchmarks. This is an empirical indication that topographic internal representation adds robustness to a classifier.
In this research we show that it is possible to build a hierarchical neural network that self-organizes a class-relevant topographic internal representation. More specifically, we show that topographic self-organization can emerge as an implication of a supervised learning. Thus, the two learning processes of topographic self-organization and supervised learning, which are often considered to be unrelated, can be viewed as two different aspects of a single learning mechanism where they are only distinguished by the layer they occupy. It should be noted that while self-organization is normally an unsupervised process, here it is not fully unsupervised in that it is directed by the output error. Importantly, the self-organizing process arises from the derivation of a loss function, which is one of the novelty of this study.
The experiments show that the classification performance of the proposed model is comparable to that of standard supervised networks. While the proposed model does not always outperform existing conventional models, we found that the performance was comparable to the best performer for most of the diverse benchmark applications. Here we argue that only looking for best accuracy in single applications is not necessarily a good way to evaluate good methods for machine learning and applications of artificial intelligence methods. Specific machine learning methods often perform well on data sets for which they have been designed. However, it is well acknowledged that sufficient performance in a variety of tasks is useful in many applications such as robotics for systems that have to function well in changing environments and a variety of tasks. Also, it is likely that robust and versatile systems are potentially important to understand human abilities better. After all, it has been shown that engineered systems can outperform humans, such as a calculator multiplying two large numbers. However, the versatility of humans skills is still unmatched by artificial intelligence systems, and topographic organization have been observed in the human brain.
