Abstract
The original self-organizing map (SOM) was proposed in the context of processing numeric data. In previous studies, an extended SOM incorporating data structure distance hierarchies has been proposed to facilitate handling of categorical values. The model could take into account the semantics embedded in categorical values via distance hierarchies. In addition to manual construction by domain experts, an approach to learning distance hierarchies from datasets has been devised. However, the proposed approach in the previous study was based on supervised learning which demands presence of a class attribute in the dataset. In real-world applications, class attribute may not be available. Thus, the supervised approach can be inapplicable. In this article, we present several methods of unsupervised learning of distance hierarchies so that neither are class attribute nor domain experts required in measuring similarity degree between categorical values. We then integrate the learned distance hierarchies with the extended SOM to facilitate the application to datasets without a class attribute. We conduct experiments to verify feasibility and compare performance of the proposed unsupervised-learning methods.
Introduction
Nowadays, information systems are commonly adopted in corporations and price for data storage is much less now than early days. Huge amount of data is therefore easily collected in corporation’s databases. Analyzing big data to unveil hidden patterns in the data which may be valuable for decision making is a hot topic in recent years [1]. Nevertheless, real-world data are complicated, usually consisting of different types of attributes in one dataset including categorical and numeric attributes. Analyzing mixed-type data is not straightforward.
Most of data analysis algorithms handle only one type of the values, either categorical or numeric. When mixed-type of data are encountered, data preprocess transforming one type of the data to the other is usually performed before analysis of the data. For a numeric attribute, discretization of continuous values is often performed. For a categorical attribute, a commonly adopted technique is 1-of-
Visualization is a convenient tool for data analysis, especially in the early stage, data exploration. Self-organizing map (SOM) [3] has been used in data visualization in many applications. SOM, a genre of unsupervised neural networks, projects high-dimensional data to a low-dimensional space, typically, two-dimensional. Consequently, the high-dimensional data become visible and can be analyzed on the two-dimensional map. Furthermore, SOM as a dimensionality reduction technique possesses a nice characteristic, namely, preservation of topological order in the data; Data close to one another in the data space are also near to one another on the projection map. Due to this property, SOM has been widely applied to visualized clustering and classification [4, 5, 6, 7, 8, 9, 10, 11] in addition to many other applications [12, 13].
However, the original training algorithm [3] was proposed in the context of numeric data. The 1-of-
To address the problem resulting from 1-of-
In the extended SOM [2], distance hierarchies for categorical attributes were constructed by domain experts or taken from existing repositories. However, domain experts are not always available or distance hierarchies may not be readily made in some fields. In the recent work MixSOM [14] and GMixSOM [15], an algorithm which constructs distance hierarchies from the data with respect to a class attribute was developed. That is, the distance hierarchies are learned from the training data in a supervised manner. The idea behind the learning algorithm is to analyze how two categorical values are associated with class labels of the class attribute. If two categorical values co-occur with the class labels in a similar way, the values are considered similar. According to the idea, pairwise distances of categorical values in a categorical attribute can be calculated and then distance hierarchy for that attribute can be constructed by using agglomerative hierarchical clustering. Nevertheless, the requirement of existence of a class attribute considerably limits the application of MixSOM and the like since in practice there are large amounts of datasets which do not have a class attribute.
The aim of this study is to tackle the constraint of requiring a class attribute in the dataset or domain experts for providing distance hierarchies when using the extended SOMs. We therefore present several unsupervised approaches which do not require a class attribute to learning distance hierarchies from the dataset and then apply the results to the extended SOM. Projection results with using distance hierarchies learned by the proposed approaches are compared on synthetic and real-world datasets.
The contribution of this article is three-fold.
We propose four unsupervised distance learning algorithms for measuring dissimilarity between the values in a categorical attribute. In the previous studies, the dissimilarity information was provided by domain experts [2] or learned in a supervised manner [14, 15], namely, requiring the existence of a class attribute in the dataset. This article extends the preliminary paper [16] with three more unsupervised learning algorithms and extensive experiments on comparison of different algorithms. We conduct extensive experiments to verify feasibility of the proposed approaches and experimentally compare the projection results in terms of internal and external indices. This article demonstrates in case of lacking a class attribute in the dataset, how one can incorporate unsupervised learning schemes for dissimilarity between categorical values in order to utilize self-organizing maps for mixed-type datasets and conduct visualized data analysis.
The structure of the paper is organized as follows. Section 2 discusses related work including the traditional approach to handling categorical attributes, dissimilarity measures for categorical data. Section 3 provides some background including distance hierarchy, and the extended SOM. Section 4 presents the proposed unsupervised methods for learning dissimilarity between categorical values from the datasets. In Section 5, we experimentally compare the proposed methods by evaluating projection results of the extended SOM. Conclusions are given in Section 6.
1-of-
coding and distance hierarchy
In data analysis, a typical technique to handle categorical values is 1-of-
(a) A toy mixed-type dataset, and (b) the transformed dataset by using 1-of-
In addition to increased dimensionality, another major disadvantage of 1-of-
To address this deficiency, distance hierarchy was proposed [2]. A distance hierarchy is composed of nodes, links, and weights (or link lengths), as shown in Fig. 2. Distance hierarchy can express dissimilarity degree between values. The distance between two values is measured by the path length between the two points representing the two values in the hierarchy. In Fig. 2a, values Coke and Pepsi are represented by the two points at the leaves in the hierarchy. Their distance or path length is two assuming each link has a unit weight. In fact, numeric values can be handled via a distance hierarchy as well. For instance, values 0.2 and
Distance hierarchies for (a) categorical values and (b) numeric values. The weight of each link is assumed to be one and omitted for clarity.
Formally, a point
For instance, assuming
The distance between two points
where
In addition to 1-of-
In [24], the authors presented a similarity measure DISC and compared DISC with the 14 measures studied in [17] on 24 real-world datasets, out of which 12 were used for classification and 12 for regression. Their results indicated DISC outperformed all competing algorithms on almost all datasets. In view of its good performance, we adapted DISC with slight revision to measure dissimilarity between categorical values, referred to as MDISC, in this study and then used the obtained dissimilarity information to construct distance hierarchies needed for the extended SOM.
In a recent paper [25], the authors proposed DILCA and compared DILCA on application to categorical clustering with three clustering algorithms, Delta [26], Rock [27], and Limbo [28], and three measures LIN, OF, and GOODALL3 for categorical data on 16 datasets including 12 from real-world and 4 synthetic. DILCA outperformed the competing approaches on most of the datasets.
Based on the idea of DILCA which measures dissimilarity between two categorical values with respect to other categorical attributes, we propose a measure referred to as ULD1 in this study. However, unlike DILCA which worked for datasets with all categorical attributes, ULD1 can work for datasets having both categorical and numeric attributes. In addition, ULD2 is proposed with some revision to ULD1.
Distance hierarchy and extended SOM
Supervised learning of distance hierarchy
In some domains, there are existing concept hierarchies, such as the International Classification for Diseases (ICD)1
Available at
Available at
where
By using Eq. (2), pairwise distances of distinct values in a categorical attribute can be computed and a distance matrix containing pairwise distances between any two categorical values can be obtained. We then apply a bottom-up agglomerative hierarchical clustering algorithm [30] to the matrix and yield the output, a dendrogram as shown in Fig. 3. The tree-structured dendrogram can be used as the distance hierarchy for the categorical attribute. According to our experience, compared with single and complete linkages, average linkage used to measure the distance between two clusters in the agglomerative hierarchical clustering [30] produces the best result.
Dendrogram generated by an agglomerative hierarchical clustering algorithm with a pairwise distance matrix can be used as a distance hierarchy.
The original training algorithm of SOM was applicable only for numeric data. Rather than using 1-of-
GMixSOM [15] is a growing self-organizing map for mixed-type data, which can project high-dimensional data mixed with categorical and numeric attributes onto a low-dimensional space for visual inspection. Unlike the original SOM, GMixSOM handle categorical attributes directly and preserve semantic similarity between categorical values via distance hierarchies.
In GMixSOM, each attribute is associated with a distance hierarchy. During training, the distance between an input datum
where
In the step of identifying the BMU with respect to an input
We use the example in Fig. 2a to illustrate the process. Assume
As can be seen in the previous illustration, the position of the lowest common tree node of the two leaves in the hierarchy is required for determining whether or not
Common Point Matrix (CPM) corresponding to the distance hierarchy in Fig. 2a
The process to construct the CPM for a categorical attribute is summarized in Algorithm 1. Note the approaches UDL1, UDL2, and MDISC for computing pairwise dissimilarity degree between categorical values will be introduced in the next section. The training algorithm of the extended model GMixSOM [15] is depicted in Algorithm 2.
The training of the extended SOM mainly consist of two phases. In the initialization phase, the map contains five neurons in a cross formation as shown in Fig. 4a with random weights. The growing threshold
For each training instance, the BMU is identified and then its weight is updated. So do the weights of the BMU’s neighbors. Specifically, the BMU with respect to an input
If the accumulated error of the BMU exceeds the growing threshold GT and the BMU is at the border of the current map, a new neuron is inserted or otherwise the error is rippled outward to its neighbors. To ripple out the error, the error of the BMU becomes a half and the immediate neighbors of the BMU equally share the amount of the other half.
(a) The initial map of GMixSOM, (b) for the input 
In insertion of a new neuron, the position where the prototypes of its neighboring neurons are most similar to the input is chosen. For the example shown in Fig. 4b, assume
Formally, the location of new neuron
where
The training algorithm for the extended SOM requires a distance hierarchy associated with each categorical attribute. We propose four unsupervised methods to construct the distance hierarchy if class attribute is not available.
The first one is trivial which is a two-level distance hierarchy. The other three learn pairwise dissimilarity between categorical values from the dataset and then construct distance hierarchy by using agglomerative clustering algorithm with the pairwise dissimilarity matrix. The idea of learning pairwise dissimilarity between categorical values is based on the following.
Assume A and B are two distinct values of a categorical attribute. A and B are deemed to be similar or have a small distance if the way of A co-occurring with the values in the other feature attributes is very similar to that of B co-occurring with those values.
In fact, the idea behind the supervised approach [14] is analogous to this one. The difference is that co-occurrence is computed with respect to the class attribute only. In contrast, the unsupervised approaches compute co-occurrence with regard to the other feature attributes, including categorical and numeric attributes since the class attribute is not available. Based on the idea, we present three unsupervised methods in addition to the two-level distance hierarchy.
Two-level distance hierarchy
The first approach is to use a two-level distance hierarchy (TLDH) for each categorical attribute. A two-level distance hierarchy consists of a root and leaf nodes with each link weight set to 0.5, as shown in Fig. 5. All categorical values of the attribute are associated with the leaf nodes of the hierarchy. As a result, two distinct values have a dissimilarity of 1 and the same values have a dissimilarity of 0.
Two-level distance hierarchy for a categorical attribute.
The second approach is inspired by DILCA or DIstance Learning for Categorical Attributes [25], an algorithm for computing distance between any pair of values in a specified categorical attribute with respect to other attributes referred to as context attributes. However, DILCA took only categorical attributes as context attributes and did not consider numeric attributes. We devise a new formula which takes into account not only categorical attributes but also numeric attributes in this study.
The new unsupervised distance learning, denoted by UDL1, to measure the distance between two categorical values of a target attribute
where
Co-occurrence of categorical values
In Eq. (7), the conditional probability is calculated by the ratio of a target value
Empirical results shown in the experimental section indicate that UDL2 outperformed UDL1 in the application to the extended SOM. We use a simple example to demonstrate the difference between Eqs (7) and (8). Assume that we want to measure the distance between target values
The fourth method is adapted from DISC [24] which measures similarity between categorical values. The difference of this method from UDL1 and UDL2 is when the context attribute is categorical, dissimilarity between
A toy mixed-type dataset
We have introduced five unsupervised approaches, 1-of-
By 1-of-
For the unsupervised algorithms UDL1, UDL2, and MDISC,
For UDL2, it is different from UDL1 only on how
For MDISC, the dissimilarity between
By UDL1,
By UDL1,
Note that UDL2 measures similarity between two values in attribute
Dissimilarity between a, b, and c
Dissimilarity between a, b, and c
Table 4 summarizes the pairwise distances among
We present the framework to analyze by using the extended SOM mixed-type datasets which do not have a class attribute as follows.
By using UDL1, UDL2, or MDISC, we can compute a pairwise distance matrix for each categorical attribute of the mixed-type dataset. Then, by using agglomerative hierarchical clustering, we can construct a dendrogram as the distance hierarchy for each categorical attribute. The next is to convert each distance hierarchy to a common point matrix (CPM). This process is summarized in Algorithm 1, Section 3.2.
The CPM for each categorical attribute computed in Algorithm 1 is needed for the SOM extended for mixed-type datasets. The detailed process is presented in Algorithm 2, Section 3.2.
It is worth noting that the introduced approaches are sensitive to noise and missing values. Data preprocessing shall be conducted to ensure the quality of the data prior to applying the approaches. Noise data shall be corrected, and missing values must be handled.
Experimental results
We use one internal and two external measures to compare projection results by GMixSOM with 1-of-
Evaluation
To evaluate the performance of the proposed methods, we use an internal measure root mean squared error (RMSE), also referred to as quantization error [32], and external measures entropy and accuracy in this study. Other performance measures related to SOM include topological error [33], trustworthiness and continuity [34]. The internal measure does not consider external information such as class attribute but only feature attributes involved in the training. RMSE measures quantization errors, that is, the average distance between input instances and the prototype of its best match unit. RMSE is calculated by
where
The external measure entropy makes use of external information, namely, the class attribute which does not involve in training the SOM. Entropy measures consistence of class labels of the instances projected in neurons and is defined by the weighted average of entropies of individual neurons.
where
We further used the learned projections on the map as a classifier and measured classification accuracy as follows. After training, each neuron is assigned a class label by the majority class of training data projected in the neuron. A test instance is classified by projecting onto the trained map and assigning the class label of its best matching neuron. Classification accuracy is measured by
where
Synthetic dataset SynMix1
Dataset SynMix1 consists of one numeric attribute Amount and two categorical attributes Dept (or Department) and Drink as shown in Table 5. The dataset has 9 groups each of which has 60 or 30 data instances. The values of the numeric attribute were randomly generated according to Gaussian distribution with designated means and standard deviations. Categorical values are randomly generated in the specified groups according to uniform distribution, such as {MIS, MBA, FM} in groups {1, 2, 3}. Each instance has one of the class labels A, B, and C. Ninety percent of the instances in groups 1, 2, and 3 have class label A and the other ten percent are randomly assigned the other two labels.
Synthetic dataset SynMix1 with one numeric and two categorical attributes
Synthetic dataset SynMix1 with one numeric and two categorical attributes
Distance hierarchies shown in Fig. 6 for categorical attributes Dept and Drink were constructed by using the hierarchical clustering algorithm taking the distance matrices produced by UDL1, UDL2, and MDISC as input. When running the unsupervised methods, the class attribute was excluded and only the other two features were considered context attributes. The result is as expected and reflects the relationship between categorical values with regard to the other feature attributes in the dataset. For example, the drinks of the same type, juices, coffee or carbonated drink, were grouped together in the hierarchy.
The projection results by the extended model GMixSOM with different schemes handling categorical attributes were shown in Fig. 7. The grey level of the square grids in the background indicates the distance between the prototypes of two neurons. The darker represents the more distant. As shown in the right part of Fig. 7e, the dark color indicates large distance separates the region of neurons labelled with 7, 8, and 9, and the region of neurons labelled with 4, 5, and 6.
The size of the neurons is in proportion to the number of the instances projected to the neuron. The color of a neuron represents the group of the instances projected in the neuron. A single color indicates all the instances have the same group number and otherwise different groups. The number associated with a neuron represents the group number of the instances projected the neuron. The group number is correspondent to that shown in Table 5. A group number with a plus symbol indicates the instances in the neuron have different group numbers and the number represents the majority group.
Distance hierarchies constructed by using (a) UDL1, (b) UDL2, and (c) MDISC for attributes Department and Drink of dataset SynMix1.
For the example of Fig. 7a, the left-most neuron has three colors and is labeled by 4+, indicating that the data points projected in the neuron are from three groups among which group 4 is the largest. The size of the neuron is larger than that of the right-most neuron labeled by 1+, indicating fewer data are projected in the right-most than the left-most neuron.
As can be seen in Table 5, the instances in groups 1, 2, and 3 are more similar to one another than to the instances in other groups and thus shall be projected to nearby regions. So are the instances in groups 4, 5, and 6, and the instances in groups 7, 8, and 9. The projections in Fig. 7c to e which use the learned distance hierarchies for representing the dissimilarity relationship between categorical values reflect this expectation. By contrast, the results in Fig. 7a and b by using 1-of-
The projection result of synthetic dataset SynMix1 by GMixSOM with (a) 1-of-
Table 6 depicts another synthetic dataset SynMix2 which consists of two categorical (i.e., Dept and Drink) and two numeric attributes (i.e., Amt1 and Amt2) of which the instances were generated in a similar manner with those in SynMix1.
Synthetic dataset SynMix2 with two categorical and two numeric attributes
Synthetic dataset SynMix2 with two categorical and two numeric attributes
Distance hierarchies constructed by using (a) UDL1, (b) UDL2, and (c) MDISC for attributes Department and Drink of dataset SynMix2.
The projection result of synthetic dataset SynMix2 by GMixSOM with (a) 1-of-
Distance hierarchies constructed by using (a) UDL1 (b) UDL2 and (c) MDISC for attributes Education, Marital-Status and Relationship of dataset pAdult.
Figure 8 shows distance hierarchies generated from the dataset by different algorithms. Figure 9 shows the projections by GMixSOM on SynMix2 with different methods for categorical attributes. From Table 6, it is evident that the instances in groups 1 and 2 are more similar to each other than to the instances in other groups. So are those in groups 3 and 4, groups 5 and 6, and groups 7 and 8. The projection results shown in Fig. 9c to e reflect such structure of the data. Specifically, those in Fig. 9c to e with distance hierarchies generated from the dataset by UDL1, UDL2, and MDISC clearly demonstrate four clusters. Furthermore, the background grey level indicates that the distances between clusters (having dark background in between) are large and the distances within each cluster (having light background) are small. Figure 9a with 1-of-
In summary, it shall be noted that the extended model GMixSOM with distance hierarchies learned from the datasets is able to reflect on the map the relationship embedded in between categorical values. This can be easily verified by the projection results of the two synthetic datasets in Figs 7 and 9. We use dataset SynMix1 to illustrate. From Table 5 and the learned distance hierarchies in Fig. 6, we can observe the data instances in each of the three major groups, i.e., (1, 2, 3), (4, 5, 6), and (7, 8, 9), are close neighbors in the input space. For example, the instances in the major group (1, 2, 3) are all of Management departments and of Carbonated drinks. In the projection maps shown in Fig. 7c to e, the instances from each of the three groups gathered together in the same region, preserving the neighborhood relations of the input data. Moreover, indicated by the background grey level, the instances in each of the three major groups have smaller intra-distances (having light grey in the region) while the instances in different groups have large inter-distances (having dark grey in between major groups). The problem with Fig. 7a and b is again that 1-of-
Statistics of the real-world datasets
#Categorical and majority (%) indicate the number of categorical attributes and the percentage of the majority class, respectively.
Performance by different number of MDISC iterations
Eleven different real-world datasets used in the experiments were taken from the UCI repository [35]. In addition, another version of datasets Adult and Australian Credit Approval (ACA) denoted by pAdult and pACA in which only the attributes significantly correlated to its class attribute were retained [2, 14] are also included.
No. of best performances by different no. of iterations of MDISC on the 13 datasets
No. of best performances by different no. of iterations of MDISC on the 13 datasets
The projection result of dataset pAdult by GMixSOM with (a) 1-of-
The summary of the experimental datasets is shown in Table 7, including the numbers of categorical attributes, numeric attributes, data points, and class labels, the percentage of the majority class, and the entropy of each dataset. For the parameters of the extended SOM, the learning rate decreased linearly,
Figure 10 shows the distance hierarchies learned from the dataset with UDL1, UDL2, and MDISC schemes. Some interesting patterns can be observed from the results. For instance, if we group the values of attribute Relationship to two clusters according to the hierarchy in the right-most of Fig. 10b and c, the clusters are {Wife, Husband} and {Unmarried, Not-in-family, Own-child, Other-relative}, respectively. Similarly, for the attribute Marital-Status in the middle of Fig. 10b and c, we got {Married-civ-spouse, Married-AF-spouse} and {Widowed, Divorced, Separated, Never-married, Married-spouse-absent}.
MDISC performs an iterative process and progressively update dissimilarity degree between pairwise values in individual categorical attributes. It is interesting to investigate how the number of iterations impacts on the dissimilarity degree. Tables 8 and 9 show the performance by using distance hierarchies constructed according to dissimilarity between categorical values measured with different numbers of MDISC iterations. The result indicates one iteration of the process achieved good result. This outcome is consistent with the claim by the authors of the paper [24]. In the experiments, hereafter, for MDISC, dissimilarity between categorical values is calculated with one iteration.
RMSE of different approaches on real-world datasets
Entropy measured in the original datasets and after projection on the maps
Classification accuracy by the various schemes
Figure 11 shows projection results of dataset pAdult. The boundary indicated by the grey levels in Fig. 11a generated by GMixSOM with 1-of-
Figure 11d shows a boundary in the middle. Most of the instances with class label salary
Table 10 depicts detail performance on individual datasets with regard to each approach to handling mixed-type data. As can be seen, MDISC achieved the lowest RMSE while TLDH has the largest.
Table 11 shows in average UDL2 achieved the best performance in reducing entropy in neurons. That is, the instances projected in individual neurons are more of the same class compared to those by the other methods. The 1-of-
Table 12 demonstrates those schemes which used distance hierarchy to represent the relationship between categorical values, namely, TLDH, UDL1, UDL2, and MDISC, outperformed the 1-of-
It is interesting to note that UDL2 outperformed UDL1 in all three measures. In other words, Eq. (8) is better than Eq. (7). The only difference between the two equations is how the conditional probability is measured in the second term of the formulas.
The two datasets pAdult and pACA formed by retaining only the attributes highly correlated to the class attribute achieved better performance than the original datasets in all three measures in all approaches except for the 1-of-
Distance hierarchy which enables coding of dissimilarity degree between categorical values is required by the extended model GMixSOM for handling mixed-type data. In this regard, in the previous studies, a supervised approach requiring existence of a class attribute in the dataset was proposed for learning the hierarchies from the dataset. However, it is not the case that every real-world dataset contains a class attribute. To facilitate the learning of hierarchies in case of no class attribute existing in the dataset, four approaches of unsupervised learning of dissimilarity degree between categorical values are presented and compared on 11 real-world datasets in this study.
The empirical results are summarized as follows. For the internal index quantization error, MDISC scheme achieved the smallest error while TLDH ranked the worst. For external indices entropy and classification accuracy, 1-of-
The advantage of UDL1, UDL2, and MDSIC compared with 1-of-
There are many ideas for measuring similarity of heterogeneous or mixed data in the literature such as those mentioned in [36, 37]. In the future, it would be interesting to integrate other ideas into our model and see whether the performance can be further improved.
In addition to applying the proposed similarity measures of mixed-type data to the extended SOM, those measures can as well be used for clustering and classification problems with mixed-type data. For instance, for clustering we can construct the pairwise distance matrix of the mixed-type dataset by using the proposed measures and then a typical hierarchical clustering algorithm can take the matrix as input and proceed the remaining clustering steps.
Footnotes
Acknowledgments
This work is partially supported by National Science Council, Taiwan under grant NSC 102-2410-H-224-019-MY2 and MOST 105-2410-H-224-007, and by the “Intelligent Recognition Industry Service Center” from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.
