Unsupervised distance learning for extended self-organizing map and visualization of mixed-type data

Abstract

The original self-organizing map (SOM) was proposed in the context of processing numeric data. In previous studies, an extended SOM incorporating data structure distance hierarchies has been proposed to facilitate handling of categorical values. The model could take into account the semantics embedded in categorical values via distance hierarchies. In addition to manual construction by domain experts, an approach to learning distance hierarchies from datasets has been devised. However, the proposed approach in the previous study was based on supervised learning which demands presence of a class attribute in the dataset. In real-world applications, class attribute may not be available. Thus, the supervised approach can be inapplicable. In this article, we present several methods of unsupervised learning of distance hierarchies so that neither are class attribute nor domain experts required in measuring similarity degree between categorical values. We then integrate the learned distance hierarchies with the extended SOM to facilitate the application to datasets without a class attribute. We conduct experiments to verify feasibility and compare performance of the proposed unsupervised-learning methods.

Keywords

Self-organizing map mixed-type data visualization categorical attribute unsupervised learning

1. Introduction

Nowadays, information systems are commonly adopted in corporations and price for data storage is much less now than early days. Huge amount of data is therefore easily collected in corporation’s databases. Analyzing big data to unveil hidden patterns in the data which may be valuable for decision making is a hot topic in recent years [1]. Nevertheless, real-world data are complicated, usually consisting of different types of attributes in one dataset including categorical and numeric attributes. Analyzing mixed-type data is not straightforward.

Most of data analysis algorithms handle only one type of the values, either categorical or numeric. When mixed-type of data are encountered, data preprocess transforming one type of the data to the other is usually performed before analysis of the data. For a numeric attribute, discretization of continuous values is often performed. For a categorical attribute, a commonly adopted technique is 1-of- $k$ coding which converts an attribute with domain size $k$ to a set of $k$ binary attributes. To transform a record including a categorical value, the binary attribute corresponding to the categorical value is set to one and the other binary attributes zero. In addition to increased data dimensionality, the 1-of- $k$ coding scheme has another major disadvantage: Semantics embedded in categorical values is lost after transformation. Consequently, topological order of the data in the original set could be altered [2].

Visualization is a convenient tool for data analysis, especially in the early stage, data exploration. Self-organizing map (SOM) [3] has been used in data visualization in many applications. SOM, a genre of unsupervised neural networks, projects high-dimensional data to a low-dimensional space, typically, two-dimensional. Consequently, the high-dimensional data become visible and can be analyzed on the two-dimensional map. Furthermore, SOM as a dimensionality reduction technique possesses a nice characteristic, namely, preservation of topological order in the data; Data close to one another in the data space are also near to one another on the projection map. Due to this property, SOM has been widely applied to visualized clustering and classification [4, 5, 6, 7, 8, 9, 10, 11] in addition to many other applications [12, 13].

However, the original training algorithm [3] was proposed in the context of numeric data. The 1-of- $k$ coding was usually applied to preprocess the data when categorical attributes are involved in the dataset. Nevertheless, due to the deficiency of semantics lose after 1-of- $k$ transformation, topological order shown on the projection map is distorted in the sense that the topology does not match user’s intuition with respect to the categorical values.

To address the problem resulting from 1-of- $k$ transformation, we proposed an extended SOM model [2] which not only allows to process mixed-type data directly but also preserves the semantics embedded in categorical values. Specifically, we proposed to incorporate a data structure, distance hierarchy, to the model. The data structure is exploited to represent dissimilarity degree between values in the extended model. Two categorical values which are semantically similar to each other have a smaller distance in the hierarchy and vice versa. A distance hierarchy is of a tree structure consisting of nodes connected by links. Each link is associated with a weight representing the dissimilarity between two nodes. The distance between any two values is measured by the total weight of the path between the values. By incorporating distance hierarchy into SOM, the proposed SOM model is able to process categorical values and reflects their semantic similarity on the map.

In the extended SOM [2], distance hierarchies for categorical attributes were constructed by domain experts or taken from existing repositories. However, domain experts are not always available or distance hierarchies may not be readily made in some fields. In the recent work MixSOM [14] and GMixSOM [15], an algorithm which constructs distance hierarchies from the data with respect to a class attribute was developed. That is, the distance hierarchies are learned from the training data in a supervised manner. The idea behind the learning algorithm is to analyze how two categorical values are associated with class labels of the class attribute. If two categorical values co-occur with the class labels in a similar way, the values are considered similar. According to the idea, pairwise distances of categorical values in a categorical attribute can be calculated and then distance hierarchy for that attribute can be constructed by using agglomerative hierarchical clustering. Nevertheless, the requirement of existence of a class attribute considerably limits the application of MixSOM and the like since in practice there are large amounts of datasets which do not have a class attribute.

The aim of this study is to tackle the constraint of requiring a class attribute in the dataset or domain experts for providing distance hierarchies when using the extended SOMs. We therefore present several unsupervised approaches which do not require a class attribute to learning distance hierarchies from the dataset and then apply the results to the extended SOM. Projection results with using distance hierarchies learned by the proposed approaches are compared on synthetic and real-world datasets.

The contribution of this article is three-fold.

1.
We propose four unsupervised distance learning algorithms for measuring dissimilarity between the values in a categorical attribute. In the previous studies, the dissimilarity information was provided by domain experts [2] or learned in a supervised manner [14, 15], namely, requiring the existence of a class attribute in the dataset. This article extends the preliminary paper [16] with three more unsupervised learning algorithms and extensive experiments on comparison of different algorithms.
2.
We conduct extensive experiments to verify feasibility of the proposed approaches and experimentally compare the projection results in terms of internal and external indices.
3.
This article demonstrates in case of lacking a class attribute in the dataset, how one can incorporate unsupervised learning schemes for dissimilarity between categorical values in order to utilize self-organizing maps for mixed-type datasets and conduct visualized data analysis.

The structure of the paper is organized as follows. Section 2 discusses related work including the traditional approach to handling categorical attributes, dissimilarity measures for categorical data. Section 3 provides some background including distance hierarchy, and the extended SOM. Section 4 presents the proposed unsupervised methods for learning dissimilarity between categorical values from the datasets. In Section 5, we experimentally compare the proposed methods by evaluating projection results of the extended SOM. Conclusions are given in Section 6.
2. Related work

2.1 1-of- $k$ coding and distance hierarchy

In data analysis, a typical technique to handle categorical values is 1-of- $k$ coding [14] which transforms a categorical attribute with domain size $k$ into a set of $k$ binary attributes. Consequently, a categorical value is transformed to a vector of $k$ binary values. When transforming a record with a categorical value, the binary attribute which corresponds to the categorical value is set to one and the others zero. Figure 1 illustrates an example with a toy dataset. Assume the domain of the categorical attribute Drink is {Coke, Pepsi, BlackTea}. The attribute Drink is then transformed to three binary attributes accordingly. The value Coke becomes $<$ 1 0 0 $>$ after the transformation. The distance between categorical values Coke and Pepsi can then be measured in the transformed dataset by their corresponding binary vectors, i.e., $<$ 1 0 0 $>$ and $<$ 0 1 0 $>$ .

Figure 1.

(a) A toy mixed-type dataset, and (b) the transformed dataset by using 1-of- $k$ coding for categorical values.

In addition to increased dimensionality, another major disadvantage of 1-of- $k$ coding is that semantics embedded in the categorical values is lost and consequently topological order may be altered in the new dataset after transformation. As can be seen, the semantics that Pepsi is more similar to Coke than to Black Tea is not preserved after the transformation if traditional distance measures such as Euclidean or Manhattan distance is used. Among those three binary attributes, any two of the three transactions are different to each other in two attributes. Any two records yield the same distance. Nevertheless, in the original datasets, the first two records which are both of carbonated drinks are considered more similar to each other than to the third one.

To address this deficiency, distance hierarchy was proposed [2]. A distance hierarchy is composed of nodes, links, and weights (or link lengths), as shown in Fig. 2. Distance hierarchy can express dissimilarity degree between values. The distance between two values is measured by the path length between the two points representing the two values in the hierarchy. In Fig. 2a, values Coke and Pepsi are represented by the two points at the leaves in the hierarchy. Their distance or path length is two assuming each link has a unit weight. In fact, numeric values can be handled via a distance hierarchy as well. For instance, values 0.2 and $-$ 0.5 are represented by two points as shown in Fig. 2b and their difference is 0.7. For an ordinal attribute, ordinal values can be converted to ordered numeric values and then handled like those of a numeric attribute.

Figure 2.

Distance hierarchies for (a) categorical values and (b) numeric values. The weight of each link is assumed to be one and omitted for clarity.

Formally, a point $P$ in a distance hierarchy is represented by two components: anchor and offset, i.e., $P=(A,d_{P})$ , which together determine a unique position in the hierarchy. Anchor is a label associated with a leaf. In other words, the anchor determines on which path $P$ is located. Offset is a positive real value specifying how far it is from the root to $P$ along the path. For instance, $x_{i}$ in Fig. 2a is represented by (Coke, 2.0).

$P$ is an ancestor of $Q$ if $P$ is on the same path with $Q$ and is in between $Q$ and the root. Two points $P$ and $Q$ are equivalent, denoted as $P\equiv Q$ , if they are at the same position. The lowest common ancestor $\textit{LCA}(P,Q)$ is the lowest common tree node of $P$ and $Q$ . The lowest common point $\textit{LCP}(P,Q)$ is defined by

$\displaystyle\textit{LCP}(P,Q)=\left\{\begin{array}[]{ll}P\textit{ or }Q&\text% {if }P\equiv Q\\ P&\text{if }P\text{ is an ancestor of }Q\\ Q&\text{if }Q\text{ is an ancestor of }P\\ \textit{LCA}(P,Q)&\text{otherwise.}\\ \end{array}\right.$

For instance, assuming $m_{i}$ is in the middle of the link in Fig. 2a, $m_{i}$ can be represented by (Coke, 0.5) or (Pepsi, 0.5) since the two values representing the same location in the hierarchy are equivalent. $m_{i}$ is an ancestor of $x_{i}$ . $\textit{LCA}(m_{i},x_{i})$ is the root. $\textit{LCP}(m_{i},x_{i})$ is $m_{i}$ .

The distance between two points $P$ and $Q$ is measured by the path length between $P$ and $Q$ . Formally, the distance is defined by Eq. (1).

$\displaystyle|P-Q|=d_{P}+d_{Q}-2d_{\textit{LCP}({P},Q)}$ (1)

where $d_{P}$ denotes the offset of $P$ and LCP represents the lowest common point of $P$ and $Q$ . In Fig. 2a, the LCP of points at Coke and Pepsi is the point located at their parent node while the LCP of $x_{i}$ and $m_{i}$ is $m_{i}$ . According to Eq. (1), the distance between $x_{i}$ and $m_{i}$ is 1.5 (or 2.0 $+$ 0.5 $-$ 2*0.5).

2.2 Similarity measures for categorical data

In addition to 1-of- $k$ , determining similarity degree between categorical values in an attribute of a dataset is much studied since there is no explicit notion of ordering among categorical values. In [17], the authors conducted an exhaustive comparison of 14 measures for categorical data on 18 datasets in the context of outlier detection. The authors classified the measures in three categories. First, there are measures such as Goodall [18] and Gambaryan [19] which give the minimum value 0 to two mismatch values and give different weights to matches. Second, there are measures such as Eskin [20], Occurrence Frequency (OF), and Inverse Occurrence Frequency (IOF) [21] which give matches the maximum values 1 and give different weights to mismatches. Third, the measures give different weights to both matches and mismatched, such as Lin [22] and Smirnov [23]. Their experimental results suggested that there is no one best performing similarity measure. One needs to understand how a measure handles different characteristics of a categorical dataset.

In [24], the authors presented a similarity measure DISC and compared DISC with the 14 measures studied in [17] on 24 real-world datasets, out of which 12 were used for classification and 12 for regression. Their results indicated DISC outperformed all competing algorithms on almost all datasets. In view of its good performance, we adapted DISC with slight revision to measure dissimilarity between categorical values, referred to as MDISC, in this study and then used the obtained dissimilarity information to construct distance hierarchies needed for the extended SOM.

In a recent paper [25], the authors proposed DILCA and compared DILCA on application to categorical clustering with three clustering algorithms, Delta [26], Rock [27], and Limbo [28], and three measures LIN, OF, and GOODALL3 for categorical data on 16 datasets including 12 from real-world and 4 synthetic. DILCA outperformed the competing approaches on most of the datasets.

Based on the idea of DILCA which measures dissimilarity between two categorical values with respect to other categorical attributes, we propose a measure referred to as ULD1 in this study. However, unlike DILCA which worked for datasets with all categorical attributes, ULD1 can work for datasets having both categorical and numeric attributes. In addition, ULD2 is proposed with some revision to ULD1.

3. Distance hierarchy and extended SOM

3.1 Supervised learning of distance hierarchy

In some domains, there are existing concept hierarchies, such as the International Classification for Diseases (ICD)1

¹
Available at http://www.cdc.gov/nchs/icd.htm.

in medicine and the ACM Computing Classification System (CCS)2

Available at http://www.acm.org/about/class/.

in computer science, which can be used as distance hierarchies by assigning a weight to each link. If no existing ones ready for use, distance hierarchies can be manually constructed by experts according to domain knowledge. In case that domain experts and existing hierarchies are not available, we proposed a supervised approach to learning distance hierarchies from the dataset in the previous study [14]. To learn from the dataset, the approach utilized co-occurrence information between feature values of the categorical attribute and class labels of the class attribute [29]. Specifically, the dissimilarity between

y_{i}

and

y_{j}

of a categorical attribute

Y

is defined as follows:

$\displaystyle d(y_{i},y_{j})=\frac{1}{|C|}\sum_{c\in C}|\textit{conf}(y_{i}% \Rightarrow c)-\textit{conf}(y_{j}\Longrightarrow c)|$ (2)

where $c$ is a class label of class attribute $C$ , $|C|$ is the number of distinct labels in $C$ , and $\textit{conf}(y_{i}\Rightarrow c)$ denotes the confidence or ratio of co-occurrence of $y_{i}$ and $c$ with respect to $y_{i}$ in the dataset. The distance is in the range [0 1]. If $y_{i}$ and $y_{j}$ have about the same ratio of co-occurrence with $c$ , the dissimilarity between $y_{i}$ and $y_{j}$ is small and otherwise large.

By using Eq. (2), pairwise distances of distinct values in a categorical attribute can be computed and a distance matrix containing pairwise distances between any two categorical values can be obtained. We then apply a bottom-up agglomerative hierarchical clustering algorithm [30] to the matrix and yield the output, a dendrogram as shown in Fig. 3. The tree-structured dendrogram can be used as the distance hierarchy for the categorical attribute. According to our experience, compared with single and complete linkages, average linkage used to measure the distance between two clusters in the agglomerative hierarchical clustering [30] produces the best result.

Figure 3.

Dendrogram generated by an agglomerative hierarchical clustering algorithm with a pairwise distance matrix can be used as a distance hierarchy.

3.2 Extended SOM for mixed-type data

The original training algorithm of SOM was applicable only for numeric data. Rather than using 1-of- $k$ coding, we tackled this issue by integrating distance hierarchies to the model in our previous studies.

GMixSOM [15] is a growing self-organizing map for mixed-type data, which can project high-dimensional data mixed with categorical and numeric attributes onto a low-dimensional space for visual inspection. Unlike the original SOM, GMixSOM handle categorical attributes directly and preserve semantic similarity between categorical values via distance hierarchies.

In GMixSOM, each attribute is associated with a distance hierarchy. During training, the distance between an input datum $x$ and the prototype $m$ of a neuron is measured as follows: (1) map each attribute value $x_{i}$ of the datum and each element $m_{i}$ of the prototype to its associated distance hierarchy $dh_{i}$ , (2) calculate the distance between the mapping points in $dh_{i}$ , and (3) aggregate the pairwise distances between $x$ and $m$ . The process can be expressed by Eq. (3).

$\displaystyle d(x,m)=\left(\sum_{i}(dh_{i}(x_{i})-dh_{i}(m_{i}))^{2}\right)^{1% /2}$ (3)

where $x$ and $m$ represent the input and the prototype of a neuron, respectively, and $dh_{i}(\cdot)$ maps the value of attribute or component $i$ to a point in its associated distance hierarchy. Figure 2a illustrates the $i^{\text{th}}$ attribute values of $x$ and $m$ are represented by two points in the hierarchy.

In the step of identifying the BMU with respect to an input $x$ , the neuron of which the prototype has the minimal distance measured by Eq. (3) to $x$ is selected as the BMU. The adaptation of a prototype is also performed with regard to the distance hierarchies. In essence, the point representing a component of the prototype moves toward the point representing the corresponding attribute value of the input. Consequently, the prototype becomes closer or more similar to the input.

We use the example in Fig. 2a to illustrate the process. Assume $m_{i}=$ (Pepsi, 0.5), $x_{i}=$ (Coke, 2.0), and the amount of adaptation is 0.6 which came from the difference between $m_{i}$ and $x_{i}$ multiplied by the neighborhood function and the learning rate at the time. To update the prototype component means $m_{i}$ shall move toward the input component $x_{i}$ by 0.6. For the movement, $m_{i}$ passes the lowest common point of the anchors of $m_{i}$ and $x_{i}$ , which is the parent node of nodes Coke and Pepsi. Therefore, after the update, the new $m_{i}$ becomes (Coke, 1.1). Note that the anchor of $m_{i}$ is changed from Pepsi to Coke and the length of a link is assumed to be 1.

As can be seen in the previous illustration, the position of the lowest common tree node of the two leaves in the hierarchy is required for determining whether or not $m_{i}$ passes the common tree node after update [14]. To facilitate the determination, a matrix referred to as common point matrix or CPM recording the distance between the root and the parent-node position is constructed. Table 1 shows the CPM corresponding to the distance hierarchy in Fig. 2a. The entries in the diagonal are set to 2 which is the distance from a leaf to the root. From Table 1, we know the lowest common tree node of Coke and Pepsi is at the location away from the root by one. After the update of the prototype component, the offset of $m_{i}$ changes from 0.5 to 1.1, passing the lowest common tree node. Therefore, the anchor of $m_{i}$ shall be Coke after the update.

Table 1

Common Point Matrix (CPM) corresponding to the distance hierarchy in Fig. 2a

	Coke	Pepsi	GreenTea	BlackTea
Coke	2	1	0	0
Pepsi	1	2	0	0
GreenTea	0	0	2	1
BlackTea	0	0	1	2

The process to construct the CPM for a categorical attribute is summarized in Algorithm 1. Note the approaches UDL1, UDL2, and MDISC for computing pairwise dissimilarity degree between categorical values will be introduced in the next section. The training algorithm of the extended model GMixSOM [15] is depicted in Algorithm 2.

Algorithm 1: Construct a common point matrix
Input: A mixed-type dataset $X$ and a specified categorical attribute $X_{c}$
Output: Common point matrix $\textit{CPM}_{c}$ for $X_{c}$
Begin 1. Compute the pairwise distance matrix $D$ for domain values of $X_{c}$ by one of the approaches, i.e., UDL1, UDL2 and MDISC. 2. Construct dendrogram $h_{c}$ as the distance hierarchy from $D$ by using agglomerative hierarchical clustering algorithm with average linkage. 3. Convert $h_{c}$ to a common point matrix $\textit{CPM}_{c}$ for $X_{c}$
end

The training of the extended SOM mainly consist of two phases. In the initialization phase, the map contains five neurons in a cross formation as shown in Fig. 4a with random weights. The growing threshold $G T$ is determined according to spreading factor $S F$ and data dimensionality $n:GT=-n\linebreak\times\ln(SF)$ [31].

For each training instance, the BMU is identified and then its weight is updated. So do the weights of the BMU’s neighbors. Specifically, the BMU with respect to an input $x$ is identified according to Eq. (4) and a neuron is updated at time $t$ according to Eq. (5) where $w_{i}$ represents the weight of neuron $i$ , $\alpha_{e}$ the learning rate which decreases along with the epoch number $e$ , and $h_{\textit{BMU},i}$ the Gaussian neighborhood function [15].

$\displaystyle\textit{BMU}=\textit{argmin}_{i}||x-w_{i}||$ (4) $\displaystyle w^{t+1}_{i}=w^{t}_{i}+\alpha_{e}h_{\textit{BMU},i}(x-w^{t}_{i})$ (5)

If the accumulated error of the BMU exceeds the growing threshold GT and the BMU is at the border of the current map, a new neuron is inserted or otherwise the error is rippled outward to its neighbors. To ripple out the error, the error of the BMU becomes a half and the immediate neighbors of the BMU equally share the amount of the other half.

Algorithm 2: Training of a GMixSOM
Input: An $n$ -dimensional dataset $D S$ , a set of $n$ common point matrices $\{\textit{CPM}\}$ , the number of training epoch $E$ , and spreading factor $S F$ .
Output: A trained GMixSOM
// Initialization phase
Initialize five neurons in a cross with random weights;
Determine growing threshold GT according to SF;
// Growing phase
For each training epoch
Reset the error of each neuron;
Initialize the learning rate and the neighborhood size;
For each input $x$ in $D S$
Identify the best matching unit (BMU) of $x$ ;
Update the prototypes of the BMU and its neighbors;
Calculate the accumulated error of the BMU;
If the error of the BMU is too large, insert a neuron when the BMU is at the border or otherwise ripples outward the
error to its neighbors;
Repeat till all inputs have been presented
Repeat till the epoch number equals to $E$

Figure 4.

(a) The initial map of GMixSOM, (b) for the input $x$ , $z_{2}$ will be the position for inserting the new neuron, and (c) for the input $x$ , $z_{3}$ will be the position for the new neuron.

In insertion of a new neuron, the position where the prototypes of its neighboring neurons are most similar to the input is chosen. For the example shown in Fig. 4b, assume $x$ is the current input and $v_{1}$ , $v_{2}$ and $v_{3}$ are the first, the second and the third BMU of $x$ . Positions $z_{1}$ , $z_{2}$ and $z_{3}$ are available around $v_{1}$ for inserting a new neuron. Since $z_{2}$ is closer to the second and the third BMU of $x$ , $z_{2}$ will be the chosen location for the new neuron. In the case of Fig. 4c, $v_{1}$ , $v_{2}$ and $v_{3}$ are the first, the second and the third BMU of input $x$ . $z_{1}$ , $z_{2}$ and $z_{3}$ are available around $v_{1}$ for inserting a new neuron and all have the same distance to $v_{1}$ . $z_{1}$ and $z_{3}$ have the same distance to the second BMU $v_{2}$ and $z_{3}$ is closer to the third BMU $v_{3}$ . Therefore, $z_{3}$ will be the chosen location.

Formally, the location of new neuron $z_{\textit{new}}(t)$ is determined by

$\displaystyle\textit{new}=\left\{\begin{array}[]{ll}\arg\min_{z_{i}(t)}\{% \Delta_{v_{2},z_{i}(t)}|z_{i}(t)\in N_{v_{1}}\},&\text{if }\Delta_{v_{2},Z_{i}% (t)}\neq\Delta_{v_{2},Z_{j}(t)},i\neq j\\ \arg\min_{z_{i}(t)}\{\Delta_{v_{3},z_{i}(t)}|z_{i}(t)\in N_{v_{1}}\},&\text{if% }\Delta_{v_{2},Z_{i}(t)}=\Delta_{v_{2},Z_{j}(t)},i\neq j\\ \end{array}\right.$ (6)

where $v_{1}$ , $v_{2}$ and $v_{3}$ are the first, the second and the third BMU of input $x$ , $z_{i}(t)$ is the eligible location for the new neuron, $\Delta_{v_{2},Z_{i}(t)}$ and $\Delta_{v_{3},Z_{i}(t)}$ are the distance from $z_{i}(t)$ to $v_{2}$ and $v_{3}$ in the map space, respectively, and $N_{v_{1}}$ is the neighbors of $v_{1}$ on the map. Essentially, Eq. (6) indicates that if more than one location has the same distance to the second BMU, the location which is closest to the third BMU is chosen as the location for inserting the new neuron, and otherwise the location which is closest to the second BMU is chosen.

4. Unsupervised learning of dissimilarity

The training algorithm for the extended SOM requires a distance hierarchy associated with each categorical attribute. We propose four unsupervised methods to construct the distance hierarchy if class attribute is not available.

The first one is trivial which is a two-level distance hierarchy. The other three learn pairwise dissimilarity between categorical values from the dataset and then construct distance hierarchy by using agglomerative clustering algorithm with the pairwise dissimilarity matrix. The idea of learning pairwise dissimilarity between categorical values is based on the following.

Assume A and B are two distinct values of a categorical attribute. A and B are deemed to be similar or have a small distance if the way of A co-occurring with the values in the other feature attributes is very similar to that of B co-occurring with those values.

In fact, the idea behind the supervised approach [14] is analogous to this one. The difference is that co-occurrence is computed with respect to the class attribute only. In contrast, the unsupervised approaches compute co-occurrence with regard to the other feature attributes, including categorical and numeric attributes since the class attribute is not available. Based on the idea, we present three unsupervised methods in addition to the two-level distance hierarchy.

4.1 Two-level distance hierarchy

The first approach is to use a two-level distance hierarchy (TLDH) for each categorical attribute. A two-level distance hierarchy consists of a root and leaf nodes with each link weight set to 0.5, as shown in Fig. 5. All categorical values of the attribute are associated with the leaf nodes of the hierarchy. As a result, two distinct values have a dissimilarity of 1 and the same values have a dissimilarity of 0.

Figure 5.

Two-level distance hierarchy for a categorical attribute.

4.2 UDL1

The second approach is inspired by DILCA or DIstance Learning for Categorical Attributes [25], an algorithm for computing distance between any pair of values in a specified categorical attribute with respect to other attributes referred to as context attributes. However, DILCA took only categorical attributes as context attributes and did not consider numeric attributes. We devise a new formula which takes into account not only categorical attributes but also numeric attributes in this study.

The new unsupervised distance learning, denoted by UDL1, to measure the distance between two categorical values of a target attribute $Y$ uses the information of context attributes $X$ . A context attribute can be categorical or numeric. For a categorical context attribute $X_{c}\in X$ , the distance between $y_{i}$ and $y_{j}$ of $Y$ with respect to $X_{c}$ is dependent on the difference between conditional probabilities of $y_{i}$ and $y_{j}$ given a value $x_{k}$ in $X_{c}$ . The larger the difference, the larger the distance. For a numeric context attribute $X_{n}$ , the distance is dependent on the difference between the averages of the values in $X_{n}$ which co-occurs with $y_{i}$ and $y_{j}$ , respectively. Likewise, the larger the difference, the larger the distance. With consideration of all context attributes, the distance between $y_{i}$ and $y_{j}$ of a categorical attribute $Y$ is defined as follows.

$\displaystyle d(y_{i},y_{j})=\frac{{1}}{|X|}\left(\sum_{X_{n}\epsilon X}{\frac% {\left|\textit{Avg}(X_{n,i})-\textit{Avg}(X_{n,j})\right|}{\textit{Max}(X_{n})% -\textit{Min}(X_{n})}}\!+\!\sum_{X_{c}\in X}{\sqrt{\frac{\sum_{x_{k}\in X_{c}}% (p(y_{i}|x_{k})-p(y_{i}|x_{k}))^{2}}{|X_{c}|}}}\right)$ (7)

where $|X|$ denotes the number of context attributes. $\textit{Avg}(X_{n,i})$ is the average of the numeric values co-occurring with $y_{i}$ in numeric attribute $X_{n}$ , i.e., $\textit{Avg}(X_{n,i})=(\sum_{t.Y=y_{i}}{t.X_{n}})/m$ where $m$ is the number of instances $t$ which has attribute value $t.Y=y_{i}$ . The deviation between the maximal and the minimal value in $X_{n}$ is used to normalize the value to the range [0 1]. $x_{k}$ is a value in categorical attribute $X_{c}$ , $|X_{c}|$ denotes the number of distinct values in $X_{c}$ .

Table 2

Co-occurrence of categorical values $a$ and $b$ with $M$ and $F$

	$M$	$F$
$a$	6	3
$b$	2	1

4.3 UDL2

In Eq. (7), the conditional probability is calculated by the ratio of a target value $y_{i}$ co-occurring with $x_{k}$ with respect to a feature value $x_{k}$ , i.e., $p(y_{i}|x_{k})$ . With regard to Eq. (2), we can take another perspective to measure the conditional probability which is the ratio of the target value co-occurring with the feature value with respect to a target value $y_{i}$ , i.e., $p(x_{k}|y_{i})$ . According to this perspective, we devise another formula slightly different from Eq. (7).

$\displaystyle d(y_{i},y_{j})=\frac{{1}}{|X|}\left(\sum_{X_{n}\epsilon X}{\frac% {\left|\textit{Avg}(X_{n,i})-\textit{Avg}(X_{n,j})\right|}{\textit{Max}(X_{n})% -\textit{Min}(X_{n})}}\!+\!\sum_{X_{c}\in X}{\sqrt{\frac{\sum_{x_{k}\in X_{c}}% (p(x_{k}|y_{i})-p(x_{k}|y_{j}))^{2}}{|X_{c}|}}}\right)$ (8)

Empirical results shown in the experimental section indicate that UDL2 outperformed UDL1 in the application to the extended SOM. We use a simple example to demonstrate the difference between Eqs (7) and (8). Assume that we want to measure the distance between target values $a$ and $b$ with respect to a context attribute in which there are only two distinct feature values $M$ and $F$ . The frequency which the target values $a$ and $b$ co-occur with the feature values $M$ and $F$ is shown in Table 2. The distance between $a$ and $b$ by UDL2 (using Eq. (8)) is 0 while the value by UDL1 (using Eq. (7)) is 0.5.

4.4 MDISC

The fourth method is adapted from DISC [24] which measures similarity between categorical values. The difference of this method from UDL1 and UDL2 is when the context attribute is categorical, dissimilarity between $y_{i}$ and $y_{j}$ is dependent on their cosine similarity with respect to the categorical attribute, rather than the difference of co-occurrences. Based on their idea, we modify the formula for measuring dissimilarity between categorical values as follows and is referred to as MDISC.

$\displaystyle d(y_{i},y_{j})=\frac{1}{|X|}\left(\sum_{X_{n}\epsilon X}{\frac{% \left|\textit{Avg}(X_{n,i})-\textit{Avg}(X_{n,j})\right|}{\textit{Max}(X_{n})-% \textit{Min}(X_{n})}}+\sum_{X_{c}\epsilon X}(1-CS(y_{i}:y_{j},X_{c}))\right)$ (9) $\displaystyle\text{where }CS(y_{i}:y_{j},X_{c})=\frac{\sum_{x_{l},x_{k}\in X_{% c}}{n(y_{i},x_{l})*n(y_{j},x_{k})*\textit{Sim}(x_{l},x_{k})}}{\textit{% NormalVector}1*\textit{NormalVector}2}$ $\displaystyle\textit{NormalVector}1={\left(\sum_{x_{l},x_{k}\in X_{c}}{n(y_{i}% ,x_{l})*n(y_{i},x_{k})*\textit{Sim}(x_{l},x_{k})}\right)}^{1/2}$ $\displaystyle\textit{NormalVector}2={\left(\sum_{x_{l},x_{k}\in X_{c}}{n(y_{j}% ,x_{l})*n(y_{j},x_{k})*\textit{Sim}(x_{l},x_{k})}\right)}^{1/2}$

$d(y_{i},y_{j})$ is calculated as the average of dissimilarity with respect to each context attribute. For a numeric context attribute $X_{n}$ , like those in Eqs (7) and (8), the distance is dependent on the difference between the averages of the values in $X_{n}$ which co-occur with $y_{i}$ and with $y_{j}$ , respectively. For a categorical context attribute $X_{c}$ , $CS(y_{i}:y_{j},X_{c})$ denotes cosine similarity between $y_{i}$ and $y_{j}$ with respect to $X_{c}$ . $n(y_{i},x_{l})$ returns the number of co-occurrences of $y_{i}$ with $x_{l}$ taken by $X_{c}$ , i.e., $n(y_{i},x_{l})=|\{t|t.Y=y_{i},t.X_{c}=x_{l},t\in D\}|$ . Note $d(y_{i},y_{j})$ is computed iteratively. Initially, $\textit{Sim}(x_{l},x_{k})$ is set to 1 if $x_{l}=x_{k}$ and otherwise 0. After one iteration, $\textit{Sim}(x_{l},x_{k})$ could be greater than 0 even when $x_{l}\neq x_{k}$ .

Table 3

A toy mixed-type dataset

A1	A2	A3	Class
a	50	C	2
a	20	B	1
a	80	A	1
b	60	D	2
b	65	B	2
b	90	A	1
c	30	D	2
c	40	C	2
c	50	B	2

4.5 An example

We have introduced five unsupervised approaches, 1-of- $k$ coding, TLDH, UDL1, UDL2, and MDISC for measuring the dissimilarity between categorical values of an attribute. We use an example to illustrate the calculation of the dissimilarity. Table 3 presents a toy dataset consisting of three feature attributes and one class attribute. Assume $A_{1}$ as the target attribute which we want to measure the dissimilarity between categorical values in its domain, i.e., {a, b, c}.

By 1-of- $k$ , $A_{1}$ will be transformed to a list of three binary attributes, namely, $<$ a, b, c $>$ . The $A_{1}$ value of the records with $A_{1}=a$ is therefore transformed to $<$ 1, 0, 0 $>$ while that with $b$ is accordingly transformed to $<$ 0, 1, 0 $>$ . The dissimilarity can then be measured by Euclidean distance. Thus, $d(a,b)$ is $\sqrt{2}$ . By TLDH, two distinct values yield dissimilarity of one, namely, $d(a,b)=$ 1 while two identical values yield zero.

For the unsupervised algorithms UDL1, UDL2, and MDISC, $A_{2}$ and $A_{3}$ are the context attributes. For UDL1, according to Eq. (7), the distance between a and b is therefore $d(a,b)=0.5(D_{2}/(90-20)+\sqrt{D_{3}/4})$ where $D_{2}=|\textit{Avg}(A_{2,a})-\textit{Avg}(A_{2,b})|=|50-71.66|=21.66$ , and $D_{3}={(p(a|A)-p(b|A))}^{2}+{(p(a|B)-p(b|B))}^{2}+{(p(a|C)-p(b|C))}^{2}+{(p(a|% D)-p(b|D))}^{2}=0+0+\frac{1}{4}+\frac{1}{4}=\frac{1}{2}$ . Note the averages in $A_{2}$ with respect to values $a$ and $b$ in $A_{1}$ are $\textit{Avg}(A_{2,a})=$ 50 and $\textit{Avg}(A_{2,b})=71.66$ , respectively. Therefore, the distance is $d(a,b)=0.5(21.66/70+\sqrt{0.5/4})=0.331$ .

For UDL2, it is different from UDL1 only on how $D_{3}$ is calculated. Therefore, we recompute the value of $D_{3}$ . ${(p(A|a)-p(A|b))}^{2}+{(p(B|a)-p(B|b))}^{2}+{(p(C|a)-p(C|b))}^{2}+{(p(D|a)-p(D% |b))}^{2}=0+0+\frac{1}{9}+\frac{1}{9}=\frac{2}{9}=0.22$ . The distance by UDL2 is $d(a,b)=0.5(21.66/70+\sqrt{0.22/4})=0.272$ .

For MDISC, the dissimilarity between $a$ and $b$ is $d(a,b)=0.5(D_{2}/(90-20)+(1-CS(a:b,A_{3}))$ where $CS(a:b,A_{3})=\frac{1+1}{\sqrt{3}\sqrt{3}}=0.66$ . In the example $n(a,A)=1$ since only a single instance in that ${A}_{1}=a$ co-occurs with $A_{3}=A$ , and $\textit{Sim}(P,Q)=1$ if $P=Q$ and otherwise 0 in the first iteration. Finally, for the first iteration $d(a,b)=0.5(21.66/70+(1-0.66))=0.321$ .

By UDL1, $d(a,c)=0.5(D_{2}/(90-20)+\sqrt{D_{3}/4})$ where $D_{2}=|\textit{Avg}(A_{2,a})-\textit{Avg}(A_{2,c})|=|50-40|=10$ , and $D_{3}={(p(a|A)-p(c|A))}^{2}+{(p(a|B)-p(c|B))}^{2}+{(p(a|C)-p(c|C))}^{2}+{(p(a|% D)-p(c|D))}^{2}=\frac{1}{4}+0+0+\frac{1}{4}=\frac{1}{2}$ . Note the averages in $A_{2}$ with respect to values $a$ and $b$ in $A_{1}$ are $\textit{Avg}(A_{2,a})=50$ and $\textit{Avg}(A_{2,c})=40$ , respectively. Therefore, the distance is $d(a,c)=0.5(10/70+\sqrt{0.5/4})=0.248$ .

By UDL1, $d(b,c)=0.5(D_{2}/(90-20)+\sqrt{D_{3}/4})$ where $D_{2}=|\textit{Avg}(A_{2,b})-\textit{Avg}(A_{2,c})|=|71.66-40|=31.66$ , and $D_{3}={(p(b|A)-p(c|A))}^{2}+{(p(b|B)-p(c|B))}^{2}+{(p(b|C)-p(c|C))}^{2}+{(p(b|% D)-p(c|D))}^{2}=\frac{1}{4}+0+\frac{1}{4}+0=\frac{1}{2}$ . Note the averages in $A_{2}$ with respect to values $a$ and $b$ in $A_{1}$ are $\textit{Avg}(A_{2,c})=40$ and $\textit{Avg}(A_{2,b})=71.66$ , respectively. Therefore, the distance is $d(b,c)=0.5(31.66/70+\sqrt{0.5/4})=0.402$ .

Note that UDL2 measures similarity between two values in attribute $y$ from the probability conditioned on the values of the target attribute, i.e., $p(x|y)$ , rather than on the values of the context attribute in UDL1, i.e., $p(y|x)$ . In other words, the more similar distribution with the values of context attributes of two categorical values, the more similar the two values are. If $a$ and $b$ have about the same probability distributions with respect to the values of context attributes, $a$ and $b$ are similar.

Table 4
Dissimilarity between a, b, and c

	1-of- $k$	TLDH	UDL1	UDL2	MDISC
$d(a,b)$	$\sqrt{2}$	1	0.331	0.272	0.321
$d(a,c)$	$\sqrt{2}$	1	0.248	0.189	0.238
$d(b,c)$	$\sqrt{2}$	1	0.402	0.344	0.392

Table 4 summarizes the pairwise distances among $a$ , $b$ , and $c$ by different methods. Note that the dissimilarity by MDISC here is the result of one iteration. By the unsupervised methods UDL1, UDL2, and MDISC, $d(b,c)$ is most distant and $d(a,c)$ is the smallest. From Table 3, we can observe that any two of the value sets of $A_{3}$ associated with $a$ , $b$ , and $c$ have two values in common. Thus, attribute $A_{3}$ does not contribute much to distinguish the differences among $a$ , $b$ , and $c$ . For $A_{2}$ , the value set associated with $a$ is closer to that with $c$ in general. The set with $b$ and the set with $c$ are most dissimilar compared to the other pairs. In fact, the averages of the value set associated with $a$ , $b$ , and $c$ are 50, 71.66, and 40, respectively.

4.6 Analyze datasets without class attribute

We present the framework to analyze by using the extended SOM mixed-type datasets which do not have a class attribute as follows.

By using UDL1, UDL2, or MDISC, we can compute a pairwise distance matrix for each categorical attribute of the mixed-type dataset. Then, by using agglomerative hierarchical clustering, we can construct a dendrogram as the distance hierarchy for each categorical attribute. The next is to convert each distance hierarchy to a common point matrix (CPM). This process is summarized in Algorithm 1, Section 3.2.

The CPM for each categorical attribute computed in Algorithm 1 is needed for the SOM extended for mixed-type datasets. The detailed process is presented in Algorithm 2, Section 3.2.

It is worth noting that the introduced approaches are sensitive to noise and missing values. Data preprocessing shall be conducted to ensure the quality of the data prior to applying the approaches. Noise data shall be corrected, and missing values must be handled.

5. Experimental results

We use one internal and two external measures to compare projection results by GMixSOM with 1-of- $k$ coding scheme and with using distance hierarchies constructed by those unsupervised approaches on synthetic and real-world mixed-type datasets.

5.1 Evaluation

To evaluate the performance of the proposed methods, we use an internal measure root mean squared error (RMSE), also referred to as quantization error [32], and external measures entropy and accuracy in this study. Other performance measures related to SOM include topological error [33], trustworthiness and continuity [34]. The internal measure does not consider external information such as class attribute but only feature attributes involved in the training. RMSE measures quantization errors, that is, the average distance between input instances and the prototype of its best match unit. RMSE is calculated by

$\displaystyle\textit{RMSE}={\left(\frac{\sum^{|X|}_{i=1}{{\textit{ndist}(x_{i}% ,{\textit{BMU}}_{i})}^{2}}}{|X|}\right)}^{1/2}$

where $|X|$ is the number of input data and $x_{i}$ is an instance of $X$ . ${\textit{BMU}}_{i}$ is the prototype of the best matching unit of $x_{i}$ . $\textit{ndist}(\cdot,\cdot)$ is the normalized distance between the two arguments. The distance is normalized by the number of attributes to allow performance comparison among different approaches. Note that 1-of- $k$ coding increases the number of attributes.

The external measure entropy makes use of external information, namely, the class attribute which does not involve in training the SOM. Entropy measures consistence of class labels of the instances projected in neurons and is defined by the weighted average of entropies of individual neurons.

$\displaystyle\textit{Entropy}=\sum^{N}_{n=1}{\frac{|X_{n}|}{|X|}\left(-\sum_{i% \in C}p_{n}(x_{i})\log p_{n}(x_{i})\right)}$

where $N$ is the number of neurons, $|X_{n}|$ is the number of the instances projected in neuron $n$ , $p_{n}(x_{i})$ is the ratio of the instances with class label $i$ to the instances projected in neuron $n$ , and $C$ is the set of class labels of the class attribute. The lower entropy represents the better projection. Low entropy indicates that most of the data instances projected in individual neurons have the same class label.

We further used the learned projections on the map as a classifier and measured classification accuracy as follows. After training, each neuron is assigned a class label by the majority class of training data projected in the neuron. A test instance is classified by projecting onto the trained map and assigning the class label of its best matching neuron. Classification accuracy is measured by

$\displaystyle\textit{Accuracy}=\frac{1}{|X|}\sum^{N}_{n=1}{\textit{max}_{i\in C% }|x_{n,i}|}$

where $|x_{n,i}|$ is the number of instances with class label $i$ in neuron $n$ .

5.2 Synthetic dataset SynMix1

Dataset SynMix1 consists of one numeric attribute Amount and two categorical attributes Dept (or Department) and Drink as shown in Table 5. The dataset has 9 groups each of which has 60 or 30 data instances. The values of the numeric attribute were randomly generated according to Gaussian distribution with designated means and standard deviations. Categorical values are randomly generated in the specified groups according to uniform distribution, such as {MIS, MBA, FM} in groups {1, 2, 3}. Each instance has one of the class labels A, B, and C. Ninety percent of the instances in groups 1, 2, and 3 have class label A and the other ten percent are randomly assigned the other two labels.

Table 5
Synthetic dataset SynMix1 with one numeric and two categorical attributes

Group	Count	Amount ( $\mu,\sigma$ )	Dept	Drink	Class
1	60	$N$ (500, 25)	MIS	Coke	A (90%)
2	30	$N$ (400, 20)	MBA	Pepsi
3	30	$N$ (300, 15)	FM
4	60	$N$ (500, 25)	EE	Latte	B (80%)
5	30	$N$ (400, 20)	CE	Mocha
6	30	$N$ (300, 15)	ME
7	60	$N$ (500, 25)	SD	AppleJ	C (70%)
8	30	$N$ (400, 20)	VC	OrangeJ
9	30	$N$ (300, 15)	AD

Distance hierarchies shown in Fig. 6 for categorical attributes Dept and Drink were constructed by using the hierarchical clustering algorithm taking the distance matrices produced by UDL1, UDL2, and MDISC as input. When running the unsupervised methods, the class attribute was excluded and only the other two features were considered context attributes. The result is as expected and reflects the relationship between categorical values with regard to the other feature attributes in the dataset. For example, the drinks of the same type, juices, coffee or carbonated drink, were grouped together in the hierarchy.

The projection results by the extended model GMixSOM with different schemes handling categorical attributes were shown in Fig. 7. The grey level of the square grids in the background indicates the distance between the prototypes of two neurons. The darker represents the more distant. As shown in the right part of Fig. 7e, the dark color indicates large distance separates the region of neurons labelled with 7, 8, and 9, and the region of neurons labelled with 4, 5, and 6.

The size of the neurons is in proportion to the number of the instances projected to the neuron. The color of a neuron represents the group of the instances projected in the neuron. A single color indicates all the instances have the same group number and otherwise different groups. The number associated with a neuron represents the group number of the instances projected the neuron. The group number is correspondent to that shown in Table 5. A group number with a plus symbol indicates the instances in the neuron have different group numbers and the number represents the majority group.

Figure 6.

Distance hierarchies constructed by using (a) UDL1, (b) UDL2, and (c) MDISC for attributes Department and Drink of dataset SynMix1.

For the example of Fig. 7a, the left-most neuron has three colors and is labeled by 4+, indicating that the data points projected in the neuron are from three groups among which group 4 is the largest. The size of the neuron is larger than that of the right-most neuron labeled by 1+, indicating fewer data are projected in the right-most than the left-most neuron.

As can be seen in Table 5, the instances in groups 1, 2, and 3 are more similar to one another than to the instances in other groups and thus shall be projected to nearby regions. So are the instances in groups 4, 5, and 6, and the instances in groups 7, 8, and 9. The projections in Fig. 7c to e which use the learned distance hierarchies for representing the dissimilarity relationship between categorical values reflect this expectation. By contrast, the results in Fig. 7a and b by using 1-of- $k$ coding and two-level distance hierarchy do not yield expected clustering. For instance, in Fig. 7a two neurons labeled with 4+ and 7+ are located at the lower-right region and separated with the others labeled with 4+ and 7+ in the upper-left region. In Fig. 7b, two neurons labeled with 7 and 7+ are located at the right-most region and separated with the other three in the left-most region.

Figure 7.

The projection result of synthetic dataset SynMix1 by GMixSOM with (a) 1-of- $k$ coding, (b) TL-DH, (c) UDL1-DH, (d) UDL2-DH, and (e) MDISC-DH.

5.3 Synthetic dataset SynMix2

Table 6 depicts another synthetic dataset SynMix2 which consists of two categorical (i.e., Dept and Drink) and two numeric attributes (i.e., Amt1 and Amt2) of which the instances were generated in a similar manner with those in SynMix1.

Table 6
Synthetic dataset SynMix2 with two categorical and two numeric attributes

Group	Count	Dept	Drink	Amt1 ( $\mu,\sigma$ )	Amt2 ( $\mu,\sigma$ )	Class
1	200	BM	Latte	$N$ (70, 10)	$N$ (20, 5)	M (90%)
2	200	IM	Mocha
3	200	CE	Cappu	$N$ (30, 10)	$N$ (80, 5)	E (85%)
4	200	EE	Black
5	200	SD	Coke	$N$ (70, 10)	$N$ (20, 5)	D (80%)
6	200	VD	Pepsi
7	200	AH	Sprint	$N$ (30, 10)	$N$ (80, 5)	H (75%)
8	200	BH	7Up

Figure 8.

Distance hierarchies constructed by using (a) UDL1, (b) UDL2, and (c) MDISC for attributes Department and Drink of dataset SynMix2.

Figure 9.

The projection result of synthetic dataset SynMix2 by GMixSOM with (a) 1-of- $k$ coding, (b) TL-DH, (c) UDL1-DH, (d) UDL2-DH, and (e) MDISC-DH.

Figure 10.

Distance hierarchies constructed by using (a) UDL1 (b) UDL2 and (c) MDISC for attributes Education, Marital-Status and Relationship of dataset pAdult.

Figure 8 shows distance hierarchies generated from the dataset by different algorithms. Figure 9 shows the projections by GMixSOM on SynMix2 with different methods for categorical attributes. From Table 6, it is evident that the instances in groups 1 and 2 are more similar to each other than to the instances in other groups. So are those in groups 3 and 4, groups 5 and 6, and groups 7 and 8. The projection results shown in Fig. 9c to e reflect such structure of the data. Specifically, those in Fig. 9c to e with distance hierarchies generated from the dataset by UDL1, UDL2, and MDISC clearly demonstrate four clusters. Furthermore, the background grey level indicates that the distances between clusters (having dark background in between) are large and the distances within each cluster (having light background) are small. Figure 9a with 1-of- $k$ coding was the worst. Groups 3 and 4 are even separated by the others.

In summary, it shall be noted that the extended model GMixSOM with distance hierarchies learned from the datasets is able to reflect on the map the relationship embedded in between categorical values. This can be easily verified by the projection results of the two synthetic datasets in Figs 7 and 9. We use dataset SynMix1 to illustrate. From Table 5 and the learned distance hierarchies in Fig. 6, we can observe the data instances in each of the three major groups, i.e., (1, 2, 3), (4, 5, 6), and (7, 8, 9), are close neighbors in the input space. For example, the instances in the major group (1, 2, 3) are all of Management departments and of Carbonated drinks. In the projection maps shown in Fig. 7c to e, the instances from each of the three groups gathered together in the same region, preserving the neighborhood relations of the input data. Moreover, indicated by the background grey level, the instances in each of the three major groups have smaller intra-distances (having light grey in the region) while the instances in different groups have large inter-distances (having dark grey in between major groups). The problem with Fig. 7a and b is again that 1-of- $k$ and TLDH schemes do not consider different similarity degree between categorical values. Recall that any two distinct values contribute the same amount of dissimilarity to the distance computation of two data points in the two schemes, namely, $\sqrt{2}$ for 1-of- $k$ and 1 for TLDH, respectively.

Table 7

Statistics of the real-world datasets

Dataset	#Categorical	#Numeric	#Data	#Class	Majority (%)	Entropy
Adult	8	6	45,222	2	75	0.808
pAdult	3	4	48,842	2	76	0.794
ACA	3	11	690	2	56	0.991
pACA	3	8	690	2	56	0.991
Audiology	66	0	220	24	26	3.406
CMC	4	5	1,473	3	43	1.539
Dermatology	33	1	366	6	30	2.433
German Credit	13	7	1,000	2	70	0.881
Hayes-Roth	4	0	132	3	39	1.546
Lymphography	15	3	148	4	55	1.228
Mushroom	18	4	5,644	2	62	0.959
Post-Operative	8	0	90	3	71	0.980
TAE	2	3	151	3	34	1.585

#Categorical and majority (%) indicate the number of categorical attributes and the percentage of the majority class, respectively.

Table 8

Performance by different number of MDISC iterations

		MDISC#1	MDISC#2	MDISC#3	MDISC#4
Adult	RMSE	0.100	0.118	0.118	0.118
	Entropy	0.529	0.549	0.544	0.55
pAdult	RMSE	0.063	0.071	0.063	0.071
	Entropy	0.511	0.521	0.518	0.516
ACA	RMSE	0.170	0.167	0.167	0.167
	Entropy	0.52	0.49	0.485	0.485
pACA	RMSE	0.134	0.145	0.148	0.141
	Entropy	0.493	0.467	0.46	0.469
Audiology	RMSE	0.195	0.187	0.190	0.219
	Entropy	1.752	1.671	1.912	1.799
CMC	RMSE	0.200	0.192	0.192	0.192
	Entropy	1.397	1.397	1.408	1.408
Dermatology	RMSE	0.270	0.288	0.295	0.295
	Entropy	0.272	0.228	0.479	0.456
GermanCredit	RMSE	0.274	0.295	0.298	0.298
	Entropy	0.825	0.809	0.826	0.826
Hayes-Roth	RMSE	0.351	0.290	0.295	0.362
	Entropy	1.047	1.432	1.418	1.125
Lymphography	RMSE	0.300	0.313	0.318	0.318
	Entropy	0.772	0.706	0.638	0.638
Mushroom	RMSE	0.095	0.110	0.118	0.118
	Entropy	0.026	0.045	0.031	0.022
PostOperative	RMSE	0.305	0.344	0.292	0.452
	Entropy	0.895	0.858	0.937	0.824
TAE	RMSE	0.293	0.281	0.292	0.277
	Entropy	1.281	1.245	1.244	1.282

5.4 Real-world datasets

Eleven different real-world datasets used in the experiments were taken from the UCI repository [35]. In addition, another version of datasets Adult and Australian Credit Approval (ACA) denoted by pAdult and pACA in which only the attributes significantly correlated to its class attribute were retained [2, 14] are also included.

Table 9
No. of best performances by different no. of iterations of MDISC on the 13 datasets

	MDISC#1	MDISC#2	MDISC#3	MDISC#4
RMSE	7	4	2	1
Entropy	4	3	4	2

Figure 11.

The projection result of dataset pAdult by GMixSOM with (a) 1-of- $k$ coding, (b) TL-DH, (c) UDL1-DH, (d) UDL2-DH, and (e) MDISC-DH.

The summary of the experimental datasets is shown in Table 7, including the numbers of categorical attributes, numeric attributes, data points, and class labels, the percentage of the majority class, and the entropy of each dataset. For the parameters of the extended SOM, the learning rate decreased linearly, $\alpha(t)=\alpha(0)\times(1.0-t/T)$ where $t$ and $T$ denote the current iteration and the total training iterations, respectively, and the initial learning rate $\alpha(0)$ was set to 0.95. Gaussian function was used for the neighborhood function.

Figure 10 shows the distance hierarchies learned from the dataset with UDL1, UDL2, and MDISC schemes. Some interesting patterns can be observed from the results. For instance, if we group the values of attribute Relationship to two clusters according to the hierarchy in the right-most of Fig. 10b and c, the clusters are {Wife, Husband} and {Unmarried, Not-in-family, Own-child, Other-relative}, respectively. Similarly, for the attribute Marital-Status in the middle of Fig. 10b and c, we got {Married-civ-spouse, Married-AF-spouse} and {Widowed, Divorced, Separated, Never-married, Married-spouse-absent}.

MDISC performs an iterative process and progressively update dissimilarity degree between pairwise values in individual categorical attributes. It is interesting to investigate how the number of iterations impacts on the dissimilarity degree. Tables 8 and 9 show the performance by using distance hierarchies constructed according to dissimilarity between categorical values measured with different numbers of MDISC iterations. The result indicates one iteration of the process achieved good result. This outcome is consistent with the claim by the authors of the paper [24]. In the experiments, hereafter, for MDISC, dissimilarity between categorical values is calculated with one iteration.

Table 10

RMSE of different approaches on real-world datasets

Datasets	1-of- $k$	TLDH	UDL1	UDL2	MDISC
Adult	0.130	0.200	0.145	0.122	0.100
pAdult	0.089	0.100	0.077	0.071	0.063
ACA	0.230	0.261	0.187	0.164	0.170
pACA	0.207	0.224	0.164	0.145	0.134
Audiology	0.219	0.226	0.219	0.202	0.195
CMC	0.245	0.268	0.210	0.210	0.200
Dermatology	0.295	0.369	0.322	0.283	0.270
GermanCredit	0.315	0.367	0.327	0.290	0.274
Hayes-Roth	0.382	0.444	0.378	0.383	0.351
Lymphography	0.371	0.400	0.351	0.315	0.300
Mushroom	0.200	0.255	0.155	0.118	0.095
PostOperative	0.399	0.452	0.440	0.330	0.305
TAE	0.202	0.365	0.270	0.281	0.293
Average	0.268	0.319	0.270	0.243	0.230

Table 11

Entropy measured in the original datasets and after projection on the maps

Datasets	1-of- $k$	TLDH	UDL1	UDL2	MDISC	Original
Adult	0.554	0.552	0.561	0.534	0.529	0.808
pAdult	0.560	0.525	0.529	0.514	0.511	0.794
ACA	0.554	0.551	0.489	0.495	0.520	0.991
pACA	0.568	0.470	0.475	0.465	0.493	0.991
Audiology	2.200	1.879	2.015	1.614	1.752	3.406
CMC	1.416	1.395	1.405	1.410	1.397	1.539
Dermatology	0.482	0.346	0.119	0.178	0.272	2.433
GermanCredit	0.782	0.791	0.798	0.796	0.825	0.881
Hayes-Roth	1.157	0.944	1.191	1.148	1.047	1.546
Lymphography	0.678	0.642	0.673	0.680	0.772	1.228
Mushroom	0.056	0.022	0.027	0.026	0.026	0.959
PostOperative	0.899	0.824	0.789	0.815	0.895	0.980
TAE	1.387	1.320	1.345	1.342	1.281	1.585
Average	0.869	0.789	0.801	0.771	0.794	1.395

Table 12

Classification accuracy by the various schemes

	1-of- $k$	TLDH	UDL1	UDL2	MDISC
Adult	0.817	0.813	0.812	0.824	0.823
pAdult	0.819	0.823	0.826	0.829	0.819
ACA	0.780	0.841	0.855	0.855	0.855
pACA	0.852	0.867	0.864	0.861	0.855
Audiology	0.445	0.541	0.500	0.545	0.577
CMC	0.483	0.488	0.487	0.496	0.492
Dermatology	0.894	0.947	0.885	0.927	0.916
GermanCredit	0.717	0.704	0.709	0.714	0.721
Hayes-Roth	0.674	0.591	0.636	0.583	0.583
Lymphography	0.797	0.824	0.791	0.797	0.757
Mushroom	0.976	0.994	0.991	0.991	0.991
PostOperative	0.722	0.722	0.744	0.733	0.711
TAE	0.510	0.543	0.510	0.517	0.536
Average	0.730	0.746	0.739	0.744	0.741

Figure 11 shows projection results of dataset pAdult. The boundary indicated by the grey levels in Fig. 11a generated by GMixSOM with 1-of- $k$ was more obscure compared to those by the others. The boundary in Fig. 11e from MDISC is clearest and most of the regions have low grey levels, indicating high similarity degree within individual regions.

Figure 11d shows a boundary in the middle. Most of the instances with class label salary $>$ 50 K, indicated by the green color, are located in the upper half of the map. The map in Fig. 11e can be roughly divided to four regions. The instances with salary $>$ 50 K are mainly located in the upper-left region.

Table 10 depicts detail performance on individual datasets with regard to each approach to handling mixed-type data. As can be seen, MDISC achieved the lowest RMSE while TLDH has the largest.

Table 11 shows in average UDL2 achieved the best performance in reducing entropy in neurons. That is, the instances projected in individual neurons are more of the same class compared to those by the other methods. The 1-of- $k$ coding performed worst.

Table 12 demonstrates those schemes which used distance hierarchy to represent the relationship between categorical values, namely, TLDH, UDL1, UDL2, and MDISC, outperformed the 1-of- $k$ coding in classification accuracy. Among the five, TLDH performed best and UDL2 came the second.

It is interesting to note that UDL2 outperformed UDL1 in all three measures. In other words, Eq. (8) is better than Eq. (7). The only difference between the two equations is how the conditional probability is measured in the second term of the formulas.

The two datasets pAdult and pACA formed by retaining only the attributes highly correlated to the class attribute achieved better performance than the original datasets in all three measures in all approaches except for the 1-of- $k$ scheme. The entropy of using 1-of- $k$ coding increases on the partial datasets. The attributes irrelevant to the class attribute were removed in the partial datasets. The remaining are the ones relevant to the class attribute. In other words, with respect to the remaining attributes, similar instances are more likely to have the same class label. Therefore, the instances projected in the same neuron are more likely to have the same class label. The projection with distance hierarchy coding with which the distance between categorical values is measured by considering context attributes do help to gather the instances with the same class label together.

6. Conclusions

Distance hierarchy which enables coding of dissimilarity degree between categorical values is required by the extended model GMixSOM for handling mixed-type data. In this regard, in the previous studies, a supervised approach requiring existence of a class attribute in the dataset was proposed for learning the hierarchies from the dataset. However, it is not the case that every real-world dataset contains a class attribute. To facilitate the learning of hierarchies in case of no class attribute existing in the dataset, four approaches of unsupervised learning of dissimilarity degree between categorical values are presented and compared on 11 real-world datasets in this study.

The empirical results are summarized as follows. For the internal index quantization error, MDISC scheme achieved the smallest error while TLDH ranked the worst. For external indices entropy and classification accuracy, 1-of- $k$ coding performed worst in both, UDL2 outperformed the others in entropy, and TLDH was the best in accuracy.

The advantage of UDL1, UDL2, and MDSIC compared with 1-of- $k$ coding and TLDH is having the capability of uncovering hidden relationship in between the values of a categorical attribute since the dissimilarity between categorical values are learned from the datasets with respect to the other feature attributes. Among those three approaches, UDL1 is inferior to the other two in all three measures. UDL2 achieved the best results in entropy and classification accuracy while MDISC performed best in quantization errors.

There are many ideas for measuring similarity of heterogeneous or mixed data in the literature such as those mentioned in [36, 37]. In the future, it would be interesting to integrate other ideas into our model and see whether the performance can be further improved.

In addition to applying the proposed similarity measures of mixed-type data to the extended SOM, those measures can as well be used for clustering and classification problems with mixed-type data. For instance, for clustering we can construct the pairwise distance matrix of the mixed-type dataset by using the proposed measures and then a typical hierarchical clustering algorithm can take the matrix as input and proceed the remaining clustering steps.

Footnotes

Acknowledgments

This work is partially supported by National Science Council, Taiwan under grant NSC 102-2410-H-224-019-MY2 and MOST 105-2410-H-224-007, and by the “Intelligent Recognition Industry Service Center” from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.

References

Zhu

G.Q.

and Ding

, Data mining with big data, IEEE Transactions on Knowledge and Data Engineering 26(1) (2014), 97–107.

Hsu

C.C.

, Generalizing self-organizing map for categorical data, Neural Networks, IEEE Transactions on 17(2) (2006), 294–304. doi: 10.1109/TNN.2005.863415.

Kohonen

, The self-organizing map, Proceedings of the IEEE 78 (1990), 1464–1480.

Hanafizadeh

and Mirzazadeh

, Visualizing market segmentation using self-organizing maps and Fuzzy Delphi metd – ADSL market of a telecommunication company, Expert Systems with Applications 38(1) (2011). doi: 10.1016/j.eswa.2010.06.045.

Himberg

, A SOM based cluster visualization and its application for false coloring, in: Paper Presented at the Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, Washington, DC, USA, 2000.

Hsu

C.C.

Lin

S.H.

and Tai

W.S.

, Apply extended self-organizing map to cluster and classify mixed-type, Neurocomputing 74 (2011), 3832–3842.

Tasdemir

Milenov

and Tapsall

, Topology-based hierarchical clustering of self-organizing maps, IEEE Transactions on Neural Networks 22(3) (2011), 474–485. doi: 10.1109/TNN.2011.2107527.

Thakare

V.S.

and Patil

N.N.

, Classification of Texture Using Gray Level Co-occurrence Matrix and Self-Organizing Map, in: Paper Presented at the 2014 International Conference on Electronic Systems, Signal Processing and Computing Technologies (ICESC), Nagpur, 2014.

Vesanto

and Alhoniemi

, Clustering of the self-organizing map, IEEE Transactions on Neural Networks 11(3) (2000), 586–600.

10.

and Chow

T.W.S.

, Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density, Pattern Recognition 37(2) (2004), 175–188.

11.

Yan

Nie

Wang

and Wang

, Classification of Aurora kinase inhibitors by self-organizing map (SOM) and support vector machine (SVM), European Journal of Medicinal Chemistry 61 (2013), 73–83. doi: 10.1016/j.ejmech.2012.06.037.

12.

Kohonen

, Essentials of the self-organizing map, Neural Networks 37 (2013), 52–65. doi: 10.1016/j.neunet.2012.09.018.

13.

Kohonen

Oja

Simula

Visa

and Kangas

, Engineering applications of the self-organizing map, Proceedings of the IEEE 84(10) (1996), 1358–1384.

14.

Hsu

C.C.

and Lin

S.H.

, Visualized analysis of mixed numeric and categorical data via extended self-organizing map, IEEE Transactions on Neural Networks and Learning Systems 23(1) (2012), 72–86. doi: 10.1109/Tnnls.2011.2178323.

15.

Tai

W.S.

and Hsu

C.C.

, Growing self-organizing map with cross insert for mixed-type data clustering, Applied Soft Computing 12 (2012), 2856–2866.

16.

Hsu

C.C.

and Kung

C.H.

, Incorporating unsupervised learning with self-organizing map for visualizing mixed data, in: Proceedings of the Ninth International Conference on Natural Computation (ICNC), 2013.

17.

Boriah

Chandola

and Kumar

, Similarity Measures for Categorical Data: A Comparative Evaluation, in: Paper Presented at the Proceedings of SDM ’08, Atlanta, Georgia, USA, 2008.

18.

Goodall

D.W.

, A new similarity index based on probability, Biometrics 22(4) (1966), 882–907.

19.

Gambaryan

, A mathematical model of taxonomy, Izvest. Akad. Nauk Armen. SSR 17(12) (1964), 47–53.

20.

Eskin

Arnold

Prerau

Portnoy

and Stolfo

, A geometric framework for unsupervised anomaly detection, Paper Presented at the Applications of Data Mining in Computer Security, 2002.

21.

Jones

K.S.

, A statistical interpretation of term specificity and its application in retrieval Document Retrieval Systems, volume 3 of Taylor Graham Series in Foundations of Information Science, Taylor Graham Publishing, London, UK, 1988, 132–142.

22.

Lin

, An information-theoretic denition of similarity, in: Paper Presented at the the 15th International Conference on Machine Learning, San Francisco, CA, USA, 1998.

23.

Smirnov

E.S.

, On exact methods in systematics, Systematic Zoology 17(1) (1968), 1–13.

24.

Desai

Singh

and Pudi

, DISC-data-intensive similarity measure for categorical data, in: Paper Presented at the 15th Pacific-Asia Conference, PAKDD, Shenzhen, China, 2011.

25.

Ienco

Pensa

R.G.

and Meo

, From context to distance: Learning dissimilarity for categorical data clustering, ACM Transactions on Knowledge Discovery from Data 6(1) (2012), 25.

26.

Ahmad

and Dey

, A method to compute distance between two categorical values of same attribute in unsupervised learning for categorical data set, Patt. Recog. Lett 28(1) (2007), 110–118.

27.

Guha

Rastogi

and Shim

, ROCK: A robust clustering algorithm for categorical attributes, in: Paper Presented at the the 15th International Conference on Data Engineering (ICDE), 1999.

28.

Andritsos

Tsaparas

Miller

R.J.

and Sevcik

K.C.

, Scalable clustering of categorical data, in: Paper Presented at the the International Conference on Extending Database Technology (EDBT), 2004.

29.

Das

Mannila

and Ronkainen

, Similarity of attributes by external probes, Knowledge Discovery and Data Mining (1998), 23–29.

30.

Jain

A.K.

Murty

M.N.

and Flynn

P.J.

, Data clustering: A review, ACM Computing Surveys 31(3) (1999), 60.

31.

Alahakoon

Halgamuge

S.K.

and Srinivasan

, Dynamic self-organizing maps with controlled growth for knowledge discovery, IEEE Transactions on Neural Networks 11 (2000), 601–614.

32.

Kohonen

, Self-Organizing Map, 2nd ed., Springer-Verlag, Berlin, 1995, p. 113.

33.

Kiviluoto

, Topology Preservation in Self-Organizing Maps, in: Proceeding of International Conference on Neural Networks (ICNN), 1996, pp. 294–299.

34.

Venna

and Kaski

, Neighborhood preservation in nonlinear projection methods: An experimental study, in: Dorffner

Bischof

and Hornik

, eds, Proceedings of ICANN 2001, International Conference on Artificial Neural Networks, Berlin, Springer, 2001, pp. 485–491.

35.

Blake

C.L.

and Merz

C.J.

, UCI Repository of Machine Learning Datasets, Dept. Inform. Comput. Sci., University of California, Irvien. Available: http://www.ics.uci.edu/∼mlearn/ML-Repository.html, 1998.

36.

T.B.

Nguyen

N.B.

and Morita

, Study of a mixed similarity measure for classification and clustering, PAKDD (1999), 375–379.

37.

S.Q.

and Ho

T.B.

, Measuring the similarity for heterogeneous data: An ordered probability-based approach, Discovery Science (2004), 129–141.

Unsupervised distance learning for extended self-organizing map and visualization of mixed-type data

Abstract

Keywords

1. Introduction

2.1 1-of- k coding and distance hierarchy

3. Distance hierarchy and extended SOM

3.1 Supervised learning of distance hierarchy

1 Available at http://www.cdc.gov/nchs/icd.htm.

4.1 Two-level distance hierarchy

Table 4 Dissimilarity between a, b, and c

5. Experimental results

5.1 Evaluation

5.2 Synthetic dataset SynMix1

Table 5 Synthetic dataset SynMix1 with one numeric and two categorical attributes

Table 6 Synthetic dataset SynMix2 with two categorical and two numeric attributes

Table 9 No. of best performances by different no. of iterations of MDISC on the 13 datasets

Footnotes

Acknowledgments

References

2.1 1-of- $k$ coding and distance hierarchy

¹
Available at http://www.cdc.gov/nchs/icd.htm.

Table 4
Dissimilarity between a, b, and c

Table 5
Synthetic dataset SynMix1 with one numeric and two categorical attributes

Table 6
Synthetic dataset SynMix2 with two categorical and two numeric attributes

Table 9
No. of best performances by different no. of iterations of MDISC on the 13 datasets