Visualized mixed-type data analysis via dimensionality reduction

Abstract

Visualization is a useful technique in data analysis, especially, in the initial stage, data exploration. Since high-dimensional data is not visible, dimensionality reduction techniques are usually used to reduce the data to a lower dimension, say two, for visualization. In previous studies, dimensionality reduction was investigated in the context of numeric datasets. Nevertheless, most of real-world datasets are of mixed-type containing both numeric and categorical attributes. In this case, a traditional approach could neither handle it directly nor output appropriate results. To address this problem, we propose a procedure for visualized analysis of mixed-type data via dimensionality reduction. Dissimilarity between categorical values is learned from the dataset and further used to measure the distance between mixed-type data points. In addition, we propose an approach to identifying significant features and visualizing patterns from the projection map chosen according to quality measures. Experiments on real-world datasets were conducted to demonstrate feasibility of the proposed method.

Keywords

Dimensionality reduction mixed-type data data visualization data analysis

1. Introduction

Real-world datasets are usually high-dimensional. However, high-dimensional data suffer from curse of dimensionality and are difficult to analyze due to incapability of visualization and high computational complexity. Dimensionality reduction techniques which reduce the dataset to acceptable dimensionality become an important tool in data analysis. Specifically, dimensionality reduction facilitates visualization [1, 2], classification [3, 4, 5, 6], clustering [7, 8, 9, 10], compression of high-dimensional data [11, 12] and other applications [13, 14, 15, 16].

In the literature, a large number of dimensionality reduction techniques have been proposed, mainly including linear and nonlinear techniques [17]. To name a few, popular linear techniques include Principal Components Analysis, Locality Preserving Projections [18] and Factor Analysis. Linear techniques for dimensionality reduction assume that the data lie on or near a linear subspace of the high-dimensional space while nonlinear techniques do not rely on the linearity assumption. Nonlinear techniques include Multidimensional Scaling, Isomap, Maximum Variance Unfolding, Kernel PCA, Diffusion Maps, Multilayer Autoencoders, Locally Linear Embedding, Laplacian Eigenmaps, Hessian LLE, Local Tangent Space Analysis, Locally Linear Coordination, and Manifold Charting. Nonlinear techniques have the ability to handle complex nonlinear data.

Moreover, real-world data usually contain mixed data types, including numeric, ordinal, and categorical attributes. For instance, Table 1 shows a portion of attributes and records of dataset Adult from UCI machine learning repository [19]. As can be seen, the dataset includes categorical attributes such as Education, Marital-Status, and Relationship and numeric attributes such as Work-Hours, Capital-Gain, and Capital-Loss. Note that the Education attribute can be considered as an ordinal attribute as well.

Analyzing mixed-type data is not straightforward. Most of algorithms handle only one type of values, either categorical or numeric. Artificial Neural Networks, Genetic Algorithms, $K$ -means and Support Vector Machines take only numeric data. Association Rules Mining, Decision Trees, and Random Forest handle categorical data. When mixed data are encountered, a preprocess transforming one type of the data to the other is performed prior to using the algorithms. To transform numeric values, discretization techniques are usually applied. For transforming categorical values, 1-of- $k$ coding is commonly used.

Like the above-mentioned algorithms, current dimensionality reduction techniques were studied in the context of a single type of values, namely, numeric data and cannot be applied directly to categorical datasets or mixed-type datasets. Moreover, a data preprocessing process which transfers either type of the data to the other usually results in information loss.

Table 1
A portion of the mixed-type real-world dataset Adult

ID	Education	Relationship	Marital-Status	Work-Hours	Capital-Gain	Capital-Loss	Salary
1	HS-grad	Unmarried	Never-married	35	400	0	$\leqslant$ 50 K
2	Masters	Husband	Married-civ-spouse	40	5000	1000	$>$ 50 K
3	Doctorate	Wife	Married-AF-spouse	45	2000	0	$>$ 50 K
4	Assoc-voc	Own-child	Divorced	30	3000	500	$\leqslant$ 50 K
5	Master	Wife	Separated	40	2000	0	$\leqslant$ 50 K
6	HS-grad	Own-child	Never-married	38	1000	200	$\leqslant$ 50 K
7	Assoc-voc	Own-child	Divorced	35	1500	0	$\leqslant$ 50 K
8	HS-grad	Husband	Married-civ-spouse	40	2000	100	$>$ 50K
9	4 ${}^{\rm th}$ -grad	Wife	Married-civ-spouse	40	1000	500	$\leqslant$ 50 K
10	Assoc-voc	Own-child	Married-AF-spouse	40	2000	0	$\leqslant$ 50 K

In this paper, we address this problem by presenting a technique for measuring dissimilarity between mixed-type data points. Furthermore, we propose a procedure based on dimensionality reduction techniques for visualized analysis of mixed-type data involving numeric and categorical attributes. The procedure allows users to visualize and analyze mixed-type datasets interactively so that users can select the clusters on the projection map and then obtain statistic information or patterns from the selected clusters. This visualized analysis helps to realize how the mixed-type data are distributed and to observe which features are significantly different in one cluster from the others.

The organization of this paper is described as follows. We review the literature of dimensionality reduction and distance learning in Section 2. The detailed description of our proposed methods is discussed in Section 3. We examine the performance of our methods with experiments in Section 4. The conclusion is discussed in Section 5.

2. Background

We briefly introduce some work which are relevant to our study, dimensionality reduction and handling of mixed-type data.

2.1 Dimensionality reduction

In machine learning, there are two main methods for dimensionality reduction: feature selection and feature extraction. Feature selection aims to find a small set $d$ of the $D$ dimensions that give us the most information [20, 21, 22, 23, 24, 25, 26]. Some of the feature selection methods handled mixed-type data [27, 28, 29, 30, 31, 32].

In this study, we take the perspective of feature extraction. In this regard, dimensionality reduction aims to reduce data dimensionality by transforming dataset $X$ with dimensionality $D$ to a new dataset $Y$ with dimensionality $d$ where $d<D$ , while retaining certain properties, such as geometrical relation, of the data as much as possible. Formally, assume a dataset $X={\{}x_{1}$ , $x_{2}$ , $\ldots$ , $x_{N}$ }, $x_{i}\in R^{D}$ , where $N$ is the number of data points and $D$ is the number of the features. Dimensionality reduction is a mapping from a high-dimensional to a low-dimensional space $F$ : $x\to y$ , where $y\in R^{d}$ and $d<D$ .

Many feature extraction techniques have been proposed, to list some, including linear techniques: Principal Component Analysis [33], Factor Analysis, Locality Preserving Projections [18], and non-linear techniques: Multidimensional Scaling [34], Sammon Mapping [35], Autoencoders [36], Curvilinear Component Analysis [37], Local Linear Embedding [38], Isomap [39], Manifold Charting [40], Maximum Variance Unfolding [16], Local Tangent Space Alignment [41], Diffusion Maps [42], Locally Linear Coordination [39], t-SNE [43].

A popular technique usually leads to many variants from the research following the original work to release constraints or improve performance. For instance, Kernel PCA (KPCA) [44] is the reformulation of traditional linear PCA in a high-dimensional space that is constructed using a kernel function. Hessian LLE [45] is a variant of Local Linear Embedding that minimizes the ‘curviness’ of the high-dimensional manifold when embedding it into a low-dimensional space, under the constraint that the low-dimensional data representation is locally isometric.

In addition to the above, the unsupervised artificial neural network Self-Organizing Map or SOM [46] can be used as a tool of dimensionality reduction as well, which maps high-dimensional data to a set of neurons organized in a low-dimensional space, typical two or three dimensions. The reason why SOM can be applied to dimension reduction is because the training algorithm of SOM preserves topological order of the raw data. That is, similar data instances are projected to the same or nearby location while dissimilar instances are projected to distant locations on the low-dimensional map. The coordinate of the neuron to which a high-dimensional data instance is projected is taken as the new value of the instance after dimensionality reduction. As a result, SOMs have been applied to visualized data analysis, including clustering [47] and classification [48].

2.2 Data preprocessing for mixed-type datasets

Most of the algorithms for data analysis handle only one type of values, numeric or categorical. When mixed-type datasets are encountered, a preprocess transforming one type of the values to the other is performed prior to using the algorithms. Numeric attributes are discretized before the algorithms for categorical data are applied. For example, the decision tree algorithm C5.0 splits a continuous attribute to a discrete one, making possible to partition the node, if the attribute is selected, to a limited number of branches during training. On the other hand, for the algorithms which process only numeric values, a typical method to transform a categorical attribute is 1-of- $k$ coding which converts a categorical attribute to a list of binary attributes. Each value in the domain of the categorical attribute is associated with a binary attribute. To transform a categorical value in a record, the binary attribute corresponding to the categorical value is set to one and the others in the list are set to zero. A toy example is illustrated in Fig. 1 assuming the domain of attribute OS has only four distinct values as shown in the table.

Figure 1.

Categorical attribute OS is transformed to a set of binary attributes by using 1-of- $k$ .

The 1-of- $k$ coding scheme has several disadvantages. First, semantics embedded in categorical values is lost after transformation. For instance, Windows7 is intuitively more similar to Windows8 than to Android4.2 or iOS8.1.2. However, in the transformed table (the right in Fig. 1), any two of the four rows yield the same difference if only the four attributes from attribute Windows7 to iOS8.1.2 are considered. As a result, topology of the data is altered not only in the original space but also in the space of reduced dimensionality. Second, dimensionality of the dataset increases dramatically. In Fig. 1, the dimension increases from three to six after the transformation. There is only one categorical attribute, which has a small domain size in this example. It is likely that there are several categorical attributes each of which has a large size of attribute domain in the dataset. Consequently, computational complexity increases. Third, storage is also increased due to increased dimensionality.

In addition to 1-of- $k$ , another popular method for measuring dissimilarity between mixed-type data points is to use simple matching coefficient or SMC for the portion of categorical attributes. SMC yields a dissimilarity of one for two distinct categorical values and zero otherwise. The dissimilarity between two mixed-type points $x_{i}$ and $x_{j}$ is defined as follows.

$\displaystyle D(x_{i},x_{j})=\left({\mathop{\sum}\limits_{p\in F_{c}}\left({% \delta\left({x_{i,p}\neq x_{j,p}}\right)}\right)^{2}+\mathop{\sum}\limits_{q% \in F_{n}}\left({\frac{x_{i,q}-x_{j,q}}{\text{max}\left\{{x_{\ast,q}}\right\}}% }\right)^{2}}\right)^{1/2}$ (1)

where $\delta(\cdot)$ is an indicator function which returns one if its parameter is true and otherwise zero. $F_{c}$ and $F_{n}$ denote the set of categorical and the set of numeric attributes, respectively.

The example in Fig. 1, the dissimilarity between t1 and t3 with the 1-of- $k$ coding scheme is $\sqrt{2}$ while the dissimilarity is 1 with SMC. As can be seen from this example, the two distinct categorical values contribute a difference of two with the 1-of- $k$ scheme while only one with the SMC. Consequently, the dissimilarity between two mixed-type points with 1-of- $k$ is usually larger than that with SMC.

3. Visualized analysis of mixed-type data

3.1 A procedure for mixed-type data analysis

For visualized analysis of mixed-type data, we propose a procedure as shown in Fig. 2 which mainly consists of three major parts, dissimilarity learning of mixed-type data points, dimensionality reduction, and analysis of clusters characteristics.

Figure 2.

A procedure for visualization and analysis of mixed-type data.

In the first step, dissimilarities between categorical values are first learned from the data. The result is further used to measure dissimilarities between mixed-type data points and then the pairwise dissimilarity matrix of the dataset can be constructed. In the second step, a dimensionality reduction method which takes the dissimilarity matrix as input is applied and yields a low-dimensional projection map. In the next step, the user performs data clustering on the map and analyzes characteristics of individual clusters according to quantitative metrics.

In the literature, many metrics have been proposed to evaluate clustering results [49, 50, 51], to name a few, Sum of Squared Error (SSE), the Davies-Bouldin index (DBI), and the Silhouette Coefficient. Users can adopt proper measures to check the quality of the clusters.

3.2 Dissimilarity learning and dissimilarity calculation

For mixed-type data, distance or dissimilarity between two data points cannot be measured directly without special treatment to categorical values. As mentioned earlier, the 1-of- $k$ coding has several disadvantages, most importantly, including loss of semantics embedding in the categorical values, leading to alteration of data topology.

Inspired by the idea in the article of [52], we propose to learn dissimilarity between values of a categorical attribute from the dataset. In particular, the dissimilarity is measured according to the co-occurrence degree between the feature values and the labels of the class attribute. The method is referred to hereafter as COFC, i.e., Co-Occurrence between Feature values and the Class labels. Formally, the dissimilarity between two categorical values of an attribute is defined as follows.

$\displaystyle d_{C}\left({a,b}\right)=\frac{1}{\left|C\right|}\mathop{\sum}% \limits_{c\in C}\left|{\frac{f\left({a\mathop{\cup}\nolimits c}\right)}{f\left% (a\right)}-\frac{f\left({b\mathop{\cup}\nolimits c}\right)}{f\left(b\right)}}\right|$ (2)

where $a$ and $b$ are two distinct values in the domain of the categorical attribute, $c$ is a label of the class attribute $C$ , $f(a)$ is the number of occurrences of $a$ in the categorical attribute, $f(a\cup c)$ is the number of co-occurrences between $a$ and $c$ in the dataset, and $|C|$ denotes the number of distinct class labels. It can be seen the dissimilarity is dependent on the relationship with the class labels. If the extent of $b$ co-occurs with the class labels is similar to that of $a$ with the class labels, $a$ and $b$ are deemed similar or have a small dissimilarity.

The dissimilarity between every pair of distinct values in each categorical attribute is measured by Eq. (2). Then, the dissimilarity between two mixed-type data points can be measured by aggregating the pairwise dissimilarities between the two points including categorical and numeric attributes. In particular, the dissimilarity between two mixed-type points $x_{i}$ and $x_{j}$ is defined by Eq. (3)

$\displaystyle D\left({x_{i},x_{j}}\right)=\left({\mathop{\sum}\limits_{p\in F_% {c}}\left({d\left({x_{i,p},x_{j,p}}\right)}\right)^{2}+\mathop{\sum}\limits_{q% \in F_{n}}\left({\frac{x_{i,q}-x_{j,q}}{\text{max}\left\{{x_{\ast,q}}\right\}}% }\right)^{2}}\right)^{1/2}$ (3)

where $d\left({x_{i,p},x_{j,p}}\right)$ returns the dissimilarity between $x_{i,p}$ and $x_{j,p}$ of categorical attribute $p$ . $F_{c}$ and $F_{n}$ denote the set of categorical and the set of numeric attributes, respectively. The dissimilarity between two numeric values is normalized by the maximum of the attribute to avoid the scale problem between different attributes.

According to Eq. (3), we can construct the proximity matrix for the mixed-type dataset. Specifically, assume $X=\left[{x_{1},x_{2},\ldots,x_{m}}\right]$ is a mixed-type dataset. The proximity matrix $M$ for $X$ consists of the pairwise dissimilarity of the data in $X$ , i.e., $M_{ij}=D(x_{i}$ , $x_{j})$ where 1 $\leqslant i$ , $j\leqslant m$ .

3.3 Dimensionality reduction

To reduce the dimension of data, many dimensionality reduction techniques have been proposed. In the previous section, we presented an approach to constructing the dissimilarity matrix of a mixed-type dataset. Therefore, as long as the reduction techniques take the matrix as input can be applied. In this study, we chose Sammon Mapping [35] since the algorithm is a well-known and has good performance on most of the datasets.

Sammon Mapping is a nonlinear method which attempts to preserve the dissimilarity between data when mapping the data points from a high-dimensional space to a low-dimensional space. Namely, the method builds an approximate configuration in the $d$ -dimensional space to represent the data configuration in the $D$ -dimensional space, where $d\ll D$ .

Formally, suppose that $X=\left[{x_{1},x_{2},\ldots,x_{m}}\right],x_{i}\in\Re^{D},1\leqslant i\leqslant m$ is a numeric dataset in high dimensional space and $Y=\left[{y_{1},y_{2},\ldots,y_{m}}\right],y_{i}\in\Re^{d},1\leqslant i\leqslant m$ is the low-dimensional data converted from $X$ , where $d\ll D$ . The objective function is defined as

$\displaystyle E=\frac{1}{\mathop{\sum}\nolimits_{i<j}d^{\ast}\left({x_{i},x_{j% }}\right)}\mathop{\sum}\limits_{i<j}^{m}\frac{\left({d^{\ast}\left({x_{i},x_{j% }}\right)-d\left({y_{i},y_{j}}\right)}\right)^{2}}{d^{\ast}\left({x_{i},x_{j}}% \right)},$

where $d^{\ast}\left({x_{i},x_{j}}\right)$ represents the distances between $x_{i}$ and $x_{j}$ ; $d\left({y_{i},\ y_{j}}\right)$ represents the distances between $y_{i}$ and $y_{j}$ . The objective function is solved by adjusting $y_{i}$ and $y_{j}$ to minimize $E$ , i.e., find out the optimal $d$ -dimensional configuration.

For a mixed-type dataset $X$ , an attribute of $x_{i}$ can be categorical or numeric, i.e., $x_{i,k}\in R$ or $x_{i,k}\in C$ where $C$ denotes the domain of the categorical attribute. The configuration of the low-dimension space remains the same as that of the original Sammon Mapping algorithm. Therefore, the dimensionality reduction of a mixed-type data is formally defined as follows.

Suppose that $X=\left[{x_{1},x_{2},\cdots,x_{m}}\right],\ x_{i,k}\in\Re$ or $x_{i,k}\in\mbox{C},1\leqslant i\leqslant m,1\leqslant k\leqslant D$ is a mixed-type dataset in high dimensional space and $Y=\left[{y_{1},y_{2},\ldots y_{m}}\right]\in\Re^{d}$ is the low-dimensional data converted from $X$ . In order to adapt the method to the dissimilarity matrix, we modify the formula slightly as follows.

$\displaystyle E=\frac{1}{\mathop{\sum}\nolimits_{i<j}D\left({x_{i},x_{j}}% \right)}\mathop{\sum}\limits_{i<j}^{m}\frac{\left({D\left({x_{i},x_{j}}\right)% -d\left({y_{i},y_{j}}\right)}\right)^{2}}{D\left({x_{i},x_{j}}\right)},$

where $D\left({x_{i},x_{j}}\right)$ is the dissimilarity between mixed-type data points $x_{i}$ and $x_{j}$ , which can be computed by Eq. (3) and $d\left({y_{i},y_{j}}\right)$ denotes the distance between $y_{i}$ and $y_{j}$ .

3.4 Evaluating the projection maps

To evaluate the quality of dimensionality reduction and compare the results among different techniques, quantitative measures are used. In this study, we use classification accuracy, trustworthiness, and continuity.

3.4.1 Accuracy by k-NN

Suppose that data points with the same class label tend to gather in clusters in the data space. After dimensionality reduction, data points with the same class label should also gather in clusters in the map space if the dimensionality reduction algorithm has a good performance. In the map space, classification accuracy represents the extent of the relationship preserved after dimensionality reduction between data points, indicating the performance of dimensionality reduction.

The $k$ -Nearest Neighbors algorithm ( $k$ -NN) is commonly used to classify data points. For an instance with unknown class label, the $k$ -NN algorithm assigns the major class label of the $k$ closest neighbors to the instance. For a high-dimensional dataset, dimensionality reduction is often used to improve computation complexity of $k$ -NN classifier [53].

We use $K$ -fold cross-validation to compute accuracy with $k$ -NN. Specifically, the dataset is partitioned into $K$ samples. One sample is used as test data, the other $K-1$ samples are used as training data. The cross-validation process is repeated $K$ times with each of the $K$ samples used as the test data once. The accuracy is the average of the $K$ results. In our experiments, we set $K$ to 10.

3.4.2 Trustworthiness and continuity

Trustworthiness $\left(T\right)$ and continuity $\left(C\right)$ are commonly used measures in dimensionality reduction [2, 17]. The metrics involve $k$ -nearest neighbors of each data point to measure the relationship between the two configurations of the data in the high-dimensional and the low-dimensional space, respectively.

Specifically, if the $k$ nearest neighbors of a data point $y_{i}$ in the low-dimensional space are also close in the high-dimensional space, i.e., the original space, we consider that the projection is trustworthy. The concept is quantified by the following equation.

$\displaystyle T\left(k\right)=1-\frac{2}{mk\left({2m-3k-1}\right)}\mathop{\sum% }\limits_{i=1}^{m}\mathop{\sum}\limits_{j\in U_{i}^{\left(k\right)}}\left({r% \left({i,j}\right)-k}\right),$

where $U_{i}^{\left(k\right)}$ denotes the data points which are in the $k$ nearest neighbors of $i$ in the low-dimensional space but not in the high-dimensional space; $r\left({i,j}\right)$ returns the rank of the neighbor $j$ of data point $i$ according to the distance between $i$ and $j$ in the high-dimensional space.

On the other hand, we can also measure the continuity of the projection with respect to the dataset. Continuity measures how the neighbors of a data point in the high-dimensional space is preserved in the low-dimensional space after projection. Formally, the quantity of continuity is measured by

$\displaystyle C\left(k\right)=1-\frac{2}{mk\left({2m-3k-1}\right)}\mathop{\sum% }\limits_{i=1}^{m}\mathop{\sum}\limits_{j\in V_{i}^{\left(k\right)}}(\hat{r}(i% ,j)-k),$

where $V_{i}\left(k\right)$ denotes the data points which are in the $k$ nearest neighbors of $i$ in the high-dimensional space but not in the low-dimensional space; $\hat{r}(i,j)$ returns the rank of the neighbor $j$ of $i$ according to the distance between $i$ and $j$ in the low-dimensional space.

The values of the two indices are in between zero and one. Larger values indicate better projection quality.

3.5 Data visualization and analysis

After evaluating the projection maps, we choose the one which has the highest $T$ and $C$ values since the values indicate projection quality, namely, the extent of topology preservation. Once the projection map is chosen, analysis of the projected data is performed as described in the following sections.

3.5.1 Data correspondence

After dimensionality reduction, the map contains projected data which have reduced dimensionality, say, two if the dimensional space is a plane. To explore the data, user can interactively cluster the data according to the distribution on the map as shown in Fig. 3.

Figure 3.

User can interactively cluster the data on the map for further exploration.

The first step to data analysis is to establish correspondence of the data instances between the original and the reduced space such that we can retrieve its original attributes of the corresponding, reduced data point in the projected space. One way for the step is to retrieve the identifier of each point in the low-dimensional space. The identifier is also associated with the point in the high-dimensional space. As a result, when we choose a cluster which we want to analyze from the projection map, the correspondent data points in the original dataset can be identified. Finally, we can respectively retrieve the categorical and the numeric portion of the selected data from the dataset and proceed analysis.

3.5.2 Analyze the selected cluster

For the portion of categorical attributes, we compute the distribution of the values in each of the attributes and also identify the mode of the distribution and its percentage. For the portion of numeric attributes, we compute the average for each of the attributes. Thus, from the results we gain insights into the characteristics of individual clusters and learn which attributes have significantly different values with respect to different class labels. The analysis is repeated for the other clusters until all of the clusters in the projection map have been analyzed.

The distribution of categorical values can be plotted as bar charts so that we can intuitively compare the distributions via visualization facilitating patterns identification. In summary, through analyzing individual clusters, we understand how the values are distributed and how the values might be related to class labels.

3.5.3 Quantitative comparison with clusters

In addition to visual inspection, to quantitatively compare the distribution of categorical values between clusters, we propose a measure which is adapted from the Kullback-Leibler divergence (KLD) [54].

KLD measures the difference between two probability distributions $P$ and $Q$ and is defined as

$\displaystyle D_{KL}\left({P||Q}\right)=\mathop{\sum}\limits_{i}P\left(i\right% )\log\frac{P\left(i\right)}{Q\left(i\right)}.$

To avoid zero on denominator, a smoothed KLD is used. We define $P\left(i\right)$ and $Q\left(i\right)$ as

$\frac{\textit{freq}\left(i\right)+1}{\mathop{\sum}\nolimits_{j}\left({\textit{% freq}\left(j\right)+1}\right)}$

where freq( $i$ ) is the number of categorical value $i$ appearing in the cluster. Moreover, KLD is not symmetric, i.e., $D_{KL}(P||Q)\neq D_{KL}(Q||P)$ . We use a symmetrized divergence which is defined as

$\displaystyle\textit{SD}_{KL}\left({P,Q}\right)=0.5\ast(D_{KL}\left({P||Q}% \right)+D_{KL}\left({Q||P}\right)).$

Algorithm 1: Visualized analysis of mixed-type data via dimensionality reduction

Input: mixed-type dataset

D

, the number of cross-validation

K

, the number of nearest neighbors

k

Output: projected data, accuracy by

k

-NN,

T

and

C

values, the distributions of categorical attributes, the average and standard deviation of numeric attributes, the distribution of class attribute, and the KLD matrices

01.

x\leftarrow

dataset

02.

n\leftarrow

the number of the distance learning algorithms

03. for

i\leftarrow

1 to

n

04. dm

\leftarrow

DistanceLearning[

i

](

x

) // one of the methods 1-of-

k

, SMC, COFC

05.

y

[

i

]

\leftarrow

SammanMapping(dm) // Section 3.3

06. accuracy

\leftarrow

AccuracyByKNN(dm, y[

i

], class, K, k) // Refer to Section 3.4

07.

T

[

i

]

\leftarrow

Trustworthiness(dm,

y

[

i

]

,k)

// Section 3.4

08.

C

[

i

]

\leftarrow

Continuity(dm,

y

[

i

]

, k

) // Section 3.4

09. end for

10.

y*\leftarrow

SelectTheBestProjection(

y, T, C

)

11.

y*\leftarrow

GiveOrdinalNumber(

y*)

// Use the CAT function in MATLAB

12. map

\leftarrow

Scatter(

y*

) // Project the data points on a map

13.

m\leftarrow

the number of the clusters on the projection map

14. for

j\leftarrow

1 to

m

15.

c\leftarrow

Brushing(map, i) // Select a cluster c interactively

16.

dc\leftarrow

RetrieveRawCategoricalData(

c

)

17.

dn\leftarrow

RetrieveRawNumericalData(

c

)

18.

p\leftarrow

the number of the categorical attributes

19.

q\leftarrow

the number of the numeric attributes

20. for

r\leftarrow

1 to

p

-1 do // Categorical attributes

21.

c

[

r

]

\leftarrow

PlotBarCharts(dc[

r

]) // Use the BAR function in MATLAB

22. end for

23. for

s\leftarrow

1 to

q

do // Numeric attributes

24. ComputeMeanStdDeviation (dn[

s

])

25. end for

26. ComputeClassDistribution(dc[

p

])

27. SymmetrizedKLD(

c

) // Section 3.5

28. end for

In this study, $P$ and $Q$ represent the probability distributions of the values of a categorical attribute in two distinct clusters.

The result indicates the degree of the difference between two clusters on a categorical attribute. A large value indicates the difference between the two clusters is apparent.

3.6 Algorithm

We summarize the process and present an algorithm for our study in Algorithm 1, where $K$ is the number of folds for cross-validation, and $k$ is the number of nearest neighbors for $k$ -NN. We use MATLAB to implement the functions, e.g., lines 11, 12, 15, and 21. For lines 16, 17, 21, 24, and 26, we write the scripts or functions on MATLAB to process the data.

According to $T$ and $C$ values computed in line 10, the analyst can choose the projection which best reflects topological order of the data to analyze. In line 21, the system computes the distribution of the categorical values in individual clusters and plots the results in bar charts. In line 24, the function computes the average and the standard deviation of numeric attributes. In line 26, it computes the class distribution in a cluster. In line 27, the function uses the distribution computed by line 21 to calculate the symmetrized KLD values and then output the KLD matrices.

4. Experiments

To verify feasibility of the proposed procedure, we implement a prototype system and conduct experiments with two synthetic and two real-world datasets. The prototype was developed with MATLAB R2013a on Windows 7.

Table 2 shows the statistics of the four experimental datasets used in this study, two synthetic, Syn1 and Syn2, and two real datasets, Credit Approval and Adult, taken from the UCI machine learning repository [55].

Table 2
Statistics of the experimental datasets

Dataset	Type	Data	Categorical	Numeric	# of	% of major
		points	attributes	attributes	classes	class
Syn1	Synthetic	1,600	2	2	4	28
Syn2	Synthetic	360	2	1	3	39
Credit approval	Real	690	3	11	2	56
Adult	Real	1,600	3	4	2	75

There are no missing values in the datasets used in this study. However, in real-world datasets, missing or noise values may exist. Noises will degrade the performance of the process and shall be corrected or removed. Missing values are not allowed in the proposed procedure and shall be imputed or removed. There are several data preprocessing techniques to handle noises and missing values. The readers can refer to the data mining textbooks [56, 51].

Figure 4.

Projection maps of Drink8G2C2N by Sammon Mapping with 1-of- $k$ , SMC, and COFC.

4.1 Synthetic datasets

4.1.1 Dataset Syn1

Dataset Syn1 shown in Table 3 has 1,600 instances of five attributes including two categorical attributes Drink and Dept, two numeric attributes Amt1 and Amt2, and one class attribute. There are four major groups each of which has 400 points. Most of group members have the same class label. For example, 90% of the first group have class label M and the rest 10% are randomly assigned with one of the other three class labels {E, D, H}. Categorical values are randomly assigned in each group. For instance, in attribute Dept, BM and IM are randomly assigned to the first group of 400 points with class label M. Similarly, the values of {Latte, Mocha, Cappu, Black} are randomly assigned to the group of 400 points with class M and the group of 400 with class E. For numeric attributes Amt1 and Amt2, the values in each group are randomly generated according to the Gaussian distribution with a designated mean and standard deviation.

Figure 4 shows the projection results which were colored according to the four class labels. The results with 1-of- $k$ and SMC schemes look similar. The only subtle difference is the projection of 1-of- $k$ is more spread out. Please note the scales in Fig. 4a and b. The reason is what we mentioned in Section 2.2 two distinct values in one categorical attribute result in a difference of two in 1-of- $k$ but only one in SMC. Consequently, the dissimilarity measured with 1-of- $k$ is slightly larger than that with SMC, leading to a map more spread out.

In Fig. 4a and b, the number of clusters in each of the four major colors is eight, corresponding to the combination of the categorical values in Drink and Dept, say, {Latte, Mocha, Cappu, Black} and {BM, IM} for the group with class label M.

By contrast, Fig. 4c, the result with COFC, clearly presents four clusters which corresponds to the four classes in Table 3. The reason for projecting the data to apparent four cluster in Fig. 4c is because the dissimilarity between categorical values computed with COFC is learned from the dataset. The method considers how each value co-occurs with the class label. Consequently, some values have small dissimilarity and some have large ones such that apparent clusters present in the original data space and so do in the projection space.

The pairwise dissimilarity between categorical values of attribute Dept is shown in Table 4. The dissimilarity between BM and IM is much smaller than that between BM and the other values since BM and IM mainly corresponds to class label M. Analogously, the same observations can be found to the other categorical values in Table 4 and in Table 5 as well. Recall that the dissimilarity between categorical values measured with COFC scheme in Eq. (1) depends on the class labels.

Table 3
Synthetic dataset Syn1

Drink	Dept	Amt1 ( $\mu$ , $\sigma$ )	Amt2 ( $\mu$ , $\sigma$ )	Class	Count
Latte	BM	$N$ (70,10)	$N$ (20,5)	M (90%)	400
Mocha	IM
Cappu	CE	$N$ (30,10)	$N$ (80,5)	E (85%)	400
Black	EE
Coke	SD	$N$ (70,10)	$N$ (20,5)	D (80%)	400
Pepsi	VD
Sprint	AH	$N$ (30,10)	$N$ (80,5)	H (75%)	400
7Up	BH

Table 4

Distance matrices of dept

	BM	IM	CE	EE	SD	VD	AH	BH
BM	0	0.0118	0.4153	0.4266	0.4061	0.4202	0.4212	0.4044
IM	0.0118	0	0.4210	0.4247	0.4021	0.4162	0.4172	0.4004
CE	0.4153	0.4210	0	0.0157	0.3906	0.3969	0.3941	0.3789
EE	0.4266	0.4247	0.0157	0	0.3983	0.3995	0.3958	0.3866
SD	0.4061	0.4021	0.3906	0.3983	0	0.0153	0.3629	0.3399
VD	0.4202	0.4162	0.3969	0.3995	0.0153	0.0000	0.3629	0.3540
AH	0.4212	0.4172	0.3941	0.3958	0.3629	0.3629	0	0.0339
BH	0.4044	0.4004	0.3789	0.3866	0.3399	0.3540	0.0339	0

Table 5

Distance matrices of drink

	Latte	Mocha	Cappu	Black	Coke	Pepsi	Sprint	7Up
Latte	0	0.0402	0.0371	0.0454	0.3734	0.3938	0.4262	0.4081
Mocha	0.0402	0	0.0274	0.0160	0.3724	0.3928	0.4252	0.4071
Cappu	0.0371	0.0274	0	0.0198	0.3449	0.3653	0.3978	0.3797
Black	0.0454	0.0160	0.0198	00	0.3564	0.3768	0.4092	0.3912
Coke	0.3734	0.3724	0.3449	0.3564	0	0.0325	0.0528	0.0347
Pepsi	0.3938	0.3928	0.3653	0.3768	0.0325	0	0.0324	0.0144
Sprint	0.4262	0.4252	0.3978	0.4092	0.0528	0.0324	0	0.0181
7Up	0.4081	0.4071	0.3797	0.3912	0.0347	0.0144	0.0181	0

Figure 5 shows the VAT diagram of the original dataset. The diagram reveals cluster structures in the dataset. We can identify four large compact, apparent clusters in Fig. 5c and 32 small compact clusters in Fig. 5a, indicated by those dark squares. The cluster structures in the dataset are reflected in the projection maps of Fig. 4. The gap between some clusters in Fig. 4b is not apparent compared with that in Fig. 4a. It is reflected in the VAT diagram of Fig. 5b.

Table 6 presents the trustworthiness and continuity of the projections and shows those values dropped significantly in 1-of- $k$ and in SMC when $k$ is larger than 300. On the contrary, those in COFC kept large even when $k$ is equal to 600. The result indicates COFC has better capability of preserving the topological order of the data after dimensionality reduction than 1-of- $k$ and SMC.

Table 6

Trustworthiness and continuity of the projections of dataset Syn1 by Sammon Mapping

	$k=$ 1		$k=$ 3		$k=$ 7		$k=$ 11
Measures	T	C	T	C	T	C	T	C
1-of- $k$	0.9996	0.9997	0.9994	0.9996	0.9992	0.9995	0.9992	0.9995
SMC	0.9986	0.9993	0.9983	0.9990	0.9980	0.9988	0.9980	0.9987
COFC	0.9925	0.9976	0.9927	0.9969	0.9930	0.9963	0.9934	0.9960
	$k=$ 25		$k=$ 50		$k=$ 100		$k=$ 200
Measures	T	C	T	C	T	C	T	C
1-of- $k$	0.9993	0.9996	0.9920	0.9942	0.9620	0.9643	0.9163	0.9729
SMC	0.9977	0.9990	0.9862	0.9938	0.9592	0.9676	0.9292	0.9783
COFC	0.9943	0.9960	0.9951	0.9962	0.9957	0.9965	0.9970	0.9974
	$k=$ 300		$k=$ 400		$k=$ 500		$k=$ 600
Measures	T	C	T	C	T	C	T	C
1-of- $k$	0.8751	0.9219	0.8706	0.8573	0.8471	0.8336	0.8276	0.8238
SMC	0.8915	0.9189	0.8693	0.8489	0.8539	0.8330	0.8445	0.8290
COFC	0.9988	0.9989	0.9994	0.9996	0.9776	0.9809	0.9798	0.9824

Figure 5.

The VAT diagrams of Syn1 dataset with 1-of- $k$ , SMC, and COFC for handling categorical attributes.

4.1.2 Dataset Syn2

Dataset Syn2 shown as Table 7 consists of two categorical attributes Drink and Dept, one numeric attribute Amount, and a class attribute Class. Each group has 60 or 30 data points. For categorical attribute Drink, Coke and Pepsi are uniformly distributed in Groups 1 to 3. For attribute Dept, MIS, MBA, and FM are randomly assigned to Groups 1 to 3 as well. The numeric values are assigned according to Gaussian distributions like those in dataset Syn1. The class labels are also assigned like those in dataset Syn1.

Table 7
Synthetic dataset Syn2

Group	Drink	Dept	Amount( $\mu$ , $\sigma$ )	Class	Count
1	Coke	MIS	$N$ (500, 25)	A (90%)	60
2	Pepsi	MBA	$N$ (400, 20)		30
3		FM	$N$ (300, 15)		30
4	Latte	EE	$N$ (500, 25)	B (80%)	60
5	Mocha	CE	$N$ (400, 20)		30
6		ME	$N$ (300, 15)		30
7	Apple-Juice	SD	$N$ (500, 25)	C (70%)	60
8	Orange-Juice	VD	$N$ (400, 20)		30
9		AD	$N$ (300, 15)		30

Figure 6.

Projection maps of Drink9mix v4 by Sammon Mapping with 1-of- $k$ , SMC, and COFC.

Figure 6 shows the projection maps of Syn2 by Sammon mapping with 1-of- $k$ , SMC, and COFC schemes for dealing with categorical values, respectively. Note that only the three attributes Dept, Drink, and Amount, excluding Group and Class, were participated in the dimensionality reduction.

We label data points with group numbers, which are displayed by nine colors for identification. In all of the three projection maps, similar data points in the data space are projected in the same region. Take the region in the dashed circle of Fig. 6a as the example. The projected data correspond to the data points of Groups 7 to 9 in Table 7. That region contains six subgroups each of which corresponds to one of the combinations of {SD, VD, AD} and {Apple-Juice, Orange-Juice} along with numeric values. Group 7 has 60 instances while 30 in each of Groups 8 and 9. As can be seen in Fig. 6a, the number of blue points, corresponding to Group 7, is greater than those of the other two colors, corresponding to Groups 8 and 9, in the region. The same can be observed in Fig. 6b and c.

Table 8

Pairwise dissimilarity between categoricalvalues of Attribute Dept with COFC

	MIS	MBA	FM	EE	CE	ME	SD	VC	AD
MIS	0	0.0722	0.0450	0.5770	0.5696	0.5548	0.5461	0.5187	0.5792
MBA	0.0722	0	0.0272	0.5079	0.5153	0.4825	0.4739	0.4464	0.5069
FM	0.0450	0.0272	0	0.5320	0.5358	0.5098	0.5011	0.4736	0.5341
EE	0.5770	0.5079	0.5320	0	0.0148	0.0667	0.4905	0.4444	0.4287
CE	0.5696	0.5153	0.5358	0.0148	0	0.0741	0.4979	0.4519	0.4435
ME	0.5548	0.4825	0.5098	0.0667	0.0741	0	0.4238	0.3778	0.3842
SD	0.5461	0.4739	0.5011	0.4905	0.4979	0.4238	0	0.0735	0.0640
VC	0.5187	0.4464	0.4736	0.4444	0.4519	0.3778	0.0735	0	0.0605
AD	0.5792	0.5069	0.5341	0.4287	0.4435	0.3842	0.0640	0.0605	0

Table 9

Pairwise dissimilarity between categoricalvalues of Attribute drink

	Coke	Pepsi	Latte	Mocha	Apple	Orange
Coke	0	0.0112	0.5282	0.5267	0.4781	0.5350
Pepsi	0.0112	0	0.5170	0.5155	0.4670	0.5238
Latte	0.5282	0.5170	0	0.0090	0.4538	0.4283
Mocha	0.5267	0.5155	0.0090	0	0.4448	0.4207
Apple	0.4781	0.4670	0.4538	0.4448	0	0.0568
Orange	0.5350	0.5238	0.4283	0.4207	0.0568	0

Figure 7.

The VAT diagrams of Drink9mix v4 Dataset with 1-of- $k$ , SMC, and COFC for handling categorical attributes.

The map of 1-of- $k$ is similar to that of SMC since the way they treat the dissimilarity between categorical values is similar. However, the projection with 1-of- $k$ is more spread than that with SMC. Please notice the scales in Fig. 6a and b. The reason is again a pair of different categorical values by 1-of- $k$ contributes more dissimilarity than that by SMC.

Figure 6c, the result with COFC, clearly presents three major clusters which corresponds to the three classes in Table 7. In each of the three clusters, we can identify three subgroups each of which associates one color. Each of the subgroups corresponds to a group in Table 7. For instance, the bottom one of the left-most cluster is Group 3 in Table 7. The reason for projecting the data to apparent three clusters in Fig. 6c is again the dissimilarity between categorical values computed with COFC is learned from the dataset. The dissimilarity between categorical values varies as can be seen in Tables 8 and 9.

Figure 7 shows the VAT diagram presenting the dissimilarity between mixed-type data. Those eighteen dark squares in the diagonal of Fig. 7a corresponds to those eighteen clusters in Fig. 6a. Figure 7c indicates that clusters are not so compact as those in Fig. 7a. In fact, from Fig. 6c we can see three clusters in form of strip.

Table 10 shows trustworthiness and continuity values of the dimensionality reduction. When $k$ is greater than 100, the values significantly drop in 1-of- $k$ and SMC while COFC still achieved fairly good outcome. The result indicates COFC yields better projection than 1-of- $k$ and SMC in regard of preserving the neighborhood relation of the data.

Table 10

Trustworthiness and continuity of the projections of dataset Syn2 by Sammon Mapping

	$k=$ 1		$k=$ 3		$k=$ 7		$k=$ 11
Measures	T	C	T	C	T	C	T	C
1-of- $k$	0.9997	0.9997	0.9994	0.9995	0.9988	0.9990	0.9979	0.9987
SMC	0.9998	0.9999	0.9995	0.9998	0.9968	0.9978	0.9941	0.9965
COFC	0.9965	0.9983	0.9964	0.9976	0.9966	0.9972	0.9977	0.9979
	$k=$ 25		$k=$ 50		$k=$ 100		$k=$ 200
Measures	T	C	T	C	T	C	T	C
1-of- $k$	0.9774	0.9715	0.9115	0.9490	0.8157	0.8706	0.5542	0.5592
SMC	0.9659	0.9715	0.9126	0.9535	0.8286	0.8675	0.5799	0.5752
COFC	0.9995	0.9995	0.9994	0.9998	0.9681	0.9782	0.8468	0.8877

Table 11

Trustworthiness and continuity of the projections of Adult1600 by Sammon Mapping

	$k=$ 25		$k=$ 50		$k=$ 100		$k=$ 200
Measures	T	C	T	C	T	C	T	C
1-of- $k$	0.9403	0.9424	0.9189	0.9190	0.9034	0.9043	0.8943	0.8913
SMC	0.9354	0.9415	0.9211	0.9223	0.9066	0.9108	0.8999	0.9015
COFC	0.9500	0.9727	0.9480	0.9666	0.9537	0.9686	0.9660	0.9766

Table 12

Classification accuracy of k-NN on dataset adult in the data and the map space

		$k=$ 1	$k=$ 3	$k=$ 7	$k=$ 11	$k=$ 25	$k=$ 51	$k=$ 101	$k=$ 201
Data	1-of- $k$	0.8006	0.8231	0.8338	0.8300	0.8125	0.8156	0.8175	0.8088
space	SMC	0.7994	0.8225	0.8344	0.8306	0.8125	0.8156	0.8163	0.8131
	COFC	0.8031	0.8288	0.8413	0.8469	0.8419	0.8300	0.8263	0.8288
Map	1-of- $k$	0.7638	0.7844	0.7956	0.8006	0.7844	0.7950	0.7719	0.7488
space	SMC	0.7581	0.7856	0.7975	0.7944	0.7831	0.8050	0.7800	0.7556
	COFC	0.8000	0.8144	0.8144	0.8238	0.8313	0.8238	0.8181	0.8131

Figure 8.

Projection maps of Adult 1600 by Sammon Mapping with (a) 1-of- $k$ , (b) SMC, and (c) COFC.

4.2 Real datasets

4.2.1 Dataset adult

The original Adult dataset has 48,842 data points of 15 attributes consisting of eight categorical, six numeric, and one class attribute. There are 76% of the data with the class label of “ $<=$ 50 K” and the others “ $>$ 50 K”. Following the study in [52], seven attributes including three categorical and four numeric attributes are used in the following experiment. In addition, due to time complexity consideration, we randomly drew 1,600 instances for the experiment.

Figure 9.

Distribution of the categorical attributes in clusters of Adult 1600 projected by Sammon Mapping with COFC. The values in X-axis of Education from the left to the right, respectively, are {Preschool, 1 ${}^{\rm st}$ –4 ${}^{\rm th}$ , 5 ${}^{\rm th}$ –6 ${}^{\rm th}$ , 7 ${}^{\rm th}$ –8 ${}^{\rm th}$ , 9 ${}^{\rm th}$ , 10 ${}^{\rm th}$ , 11 ${}^{\rm th}$ , 12 ${}^{\rm th}$ , HS-grad, Some-college, Assoc-voc, Assoc-acdm, Bachelors, Masters, Prof-School, Doctorate}.

Figure 8 shows the projection maps of Adult1600 by Sammom Mapping with the three schemes for categorical attributes. The results with 1-of- $k$ and SMC are quite similar while COFC yields two large clusters.

Table 11 lists trustworthiness and continuity for the projection and indicates that COFC preserved the neighborhood relation better than 1-of- $k$ and SMC. The values of 1-of- $k$ and SMC are quite close.

Table 12 shows classification accuracy by $k$ -NN in the data and the map space, respectively. For the map space, classification was performed after the data points were projected to the low-dimensional space. COFC is superior to 1-of- $k$ and SMC in both spaces. In the data space, COFC yielded the highest accuracy 0.8469 at $k=$ 11. In the map space, COFC achieved the highest 0.8313 at $k=$ 25.

Table 13

Mode and its percentage of categorical values and average of numerical values in individual clusters by sammon mapping with cofc

C	Education (%)	Marital_status (%)	Relationship (%)	Age	Hours_per	Capital	Capital	$>$ 50 K	Count
					_week	_gain	_loss	(%)
C1	Prof-school (59)	Married-civ-spouse (100)	Husband (86)	48	46	9163	261	41 (89)	46
C2	Bachelors (67)	Married-civ-spouse (100)	Husband (81)	44	44	5556	309	124 (73)	171
C4	Prof-school (62)	Never-married (53)	Not-in-family (65)	42	42	5887	252	20 (62)	32
C3	HS-grad (35)	Married-civ-spouse (99)	Husband (82)	45	43	1795	247	139 (32)	431
C5	Bachelors (68)	Never-married (55)	Not-in-family (70)	40	42	2093	162	42 (23)	180
C6	HS-grad (37)	Never-married (43)	Not-in-family (44)	40	39	600	119	40 (05)	740
All	HS-grad (27)	Married-civ-spouse (41)	Husband (33)	42	41	1965	185	406 (25)	1600

Table 14

KLD values for attribute Education of clusters of Adult1600 by Sammon with COFC

	C1	C2	C4	C3	C5	C6
C1	0	3.269	0.013	2.752	3.284	2.905
C2		0	3.028	4.048	0.002	4.131
C4			0	2.452	3.042	2.595
C3				0	4.031	0.024
C5					0	4.121
C6						0

Table 15

KLD values for attribute Marital_status of clusters of Adult1600 by Samm on with COFC

	C1	C2	C4	C3	C5	C6
C1	0	0.052	2.726	0.099	3.223	2.780
C2		0	3.538	0.008	4.070	3.587
C4			0	4.018	0.039	0.066
C3				0	4.575	4.086
C5					0	0.035
C6						0

Table 16

KLD values for attribute Relationship of clusters of Adult1600 by Sammon with COFC

	C1	C2	C4	C3	C5	C6
C1	0	0.042	2.721	0.079	3.553	4.065
C2		0	3.412	0.006	4.321	4.887
C4			0	3.891	0.148	0.285
C3				0	4.830	5.411
C5					0	0.110
C6						0

We interactively clustered the data on the map in Fig. 8c to six groups (refer to Fig. 9) and analyzed the data in individual clusters. The results are presented in the following tables and figures. Table 13 shows the mode and its percentage of categorical values and the average of numeric values in each cluster. The clusters were sorted according to the percentage of class label ‘ $>$ 50 K’. Interesting patterns can be identified. The averages of attribute Age and Hours_per_week roughly decreased along with the decreasing percentage of class ‘ $>$ 50 K’. The Capital_gain of the clusters with a large percentage of $>$ 50 K is significantly larger than that with a small percentage. Moreover, the instances with Education of Prof-school, Marital-status of Married-civ-spouse, and Relationship of Husband are likely to have a class label of ‘ $>$ 50 K’. These patterns match general perception regarding income.

As regards Fig. 9, some distributions of categorical values in clusters are quite different from one another. For instance, in attribute Education, C1 have totally different values, Doctorate and Prof-School, from C2, Bachelors and Masters, and also considerably different from C3 and C6. Significant differences can also be identified across clusters in attributes Marital_Status and Relationship as well. According to the results above in Tables 11–13 and Fig. 9, we consider that Sammon Mapping with COFC is effective on handling the mixed-type dataset Adult1600 and interesting patterns can be identified.

Figure 10.

Projection map of dataset Australian by Sammon Mapping with (a) 1-of- $k$ and (b) SMC (c) COFC.

For quantitative comparison of the distributions of categorical values in individual clusters, Tables 14–16 show the pairwise symmetric KL-divergence (KLD) values for each categorical attribute, respectively. Corresponding to Fig. 9, the value of C1 and C2 in Table 14 for attribute Education is relatively large while those in Table 15 for Marital_Status and 16 for Relationship are quite small. C2 and C6 have the largest difference in Education, C3 and C5 in Marital-Status, and C3 and C6 in Relationship.

4.2.2 Dataset Credit Approval

Figure 10 shows the projection results of dataset Credit Approval. The results with 1-of- $k$ and SMC failed to yield well separated clusters unlike that with COFC. The trustworthiness and continuity values of the dimensionality reduction are listed in Table 17. The result again demonstrates COFC has better capability of preserving neighborhood relation than 1-of- $k$ and SMC.

Table 17
Trustworthiness and continuity of the projections of dataset Credit Approval by Sammon Mapping

	$k=$ 1		$k=$ 3		$k=$ 7		$k=$ 11
Measures	T	C	T	C	T	C	T	C
1-of- $k$	0.9384	0.9584	0.9208	0.9389	0.8986	0.9161	0.8866	0.9060
SMC	0.9552	0.9703	0.9405	0.9506	0.9211	0.9310	0.9106	0.9200
COFC	0.9565	0.9890	0.9416	0.9869	0.9376	0.9841	0.9326	0.9802
	$k=$ 25		$k=$ 50		$k=$ 100		$k=$ 200
Measures	T	C	T	C	T	C	T	C
1-of- $k$	0.8630	0.8937	0.8478	0.8783	0.8346	0.8674	0.8403	0.8618
SMC	0.8787	0.9078	0.8539	0.8951	0.8563	0.8888	0.8739	0.8869
COFC	0.9176	0.9655	0.8948	0.9459	0.8837	0.9239	0.9007	0.9250

Table 18 presents classification accuracy by $k$ -NN in the data and the map space, respectively. In the data space, the best performance 0.8681 was achieved when $k=$ 25 with COFC and $k=$ 101 with SMC. In the map space, COFC outperformed 1-of- $k$ and SMC and had the best accuracy 0.8638 at $k=$ 7.

Table 18

Classification accuracy of k-NN on dataset Credit Approval in the data and the map space

		$k=$ 1	$k=$ 3	$k=$ 7	$k=$ 11	$k=$ 25	$k=$ 51	$k=$ 101	$k=$ 201
Data	1-of- $k$	0.8014	0.8217	0.8522	0.8493	0.8536	0.8586	0.8609	0.8594
space	SMC	0.8145	0.8507	0.8652	0.8638	0.8638	0.8638	0.8681	0.8594
	COFC	0.8014	0.8391	0.8406	0.8565	0.8681	0.8536	0.8623	0.8609
Map	1-of- $k$	0.7725	0.8029	0.8319	0.8377	0.8246	0.8101	0.8087	0.8116
space	SMC	0.7797	0.8362	0.8420	0.8493	0.8377	0.8232	0.8362	0.8333
	COFC	0.7971	0.8478	0.8638	0.8594	0.8522	0.8565	0.8594	0.8565

Table 19

Averages of numerical values and class distribution in individual clusters of Credit Approval dataset projected by Sammon Mapping with COFC

	A1	A2	A3	A7	A8	A9	A10	A11	A12	A13	A14	Count	Class: ‘1’
C12	0.00	33	7	2.68	1.00	1.00	6.29	0.00	2.00	93	2921	45	96%
C3	1.00	35	7	3.92	1.00	1.00	7.06	0.00	2.00	151	1768	72	92%
C8	1.00	35	5	4.50	1.00	1.00	6.68	1.00	2.00	212	1801	78	90%
C13	0.00	33	6	2.56	1.00	1.00	5.76	0.94	1.97	150	640	33	85%
C7	1.00	33	5	3.22	1.00	0.00	0.00	1.00	1.86	222	873	51	57%
C10	0.00	27	4	0.99	0.66	0.34	0.93	0.00	1.93	176	1040	29	50%
C2	0.97	33	5	1.85	0.61	0.39	0.86	0.01	1.82	180	381	71	35%
C11	0.03	30	5	2.11	0.57	0.43	1.26	1.00	2.00	197	685	35	29%
C4	0.02	29	4	0.64	0.00	0.00	0.00	0.02	1.98	189	2132	52	10%
C5	1.00	29	3	1.44	0.00	0.00	0.00	1.00	1.79	225	292	77	9%
C1	1.00	29	3	0.72	0.00	0.00	0.00	0.00	1.96	169	187	105	8%
C9	0.00	32	3	1.46	0.00	0.00	0.00	1.00	1.82	181	89	28	4%
C6	1.00	27	4	2.66	0.00	1.00	1.57	1.00	2.00	312	88	14	0%
Total	0.54	31	4.8	2.21	0.53	0.47	2.34	0.54	1.93	189	992	690	43%

Figure 11.

Distribution of categorical values in clusters of Credit Approval projected by Sammon mapping with COFC.

We visually identify and interactively label the clusters on the projection map of Fig. 10c. Thirteen clusters were identified. Refer to Fig. 11 for part of the clusters. From Table 19, it is interesting to see that most of the clusters have data points with the same class label. C12 has 96% of the data with class ‘1’ while C6 has none. Analysis was performed on individual clusters and detailed outcomes were presented in the following.

Table 20

Mode and percentage of categorical values and class distribution in individual clusters of Credit Approval dataset projected by Sammon Mapping with COFC

	A4	A5	A6	Count	Class: ‘1’
C12	b (93)	k (44)	d (49)	45	96%
C3	b (88)	h (24)	d (60)	72	92%
C8	b (77)	h (21)	d (58)	78	90%
C13	b (85)	k (30)	d (55)	33	85%
C7	b (76)	i (22)	d (55)	51	57%
C10	b (83)	a (21)	d (45)	29	50%
C2	b (77)	h (27)	d (61)	71	35%
C11	b (86)	k (20)	d (60)	35	29%
C4	b (65)	a,h (21)	d (52)	52	10%
C5	b (68)	c (21)	d (62)	77	9%
C1	b (65)	h (27)	d (70)	105	8%
C9	b (64)	d,h,i (14)	d (64)	28	4%
C6	b (86)	h (43)	d (64)	14	0%
Total	b (76)	h (21)	d (59)	690	43%

Table 21

KLD values between clusters for attribute A4 of Credit Approval by Sammon with COFC

	C12	C3	C8	C13	C7	C10	C2	C11	C4	C5	C1	C9	C6
C12	0.000	0.023	0.038	0.072	0.035	0.045	0.015	0.001	0.093	0.020	0.034	0.021	0.000
C3		0.000	0.086	0.105	0.007	0.016	0.004	0.023	0.028	0.000	0.108	0.081	0.020
C8			0.000	0.009	0.131	0.064	0.054	0.027	0.216	0.075	0.019	0.013	0.039
C13				0.000	0.164	0.061	0.069	0.053	0.240	0.093	0.055	0.045	0.071
C7					0.000	0.045	0.020	0.041	0.015	0.010	0.138	0.110	0.032
C10						0.000	0.009	0.034	0.065	0.013	0.118	0.089	0.040
C2							0.000	0.011	0.052	0.002	0.079	0.056	0.012
C11								0.000	0.098	0.018	0.032	0.018	0.001
C4									0.000	0.018	0.032	0.018	0.001
C5										0.000	0.099	0.073	0.017
C1											0.000	0.002	0.037
C9												0.000	0.023
C6													0.000

Table 22

KLD values between clusters for attribute A5 of Credit Approval by Sammon with COFC

	C12	C3	C8	C13	C7	C10	C2	C11	C4	C5	C1	C9	C6
C12	0.000	0.312	0.497	0.248	0.036	0.188	0.361	0.157	0.408	0.255	0.268	0.361	0.197
C3		0.000	0.122	0.156	0.249	0.178	0.133	0.239	0.218	0.107	0.201	0.314	0.209
C8			0.000	0.239	0.384	0.332	0.204	0.478	0.455	0.249	0.234	0.404	0.233
C13				0.000	0.189	0.198	0.207	0.184	0.398	0.183	0.129	0.247	0.220
C7					0.000	0.155	0.297	0.132	0.354	0.189	0.163	0.302	0.156
C10						0.000	0.319	0.167	0.437	0.247	0.105	0.167	0.138
C2							0.000	0.326	0.333	0.205	0.330	0.595	0.207
C11								0.000	0.366	0.271	0.224	0.292	0.274
C4									0.000	0.069	0.557	0.832	0.536
C5										0.000	0.286	0.520	0.312
C1											0.000	0.131	0.115
C9												0.000	0.312
C6													0.000

Table 19 shows the average of each cluster for each numeric attribute. The order of the clusters is sorted decreasingly by the percentage of class label ‘1’. The dataset has 43% of the instances with class 1. Most of the clusters have class distribution significantly different from the dataset. In addition, some interesting patterns can be observed. The averages of the clusters with large percentages of class 1 are mostly larger than those with smaller percentages of class 1 in A2, A3, A7, A10. For A8 and A9, the clusters with a large number of class 1 have a value of one while those with a small number usually have a value of zero. The only exception is C6 in A9, which has a value of one.

Table 23

KLD values between clusters for attribute A6 of Credit Approval by Sammon with COFC

	C12	C3	C8	C13	C7	C10	C2	C11	C4	C5	C1	C9	C6
C12	0.000	0.102	0.261	0.158	0.005	0.109	0.275	0.042	0.038	0.089	0.147	0.060	0.155
C3		0.000	0.054	0.016	0.097	0.099	0.109	0.156	0.112	0.029	0.074	0.074	0.045
C8			0.000	0.026	0.261	0.163	0.074	0.374	0.236	0.121	0.139	0.228	0.070
C13				0.000	0.153	0.119	0.074	0.254	0.164	0.068	0.087	0.137	0.048
C7					0.000	0.142	0.271	0.041	0.036	0.088	0.139	0.057	0.136
C10						0.000	0.206	0.154	0.133	0.062	0.212	0.110	0.202
C2							0.000	0.414	0.188	0.163	0.298	0.341	0.166
C11								0.000	0.071	0.094	0.228	0.045	0.236
C4									0.000	0.086	0.242	0.126	0.162
C5										0.000	0.139	0.053	0.095
C1											0.000	0.090	0.048
C9												0.000	0.127
C6													0.000

Table 20 presents the mode and its percentage of categorical values in each cluster. In A4 and A6, the modes of the clusters are the same with that of the dataset. It is interesting that several clusters have the mode different from the dataset in A5. The dataset has the mode ‘h’ of 15% while C12 has ‘k’ of 44%.

Figure 11 shows the distribution of the categorical values in several clusters. Due to the space, we show only the most significant results, the top three and the bottom three clusters which have the largest and the smallest percentages of class label 1 in Table 19.

To quantitatively compare the distributions of categorical values in clusters, the symmetric KLD values were calculated and the results are shown in Tables 21–23. A large KLD value indicates the distributions are quite different between two clusters. For the example of A5, comparing C6 to the other five clusters in Fig. 11, C6 and C9 have the largest KLD value, i.e., 0.312, and C6 and C1 have the smallest value, 0.115.

5. Conclusion

In this study, we proposed a procedure which integrates a dissimilarity learning scheme from the data for categorical values into dimensionality reduction in order to visualize and analyze mixed-type datasets. After projection onto a low-dimensional space, users can interactively perform data clustering followed by data analysis. Two indices trustworthiness and continuity can be used to assess quality of the projection. Characteristics of individual clusters are extracted by taking the modes for categorical attributes and the means for numeric attributes. In addition to visual inspection, to quantitatively compare the distribution of categorical values between different clusters, a revised Kullback-Leibler divergence was used. The experiments on two artificial and two real datasets by Sammon Mapping dimensionality reduction demonstrate that the proposed procedure is feasible. The analysis results provided users insights into the structures of the real-world mixed-type datasets, especially, the relationship between feature attributes and the class labels.

To make the proposed procedure practical and suitable to large datasets, some issues need to be tackled in the future. Time complexities for dissimilarity learning between categorical values and calculating the dissimilarity matrix between data points are both $O(n^{2})$ . Improvement on these two steps are required to allow the procedure adaptive to practical scalability. Moreover, determination of the proper $k$ value used in the $k$ -NN classification algorithm deserves further study.

Footnotes

Acknowledgments

This research is supported by Ministry of Science and Technology, Taiwan under grant MOST 104-2410-H-224-011 and MOST 105-2410-H-224-007.

References

Geng

Zhan

D.C.

and Zhou

Z.H.

, Supervised nonlinear dimensionality reduction for visualization and classification, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 35 (2005), 1098–1107.

Venna

and Kaski

, Visualizing gene interaction graphs with local multidimensional scaling, Paper presented at the European Symposium on Artificial Neural Networks, Bruges, Belgium, 2006.

Chen

H.-T.

Chang

H.-W.

and Liu

T.-L.

, Local Discriminant Embedding and Its Variants, Paper presented at the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 2005.

Hsu

C.-C.

and Huang

W.-H.

, Integrated dimensionality reduction technique for mixed-type data involving categorical values, Applied Soft Computing 43 (2016), 199–209.

Liu

Feng

and Qiao

, Scatter Balance: An Angle-Based Supervised Dimensionality Reduction, IEEE Transactions on Neural Networks and Learning Systems 26(2) (2015), 277–289.

Yan

Zhang

H.-J.

Yang

and Lin

, Graph Embedding and Extensions: A General Framework for Dimensionality Reduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 29(1) (2007), 40–51.

Belkin

M..

and Niyogi

, Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering, Paper presented at the Neural Information Processing Systems, Vancouver, British Columbia, Canada, 2001.

Kaski

, Dimensionality reduction by random mapping: fast similarity computation for clustering, Paper presented at the IEEE World Congress on Computational Intelligence, Anchorage, AK, 1998.

Lafon

S.P.

and Lee

A.B.

, Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization, IEEE Transactions on Pattern Analysis and Machine Intelligence 28(9) (2006), 1393–1403.

10.

Niu

J.G.

and Jordan

M.I.

, Dimensionality Reduction for Spectral Clustering, Journal of Machine Learning Research 15 (2011), 552–560.

11.

Feng

M.Y.

Song

and Wei

, ICA-Based Dimensionality Reduction and Compression of Hyperspectral Images, Journal of Electronics and Information Technology 29(12) (2007), 2871–2875.

12.

and Fowler

J.E.

, Hyperspectral image compression using JPEG2000 and principal component analysis, IEEE Geoscience and Remote Sensing Letters 4(2) (2007), 201–205.

13.

Mignotte

, A bicriteria optimization approach based dimensionality reduction model for the color display of hyperspectral images, IEEE Transactions on Geoscience and Remote Sensing 50(2) (2012), 501–513.

14.

Salakhutdinov

and Hinton

G.E.

, Learning a Nonlinear Embedding by Preserving Class Neighbourhood Structure, Paper presented at the AISTATS, 2007.

15.

Teh

Y.W.

and Roweis

, Automatic Alignment of Local Representations, Paper presented at the Neural Information Processing Systems, Vancouver, British Columbia, Canada, 2002.

16.

Weinberger

K.Q.

Sha

and Saul

L.K.

, Learning a Kernel Matrix for Nonlinear Dimensionality Reduction, Paper presented at the International Conference on Machine Learning, Banff, Alberta, Canada, 2004.

17.

Maaten

L.V.D.

Postma

and Herik

J.V.D.

, Dimensionality Reduction: A Comparative Review (TiCC-TR 2009-005). Retrieved from https://lvdmaaten.github.io/publications/papers/TR_Dimensionality_Reduction_Review_2009.pdf, 2009.

18.

Yan

Niyogi

and Zhang

H.-J.

, Face Recognition Using Laplacianfaces, IEEE Transactions on Pattern Analysis and Machine Intelligence 27(3) (2005), 328–340.

19.

Frank

and Asuncion

, UCI machine learning repository, (12 Sep 2010).

20.

Dash

and Liu

, Feature selection for classification, Intell Data Anal 1 (1997), 131–156.

21.

Dash

and Liu

, Consistency-based search in feature selection, Artif Intell 151 (2003), 155–176.

22.

Gan

J.Q.

Hasan

B.A.S.

and Tsui

C.S.L.

, A filter-dominating hybrid sequential forward floating search method for feature subset selection in high-dimensional space, Int J Mach Learn Cybern 5 (2014), 413–423.

23.

S.X.

Wang

X.Z.

Zhang

G.Q.

and Zhou

, Effective algorithms of the Moore – Penrose inverse matrices for extreme learning machine, Intell Data Anal 19 (2015), 743–760.

24.

Mitra

Murthy

C.A.

and Pal

S.K.

, Unsupervised feature selection using feature similarity, IEEE Trans Pattern Anal Mach Intell 24 (2002), 301–312.

25.

Peng

H.C.

Long

F.H.

and Ding

, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans Pattern Anal Mach Intell 27 (2005), 1226–1238.

26.

Xie

Z.X.

and Xu

, Sparse group LASSO based uncertain feature selection, Int J Mach Learn Cybern 5 (2014), 201–210.

27.

Tang

W.Y.

and Mao

K.Z.

, Feature selection algorithm for mixed data with both nominal and continuous features, Pattern Recognit Lett 28 (2007), 563–571.

28.

Q.H.

D.R.

Liu

J.F.

and Wu

C.X.

, Neighborhood rough set based heterogeneous feature subset selection, Inf Sci 178 (2008), 3577–3594.

29.

Chen

D.G.

and Yang

Y.Y.

, Attribute reduction for heterogeneous data based on combination of classical and fuzzy rough set models, IEEE Trans Fuzzy Syst 22 (2014), 1325–1334.

30.

Zhang

Mei

Chec

and Li

, Feature selection in mixed data: A method using a novel fuzzy rough set-based information entropy, Pattern Recognition 56 (August 2016), 1–15.

31.

Tuv

Borisov

and Torkkola

, Best Subset Feature Selection for Massive Mixed-Type Problems, IDEAL 2006, Lecture Notes in Computer Science (LNCS) 4224 (2006), 1048–1056.

32.

Hedjazi

Aguilar-Martin

and Le Lann

M.-V.

, Tatiana Kempowsky-Hamon, Membership-margin based feature selection for mixed type and high-dimensional data: Theory and applications, Information Sciences 322(20) (Nov. 2015), 174–196.

33.

Hotelling

, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 23 (1933), 417–441.

34.

Torgerson

W.S.

, Multidimensional scaling: I. Theory and method, Psychometrika 17(4) (1952), 401–419.

35.

Sammon

J.W.

, A Nonlinear Mapping for Data Structure Analysis, IEEE Transactions on Computers C-18 (1969), 401–409.

36.

DeMers

and Cottrell

, Non-linear dimensionality reduction, Paper presented at the Advances in Neural Information Processing Systems, San Mateo, CA, USA, 1993.

37.

Demartines

and Hérault

, Curvilinear component analysis: A self-organizing neural network for nonlinear mapping of data sets, IEEE Transactions on Neural Networks 8(1) (1997), 148–154.

38.

Roweis

S.T.

and Saul

L.K.

, Nonlinear Dimensionality Reduction by Locally Linear Embedding, Science 290(5500) (2000), 2326.

39.

Tenenbaum

J.B.

Silva

V.D.

and Langford

J.C.

, A Global Geometric Framework for Nonlinear Dimensionality Reduction, Science 290 (2000), 2319–2323.

40.

Brand

, Charting a manifold, Paper presented at the Advances in Neural Information Processing Systems, Cambridge, MA, USA, 2002.

41.

Zhang

and Zha

, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, SIAM J Sci Comput 26(1) (2004), 313–338.

42.

Law

M.H.C.

and Jain

A.K.

, Incremental nonlinear dimensionality reduction by manifold learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006), 377–391.

43.

Maaten

L.V.D.

and Hinton

, Visualizing Data using t-SNE, Journal of Machine Learning Research 9 (2008).

44.

Shawe-Taylor

and Christianini

, Kernel Methods for Pattern Analysis, Cambridge, UK.: Cambridge University Press, 2004.

45.

Donoho

D.L.

and Grimes

, Hessian eigenmaps: New locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences 102(21) (2005), 7426–7431.

46.

Kohonen

, The self-organizing map, Proceedings of the IEEE 78(9) (1990), 1464–1480.

47.

Kohonen

, Essentials of the self-organizing map, Neural Networks 37 (2013), 52–65.

48.

Hsu

C.-C.

Lin

S.-H

and Tai

W.-S.

, Apply extended self-organizing map to cluster and classify mixed-type data, Neurocomputing 74 (2011), 3832–3842.

49.

Halkidi

Batistakis

and Vazirgiannis

, Cluster Validity Methods-Part I, ACM SIGMOD Record 31(2) (2002), 40–45.

50.

Halkidi

Batistakis

and Vazirgiannis

, Cluster Validity Methods-Part II, ACM SIGMOD Record 31(3) (2002), 19–27.

51.

Tan

P.-N.

Steinbach

and Kumar

, Introduction to Data Mining, Addison Wesley, 2006.

52.

Hsu

C.-C.

, Generalizing Self-Organizing Map for Categorical Data, IEEE Transactions on Neural Networks 17 (2006), 294–304.

53.

Deegalla

and Boström

, Classification of Microarrays with kNN: Comparison of Dimensionality Reduction Methods, Paper presented at the Intelligent Data Engineering and Automated Learning, Birmingham, UK, 2007.

54.

Kullback

and Leibler

R.A.

, On information and sufficiency, Annals of Mathematical Statistics 22(1) (1951), 8.

55.

Lichman

, UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2013.

56.

Han

and Kamber

, Data Mining: Concepts and Techniques, 2

{}^{\rm nd}

ed., Morgan Kaufmann 2006.

Visualized mixed-type data analysis via dimensionality reduction

Abstract

Keywords

1. Introduction

Table 1 A portion of the mixed-type real-world dataset Adult

2.1 Dimensionality reduction

2.2 Data preprocessing for mixed-type datasets

3.1 A procedure for mixed-type data analysis

3.4 Evaluating the projection maps

3.4.1 Accuracy by k-NN

3.4.2 Trustworthiness and continuity

3.5 Data visualization and analysis

3.5.1 Data correspondence

3.5.3 Quantitative comparison with clusters

3.6 Algorithm

4. Experiments

Table 2 Statistics of the experimental datasets

4.1.1 Dataset Syn1

Table 3 Synthetic dataset Syn1

Table 7 Synthetic dataset Syn2

4.2.1 Dataset adult

Table 17 Trustworthiness and continuity of the projections of dataset Credit Approval by Sammon Mapping

Footnotes

Acknowledgments

References

Table 1
A portion of the mixed-type real-world dataset Adult

Table 2
Statistics of the experimental datasets

Table 3
Synthetic dataset Syn1

Table 7
Synthetic dataset Syn2

Table 17
Trustworthiness and continuity of the projections of dataset Credit Approval by Sammon Mapping