Abstract
Most of real-world datasets are of mixed type including both numeric and categorical attributes. Unlike numbers, operations on categorical values are limited, and the degree of similarity between distinct values cannot be measured directly. In order to properly analyze mixed-type data, dedicated methods to handle categorical values in the datasets are needed. The limitation of most existing methods is lack of appropriate numeric representations of categorical values. Consequently, some of analysis algorithms cannot be applied. In this paper, we address this deficiency by transforming categorical values to their numeric representation so as to facilitate various analyses of mixed-type data. In particular, the proposed transformation method preserves semantics of categorical values with respect to the other values in the dataset, resulting in better performance on data analyses including classification and clustering. The proposed method is verified and compared with other methods on extensive real-world datasets.
Introduction
Due to popularity of information systems and decreased cost of storage devices, huge amount of data has been collected in many organizations. In those data, patterns or information which is valuable to the organization in decision making may be hidden. Nowadays, most of the collected data is of mixed-type, namely, containing numeric and non-numeric (or categorical) data. As shown in Table 1, in addition to the class attribute, feature attributes Hours_per_week (Hrs), Capital_gain, and Capital_loss are numeric while attributes Education, Marital_status and Relationship are categorical. The ability of analyzing mixed-type data to uncover patterns is essential in order to exploit the hidden information for benefit of the organizations.
Portion of the Adult mixed-type dataset
Portion of the Adult mixed-type dataset
Analyzing mixed-type data is not a trivial task. Most of data analysis algorithms take only one type of the data, either numeric or categorical. Transformation from one type to the other is usually performed before application of the algorithms. For instance, artificial neural networks take only numeric inputs while association-rule mining takes categorical (or discrete) values.
Many studies on measuring dissimilarity or similarity between multi-dimensional mixed-type data points have been conducted in the past. Most of the methods first measures the pairwise dissimilarities of individual feature attributes including numeric and categorical ones. Then, some sort of aggregation of the individual pairwise values is taken to measure the dissimilarity between two mixed-type data points.
The drawback of the previous type of approaches is that no numeric representation of individual categorical values but only the pairwise dissimilarity between the categorical values. Consequently, some analyses or applications are not applicable. For instance, the dimensionality reduction techniques, principal component analysis (PCA) [1] and linear discriminant analysis (LDA) [2], are applicable to only the datasets of which each attribute is represented by numeric values rather than the pairwise dissimilarities between categorical values.
The objective of this study is to tackle the problem with analyzing mixed-type datasets. In particular, we propose approaches to measure similarity between categorical values of a categorical attribute in the dataset. Unlike the existing methods, not only is the pairwise similarity measured between categorical values but also distributed, continuous representation of categorical values is obtained. The proposed approach is evaluated against existing methods in classification and clustering, which are considered commonly performed tasks in data analysis.
The contributions of the study are summarized as follows.
An approach to handling mixed-type data is proposed which not only transforms categorical values to their respective numeric representations but also preserves semantic relationship among categorical values. The approach makes more data analysis algorithms applicable to the transformed datasets. Extensive experiments are conducted to verify the feasibility and superiority of the proposed approach against other schemes for handling mixed-type datasets. The issue of hyperparameters setting of the proposed model is investigated, and the proper settings are identified. The analysis on a real-world mixed-type dataset is presented to demonstrate the application and feasibility of the proposed approach.
The structure of this article is organized as follows. Section 2 reviews related work including traditional schemes for measuring similarity between categorical values and similarity between two mixed-type data instances. Section 3 presents the proposed approach to handling categorical values with preservation of semantics of the values. Extensive experiments are conducted, and the results are presented in Section 4. Conclusions are given in Section 5.
Dissimilarity between categorical values
A simple method is to compare two categorical values and yields a dissimilarity zero if the two are the same and otherwise one. Another popular scheme called 1-of-
1-of-
The drawback of the two methods above is that they do not consider the semantics embedding in the categorical values. Specifically, two distinct values always yield the same dissimilarity, which is counter intuitive. For the example in Fig. 1, ideally, T1 and T2 which both have carbonated drink are more similar to each other than to T3 or to T4 which both have coffee drink. So are T3 and T4. In the first method, the dissimilarity is one while in the 1-of-
To address the drawbacks of the aforementioned methods, many measures have been proposed to calculate the degree of similarity between categorical values in a dataset [3, 4]. Note that a similarity value
Ahmad and Dey [11, 12] proposed a dissimilarity measure between two distinct values of a categorical attribute based on how two values co-occur with the values in other attributes. Desai et al. [13] proposed a similarity measure DISC and compared DISC with 14 similarity measures used by Boriah et al. [3] on 24 real-world datasets. Twelve of the datasets were used for classification task while the others were for regression task. DISC performed better than all the compared measures on almost all datasets. Ienco et al. [14] proposed DILCA and compared DILCA with three measures LIN, OF, and GOODALL3 on application to three clustering algorithms, DELTA [12], LIMBO [15], and ROCK [16]. Sixteen datasets including 4 synthetic and 12 real-world datasets were used in the experiments. DILCA was superior to the compared measures on most of the datasets.
Many researchers studied analysis of mixed-type data under the context of data clustering, in which similarity between categorical values is an essential ingredient in the clustering algorithms [17]. Wangchamhan et al. [18] proposed to use Gower distance, based on Gower similarity coefficient [19], and the mechanism for handling categorical attributes. Ding et al. [20, 21] proposed a similarity measure based on entropy for either categorical or numeric attributes so as to have a uniform criterion. Their measure avoided feature transformation and parameter adjustment between categorical and numeric values. Sangam and Om [22] proposed a framework for clustering time evolving mixed data, in which dissimilarity with respect to categorical attribute is measured based on observed frequencies of categorical values. In [23], the authors proposed that clustering of a mixed-type data point to a cluster is based on the object-cluster similarity for mixed data. For comparing the part of categorical attributes to the cluster, the idea is based on the count of the categorical value appearing in the cluster. The more the count, the more similar to the cluster. The same idea also appeared in the work by Huang [24, 25] proposing the
Mixed-type data analysis has also been studied under the context of extending self-organizing map (SOM) [27, 28, 29] and dimensionality reduction [30, 31]. In [27, 28], the authors proposed to use a data structure distance hierarchy for representing dissimilarity between categorical values. Each categorical value is mapped to a node in the hierarchy and then the dissimilarity is measured by the path length between the two nodes representing the categorical values. As shown in Fig. 2, the dissimilarity between Coke and Pepsi is two while four between Coke and Latte, assuming unit length between two nodes. The data structure is then used to generalizing the algorithms of Self-Organizing Maps for handling mixed datasets involving categorical attributes.
Distance hierarchy for categorical values.
The distance hierarchy can be constructed manually according to knowledge of domain experts [27]. When domain knowledge is not obtainable, distance hierarchy can be constructed according to pairwise dissimilarity between the values learned in a supervised or unsupervised manner with a hierarchical agglomerative clustering [28, 29].
A supervised learning method for pairwise dissimilarity was proposed in [28]. The idea behind the method is that the similarity between two values is dependent on how each of the values co-occurs with the labels in the class attribute [32]. The more two categorical values of an attribute co-occur with the same class value, the more similar they are.
Several unsupervised methods were proposed in [29]. The idea of those methods is mainly that similarity between two values A and B of a categorical attribute is dependent on how each of the values co-occurs with the values of the other feature attributes, referred to as context values. Specifically, for categorical context values, if A and B co-occur with the same context values often, A and B are deemed more similar. As for numeric context values, if the average of the numeric values which A is associated with is close to that which B is associated with, A and B are deemed more similar. The similarities of A and B with respect to each of the feature attributes are measured and then averaged to a single similarity value. According to the idea mentioned above, two implementations UDL1 and UDL2 were proposed and compared [29]. The difference is how the degree of co-occurrence or the conditional probability between the target and a context value is calculated.
MDISC [29] was adapted from DISC [13] which measures similarity between two categorical values via cosine. Based on the idea of DISC, MDISC was devised for measuring dissimilarity between two categorical values of an attribute with respect to the other feature attributes, including numeric and categorical ones. For a numeric context attribute, the dissimilarity between A and B is dependent on the difference between the averages of the values which co-occur with A and with B, respectively. For a categorical context attribute, cosine similarity between A and B with respect to the context attribute is calculated and then transformed to a dissimilarity.
Except for the 1-of-
The advantage of having numeric encoding of categorical values is that various quantities such as the average, variance, and covariance of categorical values, which are often required in many data analysis algorithms, can be easily estimated via their numeric representations. The estimation is infeasible or difficult when we have only pairwise dissimilarities between categorical values.
Method
Idea
The idea to tackle the problem is to adopt the techniques from word embedding which can represent a word by a multi-dimensional vector consisting of continuous values as shown in Fig. 3. What is special about the representation is that the semantic similarity between words can be reflected in the word vector space [33]. In other words, the vector representing Coke will be more similar to that representing Pepsi than that representing Latte. Consequently, the distance calculated by using the two representing vectors of Coke and Pepsi will be less than that of Coke and Latte. Therefore, we consider the word embedding techniques can be adapted to address the problem of mixed-type data analysis.
Word embedding allows a word represented by a multi-dimensional vector.
Most of word embedding techniques are based on the idea that “a word is characterized by the company it keeps” popularized by Firth [34]. Mikolov et al. [33] proposed Word2Vec techniques including two word embedding models, the Continuous Bag of Words (CBOW) and the Skip-gram, for computing vector representation of words from a large size of corpus, which can contain several million up to several billion words. By training a neural network, the CBOW predicts the target word by its context words which are the words around the target in the corpus while the skip-gram model tries to predict the context words of a given word, as shown in Fig. 4. In Fig. 4a for the CBOW model, the word vector of the target word
a. The CBOW model predicts the target word 
Although several pretrained embeddings available as open source APIs which provides access to precalculated vectors of many words, they may not be appropriate to the use of analyzing the mixed-type dataset at hand. The reasons have two. First, the word vectors are precomputed based on a large corpus from a domain which is not specific to the domain of the mixed-type dataset to be analyzed. For example, Gensim word2vec [35], with an interactive web app trained on GoogleNews, includes algorithms on the skip-gram and the CBOW model, using either hierarchical softmax or negative sampling [33, 36]. Wikipedia2Vec [37] learns embeddings of words and entities by iterating over entire Wikipedia pages based on the skip-gram model. Pretrained embeddings for 12 languages, including English, Chinese, French, German, and Japanese, in binary and text format are provided. Second, in many cases, categorical values recorded in mixed-type datasets cannot be found in the pretrained models due to abbreviation, compound words, or customized labels made up by the recording organization. Some of the examples can be seen in Table 1 such as “7th-8th” and “Married-AF-spouse”. We therefore argue that the word vector of a categorical value shall be calculated based on the data in the mixed-type dataset itself.
Inspired by the idea in the paper [33], we can measure the semantic similarity of categorical values in a mixed-type dataset analogously. To adapt the idea of word embedding algorithms in this study, we consider the mixed-type dataset as a corpus, one datapoint in the dataset as a document and then apply word-embedding algorithms for continuous vector representation of categorical values in the mixed-type dataset.
Take the Adult dataset from the UCI machine learning repository [38] as an example. The dataset has 14 feature attributes including 6 continuous attributes,
{age, fnlwgt, education-num, capital-gain, capital-loss, hours-per-week}
and 8 categorical attributes,
{workclass, education, marital-status, occupation, relationship, race, sex, native-country}
and one class attribute indicating whether a person makes over 50K a year.
The Adult dataset consists of 48,842 records and is considered as a corpus, in which each document has the same size, namely, containing 14 words.
Figure 5 shows the proposed procedure which transforms categorical values of a mixed-type dataset to their numeric representations and preserves the semantics of categorical values with respect to their context attributes. Note that the procedure does not involve domain expert or use of other domain data. The transformation of categorical values to their numeric representations by Word2Vec is mainly based on the other attributes in the same dataset.
A procedure for handling mixed-type datasets in which categorical attribute values are transformed to numeric values via Word2Vec model.
To take into consideration of numeric values as context words of the target word, we discretize continuous attributes since occurrence of same numeric values may be limited in a dataset of a small or median size. Without discretization, the co-occurrence of a specific numeric value with the target word might be very low. Moreover, considering close numeric values such as 80 and 81 semantically similar is reasonable and even beneficial for most of the applications. For instance, each of ages 80
Several techniques can be applied for the discretization such as by equal width or equal frequency in individual bins, or clustering-based approaches [39]. The techniques based on equal width or equal frequency have one disadvantage that a group of frequent, close values may be split to different bins, for instance, to discretize two highly frequent values, say, age 84 and 85, into different bins. In this study, we apply a clustering-based approach, kernel density estimation (KDE) [40] to minimize such impact as much as possible.
Figure 6 shows a discretization example by KDE and the result of attribute fnlwgt in the Adult dataset. Specifically, the attribute was first normalized to the range between 0 and 1. Next, the histogram was constructed with the bin size of 0.01. After Gaussian smoothing, it is apparent that there are two local minimums, which are at fnlwgt
Discretization sample of numeric attributes of Table 1
Discretization sample of numeric attributes of Table 1
Discretization of attribute fnlwgt by using kernel density estimation.
In the applications of natural language processing, the words around the target word are considered context words and used to train neural-network models. The number of context words considered is determined by a user-specified parameter referred to as the window size
In a mixed-type dataset, the attributes are considered order-less, i.e., no specific order among the attributes. The values of the first attribute in the dataset may be related to the values of the last, distant attribute. To identify relevant attributes for context values, correlation analysis can be performed. In this study, we apply Cramer’s V [41] as a measure of association between the target and each of the other attributes. Cramer’s V can be used with variables having two or more levels, namely, distinct values and is defined as follows.
In Eq. (1),
The pairwise values of the target attribute along with each of the context attributes form the training set of the model. The set of context attributes of attribute
where
Figure 7 depicts the CBOW training model. The vocabulary size
Word embedding with the CBOW model.
The length of a word vector is determined by the number of neurons in the hidden layer. The length of word vectors in a large corpus is often several hundred. For example, one of the GloVe pre-trained models of word vectors has a vector dimension of 300. The model was trained with 840 billion tokens and a 2.2-million vocabulary. For a mixed-type dataset, the domain size of categorical attributes is often relatively small. For instance, the Adult dataset consist s of 48,842 records and each record contains 14 features. In other words, there are 683,788 values in total in the dataset, which is considered small compared to several billion tokens in the corpus used by GloVe. Therefore, it seems unnecessary to have a large vector length in our application. How length impacts the result of applications such as classification and data clustering is worth investigating.
On the other hand, using a multidimensional word vector to replace the categorical value in the mixed-type data, not only increases the dimension of the dataset but also the weight of the original categorical attribute. To have an equal weight with the other numeric attributes, we shall transform a categorical value to a single numeric value rather than a multidimensional vector. To do so, the simplest way is to set the dimension of the word vector to one. An alternative is to set a multiple dimensionality and then reduce the dimension to one by using dimensionality reduction techniques such as PCA. The idea is that the latter approach might be able to preserve more semantic information than the other one. We will compare the two approaches experimentally.
Evaluation
For quantitative evaluation, accuracy is used in the classification task. Adjusted mutual information (AMI) and adjusted rand index (ARI) are used to measure quality of clustering. Accuracy is defined as follows.
where
Adjusted mutual information (AMI) and adjusted rand index (ARI) [42] are measured as follows.
where
where
Experimental results
The proposed approach is compared with other schemes for handling mixed-type datasets on the tasks of classification and clustering on eleven datasets. The results are presented in Section 4.2. We discuss the issues of the hyperparameter setting of the proposed model in the followed section. An example of mixed-type data analysis is given in Section 4.4 to demonstrate the application and feasibility of the proposed approach.
Datasets
Eleven mixed-type datasets retrieved from the UCI machine learning repository [38] were used for experiments. The datasets include Australian Credit Approval (ACA and pACA), Adult, pAdult, Bank Marketing (Bank and bankAdditional), Contraceptive Method Choice (CMC), Credit Approval (CA) and German Credit Data (GCD), Mushroom, Teaching Assistant Evaluation (TAE). pACA and pAdult, another versions of ACA and Adult, retained only the attributes correlated to their class attribute [27, 28]. The data points with missing values are dropped. Table 3 shows the statistics of the datasets.
Statistics of experimental datasets
Statistics of experimental datasets
Classification
We apply
We set
Table 4 shows classification performance on the datasets and the scheme of Word2Vec outperformed the others. The coding method 1-of-
Comparison of classification accuracy
Comparison of classification accuracy
We use
Tables 5 and 6 show the performance measured by adjusted mutual information (AMI) and adjusted rand index (ARI), respectively. It can be observed that the word embedding scheme outperformed the others on most of the datasets. Similar to the classification results, the coding method 1-of-
Hyperparameters
In the following, we analyze how the setting of hyperparameters of the Word2Vec models impact the applications to classification and clustering. The investigation issues include the length of word vectors, comparison of CBOW and skip-gram models, comparison of negative sampling and hierarchical selection, and relevance of context words.
Clustering performance with measure AMI
Clustering performance with measure AMI
Clustering performance with measure ARI
A categorical attribute in the mixed-type dataset shall be replaced by a single numeric attribute rather than a multi-dimensional vector. To do so, we compared two approaches. The first approach is to set the number of neurons in the hidden layer to ten, obtain word vectors with the length of ten, and then applydimension reduction technique PCA on the word vectors to obtain one-dimensional values. The second is to set the number of neurons in the hidden layer to one and obtain one-dimensional word vectors directly.
Table 7 presents the result. The last row shows the count that one approach outperformed the other among the experimental datasets. The result indicates that the approach of training a multi-dimensional vector first and then applying dimension reduction yielded better outcomes on most of the datasets.
Performance comparison between different sizes of word vectors. The last row indicates the count of the parameter setting outperforming the other
Performance comparison between different sizes of word vectors. The last row indicates the count of the parameter setting outperforming the other
Figure 8 and Table 8 demonstrate the projection to one-dimensional space of the three categorical attributes of dataset pAdult. It is clear that the projection of the first approach (i.e., 10
Attribute values corresponding to the projection results in Fig. 8 from the left to the right
A Word2Vec model can be trained with hierarchical softmax or negative sampling. We compared the performance between the two settings. Table 9 shows that for classification task, negative sampling and hierarchical softmax did not have significant difference in term of superiority while for clustering task negative sampling outperformed hierarchical selection.
Performance comparison between negative sampling (NS) and hierarchical softmax (HS). The last row indicates the count of the parameter setting outperforming the other
One-dimensional projection results of attributes (a) Education, (b) Marital_status, and (c) Relationship of dataset pAdult after word embedding.
There are several word embedding models such as Word2Vec, fastText [46], GloVe [47], etc. Do different models matter? Word2Vec estimates continuous representations of words by training neural networks to predict a word from its context words (i.e., the CBOW model) or to predict the context words of a given word (i.e., the skip-gram model) [33]. In this study, we compared the two models of Word2Vec, namely, CBOW and skip-gram.
Table 10 shows that for classification task, CBOW achieved superior performance than the skip-gram model while for clustering task the difference is not significant.
Performance comparison between CBOW and skip-gram (SG) models
Performance comparison between CBOW and skip-gram (SG) models
In the applications of natural language processing, the CBOW or skip-gram models use a window size to determine the size of context words with respect to the target word. In mixed-type datasets, the attributes are considered order-less. Adjacent attributes do not imply semantically more related to the target attribute. In this study, we used Cramer’s V to measure the degree of association between attributes. If the association degree of a feature attribute to the target attribute exceeds a pre-determined threshold, the attribute is considered a context attribute.
Table 11 shows the experimental results with the threshold set to 0 and 0.05, respectively. It can be seen that in most of the cases, using a threshold to select context attributes could improve both in classification and clustering tasks.
Visualized analysis of mixed-type data
We present an example of visualized analysis of mixed-type data using dataset pAdult. In Fig. 9, categorical attributes of the dataset were handled with the Word2Vec scheme. Then, the dimension of the data was reduced to 2 by the supervised dimensionality reduction technique linear discriminant analysis (LDA) and plotted on a two-dimensional plane. In Fig. 10, categorical attributes were handled with the conventional method 1-of-
Performance comparison between the threshold of attribute relevance
Performance comparison between the threshold of attribute relevance
We used linking and brushing technique [48] and developed a function to allow an analyst to perform interactive clustering and analysis on the projection results. As shown in Figs 9 and 10, the projection results were clustered to 5 and 2 clusters, respectively, and the data points in each cluster were colored according to whether its class label is “
Projection and clustering of dataset pAdult with Word2Vec by LDA.
Referring to Figs 9 and 10, it is apparent that handling categorical attributes with Word2Vec yielded better projection result than that with 1-of-
The average of the numeric attributes and the mode of the categorical attributes in each cluster with respect to the projection shown in Fig. 9
The average of the numeric attributes and the mode of the categorical attributes in each cluster with respect to the projection shown in Fig. 10
Projection and clustering of dataset pAdult with 1-of-
Although C1 and C3 all have class label “
Unlike Fig. 9, not many clusters can be identified in Fig. 10. As shown in Table 13, the largest cluster C2 has a ratio of 0.23 with class label “
To facilitate data analysis on mixed-type datasets involving categorical attributes, categorical values in the datasets must be handled properly. The traditional method 1-of-
Footnotes
Acknowledgments
This work is partially supported by Ministry of Science and Technology, Taiwan under grant MOST 109-2410-H-224-019, and by the “Intelligent Recognition Industry Service Center” from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education (MOE) in Taiwan.
