Abstract
Distance Measuring between two mixed data objects is the basis of many learning algorithms. The complex relevance between heterogeneous – various types/scales – attributes has a significant influence on the measured results. In this paper, we propose an End-to-End Distance Measuring method for mixed data based on deep relevance learning, called E
Introduction
Measuring distance between two objects is a key embedded step of several learning algorithms, including behavior analysis [6], document analysis [25], and image analysis [29]. As the basis of these algorithms, distance metric directly influences the performance of the whole learning task. Real-world data objects are usually represented by a vector with mixed – numerical and categorical – attributes in diverse domains, including finance [7, 1] and intrusion detection [34, 13]. Therefore, it becomes a key research issue to present more appropriate distance metrics for mixed data.
Distance measures of categorical and numerical data are usually distinct. As distance measuring of numerical data is simple and accessible, many widely used distance measures are proposed, such as the commonly used Euclidean and Manhattan distances. In contrast, the distance of categorical data is not straightforward, since the different values of a categorical attribute may not be inherently ordered or comparable. Although there may be no inherent order in categorical data, other factors like matching statistics and frequency distribution exist and thus indicate distance, such as Hamming distance, the Inverse Occurrence Frequency (IOF) and Occurrence Frequency (OF). These distance measures only capture the characteristics within an attribute but ignore the relationships between attributes. The attributes cannot be analyzed in an univariate manner since they are usually related with each other. Simply combining the distance of categorical data and numerical data without considering the relevance leads to poor performance [16].
A motivating toy example
Taking a fragment of a clothes size as an example (Table 1), six people are characterized by two categorical attributes (“Gender”, “Age”) and two numerical attributes (“Height” and “Weight”). They are divided into two classes (“Size”:
A fragment of an Asian clothes size table
A fragment of an Asian clothes size table
Differences between existing data transformation based distance measuring method and the proposed end-to-end distance measuring method.
The above example shows that it is problematic to measure the distance for mixed data without considering the relevance between attributes. However, heterogeneity in mixed attributes makes the relevance learning a difficult task. Heterogeneity refers to that data scales/types differ greatly among mixed attributes (e.g., “Gender” and “Height”), which indicates that mixed data cannot be directly manipulated with traditional methods like pure numerical or categorical data. Moreover, the distance measuring for categorical and numerical data are totally different. It is a significant research problem on how to integrate distinct mechanisms of distance computation while handling heterogeneity and relevance simultaneously in an unsupervised manner.
To learn the relevance in mixed data, existing methods usually transform the mixed data to uniform (categorical/numerical) data (shown in Fig. 1a). This includes either converting categorical values into numerical values or discretizing continuous attributes into categorical attributes. Many methods attempt to assign numerical values to categorical attributes, such as one-hot [3], if-idf [3] and CDE [22]. The basic operation of them is to transform the categorical attributes into a set of binary attributes.
In this way, the attribute spaces are totally changed and the meanings of attributes are lost, which leads to a confusion in understanding the new attributes.
An alternative idea is to discretize numerical attributes. When discretizing “Height”, 1.68 m and 1.73 m may be put in the same bin or two adjacent bins, depending on the granularity of intervals used. It will inevitably end up in different results. Thus how to set proper discretization intervals is the key step. In unsupervised learning, a variety of discretization methods have been developed for numerical data [24], such as EqualWidth [36], PKID [37] and MVD [4]. Existing methods [19, 21, 20] directly apply discretizers to mixed data without considering relevance between attributes, which leads to an information loss and poor performance. Therefore, the calculation of distance mixed data is not as intuitive as it appears and the relevance within and between attributes has to be appropriately considered.
In this paper, we propose an End-to-End Distance Measuring method for mixed data based on deep relevance learning, called E
Distances between attribute values are to some degree semantically meaningful, which indicates that relevance analysis only based on co-occurrence frequency is too limited. We first map the numeric values into several intervals through a set of cut points for numeric attributes. Then we calculate the external relevance (i.e. the relevance between attributes) influenced distance based on the divergence between co-occurrence distributions. The co-occurrence distribution is represented as a weighted point cloud of attribute values. The distance between two co-occurrence distribution A and B, which we call the Co-occurrence Mover’s Distance (CMD), is the minimum cumulative distance that attribute values from A need to travel to match the point cloud of B. An iterative computing process based on CMD is conducted to capture the deep relevance. Simultaneously, we rectify the cut points by using a Frobenius-norm deviation measure as its objective function. Finally, the distance for numerical attribute values is refined based on the original values and fallen bin centers. The whole learning process is based on the given data without any domain knowledge.
To summarize, we make the following major contributions:
We propose a novel distance measuring method E We prove that our algorithm converges and the proposed distance satisfies the axioms of a metric, which allows that it can be used in multiple application scenarios theoretically and practically.
Substantial experiments on a number of real-world datasets across various domains show that E
The remainder of the paper is organized as follows. In Section 2, we discuss the related work. The problem is specified in Section 3. Section 4 introduces the details of E
Some methods simply combine distance measures computed separately on numerical and categorical attributes, such as k-prototypes [17, 18], K-means-mixed [2], CAVE [15], and INTEGRATE [5]. k-prototypes [17, 18] is an extension of k-means combining the Euclidean Distance on numerical attributes and matching distance on categorical attributes. K-means-mixed [2] is also based on the k-means paradigm and combines distance measures computed separately on numeric attributes and categorical attributes. Unlike k-prototypes, kmeans-mixed does not assume a binary or a discrete measure between two distinct categorical attribute values but computes the distance as a function of their overall distribution and co-occurrence with other categorical attributes. CAVE [15] uses variance to measure the similarity of the numeric part of the data and computes the similarity of the categorical part based on entropy weighted distances in the hierarchies. INTEGRATE [5] applies ideas from information theory to implement the k-means paradigm. It models both numerical and categorical attributes with their probability distributions and minimizes a cost function based on the Minimum Description Length principle for clustering. This kind of independent computing mechanism fails to capture the relevance between attributes and leads to poor performance.
For relevance learning on mixed data, a few methods make a data transformation and then learn attribute relevance on the transformed data. Embedding methods first attempt to transform the categorical attributes into a set of numerical attributes, i.e., one-hot [3] or if-idf [3]. Then the numerical vectors can be directly manipulated per algebraic operations to capture the relevance between attributes. For example, CoupledMC [32] first convert the categorical value to (0, 1) vector by one-hot encoding and learn the relevance based on Pearson Correlation analysis. In this way, the attributes are totally changed and the meaning of attribute is lost, which leads to a confusion in understanding the new attributes. Other methods are proposed based on the discretization process [19, 21, 20], which is another classic method for transforming mixed data [14]. For example, SpectralCAT [20] discretizes numerical features before spectral clustering using an adaptive Gaussian kernel. Fuzzy centroid is adopted in [19] to represent the prototype of a cluster and a new measure is proposed to evaluate the distance between data objects. The distance measure is calculated from a probabilistic perspective in [21]. These methods only focus on the frequency and the semantic distance between attribute values is ignored. Most of them consider the relevance between categorical and numerical features based on the co-occurrence frequency between discretized numerical and categorical attributes. The data transformation process and relevance learning are handled independently.
There are also some clustering methods specially designed for mixed data without distance measuring, including AL [26], EGMCM [30]. AL learns affinities between data points and exploit the learned affinities for clustering. EGMCM models a wide range of dependencies beyond Gaussian mixture copulas captured by meta-Gaussian distributions. These methods have totally different goals with our method.
Preliminaries and problem statement
A set of mixed objects is organized as an information table
these OIFs describe attribute values occurrence frequency. For instance, in Table 1, the Occurrence Information Function for value
where
Accordingly, our goal can be described as: to evaluate the distance between two objects
The E
The proposed E
The internal relevance learning captures the intrinsic interactions between the values of each attribute, which enables a proper estimation of the value distance in data. As the attributes are relevant with each other, external relevance is thus learned by evaluating the co-occurrence distribution divergence. Dynamic numerical data mapping builds a bridge between discrete and continuous attributes with relevance considered. Such learning process aggregates the complex relevance and offers an effective distance measuring.
Learning internal relevance
The internal relevance learning process focuses on the influence within the attribute itself. For each numerical attribute
where
In the following relevance learning process, values for numerical attribute are assigned as the bin that the original numerical value falls into. That is to say,
In 1998, Lin [27] presented a powerful distance theorem, which points that the distance between
According to [12, 33], the discrepancy in attribute value occurrence times reflects the value distance in term of frequency distribution. It reveals that smaller distance is assigned to the attribute value pair which approximately owns equal frequencies. The higher these frequencies are, the closer the two values are. Inspired by these, we propose Internal Relevance Influenced Distance to satisfy the above principles.
The above equation is an instance of Eq. (4). Here we define the commonality as “the value of attribute
Then we construct the internal relevance influenced value distance matrix
where
Since the internal relevance learning does not involve the relevance between attributes when calculating the value distance for each attribute, the external relevance learning aims to capture the interactions with other attributes. We regard this interaction between attributes as external relevance influenced distance in terms of the distribution comparisons. For this, we discuss how the relevance influences the value distance by proposing Co-occurrence Mover’s Distance (CMD).
When calculating the influence of attribute
Value travel cost. The goal is to incorporate the semantic distance between individual values pairs (e.g., Male and Female) into the co-occurrence distribution distance. Specially, the value cost matrix
Co-occurrence distribution distance. The “travel cost” between two attribute values is a natural building block to create a distance between two co-occurrence distribution. First, we allow each value in
Transportation problem. Formally, the minimum cumulative cost of moving
The above optimization is a variant of the earth mover’s distance metric (EMD) [31], a well studied optimal transportation problem for which specialized solvers have been developed [25, 23]. According the properties of EMD, it can be proved that CMD is a true metric (Section Theoretical Analysis).
We use normalized mutual information [11]
where
Thus we can update our external distance matrix:
CMD has several intriguing properties: (i) it is highly interpretable as the distance between two co-occurrence distribution takes the difference of attribute values into account; (ii) it gets optimized as the value distance updates, this iterative computing process can mine high-order relevance among multiple attributes; (iii) it is hyper-parameter free and straight-forward to understand and use; (iv) it naturally incorporates the properties in EMD and leads to a true distance metric.
In the mapping process, the gaps for bin centers are calculated by provided distance. As the external relevance learning process updates the attribute value distance matrix
The two constraint inequations are the conditions that valid intervals boundaries need to be satisfied. It is a quadratic function and can be easily solved. Accordingly, we obtain the new interval cut points with the new bin boundaries as follows:
here
Algorithm 1 presents the main procedures of the whole learning process. The first step is to generate the initialized bin boundaries with EqualWidth. For each attribute
Iterative relevance learning algorithm[1] Information Table
After this whole learning process, we get the iterative relevance influenced distance for categorical attributes values and numerical bin values. Considering that the original numerical values are different with the bin values, we refine the distance numerical values based on original values and the bin value they fall into for each numerical attribute
here
Finally, we can define iterative relevance influenced distance between objects as follows:
Specially, this calculating process is based on the original values for all the numerical values. Considering that the relevance between attributes has been taken account into the learning process, we can assume that external relevance influenced distances are independent with each other. Therefore, we set
Considering that the convergence of the algorithm can guarantee that the proposed method works and a true distance metric would be much more meaningful in real world applications, we give both the proofs in the following.
Convergence proof
Based on Eq. (8), the variance for external relevance influenced distance between
As
True metric proof
Let
where
Let
The sum of each of the three vectors equals
Rubner et al. [31] proved that the triangle inequality holds for EMD when the ground distance is a metric and the histograms are normalized.
To prove the above theorem, we should prove that
It can be easily derived that the internal relevance distance satisfies conditions (1)–(3), we only need to prove that it satisfies condition (4), which can be proved with mathematical induction. For each attribute
According to the constructing process of internal relevance influence distance, we can obtain that According to Lemma 1, we have:
Based on Eq. (13), we can easily obtain the theorem.
Experiments settings
Datasets
Eight datasets obtained from the UCI repository are used to evaluate the performance. The detailed information of these data sets is summarized in Table 2.
Parameters settings
Two parameters need to be specified for the experiments: the mapping bin number
The detailed information of data sets
The detailed information of data sets
To investigate the effectiveness of the unsupervised distance metric for the mixed data proposed in this paper, four different kinds of experiments have been conducted as follows.
To evaluate its performance in clustering, we feed E To investigate its effectiveness in cluster structure analysis, we specify the internal data structure of different distance measures with given labels. To test the ability in handling heterogeneity, we compare clustering performance of E To examine the efficiency of E
Clustering performance of algorithms SpectralCAT, CoupledMC, EGMCM and E
Obtaining better performance by combining with clustering methods
We use three state-of-the-art algorithms on mixed data, i.e., SpectralCAT [9], CoupledMC [32] and EGMCM [30], as comparisons. SpectralCAT is chosen as the representative method of simply combining discretizing and co-occurrence analysis. CoupleMC is a state-of-the-art embedding method which is based on one-hot conversation. And EGMCM is a state-of-the-art clustering method for mixed data.
Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI) [10] are used for evaluation, whose higher values indicate better clustering performance. Table 3 reports the results. The two highest measure scores of each experimental setting are highlighted in boldface. It is interesting to note that almost every method performs better than at least one of other methods on one or two datasets. This reflects the difficulty in effectively capturing the intrinsic data characteristics in mixed data and the significant challenge in designing appropriate and generalized distance measures for mixed data. This table shows that E
Clustering performance of algorithms DRL
EqualWidth, DRL
EqualFrequency, DRL
PKID (D
EW, D
EF, D
P for short, respectively) and E
DM; UCI datasets details in Table 2
Clustering performance of algorithms DRL
Data structure index comparisons on 8 data sets.
It is known that a cluster partition on a data set is to ensure that the distances between objects in the same cluster are low while the distances between objects in different clusters are high. Relative Distance (RD), Davies-Bouldin Index (DBI), Dunn Index (DI) are three of the most popular evaluation measures [32, 38]. Since internal criteria seek clusters with low intra-cluster distance and high inter-cluster distance, distance metrics that produce clusters with high RD or DI and low DBI are more desirable. During the experiments, we find that there are several noise objects in the datasets, which refers to that two objects with almost identical attributes values have different labels. This leads to a poor judgment of DI. Here we only report the results of RD and DBI.
Our method is evaluated against three methods: original objects as the baseline, one-hot encoding and coupledMC. The cluster structures produced on 8 data sets are shown in Fig. 3. With the exception of only a few items, E
Convergence test on datasets zoo and abalone.
To test the ability in handing the heterogeneity, we combine iterative relevance learning with three outlying data discretization methods, i.e., EqualWidth, EqualFrequency and PKID, as competitors. We apply the distance metric obtained by E
Achieving an acceptable convergence speed
The maximal distance matrix deviation among all the attributes varies with representations from each iteration. Here, we report the convergence speed of the maximal distance matrix deviation on datasets zoo and abalone. Similar results can be obtained on other datasets. As Fig. 4 shows, the clustering performance converges within 20 iterations.
Conclusion and future work
We have proposed E
A limitation of CMD is that it is computationally challenging just like EMD. Thus several optimization methods are proposed on EMD. We are planning to immigrate these method into E
Footnotes
Acknowledgments
This work was supported by the National Key Research and Development Program of China (2016 YFB1000101), the National Natural Science Foundation of China (Grant No. 61379052), the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (Grant No. 14JJ1026), the Science Foundation of Ministry of Education of China (Grant No. 2018A02002).
