Abstract
In recent years, the study of distance functions has been speedily developing, this motivated us to propose and improve former distance measure techniques. In traditional distance functions research, much has been done by many researchers in determining the similarity attributes of dataset; but few has attempted to combine two or more distance functions to enhance the accuracy, effectiveness, and efficiency in evaluating the performance of either the external or internal validity measures in K-Means clustering algorithms. Therefore, the paper proposes an improved approach to distance functions using K-Means clustering. We experimented with standard datasets from the UCI machine learning source and it was observed that the proposed approach performed better when compared to the traditional distance functions as shown by all the external validity measures results.
Introduction
Data clustering is a unique and dynamic area which has newly become an extremely active focus in data mining research [1]. It serves as the greater focal point for research in data mining, statistics, machine learning, spatial database technology, information retrieval, web search, biology, marketing, and other application areas. Clustering is an unsupervised classification method due to the absence of labeled information. Therefore, it is a form of learning by observation, rather than learning by examples [1]. Clustering partitions a set of objects that are related to one another within a group and unrelated to the objects in other groups [2, 1, 3]. A proper cluster is achieved when elements of a group are highly related to each other and are different from elements of other groups. This gives a procedure to describe a relation of data objects
Clustering is categorized into two methods: hierarchical and partitional [4]. The hierarchical clustering technique can be further categorized into either agglomerative or divisive, depending on what way the hierarchical breakdown is designed. Clusters in the agglomerative method are formed, bottom-up, while clusters in the divisive approach are formed, top-down. Examples of hierarchical methods include the single linkage, complete linkage, average linkage, median, and Ward [5]. Meanwhile, partitioning clustering methods accepts one-level separating on the data sets by assuming one point belongs to only one cluster. Therefore, each object is required to fit exactly into one cluster. The condition may be relaxed, as for example done in fuzzy the separating method [1], whereby, each point may be fractional and separated into different clusters. Examples of partitioning methods are
The rest of this paper is structured as follows. Section 2 presents the
K-Means clustering algorithms
The
The K-means clustering algorithm consist of three steps, which are iterated until convergence [9]. The iteration will stop when the clusters produced are stable, which means there are no more movement of objects crossing any group:
Step 1: Define the center point. Step 2: Define the distance of each object to the centers. Step 3: Cluster object based on minimum distance.
The algorithm works as follows. Firstly, the
Then, for the remaining objects, it assigns an object to the cluster to which it is best related, established by the Euclidean distance between the object and the cluster mean. The
In [1], it is stated that K-means algorithm at the beginning determines the centers of a cluster as the mean value of the points within clusters. The algorithm can be processed with two separate steps as [10] pointed out: The first step, selects certain numbers of clusters (assume k clusters) randomly of the data set with each cluster mean are fixed a priori. An arbitrary distance function is used to assign the remaining objects to the clusters to which they are most similar. Then, the second step, iteratively enhances continually the within-cluster dissimilarity, for each cluster, and calculates new mean using the objects assigned to the cluster in the preceding iteration. Therefore, all the objects are reassigned using the updated mean as the new group centers. The iteration continues until there is no change made in the cluster centroids.
Distance functions
In [11], the researcher presents distance in terms of its dissimilarity and similarity as:
Let
Conversely, a similarity (or proximity) function
One of the simplest measure to calculate the dissimilarity between two objects is called the Hamming distance [12]. It can work on dataset of any variety be it numerical, categorical or mixed data types. The distance can be computed with a few machine instructions per comparison.
Showing dataset
Showing dataset
The main objective of this paper is to propose a set of hybrid distance functions to enhance the performance of
Datasets
In studying the effect of the traditional and the proposed hybrid distance functions on the quality groupings of attributes using
Note that from Table 1, the attributes vary significantly in terms of their attributes type. The Iris dataset consists of real numbers, with the observations ranging from 0.1 to 7.9. Meanwhile, both Hayes-Roth and Tae datasets consist of integer numbers, where the Hayes-Roth dataset has a range of observations from 1 to 4 (human subjects classification, recognition and clustering), and the Tae dataset has observations ranging from 1 to 66 (evaluation of teaching performance).
Data preprocessing
Data preprocessing is very important, most especially for distance-based methods, before any data exploration algorithms can be applied [15, 1]. Data transformation such as normalization may increase the accuracy and efficiency of mining procedures that involve distance measurements [15, 1]. Data transformation is necessary to scale attributes to smaller intervals-like 0.0 to 1.0. This effort is to ensure that all attributes have an equal weight [1].
Data or feature normalization is carried out to approximate and equalize the range of features so the distances will have the same effect in the calculation of similarity [16]. The most important reason for normalization is that, the variables with great inconsistency will normally control the metric [17], because the direct presentation of geometric measures to attributes with greater intervals will indirectly allocate larger influences to the metrics to application of attributes with smaller intervals [18]. Hence, there is a need for data normalization to enforce the attributes to have a common value range.
There are many techniques for data normalization, among them are min-max,
According to [1], let
Min-max normalization reserves relations among the unique data values. It will meet an “out-of-bounds” error if a future input case for normalization falls outside of the unique data interval for
In
where,
Generally, let
In comparing the performance of the proposed distance functions, the
Manhattan distance
The Manhattan distance is computed as:
The Euclidean distance is computed as:
The Minkowski distance is computed as:
where
The Chebyshev distance can be computed as:
The Mahalanobis distance is computed as:
where
In [13], it is stated that in recent-years, a number of new distance functions were proposed-in an effort to enhance the performance of clustering algorithms. For examples, [27], proposed new distance measure for mixed data clustering using supervised and unsupervised information [28], also, proposed a mathematical formulation to combine the distances between the single components of the data attribute directions in a single distance measure. This paper proposes a distance functions called Improved Distance I (ID I).
Improved distance I
The major aim of this hybrid is to obtain a purposeful metric computation with definite goal to gain a suitable distance (or similarity) function. Generally, the assignment is to describe a function similarity
The Improved distance I presented in this paper combines the effectiveness of Manhattan and Chebyshev distance functions in clustering analysis. The ideas were adopted from [27, 28], and is defined as:
It is being merged by adding the two functions and dividing by the average. However, the Manhattan distance calculates the absolute differences between coordinates of pair of objects and subsequently, calculates Chebyshev distance known as the maximum value computed as the absolute magnitude of the differences between coordinates of a pair of objects; add the two functions computed and divide by two.
Thus, the hybrid method accomplishes the axioms characteristics of the combined methods to form a distance function as being stated in Section 2.1 and as given in [11].
Let
The
To evaluate and analyze the performance of
Purity
The purity is calculated; when the cluster is allocated to the most related class in the group. In addition, accuracy of the search allocation is stated by totaling all correct numbers of the allocated points, divided by
where;
The rate of the purity interval value is from 0 to 1.
The Rand index is used to identify and compare whether pairs of objects are identically assigned to two partitions. Therefore, the index will confirm the assigned pair of objects if identically considered in relation to facts in the same cluster;
MM: If the two objects have the same label and are in the same cluster,
MN: If they have the same labels and are in different clusters,
NM: It corresponds to a pair of objects having different labels, but in the same cluster, and
NN: It corresponds to a pair of objects having the different labels, but also being in two different clusters.
The Rand index Rand [32, 33, 27] is computed based on the measures using above definitions as given in Eq. (12):
Assuming that
The Jaccard index measures the similarity between objects and is computed as given in [34]:
where
The Fowlkes-Mallows index calculates the likeness between the clusters and give feedback by clustering algorithm. It is computed as defined in [35]:
where
Combination of precision and recall concepts from facts recovery, we computes precision and recall of cluster for each class:
The F-Measure derived from a more general relationship called
where
Showing the results of distance functions and external validity measures
NaN
The F-Measure with an interval values of [0, 1], indicates greater values for higher clustering feature.
The F-Measure is defined as the harmonic mean of precision and recall as in [35]:
This particular measure is insignificant whenever one of the two indices is insignificant. The value of F increases proportionally to the increase of precision and recall, a greater value of F-Measure indicates that the cluster performs better on the positive class.
The F-Measure with an interval values of [0, 1], indicates larger values for higher clustering features.
Defined as the percentage of all positive objects contained in a cluster as given by [35]:
where TP is the number of points of positive class contained in cluster
Is the percentage of all negative objects contained in a cluster and is defined as in [35]:
where TN is the number of points of negative class
Defined as the percentage of cluster that contains positive objects as given in [34, 25]:
where TP is the number of points of positive class
The Geometric-Mean is the product of the clustering accuracies for both classes based on Eq. (17), which is the accuracy of the positive clusters and Eq. (18), which is the accuracy on the negative clusters. The mean is defined as in [36]:
The geometric-mean is obtained by multiplying sensitivity and specificity, and taking the square root.
From the results in Table 2, and based on the datasets investigated, the highest performance of external validity measures resulted from the Rand index across all three datasets (Iris, Hayes-Roth, and Tae), while the Jaccard index recorded the lowest performance for the same datasets. Those affected with NaN (not a number) under external validity measures were Fowlkes-Mallow Index, F-Measure (
In generally, the results from the proposed distance functions showed better performance as compared to the traditional distance functions. In addition, dataset consisting of real numbers such as in the Iris dataset produced higher quality clusters in terms of external validity measurement index and was not affected with NaN as compared to the Hayes-Roth and Tae-datasets, each consisting of integer-based features.
Conclusions
This paper presented an improved distance function in
marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buy records; financial: forecasting stock market, currency exchange rate, bank bankruptcies, understanding and managing financial risk, trading futures, credit rating; biology: classification of plants and animals given their features; libraries: book ordering and stock management; insurance: identifying groups of motor insurance policy holders with a high average claim cost and identifying frauds; city-planning: identifying groups of -houses according to their house type, value and geographical location; earthquake studies: clustering observed earthquake epicenters to identify dangerous zones; as well as the world wide web (WWW): document classification; or clustering web log data to discover groups of similar access patterns.
In future work, our main objective is to design and propose an algorithm to implement the new approaches in some distance functions to enhance
