Hybrid distance functions for K-Means clustering algorithms

Abstract

In recent years, the study of distance functions has been speedily developing, this motivated us to propose and improve former distance measure techniques. In traditional distance functions research, much has been done by many researchers in determining the similarity attributes of dataset; but few has attempted to combine two or more distance functions to enhance the accuracy, effectiveness, and efficiency in evaluating the performance of either the external or internal validity measures in K-Means clustering algorithms. Therefore, the paper proposes an improved approach to distance functions using K-Means clustering. We experimented with standard datasets from the UCI machine learning source and it was observed that the proposed approach performed better when compared to the traditional distance functions as shown by all the external validity measures results.

Keywords

Hybrid measures external clustering K-Means algorithms

1. Introduction

Data clustering is a unique and dynamic area which has newly become an extremely active focus in data mining research [1]. It serves as the greater focal point for research in data mining, statistics, machine learning, spatial database technology, information retrieval, web search, biology, marketing, and other application areas. Clustering is an unsupervised classification method due to the absence of labeled information. Therefore, it is a form of learning by observation, rather than learning by examples [1]. Clustering partitions a set of objects that are related to one another within a group and unrelated to the objects in other groups [2, 1, 3]. A proper cluster is achieved when elements of a group are highly related to each other and are different from elements of other groups. This gives a procedure to describe a relation of data objects $f:D\to C$ from some data $D=\left\{{d_{1},d_{2},\ldots,d_{n}}\right\}$ to some clusters $C=\left\{{c_{1},c_{2},\ldots,c_{n}}\right\}$ on the similarity between $d_{i}$ .

Clustering is categorized into two methods: hierarchical and partitional [4]. The hierarchical clustering technique can be further categorized into either agglomerative or divisive, depending on what way the hierarchical breakdown is designed. Clusters in the agglomerative method are formed, bottom-up, while clusters in the divisive approach are formed, top-down. Examples of hierarchical methods include the single linkage, complete linkage, average linkage, median, and Ward [5]. Meanwhile, partitioning clustering methods accepts one-level separating on the data sets by assuming one point belongs to only one cluster. Therefore, each object is required to fit exactly into one cluster. The condition may be relaxed, as for example done in fuzzy the separating method [1], whereby, each point may be fractional and separated into different clusters. Examples of partitioning methods are $k$ -means, $k$ -modes, $k$ -means adaptive, $k$ -medians, $k$ -medoids and fuzzy $C$ .

The rest of this paper is structured as follows. Section 2 presents the $K$ -Means clustering that will be the central algorithm in this study. Section 3 presents the materials and methods, which covers the datasets, preprocessing, validity measures, and most importantly the new distance functions. Section 4 presents the experimental analysis and discussion. Finally, Section 5 concludes the paper and presents the possibility for future research work.

2. K-Means clustering algorithms

The $K$ -Means is a widely applied clustering method, which was first introduced by [6] and further promoted by [7, 8]. These researchers pointed out that $K$ -Means is an easy and generally used algorithm for generating clusters by iterating principle functions, defined either globally or locally. The $K$ -Means also, is straightforwardly programmed and is computationally economical, consequently, it is possible to process bulky samples on digital computer.

The K-means clustering algorithm consist of three steps, which are iterated until convergence [9]. The iteration will stop when the clusters produced are stable, which means there are no more movement of objects crossing any group:

Step 1: Define the center point.

Step 2: Define the distance of each object to the centers.

Step 3: Cluster object based on minimum distance.

The algorithm works as follows. Firstly, the $K$ -Means determines the center of a group as the mean value of the points within the cluster. To achieve this, the algorithm selects $k$ randomly from the objects in $D$ , whereby each represents an initial cluster mean (or center).

Then, for the remaining objects, it assigns an object to the cluster to which it is best related, established by the Euclidean distance between the object and the cluster mean. The $K$ -Means algorithm optimizes the within-cluster dissimilarity. It computes a new mean for each cluster using the objects allocated to the cluster in the preceding iteration. All the objects are reassigned using the updated means as the current group centroids. Recall that optimizations continue until the allocation is unchanging, which means, the groups designed in the present round are the same as those designed in the previous round.

In [1], it is stated that K-means algorithm at the beginning determines the centers of a cluster as the mean value of the points within clusters. The algorithm can be processed with two separate steps as [10] pointed out: The first step, selects certain numbers of clusters (assume k clusters) randomly of the data set with each cluster mean are fixed a priori. An arbitrary distance function is used to assign the remaining objects to the clusters to which they are most similar. Then, the second step, iteratively enhances continually the within-cluster dissimilarity, for each cluster, and calculates new mean using the objects assigned to the cluster in the preceding iteration. Therefore, all the objects are reassigned using the updated mean as the new group centers. The iteration continues until there is no change made in the cluster centroids.

2.1 Distance functions

In [11], the researcher presents distance in terms of its dissimilarity and similarity as:

Let $X$ be a set. A function $\delta:X\times X\to\Re$ is a distance (or dissimilarity) on $X$ if, $\forall x,y\in X$ , it satisfies the following three conditions:

$\delta\left({x,y}\right)\geqslant 0$ (non-negativity);

$\delta\left({x,y}\right)=\delta\left({y,x}\right)$ (symmetry);

$\delta\left({x,x}\right)=0$ ;

Conversely, a similarity (or proximity) function $\sigma:X\times X\to\Re$ on $X$ must satisfy the following three conditions, $\forall x,y\in X$ :

$\sigma\left({x,y}\right)\geqslant 0$ (non-negativity);

$\sigma\left({x,y}\right)=\sigma\left({y,x}\right)$ (symmetry);

$\sigma\left({x,y}\right)\leqslant\left({x,x}\right)$ and $\sigma\left({x,y}\right)=\sigma\left({x,x}\right)\Leftrightarrow x=y$ .

One of the simplest measure to calculate the dissimilarity between two objects is called the Hamming distance [12]. It can work on dataset of any variety be it numerical, categorical or mixed data types. The distance can be computed with a few machine instructions per comparison.

Table 1
Showing dataset

	Iris	Hayes-Roth	Tae
No. of instances	450	480	453
No. of attributes	4 (sepal length, sepal width, petal length, petal width)	4 (hobby, age, educational level, marital status)	5 (native, instructor, course, semester, size)
No. of classes	3 (setosa, verisicolour, virginica)	3 (membership in Club 1, membership in Club 2, or membership in neither Club)	3 (low (1), medium(2), and high (3))

3. Materials and methods

The main objective of this paper is to propose a set of hybrid distance functions to enhance the performance of $K$ -Means clustering algorithm centered on various external validity methods.

3.1 Datasets

In studying the effect of the traditional and the proposed hybrid distance functions on the quality groupings of attributes using $K$ -Means, three benchmark datasets from the UCI Machine Learning Repository [14], were chosen and are given in Table 1.

Note that from Table 1, the attributes vary significantly in terms of their attributes type. The Iris dataset consists of real numbers, with the observations ranging from 0.1 to 7.9. Meanwhile, both Hayes-Roth and Tae datasets consist of integer numbers, where the Hayes-Roth dataset has a range of observations from 1 to 4 (human subjects classification, recognition and clustering), and the Tae dataset has observations ranging from 1 to 66 (evaluation of teaching performance).

3.2 Data preprocessing

Data preprocessing is very important, most especially for distance-based methods, before any data exploration algorithms can be applied [15, 1]. Data transformation such as normalization may increase the accuracy and efficiency of mining procedures that involve distance measurements [15, 1]. Data transformation is necessary to scale attributes to smaller intervals-like 0.0 to 1.0. This effort is to ensure that all attributes have an equal weight [1].

Data or feature normalization is carried out to approximate and equalize the range of features so the distances will have the same effect in the calculation of similarity [16]. The most important reason for normalization is that, the variables with great inconsistency will normally control the metric [17], because the direct presentation of geometric measures to attributes with greater intervals will indirectly allocate larger influences to the metrics to application of attributes with smaller intervals [18]. Hence, there is a need for data normalization to enforce the attributes to have a common value range.

There are many techniques for data normalization, among them are min-max, $z$ -score, and decimal scale [15, 19, 1, 9]. However, there is no definite rule for normalizing the dataset. Thus, the choice of a specific normalization rule is mostly left to the choice of the researcher [19].

According to [1], let $A$ to be numeric attribute with $n$ observed values; $v_{1},v_{2},\ldots,v_{n}$ . Min-max normalization implements linear alteration on the original data. Assume that $\min A$ , and, $\max A$ are the minimum and maximum values of an attribute $A$ . Min-max normalization draws a value $v$ of $A-v$ in the range (0, 1) by calculating Eq. (1):

$\displaystyle{v}^{\prime}=\frac{v-\min A}{\left({\max A-\min A}\right)}$ (1)

Min-max normalization reserves relations among the unique data values. It will meet an “out-of-bounds” error if a future input case for normalization falls outside of the unique data interval for $A$ .

In $Z$ -score normalization, the values for an attribute $A$ , are normalized considering the mean and standard deviation of $A$ . A value of $v$ is normalized to $v$ by calculating Eq. (2):

$\displaystyle{v}^{\prime}=\frac{v-\bar{{A}}}{\sigma_{A}}$ (2)

where, $\bar{{A}}$ and $\sigma_{A}$ are the mean and standard deviation, respectively of attribute $A$ . This method of normalization is important where no specific minimum and maximum of attribute $A$ are identified, or incase, there may be an outliers that control the min-max normalization [19].

3.3 Matrix of measurements

Generally, let $D$ be a $n\times d$ size numerical matrix, common data for cluster analysis is, denoted by a matrix $\underline{X}$ , with variable values for each of the objects under study [20, 21, 9]:

$\displaystyle X=\begin{pmatrix}{x_{11}},&{x_{12}}&\cdots&{x_{1d}}\\ \vdots&\vdots&\ddots&\vdots\\ {x_{n1}},&{x_{n2}}&\cdots&{x_{nd}}\end{pmatrix}$ (3)

3.4 Distance functions

In comparing the performance of the proposed distance functions, the $K$ -Means clustering algorithm is also executed using various traditional distance functions such as Manhattan distance [22, 10]; Euclidean distance [22, 23, 10]); Minkowski distance [24, 25]; Chebyshev distance [22]; and Mahalanobis distance [26].

3.4.1 Manhattan distance

The Manhattan distance is computed as:

$\displaystyle d_{\textit{man}}\left({x_{i},x_{j}}\right)=\sum\limits_{i=1}^{d}% {\left|{x_{ik}-x_{jk}}\right|}$ (4)

3.4.2 Euclidean distance

The Euclidean distance is computed as:

$\displaystyle d_{\textit{euc}}\left({x_{i},x_{j}}\right)=\sqrt{\sum\limits_{i=% 1}^{d}{\left|{x_{ik}-x_{jk}}\right|^{2}}}$ (5)

3.4.3 Minkowski distance

The Minkowski distance is computed as:

$\displaystyle d_{\min}\left({x_{i},x_{j}}\right)=\sqrt[r]{\sum\limits_{i=1}^{d% }{\left|{x_{ik}-x_{jk}}\right|^{r}}}$ (6)

where $r$ (is the number of sample size in each dataset) is called order of the above Minkowski distance.

3.4.4 Chebyshev distance

The Chebyshev distance can be computed as:

$\displaystyle d_{\textit{che}}\left({x_{i},x_{j}}\right)=\max\limits_{i}\sum% \limits_{i=1}^{d}{\left\{{\left|{x_{ik}-x_{jk}}\right|}\right\}}$ (7)

3.4.5 Mahalanobis distance

The Mahalanobis distance is computed as:

$\displaystyle d_{\textit{mah}}({x_{i},x_{j}})=\sqrt{\left[{({x_{ik}-x_{jk}})S^% {-1}({x_{ik}-x_{jk}})^{T}}\right]}$ (8)

where $S$ is the within-group covariance matrix.

3.5 Proposed hybrid distance functions

In [13], it is stated that in recent-years, a number of new distance functions were proposed-in an effort to enhance the performance of clustering algorithms. For examples, [27], proposed new distance measure for mixed data clustering using supervised and unsupervised information [28], also, proposed a mathematical formulation to combine the distances between the single components of the data attribute directions in a single distance measure. This paper proposes a distance functions called Improved Distance I (ID I).

3.5.1 Improved distance I

The major aim of this hybrid is to obtain a purposeful metric computation with definite goal to gain a suitable distance (or similarity) function. Generally, the assignment is to describe a function similarity $\left({X,Y}\right)$ , where $X$ and $Y$ are two objects of a definite class, and the value of the function signifies the degree of similarity between the two. Properly, a distance function is a distance function with positive real values denoted on Cartesian product $X\times X$ of set $X$ . However, it is called a metric of $X$ if for each $x,y,z\in X$ as stated in Section 2.1 and as given in [11].

The Improved distance I presented in this paper combines the effectiveness of Manhattan and Chebyshev distance functions in clustering analysis. The ideas were adopted from [27, 28], and is defined as:

$\displaystyle d_{\textit{imdI}}({x_{i},x_{j}})=$ $\displaystyle\quad\frac{\left({\sum\limits_{i=1}^{d}{|{x_{ik}-x_{jk}}|+\max% \limits_{i}\sum\limits_{i=1}^{d}{|{x_{ik}-x_{jk}}|}}}\right)}{2}$ (9)

It is being merged by adding the two functions and dividing by the average. However, the Manhattan distance calculates the absolute differences between coordinates of pair of objects and subsequently, calculates Chebyshev distance known as the maximum value computed as the absolute magnitude of the differences between coordinates of a pair of objects; add the two functions computed and divide by two.

Thus, the hybrid method accomplishes the axioms characteristics of the combined methods to form a distance function as being stated in Section 2.1 and as given in [11].

3.6 K-Means clustering

Let $D$ be set of data with $n$ objects, where $x_{i},i=1,2,\ldots,n$ , are to be separated into $K$ clusters, and let $C_{j},j=1,2,\ldots,n$ . The objective of the algorithm is to minimize an objective function of an arbitrary distance functions, which is a squared error function between a point $x$ and group $j$ . Therefore, the resultant cluster center $C_{j}$ , the cost function is defined as:

$\displaystyle J=\sum\limits_{j=1}^{k}{\sum\limits_{i=1}^{n}{\left\|{x_{i}-C_{j% }}\right\|^{2}}}$ (10)

The $K$ -Means is a partitioning rule, which classifies each point in each center, calculated over the distance and the object with respect to its cluster centroid is squared, and distances summed. The cost function makes the resultant clusters as dense and as distinct as likely as given in [1, 10].

3.7 External performance measuring evaluation techniques in clustering

To evaluate and analyze the performance of $K$ -Means algorithm, which entirely depends on diverse distance functions, the following external performance are used.

3.7.1 Purity

The purity is calculated; when the cluster is allocated to the most related class in the group. In addition, accuracy of the search allocation is stated by totaling all correct numbers of the allocated points, divided by $N$ – as in [29]. It is generally computed as in [30, 31] as given in Eq. (11):

$\displaystyle\textit{Purity}\left({\Omega,\Gamma}\right)=\frac{1}{N}\sum{% \mathop{\mathop{\max}\limits_{j}\left|{{\omega_{i}}\cap{C_{k}}}\right|}\limits},$ (11)

where; $\Omega=\left({\omega_{1},\omega_{2},\ldots,\omega_{k}}\right)$ : the set of classes of the objects, $\Gamma=\left({C_{1},C_{2},\ldots,C_{j}}\right)$ : set of clusters created by clustering rule, $\left|{\omega_{i}\cap cj}\right|=n_{j}^{i}$ : is number of points in cluster $i$ -existing in class $j$ , and $N$ – is the sample size.

The rate of the purity interval value is from 0 to 1.

3.7.2 Rand index

The Rand index is used to identify and compare whether pairs of objects are identically assigned to two partitions. Therefore, the index will confirm the assigned pair of objects if identically considered in relation to facts in the same cluster;

MM: If the two objects have the same label and are in the same cluster,

MN: If they have the same labels and are in different clusters,

NM: It corresponds to a pair of objects having different labels, but in the same cluster, and

NN: It corresponds to a pair of objects having the different labels, but also being in two different clusters.

The Rand index Rand [32, 33, 27] is computed based on the measures using above definitions as given in Eq. (12):

$\displaystyle R=\frac{|\textit{MM}|+|\textit{NN}|}{|\textit{MM}|+|\textit{MN}|% +|\textit{NM}|+|\textit{NN}|}$ (12)

Assuming that $A11,A10,A01$ , and $A00$ are represented by $\textit{MM},\textit{MN},\textit{NM}$ , and NN pairs, respectively. Let $A11+A10+A01+A00=P$ amounted to maximum value of all pairs in the dataset (meaning $P=Q(Q-1)/2$ , where $Q$ is sum of all points in the dataset).

3.7.3 Jaccard index

The Jaccard index measures the similarity between objects and is computed as given in [34]:

$\displaystyle\textit{Jaccard index}=\frac{A_{11}}{A_{11}+A_{10}+A_{01}}$ (13)

where $A11$ is the value of pairs of points having the same class and group, $A10$ is the value of pairs of points having the same class and dissimilar groups, and A01 is the value of pairs of points having dissimilar classes and dissimilar groups.

3.7.4 Fowlkes-Mallows index

The Fowlkes-Mallows index calculates the likeness between the clusters and give feedback by clustering algorithm. It is computed as defined in [35]:

$\displaystyle\textit{FM}=\sqrt{\left({\frac{A_{11}}{A_{11}+A_{10}}}\right).% \left({\frac{A_{11}}{A_{11}+A_{01}}}\right)},$ (14)

where $A11$ is the value of pairs of points having the same class and group, $A10$ is the value of pairs of points having the same class and dissimilar groups, and $A01$ is the value of pairs of points having dissimilar classes and dissimilar groups.

3.7.5 F-Measure: Harmonic mean of precision and recall

Combination of precision and recall concepts from facts recovery, we computes precision and recall of cluster for each class:

$P=\frac{\textit{TP}}{\textit{TP}+\textit{FP}},\quad R=\frac{\textit{TP}}{% \textit{TP}+\textit{FN}}$

The F-Measure derived from a more general relationship called $\beta$ varied F-Measure is define as [35]:

$\displaystyle F_{\beta}={\displaystyle\frac{\beta^{2}+1}{{\displaystyle\frac{% \beta^{2}}{R}}+{\displaystyle\frac{1}{P}}}}={\displaystyle\frac{\left({\beta^{% 2}+1}\right)PR}{\beta^{2}P+R}}$ (15)

where $\beta$ is a coefficient to adjust the relative importance of precision versus recall decreasing $\beta$ leads reduction of precision importance.

Table 2

Showing the results of distance functions and external validity measures

			Data set	1: Iris
Dist. Fun.	Purity	Rand	Jaccard	Fow.-Mallow	F-M	F-M (F-score)	Sens.	Spec.	Prec.	G-Means
Manhattan	0.88667	0.92444	0.80385	0.88966	0.88586	0.88472	0.88667	0.94333	0.90262	0.91053
Euclidean	0.88667	0.92444	0.80454	0.88876	0.88609	0.88528	0.88667	0.94333	0.89786	0.85701
Minkowski	0.88667	0.92444	0.80454	0.88876	0.88609	0.88528	0.88667	0.94333	0.89786	0.91250
Chebyshev	0.88667	0.92444	0.80511	0.88804	0.88628	0.88574	0.88667	0.94333	0.89403	0.77394
Mahalanobis	0.70667	0.80444	0.59259	0.70667	0.70667	0.70667	0.70667	0.85333	0.70667	0.91621
Imp.Dist..I	0.88877	0.92554	0.80650	0.89667	0.88667	0.88876	089267	0.94533	0.91016	0.91149
			Data set	2:	Hayer-	Roth
Dist. Fun.	Purity	Rand	Jaccard	Fow.-Mallow	F-M	F-M (F-score)	Sens.	Spec.	Prec.	G-Means
Manhattan	0.50625	0.67080	0.25975	NaN	NaN	NaN	0.41899	0.72470	NaN	0.39911
Euclidean	0.42500	0.61667	0.20466	NaN	NaN	NaN	0.35200	0.67968	NaN	0.32981
Minkowski	0.41875	0.61250	0.19158	NaN	NaN	NaN	0.34744	0.67617	NaN	0.31060
Chebyshev	0.48125	0.65147	0.32061	0.48534	0.49655	0.48307	0.49812	0.73225	0.47715	0.60364
Mahalanobis	0.46875	0.64583	0.22589	NaN	NaN	NaN	0.38654	0.70267	NaN	0.35477
Imp.Dist.I	0.49987	0.67700	0.32971	0.48827	0.49676	0.49335	0.49962	0.72954	0.48351	0.60785
			Data	Set 3	: Tae
Dist. Fun.	Purity	Rand	Jaccard	Fow.-Mallow	F-M	F-M (F-score)	Sens.	Spec.	Prec.	G-Means
Manhattan	0.37748	0.58499	0.19232	NaN	NaN	NaN	0.37912	0.68934	NaN	0.35724
Euclidean	0.37086	0.58057	0.19147	NaN	NaN	NaN	0.37769	0.68822	NaN	0.36279
Minkowski	0.40397	0.60265	0.21515	NaN	NaN	NaN	0.41088	0.70446	NaN	0.39042
Chebeshev	0.35762	0.57174	0.17775	NaN	NaN	NaN	0.35989	0.67994	NaN	0.33436
Mahalanobis	0.43709	0.62472	0.27881	0.43493	0.43804	0.43286	0.43882	0.71907	0.43521	0.55619
Imp.Dist.I	0.44280	0.62865	0279346	0.44311	0.43816	0.43522	0.43789	0.71896	0.44332	0.56424

NaN $=$ Not a number.

The F-Measure with an interval values of [0, 1], indicates greater values for higher clustering feature.

3.7.6 F-Measure (F-Score)

The F-Measure is defined as the harmonic mean of precision and recall as in [35]:

$\displaystyle F=\left[{\frac{2\times\textit{Recall}\times\textit{Precision}}{% \textit{Recall}\times\textit{Precision}}}\right]$ (16)

This particular measure is insignificant whenever one of the two indices is insignificant. The value of F increases proportionally to the increase of precision and recall, a greater value of F-Measure indicates that the cluster performs better on the positive class.

The F-Measure with an interval values of [0, 1], indicates larger values for higher clustering features.

3.7.7 Sensitivity (TPR

=

True Positive Rate)

Defined as the percentage of all positive objects contained in a cluster as given by [35]:

$\displaystyle\textit{Sensitivity}\left({\textit{TPR}}\right)=\frac{\textit{TP}% }{\textit{TP}+\textit{FN}}\times 100\%$ (17)

where TP is the number of points of positive class contained in cluster $i$ and TP and FN are number of all positive objects.

3.7.8 Specificity (TNR

=

True Negative Rate)

Is the percentage of all negative objects contained in a cluster and is defined as in [35]:

$\displaystyle\textit{Specificity}=\frac{\textit{TN}}{\textit{TN}+\textit{FP}}% \times 100\%$ (18)

where TN is the number of points of negative class $j$ contained in cluster $i$ and TN and FP are number of points contained in cluster $i$ .

3.7.9 Precision

Defined as the percentage of cluster that contains positive objects as given in [34, 25]:

$\displaystyle\textit{Precision}=\frac{\textit{TP}}{\textit{TP}+\textit{FP}}% \times 100\%$ (19)

where TP is the number of points of positive class $j$ contained in cluster $i$ and TP and FP are number of points contained in cluster $i$ .

3.7.10 Geometric-means

The Geometric-Mean is the product of the clustering accuracies for both classes based on Eq. (17), which is the accuracy of the positive clusters and Eq. (18), which is the accuracy on the negative clusters. The mean is defined as in [36]:

$\displaystyle\textit{Geometric-Mean}=\sqrt{\textit{Sensitivity}\times\textit{% Specificity}}$ (20)

The geometric-mean is obtained by multiplying sensitivity and specificity, and taking the square root.

4. Results and discussion

From the results in Table 2, and based on the datasets investigated, the highest performance of external validity measures resulted from the Rand index across all three datasets (Iris, Hayes-Roth, and Tae), while the Jaccard index recorded the lowest performance for the same datasets. Those affected with NaN (not a number) under external validity measures were Fowlkes-Mallow Index, F-Measure ( $\beta$ varied) and F-Measure (F-score). Meanwhile, the performance of the proposed distance functions was not completely affected under the three datasets with NaN. Although, under Hayes-Roth dataset Chebyshev’s distance function was not affected with NaN and under Tae dataset, Mahalanobis distance function, was not affected with NaN. Also, it is important to note that the whole Iris dataset was not affected with NaN entirely. However, Iris dataset was not affected likely due to the real number characteristics it consists.

In generally, the results from the proposed distance functions showed better performance as compared to the traditional distance functions. In addition, dataset consisting of real numbers such as in the Iris dataset produced higher quality clusters in terms of external validity measurement index and was not affected with NaN as compared to the Hayes-Roth and Tae-datasets, each consisting of integer-based features.

5. Conclusions

This paper presented an improved distance function in $K$ -Means clustering algorithm. A sequence of comparative clustering investigations was carried out to compare the performance of some numbers of traditional distance functions against the proposed hybrid distance functions. Established on the three datasets, the investigation results showed that the improved approach to distance functions had achieved a better performance as compared to the remaining traditional distance functions. The enhanced $K$ -Means algorithm is hoped to improve cluster analysis in many fields such as:

marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buy records; financial: forecasting stock market, currency exchange rate, bank bankruptcies, understanding and managing financial risk, trading futures, credit rating; biology: classification of plants and animals given their features; libraries: book ordering and stock management; insurance: identifying groups of motor insurance policy holders with a high average claim cost and identifying frauds; city-planning: identifying groups of -houses according to their house type, value and geographical location; earthquake studies: clustering observed earthquake epicenters to identify dangerous zones; as well as the world wide web (WWW): document classification; or clustering web log data to discover groups of similar access patterns.

In future work, our main objective is to design and propose an algorithm to implement the new approaches in some distance functions to enhance $K$ -Means clustering algorithm, and evaluate the performance of some internal validity measures.

References

Han

Kamber

Pei

. Data mining: Concepts and techniques. Elsevier; 2011 Jun 9.

Huang

Rong

. Automated variable weighting in k-means type clustering. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2005 May; 27(5): 657-68.

Krishnasamy

Kulkarni

Paramesran

. A hybrid approach for data clustering based on modified cohort intelligence and K-means. Expert Systems with Applications. 2014 Oct 1; 41(13): 6009-16. doi.org/10.1016/j.eswa.201403.021.

Jain

. Data clustering: 50 years beyond K-means. Pattern recognition letters. 2010 Jun 1; 31(8): 651-66. doi: 10.1016/j.patrec.2009.09.011.

Oyelade

Oladipupo

Obagbuwa

. Application of k Means Clustering algorithm for prediction of Students Academic Performance. arXiv preprint arXiv: 1002.2425. 2010 Feb 11.

Steinhaus

. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. 1956 Jan; 1: 801-4.

MacQueen

. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1967 Jun 21; (Vol. 1, No. 14, pp. 281-297).

Patel

Mehta

. Impact of outlier removal and normalization approach in modified k-means clustering algorithm. IJCSI International Journal of Computer Science Issues. 2011; 8(5).

Mohamad

Usman

. Standardization and its effects on k-means clustering algorithm. Res. J. Appl. Sci. Eng. Technol. 2013; 6(17): 3299-303.

10.

Loohach

Garg

. Effect of distance functions on simple k-means clustering algorithm. International Journal of Computer Applications. 2012 Jan 1; 49(6).

11.

Giancarlo

Bosco

Pinello

. Distance functions, clustering algorithms and microarray data analysis. In Learning and Intelligent Optimization. 2010 Jan 18 (pp. 125-138). Springer Berlin Heidelberg. doi: 10.1007/978-3-642-13800.3-10.

12.

Vijay

Mahajan

Kandwal

. Hamming distance based clustering algorithm. International Journal of Information Retrieval Research (IJIRR). 2012 Jan 1; 2(1): 11-20. doi: 10.4018/ijirr.2012010102.

13.

Md Saad

Ahmad

Abu

Jusoh

. Hamming distance method with subjective and objective weights for personnel selection. The Scientific World Journal. 2014 Mar 17; 2014. doi.org/10.1155/2014/865495.

14.

Bache

Lichman

. UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, 2013.

15.

Al Shalabi

Shaaban

Kasasbeh

. Data mining: A preprocessing engine. Journal of Computer Science. 2006 Sep; 2(9): 735-9.

16.

Aksoy

Haralick

. Feature normalization and likelihood-based similarity measures for image retrieval. Pattern recognition letters. 2001 Apr 30; 22(5): 563-82.

17.

Zhan

Sakurai

. Importance of data standardization in privacy-preserving K-Means clustering. In Database Systems for Advanced Applications 2009 Apr 20 (pp. 276-286). Springer Berlin Heidelberg.

18.

Suarez-Alvarez

Pham

Prostov

. Statistical approach to normalization of feature vectors and clustering of mixed datasets. In Proc. R. Soc. A 2012 Apr 18 (p. rspa20110704). The Royal Society.

19.

Visalakshi

Thangavel

. Impact of normalization in distributed k-means clustering. International Journal of Soft Computing. 2009; 4(4): 168-72.

20.

Everitt

Landau

Leese

. Cluster Analysis (Edward Arnold, London); 1993.

21.

Vesanto

. Importance of individual variables in the k-means algorithm. In Advances in Knowledge Discovery and Data Mining 2001 Apr 16 (pp. 513-518). Springer Berlin Heidelberg.

22.

Ben Ali

Massmoudi

. K-means clustering based on gower similarity coefficient: A comparative study. In Modeling, Simulation and Applied Optimization (ICMSAO), 2013 5th International Conference on 2013 Apr 28 (pp. 1-5). IEEE.

23.

. A clustering method based on K-means algorithm. Physics Procedia. 2012 Dec 31; 25: 1104-9. doi: 10.1016/j.phpro.2012.03.206.

24.

Kumar

Prasad

. K Means clustering algorithm for partitioning data sets evaluated from horizontal aggregations. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN. 2013; 2278-0661.

25.

Rokach

Maimon

. Data mining with decision trees: Theory and applications. World scientific. 2014 Sep 3.

26.

Melnykov

. On K-means algorithm with the use of Mahalanobis distances. Statistics & Probability Letters. 2014 Jan 31; 84: 88-95.

27.

Noorbehbahani

Mousavi

Mirzaei

. An incremental mixed data clustering method using a new distance measure. Soft Computing. 2015 Mar 1; 19(3): 731-43.

28.

Visalakshi

Suguna

. K-means clustering using Max-min distance measure. InFuzzy Information Processing Society, 2009; NAFIPS 2009. Annual Meeting of the North American 2009 Jun 14 (pp. 1-6). IEEE.

29.

Mogotsi

. Christopher d. manning, prabhakar raghavan, and hinrich schütze: Introduction to information retrieval. Information Retrieval. 2010 Apr 1; 13(2): 192-5.

30.

Deepa

Revathy

Student

. Validation of document clustering based on purity and entropy measures. International Journal of Advanced Research in Computer and Communication Engineering. 2012 May; 1(3): 147-52.

31.

Hernández-Torruco

Canul-Reich

Frausto-Solís

Méndez-Castillo

. Feature selection for better identification of subtypes of Guillain-Barré syndrome. Computational and Mathematical Methods in Medicine. 2014 Sep 15; 2014. doi.org/10.1155/2014/432109.

32.

Rand

. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association. 1971 Dec 1; 66(336): 846-50.

33.

Halkidi

Batistakis

Vazirgiannis

. On clustering validation techniques. Journal of intelligent information systems. 2001 Dec 1; 17(2): 107-45.

34.

Kou

Peng

Wang

. Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Information Sciences. 2014 Aug 10; 275: 1-2. doi.org/10.1016/j.ins.2014.02.137.

35.

Velardi

Navigli

Faralli

Ruiz-Martinez

. A New Method for Evaluating Automatically Learned Terminological Taxonomies. InLREC. 2012; (pp. 1498-1504).

36.

Tomar

Agarwal

. Hybrid feature selection based weighted least squares twin support vector machine approach for diagnosing breast cancer, hepatitis, and diabetes. Advances in Artificial Neural Systems. 2015 Jan 1; 2015: 1. doi.org/10.1155/2015/265637.