Abstract
In today’s world different data sets are available on which regression or classification algorithms of machine learning are applied. One of the classification algorithms is k-nearest neighbor (kNN) which computes distance amongst various rows in a dataset. The performance of kNN is evaluated based on
Introduction
All the machine learning models are used to forecast the results based on previous data available. Based on some attribute values already known, the machine learning model is built and this model use these values to create a predicted output. Classification is a method which is used to constructs a model using data samples which called training data set. Classification is a supervised learning method which maps the attribute values into one or more response classes based on knowledge from an expert. In literature, classification methods such as Naïve Bayes, k-Nearest Neighbor, Logistic Regression, Decision Trees and Support Vector Machines have been used to solve many applications in the real world domain [1, 2].
K Nearest Neighbor (kNN) is a classifier that depends on the distance or similarity metric between test and training samples [3]. This method is simple, easy to implement and handles the noise satisfactorily. Thus, it has been used by many researchers to classify new inputs based on training data [4].
The relevant factors such as
Pulungan and Zaris [5] compared the performance of kNN classifier using various distance metrics namely Canberra Distance, Euclidean Distance and Braycurtis Distance. The differentiation between each has been based on accuracy view point. It was observed from the results that the Braycurtis distance method had better outcome than Canberra Distance and Euclidean Distance technique, based on different costs of
In literature, a plethora of research has been centered on kNN performance improvement using available distance metrics. But very less work has been carried out to modify existing distance formulas to make kNN technique more effective. In this paper, the use of the Modified Euclidean-Canberra Blend Distance formula (MECBD) is proposed, to quantify distance in the kNN classifier to get improvement, shown in terms of class prediction parameters. The remainder of the paper is organized as: Methodology used which includes discussion about kNN algorithm, original and proposed distance metrics, data sets and proposed method is presented in Section 2. The Section 3 presents the results of the application of the proposed metric. Conclusions and future work is presented in the last section.
Methodology used
kNN algorithm
Since this research is based on datasets having classes, therefore classification technique has been required to be implemented here. As kNN is based on feature correlation and classification is carried out using several kNN classifiers, which consider some input data, feed it into the trained model and then the model predicts the output class [13]. This study uses kNN algorithm for its proposed distance formula. K-Nearest Neighbor classifier was proposed by Cover and Hart in 1968. kNN is called lazy learner classifier as it holds the class information from the training dataset till the testing data is provided [9, 18]. The K-nearest neighbor (kNN) classifier is a procedure for segregating data values of learning data that is in its neighbourhood [21, 23]. kNN classifies the given dataset into classes using a similarity measure [24].
It is analogous to the clustering methodology, where clustering is performed using distance of the new data to its neighbours. To identify the distance amidst two points, particularly the point on the training dataset (
where
kNN technique works on similarity distance measure for which taking the correct value is essential. This procedure is called parameter tuning which is significant for good accuracy.
How kNN algorithm works?
The main aim of kNN algorithm is to identify the nearest neighbor’s to a testing point, so that classification of testing point is done and it is categorized. To find the nearest neighbor based on the new data rows available as input to be predicted, it is needed to calculate the distance matrix. There are various types of similarity distance measures like Euclidean Distance, Manhattan Distance, Minkowski Distance, Canberra Distance and others but commonly used matrix is Euclidean distance. kNN algorithm calculates the Euclidean distance of unknown data value from all the point values (based on value of
Implementation of distance measure
Distance measure is widely used for degree of similarity and dissimilarity between two data rows. Some distance measures include: Euclidean distance, Chebyshev, angular separation, Canberra distance, Hamming distance, Sorensen distance and so on. This study of kNN algorithm, classification process utilizes the Canberra distance technique and also a modified version of it.
Canberra distance
The Canberra distance has been proposed by G.N Lance and W.T William in 1966 and 1967. Canberra distance is designed to get the distance from couple of points where the data is in the format of rows and columns using Eq. (2) [14, 15]. It is summation of subtraction of data points attribute values with the division of sum of their abstract attribute values.
Distance between two points on a straight line is termed as Euclidean distance. This distance mechanism uses principles of Pythagoras formula. This computation is primarily used in numerous machine learning processes [10]. The Euclidean distance procedure is stated in Eq. (3) as [16] summation of the square of subtraction of attribute values of data points under consideration.
Based on the Euclidean and Canberra distance formulas, a modification of Canberra Distance formula is proposed in Eq. (4) which involves squaring the subtraction of vector absolute values followed by division of absolute values their sum, then finding the square root of the final result. Here,
Data sets are basically collection of data. Classification techniques work on the data sets whose dependent attribute is divided into classes [25]. The datasets involved in the proposed study are of classification technique type. In this study three data sets are used, two related to Spine and one related to life expectancy (Haberman’s dataset) from UCI machine learning repository with details shown in Table 1[11, 12]. One data set is Spine which has 6 attributes (pelvic incidence, pelvic tilt, lumbar lordosis angle, sacral slope, pelvic radius and grade of spondylolisthesis) and 2 classes (AB-abnormal spine, NO-normal spine) and other data set related to spine is having same 6 attributes but 3 classes (DH-disc hernia, NO-normal spine, SL-spondylolisthesis). The Haberman’s data is having 3 attributes (Age of patient at time of operation, Patient’s year of operation, Number of positive axillary nodes detected and Survival status) and 2 classes (Sur-survived, Non-Sur-not survived). All the datasets are having more than 300 rows of data.
Summary of datasets used in study
Summary of datasets used in study
The proposed technique is based on the modification of distance metric used in kNN algorithm. As it is observed that kNN algorithm works on the distance allying the actual training row of data and test row [30]. Therefore, the value of distance plays a pivotal role in class prediction using kNN. In this paper, kNN is implemented using Canberra distance metric (Algorithm 1) and then using proposed MECBD metric (Algorithm 2). Then the performance of Algorithm 1 and Algorithm 2 is compared in terms of accuracy, precision, recall, F1-score and confusion matrix. Following Fig. 1 shows the flow chart which summarizes the methodology followed in this research. The flowchart in Fig. 1 depicts the receiving of data set by application and then splitting the data set into test and training sub sets further more kNN is implemented using both MECBD and Canberra distance metrics. Based on implementation accuracy, precision, recall and f1-score, are calculated and confusion matrix is displayed.
Flowchart of the methodology used.
Classification refers to the distribution of data based on classes in a data set [29]. In Fig. 2, for class prediction kNN algorithm is used for predicting the classes based on the training dataset with Canberra distance formula based on Eq. (2) as distance metric. The complete data set is given as input to the model, which is further split into training dataset D (80%) and test data set (20%). A row of data T from the test data set is picked. And for each attribute
Algorithm 1; kNN implementation using Canberra distance metric.
In Fig. 3, kNN algorithm is used for predicting the classes based on the training dataset with MCEBD formula based on Eq. (4) as distance metric. The complete data set is given as input to the model, which is further split into training dataset D (80%) and test data set (20%). A row of data T from the test data set is picked. And for each attribute
This process of Algorithm 1 and Algorithm 2 repeats itself till all the rows of testing data set are processed and their classes are predicted.
Algorithm 2; kNN implementation using MECBD metric.
To evaluate the performance of an algorithm Confusion matrix is generally used. Confusion matrix consists of actual and predicted data values in matrix representation [28]. In acquiring the accuracy value of kNN algorithm, the operational steps are as follows:
Identification of test data, training data and values of Establishing the categorization with the kNN classifier to compute the kNN model. Conduct training and framework testing processes using distinct distance calculation mechanisms. Deriving a confusion matrix and classification report. Computing the accuracy.
Machine learning methods have been used to collate classifier conduct involving supervised algorithms [31]. Different parameters were used to show improvement of kNN algorithm for class prediction when modified distance formula has been used. They were accuracy, precision, recall and F1-score [19]. Their formulas are
where,
TrueP: True Positive TrueN: True Negative FalseP: False Positive FalseN: False Negative
To predict the classes for different datasets used, kNN algorithm technique is considered. Two different distance metrics were used, one is original Canberra distance formula and other is proposed MECBD metric. When applied on different datasets it has been observed that performance parameters values improved for proposed MECBD matrix. Performance evaluation parameters include accuracy, precision, recall and F1-score. Accuracy determines whether the model predicted the class correctly and up to which level. Precision is measure of quality which tells about the correct results returned by the algorithm [27]. Whereas, recall is the measure of quantity which tells that algorithm returned most correct results [20]. Recall tells about the in numeral of true positive predictions made out of all positive predictions that could have been guessed, recall also gives a clue of overlooked positive predictions. F1 score is computed based on mean of precision and recall [26]. The performance parameters results are summarized in Tables 2–4.
Comparison of performance between original Canberra distance metric and MECBD metric for Spine (2 Classes) dataset for
Comparison of performance between original Canberra distance metric and MECBD metric for Spine (3 Classes) dataset for for
Comparison of performance between original Canberra distance metric and MECBD metric for Haberman’s dataset dataset for
From the above tables, which are based on original Canberra metric and proposed MECBD metric comparison, it is inferred that for
Original Canberra distance confusion matrix (a) and MECBD confusion matrix (b) for Spine (2 Classes) dataset comparison for 
Original Canberra distance confusion matrix (a) and MECBD confusion matrix (b) for Spine (3 Classes) dataset comparison for 
Original Canberra Distance confusion matrix (a) and MECBD confusion matrix (b) for Haberman’s (2 Classes) dataset comparison for 
All the confusion matrix in Figs 4–6 shows increase in true predicted values when MECBD has been applied in kNN algorithm for data sets under consideration for
For example, for Spine (2 Classes) dataset the accurate class prediction count, for AB class increased from 34 to 38 and for NO increased from 16 to 18 for testing part of dataset.
Prediction accuracy comparison based on distance formulas for 
The Fig. 7 shows comparison of kNN prediction accuracy values for original and modified distance formulas for
Precision comparison based on distance formulas for 
The Fig. 8 shows precision comparison of kNN algorithm for original and modified distance formulas. It is observed from the graph that precision for AB class increases marginally (from 0.87 to 0.96) for Spine (2 classes) dataset but for NO class precision value increased considerably (from 0.70 to 0.86). Similarly, for Spine (3 classes) dataset the precision value increased for DH (from 0.50 to 0.62) and NO (from 0.70 to 0.80) classes but remained unchanged for SL (i.e. 1) class. Also, for Haberman’s dataset the Sur class had a marginal increase (from 0.77 to 0.78) in precision but Non-Sur class had a significant increase (from 0.46 to 0.71) in precision values.
Recall comparison based on distance formulas for 
The Fig. 9 shows recall comparison of kNN algorithm for original and MECBD formulas. It is observed from the graph that recall value increased for AB class (from 0.83 to 0.93) for Spine (2 classes) dataset and for NO class (from 0.76 to 0.86). Similarly, for Spine (3 classes) dataset the recall value increased for DH (from 0.55 to 0.73) and NO (from 0.70 to 0.84) classes but fell for SL (from 0.97 to 0.94) class. Also, for Haberman’s dataset the Sur class had a marginal increase (from 0.84 to 0.95) in recall but Non-Sur class had a drop (from 0.35 to 0.29) in recall values.
F1-Score comparison based on distance formulas for 
The Fig. 10 shows F1-score comparison of kNN algorithm for original and MECBD formulas. It is observed from the graph that F1-score value increased for AB class (from 0.85 to 0.93) for Spine (2 classes) dataset and for NO class (from 0.73 to 0.86). Similarly, for Spine (3 classes) dataset the F1-score value increased for DH (from 0.52 to 0.67) and NO (from 0.70 to 0.80) classes but fell marginally for SL class (from 0.98 to 0.97). Also, for Haberman’s dataset the Sur class had a marginal increase (from 0.80 to 0.86) in F1-score values but Non-Sur class had a less increase (from 0.40 to 0.42).
ROC curve and auc values for Spine (2 classes) dataset showing comparison between normal Canberra distance and MECBD metric implementation for 
In the ROC curve shown in the Fig. 11, comparison curves plotted the predicted values of kNN model using Normal Canberra distance (blue line) and MECBD metric (orange line). By the values of area under curve (auc) for both the ROC curves it has been observed that auc value (0.958) for MECBD metric curve is more than auc value (0.873) for Normal Canberra distance curve. Since, higher value of auc tell about better performance of the model in distinguishing between the classes hence, MECBD metric has improved the prediction of the kNN algorithm for Spine (2 Classes) dataset.
ROC curve and auc values for Spine (3 classes) dataset for Normal Canberra distance implementation for 
ROC curve and auc values for Spine (3 classes) dataset for MECBD metric implementation for 
ROC curve and auc values for Haberman’s dataset showing comparison between Normal Canberra distance and MECBD implementation for 
In Fig. 12, ROC curves for 3 classes for Spine (3 Classes) data set is shown for normal Canberra metric implementation in kNN classifier. Different values of auc is also shown for DH class (0.857), SL class (0.983) and NO class (0.910). A classifier is considered to be perfect if the true positive value is near 1.0. In the auc, more the area under the curve value better is the classifier.
In Fig. 13, ROC curves for 3 classes for Spine (3 Classes) data set is shown for MECBD metric implementation in kNN classifier. Different values of auc are also shown for DH class (0.940), SL class (0.983) and NO class (0.957). On comparing the auc values of Figs 12 and 13 it is inferred that MECBD implementation results is increased values of auc hence indicating improved prediction performance of kNN classifier.
In the ROC curve shown in the Fig. 14, comparison curves plotted the predicted values of kNN model using Normal Canberra distance (blue line) and MECBD metric (orange line). By the values of area under curve (auc) for both the ROC curves it has been observed that auc value (0.742) for MECBD metric curve is more than auc value (0.651) for Normal Canberra distance curve. It is inferred through higher value of auc that MECBD metric has improved the prediction of the kNN algorithm for Haberman’s survival dataset.
In this paper, a method to improve the kNN algorithm performance for accurate class prediction by use of modified Canberra distance metric named MECBD metric has been proposed. The data sets considered in the paper are based on Spine diseases (Spine (2 classes) and Spine (3 classes) datasets) and patient survival (Haberman’s dataset). The main work includes modification of Canberra distance metric by blending it with Euclidean distance metric and then applying this modified formula on kNN algorithm, which lead to improvement in accuracy, precision, recall and F1-score upon comparison with normal Canberra distance metric and MECBD metric implementation on kNN algorithm. It is inferred that class prediction accuracy for kNN algorithm based on MECBD metric technique for Spine (2 classes) dataset increased from 0.804 to 0.903, Spine (3 classes) dataset increased from 0.806 to 0.854 and Haberman’s dataset increased from 0.70 to 0.77. Further, ROC curves and auc were also calculated and compared for k=5 for all the data sets which showed increase in prediction performance of kNN algorithm. The future work will comprise of implementation of feature selection concepts and introduce weighted attributes to improve class prediction accuracy further.
