Abstract
Even though finding out distance is the central core of k-Nearest Neighbor classification techniques, similarity measures are often favored against distance in various realistic scenarios and situation. Most of the similarity measures, which are used to classify an instance, are based on geometric model. Their effectiveness decreases with the increases in the number of dimensions. This paper establishes an efficient technique called ARSkNN for finding out class of any given instance using a measure based on an unique similarity, that does no longer compute distance, for k-NN classification. Our empirical results show that ARSkNN classification technique is better than the previous established k-NN classifiers. The performance of algorithm was verified and validated on various datasets from different domains.
Introduction
In the domain of data mining, classification techniques are realm of much eagerness for well-known researchers, attributable to its ample range of practical applications, viz; geostatistics, computer vision, speech recognition and many more. The popular and established k-Nearest Neighbors also called as k-NN classification approach is among the utmost straightforward and oldest among classification approaches. In this approach, the class of the questioned instance is established in two steps; in the initial step, k nearest neighbors of questioned instance is establish and in next or last step, the class that has the majority in those nearest neighbors is said to be the class of questioned instance [12]. The accurateness of k-NN classification is drastically influenced by the metric utilize for determination of similarities or dissimilarities (distances) between the instances with the known classes and the questioned instance [34]. The vast majority of the works as reported for the metric learning phase, i.e., the first phase for k-NN classification approach, are dependent on distance metrics [3, 34]. Nevertheless, in numerous real world applications like document classification, similarities (such as cosine similarity) are preferred over distances (such as Euclidean distance).
Additionally, numerous researches have revealed that utilizing measures based on similarity are often chosen against the distance measures even in the non-textual datasets (like Iris, Wine and Balance) [22]. Preferring measure based on distance/similarity relies on the representation or the measurement type of instances which needs classification. Many similarity measures which are used in metric learning phase of k-NN classification are directly or indirectly calculated with the help of distance measures. The recently proposed similarity measure based on mass (Massim), is a substitute to similarity measure based on distance and has revealed better outcome in information retrieving tasks [31]. The cardinality feature of the least adjacent area which wraps those two instances is the main concept to calculate massim. Our algorithm ARSkNN makes use of Massim as a metric learner in k-NN classification approach.
This paper makes the following contributions; It:
establishes an efficient kNN classification technique, ARSkNN (the name comes from the first character of first name of its proposers Ashish, Roheet and Sumit), which utilizes similarity measure based on mass rather than distance. establishes the empirical and experimental foundation of ARSkNN. empirically evaluate ARSkNN and compared it with the traditional kNN. The results shows its superiority in the context of avg. runtime and avg. accuracy percentage.
Literature review
The section reviews k-NN classifiers and similarity measures.Massim and operational concept of Massim is also explained in this section. An approach of calculating similarity measure based on mass (Massim) is recently developed and hence available literature is also very limited.
k - Nearest neighbors classification
The cornerstones of k - nearest neighbors classification [12] are Nearest Neighbor classification and the k-NN law first suggested in 1951 by Fix et al. It has been also recognized by so many names; some of them are instance based classifier, case based classifier, memory based classifier, lazy classifier. There are 3 crucial pillars of k-NN classifier: class labeled instances set; the value of k and dissimilarity or similarity metric. It is lazy learning classification technique because it does not create any generalization from the training instances of the dataset. Therefore, it desires all instances of t raining data in memory for each testing instance. Due to this, there is a minimal training phase, but costly testing phase in context of space & time complexityboth [1].
To enhance k-NN classifiers, various measures have been recommended. Some of them are as; Mahalanobis distance [34], dynamic (adaptive) distance [33] and local distance metric [20]. Furthermore, the customary k-NN classifier has seen a variety of augmentations and transformations till now, for example, distance lower bound based nearest neighbor [9], Nearest Feature Line (NFL) classification technique [19], Nearest Feature Space (NFS) classification technique and Nearest Feature Plane (NFP) classification technique [10], Nearest Neighbor Line classification technique (NNLC) [37], Center-based Nearest Neighbor classification technique (CBNNC) [16], Meaningful Nearest Neighbor (MNN) [21], MaxNearestDist based nearest neighbor [23], Optimal Weighted Nearest Neighbor classifiers [24] and k-regular nearest neighbor graph [6]. But, all the above classifiers are based on distance (dissimilarity) or similarity metrics. Also genetic algorithms and genetic programming are used to propose robust nearest neighbor classifier [15]. Yu-Jie Xiong et al. [35] has suggested that distance based classifiers are very convenient and relevant in the domain of writer recognition.
Till now, there is no k-NN classifier which takes into account similarity metric based on mass (Massim). This research paper comes up with an efficient classifier for the same and it is the novelty of our approach in ARSkNN.
Concept of similarity measures
Similarity, is a perception which can be utilize not only in data mining domain but also numerous further subjects like ecology and psychology. To determine similarity among two objects, we can use similarity measure which is a function of real values. When objects are more similar to each other, then the return value of the function is higher.
Afterwards a similarity measure to classify biological species was formulated by Jaccard in 1901, various dissimilarity (distance) and similarity measures have been formulated in various scientific disciplines in the past hundred years. The literature demonstrates that Zezula et al. [36] and Cha [7] have done exhaustive surveys of similarity measures. Seung-Seok Choi et al. have also done a review of 76 binarized distance and similarity metrics alongside their correlations, via hierarchical clustering techniques [11].
A similarity measure is treated as a metric on the off chance, that it results in a higher value as the similarity between objects increases. A metric similarity S assures the following [27]:
Non-Negativity: S (X, Y) > =0 i.e., the distance among any two elements of the metric space can not be less than 0. Reflexivity: S (X, Y) =0 if and only if X = Y i.e., if two elements are equivalent to each other then the distance among them is 0. Symmetry: S (X, Y) = S (Y, X) i.e., the distance among any two elements will be same in either way. Triangle Inequality: S (X, Z) < = S (X, Y) + S (Y, Z) i.e., let three elements x, y, z of metric space then the sum of the distance among (x, y) and (y, z) is greater than (x, z).
These four properties are also known as distance axioms. Nonetheless, a few researchers confronted the triangle inequality axiom by proposing non-metric similarity measures [26]. The similarity measures based on distance is a binary function dist(x, y) [Table 1], i.e., all of them must have two variables, regardless of whether non-metric or metric. In large datasets, the fundamental downside with similarity measures based on distance is that they are expensive with regard to computation due to their high time complexities, which refers to the comparison matrix drawn for the pairwise data selected. There are some established techniques by which one can transform the similarity measure to distance measures and vice versa [8].
Distance-based Similarity versus Mass-based Similarity
Distance-based Similarity versus Mass-based Similarity
In 2010, Kai Ming Ting et al. suggested the idea of mass and how to estimate mass [29]. It was proposed and established for replacement of density estimation. Massim is not quite the same as all similarity measures. It is built on the perception of mass [30], that is the cardinality of region, instead of distance.Mainly, it does not quantify distance measure, which is the heart and soul of all the distance based similarities where as it is established on data distribution in the local area. Massim assures that 2 similar objects should be in the same local region. In view of mass, Massim is a binary function to quantify resemblance among 2 instances however, Mass is a unary function.
The comparison among distance based and mass based similarity measures have been shown in Table 1.
Massim does not fulfill all the four distance axioms dissimilar to distance-based similarity measures, as mentioned in Table 2.
Axioms followed by Distance-based Similarity Measures and Mass-based Similarity Measures
Axioms followed by Distance-based Similarity Measures and Mass-based Similarity Measures
Massim is a revolutionary concept which is a similarity measure based on mass.
The two adaptations for Massim, first being for single dimensional and the next being for multidimensional representation of objects (instances) [31].
For single dimensional adaptation, in order to create a model E to distinct feature space into least convex local section R (x, y|E), which covers x, y ∈ I R, as product of twofold split that separates real line characterized by D = {x1 < x2 < x3 … < x
n
} into two non-intersecting local area r1 (where x < = b
i
) and r2 (where x > = b
i
); and one region r3, that spreads over the intact real line should also be there. Now Massim(x, y) is expressed as:
Another smoother form of single dimensional Massim i.e. Level-h Massim, is estimated iteratively by using the following equation:
Level-h Massim takes into account the manner of data distribution which is further efficient and it is utilized for multi-modal data distribution [31].
The similarity of 2 instances (x, y) (for multi-dimensional Massim) is assessed by standard avg. of mass for every smallest local area that covers instances (x, y). The mass inequality produced by half-space splits [29], is promise by the same random local areas.
Also for multi-dimensional datasets, every single half-space split is carried out over arbitrarily choose features. A h-level split is represented using a tree structure in which every single subtree contains h half-space splits. Hence there would be 2 h distinct local areas in this tree [31].
Further to calculate similarity for multi-dimensional mass similarity, the following equation is referred:
In our previous research work [18], ARSkNN has been already is proposed as a novel and robust k-nearest neighbor classifier which utilizes Massim, a similarity measure based on mass, against the techniques rely upon distance-based similarity measures. Further the algorithm is explained in the following paragraph:
In this algorithm, y is the query instance for which Mass h (x, y) is calculated for every instance x in the given dataset D. In order to calculate, instances (x, y) are analyzed with t similarity trees (sTrees) from the constructed similarity forest (sForest) which is generated from dataset D, that has {(x1, c1), (x2, c2), …, (x n , c n )} without consideration of c i [31]. Further to find the k nearest neighbors (instances) in D with respect to queried instance, one has to use Algorithm 1 i.e., ARSkNN.
The nearest neighbors of y are established in this way: ∀x i , x j ∈ D, if Mass h (x i , y) < Mass h (x j , y) then similarity (x i , y) > similarity (x j , y). It implies that the instance having lower mass concerning query instance is more identical to the query instance. After this discover the k nearest neighbor of the y and later conclude class of y in the voted way.
Complexity discussion
The avg. case time complexity of traditional k-nearest neighbor, using Lp Norm metric (Minkowski metric) is
In first phase (modeling), the time complexity can scale down from
Comparison with traditional kNN
The traditional kNN always experience challenges in high dimension datasets due to the fact that all instances are approximately equidistant to the queried instance [13]. Linear discriminant analysis (LDA) or principal component analysis (PCA) or canonical correlation analysis (CCA) techniques are needed in the pre-processing step to save kNN from Curse of Dimensionality. These techniques along with kNN are computationally expensive due to their own run time complexity. However, ARSkNN do not need these techniques as pre-processing step, because Massim arbitrarily selects an attribute to generate an internal node of sTree and each sTree itself is generated by utilizing a small subset of the entiredataset.
The traditional kNN needs all the training instances in memory to classify a queried instance. This is the reason why it is also known as lazy classifier [1]. It is also a drawback by which kNN gives problems when applied upon datasets with very high dimension. While, ARSkNN does not need to keep all training instances in memory due to an intermittent modeling phase in which it makes sForest with sTrees. It is therefore, easy to apply ARSkNN upon high dimension datasets.
Implementation
The implementation of traditional kNN defined as IBK in Waikato Environment for Knowledge Analysis (Weka) 3.7.10 [17] platform has been used in this experiment. In order to implement nearest neighbors search algorithm using Euclidean distance, Weka named it as LinearNNSearch.
The implementation of ARSkNN has been performed using JDK 1.8.0 with IDE Netbeans 8.0.2. The jar file with the name ARSkNN.jar was merged during runtime in Weka 3.7.10 platform and it was tested on various datasets from different domains as described below.
Datasets used
The datasets were chosen from various research domains varying in sizes for executing and testing the algorithm. The characteristics, attribute and instance information of all datasets utilized in the evaluation are outlined in Table 3.
Datasets Attributes
Datasets Attributes
The datasets used along with their brief descriptions are as mentioned below:
Pendigits dataset consists of 250 samples from 44 writers. These writers wrote 250 digits inside boxes of 500×500 pixel resolution on the tablet. These 250 samples get converted to 10992 instances. Each instance has 16 numeric attributes and all instances are classified into 10 classes as numeric digits 0–9. Approximately 1090 instances belong to each class. kNN with k = 5 and multilayer perceptron were used as a classifier in the first research approach applied to this dataset [2]. Recently published paper in 2013, the accuracy percentage on this dataset using DEMass-Bayes classifier is projected at 98.87% [28].
Letter recognition dataset
Letter recognition dataset is about character image features. This dataset is based on 20 different types of fonts for 26 English alphabet characters. These 20 fonts are twisted to create 20000 unique instances with 16 numeric attributes. Each alphabet has approximately 770 instances. In the first publication with this dataset, an adaptive classifier (IF-THEN classifier) was used for classification [14]. The accuracy percentage using DEMass-Bayes classifier on this dataset is 92.27% [28].
Magic04 dataset
Magic04 dataset is created underneath of a Monte Carlo program to recreate enrollment of high energy gamma particles in a very ground-based atmospheric Cherenkov gamma telescope by employing the imaging method. The dataset consists of 19920 instances out of which 12332 are gammas and 6688 are hadrons. Each instance is characterized by 10 numeric attributes. Total 14 different classifiers were applied in the first research done on this dataset and kNN was one of them [5]. DEMass-Bayes classifier produced 84.17 as accuracy percentage on this dataset [28].
Statlog dataset
Statlog dataset is created from Landsat Satellite data to identify the land use in 7 different classes. The dataset is a very small sub-area of a sight captured by Landsat Satellite, comprising of 82×100 pixels. Each instance of dataset resembles to a 3×3 square region of pixels accommodated within the 82×100 pixels area. Original dataset has 6435 instances with 36 numeric attributes. There is no any known first research done on this dataset. But, many researchers had used this dataset in their research.
Covertype dataset
Covertype dataset is use to predict the cover type of four forest areas in the Roosevelt National Forest. In this dataset, 12 independent variables consists of total 54 numeric attributes for 581012 instances are generated from digital spatial data processed in a GIS. In the first research done on this dataset, Artificial Neural Network classifier is used for classification task [4]. DEMass-Bayes classifier produced 92.05 as accuracy percentage on this dataset [28].
Pendigits, Letter Recognition, Magic04 and Statlog datasets are taken as provided in UCI repository. In Statlog_Middle dataset, only the four spectral values are taken by attribute number 17, 18, 19 and 20 for the central pixel of the real dataset. Covertype_Small dataset is prepared by randomly chosen 10000 instances from the original Covertype dataset.
Empirical evaluations
This section demonstrates the experimental results achieved by evaluating performance of the ARSkNN vis-à-vis the traditional kNN. Evaluations are carried out to:
compare ARSkNN with traditional kNN; and analyse the behaviours of ARSkNN when key parameters are changed.
The entire classification experiment has been set up on the system equipped with Intel Core i7 processor, 2.4 GHz clock cycle and 8 Gega Bytes processing memory. Further the 10 fold cross validation on Weka 3.7.10 has been executed for both the classifiers during the experiment.
Table 4 represents the outcomes for the avg. classification accuracy percentage parameter over 10-fold cross validation for IBK and ARSkNN with 10, 50, and 100 sTrees for distinct values of k which are 1, 3, 5 and 10. Table 4 concludes that ARSkNN contributes better avg. classification accuracy in Magic04, Statlog_Middle, Statlog and Covertype_Small datasets over IBK.
Average classification accuracy (%) Over 10-Fold cross-validation for IBK and ARSKNN with 10 sTrees, ARSkNN with 50 sTrees and ARSkNN with 100 sTrees for distinct values of k (1, 3, 5, and 10)
Average classification accuracy (%) Over 10-Fold cross-validation for IBK and ARSKNN with 10 sTrees, ARSkNN with 50 sTrees and ARSkNN with 100 sTrees for distinct values of k (1, 3, 5, and 10)
Note: Boldface values represent the best accuracy achieved in the dataset.
The results on accuracy percentage as shown in Table 4 are also graphically represented in Figs. 1–4. In these figures, a comparison of IBK, ARSkNN with 10 trees, ARSkNN with 50 trees and ARSkNN with 100 trees over 10-fold cross validation for all six datasets has been illustrated for avg. classification accuracy.

Avg. Accuracy (%) of classifiers for k = 1.

Avg. Accuracy (%) of classifiers for k = 3.

Avg. Accuracy (%) of classifiers for k = 5.

Avg. Accuracy (%) of classifiers for k = 10.
The count of nearest neighbor (k) is taken as 1 in Fig. 1. It is clearly evident from Fig. 1, ARSkNN with 100 trees produced significantly better accuracies in comparison with IBK for Magic04, Statlog_Middle, Statlog and Covertype_Small datasets. In Pendigits and Letter Recognition datasets, the accuracy of ARSkNN is not significantly worse in comparison with IBK.
In Fig. 2, nearest neighbors (k) are chosen as 3. As shown in Fig. 2, the better classification accuracies for Magic04 and Statlog_Middle datasets are produced in ARSkNN with 50 tress and ARSkNN with 100 trees produced significantly better accuracies for Statlog and Covertype_Small datasets. In Pendigits and Letter Recognition datasets, the accuracy of ARSkNN is not significantly worse in comparisonwith IBK.
Figure 3 shows result for 5 nearest neighbors (k). As can be deciphered from Fig. 3, the better classification accuracies for Magic04 and Statlog_Middle datasets are produced in ARSkNN with 50 tress and ARSkNN with 100 trees produced significantly better accuracies for Statlog and Covertype_Small datasets. In Pendigits and Letter Recognition datasets, the accuracy of ARSkNN is not significantly worse in comparison with IBK.
Figure 4 illustrates results for the number of nearest neighbors (k) as 10. As shown in Fig. 4, the better classification accuracies for Statlog_Middle datasets are produced in ARSkNN with 50 tress and ARSkNN with 100 trees produced significantly better accuracies for Magic04 and Covertype_Small datasets. In Pendigits, Letter Recognition and Statlog datasets, the accuracy of ARSkNN is not significantly worse in comparison with IBK.
Table 5 shows the obtained results based on the avg. runtime parameter in seconds over 10-fold cross validation for IBK and ARSkNN with 10, 50, and 100 sTrees for distinct values of k which are 1, 3, 5 and 10. Table 5 concludes that ARSkNN contributes better avg. runtime in all the datasets used for evaluation over IBK.
Avg. runtime (in seconds) over 10-fold cross-validation for IBK, ARSkNN with 10 sTrees, ARSkNN with 50 sTrees and ARSkNN with 100 sTrees for distinct values of k (1,3,5,10)
Note: Boldface values represent the worst runtime achieved in the dataset.
The results on avg. runtime (in seconds) as shown in Table 5 are also graphically represented in Figs. 5–8. In these figures, a comparison of IBK, ARSkNN with 10 trees, ARSkNN with 50 trees and ARSkNN with 100 trees over 10-fold cross validation for all six datasets has been illustrated for avg. runtime (in seconds).

Avg. Runtime (in seconds) of classifiers for k = 1.

Avg. Runtime (in seconds) of classifiers for k = 3.

Avg. Runtime (in seconds) of classifiers for k = 5.

Avg. Runtime (in seconds) of classifiers for k = 10.
Figure 5 shows the result for nearest neighbor (k) as 1. Figure 5 reflects that ARSkNN is taking a significantly less avg. runtime against IBK even with 100 trees upon Pendigits, Letter Recognition, Magic04 and Statlog_Middle datasets. In Statlog and Covertype_Small datasets, the avg. runtime of ARSkNN with 10 trees and with 50 trees are lesser than IBK but the avg. runtime of ARSkNN with 100 trees is more than IBK.
In Fig. 6, nearest neighbors (k) are taken as 3. Figure 6 reflects that ARSkNN is taking a significantly less avg. runtime against IBK even with 100 trees upon all datasets named Pendigits, Letter Recognition, Magic04, Covertype_Small, Statlog and Statlog_Middle.
The value of nearest neighbor (k) is taken as 5 in Fig. 2. As can be deciphered from Fig. 2 that ARSkNN is taking a significantly less avg. runtime against IBK even with 100 trees upon all datasets named Pendigits, Letter Recognition, Magic04, Covertype_Small, Statlog and Statlog_Middle.
The result with nearest neighbor (k) as 10 for all six datasets has been shown in Fig. 8. The ARSkNN significance over IBK in term of less runtime achieved on the Pendigits, Letter Recognition, Magic04, Covertype_Small, Statlog and Statlog_Middle datasets is reflected in the Fig. 8.
This research established the fact that ARSkNN is significantly better as compared to IBK (using Euclidean distance) in accuracy percentage for 4 datasets namely, Magic04, Statlog_Middle, Statlog and Covertype_Small as shown in Table 4. One should not expect ARSkNN to outperform IBK in all datasets because the classifiers accuracy percentage depends upon number of factors, such as whether the attributes of dataset’s domain counterpart similarity measure.
However, ARSkNN was found significantly better than IBK (with Euclidean distance) in avg. runtime for all the datasets. This is because of the fact that rather than computing distance among each training instance and the testing instance in the testing phase, ARSkNN calculate the similarity measure before the testing phase during the modeling of sForest and the testing instance is compared with the similarities in the modeled sForest.
This algorithm will give more accuracy percentage, if more number of samples are taken to build sTrees, but one should not take more than the half of total number of training instances. Generally, it is observed that for ARSkNN, k = 3; number of sTrees=100; and number of samples to build sTree = 5000 gives a better accuracy in most of the datasets. However, it has been observed that the avg. accuracy percentage of the IBK depends upon the learning metric (similarities or dissimilarities), the value of k and the characteristic of datasets.
The similarities and differences between ARSkNN and traditional kNN classifier are discussed below. Both classifiers utilizing the voting mechanism for deciding the classes of testing instances, after finding the k-nearest neighbors and also for both classifiers, there is need to decide the proper value of k. The fundamental difference between both of them is that the traditional kNN classifier uses (dis) similarity measures which have their base as distance calculation, whereas ARSkNN uses a similarity measure known as Massim which has mass estimation as its base.
There are some applications of tree structures (e.g. k-d Tree) for indexing in order to increase the nearest neighbor search. The reader should not confuse with the purpose of the sTrees, that they are used to hold similarity measures and not for indexing, because sTrees are balanced binary trees so the indexing techniques can be used to decrease the time complexity of sTrees.
Conclusion
This paper well establishes the empirical and experimental foundation of ARSkNN, as an efficient approach of kNN classifiers. ARSkNN represents a paradigm shift from distance based kNN to Massim based kNN. ARSkNN overcame the key drawback of kNN classifiers, i.e., to keep all training instances in memory for the sake of classifying a single testing instance. Our results verified and established the fact that ARSkNN has better accuracy percentage and better avg. runtime than kNN.
There are possibilities of extending the proposed algorithm.
one can utilize other distance metrics as of Euclidean distance for comparison of ARSkNN with kNN. ARSkNN can also be applied in various datasets from other domains like document classification, geostatistics etc. the efficiency of ARSkNN will be checked upon parameters like sensitivity (recall), F-measure, balanced classification ratio (BCR).
