ARSkNN: An efficient k-nearest neighbor classification technique using mass based similarity measure

Abstract

Even though finding out distance is the central core of k-Nearest Neighbor classification techniques, similarity measures are often favored against distance in various realistic scenarios and situation. Most of the similarity measures, which are used to classify an instance, are based on geometric model. Their effectiveness decreases with the increases in the number of dimensions. This paper establishes an efficient technique called ARSkNN for finding out class of any given instance using a measure based on an unique similarity, that does no longer compute distance, for k-NN classification. Our empirical results show that ARSkNN classification technique is better than the previous established k-NN classifiers. The performance of algorithm was verified and validated on various datasets from different domains.

Keywords

Data mining classification nearest neighbor similarity measure

1 Introduction

In the domain of data mining, classification techniques are realm of much eagerness for well-known researchers, attributable to its ample range of practical applications, viz; geostatistics, computer vision, speech recognition and many more. The popular and established k-Nearest Neighbors also called as k-NN classification approach is among the utmost straightforward and oldest among classification approaches. In this approach, the class of the questioned instance is established in two steps; in the initial step, k nearest neighbors of questioned instance is establish and in next or last step, the class that has the majority in those nearest neighbors is said to be the class of questioned instance [12]. The accurateness of k-NN classification is drastically influenced by the metric utilize for determination of similarities or dissimilarities (distances) between the instances with the known classes and the questioned instance [34]. The vast majority of the works as reported for the metric learning phase, i.e., the first phase for k-NN classification approach, are dependent on distance metrics [3 , 34]. Nevertheless, in numerous real world applications like document classification, similarities (such as cosine similarity) are preferred over distances (such as Euclidean distance).

Additionally, numerous researches have revealed that utilizing measures based on similarity are often chosen against the distance measures even in the non-textual datasets (like Iris, Wine and Balance) [22]. Preferring measure based on distance/similarity relies on the representation or the measurement type of instances which needs classification. Many similarity measures which are used in metric learning phase of k-NN classification are directly or indirectly calculated with the help of distance measures. The recently proposed similarity measure based on mass (Massim), is a substitute to similarity measure based on distance and has revealed better outcome in information retrieving tasks [31]. The cardinality feature of the least adjacent area which wraps those two instances is the main concept to calculate massim. Our algorithm ARSkNN makes use of Massim as a metric learner in k-NN classification approach.

This paper makes the following contributions; It:

establishes an efficient kNN classification technique, ARSkNN (the name comes from the first character of first name of its proposers Ashish, Roheet and Sumit), which utilizes similarity measure based on mass rather than distance.

establishes the empirical and experimental foundation of ARSkNN.

empirically evaluate ARSkNN and compared it with the traditional kNN. The results shows its superiority in the context of avg. runtime and avg. accuracy percentage.

2 Literature review

The section reviews k-NN classifiers and similarity measures.Massim and operational concept of Massim is also explained in this section. An approach of calculating similarity measure based on mass (Massim) is recently developed and hence available literature is also very limited.

2.1 k - Nearest neighbors classification

The cornerstones of k - nearest neighbors classification [12] are Nearest Neighbor classification and the k-NN law first suggested in 1951 by Fix et al. It has been also recognized by so many names; some of them are instance based classifier, case based classifier, memory based classifier, lazy classifier. There are 3 crucial pillars of k-NN classifier: class labeled instances set; the value of k and dissimilarity or similarity metric. It is lazy learning classification technique because it does not create any generalization from the training instances of the dataset. Therefore, it desires all instances of t raining data in memory for each testing instance. Due to this, there is a minimal training phase, but costly testing phase in context of space & time complexityboth [1].

To enhance k-NN classifiers, various measures have been recommended. Some of them are as; Mahalanobis distance [34], dynamic (adaptive) distance [33] and local distance metric [20]. Furthermore, the customary k-NN classifier has seen a variety of augmentations and transformations till now, for example, distance lower bound based nearest neighbor [9], Nearest Feature Line (NFL) classification technique [19], Nearest Feature Space (NFS) classification technique and Nearest Feature Plane (NFP) classification technique [10], Nearest Neighbor Line classification technique (NNLC) [37], Center-based Nearest Neighbor classification technique (CBNNC) [16], Meaningful Nearest Neighbor (MNN) [21], MaxNearestDist based nearest neighbor [23], Optimal Weighted Nearest Neighbor classifiers [24] and k-regular nearest neighbor graph [6]. But, all the above classifiers are based on distance (dissimilarity) or similarity metrics. Also genetic algorithms and genetic programming are used to propose robust nearest neighbor classifier [15]. Yu-Jie Xiong et al. [35] has suggested that distance based classifiers are very convenient and relevant in the domain of writer recognition.

Till now, there is no k-NN classifier which takes into account similarity metric based on mass (Massim). This research paper comes up with an efficient classifier for the same and it is the novelty of our approach in ARSkNN.

2.2 Concept of similarity measures

Similarity, is a perception which can be utilize not only in data mining domain but also numerous further subjects like ecology and psychology. To determine similarity among two objects, we can use similarity measure which is a function of real values. When objects are more similar to each other, then the return value of the function is higher.

Afterwards a similarity measure to classify biological species was formulated by Jaccard in 1901, various dissimilarity (distance) and similarity measures have been formulated in various scientific disciplines in the past hundred years. The literature demonstrates that Zezula et al. [36] and Cha [7] have done exhaustive surveys of similarity measures. Seung-Seok Choi et al. have also done a review of 76 binarized distance and similarity metrics alongside their correlations, via hierarchical clustering techniques [11].

A similarity measure is treated as a metric on the off chance, that it results in a higher value as the similarity between objects increases. A metric similarity S assures the following [27]:

Non-Negativity: S (X, Y) > =0 i.e., the distance among any two elements of the metric space can not be less than 0.

Reflexivity: S (X, Y) =0 if and only if X = Y i.e., if two elements are equivalent to each other then the distance among them is 0.

Symmetry: S (X, Y) = S (Y, X) i.e., the distance among any two elements will be same in either way.

Triangle Inequality: S (X, Z) < = S (X, Y) + S (Y, Z) i.e., let three elements x, y, z of metric space then the sum of the distance among (x, y) and (y, z) is greater than (x, z).

These four properties are also known as distance axioms. Nonetheless, a few researchers confronted the triangle inequality axiom by proposing non-metric similarity measures [26]. The similarity measures based on distance is a binary function dist(x, y) [Table 1], i.e., all of them must have two variables, regardless of whether non-metric or metric. In large datasets, the fundamental downside with similarity measures based on distance is that they are expensive with regard to computation due to their high time complexities, which refers to the comparison matrix drawn for the pairwise data selected. There are some established techniques by which one can transform the similarity measure to distance measures and vice versa [8].

Table 1
Distance-based Similarity versus Mass-based Similarity

Distance-based similarity Mass-based similarity

Computation dist(x, y) is calculated (exclusively) on the basis of coordinate x and y in the feature space. Mass(x, y) is calculated (exclusively) on the basis of data distribution in local region of feature space.

Definition the shortest path length from x to y is determined by dist(x, y). The cardinality of smallest local area covering both x and y determines Mass base function M(x, y).

Inequality similarity(x, y) > similarity(x, z) ≡ dist(x, y) <dist(x, z) similarity(x, y) > similarity(x, z) ≡ Mass(x, y) < Mass(x, z)

Metric All four distance axioms generally intact. All major distance axioms intact except one.

	Distance-based similarity	Mass-based similarity
Computation	dist(x, y) is calculated (exclusively) on the basis of coordinate x and y in the feature space.	Mass(x, y) is calculated (exclusively) on the basis of data distribution in local region of feature space.
Definition	the shortest path length from x to y is determined by dist(x, y).	The cardinality of smallest local area covering both x and y determines Mass base function M(x, y).
Inequality	similarity(x, y) > similarity(x, z) ≡ dist(x, y) <dist(x, z)	similarity(x, y) > similarity(x, z) ≡ Mass(x, y) < Mass(x, z)
Metric	All four distance axioms generally intact.	All major distance axioms intact except one.

2.3 Massim - similarity measure based on mass

In 2010, Kai Ming Ting et al. suggested the idea of mass and how to estimate mass [29]. It was proposed and established for replacement of density estimation. Massim is not quite the same as all similarity measures. It is built on the perception of mass [30], that is the cardinality of region, instead of distance.Mainly, it does not quantify distance measure, which is the heart and soul of all the distance based similarities where as it is established on data distribution in the local area. Massim assures that 2 similar objects should be in the same local region. In view of mass, Massim is a binary function to quantify resemblance among 2 instances however, Mass is a unary function.

The comparison among distance based and mass based similarity measures have been shown in Table 1.

Massim does not fulfill all the four distance axioms dissimilar to distance-based similarity measures, as mentioned in Table 2.

Table 2
Axioms followed by Distance-based Similarity Measures and Mass-based Similarity Measures

Distance-based similarity measure Mass-based similarity measure

Axiom 1 dist. (x, y) > =0 Mass. (x, y) > =1 (non-negativity)

Axiom 2 dist. (x, y) =0 ⇔ x = y i. ∀x, yMass. (x, x) < = Mass. (x, y) ii. ∃x ≠ yMass. (x, x) ≠ Mass. (y, y) (reflexivity)

Axiom 3 dist. (x, y) = dist. (y, x) Mass. (x, y) = Mass. (y, x) (symmetry)

Axiom 4 dist. (x, z) < = dist. (x, y) + dist. (y, z) Mass. (x, z) < Mass. (x, y) + Mass. (y, z) (triangle in-equality)

	Distance-based similarity measure	Mass-based similarity measure
Axiom 1	dist. (x, y) > =0	Mass. (x, y) > =1	(non-negativity)
Axiom 2	dist. (x, y) =0 ⇔ x = y	i. ∀x, yMass. (x, x) < = Mass. (x, y) ii. ∃x ≠ yMass. (x, x) ≠ Mass. (y, y)	(reflexivity)
Axiom 3	dist. (x, y) = dist. (y, x)	Mass. (x, y) = Mass. (y, x)	(symmetry)
Axiom 4	dist. (x, z) < = dist. (x, y) + dist. (y, z)	Mass. (x, z) < Mass. (x, y) + Mass. (y, z)	(triangle in-equality)

Massim is a revolutionary concept which is a similarity measure based on mass.

2.4 Operational concept for massim

The two adaptations for Massim, first being for single dimensional and the next being for multidimensional representation of objects (instances) [31].

For single dimensional adaptation, in order to create a model E to distinct feature space into least convex local section R (x, y|E), which covers x, y ∈ I R, as product of twofold split that separates real line characterized by D = {x₁ < x₂ < x₃ … < x_n} into two non-intersecting local area r₁ (where x < = b_i) and r₂ (where x > = b_i); and one region r₃, that spreads over the intact real line should also be there. Now Massim(x, y) is expressed as: $Massim (x, y) = (\sum_{i = 1}^{n - 1} M_{i} (x, y)^{e} p (b_{i}))^{\frac{1}{e}}$ (1) where, e is mean either arithmetic or harmonic, p (b_i) is the probability to choose twofold split i.e., (b_i) and $M_{i} (x, y) = {\begin{matrix} i & if R (x, y | E) = r_{1} \\ n - 1 & if R (x, y | E) = r_{2} \\ n & if R (x, y | E) = r_{3} \end{matrix}$ (2)

Another smoother form of single dimensional Massim i.e. Level-h Massim, is estimated iteratively by using the following equation: ${Massim}^{h} (x, y) = (\sum_{i = 1}^{n - 1} {Massim}_{i}^{h - 1} (x, y)^{e} p (b_{i}))^{\frac{1}{e}}$ (3) with the condition,

$\begin{matrix} {Massim}^{h} (x, y) \\ = {\begin{matrix} \sum_{i = 1}^{n - 1} m_{i} (x) p (b_{i}) & h = 1 \\ \sum_{i = 1}^{n - 1} {massim}_{i}^{h - 1} (x) p (b_{i}) & h > 1 \end{matrix} \end{matrix}$ (4) where, $m_{i} (x) = {\begin{matrix} i & ifx < = x_{i} \\ n - 1 & ifx > = x_{i} \end{matrix}$ (5)

Level-h Massim takes into account the manner of data distribution which is further efficient and it is utilized for multi-modal data distribution [31].

The similarity of 2 instances (x, y) (for multi-dimensional Massim) is assessed by standard avg. of mass for every smallest local area that covers instances (x, y). The mass inequality produced by half-space splits [29], is promise by the same random local areas.

Also for multi-dimensional datasets, every single half-space split is carried out over arbitrarily choose features. A h-level split is represented using a tree structure in which every single subtree contains h half-space splits. Hence there would be 2^h distinct local areas in this tree [31].

Further to calculate similarity for multi-dimensional mass similarity, the following equation is referred: ${Massim}^{h} (x, y) \approx (\frac{1}{t} \sum_{i = 1}^{t} M_{h, i} (x, y)^{e}) \frac{1}{e}$ (6) where, $M_{h, i} (x, y)$ is the mass in R (x, y|E_i (h)), i.e. the smallest area of numerous multi-dimensional models E_i (h) enclosing x and y both; and t > 1 is the quantity of arbitrarily selected areas which are referred for outline the mass similarity of x, y.

3 ARSkNN: the proposed k-NN classifier

In our previous research work [18], ARSkNN has been already is proposed as a novel and robust k-nearest neighbor classifier which utilizes Massim, a similarity measure based on mass, against the techniques rely upon distance-based similarity measures. Further the algorithm is explained in the following paragraph:

In this algorithm, y is the query instance for which Mass^h (x, y) is calculated for every instance x in the given dataset D. In order to calculate, instances (x, y) are analyzed with t similarity trees (sTrees) from the constructed similarity forest (sForest) which is generated from dataset D, that has {(x₁, c₁), (x₂, c₂), …, (x_n, c_n)} without consideration of c_i [31]. Further to find the k nearest neighbors (instances) in D with respect to queried instance, one has to use Algorithm 1 i.e., ARSkNN.

The nearest neighbors of y are established in this way: ∀x_i, x_j ∈ D, if Mass^h (x_i, y) < Mass^h (x_j, y) then similarity (x_i, y) > similarity (x_j, y). It implies that the instance having lower mass concerning query instance is more identical to the query instance. After this discover the k nearest neighbor of the y and later conclude class of y in the voted way.

3.1 Complexity discussion

The avg. case time complexity of traditional k-nearest neighbor, using Lp Norm metric (Minkowski metric) is $O (knlogn)$ [32], whereas ARSkNN is having $O (ntlog d)$ complexity in the modeling phase which is deliberated in the equation (6) defined in section 2.4. Further for class assignment phase, the time complexity is $O (kn)$ here, n represent all instances of dataset D; d represents number of arbitrarily chosen instances of dataset D; t represents number of arbitrary areas utilized for calculating massim and k is the count of nearestneighbors.

In first phase (modeling), the time complexity can scale down from $O (ntlogd)$ to $O ((n + d) t)$ by adopting any suitable indexing technique, as all sTrees of every sForest are balanced binary tree [18].

3.2 Comparison with traditional kNN

The traditional kNN always experience challenges in high dimension datasets due to the fact that all instances are approximately equidistant to the queried instance [13]. Linear discriminant analysis (LDA) or principal component analysis (PCA) or canonical correlation analysis (CCA) techniques are needed in the pre-processing step to save kNN from Curse of Dimensionality. These techniques along with kNN are computationally expensive due to their own run time complexity. However, ARSkNN do not need these techniques as pre-processing step, because Massim arbitrarily selects an attribute to generate an internal node of sTree and each sTree itself is generated by utilizing a small subset of the entiredataset.

The traditional kNN needs all the training instances in memory to classify a queried instance. This is the reason why it is also known as lazy classifier [1]. It is also a drawback by which kNN gives problems when applied upon datasets with very high dimension. While, ARSkNN does not need to keep all training instances in memory due to an intermittent modeling phase in which it makes sForest with sTrees. It is therefore, easy to apply ARSkNN upon high dimension datasets.

4 Implementation

The implementation of traditional kNN defined as IBK in Waikato Environment for Knowledge Analysis (Weka) 3.7.10 [17] platform has been used in this experiment. In order to implement nearest neighbors search algorithm using Euclidean distance, Weka named it as LinearNNSearch.

The implementation of ARSkNN has been performed using JDK 1.8.0 with IDE Netbeans 8.0.2. The jar file with the name ARSkNN.jar was merged during runtime in Weka 3.7.10 platform and it was tested on various datasets from different domains as described below.

4.1 Datasets used

The datasets were chosen from various research domains varying in sizes for executing and testing the algorithm. The characteristics, attribute and instance information of all datasets utilized in the evaluation are outlined in Table 3.

Table 3
Datasets Attributes

Dataset Name [Ref] # Instances # Attributes # Classes Domain

PenDigits [2] 10992 16 10 Digit Recognition

Letter Recognition [14] 20000 16 26 Character Recognition

Magic04 [5] 19020 10 2 Gamma Particles

Statlog_Middle 6435 4 7 Landsat Satellite

Statlog 6435 36 7 Landsat Satellite

Covertype_Small [4] 10000 54 7 Forest Covertype

Dataset Name [Ref]	# Instances	# Attributes	# Classes	Domain
PenDigits [2]	10992	16	10	Digit Recognition
Letter Recognition [14]	20000	16	26	Character Recognition
Magic04 [5]	19020	10	2	Gamma Particles
Statlog_Middle	6435	4	7	Landsat Satellite
Statlog	6435	36	7	Landsat Satellite
Covertype_Small [4]	10000	54	7	Forest Covertype

The datasets used along with their brief descriptions are as mentioned below:

4.1.1 Pendigits dataset

Pendigits dataset consists of 250 samples from 44 writers. These writers wrote 250 digits inside boxes of 500×500 pixel resolution on the tablet. These 250 samples get converted to 10992 instances. Each instance has 16 numeric attributes and all instances are classified into 10 classes as numeric digits 0–9. Approximately 1090 instances belong to each class. kNN with k = 5 and multilayer perceptron were used as a classifier in the first research approach applied to this dataset [2]. Recently published paper in 2013, the accuracy percentage on this dataset using DEMass-Bayes classifier is projected at 98.87% [28].

4.1.2 Letter recognition dataset

Letter recognition dataset is about character image features. This dataset is based on 20 different types of fonts for 26 English alphabet characters. These 20 fonts are twisted to create 20000 unique instances with 16 numeric attributes. Each alphabet has approximately 770 instances. In the first publication with this dataset, an adaptive classifier (IF-THEN classifier) was used for classification [14]. The accuracy percentage using DEMass-Bayes classifier on this dataset is 92.27% [28].

4.1.3 Magic04 dataset

Magic04 dataset is created underneath of a Monte Carlo program to recreate enrollment of high energy gamma particles in a very ground-based atmospheric Cherenkov gamma telescope by employing the imaging method. The dataset consists of 19920 instances out of which 12332 are gammas and 6688 are hadrons. Each instance is characterized by 10 numeric attributes. Total 14 different classifiers were applied in the first research done on this dataset and kNN was one of them [5]. DEMass-Bayes classifier produced 84.17 as accuracy percentage on this dataset [28].

4.1.4 Statlog dataset

Statlog dataset is created from Landsat Satellite data to identify the land use in 7 different classes. The dataset is a very small sub-area of a sight captured by Landsat Satellite, comprising of 82×100 pixels. Each instance of dataset resembles to a 3×3 square region of pixels accommodated within the 82×100 pixels area. Original dataset has 6435 instances with 36 numeric attributes. There is no any known first research done on this dataset. But, many researchers had used this dataset in their research.

4.1.5 Covertype dataset

Covertype dataset is use to predict the cover type of four forest areas in the Roosevelt National Forest. In this dataset, 12 independent variables consists of total 54 numeric attributes for 581012 instances are generated from digital spatial data processed in a GIS. In the first research done on this dataset, Artificial Neural Network classifier is used for classification task [4]. DEMass-Bayes classifier produced 92.05 as accuracy percentage on this dataset [28].

Pendigits, Letter Recognition, Magic04 and Statlog datasets are taken as provided in UCI repository. In Statlog_Middle dataset, only the four spectral values are taken by attribute number 17, 18, 19 and 20 for the central pixel of the real dataset. Covertype_Small dataset is prepared by randomly chosen 10000 instances from the original Covertype dataset.

5 Empirical evaluations

This section demonstrates the experimental results achieved by evaluating performance of the ARSkNN vis-à-vis the traditional kNN. Evaluations are carried out to:

compare ARSkNN with traditional kNN; and

analyse the behaviours of ARSkNN when key parameters are changed.

The entire classification experiment has been set up on the system equipped with Intel Core i7 processor, 2.4 GHz clock cycle and 8 Gega Bytes processing memory. Further the 10 fold cross validation on Weka 3.7.10 has been executed for both the classifiers during the experiment.

Table 4 represents the outcomes for the avg. classification accuracy percentage parameter over 10-fold cross validation for IBK and ARSkNN with 10, 50, and 100 sTrees for distinct values of k which are 1, 3, 5 and 10. Table 4 concludes that ARSkNN contributes better avg. classification accuracy in Magic04, Statlog_Middle, Statlog and Covertype_Small datasets over IBK.

Table 4
Average classification accuracy (%) Over 10-Fold cross-validation for IBK and ARSKNN with 10 sTrees, ARSkNN with 50 sTrees and ARSkNN with 100 sTrees for distinct values of k (1, 3, 5, and 10)

Data Sets For k = 1 For k = 3

IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree

PenDigits 99.36 98.33 99.19 99.28 99.35 98.46 99.19 99.24

Letter Recognition 95.96 86 94.46 95.63 95.63 86.87 93.41 94.58

Magic04 80.93 83.18 85.29 85.52 83.2 84.02 84.9 84.84

Satlog_Middle 82.25 82 83.49 83.57 84.07 84.35 84.75 84.61

Satlog 90.2 88.51 91.11 91.37 90.78 89.19 90.45 90.87

Covertype_Small 80.9 79.12 81.64 81.92 79.09 78.95 79.89 80.64

(a) (b)

Data Sets For k = 5 For k = 10

IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree IBK ARS-10 Tree ARS-50 Tree ARS-100 Tree

PenDigits 99.26 98.31 98.97 98.94 99.02 97.98 98.54 98.57

Letter Recognition 95.5 85.66 92.48 93.37 94.79 82.98 90.12 91.18

Magic04 83.64 83.84 84.55 84.52 83.79 83.47 83.82 83.86

Satlog_Middle 84.66 85.19 85.48 85.45 85.62 85.71 86.23 86.02

Satlog 90.62 88.92 89.85 90.99 90.48 88.22 89.23 89.33

Covertype_Small 78.01 77.53 78.1 78.62 75.55 75.79 75.74 75.93

(c) (d)

Data Sets	For k = 1	For k = 3
PenDigits	99.36	98.33	99.19	99.28	99.35	98.46	99.19	99.24
Letter Recognition	95.96	86	94.46	95.63	95.63	86.87	93.41	94.58
Magic04	80.93	83.18	85.29	85.52	83.2	84.02	84.9	84.84
Satlog_Middle	82.25	82	83.49	83.57	84.07	84.35	84.75	84.61
Satlog	90.2	88.51	91.11	91.37	90.78	89.19	90.45	90.87
Covertype_Small	80.9	79.12	81.64	81.92	79.09	78.95	79.89	80.64
(a)	(b)
Data Sets	For k = 5	For k = 10
	IBK	ARS-10 Tree	ARS-50 Tree	ARS-100 Tree	IBK	ARS-10 Tree	ARS-50 Tree	ARS-100 Tree
PenDigits	99.26	98.31	98.97	98.94	99.02	97.98	98.54	98.57
Letter Recognition	95.5	85.66	92.48	93.37	94.79	82.98	90.12	91.18
Magic04	83.64	83.84	84.55	84.52	83.79	83.47	83.82	83.86
Satlog_Middle	84.66	85.19	85.48	85.45	85.62	85.71	86.23	86.02
Satlog	90.62	88.92	89.85	90.99	90.48	88.22	89.23	89.33
Covertype_Small	78.01	77.53	78.1	78.62	75.55	75.79	75.74	75.93
(c)	(d)

Note: Boldface values represent the best accuracy achieved in the dataset.

The results on accuracy percentage as shown in Table 4 are also graphically represented in Figs. 1–4. In these figures, a comparison of IBK, ARSkNN with 10 trees, ARSkNN with 50 trees and ARSkNN with 100 trees over 10-fold cross validation for all six datasets has been illustrated for avg. classification accuracy.

Fig.1

Avg. Accuracy (%) of classifiers for k = 1.

Fig.2

Avg. Accuracy (%) of classifiers for k = 3.

Fig.3

Avg. Accuracy (%) of classifiers for k = 5.

Fig.4

Avg. Accuracy (%) of classifiers for k = 10.

The count of nearest neighbor (k) is taken as 1 in Fig. 1. It is clearly evident from Fig. 1, ARSkNN with 100 trees produced significantly better accuracies in comparison with IBK for Magic04, Statlog_Middle, Statlog and Covertype_Small datasets. In Pendigits and Letter Recognition datasets, the accuracy of ARSkNN is not significantly worse in comparison with IBK.

In Fig. 2, nearest neighbors (k) are chosen as 3. As shown in Fig. 2, the better classification accuracies for Magic04 and Statlog_Middle datasets are produced in ARSkNN with 50 tress and ARSkNN with 100 trees produced significantly better accuracies for Statlog and Covertype_Small datasets. In Pendigits and Letter Recognition datasets, the accuracy of ARSkNN is not significantly worse in comparisonwith IBK.

Figure 3 shows result for 5 nearest neighbors (k). As can be deciphered from Fig. 3, the better classification accuracies for Magic04 and Statlog_Middle datasets are produced in ARSkNN with 50 tress and ARSkNN with 100 trees produced significantly better accuracies for Statlog and Covertype_Small datasets. In Pendigits and Letter Recognition datasets, the accuracy of ARSkNN is not significantly worse in comparison with IBK.

Figure 4 illustrates results for the number of nearest neighbors (k) as 10. As shown in Fig. 4, the better classification accuracies for Statlog_Middle datasets are produced in ARSkNN with 50 tress and ARSkNN with 100 trees produced significantly better accuracies for Magic04 and Covertype_Small datasets. In Pendigits, Letter Recognition and Statlog datasets, the accuracy of ARSkNN is not significantly worse in comparison with IBK.

Table 5 shows the obtained results based on the avg. runtime parameter in seconds over 10-fold cross validation for IBK and ARSkNN with 10, 50, and 100 sTrees for distinct values of k which are 1, 3, 5 and 10. Table 5 concludes that ARSkNN contributes better avg. runtime in all the datasets used for evaluation over IBK.

Table 5

Avg. runtime (in seconds) over 10-fold cross-validation for IBK, ARSkNN with 10 sTrees, ARSkNN with 50 sTrees and ARSkNN with 100 sTrees for distinct values of k (1,3,5,10)

Data Sets	For k = 1				For k = 3
	IBK	ARS-10 Tree	ARS-50 Tree	ARS-100 Tree	IBK	ARS-10 Tree	ARS-50 Tree	ARS-100 Tree
PenDigits	0.55	0.02	0.11	0.25	0.61	0.02	0.09	0.21
Letter Recognition	3.05	0.04	0.24	0.52	3.31	0.04	0.24	0.54
Magic04	1.87	0.04	0.17	0.44	2.04	0.04	0.21	0.48
Statlog_Middle	0.10	0.01	0.03	0.07	0.11	0.01	0.03	0.07
Statlog	0.29	0.02	0.12	0.45	0.36	0.02	0.13	0.34
Covertype_Small	0.49	0.04	0.24	0.61	0.73	0.04	0.23	0.53
(a)				(b)
Data Sets	For k = 5				For k = 10
	IBK	ARS-10 Tree	ARS-50 Tree	ARS-100 Tree	IBK	ARS-10 Tree	ARS-50 Tree	ARS-100 Tree
PenDigits	0.64	0.02	0.10	0.20	0.70	0.02	0.09	0.18
Letter Recognition	3.22	0.04	0.24	0.54	3.59	0.04	0.23	0.54
Magic04	2.24	0.03	0.19	0.46	2.54	0.03	0.26	0.49
Statlog_Middle	0.11	0.01	0.04	0.07	0.12	0.00	0.03	0.06
Statlog	0.35	0.02	0.11	0.32	0.40	0.02	0.10	0.27
Covertype_Small	0.92	0.05	0.23	0.46	0.90	0.05	0.22	0.49
(c)				(d)

Note: Boldface values represent the worst runtime achieved in the dataset.

The results on avg. runtime (in seconds) as shown in Table 5 are also graphically represented in Figs. 5–8. In these figures, a comparison of IBK, ARSkNN with 10 trees, ARSkNN with 50 trees and ARSkNN with 100 trees over 10-fold cross validation for all six datasets has been illustrated for avg. runtime (in seconds).

Fig.5

Avg. Runtime (in seconds) of classifiers for k = 1.

Fig.6

Avg. Runtime (in seconds) of classifiers for k = 3.

Fig.7

Avg. Runtime (in seconds) of classifiers for k = 5.

Fig.8

Avg. Runtime (in seconds) of classifiers for k = 10.

Figure 5 shows the result for nearest neighbor (k) as 1. Figure 5 reflects that ARSkNN is taking a significantly less avg. runtime against IBK even with 100 trees upon Pendigits, Letter Recognition, Magic04 and Statlog_Middle datasets. In Statlog and Covertype_Small datasets, the avg. runtime of ARSkNN with 10 trees and with 50 trees are lesser than IBK but the avg. runtime of ARSkNN with 100 trees is more than IBK.

In Fig. 6, nearest neighbors (k) are taken as 3. Figure 6 reflects that ARSkNN is taking a significantly less avg. runtime against IBK even with 100 trees upon all datasets named Pendigits, Letter Recognition, Magic04, Covertype_Small, Statlog and Statlog_Middle.

The value of nearest neighbor (k) is taken as 5 in Fig. 2. As can be deciphered from Fig. 2 that ARSkNN is taking a significantly less avg. runtime against IBK even with 100 trees upon all datasets named Pendigits, Letter Recognition, Magic04, Covertype_Small, Statlog and Statlog_Middle.

The result with nearest neighbor (k) as 10 for all six datasets has been shown in Fig. 8. The ARSkNN significance over IBK in term of less runtime achieved on the Pendigits, Letter Recognition, Magic04, Covertype_Small, Statlog and Statlog_Middle datasets is reflected in the Fig. 8.

6 Results and discussion

This research established the fact that ARSkNN is significantly better as compared to IBK (using Euclidean distance) in accuracy percentage for 4 datasets namely, Magic04, Statlog_Middle, Statlog and Covertype_Small as shown in Table 4. One should not expect ARSkNN to outperform IBK in all datasets because the classifiers accuracy percentage depends upon number of factors, such as whether the attributes of dataset’s domain counterpart similarity measure.

However, ARSkNN was found significantly better than IBK (with Euclidean distance) in avg. runtime for all the datasets. This is because of the fact that rather than computing distance among each training instance and the testing instance in the testing phase, ARSkNN calculate the similarity measure before the testing phase during the modeling of sForest and the testing instance is compared with the similarities in the modeled sForest.

This algorithm will give more accuracy percentage, if more number of samples are taken to build sTrees, but one should not take more than the half of total number of training instances. Generally, it is observed that for ARSkNN, k = 3; number of sTrees=100; and number of samples to build sTree = 5000 gives a better accuracy in most of the datasets. However, it has been observed that the avg. accuracy percentage of the IBK depends upon the learning metric (similarities or dissimilarities), the value of k and the characteristic of datasets.

The similarities and differences between ARSkNN and traditional kNN classifier are discussed below. Both classifiers utilizing the voting mechanism for deciding the classes of testing instances, after finding the k-nearest neighbors and also for both classifiers, there is need to decide the proper value of k. The fundamental difference between both of them is that the traditional kNN classifier uses (dis) similarity measures which have their base as distance calculation, whereas ARSkNN uses a similarity measure known as Massim which has mass estimation as its base.

There are some applications of tree structures (e.g. k-d Tree) for indexing in order to increase the nearest neighbor search. The reader should not confuse with the purpose of the sTrees, that they are used to hold similarity measures and not for indexing, because sTrees are balanced binary trees so the indexing techniques can be used to decrease the time complexity of sTrees.

7 Conclusion

This paper well establishes the empirical and experimental foundation of ARSkNN, as an efficient approach of kNN classifiers. ARSkNN represents a paradigm shift from distance based kNN to Massim based kNN. ARSkNN overcame the key drawback of kNN classifiers, i.e., to keep all training instances in memory for the sake of classifying a single testing instance. Our results verified and established the fact that ARSkNN has better accuracy percentage and better avg. runtime than kNN.

There are possibilities of extending the proposed algorithm.

one can utilize other distance metrics as of Euclidean distance for comparison of ARSkNN with kNN.

ARSkNN can also be applied in various datasets from other domains like document classification, geostatistics etc.

the efficiency of ARSkNN will be checked upon parameters like sensitivity (recall), F-measure, balanced classification ratio (BCR).

References

Aha

D.W.

Lazy learning, chapter Lazy Learning, pp. 7–10. Kluwer Academic Publishers, Norwell. MA, USA, 1997.

Alimoglu

and Alpaydin

, Methods of combining multiple classifiers based on different representations for penbased handwritten digit recognition, In. Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (TAINN 96.Citeseer1996.

Baoli

, Qin

and Shiwen

, An adaptive k-nearest neighbor text categorization strategy, ACM Transactions on Asian Language Information Processing (TALIP)3(4) (2004), 215–226.

Blackard

J.A.

and Dean

D.J.

, Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables, Computers and Electronics in Agriculture24(3) (1999), 131–151.

Bock

R.K.

, Chilingarian

, Gaug

, Hakl

, Hengstebeck

Th.

, Jiřina

, Klaschka

, Kotrč

, Scký

, Towers

, ., Methods for multidimensional event classification: A case study using images from a cherenkov gamma-ray telescope. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment516(2) (2004), 511–528.

Broelemann

, Jiang

, Mukherjee

and Chowdhury

A.S.

, Variants of k-regular nearest neighbor graph and their construction, Information Processing Letters (2017).

Cha

S.-H.

, Comprehensive survey on distance/similarity measures between probability density functions, City1(2) (2007), 1.

Chen

, Ma

and Zhang

, On the similarity metric and the distance metric, Theoretical Computer Science410(24) (2009), 2365–2376.

Chen

Y.-S.

, Hung

Y.-P.

, Yen

T.-F.

and Fuh

C.-S.

, Fast and versatile algorithm for nearest neighbor search based on a lower bound tree, Pattern Recognition40(2) (2007), 360–375.

10.

Chien

J.-T.

and Wu

C.-C.

, Discriminant wavelet faces and nearest feature classifiers for face recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence24(12) (2002), 1644–1649.

11.

Choi

S.-S.

, Cha

S.-H.

and Tappert

C.C.

, A survey of binary similarity and distance measures, Journal of Systemics, Cybernetics and Informatics8(1) (2010), 43–48.

12.

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory13(1) (1967), 21–27.

13.

Francois

, Wertz

and Verleysen

, The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering19(7) (2007), 873–886.

14.

Frey

P.W.

and Slate

D.J.

, Letter recognition using holland-style adaptive classifiers, Machine Learning6(2) (1991), 161–182.

15.

Gagné

and Parizeau

, Coevolution of nearest neighbor classifiers, International Journal of Pattern Recognition and Artificial Intelligence21(05) (2007), 921–946.

16.

Gao

Q.-B.

and Wang

Z.-Z.

, Center-based nearest neighbor classifier, Pattern Recognition40(1) (2007), 346–349.

17.

Hall

, Frank

, Holmes

, Pfahringer

, Reutemann

and Witten

I.H.

, The weka data mining software: An update, ACM SIGKDD Explorations Newsletter11(1) (2009), 10–18.

18.

Kumar

, Bhatnagar

and Srivastava

, Arsknn-a k-nn classifier using mass based similarity measure, Procedia Computer Science46 (2015), 457–462.

19.

S.Z.

and Lu

, Face recognition using the nearest feature line method, IEEE Transactions on Neural Networks10(2) (1999), 439–443.

20.

Noh

Y.-K.

, Zhang

B.-T.

and Lee

D.D.

, Generative local metric learning for nearest neighbor classification. In Advances in Neural Information Processing Systems (2010), pp. 1822–1830.

21.

Omercevic

, Drbohlav

and Leonardis

, High-dimensional feature matching: Employing the concept of meaningful nearest neighbors, In 2007 IEEE 11th International Conference on Computer Vision (2007). pp. 1–8. IEEE.

22.

Qamar

A.M.

, Gaussier

, Chevallet

J.-P.

and Lim

J.H.

, Similarity learning for nearest neighbor classification. In 2008 Eighth IEEE International Conference on Data Mining2008, pp. 983–988IEEE.

23.

Samet

, K-nearest neighbor finding using maxnearest-dist, IEEE Transactions on Pattern Analysis and Machine Intelligence30(2) (2008), 243–252.

24.

Samworth

R.J.

., Optimal weighted nearest neighbour classifiers, The Annals of Statistics40(5) (2012), 2733–2763.

25.

Shalev-Shwartz

, Singer

and Ng

A.Y.

, Online and batch learning of pseudo-metrics. In Proceedings of the twenty-first international conference on Machine learning (2004) pp. 94. ACM.

26.

Skopal

and Bustos

, On nonmetric similarity search problems in complex domains, ACM Computing Surveys (CSUR)43(4) (2011), 34.

27.

Theodoridis

, Koutroumbas

Pattern recognition, academic press. New York, 1999.

28.

Ting

K.M.

, Washio

, Wells

J.R.

, Liu

F.T.

and Aryal

, Demass: A new density estimator for big data, Knowledge and Information Systems35(3) (2013), 493–524.

29.

Ting

K.M.

, Zhou

G.-T.

, Liu

F.T.

, Tan

J.S.C.

Mass estimation and its applications. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 2010, pp. 989–998. ACM.

30.

Ting

K.M.

, Zhou

G.-T.

, Liu

F.T.

and Tan

S.C.

, Mass estimation, Machine Learning90(1) (2013), 127–160.

31.

Ting

K.M.

, Fernando

T.L.

, Webb

G.I.

Mass-based similarity measure: An effective alternative to distance-based similarity measures. Technical report, Technical report 2013/276, Calyton School of IT, Monash University, Australia, 2013.

32.

Vaidya

P.M.

, Ano (n logn) algorithm for the all-nearestneighbors problem. Discrete & Computational Geometry4(2) (1989), 101–115.

33.

Wang

, Neskovic

and Cooper

L.N.

, Improving nearest neighbor rule with a simple adaptive distance measure. Pattern Recognition Letters28(2) (2007), 207–213.

34.

Weinberger

K.Q.

and Saul

L.K.

, Distance metric learning for large margin nearest neighbor classification, Journal of Machine Learning Research10(Feb) (2009), 207–244.

35.

Xiong

Y.-J.

, Lu

and Wang

P.S.P.

, Off-line text-independent writer recognition: A survey, International Journal of Pattern Recognition and Artificial Intelligence31(05) (2017), 1756008.

36.

Zezula

, Amato

, Dohnal

, Batko

Similarity search: The metric space approach, volume 32. Springer Science & Business Media, 2006.

37.

Zheng

, Zhao

and Zou

, Locally nearest neighbor classifiers for pattern classification, Pattern Recognition37(6) (2004), 1307–1309.