Abstract
Motivated by the idea of cross-validation, a novel instance selection algorithm is proposed in this paper. The novelties of the proposed algorithm are that (1) it cross selects the important instances from the original data set with a committee, (2) it can deal with the problem of selecting instance from large data sets. We experimentally compared our algorithm with five state-of-the-art approaches which are CNN, ENN, RNN, MCS, and ICF on 3 artificial data sets and 6 UCI data sets, including 4 large data sets, ranking from 130K to 4898K in size. The experimental results show that the proposed algorithm is very efficient and effective, especially on large data sets.
Introduction
Instance selection also named sample selection is to select a small representative subset from original data set by removing the redundancy instances. In the framework of classification, the purpose of instance selection is to reduce computational complexity of classification algorithm without degenerating its classification accuracy. Since Hart’s seminal work (i.e. CNN) [1], many instance selection algorithms have been proposed by different researchers. CNN attempts to find a minimal consistent subset (MCS) of the training set. A consistent subset S of a training set T correctly classifies every instance in T with the same accuracy as T itself [2]. CNN algorithm can ensure that all instances in T are classified correctly by S. However, it does not guarantee that S is a MCS. In addition, CNN is especially sensitive to noise, because noisy instances will usually be misclassified by their neighbors, and thus will be retained [3]. The reduced nearest neighbor (RNN) rule proposed by Gates [4] starts with S = T and removes each instance from S if such a removal does not cause any other instances in T to be misclassified by the instances remaining in S. RNN is computationally more expensive than CNN. The selective nearest neighbor rule (SNN) proposed by Ritter [5] improves CNN and RNN by ensuring that a MCS can be found. SNN is much more complex and the computational time is significantly greater than CNN and RNN. Based on the relative significance of the instances in the training set, Dasarathy proposed an algorithm which can identify MCS [6]. The editing nearest neighbor (ENN) rule proposed by Wilson [3] employs the so called editing rule to remove noisy instances in the training set. The rule is that all instances which are incorrectly classified by their nearest neighbors are assumed to be noisy instances. Based on the concepts of coverage and reach ability, the iterative case filtering (ICF) algorithm was introduced in [2]. The reachable set depends on the distance of an instance from its nearest enemy, and the coverage set of every instance is the list of its associates.
Recently, some new instance selection algorithms were developed by different authors. Nikolaidis et al. proposed a class boundary preserving algorithm [7], which discards center instances while it retains a suitable number of border patterns. Based on the concept of Voronoi cells and enemies, Angiulli proposed a fast nearest neighbor condensation (FCNN) algorithm [8] for large data sets classification. The author claimed that FCNN is order independent, its worst-case time complexity is quadratic, and it is likely to select points very close to the decision boundary. Based on the idea of so-called chain which is a sequence of nearest neighbors from alternating classes, Fayed et al. presented a template reduction algorithm [9], the authors make the point that patterns further down the chain are close to the classification boundary. Li presented a critical pattern selection algorithm by considering local geometrical and statistical information [10]. This algorithm selects both border and edge patterns from the data set.
Most of the instance selection algorithms are tailored for nearest neighbor classifier, so the instances selected with these algorithms are often only suitable for nearest neighbor classifiers. In addition, the computational complexities of these algorithms are also very high, for large data sets some algorithms are impracticable. In order to deal with this problem, motivated by the idea of cross validation, we propose a novel instance selection algorithm, which cross selects the important instances from the original data set with a committee. For large data sets, considering the learning efficiency, we use the single-hidden-layer feed-forward neural networks (SLFNs) trained with extreme leaning machine (ELM) [11] as classifiers in the proposed algorithm. ELM has very fast learning speed and very good generalization ability, and it has been successfully applied in function approximation [12, 13], pattern recognition [14, 15], big data classification [16–18], two comprehensive survey on ELM can be found in [19, 20].
The paper is organized as follows. Some related notions and theoretical background are given in Section 2. The proposed methods are presented in Section 3. Experimental results and analysis are presented in Section 4. Section 5 concludes the paper.
Preliminaries
ELM is an efficient and practical learning algorithm used for training the single-hidden-layer feed-forward neural networks (SLFNs) [11]. In ELM the input weights and the hidden layer biases can be chosen randomly, the output weights can be analytically determined with Moore-Penrose generalized inverse [21] of the hidden layeroutput matrix. Unlike other gradient-descent based learning algorithms (such as Back Propagation algorithm [22–24]), ELM does not require iterative techniques to adjust input weights and hidden layer biases during training process. ELM can overcome many drawbacks of the traditional gradient-based learning algorithms such as local minimal, low learning speed by randomly selecting input weights and hidden layer bias [19, 20].
Given a training data set, D = {(x
i
, y
i
) |x
i
∈ R
d
, y
i
∈ R
k
, i = 1, 2, ⋯ , n}, where x
i
is a d × 1 input vector and y
i
is a k × 1 target vector, a SLFN with m hidden nodes is formulated as
The ELM Algorithm [11] is presented as follows.
For the trained SLFN, the outputs are transformed into the interval (0, 1) with softmax function [22], the transformed results can be viewed as posterior probability p (w
k
|x
i
), where w
k
is the kth class, and x
i
is the ith input vector. The p (w
k
|x
i
) is calculated as:
In this section, we first present the idea of the proposed algorithm, and then present the proposed algorithm.
The idea of the proposed algorithm
The idea of the proposed algorithm is illustrated as Fig. 1. We firstly partition the data set into n disjoint subsets, for each subset S i (i = 1, 2, ⋯ , n), we use a committee B to select the important instances from S i . The committee B consists of n - 1 SLFNs which are trained on other n - 1 subsets. Fig. 2 shows an example of the algorithm for 1 round, the data set used in this example includes 50 instances.
The K-L divergence [26] is employed to measure the significance of the samples, the K-L divergence also called relative entropy is a measure of distance between two distributions [26]. Let x be a discrete random variable, p (x) and q (x) are its two probability mass functions, the definition of K-L divergence is given as follows [26].
Let SLFN1, SLFN2, ⋯, SLFN
n
are the members of the committee B. The informative instances are selected with the following criteria.
The proposed algorithm named ELM-KL is described as follows.
In the ELM-KL algorithm, V a (S) denotes the validation accuracy of a classifier which is trained on data set S. We also developed another instance selection algorithm named ELM-KL-ALL, which is similar to ELM-KL, the only difference is that ELM-KL-ALL does not introduce validation set, the trained ELM classifier are validated on training set itself.
Experiments and the analysis of the experimental results
Experiments settings and the experimental results
The effectiveness of our proposed method is verified through numerical experiments in the environment of Matlab 7.0 on a Pentium 4 PC. In our experiments we totally select 3 artificial data sets and 6 UCI data sets [27]. The 3 artificial data sets are mainly used to verify the feasibility of the proposed algorithm. The 6 UCI data sets are used to verify the effectiveness and efficiency of the proposed algorithm, we select data sets letter and shuttle because that they contain many categories, the number of classes of the two data sets are 27 and 7 respectively, we select data sets MiniBooNE, skin, artificial_2 and cod_rna because that they are large data sets. The basic information of the 6 UCI data sets is listed in Table 1.
The first artificial data set is two-dimensional concentric data with two classes which are uniform concentric circular distributions [28]. The points of the class ω1 are uniformly distributed into a circle of radius 0.3 centered on (0.5, 0.5). The points of the class ω2 are uniformly distributed into a ring centered on (0.5, 0.5) with internal and external radii equal to 0.3 and 0.5, respectively.
The second artificial data set is two-dimensional cloud data with two equal priori probable classes [28]. The class ω1 is the sum of three different Gaussian distributions:
Where, x = (x1, x2), and
The class ω2 is a single Gaussian distribution:
The third artificial data set is a three-dimensional Gaussian data denoted Gaussian with four classes ω i (i = 1, 2, 3, 4), the distribution of ω i is p (x|ω i ) ∼ N (μ i , Σ i )
Where,
The basic information of the 3 artificial data sets is listed in Table 2. In our experiments, the values of the attributes of all the data sets are normalized into [–1, 1]. Two thirds of the instances are used as training set, and one third of the instances are used as testing set.
The performances of the proposed algorithms ELM-KL and ELM-KL-ALL are compared with original ELM (ORI in short) and ELM-NN on 4 aspects: the number of selected instances, the optimal number of hidden nodes of SLFNs, the testing accuracy, and the training time. ELM-NN is another instance selection algorithm proposed in our previous work [29]. The experimental results on 3 artificial data sets are listed in Table 3, the experimental results on 6 UCI data sets are listed in Table 4.
From the experimental results listed in Tables 3 and 4, we can find that although the testing accuracies of the proposed algorithm trained on the selected subset are lower than the ones trained on the whole data set on all ten data sets, the difference of the experimental results are slight, and the requirements of memory space and the training time are much less than the ones needed on the whole data set. In the framework of the competence preservation, the proposed algorithm is very effective and efficient. The numbers of optimal hidden nodes in Tables 3 and 4 are determined by using the method proposed in our previous work [30]. The curves described the relationship between the testing accuracy and the number of hidden nodes on the subsets selected from the 3 artificial data sets, and the 6 UCI data sets with ELM-EN, ELM-KL, ELM-KL-ALL and original ELM (ORI in short) are shown in Fig. 3 and Fig. 4 respectively.
We also compare our algorithm with five state-of-the-art approaches CNN, ENN, RNN, MCS, and ICF. The comparisons of performances on 3 artificial data sets are listed in Tables 5 to 7. The comparisons of performance on the 6 UCI data sets are listed in Tables 8 to 13. where “-” means that the results cannot be obtained. Compared with CNN, ENN, and RNN, our algorithm removes much more instances while obtaining the similar accuracies. Although ICF and MCS have lower selected ratio than our algorithm, they run much slowly than ours. Our algorithm has a compromise between time and selected ratio. What’s more, four classical algorithms ENN, RNN, ICF, MCS do not work out results on the two largest data set selected in our experiments, i.e. artificial_2 and cod_rna. Two classical algorithms MCS and ICF do not work out results on data sets Shuttle and MiniBooNE. RNN does not work out results on artificial data set gaussian. Our proposed algorithms can work well on large data sets.
In this section, we present a theoretical analysis of the computational time complexity of the proposed algorithm ELM-KL to specify the possible reasons of the results of the experiment. The ELM-KL (algorithm 2) consists of 13 steps, it is obviously, the computational complexity of the step 1, 2 and 3 are O (n), O (n) and O (1) respectively. The step 4 of algorithm 2 is actually to train p SLFN with ELM algorithm, it is well known that the main computational cost of ELM comes from the calculation of the Moore-Penrose generalized inverse of hidden layer output matrix H. Huang et al. [11] pointed out that, when the n training samples are distinct, the hidden-layer output matrix H is column full rank with probability one, so the ELM can be solved as a full-rank least-square problem [31]. For such a problem, some methods, such as orthogonal project, Householder triangularization, and Gram-Schmidt orthogonalization, may be used to solve it, with computational time complexity O (m2n) [21]. Hence, the computational time complexity of ELM algorithm is O (m2n), where m is the number of attributes. When m ⪡ n, the computational time complexity of ELM can be thought to approach O (n) [31]. Accordingly, the computational complexity of the step 4 is O (pn). It is easy to find that the computational time complexity of step 5, 6, 7 and 8 are O (n), O (kn), O (p) and O (lp), where k is the number of classes. Obviously, the computational time complexity of step 9, 10, 11, 12 and 13 are O (lp), O (1), O (1), O (1) and O (1) respectively. Hence, the computational time complexity of the proposed algorithm is . Generally, the number of classes k and the number of selected instances in a loop l are far less than the number of instances of training set n, so the computational time complexity of the proposed algorithm is O (pn).
The computational time complexity of CNN, ENN, RNN, MCS, and ICF are O (n2), O (qn2), O (n3), O (n3) and O (qn2) [32], where q is the nearest numbers. For convenient comparison, the computational time complexities of the 6 algorithms are summarized in Table 14.
Based on the above analysis, it can be seen from Table 14 that the computational time complexity of ELM-KL is the minimum among the 6 algorithms. The consequence of theoretical analysis justify the experimental results, for example, it can be seen from Tables 12 and 13 that four classical algorithms ENN, RNN, ICF, MCS do not work out results on the two largest data set selected in our experiments, i.e. artificial_2 and cod_rna. The possible reasons is that the computational time complexity of the four classical algorithms are to high to work out results on large data sets.
Conclusions
Motivated by the idea of cross-validation, in this paper, we proposed two instance selection algorithms which are practicable in large data sets, while some classical algorithms (e.g. ENN, RNN, ICF, MCS) are impracticable. Further more, the proposed algorithms have two advantages: fast leaning speed and low selected ratio. The low selected ratio is achieved by discarding the new selected instances which cannot increase the testing accuracy in each loop. The fast leaning speed is due to the choice of extreme learning machine. The experimental results have verified that the proposed algorithms are much more feasible and effective than five state-of-the-art approaches CNN, ENN, RNN, MCS, and ICF. There are two problems related to our work are worth further investigating, the one problem is how to deploy the proposed algorithms to cloud computing environment, such as, Hadoop MapReduce environment? The other problem is how to scale the proposed algorithms to imbalanced data sets? especially imbalanced big data.
Acknowledgments
This research is supported by the national natural science foundation of China (61170040, 71371063), by the natural science foundation of Hebei Province (F2013201110, F2013201220), by the Key Scientific Research Foundation of Education Department of Hebei Province (ZD20131028) and by the Opening Fund of Zhejiang Provincial Top Key Discipline of Computer Science and Technology at Zhejiang Normal University, China.
