Abstract
Data reduction can increase generalization abilities of the learning model and shorten learning time. It can be particularly helpful in analyzing big data sets. This paper focuses on the machine learning from examples with data reduction. In the paper data reduction is carried out by selection of relevant instances, called prototypes. The discussed approach bases on the assumption that the selection of prototypes is carried-out by a team of agents and that the prototype instances are selected from clusters of instances under the constraint that from each cluster a single prototype is obtained. For cluster initialization the kernel-based fuzzy clustering algorithm is used. Main feature of the proposed approach is integrating data reduction with the stacking technique. Stacked generalization assures diversification among prototypes, and hence, base classifiers. To validate the proposed approach we have carried-out computational experiment. We have also evaluated experimentally the influence of the clustering method and the number of stacking folds used, on the classification accuracy.
Introduction
In the contemporary world quantity of data produced daily by various information systems can be, by some estimates, measured in zetabytes. Since traditional techniques of the analytical processing are not fit to effectively deal with such massive datasets, their owners more and more often require applying data mining or machine learning techniques enabling discovery and extraction of yet undiscovered knowledge and useful patterns. Such a need has motivated, in the recent years, an exceptional global research effort devoted to finding new techniques, methods and approaches bringing machine learning and data mining closer to the successful real-life applications.
Thus, mining information and knowledge from huge and big data sources has been recognized as an important research area. Main tool for such mining remain machine learning algorithms. Research work in the field of machine learning have resulted in development of numerous approaches and algorithms for classification problems [42]. One of the recent focuses of such research include methods of selecting relevant information to be used within the learning process. It is obvious that removing some instances from the training set reduces time and memory complexity of the learning process [40]. Data reduction performed without losing extractable information is also considered as an approach to increasing effectiveness of the learning process when the available data sets are large [22]. Data reduction is perhaps the most critical component in retrieving information from big data (i.e., petascale-sized data) in many data-mining processes [45].
This paper focuses on the machine learning from examples with data reduction. Data reduction is carried out by selection of relevant instances, called prototypes. The discussed approach bases on the assumption that the selection of prototypes is supported by a team of agents and that the prototype instances are selected from clusters of instances under the constraint that from each cluster only a single prototype is obtained.
For increasing accuracy of the machine learning, it is proposed to apply one of the multiple modeling techniques. These techniques also enjoy a lot of attention from the machine learning community. Generally, the aim of the multiple model approaches is the integration of the multiple learning models into one ensemble classification system. Main strategy for ensemble systems construction is to combine outputs of the different classification models, known as base classifiers, such that their combination outperforms each of the single classifiers. To achieve this goal it is required that individual classifiers are accurate in some sense and diverse [32]. Techniques for assuring diversification among base classifiers include using different learning algorithms or using different subsets of the original dataset. One of such techniques is stacked generalization or stacking [7]. In stacking diversity of the ensemble is achieved by generating base learners using different training parameters, including different training sets.
The paper extends earlier research results proposed in [13], where the similarity-based algorithm was used for cluster initialization. Now we suggest an alternative method for cluster initialization based on the kernel-based C-means clustering algorithm, and propose and evaluate through computational experiment the new version of the approach dedicated for learning from examples with instance selection. Main feature of the approach, is using the stacking technique to assure diversification among base classifiers – members of the ensemble. Base classifiers are trained using prototypes selected from clusters produced by the kernel-based fuzzy clustering algorithm. It is claimed that the proposed algorithm is competitive comparing with other approaches to learning from examples with data reduction proposed earlier by the authors (see [8–10]) and, especially, with the algorithm presented in [13], as well as to other machine learning systems devoted to classification. To validate the approach, an extensive computational experiment has been carried out using several benchmark datasets from UCI repository [3].
The paper is organized as follows. In Section 2 the problem of learning from example with data reduction is described. This section also contains short review of current instance reduction techniques. Section 3 includes brief description of the stacked generalization. Section 4 provides information on the kernel-based fuzzy c-means clustering algorithm. Details of the proposed approach are described in Section 5. Section 6 discusses computational experiment plan and results. The experiment aimed at validating the proposed approach. Finally, the last section contains conclusions and suggestions for future research.
Learning from examples with data reduction
The main goal of the data reduction techniques is to decrease the quantity of information required to learn a high quality classifiers. There are three approaches to this: by selection of instances, by selection of attributes/instances or by simultaneous reduction in both dimensions [4].
The paper focuses on data reduction problem through instance selection. Instance selection is considered especially useful as a mean to increasing effectiveness of the machine learning process when the available datasets are large, since overcoming storage and complexity constraints might become computationally very expensive [8, 38].
Finding a small set of representative instances, called patterns, prototypes or reference vectors, in case of a large datasets can result in producing the classification model superior to one constructed from the whole original and massive dataset. Besides, data reduction may help to avoid working with the whole original dataset all the time [48]. It is also obvious that removing some instances from the training set reduces time and memory complexity of the learning process [40].
Instance selection is also considered as an important step towards increasing effectiveness of the learning process when the available datasets are distributed and when the access to data is limited and costly. Selecting the relevant data from distributed locations and then moving only the local patterns can eliminate or reduce restrictions on a communication bandwidth. It also saves time, reduces costs of data shipping, speeds up the distributed learning process, and enables users to deal with larger datasets using minimal resources available at hand [28, 45]. Thus, instance selection process often becomes a critical element of the data mining in its quest to retrieve meaningful information from datasets. Instance selection has become a crucial technique for big data analysis [31].
Considering application of the data reduction technique to reinforce the learning process, one aims at finding out or constructing a set of prototypes representing a subset of the original training set. However, the set should assure the performance of a classifier, built based on prototypes included in the reduced set, better or at least not worse than a classifier built using the original dataset [26].
More formal, the instance reduction problem can be defined as a task of removing a number of instances from original data set D and thus producing the reduced training set S to optimize some criterion or criteria.
With respect to the problem of learning from data, when data reduction process is carried out, the task of the learner L is to output the hypothesis h ∈ H optimizing performance criterion F using dataset S which is a subset of the set D, such that |S| < |D| (ideally S = S opt ), and where H is the hypothesis space, i.e. a set of all possible hypotheses, that the learner can draw. Hence, the learner can be also viewed as a logical predicate that may be true or false with respect to the considered data, or a function that maps instances into, for example, category (data class) space.
In the last years, several approaches to instance selection have been proposed (see, for example, [17, 23, 40]). Usually, instance selection algorithms are based on distance calculation between instances in the training set [5]. Methods based on other approaches, known as the instance-based methods, remove an instance if it has the same output class as its k nearest neighbours, assuming that all neighbourhood instances will be, later on, correctly classified [38]. Both approaches have several weaknesses. They often use distance functions that are inappropriate or inadequate for linear and nominal attributes [5, 38]. Besides, there is a need to store all the available training examples in the model. To eliminate the above, several approaches have been proposed, including, for example the condensed nearest neighbour (CNN) algorithm [19],the instance-based learning algorithm 2 (IB2) [1], theinstance-based learning algorithm 3 (IB3) [1], theselective nearest neighbour (SNN) algorithm [29], the edited nearest neighbour (ENN) algorithm [40], the family of decremental reduction optimization procedures (DROP1-DROP5) [40], and the instance weighting approach [46].
Another kind of algorithm attempting to select the reference instances evaluate the instances of each classes separately and keeps only the densest instances in a given (arbitrary) neighborhood [7].
The other group of methods (including for example the family of four instance reduction algorithms denoted respectively IRA1-IRA4 [7], and the Allk-NN method [36]) try to eliminate unwanted training examples using some removal criteria that need to be fulfilled. The same principle has been applied in [44]. The authors of [44] conclude that if many instances of the same class are found in the analyzed area, and when this area does not include instances from the other classes, then an unknown instance can be correctly classified when only selected reference vectors from this area are used.
The above reasoning also leads to approaches, where instances situated close to the center of the cluster of similar instances should be selected as the reference vectors (see, for example [2, 41]). Such an approach requires using some clustering algorithms like, for example, k-means or fuzzy k-means algorithm [16]. These algorithms generate cluster centroids, from which the reduced dataset is produced. In such case, a good reduced dataset can be obtained if the centroids are “good” representatives of clusters in the data.
In [24], the authors propose three methods for instance selection that are based on the notion of local sets, i.e. the Local Set-based Smoother (LSSm), the Local Set-based Centroids Selector method (LSCo) and the Local Set Border Selector (LSBo). The LSSm removes instances, which do not guarantee the correct classification more instances than have been correctly classify so far. The LSCo, at first step, applies LSSm for removing noises and then run the clustering process for identifying clusters in the dataset. Finally, the algorithm keeps in S only the centroids of the obtained clusters. The LSBo also applies LSSm for removing noises and is based on clustering of instances after the noises removing process. Finally, each instances from D are verified regarding to its representation in S to ensure its correct classification. A feature of the LSBo is its ability to assure a good balance between reduction and accuracy.
Another approach to instance selection based on clustering has been proposed in [8]. The idea is to select instances from clusters and each instance has a chance to be selected as the reference instances. Another approach is to consider the so-called candidate instances situated close to the centre of clusters and then to select the prototypes using the classification accuracy as a criterion [10]. In the above approaches clusters are produced using the clustering algorithm based on the similarity coefficient, originally proposed in [12]. In both cases the agent-based population learning algorithm (PLA) has been applied to support and improve instance selection. In this paper, a new version of the algorithm is proposed, where the PLA is also used for instance selection from clusters, however for initializing clusters the kernel-based fuzzy C-means is applied. Details of the clustering algorithm are discussed inSection 4.
Stacked generalization
Stacking, introduced by Wolpert in [43], has been proposed as a technique for generating an ensemble of heterogeneous classifiers. The approach was proposed as a tool for achieving diversity of base classifiers in the ensemble with a view to increase accuracy of the machine classification.
In general, stacking produces an ensemble of classifiers in which: base classifiers are trained using different
training parameters, outputs of the
base classifiers are combined using some meta-classifier.
In the standard stacking algorithm, at first q different instance subsets of equal size are generated randomly. The subsets are generated assuring similar relative proportion of the different classes as in the original dataset. Omitting one of the subsets in each iteration the so-called level-0 classifiers are generated from the remaining subsets. The process is repeated q times following the pattern of the q-fold cross-validation procedure. At each iteration the omitted subset of instances is used to generate the so called level-1 set of instances. Thus, level-0 models produce vector of predictions that form the input to the level-1 model, which in turn, predicts class label for new instances with unknown class labels. In the approach, the meta classifier in the form of relative weight for each level-0 classifier is created by assigning weights to classifiers proportionally to their performance. The above schema for learning meta classifier, proposed by Wolpert [43], has a form of the so-called leave-one-out cross-validation.
The stacking approach to combining classifiers works as follows: Suppose that there are
q different learners L1, …,
L
q
and q different training
sets – D1, …, D
q
,
where
D = D1 ∪ D2 … ∪ D
q
.
It is also assumed, that each learner induced from training sets
D1, …, D
q
respectively, outputs hypotheses h1, …,
h
q
, where q is the number
of base classifiers and
∀hi:i=1,…,q ∈ H.
The goal of stacking is to learn a good combined classifier h such that the
final classification will be computed from
h1 (x) , …,
h
q
(x) as shown in the
Equation (1):
There are several other possibilities to implement different strategies for combining base classifiers. In this paper, the output of the meta classifier is obtained through majority voting scheme.
There are also several related studies where several different stacking variants have been proposed and investigated. A broad overview of the stacking can be found in [43], where different variants of stacking and several example application domains have been discussed. The review identifies also a number of strengths and weaknesses of the technique emphasizing that stacking is an interesting alternative for generating effective classification ensembles.
This paper is inspired by the approach based on stacking technique proposed by Skalak in [34]. In [34] it has been proposed to select a few prototypes per class to obtain the level-0 classifiers, and next use their outputs at the level-1 for generating the meta-classifier.
In the approach proposed in our paper we also suggest the procedure of selecting prototypes through data reduction. Thus, main features of the approach include prototype selection from the induced clusters and applying stacking technique to assure diversification of the level-0 models based on the earlier selected prototypes.
Kernel-based fuzzy C-means (KFCM) was introduced to overcome noise and outliers sensitivity in FCM [25] by transforming input data X into a higher dimensional kernel space via a non-linear mapping Φ which increases the possibility of linear scalability of the instances in the kernel space and allows for fuzzy C-means clustering in the feature space [25].
The kernel method uses the fact that product in the kernel space can be expressed by a
Mercer kernel function K as follows [20, 47]:
This allows to replace the computation of distance in the kernel space by a Mercer kernel function, which is known as a “kernel trick” [25]. One of most commonly used kernel function type is the Gaussian function.
To describe the proposed approach in a formal manner, the following notation is introduced. Let x denote a training instance, N - the number of instances in the original training set D and, n - the number of attributes with the total length of each instance (i.e. training example) equal to n + 1. When the training set includes data of the classification problem, the element numbered n + 1 contains a class label that can take any value from the finite set of class labels C = {c l : l = 1, … k}. Also, let X = {x ij : j = 1, …, n + 1 ; i = 1, …, N} denote the matrix of n + 1 columns and N rows containing values of all instances from the training dataset.
From mathematical point of view, let clusters be characterized by prototypes
c
i
and fuzzy partition matrix
U = (u
ij
) of size
t × N, where
u
ij
is the degree of membership of
x
j
in cluster i and
t denotes the number of clusters. Then the standard constraints for fuzzy
clustering are assumed, which are expressed as:
For KFCM-K clustering it is assumed that prototypes v i are located in the kernel space and centroids further need to be approximated by an inverse mapping c i = Φ-1 (v i ) to the feature space [20].
The KFCM-K algorithm minimizes the following objective function [18, 25]:
Centroids c
i
are approximated in the feature
space minimizing:
Advantage of using kernel functions is the possibility to determine the number of clusters based on significant eigenvalues of a matrix determined by the kernel function applied to feature vectors.
The algorithm of kernel based fuzzy C-means clustering is shown as Algorithm 1.
Let
be a quadratic matrix of size N
l
x
N
l
. Calculate the eigenvalues of matrix (). t
l
:= number of
eigenvalues exceeding δ. Initialize U
l
to random fuzzy
partition satisfying the conditions (4) and with respect to
t
l
.
Update
U
l
according to (7),
(8).
Update centroids according to
(10).
Map input vectors from D according to U l (l = 1, …, k) into t clusters, where .
Let C1, …, C t denote the obtained clusters such that , ∀i≠j:i,j=1,…,tC i ∩ C j = ∅.
Next, from such subsets C1, …, C t the prototypes are selected and the reduced training set S is produced.
This paper deals with the problem of learning from examples using dataset obtained through data reduction. In general, it was proved that the data reduction is computationally difficult combinatorial optimization problem [14, 30]. Thus approximate algorithms including metaheuristics seem to be a practical approach to solving it.
In the paper data reduction problem is solved by the agent-based population learning algorithm, where the specialized team of agents (A-Team) selects prototypes executing various improvement procedures and cooperating with a view to solve the data reduction problem.
Applying the population-based approach with optimization procedures implemented as agents within the asynchronous team of agents (A-Team) for solving data reduction has been proposed in [8, 12]. Agents working in the A-Team achieve an implicit cooperation by sharing the population of solutions, also called individuals, to the problem to be solved. An A-Team can be also defined as a set of agents and a set of memories, forming a network in which every agent remains in a closed loop [35]. All agents can work asynchronously and in parallel. Agents cooperate to construct, find and improve solutions which are read from the shared common memory.
The functionality of the agent-based population learning approach can be defined as the organized and incremental search for the best solution. The pseudo-code of the agent-based population learning approach is shown as Algorithm 2.
Generate the initial population of solutions (individuals) and store them in the common memory
Implement different improvement procedures creating optimization agents
Activate optimization agents, which execute improvement procedures
Read randomly selected
individual from the common memory Execute improvement algorithm Store
individual back in the common memory
Take the best solution from the population as the final result
In the discussed case the aim is to find the most effective classification model based on data reduction, which guarantees maximization of the classification quality criterion or criteria. Thus, A-Team consists of agents which execute several different implemented procedures and cooperate through exchanging feasible solutions of the learning from examples problem. The main feature of the proposed agent-based population learning algorithm (PLA) is its ability to select prototypes forming the compact representation of the original dataset and which are next used to induce classifier.
As it has been already pointed-out, prototypes are selected from clusters of instances. These clusters are constructed at the initial phase of computations from the set of training instances through applying the KFCM algorithm (see Algorithm 1). At the next stage, prototype instances are selected from clusters through the population-based search carried out by the optimizingagents.
After the clusters have become available, potential solutions, forming the initial population, are generated through randomly selecting exactly one single instance from each of the considered clusters. Thus, a potential solution is represented by the set of prototypes, i.e. by the compact representations of the original dataset. A feasible solution to data reduction problem is encoded as a string consisting of numbers of selected reference vectors.
Further on, selection procedure of the representation of instances through population-based search is carried out by the team of optimizing agents. Each agent is an implementation of the local search procedure and operates on individuals. The selection is carried out for each cluster and removal of the remaining instances constitute basic step of the data reduction process. In other words, each optimizing agent tries to improve quality of the received solutions by applying the implemented improvement procedure. An agent, after being supplied with an individual to be improved, explores its neighborhood aiming at finding a new, better solution among its neighbours.
Each solution from the population is evaluated and the value of its fitness is calculated. The evaluation is carried out by estimating classification accuracy of the classifier, which is constructed taking into account instances as indicated by the solution.
To solve the data reduction problem several types of optimizing agents carrying out different improvement procedures have been implemented. Among improvement procedures aiming at improving the current prototype selection there are two local searches, one simple random search and the tabu search.
In each of the above cases solutions that are forwarded to the optimizing agents for the improvement are transferred from the population in accordance with the user defined working strategy specifying rules for choosing and replacing individuals in the common memory.
As it has been already pointed-out, the feature of the proposed approach is using the stacking concept to increase accuracy of the machine learning and especially to assure diversification among members of the ensemble. Thus, from the implementation point of view, the above described process of searching for the best results using the agent-based population learning algorithm is carried out in parallel for q independent data reduction problems. In each stream of such searching, a different dataset is processed and the independent set of the prototypes is selected.
In details, in the presented approach the training data set is split into q training subsets. Next the q - 1 subsets are used as a input data for the agent-based population learning algorithm and are used for forming the initial population of the agent-based algorithm. The process is next repeated q times and in each iteration the representative set of the prototypes is selected and used to induce decision tree as a base model. Each base model is evaluated using the subset omitted at the current iteration. Finally, the ensemble of heterogeneous base classifiers is obtained. The ensemble is used to predict class label of the newly coming examples.
The pseudo-code of the above described process is shown below as Algorithm 3.
Map randomly examples from D into q disjoint subsets D1 … , D q .
Let
D′ = D - D
i
denote q - 1 subsets of D. Map instances from D′ into clusters using the
clustering algorithm – Algorithm 1. Run
the population learning algorithm - Algorithm 2, for prototype selection and generate
a base classifier δ
i
using the omitted
subset D
i
as the test
set.
Let δ1, …, δ q denote the obtained base classifiers forming together the meta classifier for predicting new instances.
Computational experiment
The main aim of the computational experiment was to answer the following questions: Does the proposed
selection procedure and the proposed stacking technique improve classification
accuracy as compared with other approaches? Does number of stacking folds set for the agent-based
data-reduction with the KFCM clustering influence performance of the proposed
classifier?
In particular the reported experiment aimed at answering the question whether the proposed algorithm, called ABDRkfStE (Agent-Based Data-Reduction based on the KFCM with Stacking Ensemble Learning) performs better than other algorithms including ABDRStE (Agent-Based Data-Reduction with Stacking Ensemble Learning) - introduced in [13], ABIS (Agent-Based Instance Selection) - introduced in [8], ABDRE (Agent-Based Data-Reduction with Ensemble) with RM-RR (Random Move and Replace Randomly strategy) and ABDRE with RM-RW (Random Move and Replaces First Worst strategy) - proposed in [9]. The ABDRkfStE has been also compared with other ensemble classifiers, i.e. AdaBoost, Bagging and Random Subspace Method, for which the results have been presented in [9].
Evaluation of the proposed approach and performance comparisons are based on solving several classification benchmark datasets obtained from the UCI Machine Learning Repository [3]. Basic characteristics of these datasets are shown in Table 1.
The computational experiment was repeated several times for 8 different values of stacking folds, set, respectively, from 3 to 10. Each benchmark problem has been solved 50 times, and the experiment plan involved 10 repetitions of the 10-cross-validation scheme. Finally, for each value of the parameter defined stacking folds the reported values of the quality measure have been averaged over all runs.
The C4.5 algorithm has been applied to induce all of the base models for all ensemble classifiers [27].
Population size for each investigated A-Team, following earlier experiment results presented in [9] and [13], was set to 40. The process of searching for the best solution of each A-Team has been stopped either after 100 iterations or after there has been no improvement of the current best solution for one minute of computation. Values of these parameters have been set arbitrarily.
In case of Bagging and Random Subspace Method the size of bags has been set to 50% of the original training set. In case of ABDRE with RM-RR and ABDRE with RM-RW the number of base models has been set to 40.
The experiment results obtained using the ABDRkfStE approach for different values of stacking folds are shown in Fig. 1. Results of the experiment suggest that the stacking generalization, in general, can improve performance quality of the proposed classifiers. Given the fact that, increasing number of the stacking folds means also increasing number of members of the ensemble, which are combined to obtain one decision, the results show that the number of folds can influence classification accuracy. However, It can be also observed, that for a small data sets stacking does not significantly, if at all, improve the classification accuracy (see for example results for Heart and Sonar). An improvement is most likely to be observed for a bigger datasets. This finding is also confirmed through the correlation analysis between the accuracy and number of folds. In case of Heart and Sonar the value of the correlation coefficient amounted to 0.0316 and 0.1156, respectively. For example, for WBC the value of the correlation coefficient equals 0.8142.
The best results for ABDRkfStE (from a range of results presented in Fig. 1) are shown in Table 2. Table 2 also shows results obtained by other ensemble classifiers, some example non-ensemble classifiers, like for example, for C 4.5, as well as the results of other known algorithm for data reduction, i.e. DROP 4. It should be also noted that in case of C 4.5 the presented results have been obtained without data reduction and without any other pre-processing actions. From Table 2 it is easy to observe that in majority of cases data reduction increases accuracy of the classification as compared with cases when classifier is induced using a non-reduced dataset. The proposed agent-based data reduction approach outperforms also the DROP technique and other instance selection methods, i.e. CNN, SNN and IB2.
It can be also observed from Table 2, that ABDRkfStE performance, in term of classification accuracy, is quite good and comparable to the performance of other ensembles classifiers. Especially, the ABDRkfStE proves competitive as compared with the ABDRE algorithm with the RM-RR or RM-RW strategy, both algorithms proposed in the earlier paper of authors [9], AdaBoost, Bagging and Random Subspace Method. It should be also noted that the stacked generalization can be promising mechanism influencing the quality of the classification system and providing a way for diversification of the learning models based on the reduced data.
It is also clear that ABDRkfStE does not outperform ABDRStE. Thus, it is worth noting that ABDRStE is based on different and apparently more effective clustering method than ABDRkfStE. However, one can draw the conclusion that kernel-based fuzzy clustering can support learning from examples with data reduction and stacking as an alternative approach for clustering as an alternative method to clustering in comparison to clustering based on similarity coefficient referenced in Section 2.
Conclusions
The paper contributes through proposing a novel integrated approach to machine learning from examples with data reduction and stacking. Main result of the reported research is proposing and validating an extended version of the agent-based data reduction algorithm incorporating the stacked generalization mechanism for improving the performance of the resulting classification model. Computational experiment results confirm, that the stacking method assures generation of diversified heterogeneous members of the ensemble, which are subsequently used for forming the meta-classifier.
From the reported experiment it can be also concluded, following a similar finding, (see [13]), that integration of the agent-based population learning approach to data reduction with the stacking is a useful solution allowing to obtain high quality ensemble models for machine learning. However, the reported results shows that clustering method can be crucial and the decision on choosing clustering method can influence learning results.
The experiment, due to its limited scope, allows only for a preliminary validation of the approach. Further research should aim at investigating influence of several independent factors on the ABDRkfStE performance, among them methods for constructing meta-classifier. It is also planned to investigate a wide range of stacking and boosting techniques as tools for diversification of prototypes obtained in the data reduction process.
