Abstract
Under certain situations, researchers were forced to work with small sample-sized (SSS) data. With very limited sample size, SSS data have the tendency to undertrain a machine learning algorithm and rendered it ineffective. Some extreme cases in SSS problems will have to deal with large feature-to-instance ratio, where the high number of features compared to small number of instances will overfit the classification algorithm. This paper intends to solve small sample-sized classification problems through hybrid of random subspace method and random linear oracle ensemble by utilizing binary feature subspace splitting and oracle selection scheme. Experimental results on artificial data indicate the proposed algorithm can outperform single decision tree and linear discriminant classifiers in small sample-sized data, but its performance is identical to k-nearest neighbor classifier due to both shared similar selection approach. Results from real-world medical data indicate the proposed method has better classification performance than its corresponding single base classifier especially in the case of decision tree.
Introduction
Due to the rapid development of modern technologies such as computers, networks, sensors and digital storage systems, data collection has become much cheaper and effortless. Thus, researchers are overwhelmed with massive size of data, which in fact, leads to the popularity of a new research field called the big data analysis [25]. However, not every data sets are made publicly accessible, information such as medical status and industrial process are confidential, resulting very limited sample data available for public researchers. So, the importance of small sample-sized (SSS) data classification should not be ignored [4, 21]. The term “small” is rather subjective. Let the number of instances in a data set be N, and the number of features is M. According to Vapnik, a data set was considered small when the ratio N/M is less than 20 [21].
Ensemble methods have been the saviors for SSS data classification, especially the Bootstrap Aggregating (Bagging) and Random Subspace Methods (RSM) approaches. Bagging utilized random sampling with replacement approach, allowing each classifier to train on randomly generated training sets, providing a much more diverse result. Similar approach had been employed by RSM, except that the data generation is done in the feature space. RSM randomly selects a subset of feature for the classifier to train on, resulting in several classifiers which each expert in different feature space. Thus, RSM have been considered for problems with limited number of samples thanks to its ability to reduce data dimension while maintaining the number of training objects [17, 20].
Even though overall information can be preserved by aggregating many individual classifiers expert in different feature subspaces, there will still be loss of information from the base classifier perspective. Every base classifier in an ensemble will contribute to the final decision made through a method called classifier combination, the process usually involves a majority voting scheme whereby the result from each base classifier will be considered. Base classifiers that suffer from information loss have greater possibilities to undertake wrong judgement, which eventually will provide an impact to the final decision.
This research aims to design an ensemble method that capable in solving SSS classification problems through feature data manipulation approach inspired by hybridization of RSM with Random Linear Oracle (RLO) ensemble, namely Random Subspace Oracle (RSO).
This paper is structured as follow: Section 2 reviews the materials and concepts of learning algorithm. Section 3 gives the working principle of proposed method. Section 4 explains the experimental setup for this work. Section 5 presents the results obtained from experiment and briefly discuss it. This paper ends with Section 6 drawing conclusion to this work.
Materials and concepts
This section reviews the principle of basic classifiers, and structure of RSM and RLO ensemble methods.
Decision tree (DT)
DT algorithm is a flowchart-like tree structure, where each internal node in the tree was represented by a test on a feature, each branch denotes an outcome, and each leaf node contains a class label [22].
The construction of DT model revolves around three main aspects: feature evaluation criterion, leaf creating criterion, and pruning. It is common to describe tree classifier using a graph terminology, starting with a root node which contains all the training samples. If the node is pure enough to satisfy the stopping rules, a leaf holding the dominant class label will be added to the tree. Else, each feature in the data will be examined, and the most discriminative feature will be used to make a split (usually a binary split for continuous-valued problem) according to the best threshold. The process will be repeated on the Left and Right child nodes using data that satisfied respective conditions from the previous test.
With greedy growing approach, DT algorithm is sure to select feature with the most information gain to split on each node, without considering the tests conducted by the other nodes [8]. This approach will lead to a situation where a single feature will be tested several times on different nodes, each with a different threshold value.
Prediction of new data will start from root node of the tree, travel down the path according to the respective feature being tested on each node, and eventually reaching a leaf node. Due to the binary threshold split, the decision boundaries of DT algorithm take the form of lines (or planes/hyperplanes in higher dimensions).
Linear discriminant classifier (LDC)
Discriminant classifier is named after the discriminant function used by the classifier to determine the classification boundaries [3, 16]. Assuming N data samples from one experiment,
Where sample will be assigned with the tag of largest g
i
(
In pattern recognition, each category of feature is referred as a dimension, so the classification boundaries of LDC in two-dimensioned problem will consists of several straight lines depending on the number of classes, several number of planes in three-dimensioned problem, and hyperplanes for higher number of dimensions.
Although with the name of “lazy” algorithm, KNN is said as the simplest and most attractive pattern classification approach [14]. The term “lazy” refers to that learner did not attempt to formulate a discriminative function from the training data, but it “memorizes” the data instead [1]. Doing so will requires a massive memory storage, but the training process ends almost instantly. In [7], KNN classifier outperform any other algorithms that were being compared.
During classification phase, KNN calculates distances between the new data point and all training data. Several training data which are nearest to the new data point will be selected, whereby each data point will provide a vote to determine the final label to be assigned to the new data point. The number of training data selected was controlled by the value K (usually odd numbered), so a KNN algorithm with one nearest neighbor will be called 1-NN, 3-NN for three nearest neighbors, and so on.
Random subspace method (RSM)
RSM is an ensemble method introduced in [24], the author proposed to construct multiple tree classifiers by training on different randomly chosen feature subspace. Due to how easily decision tree algorithm can overfits the training data especially when the sample size is small, RSM was originally proposed to solve this overfitting issue by limiting the number of features provided to the learner. Thus, a fully-grown tree in the ensemble does not requires pruning because the tree overfits only the data subspace provided to it, instead of the whole data space.
Assuming a training set with N number of instances, and M number of features. RSM first decides the number of subspace features, D to be selected for generation of new training subset, where D < M (usually D = M/2). For each subset generated, a base classifier will be applied to train on that subset, thus for an ensemble with size of L, L subsets of feature will be generated, and L number of classifiers will be trained.
The classification stage of RSM was done by the individual classifiers in ensemble, by having each base classifier provides a prediction to the new sample, and a fusion scheme will be used to combine all predictions into one final decision.
Due to the working principles of RSM, it is highly suitable for the use in SSS problem. RSM reduces the dimensionality of the data while maintaining the same number of samples, relatively increases the amount of training samples [19, 20].
Random linear oracle (RLO)
Random Linear Oracle (RLO), introduced by Kuncheva and Rodríguez in year 2007, is a unique ensemble method that combines both classifier fusion and selection approaches [18]. In RLO ensemble method, each classifier in the ensemble is replaced with two mini-ensembles plus a random oracle function that choose between them.
Training of RLO ensemble involved a randomly generated oracle of hyperplane over the whole feature space, separating it into two different subspaces. Two base classifiers will be trained on each subspace respectively, yielding two different classifiers.
Figure 1 illustrate how the RLO method be applied to a 2-class problem. Moreover, several RLO ensembles will be trained on the same feature space, each with a randomly generated oracle, and two classifiers that trained on separated feature subspaces.

RLO ensemble method applied to 2-class problem [10].
When a new sample data comes to test, the location of that data will first be determined by oracle in each ensemble, and corresponding classifier expert in that region will be selected to classify the object. Finally, a fusion scheme will be applied to combine all predictions into one final decision.
This research proposes a hybrid of RSM and RLO ensemble methods and named as random subspace oracle (RSO) due to its feature subspace splitting and oracle selection scheme.
RSM is good in dealing with SSS data sets but in some extreme cases where the number of features can easily exceed the number of instances, massive number of redundant features will also present in the data set. The high number of redundant features will lower the chance of a significant feature being selected during subset generation, rendering the training of classifier ineffective. As well in [2], the authors pointed out that due to effect from random selection, classifiers trained by subsets with low class separability will affect the final decision of whole ensemble because it suffered from loss of information.
To cover up the drawback in random feature selection, the oracle approach in RLO ensemble is implemented in the feature level.
The building of RSO model starts with a split of training set in the feature level, using random selection approach from RSM to create one training subset, another subset will be created with the remaining non-selected features, yielding two subsets with distinctive features but same number of instances. Thus, a total number of M2 / - 2 - 1 different combinations can be formed through this approach.
Each classifier in the ensemble will be replaced with two mini-ensembles according to RLO approach, and each mini-ensemble will be trained on one of the previously created training subsets. This approach allowed every classifier in the ensemble to approach a single problem from two different perspectives.
Since both mini-ensembles expert in different dimensions, the hyperplane oracle selection of classifier based on the data point’s location is not viable. This research will investigate the simplest yet elegant form of dynamic selection through Euclidian’s distance. Benchmark for each mini-ensemble was recorded by the mean point of respective training subset. So, distances are measured from the new data point to benchmarking point of corresponding mini-ensemble. Mini-ensemble with the shortest distance from its benchmarking point to the new data point will be used to carry out classification process.
The fusion approach of RSO ensemble is made by simple majority voting, the final label of new data point is assigned to the class with highest votes. Figure 2 provides the pseudo code of RSO ensemble method in its training and operation phases.

Pseudo code for RSO ensemble training and operation phases.
This section introduces the methodology of this research, discussing the generation of artificial data sets, data manipulation layout, analysis tool, and experiment procedure.
Data generation
This experiment starts with generation of artificial data using Waikato Environment for Knowledge Analysis (WEKA) software [11]. Data sets are generated with RandomRBF generator in WEKA according to properties in Table 1.
Naming and properties of artificial data sets
Naming and properties of artificial data sets
Using artificial data in this experiment is to test the default ability of each algorithm as generated data are noise-free, and the behavior of data can be controlled through generator.
Data sets will be generated with number of features ranging from 200, 500, 1000, 2000, and 5000. Furthermore, for each feature number, ten data sets will be produced, each with different number of instances ranging from 100 to 1000 with an interval of 100, yielding a total of 50 data sets. Class number is fixed at 2.
For clarification, naming of data sets with 200 features will be prefixed with alphabet ‘V’, 500 features will be prefixed with alphabet ‘W’, ‘X’ for data sets with 1000 features, ‘Y’ for data sets with 2000 features, and data sets with 5000 features will be prefixed with ‘Z’. Each prefix will be followed by a value representing its number of instances divided by 100. The class number will be after the name of data set with an ‘_’ separator. For example, data set with name ‘X3_2’ will has 300 instances, 1000 features, and 2 classes. The instance-to-feature ratio will also be given in Table 1.
Furthermore, to properly investigate the performance of algorithms, 4 private medical data sets from [15] were used. Table 2 gives the properties of 4 data sets.
Properties of medical data sets
The “contractions” data set recorded intestinal contractions of patient through wireless capsule video endoscopy. Data set “rds” contains clinical records of 85 premature new-born children with two types of respiratory distress syndrome (RDS), also known as Hyaline Membrane Disease (HMD) and non-HMD. The “laryngeal1” data set extracted 16 features in time, spectral, cepstral domains from 213 normal or pathological voices used for computer decision support system. Finally, “weaning” data set consists of 151 patients suffering from acute respiratory insufficiency on a long-term mechanical ventilation measured once before weaning and once at the start of weaning, so each patient will be an instance in both classes, yielding a total of 302 instances. All four data sets have an instance-to-feature ratio less than 20.
Data manipulation process is here to avoid the circumstances of overfitting or peeking, this research utilized the well-known cross-validation paradigm. K-fold cross-validation divides the data set into K regions using stratified sampling method, with this, it can minimize the bias in data due to random separation of data space [13]. This experiment separates the fold by K = 10, with leave-one-out approach, one region will be reserved as testing set, while the remaining 9 regions were used for training. This process will be repeated for 10 iterations, each with a different fold as the testing set, thus 10 results will be obtained for each algorithm. These recorded results will be used to evaluate the performances of all algorithms throughout one data set. The overall performance of an algorithm can be obtained by averaging all 10 results.
Wilcoxon signed-rank (WSR) test
Wilcoxon signed-ranks (WSR) test introduced in [9] provides a non-parametric pairwise comparison based on ranking of treatments will be used as the analysis tool to draw conclusion on algorithms’ performance. WSR was termed as the non-parametric alternative of paired t-test approach. For each data set, this method assigns a rank to the differences in performances of two algorithms in test, considers only the absolute values. Average ranks will be assigned in case of ties.
In [12], the author recommended that non-parametric tests are preferred over parametric tests in typical machine learning studies, because most results from machine learning experiment are distribution-free in nature and could not fulfil the safe usage of parametric tests.
Assuming classification results of two algorithms over N number of data sets were organized as N × 2 matrix. Given d
i
be the difference between classification accuracies from two algorithms on i-th among all N number of data sets. Let R+ be the sum of ranks when second algorithm performed better than the first algorithm, and vice versa for R-. The sums are calculated by Equations (2 and 3).
As seen from the equations, sum of ranks for tied cases are distributed evenly in both sums. Next, Equation (4) can be used to obtain the z-score of test.
Where T is the smaller sum of ranks among R+ and R-, T = min(R+, R-). If the test successfully rejected null hypothesis, indicating there is a significant different between two algorithms, the signed sums of ranks will be used to determine which algorithm performed better.
The experiment compares RSO ensemble with three distinct single classifiers by having RSO ensemble use the respective single classifier as its base. The size of ensemble was set to 25, as it was stated in [23] that this number of ensemble size provided the largest profit in accuracy.
Data sets were arranged according to number of features, prefixed with distinct alphabets (“V” until “Z”) for easy identification. Experiments were done in batches of five, based on data sets’ prefix.
The performance analysis will be done with the overall classification accuracies using pairwise WSR test between the single classifier and RSO ensemble using that classifier as base by stating the null and alternative hypotheses as:
The test will be carried out with a standard α-value of 0.05. If the resulting p-value from the test was less than 0.05, indicating there is a significant difference in performance in those two algorithms, signed rank should be checked to determine which algorithm has better performance.
This section discusses the results obtained from experiment, and separated them into two components: artificial data, and real-world data.
Artificial data
Table 3 gives the definition of acronyms used in this experiment.
Definition of acronyms
Definition of acronyms
Results of prefix “V” data sets are recorded in Table 4, bolded algorithms’ name indicates better algorithm among the pair when there is a significant different from test results. RSO outperformed single classifier only in two cases, TREE and LDC in term of mean classification accuracy, a stalemate in the case of KNN.
Overall results of prefix “V” data sets
However, based on [6], “no free lunch” (NFL) theorem stating that no search algorithm can outperforms any other algorithms when their performances are averaged across all potential problems. This theorem also holds true in supervised machine learning. So, mean accuracy does not provide a good indication of algorithm performance, WSR test decision should be considered instead.
With p-value as small as 0.0020 and level of significant α= 0.05, RSO ensemble with TREE as base classifier significantly outperformed single TREE classifier. Followed by LDC/RSO_LDC pair with p-value = 0.0313, succeeded in rejecting the null hypothesis, proving that RSO_LDC performed better than single LDC in the case of 200 features data sets.
On the other hand, KNN/RSO_KNN pairs failed to reject null hypothesis, showing that RSO ensembles have similar performance as corresponding single classifiers. The p-value ranged from 0 to 1. KNN/RSO_KNN pairs attained the maximum p-values of 1, indicates that there is exactly no difference between the performance of RSO and its KNN single classifier. Thus, both algorithms will always provide the same classification results.
Prefix “V” data sets contained data with 200 features and distinct number of instances. Since RSO ensemble works by splitting the feature space into two subsets, the number of features available should have an impact to the performance of algorithm. Observing Table 5 where the overall results of prefix “W” data sets (500 features) are presented, the mean accuracies from all algorithms received slight increment except RSO_LDC.
Overall results of prefix “W” data sets
From Table 5, TREE/RSO_TREE pair with p-value of 0.0039 can reject null hypothesis in favor of alternative hypothesis stating that both algorithms are significantly different. The same case happened for KNN/RSO_KNN pair, where their p-value = 1 showing that the performance of RSO ensemble is identical to single KNN classifier. A change in decision happened for LDC/RSO_LDC pair where its p-value = 0.0938 and failed to reject null hypothesis, which majorly caused by the slight decrement in mean accuracy of RSO_LDC.
The overall results of data sets with prefix “X”, “Y”, and “Z” (1000, 2000, and 5000 features) are presented in Tables 6– 8, respectively. All three sets of data arrived at same decision as in Table 5, where only TREE/RSO_TREE pair can reject null hypothesis in favor of alternative hypothesis at 95% confidence level. Further proving the ability of RSO ensemble in improving the performance of single TREE classifier in SSS cases.
Overall results of prefix “X” data sets
Overall results of prefix “Y” data sets
Overall results of prefix “Z” data sets
Finally, to get a generalized performance analysis, all results from 50 data sets were combined, and WSR test was carried out on the combined result. From Table 9, two pairs of algorithms, TREE/RSO_TREE and LDC/RSO_LDC, able to reject null hypothesis in favor of alternative hypothesis with p-value less than the designated α-value. Especially in the case of TREE/RSO_TREE pair, WSR test concluded that RSO_TREE is significantly better than single TREE classifier with possibility of error occurrence as small as 1.88×10–8.
Overall results of all data sets
Next, with p-value of 0.0074, LDC/RSO_LDC pair showed significant difference in performance. The performance of RSO ensemble with LDC as base classifier is relatively unstable than using TREE as base classifier. RSO_LDC outperformed single LDC in prefix “V” data sets (200 features) as shown in Table 4. However, as the number of features increases, the difference between two algorithms getting smaller, and eventually single LDC attained a slightly higher mean accuracy than RSO_LDC as in Table 8. In the case of KNN/RSO_KNN pair, RSO ensemble never did once managed to perform better than single KNN classifier. Thus, using RSO ensemble with KNN as base classifier to solve SSS problems is not recommended.
Artificial data without noises only allow investigation of algorithm operation in its ideal form, but a reliable classification algorithm should include noise tolerance when solving complex real-world problems. Table 10 recorded the classification accuracies, as well as the average accuracy of each algorithm across all four medical data sets. Bolded values indicate the cases when RSO ensemble outperformed its corresponding base classifier.
Results of four real-world medical data sets
Results of four real-world medical data sets
Ranking of performance was done by observing each data set, algorithm with the highest accuracy will be ranked at 1, algorithm with lowest accuracy ranked at 6, average of rank will be assigned in the case of tie. Table 11 presents the summarized ranking of all algorithms.
Ranking of each algorithm across all real-world medical data sets
Observed from the average rank, all RSO ensembles have a better ranking than their corresponding base classifier, TREE/RSO_TREE pair has the largest rank increment (from rank 4.75 to 3.25) with 1.74% difference in average accuracy. Also, RSO_LDC has the best performance among all six algorithms with rank of 2.25, slightly better than single LDC ranked at 2.5. However, single LDC (85.66%) has a little advantage than RSO_LDC (85.36%) in term of average classification accuracy. On the other hand, single KNN received minor improvement from RSO method when dealing with real-world data sets, increasing its average accuracy from 75.91% to 78.15%.
This research proposed a hybrid of RSM and RLO ensemble methods named as Random Subspace Oracle (RSO) to solve small sample-sized classification problems. The method involved binary feature splitting through random sub-spacing, and an oracle selection approach to choose between mini-ensembles that expert from different dimensions, finally fusing all decision from base classifiers by majority voting scheme.
From Section 5, RSO ensemble with TREE base classifier showed significant result when compared to single TREE classifier with p-value as small as 1.88×10–8 from pairwise WSR test, and improvement of rank from 4.75 to 3.25 when dealing with real-world data sets. Followed by RSO ensemble with LDC base classifier that succeeded in rejecting null hypothesis when dealing with prefix “V” (200 features) data sets and combined result. Although RSO_LDC achieved higher ranking in real-world data sets when compared to single LDC, the average accuracy of RSO_LDC incurred by some slight drawback.
RSO ensemble failed in WSR test when using KNN as its base classifier. The incapability of RSO ensemble possibly caused by the inappropriate oracle selection mechanism. Euclidian’s distance was used as the oracle in this case with mean point of each training subset as the benchmark. Mini-ensemble with the shortest Euclidian’s distance from its benchmarking point to the new data point will be used for classification. This approach is same as the principle used in single KNN classifier where the distance to the nearest neighbor will decides the outcome. Resulting in similar performance across all data sets.
The success of RSO ensemble can be attributed to two main aspects: the random selection of feature, and the splitting of feature space. Random selection of feature reduces the dimensionality of data while maintaining the same number of samples, relatively increases the total amount of training samples. On top of that, oracle splitting was used to maintain the loss of data integrity due to random feature selection, provides two options for each classifier in the ensemble to solve a problem from two different viewpoints.
However, due to the splitting of feature space that requires each classifier to be formed from two mini-ensembles, the training time for each classifier in the ensemble is two-fold. In that case, the overall training and classification time is the same as RLO ensemble.
For future development, strategy such as altering the number of criteria required to make decision can be made. So, with the original Euclidian’s distance measurement, aspects such as correlation and probability can be added as oracle, each with its own evaluation condition, and a fusing approach will be used to combine all decision made. Thus, the mini-ensemble selection can be more dynamic through more inspection from different perspectives.
Footnotes
Acknowledgments
To everyone that contributed and supported in this research.
