Balanced training/test set sampling for proper evaluation of classification models

Abstract

In machine learning, classification involves identifying the categories or classes to which a new observation belongs based on a training set. The performance of a classification model is generally measured by the classification accuracy of a test set. The first step in developing a classification model is to divide an acquired dataset into training and test sets through random sampling. In general, random sampling does not guarantee that test accuracy reflects the performance of a developed classification model. If random sampling produces biased training/test sets, the classification model may result in bias. In this study, we show the problems of random sampling and propose balanced sampling as an alternative. We also propose a measure for evaluating sampling methods. We perform empirical experiments using benchmark datasets to verify that our sampling algorithm produces proper training and test sets. The results confirm that our method produces better training and test sets than random and several non-random sampling methods can.

Keywords

Classification training and test sets accuracy random sampling balanced sampling

1. Introduction

Classification is a supervised learning technique that has wide practical applications in such areas as disease diagnosis, document classification, fraud detection, and customer churn detection. Classification is the problem of allocating new observations to specific categories based on a training set of data containing instances whose category membership is already known. Logistic regression, support vector machine (SVM), decision trees, artificial neural network, k-nearest neighbor (KNN), and Bayesian classifier are all well-known algorithms. Figure 1 summarizes the general building process of a classification model. In the first step, we divide a prepared whole dataset into training and test sets. A portion of the training data is separated into a validation set. A validation set is generally used for parameter tuning to select the best training model. After obtaining the final learned model, we evaluate the performance of the model using the test set. Classification accuracy using the test set (test accuracy) is a typical measure of the performance of a learned model.

Table 1
Basic classification accuracy statistics from 1000 random sampling trials (SD: standard deviation)

Min	Mean-SD	Median	Mean	Mean $+$ SD	Max	SD
0.848	0.889	0.904	0.901	0.914	0.975	0.017

Figure 1.

Development of a classification model.

One problem is that classification accuracy depends on training and test sets. The manner in which these sets are divided affects the quality of the classification model and its performance (test accuracy). A poor model may produce high test accuracy if the test dataset contains well-classified instances. By contrast, a good model may produce low test accuracy if the test set contains confused instances. To illustrate these scenarios, we generate an example dataset with two classes and 499 instances, and then select two random test sets from this generated dataset. Figure 2a shows the distribution of whole data instances within a dataset, and Fig. 2b and c show the distribution of instances of test sets A and B. In Fig. 2, the colors of dots express their classes. We can observe that test set A has clearly separated instances, whereas test set B has some overlapped instances. After building two learning models for the cases shown in Fig. 2b and c using SVM, we performed a classification test. Test sets A and B yielded a classification accuracy of 0.975 and 0.848. However, determining which result best reflects the performance of the developed classification model is difficult. This result reveals that dividing training and test sets is crucial in developing a classification model.

Figure 2.

Distribution of instances in an example dataset.

Table 1 gives random sampling statistics for the example dataset. We generate training and test sets 100 times by random sampling and calculate the classification accuracy using an SVM classifier. The mean accuracy of 0.901 may be a proper accuracy of the classification model generated using the example dataset. The table shows that the range of classification accuracy was 0.848 to 0.975, which is quite wide. Therefore, we require an improved sampling method.

Desirable training and test sets should be representative of a whole dataset. In this study, we propose a balanced sampling approach based on the degree of overlap of instances to produce representative training and test sets. In our previous work [1, 2], we showed that classification accuracy strongly depends on overlapped areas of class instances in a target dataset. Therefore, when we divide training and test sets, overlapped areas should be reflected in them. In the Section 3, we describe the degree of overlap in greater detail and propose an approach to divide training and test sets. We also present a measure to evaluate the quality of the generated sets. The proposed method is then compared with previous methods using our evaluation measure. Several previous sampling methods are described in related works section. A k-fold cross validation has been proposed to overcome the overfitting problem in classification. It makes k training models and mean of test accuracies is considered as an evaluation measure for parameter tuning of a model or comparison of different models. The purpose of k-fold cross validation is different from sampling methods. Therefore, we do not compare proposed method and k-fold cross validation.

2. Related works

Random sampling [3, 4] is widely used to divide training and test sets from a whole dataset. However, random sampling does not indicate how and why certain instances are chosen for such sets [5, 6, 7], and sometimes it produces very different evaluations for classification models built on the same dataset. Stratified sampling is a kind of random sampling. It randomly chooses same ratio of instances from each class of a dataset. The goal of stratified sampling is to ensure training and test set have equal ratio of instances from whole classes. Oversampling or undersampling deal with imbalance problems of classes. In some case, class A has lots of instances whereas class B has small instances. In that case, we can choose higher ratio of instances from class B than class A to balance class distribution in a training/test sets. It also a kind of random sampling.

Other methods to generate training and test sets have been proposed. For example, various clustering-based approaches were introduced in [8, 9, 10, 11, 12]. In the first step of these types of methods, $k$ clusters of instances of a whole dataset is generated, and training and test instances are randomly selected from each cluster. In this method, $k$ is the number of classes of obtained dataset (see Algorithm 1).

Algorithm 1: Clustering-based sampling

/* Input: DS, k, t // DS: target dataset,

k

: number of cluster,

t

: ratio of training set Output: trainset, testSet */ trainSet

\leftarrow\emptyset

testSet

\leftarrow\emptyset

Perform clustering with

k

FOR each cluster in DS DO NG

\leftarrow

number of instances in current cluster TR

\leftarrow

NG*

t

instances of current cluster that are randomly chosen trainSet

\leftarrow

trainSet

\cup

TR END FOR testSet

\leftarrow

DS – trainSet RETURN trainSet, testSet

Hudson et al. [11] presented the most descriptive compound method (MDC), and Martin and Critchlow [12] proposed D-optimal designs to divide training and test sets. The goal of MDC is to select a subset of chemical compounds that most effectively represents the compounds in the original population. The information value of a compound is evaluated as the sum of reciprocal values of ranks of distances to other compounds [13]. Algorithm 2 describes the MDC method.

Algorithm 2: MDC sampling

/* Input: DS, t // DS: target dataset,

t

: ratio of training set Output: trainset, testSet */ trainSet

\leftarrow\emptyset

testSet

\leftarrow\emptyset

nSample

\leftarrow t

* (number of rows of DS) // size of training set FOR

i=

1 TO nSample DO

X\leftarrow

DS dist_matrix

\leftarrow

distance matrix of all instance in

X

rank_matrix

\leftarrow

rank matrix of all instance in

X

initialize rank_matrix to 0 max_idx

\leftarrow-

1 FOR

j=

1 TO number of instances in

X

DO rank_matrix[,j]

\leftarrow

rank of dist_matrix[,j] END FOR ivdist_matrix

\leftarrow

1/rank_matrix // information distance matirx ivSum

\leftarrow

row sums of ivdist_matrix // information value sum max_index

\leftarrow

index of maximum value of ivSum trainSet

\leftarrow

trainSet

\cup X

[max_index,] END FOR testSet

\leftarrow

DS – trainSet RETURN trainSet, testSet

The purpose of the D-optimal design is to select an optimal set of instances for model building. The D-optimal design selects instances such that potential errors in descriptors minimally affect an assumed (usually linear) model. Possible model errors are not considered. Therefore, the selected instances will always be on the outer surface of the space occupied by the candidates. A close distance of any two points does not decrease the D-optimal quality of the design [14]. Algorithm 3 describes the D-optimal method.

Algorithm 3: D-optimal sampling

/* Input: DS, t, nRepeat // DS: target dataset,

t

: ratio of training set, nRepeat: number of repeat Output: trainset, testSet */ trainSet

\leftarrow\emptyset

testSet

\leftarrow\quad\emptyset

max.det

\leftarrow-1

nSample

\leftarrow t

* (number of rows of DS) // size of training set FOR

i=

1 TO nRepeat DO tempMatrix

\leftarrow

random samples of DS that size is nSample det

\leftarrow

determinant of (tempMatrix)

{}^{T}\times

(tempMatrix) IF det

>

max.det THEN max.det

\leftarrow

det trainSet

\leftarrow

tempMatrix END IF END FOR testSet

\leftarrow

DS – trainSet RETURN trainSet, testSet

3. Materials and methods

The degree of overlap is a core concept of the proposed balanced sampling. We next introduce the concept of degree of overlap and propose a balanced sampling algorithm based on that concept. To compare the proposed and previous methods, an evaluation measure that we call a mean accuracy indicator (MAI) is introduced.

Figure 3.

Two example datasets having three classes.

Figure 4.

Overlapping area between two classes in a dataset.

3.1 Concept of degree of overlap

The classification accuracy of a dataset depends on the characteristics of that dataset. Dataset A (Fig. 3a) has three classes, and the instances of each class are clearly separated compare to the instances in Dataset B (Fig. 3b). In other words, the class-overlapping area of Dataset A is smaller than that of Dataset B. We know intuitively that Dataset A produces higher classification accuracy than does Dataset Lee et al. [14] proposed an R-value measure to evaluate the quality of datasets. It is based on the ratio of overlapping areas among classes (categories) in a dataset. Figure 4 presents the concept of an overlapping area. If a data instance is located in an overlapping area, correctly classifying it is difficult. The R-value measures the size of the overlapping area. Based on the R-value, we define the degree of overlap of each data instance. First, we take the $k$ -nearest neighbor instances of target instances. The degree of overlap of the instances is defined by the number of neighbor instances that belong to different classes from the target instances. In Fig. 5a and b, the overlap levels of center instances are 3 and 1, respectively. The range of overlap levels is between 0 and $k$ . Figure 6 shows the distribution of degrees of overlap for the dataset shown in Fig. 2.

Figure 5.

Five-nearest neighbors for center instance (blue rectangle).

Figure 6.

Distribution of degrees of overlap for each instance in Fig. 2a ( $k=$ 5).

3.2 Balanced sampling based on degrees of overlap

Before we provide a formal description of balanced training and test set sampling, we must define the following notations.

DS: given dataset

$N$ : number of total instances of DS

$P_{i}$ : $i$ -th instance of DS

$C_{i}$ : Class label of $P_{i}$

$k$ : number of nearest neighbors for $P_{i}$

$t$ : ratio of training instances to whole instances in a given dataset, where 0 $<$ $t$ $<$ 1 and the ratio of test instances is $1-t$

Definition 1. KNearest ( $P_{i}$ ) is the $k$ -nearest neighbors of $P_{i}$ are a set of nearest neighbor instances of $P_{i}$ and the number of elements in the set is $k$ .

Definition 2. The degree of overlap (DO) for $P_{i}$ is defined by:

$\displaystyle\textit{DO}(P_{i})=\sum_{j=1}^{k}(1\ \text{if}\ C_{i}\neq C_{j},% \text{otherwise}\ 0),\text{where}\ C_{j}\ \text{is a class label of}\ P_{j}\in% \textit{KNearest}(P_{i})$

The procedure for balanced training and test sampling is as follows:

Step 1.
Calculate the DO of every instance $P_{i}$ of a given dataset and then instances are grouped by DO to ( $k+1$ ) groups.
Step 2.
From each group, randomly choose $t$ * $N_{s}$ instances, where $N_{s}$ is a number of instances of group $s$ . Add them to the training set and the remaining to the test set.

Algorithm 4 contains pseudo-code for balanced training and test sampling.

Algorithm 4: Balanced training/test set sampling

/* Input: DS, $k$ , $t$ Output: trainset, testSet / PDO $\leftarrow$ NULL // Array to store DO value trainSet $\leftarrow\emptyset$ testSet $\leftarrow\emptyset$ FOR $i=$ 1 TO $N$ DO PDO[ $i$ ] $=$ DO( $P_{i})$ // calculate the DO for each instance END FOR FOR $i=$ 0 TO $k$ DO GROUP $\leftarrow$ { $P_{j}|P_{j}\in$ DS and PDO[ $j$ ] $=$ $i$ } NG $\leftarrow$ number of instances in GROUP TR $\leftarrow$ NG $t$ instances of GROUP that are randomly chosen trainSet $\leftarrow$ trainSet $\cup$ TR END FOR testSet $\leftarrow$ DS - trainSet RETURN trainSet, testSet

3.3 Evaluation measure for sampling methods

If we want to develop or choose a sampling method to generate training and test sets, we need to include an evaluation measure in the sampling method. To evaluate the quality of generated training and test sets, we must define – because we cannot directly know what constitutes – desirable ones. However, we may expect that desirable training and test sets should be representative of a whole dataset. In other words, the distribution of instances in training and test sets should be similar to the entire dataset. One-time random sampling may not meet this qualification. If we repeat the random sampling infinitely, the average of the generated test accuracies may converge to a single value. We can determine a proper evaluation value for a given dataset and call it an absolute evaluation value (AEV). From a practical point of view, infinite random sampling is impossible. Therefore, in this study, we performed 100 iterations of random sampling to obtain the AEV. If training and test sets from a sampling method produce a similar accuracy to the AEV, then the method is proper for developing and evaluating classification model. Using AEV, we propose the previously defined MAI as an evaluation measure for a sampling method. Suppose we have a dataset $D$ . We then generate training and test sets using sampling method $S$ , and obtain a test accuracy using classification model $C$ . AEV can then be calculated using the following equation:

$\displaystyle\textit{AEV}=\left(\left.\mathop{\sum}\limits_{i=1}^{n}\ \textit{% test\_acc}_{i}\right)\right/n$ (1)

where test_acc ${}_{i}$ is the test accuracy generated by the $i$ -th random sampling and $n$ is the number of random sampling iterations. In our experiment, $n$ was 100. Test accuracy is measured using model $C$ . MAI is defined by the following equation:

$\displaystyle\textit{MAI}=\frac{\textit{ACC}-\textit{AEV}}{\textit{SD}}$ (2)

where ACC refers to the test accuracy derived from model $C$ for a test set generated by sampling method $S$ , and SD is a standard deviation of test accuracies ( $\textit{test\_acc}_{i}$ ) in AEV. If MAI $=$ 0, then we can assume that sampling method $S$ generates perfectly desirable training and test sets. If MAI $>$ 0, sampling method $S$ generates biased training and test sets, and classification model $C$ will be overestimated. If MAI $<$ 0, classification model $C$ will be underestimated. MAI $=$ 0 is difficult to conceive. Figure 7 shows how to calculate MAI for a given sampling method. In this study, we compare the proposed and previous methods using the MAI measure.

Table 2

Summaries of the benchmark datasets

No.	Dataset	# of features	# of classes	# of instances
1	Hill_valley	100	2	1212
2	Pima-indians-diabetes	8	2	268
3	Satimage	36	7	4435
4	Waveform	21	3	5038
5	Wdbc	30	2	569
6	Winequality-white	11	6	14066
7	Ecoil	7	8	336
8	Seed	7	3	420
9	GlassId	9	7	595
10	BreastTissue	9	6	106

4. Results

The performance of a sampling method can be measured using an MAI based on a specific dataset and classifier. In our experiments, we tested various datasets and classifiers to compare the proposed and previous sampling methods.

4.1 Benchmark datasets, compared sampling methods, and classifiers

To compare the proposed and previous sampling methods, we used benchmark datasets that have various numbers of features (attributes), classes, and instances. The datasets were collected from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) and are listed in Table 2.

We compared the proposed method to previous methods such as random sampling, D-optimal, MDC, and clustering-based sampling. For clustering-based sampling, we used the k-means clustering algorithm (where k is the number of classes). To achieve classification accuracy, we tested KNN, SVM, random forest (RF), and C5.0 (C50) classifiers. They were implemented in the R package (https://www.r-project.org). Table 3 lists the classification methods and applied parameters for our test. We divided the entire dataset into training and test sets at a 75:25 ratio in order to build and evaluate our classification models.

Table 3
Summary of classifiers and applied parameters

Classifier	R package	Parameter values
KNN	class	$k=$ 6, $l=$ 0, prob $=$ FALSE, use.all $=$ TRUE
SVM	e1071	Default
RF	randomForest	Default
C50	C50	trials $=$ 1

Figure 7.

Procedure for calculating MAI.

4.2 Comparison of proposed and previous methods by MAI

Table 4 lists basic statistics related to our experiment on benchmark datasets. From the 100 iterations of random sampling, we obtained max, min, mean, and SD by four classifiers for a given dataset. We also obtained classification accuracies for the proposed and five previous sampling methods.

Table 4
Estimation of accuracy using random sampling and split method accuracy (RS: Random sampling, DO: D-optimal, CB: Clustering-based)

Dataset	Classifier	Mean	SD	Test accuracy
		(AEV)		Proposed	RS	DO	MDC	CB
Pima	KNN	0.716	0.027	0.703	0.750	0.745	0.677	0.729
	C50	0.736	0.029	0.724	0.745	0.708	0.719	0.760
	SVM	0.760	0.025	0.745	0.781	0.750	0.755	0.776
	RF	0.763	0.026	0.776	0.776	0.755	0.740	0.750
Satimage	KNN	0.901	0.008	0.905	0.904	0.904	0.900	0.909
	C50	0.857	0.010	0.857	0.847	0.858	0.857	0.845
	SVM	0.892	0.008	0.900	0.890	0.897	0.882	0.895
	RF	0.911	0.008	0.902	0.913	0.916	0.907	0.921
Waveform	KNN	0.822	0.009	0.819	0.824	0.813	0.800	0.814
	C50	0.767	0.012	0.774	0.769	0.770	0.778	0.758
	SVM	0.864	0.008	0.866	0.871	0.867	0.865	0.859
	RF	0.854	0.009	0.851	0.867	0.852	0.850	0.848
Wdbc	KNN	0.930	0.018	0.930	0.937	0.909	0.944	0.944
	C50	0.936	0.021	0.958	0.979	0.909	0.958	0.930
	SVM	0.974	0.012	0.993	0.986	0.965	0.958	0.958
	RF	0.960	0.016	0.993	0.944	0.937	0.972	0.930
Winequality	KNN	0.470	0.012	0.491	0.475	0.454	0.450	0.504
	C50	0.572	0.014	0.568	0.575	0.571	0.571	0.603
	SVM	0.571	0.011	0.588	0.551	0.561	0.565	0.590
	RF	0.684	0.011	0.704	0.665	0.677	0.707	0.721
Hill_Valley	KNN	0.553	0.025	0.587	0.578	0.545	0.508	0.578
	C50	0.505	0.003	0.505	0.505	0.469	0.469	0.508
	SVM	0.515	0.017	0.518	0.535	0.429	0.548	0.498
	RF	0.597	0.026	0.594	0.591	0.620	0.601	0.581
Ecoil	KNN	0.851	0.030	0.880	0.880	0.833	0.929	0.857
	C50	0.809	0.036	0.783	0.795	0.762	0.774	0.810
	SVM	0.806	0.059	0.759	0.783	0.679	0.893	0.762
	RF	0.864	0.029	0.855	0.892	0.845	0.869	0.881
Seed	KNN	0.890	0.037	0.863	0.922	0.925	0.887	0.923
	C50	0.909	0.041	0.843	0.784	0.906	0.925	0.942
	SVM	0.930	0.032	0.882	0.902	0.981	0.943	0.923
	RF	0.928	0.035	0.882	0.882	0.943	0.887	0.942
GlassId	KNN	0.662	0.053	0.635	0.673	0.630	0.537	0.722
	C50	0.690	0.058	0.673	0.788	0.556	0.556	0.704
	SVM	0.701	0.048	0.692	0.769	0.500	0.556	0.648
	RF	0.786	0.047	0.769	0.827	0.685	0.648	0.796
Breast	KNN	0.536	0.085	0.542	0.542	0.593	0.556	0.615
	C50	0.665	0.070	0.625	0.625	0.704	0.741	0.654
	SVM	0.589	0.063	0.583	0.542	0.704	0.556	0.654
	RF	0.705	0.071	0.667	0.667	0.778	0.704	0.654

Table 5 summarizes MAI values of the proposed and five previous sampling methods with 10 datasets and four classifiers. Based on the definition of MAI, the smaller the MAI value, the better the sampling result will be. The average absolute value of MAI for the proposed method was 0.747, whereas previous methods showed 1.863, 1.647, 2.206, 1.983, and 1.770. As previously mentioned, the ideal value of MAI is 0; that of the proposed method was closer to 0 than were the previous methods. Thus, the proposed method produced better training and test sets than did the previous methods. The SD for the absolute value of MAI for the proposed method was 0.586, whereas the previous methods had SDs of 2.492, 2.692, 2.366, 2.540, and 2.443. This means that the proposed method produced stable results with less variability than did the other methods.

Table 5

MAI for the proposed and three previous sampling methods (RS: Random sampling, DO: D-optimal, CB: Clustering-based)

		MAI
Dataset	Classifier	Proposed	RS	DO	MDC	CB
Pima	KNN	0.495	1.272	1.076	1.477	0.487
	C50	0.408	0.303	0.942	0.586	0.836
	SVM	0.613	0.846	0.405	0.196	0.638
	RF	0.504	0.504	0.310	0.920	0.513
Satimage	KNN	0.516	0.401	0.319	0.140	1.007
	C50	0.015	0.995	0.054	0.037	1.227
	SVM	1.000	0.264	0.691	1.257	0.462
	RF	1.073	0.330	0.710	0.456	1.293
Waveform	KNN	0.266	0.249	0.935	2.306	0.850
	C50	0.650	0.166	0.320	0.942	0.786
	SVM	0.300	0.875	0.408	0.121	0.549
	RF	0.348	1.509	0.241	0.427	0.705
Wdbc	KNN	0.028	0.359	1.151	0.767	0.745
	C50	1.062	2.089	1.303	1.077	0.307
	SVM	1.538	0.970	0.712	1.275	1.299
	RF	2.067	1.003	1.414	0.764	1.881
Winequality	KNN	1.774	0.430	1.273	1.608	2.848
	C50	0.341	0.179	0.094	0.094	2.133
	SVM	1.518	1.816	0.903	0.608	1.653
	RF	1.683	1.683	0.611	1.961	3.227
Hill_Valley	KNN	1.356	0.967	0.330	1.756	0.967
	C50	0.089	0.089	13.088	13.088	1.287
	SVM	0.182	1.130	4.939	1.889	0.956
	RF	0.127	0.254	0.887	0.127	0.634
Ecoil	KNN	0.969	0.969	0.573	2.607	0.222
	C50	0.723	0.392	1.307	0.980	0.002
	SVM	0.796	0.389	2.155	1.464	0.748
	RF	0.284	0.980	0.641	0.192	0.609
Seed	KNN	0.750	0.861	0.942	0.091	0.902
	C50	1.608	3.051	0.074	0.389	0.825
	SVM	1.503	0.883	1.619	0.426	0.216
	RF	1.281	1.281	0.439	1.156	0.409
glassId	KNN	0.513	0.214	0.607	2.357	1.143
	C50	0.296	1.701	2.330	2.330	0.234
	SVM	0.180	1.411	4.158	3.009	1.093
	RF	0.353	0.870	2.135	2.921	0.221
Breast	KNN	0.065	0.065	0.666	0.229	0.935
	C50	0.569	0.569	0.549	1.074	0.159
	SVM	0.097	0.756	1.809	0.536	1.019
	RF	0.541	0.541	1.030	0.017	0.722
Average of abs(MAI)		0.712	0.840	1.353	1.341	0.919
SD of abs(MAI)		0.555	0.622	1.354	2.062	0.666

Figure 8 is a box plot that shows the results of Table 5. In the box plot, each point refers to a value of $\frac{\textit{ACC}-\textit{AEV}}{\textit{SD}}$ from the 10 datasets and four classifiers. The points on the box plot from previous methods are dispersed widely, whereas the points from the proposed method are gathered near 0. The height of the box for the proposed method is smaller than that for the other methods. This is another factor that reveals the superiority of the proposed method.

Figure 8.

Box-plotted distributions of MAI values.

5. Discussion

The goal of dividing training and test sets is to minimize the bias of test accuracy. We want the performance of our developed model to reflect exactly the quality of a given dataset. Some training and test sets overestimate classification models and some underestimate. Our experiments confirmed that the proposed balanced sampling method produces better training and test sets for developing classification models than do previous methods. As previously mentioned, classification accuracy strongly depends on overlapped areas between class areas. The proposed method is based on overlapped areas and thus reduces any bias in classification accuracy. In the case of D-optimal sampling, data in overlapped areas are excluded to obtain a training set and this leads to bias in test accuracy. In the MDC method, the chosen probability of data in dense areas is higher than in sparse areas. It also produces bias in test accuracy.

To examine how the size of a dataset affects data sampling, we split the dataset into two groups. Datasets in which each class has over 500 instances belong to Group A, and all others belong to Group B. Table 6 list the average absolute values of MAI for the two groups of datasets. As we can see, the difference in MAI between the proposed and previous methods in Group B is smaller than in Group A. This reveals that the size of the dataset strongly affects the quality of sampling results. The larger the dataset, the greater the MAI value, which means that the bias in test accuracy increases. In Table 6, the “Diff” column refers to the difference in MAI between the proposed and average of other methods. Table 6 also shows that the proposed method is less affected by the dataset size than are the previous methods.

Table 6
Average absolute values of MAI for two groups of datasets (RS: Random sampling, DO: D-optimal, CB: Clustering-based)

Group		Proposed	RS	DO	MDC	CB	Diff
		MAI
A	# of instances $\geqslant$ 500	0.027	0.876	1.741	1.685	0.027	1.056
B	# of instances $<$ 500	0.054	2.622	5.263	5.220	0.054	3.236

If a dataset has clearly separated class areas, any sampling method can produce good training and test sets, and the difference in MAI between the proposed and previous methods may be small. For example, the well-known iris dataset (http://archive.ics.uci.edu/ml/) has clearly separated classes, and MAI values of the proposed and previous methods are similar (see Table 7).

Table 7

Average MAI of four classifiers for the iris dataset (RS: Random sampling, DO: D-optimal, CB: Clustering-based)

Proposed	RS	DO	MDC	CB	Diff
0.377	0.474	0.634	1.059	0.718	0.344

Stratified sampling and oversampling/undersampling are basically kinds of random sampling. They focus on ‘class balance’ in training/test sets. Proposed method can be easily combined with them. For example, we can apply proposed sampling method to each class instances and merge them by stratified sampling way.

6. Conclusion

How to divide an obtained dataset into training and test sets considerably affects a classification model. In our study, we proposed a balanced sampling method that showed better performance than did previous methods. Particularly, it was highly efficient in processing large datasets. The proposed MAI measure was helpful in evaluating the sampling method and can be used to develop new sampling methods. Because outlier data instances in our study influenced the classification model, a sampling method should consider outliers, which is a topic for future research. It is also interesting to study how the combination of sampling methods and classification algorithms affects classification accuracy. Sampling methods with considering imbalanced classes is another research topic. The proposed balanced sampling method and MAI calculation algorithm were implemented using R, and the source codes are posted at: https://bitldku.github.io/home/sw/balancedSampling.html.

Footnotes

Acknowledgments

This work was supported by the ICT and RND program of MIST/IITP [2018-0-00242, Development of AI ophthalmologic diagnosis and smart treatment platform based on big data].

References

Walczaka

Massart

D.L.

Heuerding

Erni

Last

I.R.

and Prebble

K.A.

, Artificial neural networks in classification of NIR spectral data: design of the training set, Chemometrics and Intelligent Laboratory Systems 33(1) (1996), 35–46.

Yasri

and Hartsough

, Toward an optimal procedure for variable selection and QSAR model building, Journal of Chemical Information and Computer Sciences 41(5) (2001), 1218–1227.

Golbraikh

and Tropsha

, Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection, Molecular Diversity 5(4) (2000), 231–243.

Huuskonen

, QSAR modeling with the electrotopological state: TIBO derivatives, Journal of Chemical Information and Computer Sciences 41(2) (2001), 425–429.

Pötter

and Matter

, Random or rational design? Evaluation of diverse compound subsets from chemical structure databases, Journal of Medicinal Chemistry 41(4) (1998), 478–488.

Loukas

Y.L.

, Adaptive neuro-fuzzy inference system: an instant and architecture-free predictor for improved QSAR studies, Journal of Medicinal Chemistry 44(17) (2001), 2772–2783.

Bernard

Pintore

Berthon

J.Y.

and Chrétien

J.R.

, A molecular modeling and 3D QSAR study of a large series of indole inhibitors of human non-pancreatic secretory phospholipase A2, European Journal of Medicinal Chemistry 36(1) (2001), 1–19.

Burden

F.R.

Ford

M.G.

Whitley

D.C.

and Winkler

D.A.

, Use of automatic relevance determination in QSAR studies using Bayesian neural networks, Journal of Chemical Information and Computer Sciences 40(6) (2000), 1423–1430.

Burden

F.R.

and Winkler

D.A.

, Robust QSAR models using Bayesian regularized neural networks, Journal of Medicinal Chemistry 42(16) (1999), 3183–3187.

10.

Tetko

I.V.

Kovalishyn

V.V.

and Livingstone

D.J.

, Volume learning algorithm artificial neural networks for 3D QSAR studies, Journal of Medicinal Chemistry 44(15) (2001), 2411–2420.

11.

Hudson

B.D.

Hyde

R.M.

Rahr

Wood

and Osman

, Parameter Based Methods for Compound Selection from Chemical Databases, Quantitative Structure-Activity Relationships 15 (1996), 285–289. doi: 10.1002/qsar.19960150402.

12.

Martin

E.J.

and Critchlow

R.E.

, Beyond mere diversity: tailoring combinatorial libraries for drug discovery, Journal of Combinatorial Chemistry 1(1) (1999), 32–45.

13.

, A new dataset evaluation method based on category overlap, Computers in Biology and Medicine 41(2) (2011), 115–122.

14.

Lee

Batnyam

and Oh

, Efficient feature selection method based on R-value, Computers in Biology and Medicine 43(2) (2013), 91–99.

Balanced training/test set sampling for proper evaluation of classification models

Abstract

Keywords

1. Introduction

Table 1 Basic classification accuracy statistics from 1000 random sampling trials (SD: standard deviation)

3. Materials and methods

4.1 Benchmark datasets, compared sampling methods, and classifiers

Table 3 Summary of classifiers and applied parameters

Table 4 Estimation of accuracy using random sampling and split method accuracy (RS: Random sampling, DO: D-optimal, CB: Clustering-based)

Table 6 Average absolute values of MAI for two groups of datasets (RS: Random sampling, DO: D-optimal, CB: Clustering-based)

Footnotes

Acknowledgments

References

Table 1
Basic classification accuracy statistics from 1000 random sampling trials (SD: standard deviation)

Table 3
Summary of classifiers and applied parameters

Table 4
Estimation of accuracy using random sampling and split method accuracy (RS: Random sampling, DO: D-optimal, CB: Clustering-based)

Table 6
Average absolute values of MAI for two groups of datasets (RS: Random sampling, DO: D-optimal, CB: Clustering-based)