Abstract
Fuzziness based divide and conquer (D&C) is a recently proposed strategy for promoting the classifiers (i.e., fuzzy classifiers) performance, where the amount of fuzziness quantity associated with each data point (i.e., both labeled and unlabeled) is considered as an important avenue to the empire for instance selection problem. This technique is regarded as a semi-supervised learning (SSL) technique, where different categories of instances are obtained by using fuzziness measure, and then the instances having less amount of fuzziness are incorporated into training set for improving the generalization ability of a classifier. This study proposes some effective methods and presents a novel algorithm for categorizing the instances into three groups that can effectively integrate with D&C strategy. It is observed by the experimental validation that considering the splitting criteria for instances categorization can lead the classifier to perform better on withheld set. Results on different classification data sets prove the effectiveness of proposed algorithm.
Keywords
Brief introduction of divide-and-conquer strategy
Fuzziness based divide and conquer (D&C) strategy is proposed by Wang et al. [28] for improving the generalization capability of a classifier ℱ (i.e., the classifiers whose output is a membership or fuzzy vector v). In [28], fuzziness is considered as an important criterion for dividing the instances x i into three groups, and then a group having low fuzziness FG low is incorporated into training set L. ℱ is retrained on the new training set and obtained results are simulated by the experimental verification. Their strategy is regarded as an approach to semi-supervised learning (SSL), where the unlabeled instances having certain magnitude of fuzziness participate in the learning process and provide better predictive ability on withheld set [5].
In their study, they conducted two different experiments to observe the relationship among the fuzziness. F of a classifier and the classification accuracy. Specifically, in first experiment, the relationship between correctly classified instances x
ci
and their fuzziness f
ci
is observed. For a set of labeled instances ℱ = ℱ(L)(i.e., Training of classifier). Computing a membership matrix of L by using ℱ: U
L
= μ
ij
×L. Computing a membership matrix of T: U
T
= (μ
ij
) C×T. Fuzziness of every instance f
i
in F (U
L
) and F (U
T
) is computed. Sorting is performed both in F (U
L
) and F (U
T
) based on Three groups i.e,. FG
Low
, FG
mid
and FG
High
are extracted based on (sorted) fuzziness values of the instances belonging to L and T. Obtained the accuracies on FG
Low
, FG
mid
and FG
High
that are extracted by using L and T.
Authors in [28] designed this experiment to observe the correct rate of classification for three groups, where they experimentally showed that the instances belong to FG
low
have higher classification rate than the FG
mid
and FG
high
. To further analysis and support this fact, they conducted a second experiment where some useful information i.e., how fuzziness intacts with the classified and misclassified instances x
mi
? is extracted. In their second experiment, above mentioned steps (i to iv) were common, but instead of the dividing the instances into three groups, they performed the following steps on L and T. Separate the correctly classified instances x
ci
along with their fuzziness values f
ci
both in L and T. Computing the average fuzziness of all correctly classified instances Computing the average fuzziness of misclassified instances
It is also found by the experimental verification in their studies that the instances having higher fuzziness values (i.e., that are grouped in FG high ) have greater chance of misclassification, because those instances are near to the classification boundary, while the instances having low fuzziness are far from the boundary. In both cases they repeated their experiments multiple times to check the effectiveness of randomization.
The steps (i.e., i to vi) in above mentioned experiment were the basic assumptions toward the fuzziness based D&C strategy, but the second experiment is conducted to figure out and prove the assumption that fuzziness has an essential impact on the classifier’s generalization ability (i.e., specially for the misclassified samples). It is also proved in their study that the risk of misclassification becomes higher as the fuzziness of instances increases in L.
Therefore, for the most classification problem, it is difficult to obtain the correct rate of classification for those instances that are having high fuzziness as compared to the low fuzziness instances. In other words, it is difficult the handle the boundary points compared to the points that are far from the boundary (i.e., inner boundary).
The assumption behind D&C strategy is to separately handling the instances that are included in FG low , FG mid and FG high categories and to incorporate the group with highest accuracy into L, which is an effective way to promote the classifier performance. The flow chart of D & C strategy proposed by Wang et al. [28] is depicted in Fig. 1

Flow chart of Experiment-1 reflecting fuzziness based categorization.
One can see the Fig. 1, where the FG low is incorporated into L and retraining is performed with new training set to achieve the better classification accuracy on withheld dataset T. From the literatures [4, 30], one can also find the studies related to the generalization capability (i.e., prediction) of a classifier that rely on the fuzziness. An important question arises: how to effectively divide the training or unlabeled instances into FG low , FG mid and FG high that can guarantee the distribution of useful instances for promoting the D&C strategy?. We present some state of the art mechanisms and provide a novel algorithm to divide the instances into three different categories based on their fuzziness magnitude. We also compare the impact of these categorization on the generalization ability by using a classifier called neural network with random weight (NNR w ), and observe their effectiveness with D&C strategy.
The rest of the paper is organized as follows. Section 2 discusses some IS methods that rely on the uncertainties. Section 3 presents the basic concept of fuzziness and also illustrates how fuzziness can be computed for a single layer feed forwarded network (SLFN) called neural network with random weight (NNR w ), whose output can be obtained like a fuzzy or membership vector. Section 4 presents some states of the art approaches for splitting the instances into three groups, and propose an algorithm for instance categorization. Experimental verification and results are presented in Section 5. Finally, Section 6 concludes this paper.
We consider instance selection as an essential process in the data reduction phase of knowledge discovery and data mining (KDD), whose aim is to reduce the amount of training instances from original data. In this process noisy instances may also be removed. Therefore, this process reduces the size of training set either by retaining the predictive ability or by improving the learning capability. According to [23], several instances participate in learning or training process, but some instances are irrelevant or not useful for classifying, hence, to achieve an acceptable learning performance, it is possible to ignore these non-useful instances, however, this process is considered as instance selection. In general, IS methods, that focus on improving the predictive performance of the classifier (apply after instance selection) are called edition techniques. Methods that contribute to the reduction of storage requirements are known as condensation algorithms. Some IS methods achieve both goals (i.e., generalization capability and storage reduction) simultaneously, they are called hybrid methods. In [6], authors described that the expectation to achieve the accuracy either equal or better than the original training set is not usually achieved, and still a certain loss of accuracy is inevitable. For the better selection of instances, many other methods have been proposed in the recent literatures. Authors in [13] proposed a novel instance selection mechanism called LAMIS which employs the hyperplane with a large symmetric margin. In their approach, the core of instance selection process is based on keeping the hyperplane that separates the two-class data to provide the large margin separation. LAMIS selects the most informative instances, satisfying both objectives i.e., high accuracy and reduction rates. In [33], an instance reduction method is proposed to speed up the instance selection process for the various instance selection-based multiple-instance learning algorithms. Their method is based on pairwise similarity between instances in a training bag, where the performance can be enhanced by improving the similarity between the instances that are necessary for learning. A detail review of instance selection methods is presented in [23].
Many instance selection methods that rely on uncertainties have been proposed in the scientific literatures, e.g., author in [3] proposed the first instance selection mechanism that queries an instance into the uncertainty area. The method proposed by [3] learns a concept by reducing the volume of uncertainty area using the required instance. This method evades to query those instances that exist in learned area, but it queries only those instances which reside in uncertainty area. This process increases the learning rate because it avoids to acquire the ineffective instances. Another method called Query by Committee (QBC) is proposed by Seung et al. [27] in 1992, which acquires the instances or examples according to the principle of maximal disagreement of the committee. This method is based on the observation that an instance with maximum disagreement is harder to classify. Therefore, this type of disagreement is considered as a type of uncertainty. Authors in [18] and [19] also proposed the uncertainty based instance selection strategy. Their strategies build a single classifier that could predict and provide the class label to an instance, and also a measurement of certainty. The uncertainty is associated with posterior probability using Bayes rule. It only selects the instance which is considered to be misclassified, because the class of the instance is unknown before asking to the domain experts. Wang et al. [29] introduced a new instance selection mechanism called maximum ambiguity based sample selection in fuzzy decision tree induction, where the instances are selected based on the principle of maximal classification ambiguity to select the instances with maximal evaluated ambiguity in fuzzy decision tree induction. It only selects the instance with maximal evaluation ambiguity when it has similarity with fuzzy decision tree.
Neural network with random weight (NNR w ) and fuzziness
Neural network with random weight (NNRw)
Schmidt et al. [26] are the first one, who earlier studied and investigated the significance of randomization on the generalization performance of SLFN. Authors in [26] experimentally demonstrated that SLFN can gain a better predictive performance by selecting the random weights that connect the input layer and hidden layer nodes, and by analytically computing the weights of output layer nodes. This was the first study regarding the non-iterative training of neural network (NN) using randomization concept. The researchers also concluded that, in SLFN, the weights of the output layer nodes are significantly most important than the weights found in the hidden layer nodes. Authors in [26] did not propose the name of their SLFN, therefore, in order to recognize their work, we use the neural network with random weights (NNR w ). An overview of the structure of NNR w is shown in Fig. 2.

A schematic overview of NNR w model.
The idea of randomization of hidden layer in NN has been proposed several times. Pao et al. [24] proposed a very similar model called random vector functional-link network (RVFLN) and its generalization performance was investigated in [15]. Authors in [31] used this approach for initializing the weights of NN before training it with back-propagation (BP). From the literature [2, 25], one can study that several ideas have been proposed for randomization that incorporate random hidden-layer weights and biases, and the direct connection between the input layer and the output layer. In NNR w , the parameters of hidden layer nodes (i.e, input weights and hidden layer biases) can be chosen randomly and the output weights between hidden and output layer can be analytically determined with Moore-Penrose generalized inverse. Similarly, authors in [7, 37], introduced NN with a randomly initialized hidden layer and trained using pseudo-inverse. NNR w provides better training speed, because unlike BP (i.e., gradient-descent based algorithm), NNR w does not require the iterative tunning process for parameters at hidden layer nodes, that overcomes the drawback of local minimal as in conventional gradient based algorithms. Conventional neural networks have great approximation capability but the behavior of those networks during the training process heavily depends on the training set. The classification boundaries generated by the NNs are often unpredictable in the presence of less amount of data. Many extensions related to NNs have been proposed in scientific literatures such as discrete-time stochastic neural network [17], polygonal fuzzy NN used to handle the polygonal fuzzy data [20] and weightnetworks [16].
NNRw algorithm. The key idea of NNR w is the random initialization of the hidden layer weights and the subsequent training consists of computing the least-squares solution to the linear system defined by the outputs of hidden layer and targets.
Consider a set of l distinct instances (x
i
, y
i
) with
In Equation (1),
Therefore, for a given dataset
The above Equation (2) can be compactly written as
A typical sigmoid function is presented in Fig. 3, which has the curve in two direction and resemblances to the English letter “S”. This function transforms an input value to an output ranging from 0 to 1. It is worth noting that this function only returns the positive values. If one needs the NN to return the negative values then this function will be unsuitable.

Sigmoid activation function.
The above mentioned (3) becomes a system of linear equations, which in most cases can be transferred to a regular system of linear equations.
Suppose that
In Equation (6), H† denotes the pseudo-inverse. The (NNR w ) algorithm is summarized in Algorithm 1.
1:
2: Hidden node output function g (
3: Number of hidden nodes
4: Weight Matrix
5: Randomly select the input parameters
6: Compute the hidden layer output matrix
7: By using Eq. (6), calculate the output weight
The proposed solution to the equation Minimum training error can be achieved due to provision of the least-squares solutions. It is considered as the solution with the smallest norm among the least-squares solutions The smallest norm solution among the least-squares solutions is unique for a
The strength of the NNR w is the fact that there is no need to iteratively tune of the randomly initialized weights as compared to BP. Due to this advantage (i.e., non-iterative) this process makes its learning speed extremely fast.
The theory of fuzzy set is introduces by Zadeh [34] and it relates to classes of objects with un-sharp or unclear boundaries where the membership degree is a significant matter of interest. Conventional crisp sets contain values that only satisfy precise or compact characteristics required for membership.
Suppose a set S consists of values from 1 to 5 is a representation of a crisp set then it can be expressed as
The above membership function will produce a value 1, if and only if S ∈ n. Otherwise, it will produce a value 0. A graph of membership function for crisp set S is depicted in Fig. 4 One can visualize the above Fig. 4, that shows a clear distinction between the elements either these belong to the set or not. It is worth noted that there is a problem in defining the numbers between 6 to 10, which are also crisp. A similar situation comes to our daily life. For example, in deciding, the person is either tall or short. Regarding the conventional phenomenon or logic, there must be need to define a height threshold that differentiates a tall person to the smaller one. For example, A person is taller, if its height is greater than the threshold (i.e., 6 feet) otherwise that person is shorter. However, this mechanism obviously can not reflect the actual judgment based on human thinking. So, it can be better described as soft switching mechanism rather than threshold. This may lead us to often add some modifiers to the word taller or shorter (i.e., very, more, not, not very, somehow etc.) in order to express the degree of height rather than absolute value i.e., either True or False.

Membership function of crisp set S.
A fuzzy set, enables us to make decision according to human thinking. Therefore, in a fuzzy set, “height of a person”, degree related the height is defined that provides us a continous transition rather than a sharp transition (i.e., from true to false).
The term fuzziness refers to the unclear boundary between two linguistic variables and firstly proposed by Zadeh [34] with the proposed concept of fuzzy set. According to Dubois and Prade [11], A fuzzy set is a set that contains the elements with varying the membership quantity or values. The elements of a fuzzy set are mapped to a universe of membership values by using a function (i.e, membership function), and it maps the elements of objects of a fuzzy set into the real values in the interval of [0, 1] [1]. Fuzzy set is different from classical crisp set where the elements in a set have full membership which means that membership value must be equal to 1.
Zadeh in [35] also generalized the probability measure of an event to a fuzzy event and suggested using entropy in information theory to interpret the uncertainty associated with the fuzzy event [30]. Authors in [9], considered fuzziness to be a type of uncertainty and defined fuzzy entropy which is based on Shannon’s entropy function, and also introduced the set of properties for which a fuzziness should satisfy. These properties of fuzziness have widely accepted and become a criteria for defining the fuzziness [1]. Those properties also depict that a fuzziness degree attains itsmaximum when the membership degree of every element is equal and minimum when every elements either belongs to fuzzy set or absolutely not.
We consider fuzziness as a type of cognitive uncertainty, coming from the transition of uncertainty from one linguistic variable to another. where a linguistic variable is a fuzzy set and defined in a certain universe of discourse. Let S = {μ1, μ2, ⋯ , μ
n
} be a fuzzy set then the fuzziness of S can be defined as
In Equation (8), μ i represents the membership function and K is a constant and equal to (1/n).
The fuzziness of fuzzy set defined by Equation (8) attains its maximum when the membership degree of every element is μ i = 0.5 for every i = 1, 2, ⋯ , n and minimum when every element belongs to the fuzzy set or absolutely not for every μ i = 0 or μ i = 1, i (1 ≤ i ≤ n).
Now we associate fuzziness with the output of a classifier. It is well found that many classifiers have the output in the form of fuzzy vector, where each component of vector corresponds to the membership degree of the testing instances belonging to a class. These types of classifiers include artificial neural network (ANN) [12, 22], support vector machines (SVMs) [8, 32], fuzzy decision tree [29] and etc. It is important to mention here that some classifiers, such as NN, a simple transformation can transfer the initial output to a form a fuzzy vector if the components of the initial output are not in the interval of [0, 1].
Therefore, for a given set of training instances
In Equation (9), (μ
ij
) represents the jth instances of x belonging to ith class. The elements in the membership matrix have to obey the following properties
Therefore, once the training process (of a classifier) completes, U upon n can be obtained. For the jth instance of x, the output of a trained classifier is represented as a fuzzy set i.e.,
Based on the Equation (8), the fuzziness of a training classifier on x
j
can be depicted as Equation (10)
Once the training process is completed, we can easily obtain the fuzziness of learned classifier. Let the membership vector U of a classifier on n training samples with C classes be U = (μ
ij
) C×n, the fuzziness associated with classifier is given in Equation (11)
Above Equation (11) illustrates the fuzziness of a trained classifier that has output analogous to a fuzzy vector. (11) also plays an essential role for investigating the classifier’s generalization based on fuzziness. This equation actually represents the average fuzziness of classifier’s output on all training instances or we can say that it is the training fuzziness of the classifier.
The most appropriate representation of classifier fuzziness must be the average fuzziness of entire instance space that includes both the training and testing instances. However the fuzziness of testing samples is generally unknown and for any supervised learning, there is a well acknowledged assumption, that is, the training samples have a distribution identical to the distribution of samples in the entire space. It indicates the reason-ability that we use Equation (11) for the classifier’s fuzziness.
How to effectively categorize the instances into low, mid and high fuzziness group is a major concern of this research studies. The formation of groups (or categories) in fact depends on the situation of a specific problem, where different approaches can be utilized about how and what threshold has to set for dividing the instance into 3 categories.
We illustrate two state of the art techniques and present an algorithm that can be used to categorize the instances into 3 categories based on the fuzziness quantity. We illustrate them briefly.
Percentage split (P-split)
In this technique, a vector V = {v1, v2, ⋯ , v
n
}, that represents the sorted fuzziness values of respective instances X = {x1, x2,. . . . x
n
} in ascending order. where n represents the total number of samples and v
i
is the value (i.e., fuzziness) corresponds to the i
th
instance. This is a very simple mechanism, which actually focus the proportion of percentage of those instances that are having lower fuzziness values and higher fuzziness values. For example, if there are 5000 instances in a data set and after obtaining a V, we assign a lp value and hp value for the FG
low
and FG
high
. if V is a sorted list and we assign 30% instance for low category groups and 20% for high category groups than it calculates the total instances in each category as (12).
Hence the FG
low
will contain the instances {x1, x2, ⋯ , x(n
low
)}. Similarly, the total amount of elements in FG
high
can be computed as (13).
Based on the total number of elements in high groups we obtain the instances that belong to FG high as FG high = {x(n-n high +1),. . , x n }. All remaining elements will be included in FG mid category.
This is also a simple mechanism, that distributes the entire set into 3-parts. For example, if there are n instances. It will assign the equal distribution of instance to every group by using (14).
Hence by using (14), the instances for each group can be extracted as (16)
The instances must be sorted based on their fuzziness quantity in ascending order before applying the N-split criteria.
In this technique we present an algorithm that will auto extract the instances for FG
low
, FG
mid
and FG
high
categories based on their fuzziness values. We will use the measure i.e., median to illustrate this algorithm called M-split. Median is the middle value in the set of elements. To compute the median value, the vector V must be arranged in the ascending order first, there are two ways to compute the median depending on the amount of elements in a set. If the total number of elements n in a set V are even then the median M
v
can be computed as (16).
If the total number of elements n in a set V are odd then the median M
v
can be obtained by using (17).
The illustration of M-split is depicted in Algorithm 2, the algorithm first finds the median of all values [1, 2, 3, ⋯ , n] in a set and place the obtained value into q1. Now we have 2 ranges of values (i.e., 2 sets); one is from first element to q1th element and the other is q1th + 1 to nth element. Again we find the median values of both sets and keep this value in q1 and q2 respectively. At this stage we create 3 groups i.e., FG
low
, FG
mid
andFG
high
and place the elements in these groups as mentioned in (19). We can also use mean instead of median, but in this study, we are only categorizing the instance based on median.
1: Set of sorted instances
2: FG low , FG mid , FG high {i.e., instances in low, mid and high categories}
3: initialize q1 = q2 = 0
4:M1 = M v (V)
5:
6: q1 = M v (V < M1)
7: (q1 = 0)
8: low M = q1
9:
10: q2 = M v (V > q1)
11: high M = q2
12:
13:
14: q2 = M v (V > M1)
15: V1 = V < m and V ≥ q1)
16: low M > m and V ≥ q2)
17: high M = M v (V1)
18:
19:
20:
21: low M = M1;
22:
23: q2 = M v (V > M1)
24: high M = q2
25:
26:
27: high M = M v (V > q2)
28:
29:
30:
31:
32: FG low = x i
33:
34: FG mid = x i
35:
36: FG high = x i
37:
38:
The splitting techniques are applied on 10 benchmark data sets, that are taken from UCI machine learning repository [21] to experimentally acquire the statistical relation between instances obtained by proposed fuzziness based categorization mechanism and the their correct rate of classification. The data sets which are selected (during the experiments) belong to a wide variety of classification problems, where the number of instances, their classes and types of features differ. The detail of these data sets is summarized in Table 1.
List of data sets acquired for experimentation
List of data sets acquired for experimentation
For the purpose of experimental design, we use 10-fold cross validation technique to evaluate the performance of all splitting approaches. We performed necessary normalizing between [0, 1]. For the NNR w , the initial interval for the random parameters is set to [0, 1], and the number hidden nodes are selected as 60 corresponding to each dataset. We applied all three mechanisms for splitting the instances into 3 groups during the experimental process. Table 2 lists the results obtained by using 10-times 10-fold cross validation scheme, where average is taken to demonstrate the effectiveness of sampling method.
Experimental Results using different splitting methods for instance categorization
For further analysis of our experiments, we take a typical case of one data set (i.e., automobile), We vary the number of hidden nodes from 10 to 100, and check the impact of categorization by using 3 splitting approaches. One can see in Fig. 5, where the proposed splitting approach gains more accuracy than the others. For the NNR
w
, we also analyze the impact of different initialization intervals on the overall performance of our proposed splitting approaches. The initialization interval is set to [0, λ], 1 ≤ λ] ≤10 during the simulation. The input weights

Comparison of splitting methods (Testing accuracy).

Impact of different initialization interval (Testing accuracy).
One can analyze the Figs. 5 and 6, that all splitting methods are improving the accuracy rate, where FG low is extracted after applying each splitting method and incorporating into original training set L. In-fact, the division of instances into 3 groups depends on the nature of a problem, and the amount of instances that the data sets hold. Suppose in a case, we have limited fuzziness values then M-split criteria may fail to produce the desirable results, hence, we further need to judge the impact of fuzziness values. The other two types of splitting can be effectively utilized in such type of situation. Same is the case with penbased and yeast data sets, where the M-split criteria achieves slightly low or equivalent results than N-split and P-split dividing approaches. M-split criteria is useful in the case, when most of the fuzziness values are similar, for example, in the case of 100 instances, if 50 instances are outputting the same fuzziness values i.e., 0, then the M-split criteria will place all similar values in one category, and other values will be further divide into two groups. The main step in D&C strategy was to incorporate that group with L, that are having high accuracy. All splitting methods are following this phenomenon, we have listed the accuracy obtained by FG low , FG mid and FG high in Table 2, where one can see the accuracies obtained by 3 groups by using different splitting approaches. By using any splitting method, FG low is getting higher accuracy than FG mid and FG high , and for many data sets M-split criteria is reflecting higher accuracy rate on withheld set than the original testing accuracy. Table 2 and both Figs. 5 and 6 prove the effectiveness of M-Split criteria.
For the so called problem of big-data, selection of M-split criteria may be much useful than other criteria, because N-split or P-split may extract much misclassified instances for FG low . During this study, we have also observed some interesting findings by adding two groups collectively along with L, we will present this part in our next study.
Fuzziness based divide and conquer strategy is an effective and useful strategy for promoting the classifier’s performance. However, the critical problem is to categorize the instances according to their fuzzy values into 3 groups. We have investigated different state of the art methods for categorization, and proposed an efficient mechanism that can effectively extract the instances for low, mid and high fuzziness categories. We used NNR w to obtain the membership vector corresponding to each data point by using simple transformation. Other classifiers i.e., Fuzzy k-nearest neighbor, support vector machine (SVM) or BP can also be utilized to compute the fuzziness for data points. It is observed by the experimental simulation that proposed splitting algorithm provides better solution for placement the data points into respective categories, and also promotes the efficacy of D&C strategy. This technique of categorization can be much helpful for the so called problem of big-data. Experimental results have shown its effectiveness.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China 71371063, the Basic Research Project of Knowledge Innovation Program in Shenzhen (JCYJ20150324140036825).
