A novel privacy-preserving probability transductive classifiers from group probabilities based on regression model

Abstract

Group probability classifier learning is an emerging and promising learning technique, especially in privacy-preserving data mining. It is used to train a classifier from a group probability dataset, where the class labels of each sample are unknown while the probabilities of each class in the given data groups of the whole dataset are available. The existing work is mainly based on the inverse calibration (IC) strategy to obtain the estimated labels for data in the group probability dataset and then make use of classical classification algorithms such as support vector machine (SVM) model to train the desired classifier. A critical challenge of the exiting IC-based methods lies in the difficulty of designing an ideal IC function for label estimation and the methods are sensitive to the adopted IC function. In order to overcome this shortcoming, a novel probability transductive classifier that does not involve IC in the learning procedure is proposed, where the probability values are directly used as the output of the training data for the model training. Particularly, on the training data with the output being continuous real values, the existing classical regression model can be easily adopted to model the group probability classification problem. For a future testing data, the model output of the obtained group probability classification model can present the probability that the testing data belong to the positive class. With a given threshold, the final class label of the testing data can be obtained for the classification task. The experimental results on synthetic datasets and real UCI datasets show that the proposed method is more effective than the existing methods.

Keywords

Privacy preserving regression model probability transductive group probability classification

1 Introduction

With increasing attention to big data and rapid development of data mining, preservation of data privacy is becoming an important concern and has significant impact on the society. For example, the labels of the collected data for such purposes as political elections, fraud detection or spam filtering should not be disclosed [1 –4]. Group probability data is a kind of datasets that deserves special attention of privacy preservation. Data collected in political elections are a typical example [2]. In an election, while the result of each individual vote is not disclosed publicly, the distribution of the votes at individual regions is made available. The income, social class and other socio-economical status of the voters in the regions can then provide group probability data which are closely related to final voting results, directly influencing the regional distribution of votes. Similar situation also exists in spam identification [3], where malicious emails can be classified based on group probability data. To detect a spam, it is important to establish a collection of spam emails which is costly to achieve. One approach of data collection is to look into the emails in the inbox of the recipients. The spam datasets can be acquired by specifying emails as spam bait. In addition to labeling suspicious emails as spam, it is also possible to obtain spam datasets in a relatively inexpensive way by estimating the proportions of spam and non-spam in the inbox which typically contains a mix of spam and non-spam. However, collection of spam data by examining the emails of users is intrusive to privacy. The research presented in this paper is thus motivated by the need to preserve privacy in group probability datasets while extracting useful information by data mining.

In group probability datasets, the class labels of the data are unknown whereas the probabilities of each class in the given groups of the whole dataset are known. To train an effective classifier based on the group probability datasets for future prediction is a challenging machine learning task. In particular, the characteristics of group probability datasets preclude the use of classical learning methods. Thus, new machine learning methods applicable for handling group probability datasets are in demand. Research along this direction has been underway. For example, the Platt-model-based inverse calibration (IC) technique [5 –8] has been introduced to develop support vector machine (SVM) for handling group probability dataset [9 –11], and the IC-SVM algorithm is thus proposed [2]. In this algorithm, the IC technology is used to obtain the labels of the data using the IC function based on the class probability. With the labeled data, classical learning methods are then used to train the SVM classifier. Experimental results show that IC-SVM demonstrates good classification performance and is suitable for group probability data. However, there are still many unresolved issues. A critical challenge of the exiting IC-based methods lies in the difficulty of designing an ideal IC function for label estimation and the methods are sensitive to the adopted IC function.

In order to overcome the difficulty, a novel probability transductive classier without requiring IC in the learning procedure is proposed in this paper. First, a new framework of probability transductive classier design is developed, where the probability values are directly used as the output of the training data for the model training. In the binary classification problem of group probability, the probability of a sample belonging to the positive class is directly used as the output value of the training samples. With the outputs, classical regression modeling methods, e.g. support vector regression (SVR) [12 –14], fuzzy systems [15 –18], neural networks [19 –22], can be easily adopted to train the group probability classifier. For a future testing data, the model output of the obtained group probability classification model provides the probability that the testing data belong to the positive class. Then with a given threshold, the explicit class label is obtained for the classification task. The work conducted to develop the proposed method is summarized as follows.

First, a framework for group probability classifier design without requiring IC is proposed. The method reduces the sensitivity of conventional group probability classifier training methods to the IC functions.

With the proposed framework, classical regression modeling methods are used in the model training stage to train the group probability classifiers by directly using the group probability datasets as the regression datasets.

In the model testing stage, a class label calibration method based on a probability threshold is proposed for the binary classification problem.

The rest of this paper is organized as follows. In Section 2, the related work on group probability classification is briefly reviewed. In Section 3, a novel probability transductive classifier estimation framework is proposed and a regression-model-based training algorithm is presented. Section 4 presents and discusses the experiments results on both synthetic and real-world datasets. Finally, the conclusions of the study are given.

2 Related work

This section gives a brief introduction of the work related to group probability classification, with focus on the IC technique and the IC-SVM classification method.

2.1 Group probability –problem formulation

Given a binary classification dataset X = { x _i, i = 1, …, N}, where x _i ∈ R ^d represents a data point, N is the number of data points, and the labels of all the data points are unknown. Suppose the group probability of k groups of the data points, denoted as S _k = {X _i,k, i = 1, …, N _k}, are known and equal to p _k, where p _k is the probability of positive class data in the group S _k and N _k is the number of data points in each group, the dataset X usually has good privacy preserving property. However, it is a non-trivial task to apply conventional classification methods for datasets of this kind for effective training of the classifier models for the group probability datasets.

2.2 IC-SVM

To deal with the difficulty in using conventional classification models directly for the group probability datasets, an IC technique based SVM classification method is proposed [2]. In this method, the group probability data are first labeled using IC with the classical Platt model. SVM learning method is then employed for model learning.

Sigmoid function is employed in the Platt-model-based IC technique to obtain the calibration function (probability distribution function) Θ (f (X)) as follows, $Θ (f (X)) = 1 / (1 + exp (- Af (X) + B)) .$ (1)

In the function, the parameters A and B are obtained by gradient descent method by means of minimum cross-entropy. Theoretical study has shown that the finally probability distribution function is given by Θ (f (X)) ≈ P (Y = 1|X). Meanwhile, in order to simplify the calculation, by setting A = 1 and B = 0, Equation (1) can be reduced to $p (y = 1 | X) = Θ (y) = 1 / 1 + exp (- y) .$ (2)

According to the transform principle of inverse function, the inverse function of Equation (2) is given by $y = Θ^{- 1} (p) = - log (p^{- 1} - 1),$ (3) where p is the positive class probability. By Equation (3), different strategies can be used to obtain the labeled data for the supervised learning tasks. With the IC technique, the IC-SVM method is proposed for group probability classification problem based on the conventional SVM. The principle of IC-SVM is schematically shown in Fig. 1.

3 Probability transductive group probability classification

Although IC-based group probability classification methods, such as IC-SVM, are more effective than the conventional methods, many issues remain to be solved. A critical challenge is that it is difficult to identify an appropriate IC function that can be adopted for different complicated group probability classification tasks. In this section, a novel group probability classification framework, called probability transductive group probability classification (PT-GPC), is proposed to get rid of the need of the IC procedure. Since this technology uses probability transductive strategy, it is more simple and effective than the IC-based methods. The advantages will be discussed in the paper and demonstrated with experimental results.

3.1 Overall framework

The overall framework of the proposed of PT-GPC method is shown in Fig. 2. It involves three stages:

In the first stage, the training dataset of probability outputs is constructed, where the probability of each group in a group probability dataset is directly used to construct the dataset containing the corresponding input-output data pairs with probability outputs.

In the second stage, by taking the constructed training datasets as the regression datasets with probability outputs, the group probability classifier is then trained by using a classical regression modeling method, such as fuzzy system and neural networks.

In the third stage, the probability output is used to determine the class label of a testing data.

3.2 Implementation details

(a) Construction of probability datasets

Different labeled datasets can be constructed for classifier training by using IC and some construction strategies. With reference to the strategies in [2], given an input dataset X = {x _i, i = 1, …, N}, x _i ∈ R ^d, containing K groups S _k = {x _i,k, i = 1, …, N _k} , x _i,k ∈ R ^d belonging to the positive class with the corresponding group probability p _k, the probability output regression datasets can be obtained using the two strategies below.

1) Concrete strategy: for a giving binary group probability dataset X = {x _i, i = 1, …, N}, x _i ∈ R ^d, containing K groups denoted by S _k and p _k, the constructed regression dataset with the probability outputs can be described as D _k = [S _k, Y _k], where input and output datasets are given be S _k = {x _i,k, i = 1, …, N _k} , x _i,k ∈ R ^d and Y _k = {y _i,k, i = 1, …, N _k} , y _i,k = p _k ∈ [0, 1] respectively. The strategy is schematically shown in Fig. 3.

In the concrete strategy, every sample x _i,k has a probability output p _k. If the samples are in the same group, in the kth group, say, the probability output of x _i,k will be the same.

2) Abstraction strategy: for a giving binary group probability dataset X = {x _i, i = 1, …, N}, x _i ∈ R ^d, containing K groups denoted by S _k and p _k, the constructed regression dataset with the probability outputs can be described as $D_{k} = [{\bar{S}}_{k}, Y_{k}]$ , with ${\bar{S}}_{k} = {\bar{x}}_{k}, {\bar{x}}_{k} = \frac{1}{| S_{k} |} \sum_{i = 1}^{| S_{k} |} x_{i, k}$ and Y _k = p _k, p _k ∈ [0, 1]. The strategy is illustrated in Fig. 4.

In the abstraction strategy, a new sample ${\bar{x}}_{k}$ is constructed to represent all the samples which are in the same group, and every new sample ${\bar{x}}_{k}$ has the corresponding probability output p _k.

(b) Learning with regression technique

With the constructed regression datasets described above, different regression modeling techniques, such as support vector regression (SVR) [12 –14], fuzzy systems [15 –18] and neural networks [19 –22], can be adopted directly to train a regression model with the probability output.

Since the model trained with the regression technique only provides a real output denoting the probability of the positive class, it is necessary to label the test data by using the probability output. The following strategy can be used for this purpose, $label (x) = {\begin{matrix} + 1 if f (y = 1 | x) \geq Θ \\ - 1 if f (y = 1 | x) < Θ \end{matrix},$ (4) where Θ is a predefined threshold, with Θ = 0.5 in general.

3.3 The PT-GPC algorithm

Based on the PT-GPC framework and the implementation details discussed above, the algorithm of PT-GPC is described below.

PT-GPC algorithm:

Construct the corresponding probability output regression datasets with the group probability dataset by using the concrete strategy or the abstraction strategy.

By using the constructed regression dataset and the classical regression modeling method to obtain a group probability classification model.

Transform the probability output of the test data to the discrete labels.

Remark : The proposed method appears to be a kind of semi-supervised learning methods. In the sense that both techniques can effectively use the additional information in the learning procedure, the group probability based method can be regarded as a semi-supervised learning method.

4 Experiment

4.1 Settings

1) Datasets: In order to validate the effectiveness of the proposed PT-GPC method, the classical two-moon synthetic dataset [12] and the UCI datasets of real data are used for experimental analysis. In all the experiments, two-third of the samples is taken as the training set, and the remaining one-third are used as the testing set. Details of the datasets used in the experiments are given in Table 1.

2) Methods for performance comparison: The proposed method is compared with a classical IC-based method. Specifically, the two strategies in Section 3.2(a) are adopted for the construction of the labeled datasets for the IC-based method and the probability output regression datasets for the proposed method. Meanwhile, in order to evaluate the effectiveness of the proposed method comprehensively, the probability output regression datasets are trained on various types of classical intelligence models, including kernel methods, fuzzy systems and neural networks. Three classical regression models, namely, SVR [12 –14](kernel method), L2-TSK [15–18 , 24] (fuzzy system), and RBF neural network model [19 –22] (neural network) are adopted for the group probability classifier based on the constructed datasets. In the experiment, the performance of the algorithms, in terms of classification accuracy, is evaluated and compared. The same experimental procedure is repeated 10 times at each setting so as to compute the mean and standard deviation of the classification results. For each algorithm, the best classification accuracy obtained with the optimal parameter setting is recorded for comparison. In our experiments, the bolds in Tables 3–7 denote the best results obtained by the adopted methods on thegiven dataset.

3) Parameter setting: In the experiments, 5-fold cross-validation [13] is adopted to determine the optimal value of the model parameters, with the values chosen from the ranges given in Table 2 for different models. In our experiments, the bolds in Tables 3–7 denote the best results obtained by the adopted methods on the given dataset.

4) Experimental environment: The experiments are performed on a computer with an Intel Core CPU (I5-3317U), 1.7 GHz, 16GB RAM, using MATLAB as the programming environment.

4.2 Performance comparison on classification accuracy

(a) Synthetic dataset

The synthetic two-moon synthetic dataset is shown in Fig. 5. The dataset consists of 600 two-dimensional data points, with equal proportion representing the positive and negative classes respectively. In order to fully examine the performance of the proposed method and the IC-based method, the dataset into partitioned into different number of groups respectively, where the number of groups is set to 5, 10, 20, 30, 50, 100 and 150. For the groups obtained in each dataset partitioning, the group probabilities are provided for the constructed group probability datasets.

Refer to the experimental results obtained from the two-moon group probability datasets in Table 3, the following conclusions can be made.

1) When the concrete strategy is adopted and there are a small number of groups, the proposed PT-GPC method has shown better performance than the IC-based method, which validates that the PT-GPC method has stronger adaptability. In particular, when there are only few groups, i.e., the information is very inadequate, the advantage of the PT-GPC method is more significant. In the case when a larger number of groups are available, since the real probability distribution of the data becomes more evident and the information of the data is also more sufficient, the performance of the PT-GPC method and the IC-based method tends to be similar. The results obtained from the two methods even become the same when there are 150 groups.

2) When the abstraction strategy is adopted and there are only few groups, the information loss is more severe due to the higher degree of data abstraction of the strategy. Therefore, it is found from the experiments that the classifier performance of both methods is not satisfactory. However, with increasing number of groups, the performance of the proposed PT-GPC method is more significantly improved than that of the IC-based method. When the number of groups is large enough, the two methods have the same performance as the information available becomes sufficient.

In summary, since the proposed method is not influenced by the IC procedure, it is a more promising performance than the classical IC-based method.

(b) Real world applications

In real world applications, data privacy protection problems, for example, concerning credit card information in banks, political election data and health data, are receiving more and more attention nowadays. In this section, experiments performed on four real world datasets from the UCI database, including (i) the Australian credit card dataset, (ii) breast-cancer dataset, (iii) heart dataset and (iv) the United States in 1984 election dataset, are presented to demonstrate the performance of the proposed PT-GPC method in the fields of finance, medicine and political election.

(i) Financial dataset

In the financial field, especially in the business of credit card services, privacy protection is of utmost importance to the clients. Group probability can be leveraged for the protection of these data. In this study, the Australian credit card dataset of the UCI database is used for performance evaluation. The dataset contains 690 data samples with attributes of 14 dimensions. The results of the experiments performed on the credit card dataset are shown in Table 4.

(ii) Medical dataset

Protection of medical data privacy is important to the patients. For example, in the fields of medicine, cancer data and heart data are generally not directly disclosed in any specific form. The abstraction strategy of group probabilities is thus a good choice for effective protection of the data privacy when some typical data have to be published. The proposed PT-GPC method is therefore tested against such datasets to evaluate its potential for the medical applications. Here, the Breast Cancer dataset (699 data samples, 10 attributes) and the Heart dataset (270 data samples, 13 attributes) of the UCI database are adopted. Details of the experimental results are shown in Tables 5–6.

(iii) Election dataset

Privacy protection is also a matter of concern in political elections, as discussed in the beginning of the paper. Group probability samples of the data can provide clues for predicting the results of the next election. Therefore, it is of practical significance to study the performance of the proposed PT-GPC using such dataset. To this end, the United States in 1984 election dataset of the UCI database is used to evaluate the performance of the algorithms. The dataset was collected from the real voting data of the 1984 U.S. presidential election. The experimental results are shown in Table 7.

(iv) Discussions

It can be seen from the experiments performed on the four real datasets that the results are similar to those obtained from the synthetic datasets. The findings further indicate that the outstanding generalization capabilities of the proposed probabilities transductive classification method, particularly when the number of groups is small. The classification performance of the proposed method is comparable to that of the IC-based methods when the number of groups is large. It is clear that the number of groups is critical parameter determining the extent of data privacy protection. The smaller the number of groups, the less the amount of information disclosed, and the higher the ability of data privacy protection. The feature further supports that the proposed method is more appropriate for handling the scenarios of group probabilities than the conventional methods.

5 Conclusions

A novel probability transductive group probability classification method is proposed in this study in order to overcome the shortcoming of the classical IC-based methods, i.e. the performance is sensitive to the IC function. The proposed PT-GPC method can directly construct the probability output regression dataset for model training without undergoing the IC procedure. The effectiveness of the PT-GPC method has been validated by experiments performed on both synthetic and real datasets.

Although the proposed method has demonstrated promising performance, there are rooms for further investigations and improvement. For example, future investigations can be carried out to achieve effective group probability classifier learning even when the number of groups is small. To this end, transfer learning strategy will be explored where historical knowledge can be leveraged, which is expected to compensate the deficiency caused by the availability of only a small number of groups.

Footnotes

Acknowledgments

This work was supported in part by the General Research Fund of the Hong Kong Research Grants Council (PolyU5134/12E), the National Natural Science Foundation of China (61170122, 61272210), the Natural Science Foundation of Zhejiang Province (LY13F020011), the Ministry of Education Program for New Century Excellent Talents (NCET-120882), the Fundamental Research Funds for the Central Universities (JUSRP51321B), and the Outstanding Youth Fund of Jiangsu Province (BK20140001), and the Natural Science Foundation of Jiangsu Province (BK20130155).

References

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.