Collaborative ensemble learning under differential privacy

Abstract

Ensemble learning plays an important role in big data analysis. A great limitation is that multiple parties cannot share their knowledge extracted from ensemble learning model with privacy guarantee, therefore it is a great demand to develop privacy-preserving collaborative ensemble learning. This paper proposes a privacy-preserving collaborative ensemble learning framework under differential privacy. In the framework, multiple parties can independently build their local ensemble models with personalized privacy budgets, and collaboratively share their knowledge to obtain a stronger classifier with the help of central agent in a privacy-preserving way. Under this framework, this paper presents the differentially private versions of two widely-used ensemble learning algorithms: collaborative random forests under differential privacy (CRFsDP) and collaborative adaptive boosting under differential privacy (CAdaBoostDP). Theoretical analysis and extensive experimental results show that our proposed framework achieves a good balance between privacy and utility in an efficient way.

Keywords

Ensemble learning differential privacy random forests adaptive boosting

1. Introduction

In the age of big data [36], it is valuable to investigate the methodologies of big data analysis, so that the custodians of such big data can make decision by estimating useful knowledge extracted from data. For example, facial images are captured in surveillance videos to trace the escaped criminals [23]. Amazon collects the text and images of clothing purchased by millions of customers to mine styles of fashionable dresses. Based on historical loan records, banks give credit scores to applicants and decide if loans should be issued [5].

Ensemble learning [26] is one of machine learning paradigms where multiple models jointly solve a particular problem. In contrast to ordinary approaches which try to learn one hypothesis from training data set, ensemble learning tries to construct a set of hypothesises and combine them. This strategy is primarily used to improve the performance of the model, or reduce the error of one hypothesis. Ensemble learning is very in line with classification. It has already applied to diverse applications, such as predicting potential side effects of drugs [38], detecting visual concept of web images [29], or exploring appropriate rules for stock trading decision [20].

However, ensemble learning may disclose individual privacy if data is shared among organizations or submitted to public commercial cloud [33]. For example, electronic health records [14] generally contain patients’ data such as demographics, diagnostics, medications and nursing problems. Most of them are extremely sensitive to patients. If those data had been exposed to adversary, it would have brought incalculable losses. Considering this risk, most of organizations have no choice but to build its decision support system with limited available data. Thus, there is a contradiction between preserving individual privacy and having sufficient data for modeling then making decision.

There are existing work on privacy-preserving ensemble learning [1,18], and they give specific solutions, but these proposals do not fully address the key problem of how to share knowledge securely and efficiently among multiple parties. The work in [1] discussed differentially-privacy random forests, and compared three variants of the algorithm: majority voting, threshold averaging and probabilistic averaging. Even when privacy guaranteed, majority voting rule is an excellent candidate because it is less sensitive to the choice of parameters. The recent work [18] of adaptive boosting developed a distributed protocol to mine healthcare data. But it would disclose the information of the model. Therefore, a practical privacy-preserving ensemble learning solution is still deficient.

In this paper, by introducing differential privacy, we propose a privacy-preserving collaborative ensemble learning framework. Under this framework, we present the differentially private version of two widely-used ensemble learning algorithms: collaborative random forests under differential privacy (CRFsDP) and collaborative adaptive boosting under differential privacy (CAdaBoostDP). The goal of our work is to build an effective collaborative decision support system that can share knowledge among multiple parties without revealing personal sensitive information. In our framework, each party trains its own classifier by ensemble learning with its local data, and the useful knowledge extracted from classifier is then transferred to a central agent without disclosing the sensitive data. Our approach integrates these classifiers at the honest-but-curious central agent under privacy constraints.

Compared with existing work, our solution has the following salient advantages. First, with a central agent, our solution reduces computation and communication cost dramatically for the overall system. Our framework avoids to transfer data directly among multiple parties. Each party uploads and receives integrated model from the central agent, and there is no interaction between parties. Second, we regard the threat model of central agent as honest-but-curious instead of trust. Therefore, our framework is practical in the sense of security assumption. Third, our approach provides flexible configuration of privacy budget for each party. When faced with sensitive data, different users have different levels of privacy expectations. But most of existing literatures set a universal standard without catering for users’ personalized needs. The consequence is that we may be offering insufficient protection to a subset of people, while applying excessive privacy control to another subset [34].

Our main contributions are summarized as follows.

We propose an ensemble learning framework under differential privacy, and it is applicable to any concrete algorithm that follows ensemble learning concept. The framework focuses on the scenario that multiple parties in ensemble learning have to share knowledge with each other, but do not want to disclose their own private sensitive data.

Our framework is efficient in terms of complexity and practicality in terms of security assumption. By utilizing a central agent, our framework only requires model exchange instead of data set. It not only significantly saves the cost of computation and communication, but also ensures minimized privacy breach and maximized utility. We also treat the central agent as honest-but-curious, and design a simple yet effective integrating approach to integrate local models.

We consider personalized differential privacy. Our proposed framework allows privacy budget to be specified at party level, so each party in our framework can control privacy budget by itself. It provides significantly better performance than the existing solutions.

We conduct theoretical analysis and extensive experiments on real credit card clients data set to evaluate the effectiveness of our proposed framework, and the results illustrate that a satisfactory tradeoff between privacy and utility can be achieved simultaneously.

The rest of this paper is organized as follows: In Section 2, we provide related work on privacy-preserving data mining, especially for ensemble learning. In Section 3, we give necessary preliminaries on ensemble learning and differential privacy. In Section 4, we formulate the problem and system, as well as the security model considered in this paper. In Section 5, we propose the ensemble learning framework under differential privacy in detail. Section 6 and Section 7 give theoretical analysis and experimental results, respectively. Section 8 concludes the paper.

Table 1
Notations

Notation Description Notation Description

D Data set v Number of attribute variables used by tree

N Number of records in data set m Number of parties

C Class variable $ϵ_{c}$ Privacy budget of CART

$\hat{C}$ Predicated class variable $ϵ_{p}$ Privacy budget for each party

K Number of classes α Accuracy of local ensemble classifier

A Attribute variables λ Proportion of amount of local data set

T Number of iterations w Weight function in integration

h Maximum height of CART

Notation	Description	Notation	Description
D	Data set	v	Number of attribute variables used by tree
N	Number of records in data set	m	Number of parties
C	Class variable	$ϵ_{c}$	Privacy budget of CART
$\hat{C}$	Predicated class variable	$ϵ_{p}$	Privacy budget for each party
K	Number of classes	α	Accuracy of local ensemble classifier
A	Attribute variables	λ	Proportion of amount of local data set
T	Number of iterations	w	Weight function in integration
h	Maximum height of CART

2. Related work

In this section, we first give a very brief overview of privacy study in data mining, and then discuss related work on privacy-preserving ensemble learning.

2.1. Privacy-preserving data mining

The time has witnessed the development of privacy-preserving data mining, from K-anonymity [30], L-diversity [21] to T-closeness [17]. K-anonymity ensures that individuals cannot be uniquely re-identified in a data set and thus guards against linking attacks. However, the adversary with background knowledge can infer sensitive information about individuals even without re-identifying them. L-diversity requires that each equivalence class has at least l well-represented values for each sensitive attribute, but it does not consider semantic meaning of sensitive values. T-closeness [17] requires that the distribution of sensitive attribute in any equivalence class is close to the distribution of that in the overall table, but it fails to prevent identity disclosure. Differential privacy [6,7] guarantees that the distribution of noisy query result changes very little with the addition or deletion of any record. It has applied to support vector machine [28], location-based services [9], correlated data publication [4], high dimensional data publication [37], etc.

2.2. Privacy-preserving ensemble learning

Random forests with differential privacy was first observed in [15]. The authors presented a differentially private decision tree ensemble algorithm using the random decision tree, but they did not estimate the quality of differentially-privacy algorithm. In [32], encryption-based method was developed securely to construct random forests and distributed strategy was proposed for knowledge discovery. In [31], the random response idea was used to mix set of data instead of generalization, but it was only appropriate for binary attribute. In [27], the authors discussed the quality function of decision tree, and they stated that information gain, max operator and Gini index had almost same effect on accuracy regardless of sensitivity towards noise. The recent work [1] provided strong theoretical guarantees of both non-differentially private and differentially private random forests. The result was that majority voting and threshold averaging had better accuracy than probabilistic averaging.

Compared with random forests, adaptive boosting with privacy constraints has rarely been researched. Gambs et al. [12] first proposed privacy-preserving boosting data mining algorithms: BiBOOST and MultBOOST. The algorithms allowed two or more parties to construct a boosting classifier without sharing their own data sets. However, the main problem there was that each party’s optimal classifier had been exposed potentially at merging step through anonymous broadcast. In [8], the authors used boosting for arbitrary low sensitivity query. The latest work in [18] exhibited a distributed protocol based on the adaptive boosting strategy, but its integration process required plenty of communication cost and existed potential risk of leaking model information.

3. Preliminaries

This section gives preliminaries closely related to our work, including classification and regression trees, ensemble learning and differential privacy. We first introduce the notations in Table 1.

3.1. Classification and regression trees

Classification and regression trees [19] are methods for constructing predication model for data. The model is obtained by recursively partitioning the data and fitting a simple predication model within each partition. It can be visualized with a decision tree. CART [3] is one form of classification and regression trees, which splits node by exhaustively searching over inputs and finds best branch attribute variable based on Gini index [24] minimization principle. CART first grows a complete large tree and then prunes it to a smaller one to make an estimate of misclassification error minimized. CART employs cross-validation method to compute error rate. We formally define process of generation and pruning of CART as follows: $\begin{array}{l} (1) & g = CARTGen (D, A, C, h) \\ (2) & g = CARTPrun (g) \end{array}$ where the meanings of notations can be found in Table 1. For simplicity and without loss of generality, we assume that the CART is binary in the subsequent statements, i.e. the class variable C has only two possible values.

3.2. Ensemble learning

Ensemble Learning refers to learning a combination of base hypotheses. Its goal is to strength the capability of base model.

Definition 1 (Ensemble learning [25]).

Suppose we have gained T CARTs, each is denoted as $g_{t}$ ( $t \in T$ ). Let G denote the aggregation of classifiers $g_{t}$ , and $\hat{C} | X$ indicate that given X, output $\hat{C}$ , we can define $\begin{matrix} (3) & G (\hat{C} | X) = \sum_{t = 1}^{T} w_{t} \cdot g_{t} (\hat{C} | X) \end{matrix}$ where $w_{t}$ is the probabilistic weight of tth classifier.

3.2.1. Random forests

Random forests [2] algorithm is an ensemble learning method. It consists of a collection of decision trees where these trees are independently constructed, and each tree gives the same weight vote for input to predict class variable. Random forests include training and predication phases. It first takes a bootstrap sample from training data set and builds decision tree using sample. After a number of trees are generated, they vote for the majorities.

Random forests algorithm outperforms many models in computational speed, due to the inherent property of random partition employed in tree generating process. Randomness is introduced so that random forests model is not easily overfitting. Random forests are able to support parallel operation easily.

3.2.2. Adaptive boosting

Adaptive boosting (AdaBoost) [11] is one of ensemble learning algorithms which iteratively generates a strong classifier from a pool of single weak classifiers. In each iteration, a classifier is learned, and based on the correctness of the prediction in previous round, the weight distribution of training data set is updated. Such that, in the subsequent iteration, the classifier will focus on the misclassified records in previous iteration. The final classifier is a weighted combination of those classifiers.

AdaBoost provides a way of combining multiple base classifiers whose combined performance is significantly better than that of any of the base classifiers, and does not require feature selection before training. It can also avoid overfitting, but the time complexity is a little higher than base classifier.

3.3. Differential privacy

Differential privacy is a novel technique providing privacy-preserving noisy query answers over statistical databases [16]. It guarantees that the distribution of noisy query answers changes little with the addition or deletion of any record. The goal of differential privacy is to answer queries over sensitive data sets without compromising the privacy of individuals, whether their records are in data sets or not.

Definition 2 (ϵ-Differential privacy [7]).

A randomized function $M$ gives ϵ-differential privacy if for both data sets $D_{1}$ and $D_{2}$ that differ on at most one record, and output set $O \subseteq Range (M)$ , $\begin{array}{l} Pr [M (D_{1}) \in O] \\ (4) & ⩽ exp (ϵ) \times Pr [M (D_{2}) \in O] \end{array}$

The probability is the risk of privacy disclosure and it is controlled by function $M$ . Privacy budget ϵ refers to the degree of privacy protection. The smaller ϵ is, the stronger level of privacy it achieves.

In order to achieve differential privacy, two standard techniques, Laplace mechanism [7] and exponential mechanism [22], have been proposed in the literatures. A fundamental concept of these two mechanisms is global sensitivity that maps a database to real values.

Definition 3 (Global sensitivity [7]).

For any function $f : D \to R^{d}$ , the global sensitivity of f is defined as $\begin{matrix} (5) & S (f) = max_{D_{1}, D_{2}} {‖ f (D_{1}) - f (D_{2}) ‖}_{1} \end{matrix}$ where $D_{1}$ and $D_{2}$ differ on at most one record. R is the real space, and d is the query dimension of the function f.

Laplace mechanism is proposed in [7], which takes as inputs a data set, a function f, and a privacy budget ϵ and returns the true output of f plus some Laplace noise. The noise is drawn from a Laplace distribution. Formally, for a given function $f : D \to R^{d}$ over an arbitrary data set, the mechanism $M$ meets ϵ-differential privacy: $\begin{matrix} (6) & M (D) = f (D) + Laplace (\frac{S (f)}{ϵ}) \end{matrix}$

Exponential mechanism [22] is useful for the analysis whose outputs are not real after adding noise. It selects an output r from the outputs $O$ , by considering its score of a given score function q in a differentially privacy version. The score function q measures the quality of r, so it has to be insensitive to the changes of any record in the data set. Formally, given a score function q for a data set D, the exponential mechanism $\begin{matrix} (7) & M (D) = {r | Pr [r] \propto exp (\frac{ϵ q (D, r)}{2 S (q)})} \end{matrix}$ meets ϵ-differential privacy.

4. Problem statement and system model

In this section, we state the problem considered in this paper and its challenges, as well as the involved system model and security model.

4.1. Problem statement

We consider the problem of privacy-preserving collaborative ensemble learning in this paper. Specifically, multiple parties hold their own data sets respectively, or they get a part of data sets from a data set pool. They want to collaborate with others and use ensemble learning to obtain better predictive performance, but they do not want to publish their own data sets directly for their privacy concerns, so each party runs ensemble learning on their own data sets locally. The question is: how they can obtain better predictive performance through ensemble learning in collaborative and distributed way and protect the privacy of each party at the same time?

This problem is prevalent in practice because knowledge sharing between multiple parities is often an essential requirement due to limited amount of storage space or computing power. A case is that several hospitals have a patient’s electronic health records corresponding to diverse symptoms. Patient will be able to obtain better treatment through comprehensive consultation if these medical records should have been shared. Similar scenes also frequently occur in other industries. In age of big data, traditional single machine is far from enough to cope with trillion or even more data, so distributed architecture is necessary to mine data. Efficiently integrating data from local ones is a must in this case, but we have to concern the privacy.

4.2. Challenges

There are mainly two challenges of solving the problem of privacy-preserving ensemble learning. On one hand, the computation complexity or communication cost maybe unaffordable when privacy is considered in ensemble learning. For example, one solution is that each party can send the encrypted data or the model to others, but it will introduce large computation workloads at each party or heavy communication overheads between parties. On the other hand, it is hard to meet the personalized privacy requirements. Because different parties may have different privacy concerns, it is a challenge to provide flexible privacy budget for each party in ensemble learning while getting a satisfactory integrated model.

4.3. System model

The system model of our solution is demonstrated in Fig. 1. Each party is responsible for building local ensemble model and transferring it to a central agent. Central agent integrates those local models to obtain the integrated model, and distributes it to all parties. Moreover, any party can upload his latest model again if he refines the local model with new data set.

Fig. 1.

The system model.

4.4. Security model and assumptions

In our security model, in the worst case, the adversary knows all the data except for the inferred record, and all the records in the database are independent of each other. We further have three assumptions on this system model. First, the threat model of central agent is honest-but-curious, meaning that central agent exactly follows the proposed protocol, yet attempts to learn private information from its received data. Second, central agent owns powerful computing power to complete any complex operations. Third, each party has different privacy expectations for same scheme that subjects to the identical distribution.

5. Collaborative ensemble learning under differential privacy

In this section, we first give the framework of our proposed collaborative ensemble learning under differential privacy. Under this framework, we then propose the differentially private versions of two widely-used ensemble learning algorithms: random forests and adaptive boosting. Finally, we present our integration and distribution mechanisms at central agent for collaborative ensemble learning. The implemented framework on random forests and adaptive boosting are called collaborative random forests under differential privacy (CRFsDP) and collaborative adaptive boosting under differential privacy (CAdaBoostDP).

5.1. Framework

Now we introduce our framework of collaborative ensemble learning under differential privacy. As shown in Fig. 2, the framework includes four prime components:

Fig. 2.

Framework of collaborative ensemble learning under differential privacy.

Producing local data set. Each party holds or collects its own local data set because the data with same scheme usually comes from multiple sources in real applications, or a portion of data sets from a data pool is assigned to each party for distributed processing. Local data set contains sensitive information so that each party does not want to share it with each other directly.

Building local ensemble model. In the framework of collaborative ensemble learning, each party first builds its local ensemble model independently with local data. In the subsequent statements, we focus on two widely-used ensemble learning algorithms: random forests and adaptive boosting, whose base classifier is CART, and the framework can be easily extend to support any other ensemble learning algorithm.

Introducing differential privacy. In order to protect the privacy of local ensemble model of each party, we introduce differential privacy during learning process. We first introduce differential privacy into the base classifier CART and then propose a differentially private CART algorithm. Based on it, we propose differentially private versions of two algorithms: random forests and adaptive boosting.

Integration and distribution. After all parties build their differentially private local ensemble model, they send their models to a central agent. Multiple models are then integrated at the central agent. According to generalization ability of models, we design weight distribution for each local model based on amount of data and its accuracy. Our strategy tries to disclose minimized information at integration phase. Finally, the integrated model is distributed to all parties, and each party gains the synthesized knowledge except for others’ data.

Algorithm 1

CARTGenDP

Algorithm 2

CARTPrunDP

5.2. Differentially private base classifier CART

CART is a simple and efficient base classifier for ensemble classifier such as random forests and adaptive boosting. The structure of CART contains branch attribute variables and class variable. To publish CART, we must prevent these critical contents from leaking out. Therefore, we design a differentially private CART to solve this problem.

The basic idea of our proposed differentially private CART is adding noise to each node of the decision tree. The amount of noise is decided by a given privacy budget. We still follow the principle of CART, namely, generation and pruning. In the process of tree generation, more Laplace noises are added to the nodes that are closer to the leaf nodes. Algorithm 1 presents the details of generating CART under differential privacy (CARTGenDP), and Algorithm 2 presents the pruning process (CARTPrunDP).

Specifically, suppose a party allocates its privacy budget $ϵ_{c}$ and wants to build its differential privacy CART with it. Starting from root node, set available privacy budget of current node as $ϵ_{c}$ , we must decide whether the current node is a branch node or a leaf node. If it is a leaf node, allocate all the available privacy budget to it; If it is a branch node, allocate $1 / 2$ available privacy budget to it, and reserve the other $1 / 2$ to its child node(s). If current node has two child nodes, allocate $1 / 4$ privacy budget to each child node; If current node has only one child node, directly allocate $1 / 2$ available privacy budget to that child node. In pruning process, add privacy budget that belongs to the pruned subtree to its parent node.

5.3. Random forests under differential privacy

In this section, we propose a method of introducing differential privacy into random forests. The basic idea is intuitive: we utilize the proposed differentially private CART in previous section as the base classifier in random forests.

Differentially private random forests include two phases:

Create differentially private random forests by training data set. First, given the number of iterations T and the privacy budget $ϵ_{p}$ , allocate $ϵ_{p} / T$ privacy budget in each iteration. Then, call CARTGenDP and CARTPrunDP functions recursively to generate each differentially private base classifier CART. Finally, get the differentially private random forests. In prediction, output majority voting based on ${g_{t}}_{1}^{T}$ . This process is denoted as RFsDP and the detail is presented in Algorithm 3.

Calculate the accuracy of differentially private random forests by test data set. Use the created differentially private random forests to classify test data set and get the predicted class labels. Compare the predicted class labels with their real ones and get the accuracy α of created local differentially private random forests.

Algorithm 3

RFsDP

5.4. Adaptive boosting under differential privacy

In this section, we propose a method of introducing differential privacy into another widely used ensemble learning algorithm: adaptive boosting (AdaBoost). The basic idea is also utilizing the proposed differentially private CART as the base classifier to build a boosting classifier.

Differentially private adaptive boosting includes two phases:

Create differentially private AdaBoost classifier by training data set. First, given the number of iterations T and the privacy budget $ϵ_{p}$ , allocate $ϵ_{p} / T$ privacy budget in each iteration. Initialize the weight distribution of records in training data set as uniform distribution, Then, recursively build differentially private CART with weight distribution. In each iteration, after the CART is built by calling CARTGenDP and CARTPrunDP functions, compute the error rate of misclassified data and the weight of CART in current iteration, and update the weight distribution for next iteration based on the error rate. Finally, append each CART with its weight to ensemble classifier. Output weighted vote on $sign (\sum_{t = 1}^{T} η_{t} \cdot g_{t})$ in prediction phase. This process is denoted as AdaBoostDP and the detail is presented in Algorithm 4.

Calculate the accuracy of differentially private AdaBoost classifier by test data set. Use the created differentially private AdaBoost classifier to classify test data set and get the predicted class labels. Compare the predicted class labels with their real ones and get the accuracy α of created local differentially private AdaBoost.

Algorithm 4

AdaBoostDP

5.5. Integration and distribution

When all parties build their local ensemble classifiers (i.e. RFsDP or AdaBoostDP in this paper), they must collaboratively and secretly construct integrated classifier over global data set with the help of central agent. In our proposed framework, each party sends differentially private ensemble model to the central agent instead of local data set for privacy concerns. Upon receiving local ensemble classifiers from all parties, the central agent integrates them to obtain the final classifier.

The weight of each local ensemble classifier in integration is determined by its accuracy and the proportion of its training data set to global data set. For example, when a party has very little amount of data, the accuracy of the trained classifier may be high enough probably due to overfitting. Therefore, we design the weighting function $w_{p}$ for local classifier as $\begin{matrix} (8) & w_{p} (λ, α) = α exp (λ) \end{matrix}$ where λ is the proportion of training data set of party p, and α is the corresponding accuracy. Algorithm 5 describes the detail of integration process.

Algorithm 5

Integration

After the integration, the central agent should distribute the final classifier to all parties to obtain a better performance on prediction. It is easy to distribute integrated classifier to all parities via network. Each party can also continually update its local ensemble classifier based on new data sets and periodically upload it to central agent to improve the performance of integrated classifier. Furthermore, if a new party enrolls in the system, he can contribute his local ensemble classifier to the central agent and only the integrated one needs to be updated and distributed again, or he can share the integrated classifier from the central agent directly.

5.6. Further discussion

We now give a further discussion on how to make the integration computation more secure. By now, our proposed solutions can ensure each party’s privacy of local ensemble model and data set, but each party should send the accuracy of local ensemble classifier $λ_{p}$ and the number of records of local data set $N_{p}$ to the central agent in plaintext. In the case that the central agent is honest-but-curious, he can get these sensitive parameters (one party can also know these parameters if he wiretaps on the communication channel); in the case that he is malicious, he can even manipulate these parameters to control the final integrated classifier. To suppress the curious or even malicious behaviors of central agent, we can utilize secure multi-party computation [13] technique to further protect the security of $λ_{p}$ and $N_{p}$ , so that each party cannot get these sensitive parameters of other parties, and the central agent cannot get or manipulate these sensitive parameters in the process of integration.

6. Theoretical analysis

In this section, we give theoretical analysis on our proposed solution, including privacy, utility, computational complexity and communication cost.

6.1. Privacy analysis

We analyze the privacy protection of our proposed scheme in three aspects: base classifier CART, ensemble classifier, and integrated final classifier.

First, we explain the privacy gained by CART. For our proposed differentially private base classifier CART, $ϵ_{c}$ privacy budget is allocated to CART. In the generation process CARTGenDP, start from the root node, if a node is branch node, half of current privacy budget is used to perturb branch variable value, half of it is equally divided to be used for child node(s). In the pruning process CARTPrunDP, the pruned privacy budget is added back to the parent node of pruned subtree. Therefore, all privacy budget $ϵ_{c}$ is fully used. For example, Fig. 3 demonstrates a constructed differentially private CART and the privacy budget allocated to each node.

Fig. 3.

A differentially privacy CART and the privacy budget allocation to each node. Branch nodes are represented by circle, while leaf nodes by square.

Next, we analyze the privacy of each ensemble classifier. For a party with personalized privacy requirement $ϵ_{p}$ , the exhausted privacy budget for his local ensemble classifier is $T \cdot ϵ_{c} = T \cdot \frac{ϵ_{p}}{T} = ϵ_{p}$ , which guarantees precisely the required level of privacy for that party. Because each ensemble classifier is independently built under differential privacy constraint, privacy budgets do not accumulate. Therefore, our solution satisfies $ϵ_{p}$ -differential privacy for each party.

Last, we analyze the privacy constraint of integration at the central agent. Let D denote the union of each data set $D_{p}$ holding by party p, so ${D_{p}}_{1}^{m}$ represents a partition of D. Given two data sets D and $D^{'}$ that differ on at most one record, for a partition ${D^{i}}_{1}^{m}$ on D, the probability of party p selecting $D^{i}$ is $\begin{array}{l} Pr [D_{p} \leftarrow D^{i} ∣ D] \\ = \frac{f (D^{i}) + Laplace (\frac{S (f)}{ϵ_{p}})}{\sum_{i} (f (D^{i}) + Laplace (\frac{S (f)}{ϵ_{p}}))} \\ ⩽ \frac{f (D^{^{'} i}) + S (f) + Laplace (\frac{S (f)}{ϵ_{p}})}{\sum_{i} (f (D^{^{'} i}) + S (f) + Laplace (\frac{S (f)}{ϵ_{p}}))} \\ ⩽ exp (ϵ_{p}) \frac{f (D^{^{'} i}) + Laplace (\frac{S (f)}{ϵ_{p}})}{\sum_{i} (f (D^{^{'} i}) + Laplace (\frac{S (f)}{ϵ_{p}}))} \\ = exp (ϵ_{p}) Pr [D_{p}^{'} \leftarrow D^{^{'} i} ∣ D^{'}] \end{array}$ which proves that differential privacy is independent of partition and there is no extra privacy leakage in the integration process.

6.2. Utility analysis

We analyze the utility of category aggregates between original and perturbed data. We use the widely-used root mean square error (RMSE), defined as $\begin{matrix} (9) & RMSE = \sqrt{\frac{1}{K} \sum_{j = 1}^{K} {(C^{j} - {\hat{C}}^{j})}^{2}} \end{matrix}$ on original and perturbed category aggregates, where $C^{j}$ and ${\hat{C}}^{j}$ are the jth elements in class C and predicted class $\hat{C}$ . We compute ${RMSE}_{p}$ of party p as Equation (10) $\begin{array}{l} {RMSE}_{p} \\ = \sqrt{\frac{1}{K} \sum_{j = 1}^{K} {(C^{j} - {\hat{C}}^{j})}^{2}} \\ = (\frac{1}{K} \sum_{j = 1}^{K} (f (D^{j}) \\ - (f (D^{j}) + Laplace (\frac{S (f)}{ϵ_{p}})))^{2})^{1 / 2} \\ = \sqrt{\frac{1}{K} \sum_{j = 1}^{K} {(Laplace (\frac{S (f)}{ϵ_{p}}))}^{2}} \\ (10) & = Laplace (\frac{S (f)}{ϵ_{p}}) \end{array}$ where $D^{j}$ is the set of records whose class value is $C^{j}$ . Therefore, the RMSE of whole system is bounded by $Laplace (\frac{S (f)}{min ϵ_{p}})$ .

6.3. Time and communication costs

In this section we analyze the time and communication costs of our proposed differentially private ensemble learning algorithms. Table 2 tabulates the results for a clear view.

Table 2
Time and communication costs

Scheme Time Cost Communication Cost

CRFsDP $max {t_{p}}_{1}^{m} + t_{a g}$ $2 \sum_{1}^{m} n_{G_{p}}$

CAdaBoostDP $max {T t_{p}}_{1}^{m} + t_{a g}$ $2 \sum_{1}^{m} n_{G_{p}}$

Scheme	Time Cost	Communication Cost
CRFsDP	$max {t_{p}}_{1}^{m} + t_{a g}$	$2 \sum_{1}^{m} n_{G_{p}}$
CAdaBoostDP	$max {T t_{p}}_{1}^{m} + t_{a g}$	$2 \sum_{1}^{m} n_{G_{p}}$

The time cost of our proposed schemes CRFsDP and CAdaBoostDP mainly involves two parts: building local ensemble model and model integration. Suppose $t_{p}$ is the time cost of building a CART for party p, and $t_{a g}$ is the time cost of the central agent. Because all the parties can build their ensemble models in parallel, the time cost of building ensemble model for the whole system is decided by the party with the heaviest task. The time cost of proposed random forests under differential privacy is $max {t_{p}}_{1}^{m} + t_{a g}$ , because random forests can also be executed in parallel. The time cost of proposed AdaBoost under differential privacy is $max {T t_{p}}_{1}^{m} + t_{a g}$ since each party should iterate T rounds in sequence to build AdaBoostDP classifier.

The communication cost of CRFsDP and CAdaBoostDP is the sum of payloads that each party sends to and receives from the central agent. Suppose $n_{G_{p}}$ denotes transmission payload of local ensemble model for party p, then it is easy to compute the communication cost of whole system $2 \sum_{1}^{m} n_{G_{p}}$ .

7. Experiments

7.1. Configuration

We implement the proposed CRFsDP and CAdaBoostDP by Python programming language on an Intel i7-4510U PC with 8 GB RAM. We set $m = 5$ and $T = 100$ , divide data set into five parts randomly, and evaluate the performance of final integrated classifier with test data set. We run each grouping experiment for five times and take the average as the final result.

7.2. Data set

The experimental data set comes from the UCI machine learning repository [35]. The original data was from a famous bank and the targets were credit card holders of the bank. Table 3 is a brief description of 23 attributes in data set, and the class variable is binary which indicates that credit card holder is credible or not. The 80% of data set are used to train and the rest are for test.

Table 3
Attributes description for credit card data set

Attribute Description

X1 Amount of the credit

X2 Gender

X3 Education

X4 Marital status

X5 Age

X6–X11 History of past payment

X12–X17 Amount of bill

X18–X23 Amount of pervious payment

Attribute	Description
X1	Amount of the credit
X2	Gender
X3	Education
X4	Marital status
X5	Age
X6–X11	History of past payment
X12–X17	Amount of bill
X18–X23	Amount of pervious payment

7.3. Metrics

$F 1$ [25] is a common index measuring both precision and recall, and it is widely used to evaluate the performance of classifiers. $\begin{matrix} (11) & F 1 = \frac{2 \times precision \times recall}{precision + recall} \in [0, 1] \end{matrix}$ where $recall = \frac{TP}{TP + FN}$ , and $precision = \frac{TP}{TP + F P}$ . In our experiments, the positive individuals are credible clients and the negative ones are non-credible clients. Thus, $TP$ refers to the number of credible clients correctly predicted by the classifier, $FN$ corresponds to the number of credible clients wrongly predicted as non-credible clients, and $F P$ is the number of non-credible clients misclassified as credible ones. We use $F 1$ score to measure the performance of final integrated classifier on test data set. A high value of $F 1$ score indicates that both precision and recall are reasonably high.

In our experiments, we also evaluate the performance of a classifier by area under an ROC curve (AUC). Receiver operating characteristics (ROC) curve is useful to organize classifiers and visualize their performance, and area under an ROC curve (AUC) can reduce ROC performance to a single scalar value representing expected performance [10]. The value of AUC is between 0 and 1, and a value closer to 1 indicates better generalization capability of a classifier.

7.4. Results on different maximum CART height

Fig. 4.

Results on different maximum CART heights under the same privacy budget for each party.

We first explore the effect of maximum CART height on the performance of integrated classifier. In order to facilitate the demonstration of experimental results, we set same privacy budget for each party. In our experiment, we set the maximum height of CART $h = 3, 4, 5, 6, 7$ , and the privacy budget allocated to each CART by each party $ϵ_{c} = 0.10, 0.25, 0.50, 0.75, 1.00$ , respectively. We also set no-dp (without differential privacy) as the baseline to evaluate the performance loss caused by guaranteeing parties’ privacy at different levels. The results of CRFsDP and CAdaBoostDP are shown in Fig. 4.

An obvious trend in Fig. 4 is that, given a fixed privacy budget ϵ, the values of $F 1$ and AUC increase as h is gradually going up. This is because when the height of CART extends, it will build more accurate classification rules for data set. However, h must be properly selected, as the time cost of classifier will increase along with the height of CART and the classifier’s performance does not improve obviously after h exceeds a threshold, e.g. the threshold is 6 for experimental data set. Therefore, setting $h = 6$ is a reasonable configuration. Furthermore, compare Fig. 4(a) and 4(b), $F 1$ score of CRFsDP is larger than that of CAdaBoostDP under the same condition. It probably is that, in CAdaBoostDP, previous added noise has negative effect on weight distribution of current iteration via error rate. $AUC$ values from Fig. 4(c) and 4(d) can also prove this point. Moreover, compared with baseline, both $F 1$ and AUC do not dramatically decrease after introducing differential privacy mechanism.

7.5. Results on personalized privacy budget

Table 4
Different levels of privacy budgets

No. Privacy Budgets Median No. Privacy Budgets Median

G1 0.10, 0.01, 0.02, 0.01, 0.12 0.02 G6 0.03, 0.69, 0.72, 0.73, 0.86 0.72

G2 0.05, 0.10, 0.15, 0.20, 0.25 0.15 G7 0.05, 0.12, 0.78, 0.98, 0.98 0.78

G3 0.12, 0.23, 0.24, 0.39, 1.00 0.24 G8 0.70, 0.75, 0.80, 0.90, 1.00 0.80

G4 0.50, 0.50, 0.50, 0.50, 0.50 0.50 G9 0.90, 0.85, 0.81, 0.85, 0.97 0.85

G5 0.10, 0.25, 0.50, 0.75, 1.00 0.50 G10 1.00, 1.00, 1.00, 1.00, 1.00 1.00

No.	Privacy Budgets	Median	No.	Privacy Budgets	Median
G1	0.10, 0.01, 0.02, 0.01, 0.12	0.02	G6	0.03, 0.69, 0.72, 0.73, 0.86	0.72
G2	0.05, 0.10, 0.15, 0.20, 0.25	0.15	G7	0.05, 0.12, 0.78, 0.98, 0.98	0.78
G3	0.12, 0.23, 0.24, 0.39, 1.00	0.24	G8	0.70, 0.75, 0.80, 0.90, 1.00	0.80
G4	0.50, 0.50, 0.50, 0.50, 0.50	0.50	G9	0.90, 0.85, 0.81, 0.85, 0.97	0.85
G5	0.10, 0.25, 0.50, 0.75, 1.00	0.50	G10	1.00, 1.00, 1.00, 1.00, 1.00	1.00

Then we analyze the effect of personalized privacy budget on the performance of classifier given a fixed height of CART. According to the analysis in previous section, we set $h = 6$ . Table 4 lists ten groups (G1–G10) of personalized privacy budgets for parties. In Fig. 5, we plot the performance of classifiers in terms of $F 1$ and AUC under the median of personalized privacy in each group.

Fig. 5.

Results on different levels of personalized privacy budgets. The median of privacy budgets for each group is used to measure privacy level.

From Fig. 5, the values of $F 1$ and AUC of both CRFsDP and CAdaBoostDP have a considerable increase when the median of privacy budget varies from 0.02 to 1.00. Also, the performance of CRFsDP is better than that of CAdaBoostDP under the same privacy budget setting, which is consistent with our observation in previous section. We can also find from Fig. 5 that setting personalized privacy budget can get better performance on integrated classifier. For example, the medians in G4 and G5 are the same, but G5 has a personalized privacy budget setting, while all parties has the same privacy budget in G4. It is obvious from Fig. 5 that the performance on integrated classifier of G5 in terms of $F 1$ and AUC is better. Another point is that the value of ϵ must be properly chosen, since the performance of integrated classifier will degrade dramatically under a very small ϵ while the privacy of parties cannot be well protected if ϵ is too large. Therefore, we need to find a balance between them, which protects each individual privacy properly without overly reducing utility of data. It is the reasonable balance for data set $ϵ_{c} = 0.80$ from experimental results.

7.6. Results on different number of parties

Fig. 6.

Results on execution time of the system under different numbers of parties.

Finally, we discuss the relationship between numbers of parties taking part in system and execution time of system. We set numbers of parties $m = 3, 4, 5, 6, 7$ , CART height $h = 6$ , and privacy budget $ϵ_{c} = 0.80$ . Figure 6 shows the results. When the amount of parties participating in the system increases, the time cost of the system indeed goes up, but does not show a linear growth. It indicates that our scheme can support for multiple parties to jointly build the integrated classifier. Besides, it is observed from Fig. 6 that CAdaBoostDP spends more time on building integrated classifier. This conclusion is also consistent with our theoretical analysis.

8. Conclusion

In this paper, we investigate the problem of privacy-preserving ensemble learning. A framework of privacy-preserving collaborative ensemble learning based on differential privacy is proposed to provide personalized privacy protection over distributed data set. In the framework, each party builds local ensemble model and decides how much information to share with others by configuring personalized privacy budget. All parties send their differentially private models to the central agent to get final stronger integrated classifier. We also implement the framework on two widely-used ensemble learning algorithms: random forests and adaptive boosting. Theoretical analysis proves that the proposed scheme satisfies $ϵ_{p}$ -differential privacy for each party p. Experimental results on real-life data set also demonstrate that our scheme gets a satisfactory tradeoff between privacy and utility.

Footnotes

Acknowledgements

The work in this paper was supported by the National Natural Science Foundation of China (No. 61672118), the Fundamental Research Funds for the Central Universities (No. 106112016CDJZR185513), and the Graduate Scientific Research and Innovation Foundation of Chongqing, China (Nos. CYS15029 and CYB16046).

References

Bojarski,

Choromanska,

Choromanski and

LeCun, Differentially-and non-differentially-private random decision trees, 2014, arXiv preprint arXiv:1410.6973.

Breiman, Random forests, Machine learning 45(1) (2001), 5–32. doi:10.1023/A:1010933404324.

Breiman,

Friedman,

R.A.

Olshen and

C.J.

Stone, Classification and Regression Trees, CRC Press, 1984.

Chen,

B.C.M.

Fung,

P.S.

Yu and

B.C.

Desa, Correlated network data publication via differential privacy, The VLDB Journal 23(4) (2014), 653–676. doi:10.1007/s00778-013-0344-8.

Choi,

Kim and

Suh, Classification model for detecting and managing credit loan fraud based on individual-level utility concept, ACM SIGMIS Database 44(3) (2013), 49–67. doi:10.1145/2516955.2516959.

Dwork, Differential privacy, in: Proceedings of ACM International Colloquium on Automata, Languages and Programming (ICALP), 2006, pp. 1–12.

Dwork,

McSherry,

Nissim and

Smith, Calibrating noise to sensitivity in private data analysis, in: Proceedings of Conference on Theory of Cryptography (TCC), 2006, pp. 265–284.

Dwork,

G.N.

Rothblum and

Vadhan, Boosting and differential privacy, in: Proceedings of IEEE Symposium on Foundations of Computer Science (FOCS), Vol. 26, 2010, pp. 51–60.

Elsalamouny and

Gambs, Differential privacy models for location-based services, Transactions on Data Privacy 9(1) (2016), 15–48.

10.

Fawcett, An introduction to ROC analysis, Pattern Recognition Letters 27(8) (2006), 861–874. doi:10.1016/j.patrec.2005.10.010.

11.

Freund and

R.E.

Schapire, A desicion-theoretic generalization of online learning and an application to boosting, in: Proceedings of European Conference on Computational Learning Theory (EuroCOLT), 1995, pp. 23–37. doi:10.1007/3-540-59119-2_166.

12.

Gambs,

Kégl and

Aïmeur, Privacy-preserving boosting, Data Mining and Knowledge Discovery 14(1) (2007), 131–170. doi:10.1007/s10618-006-0051-9.

13.

Goldreich, Foundations of Cryptography: Volume 2, Basic Applications, Cambridge University Press, 2009.

14.

Häyrinen,

Saranto and

Nykänen, Definition, structure, content, use and impacts of electronic health records: A review of the research literature, International Journal of Medical Informatics 77(5) (2008), 291–304. doi:10.1016/j.ijmedinf.2007.09.001.

15.

Jagannathan,

Pillaipakkamnatt and

R.N.

Wright, A practical differentially private random decision tree classifier, in: Proceedings of IEEE International Conference on Data Mining Workshops (ICDMW), 2009, pp. 114–121.

16.

Kifer and

Machanavajjhala, No free lunch in data privacy, in: Proceedings of ACM International Conference on Management of Data (SIGMOD), 2011, pp. 1513–1522.

17.

Li,

Li and

Venkatasubramanian, T-closeness: Privacy beyond K-anonymity and L-diversity, in: Proceedings of IEEE International Conference on Data Engineering (ICDE), 2007, pp. 106–115.

18.

Li,

Bai and

C.K.

Reddy, A distributed ensemble approach for mining healthcare data under privacy constraints, Information Sciences 330 (2016), 245–259. doi:10.1016/j.ins.2015.10.011.

19.

W.-Y.

Loh, Classification and regression trees, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1(1) (2011), 14–23.

20.

Mabu,

Obayashi and

Kuremoto, Ensemble learning of rule-based evolutionary algorithm using multi-layer perceptron for supporting decisions in stock trading problems, Applied Soft Computing 36(C) (2015), 357–367. doi:10.1016/j.asoc.2015.07.020.

21.

Machanavajjhala,

Gehrke,

Kifer and

Venkitasubramaniam, L-diversity: Privacy beyond K-anonymity, in: Proceedings of IEEE International Conference on Data Engineering (ICDE), 2006, pp. 24.

22.

McSherry and

Talwar, Mechanism design via differential privacy, in: Proceedings of IEEE Symposium on Foundations of Computer Science (FOCS), 2007, pp. 94–103.

23.

Melle and

J.-L.

Dugelay, Scrambling faces for privacy protection using background self-similarities, in: Proceedings of IEEE International Conference on Image Processing (ICIP), 2014, pp. 6046–6050.

24.

Mingers, An empirical comparison of selection measures for decision-tree induction, Machine learning 3(4) (1989), 319–342.

25.

K.P.

Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.

26.

N.C.

Oza and

Russell, Online ensemble learning, in: Proceedings of AAAI Conference on Artificial Intelligence, 2000.

27.

Patil and

Singh, Differential private random forest, in: Proceedings of IEEE International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2014, pp. 2623–2630.

28.

B.I.P.

Rubinstein,

P.L.

Bartlett,

Huang and

Taft, Learning in a large function space: Privacy-preserving mechanisms for SVM learning, Privacy and Confidentiality 4(1) (2012), 65–100.

29.

Sun,

Sudo and

Taniguchi, Visual concept detection of web images based on group sparse ensemble learning, Multimedia Tools and Applications 75(3) (2016), 1409–1425. doi:10.1007/s11042-014-2179-8.

30.

Sweeney, K-anonymity: A model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(05) (2002), 557–570. doi:10.1142/S0218488502001648.

31.

Szűcs, Random response forest for privacy-preserving classification, Journal of Computational Engineering 2013(309) (2013), 1–6. doi:10.1155/2013/397096.

32.

Vaidya,

Shafiq,

Fan,

Mehmood and

Lorenzi, A random decision tree framework for privacy-preserving data mining, IEEE Transactions on Dependable and Secure Computing 11(5) (2014), 399–411. doi:10.1109/TDSC.2013.43.

33.

Wang,

Chen and

Zhang, Outsourcing high-dimensional healthcare data to cloud with personalized privacy preservation, Computer Networks 88(C) (2015), 136–148. doi:10.1016/j.comnet.2015.06.014.

34.

Xiao and

Tao, Personalized privacy preservation, in: Proceedings of ACM International Conference on Management of Data (SIGMOD), 2006, pp. 229–240.

35.

I.-C.

Yeh, Default of credit card clients data set, http://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients.

36.

Yu, Big privacy: Challenges and opportunities of privacy study in the age of big data, IEEE Access 4 (2016), 2751–2763. doi:10.1109/ACCESS.2016.2577036.

37.

Zhang,

Cormode,

C.M.

Procopiuc,

Srivastava and

Xiao, Privbayes: Private data release via Bayesian networks, in: Proceedings of ACM International Conference on Management of Data (SIGMOD), 2014, pp. 1423–1434.

38.