Anonymizing transactional datasets

Abstract

In this paper, we study the privacy breach caused by unsafe correlations in transactional data where individuals have multiple tuples in a dataset. We provide two safety constraints to guarantee safe correlation of the data: (1) the safe grouping constraint to ensure that quasi-identifier and sensitive partitions are bounded by l-diversity and (2) the schema decomposition constraint to eliminate non-arbitrary correlations between non-sensitive and sensitive values to protect privacy and at the same time increase the aggregate analysis. In our technique, values are grouped together in unique partitions that enforce l-diversity at the level of individuals. We also propose an association preserving technique to increase the ability to learn/analyze from the anonymized data. To evaluate our approach, we conduct a set of experiments to determine the privacy breach and investigate the anonymization cost of safe grouping and preserving associations.

Keywords

Data privacy data anonymization transactional data

1. Introduction

Data outsourcing is on the rise, and the emergence of cloud computing provides additional benefits to outsourcing. Privacy regulations pose a challenge to outsourcing, as the very flexibility provided makes it difficult to prevent against trans-border data flows, protection and separation of clients, and other constraints that may be required to outsource data. An alternative is encrypting the data [5]; while this protects privacy, it also prevents beneficial use of the data such as value-added services by the cloud provider (e.g., address normalization), or aggregate analysis of the data (and use/sale of the analysis) that can reduce the cost of outsourcing. Generalization-based data anonymization [9,12,18,19] provides a way to protect privacy while allowing aggregate analysis, but does not make sense in an outsourcing environment where the client wants to be able to retrieve the original data values.

An alternative is to use bucketization, as in the anatomy [24], fragmentation [4], or slicing [11] models. Such a database system has been developed [15,16]. The key idea is that identifying and sensitive information are stored in separate tables, with the join key encrypted. To support analysis at the server, data items are grouped into buckets; the mapping between buckets (but not between items in the bucket) is exposed to the server. An example is given in Table 1 where attribute DrugName is sensitive: Table 1(b) is an anatomized version of table prescription with attributes separated into ${Prescription}_{QIT}$ and ${Prescription}_{SNT}$ .

Table 1
Prescription anonymized

The bucket size and grouping of tuples into buckets ensures that privacy constraints (such as k-anonymity [18,19] or l-diversity [12]) are satisfied.

Complications arise when extending this approach to transactional datasets. Even with generalization-based approaches, it has been shown that transactions introduce new challenges. Approaches such as $(X, Y)$ -privacy [23] and $k^{m}$ -anonymity [21] include restrictions on the correlation of quasi-identifying values but still endure some limitations when applied to bucketization approaches.

We give examples of this based on Table 1(b). The anonymized table satisfies the $(X, Y)$ -privacy and $(2, 2)$ -diversity privacy constraints [13]; given the 2-diverse table, an adversary should at best be able to link a patient to a drug with probability $1 / 2$ .

Inter-group dependencies occur when an adversary knows certain facts about drug use, e.g., Retinoic acid is a maintenance drug taken over a long period of time. As P1 is the only individual who appears in all groups where Retinoic acid appears, it is likely that P1 is taking this drug. Note that this fact can either be background knowledge, or learned from the data.

Intra-group dependencies occur where the number of transactions for a single individual within a group results in an inherent violation of l-diversity (this would most obviously occur if all transactions in a group were for the same individual). By considering this separately for transactional data, rather than simply looking at all tuples for an individual as a single “data instance”, we gain some flexibility.

Another privacy breach might occur when an adversary can use the (anatomized) data to link sensitive information to the non-sensitive attribute (neither (quasi)-identifying, nor sensitive). Such correlation between attributes can then be used to link the sensitive values to the identifying information. For instance, let us assume a (publicly known) correlation between Manufacturer and DrugName; e.g., only Raphe Healthcare makes Retinoic acid. This enables the adversary to determine a drug taken by the individual. More subtle is the case where the dependency occurs in the other direction, e.g., a manufacturer might produces only one drug (as shown with Envie de Neuf). To cope with such cases, anatomy must produces two sensitive tables as in [6] (one for each sensitive attribute), which alternatively inhibits the ability to learn such correlations from the data.

In this paper, we present a method to counter such privacy violations while preserving data utility. Our contributions can be summarized as follows:

An in-depth study of privacy violation due to correlation of individuals’ related tuples in bucketization techniques.

A safe grouping technique to eliminate privacy violation due to the transactional nature of the data. Our safe grouping technique ensures that quasi-identifier and sensitive partitions respect the l-diversity privacy constraint.

A schema decomposition technique to eliminate non-arbitrary correlations between non-sensitive and sensitive values. We consider that safely decomposing the table can be used to expose the correlation of non-identifying information to sensitive values in a way that protects privacy while allowing analysis of data.

The approach is based on knowing (or learning) the correlations, and forming buckets with a common antecedent to the correlation. This protects against inter-group dependencies. Identifiers are then suppressed where necessary (in an outsourcing model, this corresponds to encrypting just the portion of the tuple in the identifier table) to ensure the privacy constraint is met (including protection against intra-group correlation).

In the next section, we present our adversary model. Section 3 gives further background on prior work and its limitations in dealing with this problem. In Section 4, we define the basic notations and key concepts used in the rest of the paper. A definition of correlation-based privacy violation in transactional datasets is given in Section 5. In Section 6, we present our a safety constraints; safe grouping and safe decomposition that form the basis of our anonymization method. Section 7 gives our safe grouping and preserving associations algorithms. A set of experiments to evaluate both the practical efficiency and the loss of data utility (suppression/encryption) is given in Section 8. We conclude with a discussion of next steps to move this work toward practical use.

2. Adversary model

In our adversary model, we assume that the adversary has knowledge of the transactional nature of the dataset. We also assume that he/she has outside information on correlations between sensitive data items that leads to a high probability that certain sets of items would belong to the same individual. This is illustrated in the Introduction (example 1) where the fact that the drug Retinoic acid is known to be taken for a long period of time makes it possible to link it to patient P1.

We do not care about the source of such background information; it may be public knowledge, or it may be learned from the anatomized data itself. (We view learning such knowledge from the data as beneficial aggregate analysis of the data.)

We explicitly distinguish between identifying information (which in bucketization can be both explicitly identifiers such as name and quasi-identifiers such as address that can be linked to identity using public information); non-identifying, non-sensitive information (such as the manufacturer of a medication), and sensitive information (such as the name of the medication being taken). We assume that the adversary does not have prior knowledge of the link between non-sensitive, non-identifying attribute values and specific individuals, beyond what can be inferred from general knowledge about frequency of attributes – if an adversary could have such knowledge, these become identifying information. This same assumption is made with generalization methods such as l-diversity or k-anonymity, where such knowledge weakens the privacy provided. Our method deals with potential information disclosure from correlation between identifying or non-sensitive attributes and sensitive attributes.

3. Related work

In [23], the authors consider that any transaction known by the adversary could reveal additional information that might be used to uncover a sensitive linking between a quasi-identifier and a sensitive value. They define $(X, Y)$ -privacy to ensure on one hand that each value of X is linked to at least k different values of Y, and on the other hand, no value of Y can be inferred from a value of X with confidence higher than a designated threshold. In a similar approach proposed in [21], the authors extend k-anonymity with $k^{m}$ -anonymity requiring that each combination of at most m items appears in at least k transactions, where m is the maximum number of items per transaction that could be known by the adversary. (Also related is the problem of trail re-identification [14].) As demonstrated in the example in Table 1(b), these techniques are limited when it comes to bucketization, as more subtle intra and intra group correlations may lead to a breach of l-diversity. Even in [22] where the authors apply $k^{m}$ -anonymity while preserving the original terms by disassociating tuples into record and term chunks to meet bucketization techniques, a privacy breach can still occur due to the lack of diversity. Particularly, when ensuring $k^{m}$ -anonymity without using generalization the threat of the homogeneity attack can be maximized when the dataset is reconstructed.

In [11] the authors propose a slicing technique to provide effective protection against membership disclosure. It is similar to how we approach the association between non-sensitive and sensitive attributes where separating the data cannot be done in a trivial matter without considering the correlations. However, the slicing technique as proposed (i.e., grouping quasi-identifier and sensitive attributes based on their correlation threshold) remains vulnerable to identity disclosure. This is due to adversary’s knowledge of the transactional nature of the dataset where he/she may still be able to associate an individual identifier with correlated sensitive values.

The authors in [6] discuss privacy violations in the anatomy privacy model [24] due to functional dependencies (FDs). In their approach, they propose to create QI-groups on the basis of a FD tree while grouping tuples based on the sensitive attribute to form l-diverse groups. Unfortunately, dealing with FDs’ is not sufficient, as less strict dependencies can still pose a threat.

In [3], the authors consider correlation as foreground knowledge that can be mined from anonymized data. They use the possible worlds model to compute the probability of associating an individual to a sensitive value based on a global distribution. In [8], a naïve Bayesian model is used to compute association probability. They used exchangeability [1] and De Finetti’s theorem [17] to model and compute patterns from the anonymized data. In [10], the authors deal with background knowledge that can be mined from the data. In their paper they focus mainly on what we consider as negative correlations limiting by that the ability to handle positive and exposed correlations. These papers address correlation in its general form where the authors show how an adversary can violate l-diversity privacy constraint through an estimation of such correlations in the anonymized data. As it is a separate matter, we consider that correlations due to transactions where multiple tuples are related to the same individual ensure that particular sensitive values can be linked to a particular individual when correlated in the same group (i.e., bucket). We go beyond this, addressing any correlation (either learned from the data or otherwise known) that explicitly violates the targeted privacy goal.

4. Formalization

Given a table T with a set of attributes ${A_{1}, \dots, A_{b}}$ , $t [A_{i}]$ refers to the value of attribute $A_{i}$ for the tuple t. Let U be the set of individuals of a specific population, $\forall u \in U$ we use $T_{u}$ to denote the set of tuples in T related to the individual u. Attributes of a table T that we deal with in this paper are divided as follows:

Identifier ( $A^{id}$ ) represents an attribute that can be used (possibly with external information available to the adversary) to identify the individual associated with a tuple in a table. We distinguish two types of identifiers; sensitive and non-sensitive. For instance, the attribute Social Security Number is a sensitive identifier; as such it must be suppressed (encrypted). Non-sensitive identifiers are viewed as public information, and include both direct identifiers such as Patient ID in Table 1, and quasi-identifiers that in combination may identify an individual (such as ⟨Gender, Birthdate, Zipcode⟩, which uniquely identifies many individuals).

Sensitive attribute ( $A^{s}$ ) contains sensitive information that must not be linkable to an individual, but does not inherently identify an individual. In our example (Table 1(a)), the attribute DrugName is considered sensitive and should not be linked to an individual.

Non-sensitive attribute ( $A^{ns}$ ) represents an attribute that is neither inherently sensitive nor (quasi)identifying, such as the attribute Manufacturer in Table 1(a). (Note that we do allow non-sensitive attributes to be correlated with sensitive or identifying attributes; this paper shows how to deal with such cases.)

Definition 1 (Equivalence class/QI-group [18]).

A quasi-identifier group (QI-group) is defined as a subset of tuples of $T = ⋃_{j = 1}^{m} {QI}_{j}$ such that, for any $1 ⩽ j_{1} \neq j_{2} ⩽ m$ , ${QI}_{j 1} \cap {QI}_{j 2} = ϕ$ .

Note that for our purposes, this can include direct identifiers as well as quasi-identifiers; we stick with the QI-group terminology for compatibility with the broader anonymization literature.

Definition 2 (l-diversity [13]).

A table T is said to be l-diverse if each of the QI-groups ${QI}_{j}$ ( $1 ⩽ j ⩽ m$ ) is l-diverse; i.e., ${QI}_{j}$ satisfies the condition $c_{j} (v_{s}) / | {QI}_{j} | ⩽ 1 / l$ where

m is the total number of QI-groups in T,

$v_{s}$ is the most frequent value of $A^{s}$ ,

$c_{j} (v_{s})$ is the number of tuples of $v_{s}$ in ${QI}_{j}$ ,

$| {QI}_{j} |$ is the size (number of tuples) of ${QI}_{j}$ .

Definition 3 (Anatomy).

Given a table T, we say that T is anatomized if it is separated into a quasi-identifier table ( $T_{QIT}$ ) and a sensitive table ( $T_{SNT}$ ) as follows:

$T_{QIT}$ has a schema $(A_{1}, \dots, A_{d}, GID)$ where $A_{i}$ ( $1 ⩽ i ⩽ d$ ) is either a non-sensitive identifying or quasi-identifying attribute and GID is the group id of the QI-group.

$T_{SNT}$ has a schema $(GID, A_{d + 1}^{s})$ where $A_{d + 1}^{s}$ is the sensitive attribute in T.

To express correlation in transactional data we use the following notation $c d^{id} : A_{1}^{id}, \dots, A_{n}^{id} ⇢ A^{s}$ where $A_{i}^{id}$ is a non-sensitive identifying attribute and $A^{s}$ is a sensitive attribute, and $c d^{id}$ is a correlation dependency between attributes $A_{1}^{id}, \dots, A_{n}^{id}$ on one hand, and $A^{s}$ on the other. Table 2 shows the set of notations used in the paper.

Table 2
Notations

T A table containing individuals related tuples

$t_{i}$ A tuple of T

u An individual described in T

$T_{u}$ A set of tuples related to individual u

A An attribute of T

$A^{id}$ An identifying attribute of T

$A^{ns}$ An non-sensitive attribute of T

$A^{s}$ A sensitive attribute of T

${QI}_{j}$ A quasi-identifier group

$T^{*}$ Anonymized version of table T

Next, we present a formal description of the privacy violation that can be caused due to such correlations.

5. Correlation-based privacy violation

In this section, we formally define the privacy violation that might occur due to correlation.

5.1. Correlation between identifying and sensitive values

Inter-group correlation occurs when transactions for a single individual are placed in multiple QI-groups (as with P1, P3 and P4 in Table 1(a)). The problem arises when the values in different groups are related (as would happen with association rules); this leads to an implication that the values belong to the same individual. Formally:

Definition 4 (Inter QI-group correlation).

Given a correlation dependency of the form $c d^{id} : A^{id} ⇢ A^{s}$ over $T^{*}$ , we say that a privacy violation might exists if there are correlated values in a subset ${QI}_{j}$ $(1 ⩽ j ⩽ m)$ of $T^{*}$ such that $v_{id} \in π_{A^{id}} {QI}_{1} \cap \dots \cap π_{A^{id}} {QI}_{m}$ and $| π_{A^{s}} {QI}_{1} \cap \dots \cap π_{A^{s}} {QI}_{m} | < l$ where $v_{id} \in A^{id}$ is an individual identifying value, l is the privacy constant and an adversary knows of that correlation.

The example shown in Table 1, explains how an adversary with prior knowledge of the correlation, in this case that Retinoic acid must be taken multiple times, is able to associate the drug with the patient Roan (P1) due to their correlation in several QI-groups.

An intra-group violation can arise if several correlated values are contained in the same QI-group. Here the problem is that this gives a count of tuples that likely belong to the same individual, which may limit it to a particular individual in the group. Table 3 is an example of intra QI-group correlation in which the number of tuples associated with patient Roan (P1) in $T_{QIT}$ exceeds the number of tuples associated with the most frequent sensitive value in $T_{SNT}$ . Formally, the intra QI-group correlation is defined as follows:

Table 3
Intra QI-group correlation

Patient ID GID

Roan (P1) 1

Roan (P1) 1

Roan (P1) 1

Carl (P3) 1

Carl (P3) 1

Roan (P1) 1

GID DrugName

1 Retinoic acid

1 Retinoic acid

1 Retinoic acid

1 Azelaic acid

1 Azelaic acid

1 Azelaic acid

Patient ID	GID
Roan (P1)	1
Roan (P1)	1
Roan (P1)	1
Carl (P3)	1
Carl (P3)	1
Roan (P1)	1

GID	DrugName
1	Retinoic acid
1	Retinoic acid
1	Retinoic acid
1	Azelaic acid
1	Azelaic acid
1	Azelaic acid

Lemma 1 (Intra QI-group correlation).

Given a QI-group ${QI}_{j}$ ( $1 ⩽ j ⩽ m$ ) in $T^{*}$ that is l-diverse, we say that a privacy violation might occur if individual’s related tuples are correlated in ${QI}_{j}$ such that $f ({QI}_{j}, u) + c_{j} (v_{s}) > | {QI}_{j} |$ where

$v_{s}$ is the most frequent $A^{s}$ value in ${QI}_{j}$ ,

$c_{j} (v_{s})$ is the number of tuples $t \in {QI}_{j}$ with $t [A^{s}] = v_{s}$ ,

u is the individual who has the most frequent tuples in ${QI}_{j}$ ,

$f ({QI}_{j}, u)$ is a function that returns the number of u’s related tuples in ${QI}_{j}$ ,

$| {QI}_{j} |$ is the size of ${QI}_{j}$ (number of tuples contained in ${QI}_{j}$ ) .

Proof.
Let r be the number of remaining sensitive values in ${QI}_{j}$ , $r = | {QI}_{j} | - c_{j} (v_{s})$ . If $f ({QI}_{j}, u) + c_{j} (v_{s}) > | {QI}_{j} |$ , this means that $f ({QI}_{j}, u) > | {QI}_{j} | - c_{j} (v_{s})$ and therefore $f ({QI}_{j}, u) > r$ . That is, there are e tuples related to individual u such that $f ({QI}_{j}, u) = e$ to be associated with r sensitive values of ${QI}_{j}$ where $e > r$ . According to the pigeon-hole principle, at least a tuple t of $T_{u}$ will be associated with the sensitive value $v_{s}$ which leads to a privacy violation. □

It would be nice if this lemma was “if and only if”, giving criteria where a privacy violation would NOT occur. Unfortunately, this requires making assumptions about the background knowledge available to an adversary (e.g., if an adversary knows that one individual is taking a certain medication, they may be able to narrow the possibilities for other individuals). This is an assumption made by all k-anonymity based approaches, but it becomes harder to state when dealing with transactional data.

Let us go back to Table 3, an adversary is able to associate both drugs (Retinoic acid and Azelaic acid) with patient Roan (P1) due to the correlation of their related tuples in the same QI-group.
5.2. Correlation between non-sensitive and sensitive values

Eliminating the threat related to the correlation of non-sensitive and sensitive attribute values can be done using naïve anatomization where multiple sensitive attributes are separated to preserve individual’s privacy. This is somehow inevitable under anatomy even if these attributes are not sensitive by their nature (e.g., the attribute manufacturer is an example of such attributes). This leads to a loss of utility as explained by the following two points:

Loss of correlation: Given the correlation between a non-sensitive and sensitive attributes $A^{ns}$ and $A^{s}$ , and a probability density function $G : {DS}_{A^{ns}, A^{s}} \to [0, 1]$ where $DS$ is a 2D space defined by attributes $A^{ns}$ and $A^{s}$ . Separating $A^{ns}$ and $A^{s}$ leads to a loss of correlation that could be estimated to be1

¹
It refers to the association of a value in $T_{QIT}$ to a sensitive value in $T_{SNT}$ which, in case of anatomy, is estimated to be equal to $1 / l$ or $1 / (l + 1)$ depending on the QI-group size.

G (x) / | {QI}_{j} |

where x is a 2D random variable in

DS

, and

{QI}_{j}

is a given QI-group with (

1 ⩽ j ⩽ m

). In other terms, keeping non-sensitive attributes, related by correlation to the sensitive attribute, in the

T_{QIT}

increases the uncertainty in terms of associations between their correlated values which eventually decreases the utility. Table 1(b) shows that without any background assumptions, a given manufacturer in

{QI}_{1}

is producing Retinoic acid with a probability equal to

1 / 3

Loss of association: Consider a correlation between attributes $A^{ns}$ and $A^{s}$ , we assume that the association between an individual u and $A^{ns}$ is considered non-sensitive. However, in a naive anatomization, this association is lost and the quality of information for aggregate analysis is affected. For instance, the association between patient and manufacturer itself is not sensitive nor confidential. That is, preserving the association between patient and manufacturer for the cases where there is no privacy violation, if possible, is important and reduces information loss.

6. Privacy preserving from unsafe correlations

As we have shown in the previous section, bucketization techniques do not cope well with correlation, which is due to transactional data where an individual might be represented by several tuples that could lead to identifying his/her sensitive values and/or due to the ability to link sensitive to non-sensitive attribute values. In order to guarantee safety, this section presents our two safety constraints; safe grouping and safe decomposition.

Safety constraint (Safe grouping).

Given a correlation dependency in the form of $c d^{id} : A^{id} ⇢ A^{s}$ , safe grouping is satisfied iff

$\forall u \in U$ , the subset $T_{u}$ of T is contained in one and only one quasi-identifier group ${QI}_{j}$ ( $1 ⩽ j ⩽ m$ ) such that ${QI}_{j}$ respects l-diversity and contains at least k subsets $T_{u_{1}}, \dots, T_{u_{k}}$ where $u_{1}, \dots, u_{k}$ are k distinct individuals of the population, and

$Pr (u_{i_{1}} | {QI}_{j}) = Pr (u_{i_{2}} | {QI}_{j}) ⩽ 1 / l$ where $u_{i_{1}}, u_{i_{2}}, i_{1} \neq i_{2}$ are two distinct individuals in ${QI}_{j}$ with ( $1 ⩽ i ⩽ k$ ) and $Pr (u_{i} | {QI}_{j})$ is the probability of $u_{i}$ in ${QI}_{j}$ .

Safe grouping ensures that individual tuples are grouped in one and only one QI-group that is at the same time l-diverse, respects a minimum diversity for identity attribute values, and all subsets $T_{u}$ in ${QI}_{j}$ have equal number of tuples.

Table 4
Prescription respecting our safety constraints

Patient ID Country GID

Carl (P3) France 1

Carl (P3) France 1

Carl (P3) France 1

Carl (P3) France 1

Roan (P1) United States 1

Roan (P1) United States 1

Roan (P1) United States 1

Roan (P1) United States 1

∗ United States 1

Alice (P6) United States 2

Bob (P5) Columbia 2

Elyse (P2) United States 3

Lisa (P4) Columbia 3

∗ Columbia 3

∗ Columbia 3

GID DrugName Manufacturer

1 Azelaic acid Raphe Healthcare

1 Cytarabine Jai Radhe

1 Azelaic acid Raphe Healthcare

1 Mild exfoliation Envie De Neuf

1 Mild exfoliation Envie De Neuf

1 Retinoic acid Raphe Healthcare

1 Retinoic acid Raphe Healthcare

1 Azelaic acid Raphe Healthcare

1 Retinoic acid Raphe Healthcare

2 Adapalene Jai Radhe

2 Epsom. magnesii PQ Corp.

3 Mild exfoliation Envie De Neuf

3 Azelaic acid Gep-Tek

3 Cytarabine Jai Radhe

3 Retinoic acid Raphe Healthcare

Patient ID	Country	GID
Carl (P3)	France	1
Carl (P3)	France	1
Carl (P3)	France	1
Carl (P3)	France	1
Roan (P1)	United States	1
Roan (P1)	United States	1
Roan (P1)	United States	1
Roan (P1)	United States	1
∗	United States	1
Alice (P6)	United States	2
Bob (P5)	Columbia	2
Elyse (P2)	United States	3
Lisa (P4)	Columbia	3
∗	Columbia	3
∗	Columbia	3

GID	DrugName	Manufacturer
1	Azelaic acid	Raphe Healthcare
1	Cytarabine	Jai Radhe
1	Azelaic acid	Raphe Healthcare
1	Mild exfoliation	Envie De Neuf
1	Mild exfoliation	Envie De Neuf
1	Retinoic acid	Raphe Healthcare
1	Retinoic acid	Raphe Healthcare
1	Azelaic acid	Raphe Healthcare
1	Retinoic acid	Raphe Healthcare
2	Adapalene	Jai Radhe
2	Epsom. magnesii	PQ Corp.
3	Mild exfoliation	Envie De Neuf
3	Azelaic acid	Gep-Tek
3	Cytarabine	Jai Radhe
3	Retinoic acid	Raphe Healthcare

Table 4 describes a quasi identifier group ( ${QI}_{1}$ ) that respects safe grouping where on one hand, we assume that there are no other QI-groups containing $P 1$ and $P 3$ and on the other hand, one tuple from $T_{P 1}$ is anonymized to guarantee that $Pr (P 1 | {QI}_{1}) = Pr (P 3 | {QI}_{1}) ⩽ 1 / 2$ . Note that we have suppressed some data in order to meet the constraint; this is in keeping with privacy models that uses partial suppression by replacing individual’s values with a * to preserve privacy as in [12,19,20] or encryption as in the model in [15] where some data is left encrypted, and only “safe” data is revealed.

Lemma 2.

Let ${QI}_{j}$ for ( $1 ⩽ j ⩽ m$ ) be a QI-group that includes k individuals, if ${QI}_{j}$ satisfies safe grouping then k is at least equal to l.

Proof.

Consider an individual u in ${QI}_{j}$ , according to the safe grouping, $Pr (u | {QI}_{j}) ⩽ 1 / l$ . Or $Pr (u | {QI}_{j})$ is equal to $f ({QI}_{j}, u) / | {QI}_{j} |$ where $f ({QI}_{j}, u) = | {QI}_{j} | / k$ represents the number of individual’s u related tuples in ${QI}_{j}$ . Hence, $1 / k ⩽ 1 / l$ and $k ⩾ l$ . □

Corollary 1 (Correctness).

Given an anonymized table $T^{*}$ that respects safe grouping, and a correlation dependency of the form $c d^{id} : A^{id} ⇢ A^{s}$ , an adversary cannot correctly associate an individual u with a sensitive value $v_{s}$ with a probability $Pr (A^{s} = v_{s}, u | T^{*})$ greater than $1 / l$ .

Proof.
Safe grouping guarantees that individual’s u related tuples $T_{u}$ are contained in one and only one QI-group ( ${QI}_{j}$ ), which means that possible association of u to $v_{s}$ is limited to the set of correlated values that are contained in ${QI}_{j}$ . Hence, $Pr (A^{s} = v_{s}, u | T^{})$ can be written as $Pr (A^{s} = v_{s}, u | {QI}_{j})$ . On the other hand, $Pr (A^{s} = v_{s}, u | {QI}_{j}) = \frac{Pr (A^{s} = v_{s}, u)}{\sum_{i = 1}^{k} Pr (A^{s} = v_{s}, u_{i})}$ where k is the number of individuals in ${QI}_{j}$ and $Pr (A^{s} = v_{s}, u_{i})$ is the probability of associating individual $u_{i}$ to a sensitive value $v_{s}$ . Recall that safe grouping guarantees that for a given individual $u_{i}$ , $Pr (A^{s} = v_{s}, u_{i})$ is at the most equal to $1 / l$ . Summarizing, $Pr (A^{s} = v_{s}, u | {QI}_{j})$ is at the most equal to $1 / k$ where $k ⩾ l$ according to Lemma 2. □

We can estimate,2
²
$Pr (A^{s} = Retinoic acid, A^{id} = P 1 | T^{})$ as calculated remains an estimation where a much deeper aspect on how to calculate the exact probability of values correlated across QI-groups can be seen in [3] and [8].

for example, $Pr (A^{s} = Retinoic acid, A^{id} = P 1 | T^{})$ to be $4 / 5$ where it is possible to associate Roan (P1) with Retinoic acid in 4 of 5 QI-groups as shown in Table 1(b). However, as you can notice from Table 4, safe grouping guarantees that $Pr (A^{s} = Retinoic acid, A^{id} = P 1 | T^{})$ remains limited to the possible association of values in ${QI}_{1}$ and thus bounded by l-diversity.

In the following, we present a schema decomposition safety constraint that recognizes and deals with correlation between non-sensitive and sensitive attributes.
Safety constraint (Safe decomposition).

Given a table T, let $A^{ns}$ be the non-sensitive attribute related by correlation to $A^{s}$ , the sensitive attribute of T, safe decomposition produces two tables $T_{QIT}$ , and $T_{SNT}$ such that

$T_{QIT} (A_{1}, \dots, A_{d - 1}, GID)$ is a quasi-identifier table that holds the set of quasi-identifier and non-sensitive attributes $A_{1}, \dots, A_{d - 1}$ , and

$T_{SNT} (A^{s}, A^{ns}, GID)$ is a sensitive table in which both, non-sensitive and sensitive attributes related by a correlation are grouped.

The safe decomposition constraint ensures that non-sensitive attributes related by correlation to sensitive ones are removed from $T_{QIT}$ and placed into one sensitive table $T_{SNT}$ . This way the correlation gives no additional knowledge to the adversary as the linkage from the correlation is already explicit in the data.

Corollary 2.
Given the non-sensitive attribute $A^{ns}$ in T, an adversary with limited 3
³
As defined in our adversary model.

knowledge of the association between $A^{ns}$ and an individual u cannot use the grouping of $A^{ns}$ with $A^{s}$ in $T_{SNT}$ to breach l-diversity privacy constraint.
Proof.
Consider a table T formed by m QI-groups under a grouping by correlation where the sensitive table holds the sensitive attribute $A^{s}$ along with the non-sensitive attribute $A^{ns}$ that is related to $A^{s}$ by correlation $T_{SNT} (A^{s}, A^{ns}, GID)$ . A violation of l-diversity might occur if in a given QI-group, $Pr (A^{s} = v_{s}, A^{ns} = v_{ns} | t_{u}) > 1 / l$ , where $v_{s}$ is the sensitive value of $A^{s}$ , and $v_{ns}$ is the non-sensitive value of $A^{ns}$ respectively and u is an individual of T. However, an adversary has limited knowledge of the association between the individual u and the non-sensitive value $v_{ns}$ and therefore $Pr (A^{s} = v_{s}, A^{ns} = v_{ns} | t_{u})$ can be written as $Pr (A^{s} = v_{s} | t_{u}) \times Pr (A^{ns} = v_{ns} | t_{u})$ or $Pr (A^{s} = v_{s} | t_{u})$ is equal to $1 / l$ at the most (QI-groups are l-diverse) which makes $Pr (A^{s} = v_{s}, A^{ns} = v_{ns} | t_{u}) ⩽ 1 / l$ . □

Summarizing, if a non-sensitive attribute is correlated with an identifying attribute, we can place it in $T_{QIT}$ which only makes the correlation between non-sensitive and sensitive attribute values explicit.

We note that a benefit of the safe decomposition constraint is that it exposes the correlations, improving the utility of the data for aggregate analysis. This is shown using the following preserving associations definition.
Definition 5 (Preserving associations).

Let $T^{*}$ be a table that is safely decomposed with non-sensitive attribute grouped by correlation into $T_{SNT} (A^{s}, A^{ns}, GID)$ , preserving associations creates QI-groups that are l-diverse such that tuples with a single $A^{ns}$ value are assigned to the same QI-group.

Preserving associations deals with the case where a non-sensitive attribute is correlated with both identifying and sensitive attributes. Again, this is done through making the correlation explicit in the data – both improving utility, and ensuring that the knowledge of the correlation does not give the adversary additional capabilities. The way we do this is by forming the anatomy groups so that the non-sensitive $A^{ns}$ values are the same within a group. Even though the adversary knows of the correlation between the identifying and $A^{ns}$ attributes, this gives him/her no ability to link data beyond that already provided by the grouping, which is already l-diverse.

Table 5 shows an example on how it is possible to preserve the association between the attributes patient and manufacturer.

Table 5
Preserving association between patient and manufacturer

Patient ID GID

Carl (P3) 1

Roan (PI) 1

∗ 1

Lisa (P4) 1

∗ 1

∗ 1

GID DrugName Manufacturer

1 Azelaic acid Raphe Healthcare

1 Azelaic acid Raphe Healthcare

1 Azelaic acid Raphe Healthcare

1 Retinoic acid Raphe Healthcare

1 Retinoic acid Raphe Healthcare

1 Retinoic acid Raphe Healthcare

Patient ID	GID
Carl (P3)	1
Roan (PI)	1
∗	1
Lisa (P4)	1
∗	1
∗	1

GID	DrugName	Manufacturer
1	Azelaic acid	Raphe Healthcare
1	Azelaic acid	Raphe Healthcare
1	Azelaic acid	Raphe Healthcare
1	Retinoic acid	Raphe Healthcare
1	Retinoic acid	Raphe Healthcare
1	Retinoic acid	Raphe Healthcare

Corollary 3.

Let $Q I$ be a quasi-identifying group of $T^{*}$ , let ${v_{s_{1}}, \dots, v_{s_{i}}}$ be the set of sensitive values in $Q I$ associated with a common $v_{ns}$ value such that, for a given tuple $t_{u}$ in QIT, $Pr (A^{ns} = v_{ns} | t_{u})$ is greater than $1 / l$ . Such association cannot be used to breach privacy if ${v_{s_{1}}, \dots, v_{s_{i}}}$ are $c (v_{ns})$ -diverse where $c (v_{ns})$ is the number of $v_{ns}$ in $Q I$ .

Proof.

Let us assume that $Pr (A^{ns} = v_{ns} | t_{u})$ is equal to $\frac{c (v_{ns})}{| Q I |}$ . Given that $T^{*}$ is safely decomposed, we can say that $Pr (A^{ns} = v_{ns} | t_{u})$ is equal to $Pr (A^{s}, A^{ns} = v_{ns} | t_{u})$ . However, $Pr (A^{s}, A^{ns} = v_{ns} | t_{u})$ is equal to $Pr (A^{s} = v_{s_{1}}, A^{ns} = v_{{n s}_{1}} | t_{u}) + \dots + Pr (A^{s} = v_{s_{i}}, A^{ns} = v_{n s_{c (v_{ns})}} | t_{u})$ where $Pr (A^{s} = v_{s_{1}}, A^{ns} = v_{ns} | t_{u})$ is at the most equal to $1 / | Q I |$ ( $T^{*}$ is originally an anatomized table having $| Q I | = l + 1$ at the most). □

We note that using the safety constraints described above, we do not intend to replace anatomy. In fact, we divide the table as described in the original anatomy model by separating it into two subtables ( $T_{QIT}$ , $T_{SNT}$ ) while providing a safe grouping of tuples on the basis of the attributes related by a correlation dependency and moving the non-sensitive attribute from $T_{QIT}$ to $T_{SNT}$ .

7. Privacy enforcement

In this section, we provide mechanisms to enforce the safe correlation safety constraint when anonymizing data. We start with safe grouping algorithm that guarantees the safe grouping safety constraint. We also present the associations preserving algorithm used to create QI-groups under the safe decomposition constraint.

7.1. Safe grouping algorithm

Algorithm 1 guaranties the safe grouping of a table T based on an identity attribute correlation dependency $c d^{id} : A^{id} ⇢ A^{s}$ ( $A^{id} \in T_{QIT}$ and $A^{s} \in T_{SNT}$ ).

Algorithm 1.

SafeGrouping

The main idea behind the algorithm is to create k buckets based on the attribute ( $A^{id}$ ) defined on the left hand side of a correlation dependency in a reasonable time.

The safe grouping algorithm takes a table T, a correlation dependency $A^{id} ⇢ A^{s}$ , a non-sensitive attribute $A^{ns}$ , a constant l to ensure diversity, and a constant k representing the number of individuals (individuals’ related tuples) to be stored in a QI-group. It ensures a safe grouping on the basis of the attribute $A^{id}$ . In Step 2, the algorithm hashes the tuples in T based on their $A^{id}$ values and sorts the resulting buckets. For any individual, all their values will end up in the same bucket. In the group creation process from Steps 4–17, the algorithm creates a QI-group with k individuals. If the QI-group respects l-diversity the algorithm adds it to the list of QI-groups and enforces the safety constraint in Step 8 by anonymizing tuples in $T_{QIT}$ including values that are frequently correlated in the QI-group. In other terms, it makes sure that individuals’ related tuples in the QI-group are of equal number.

If l-diversity for the QI-group in question is not met, the algorithm enforces it by anonymizing tuples related to the most frequent sensitive value in the QI-group. After the l-diversity enforcement process, the algorithms verifies whether the group contains k buckets, and if not anonymizes (which could mean generalizing, suppressing, or encrypting the values, depending on the target model).

From Steps 19–26 the algorithm anatomizes the tables based on the QI-groups created. It stores random non-sensitive and sensitive attribute values in the $T_{SNT}$ table.

Algorithm 2.

Preserving associations

While safe grouping provides safety, its ability to preserve data utility is limited to the number of distinct values of $A^{id}$ attribute.

7.2. Preserving associations algorithm

We now present our preserving associations algorithm (Algorithm 2) that groups attributes to preserve the associations between individuals and non-sensitive attribute values while still guaranteeing the requirements needed to ensure a safe correlation.

The preserving associations algorithm’s main priority is to create QI-groups with common values of non-sensitive attribute. The algorithm takes a table T, a non-sensitive attribute $A^{ns}$ , a sensitive attribute $A^{s}$ and a constant l. It creates l-diverse partitions with associations preserved between the tuples in $T_{QIT}$ and the non-sensitive attribute values in $T_{SNT}$ . In Step 2, the algorithm hashes the tuples based on non-sensitive attribute values. For every bucket that is l-diverse a QI-group is created, anonymized by applying the safety constraint (in Step 7) and moved to $Anon T$ .

In Step 10, the algorithm ensures that the remaining buckets are anonymized. We do not specify the anonymization technique to be used. However, both safe grouping and anatomy can be adopted, as desired to meet privacy constraints.

In Steps 11–18, the algorithm stores QI-groups in the decomposed tables $T_{QIT}$ and $T_{SNT}$ according to the grouped by correlation attributes.

The example in Table 5 shows how to preserve associations. As you can notice, the main objective is to create QI-groups with common Manufacturer values as is the case for ${QI}_{1}$ . Since all the manufacturer values in a QI-group are the same, the correlation does not allow any further linking between identifiers and sensitive values than already provided by the group ID, which already guarantees l-diversity.

8. Experiments

We now present a set of experiments to evaluate the efficiency of our approach, both in terms of computation and more importantly, loss of data utility. We implemented our algorithms in Java based on the Anonymization Toolbox [7], and conducted experiments with an Intel XEON 2.4 GHz PC with 2 GB RAM.

8.1. Evaluation dataset

In keeping with much work on anonymization, we use the Adult Dataset from the UCI Machine Learning Repository [2]. To simulate real identifiers, we made use of a U.S. state voter registration list containing the attributes Birthyear, Gender, Firstname and Lastname. We combined the adult dataset with the voter’s list such that every individual in the voters list is associated with multiple tuples from the adult dataset, simulating a longitudinal dataset from multiple census years. We have constructed this dataset to have a correlation dependency of the following form $Firstname, Lastname ⇢ Occupation$ ; where Occupation is a sensitive attribute, $Firstname$ , $Lastname$ are identifying attributes and remaining attributes are presumed to be quasi-identifiers.

We say that an individual is likely to stay in the same occupation across multiple censuses. Note that this is not an exact longitudinal dataset; n varies between individuals (simulating a dataset where some individuals move into or out of the census area). The generated dataset is of size 48,836 tuples with 21,201 distinct individuals.

In the next section, we present and discuss results from running our safe grouping algorithm.

8.2. Evaluation results

We elaborated a set of measurements to evaluate the efficiency of safe grouping. These measurements can be summarized as follows:

evaluating privacy breach in a naive anatomization. We note that the same test could be performed on the slicing technique [11] as the authors in their approach do not deal with identity disclosure,

determining anonymization cost represented by the loss metric to capture the fraction of tuples that must be (partially or totally) generalized, suppressed, or encrypted in order to satisfy the safe grouping,

comparing the computational cost of our safe grouping algorithm to anatomy [24], and

evaluating the utility of preserving associations compared to a basic anatomization.

8.2.1. Evaluating privacy

After naïve anatomization over the generated dataset, we have identified 5 explicit violations due to intra QI-group correlations where values of $A^{id}$ are correlated in a QI-group. On the other hand, in order to determine the number of violations due to inter QI-group correlation, we calculate first the possible associations of $A^{id}$ and $A^{s}$ values across a naïve anatomized table. This is summarized in the following equation for values $v_{id}$ and $v_{s}$ respectively. $\begin{matrix} G (v_{id}, v_{s}) = \frac{\sum_{j = 1}^{m} f_{j} (v_{id}, v_{s})}{\sum_{j = 1}^{m} g_{j} (v_{id})}, \end{matrix}$ where $\begin{matrix} f_{j} (v_{id}, v_{s}) = \{\begin{matrix} 1 & if v_{id} and v_{s} are associated in {QI}_{j}, \\ 0 & otherwise \end{matrix} \end{matrix}$ and $\begin{matrix} g_{j} (v_{id}) = \{\begin{matrix} 1 & if v_{id} exists in {QI}_{j}, \\ 0 & otherwise . \end{matrix} \end{matrix}$ At this point, a violation occurs for significant4

⁴
Significance is measured in this case based on the support of $A^{id}$ values and their correlation across QI-groups. For instance, $v_{id}$ is considered significant if it exists in at least α QI-groups where α is a predefined constant greater than 2.

A^{id}

values if:

$G (v_{id}, v_{s}) > 1 / l$ . This represents a frequent association between $v_{id}$ and $v_{s}$ where $v_{id}$ is more likely to be associated with $v_{s}$ in the QI-groups to which it belongs and,

$| π_{A^{s}} {QI}_{1} \cap \dots \cap π_{A^{s}} {QI}_{m} | < l$ where ${QI}_{1}, \dots, {QI}_{m}$ are the QI-groups to which $v_{id}$ belongs.

After we applied the above test to the anatomized dataset, we have identified for $l = 2$ and $l = 3$ , 167 and 360 inter QI-groups correlation violations. We note that a much deeper study on violations due to data correlation can be found in [3,8,10].

8.2.2. Evaluating anonymization cost

We evaluate our proposed anonymization algorithms to determine the loss metric ( $LM$ ) representing the number of tuples in T and $T_{QIT}$ that need to be partially or fully suppressed in order to achieve the safety constraint.

Definition 6 (Loss metric ( $LM$ )).

Let $ρ (T^{*})$ be a function that returns the number of tuples fully or partially (values in tuples) suppressed in the anonymization $T^{*}$ of T. The loss metric ( $LM$ ) for table $T^{*}$ is $\begin{matrix} (1) & LM (T^{*}) = \frac{ρ (T^{*})}{| T |} . \end{matrix}$

Table 4 shows a anonymized version of table prescription where grouping is safe and has a loss metric equal to $LM (Prescription) = 2 / 13$ .

We investigate the anonymization cost for a correlation dependency $c d^{id} : Firstname, Lastname ⇢ Occupation$ using the safe grouping algorithm. We anonymize the dataset with various k and l. We use $l = 2, 3, 4, 5, 6$ and 7 for which the dataset satisfies the eligibility condition (see [13]); values for the number of individuals per group are tested for $k = 7, 8$ and 9. At each execution, we compute the $LM$ . The results are shown in Fig. 1.

Fig. 1.

Safe grouping evaluation in transactional datasets (a)–(c). (a) % of tuples anonymized to ensure the safety constraint and l-diversity for $k = 7$ . (b) % of tuples anonymized to ensure the safety constraint and l-diversity for $k = 8$ . (c) % of tuples anonymized to ensure the safety constraint and l-diversity for $k = 9$ . (d) Computational cost evaluation.

From Fig. 1, we can see that the $LM$ increases with l and in a for ( $k = 9$ , $l = 7$ ) the computed loss metric $LM$ is eventually high where you can notice that the number of tuples to anonymize in order to preserve l-diversity reaches 35% in a worst case. Nonetheless, for small values of l an acceptable value of $LM$ is computed. Anonymizing datasets using safe grouping can be seen as a trade-off between cost and privacy where for small values of l, $LM$ produces values less than 10% leading to a relatively small anonymization cost. Another aspect to consider is how to define k w.r.t. l to guarantee a minimum $LM$ . Note that for transactional data, it is possible for k (the number of individuals, not transactions, in a group) to be smaller than l; however, this makes satisfying the privacy criteria difficult, leading to substantial anonymized data. The experiments show that high data utility can be preserved as long as k is somewhat greater than l.

8.2.3. Evaluating computation cost

We now give the processing time to perform safe grouping compared to anatomy. Figure 1(d) shows the computation time of both safe grouping and anatomy over a non-transactional dataset with different k values. Theoretically, a worst case of safe grouping could be much higher; but in practice, for a small values of l the safe grouping has better performance than anatomy. Furthermore, as k increases the safe grouping computation time decreases due to reduced I/O access needed to test QI-groups’ l-diversity.

8.2.4. Utility of associations preserving algorithm

We now evaluate the utility of preserving associations given the correlation dependencies. We investigate the amount of associations preserved between individuals and the attribute Hours-per-week in the dataset. To do so, we compute a quality metric $Q_{pa}$ that estimates the associations preserved in the dataset based on the number of groups where hours-per-week is the same for all entries in the group. The association metric $Q_{pa}$ is formally defined as follows.

Definition 7 (Association metric $Q_{pa}$ ).

Let $n_{pa}$ be the set of tuples in the dataset with associations preserved, an Association Metric $Q_{pa}$ is defined as $Q_{pa} = \frac{n_{pa} + (n - n_{pa}) / l}{n}$ where n is the set of tuples in the dataset and l is the privacy constant.

We compute $Q_{pa}$ with different l values and compare the result to anatomy’s association metric bounded to $Q_{an} = 1 / l$ . The computed results in Fig. 2 shows the relevant utility of preserving association where for relatively small l values the $Q_{pa}$ can reach relatively high associations and at the worst case ( $l = 7$ ) converges to anatomy.

9. Conclusion

In this paper, we proposed a safe grouping and safe decomposition constraints to cope with defects of bucketization techniques in handling correlated values in transactional dataset. Our safe grouping algorithm creates partitions with an individual’s related tuples stored in one and only one group, eliminating these privacy violations. We also provided a preserving associations technique to make the associations between individuals and non-sensitive attribute values explicit, thus increasing the ability to analyze the anonymized data.

Fig. 2.

Quality of preserving associations.

We showed, using a set of experiments, that there is a trade-off to be made between privacy and utility. This trade-off is quantified based on the number of tuples to be anonymized using the safe grouping algorithm. We investigated the computation time of safe grouping and showed that despite the exponential growth of safe grouping, for a small range of values of l, safe grouping outperforms anatomy while providing stronger privacy guarantees. Finally, we showed that our associations preserving technique clearly gives additional learning abilities compared to the traditional anatomy.

Footnotes

Acknowledgments

This publication was made possible by NPRP grant 09-256-1-046 from the Qatar National Research Fund. The statements made herein are solely the responsibility of the authors.

References

D.J.

Aldous, Exchangeability and related topics, in: École D’été de Probabilités de Saint-Flour, XIII – 1983, Lecture Notes in Math., Vol. 1117, Springer, Berlin, 1985, pp. 1–198.

Asuncion and

D.J.

Newman, UCI machine learning repository, 2007.

Chi-Wing Wong,

Wai-Chee Fu,

Wang,

Xu and

P.S.

Yu, Can the utility of anonymized data be used for privacy breaches?, in: CoRR, 2009, 0905.1755.

Ciriani,

De Capitani Di Vimercati,

Foresti,

Jajodia,

Paraboschi and

Samarati, Combining fragmentation and encryption to protect privacy in data storage, ACM Trans. Inf. Syst. Secur.13 (2010), 22:1–22:33.

Hacıgümüş,

B.R.

Iyer and

Mehrotra, Executing SQL over encrypted data in the database-service-provider model, in: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, WI, June 4–6, 2002, 2002, pp. 216–227.

Jiang,

Gao,

Wang and

Yang, Multiple sensitive association protection in the outsourced database, in: Database Systems for Advanced Applications (DASFAA), Tsukuba, Japan, April 1–4, 2010, 2010, pp. 123–137.

Kantarcioglu,

Inan and

Kuzu, Anonymization toolbox, 2010.

Kifer, Attacks on privacy and De Finetti’s theorem, in: SIGMOD Conference, 2009, pp. 127–138.

Li,

Li and

Venkatasubramanian, t-closeness: Privacy beyond k-anonymity and l-diversity, in: ICDE, 2007, pp. 106–115.

10.

Li and

Li, Injector: Mining background knowledge for data anonymization, in: ICDE, 2008, pp. 446–455.

11.

Li,

Zhang and

Molloy, Slicing: A new approach for privacy preserving data publishing, IEEE Trans. Knowl. Data Eng.24(3) (2012), 561–574.

12.

Machanavajjhala,

Gehrke,

Kifer and

Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, in: Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE 2006), Atlanta, GA, April 2006, 2006.

13.

Machanavajjhala,

Gehrke,

Kifer and

Venkitasubramaniam, l-diversity: Privacy beyond k-anonymity, ACM Transactions on Knowledge Discovery from Data (TKDD)1(1) (2007), 1010–1027.

14.

Malin, Trail re-identification and unlinkability in distributed databases, PhD thesis, Carnegie Mellon University, 2006.

15.

A.E.

Nergiz and

Clifton, Query processing in private data outsourcing using anonymization, in: The 25th IFIP WG 11.3 Conference on Data and Applications Security and Privacy (DBSEC-11), Richmond, VA, July 11–13, 2011, 2011, pp. 11–13.

16.

A.E.

Nergiz,

Clifton and

Malluhi, Updating outsourced anatomized private databases, in: 16th International Conference on Extending Database Technology (EDBT), Genoa, Italy, March 18–22, 2013, 2013.

17.

Ressel, De Finetti-type theorems: an analytical approach, Ann. Probab.13(3) (1985), 898–922.

18.

Samarati, Protecting respondents’ identities in microdata release, IEEE Trans. Knowl. Data Eng.13(6) (2001), 1010–1027.

19.

Sweeney, k-anonymity: a model for protecting privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems10(5) (2002), 557–570.

20.

Sweeney, Achieving k-anonymity privacy protection using generalization and suppression, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems10(5) (2002), 571–588.

21.

Terrovitis,

Mamoulis and

Kalnis, Privacy-preserving anonymization of set-valued data, Proc. VLDB Endow.1(1) (2008), 115–125.

22.

Terrovitis,

Mamoulis,

Liagouris and

Skiadopoulos, Privacy preservation by disassociation, Proc. VLDB Endow.5(10) (2012), 944–955.

23.