Privacy preserving data release for tagging recommender systems

Abstract

Tagging recommender systems allow Internet users to annotate resources with personalized tags. The connection among users, resources and these annotations, often called a folksonomy, permits users the freedom to explore tags, and to obtain recommendations. Releasing these tagging datasets accelerates both commercial and research work on recommender systems. However, tagging recommender systems has been confronted with serious privacy concerns because adversaries may re-identify a user and her/his sensitive information from the tagging dataset using a little background information. Recently, several private techniques have been proposed to address the problem, but most of them lack a strict privacy notion, and can hardly resist the number of possible attacks. This paper proposes an private releasing algorithm to perturb users’ profile in a strict privacy notion, differential privacy, with the goal of preserving a user’s identity in a tagging dataset. The algorithm includes three privacy-preserving operations: Private Tag Clustering is used to shrink the randomized domain and Private Tag Selection is then applied to find the most suitable replacement tags for the original tags. To hide the numbers of tags, the third operation, Weight Perturbation, finally adds Laplace noise to the weight of tags.

We present extensive experimental results on two real world datasets, De.licio.us and Bibsonomy. While the personalization algorithm is successful in both cases, our results further suggest the private releasing algorithm can successfully retain the utility of the datasets while preserving users’ identity.

Keywords

Privacy preserving differential privacy recommender system tagging

1. Introduction

The widespread success of social network web sites, such as Del.icio.us and Bibsonomy, introduces a new concept called the tagging recommender system. These web sites enable Internet users to annotate resources using customized tags, which in turn facilitates the recommendation of resources. Over the last few years, these social network web sites have generated a large collection of data, but the privacy issue in utilizing the data for tagging recommendation has generally been overlooked [19]. Release or sharing these kinds of tagging dataset in their raw form may raise a serious privacy violation. An adversary with background information may re-identify a particular user in these tagging datasets and obtain the user’s historical tagging records [11]. Moreover, in comparison with traditional recommender systems, tagging recommender systems involves more semantic information that directly discloses the preference of a user. Hence, the privacy violation involved is more serious than traditional ones [19]. Consequently, how to preserve privacy in tagging recommender systems is an emerging issue that needs to be solved.

Over the last decade, a variety of privacy preserving approaches have been proposed for traditional recommender systems [20]. For example, cryptography is used in the rating data for multiple parties computation [5,28]. Perturbation adds noise to the users’ ratings before rating prediction [21,22], and obfuscation replaces a certain percentage of ratings by random values [1]. However, these approaches can hardly be applied in tagging recommender systems due to the semantic property of tags. Specifically, cryptography completely erases the semantic means of tags, while perturbation and obfuscation can only be applied to numerical values rather than words. These deficiencies render those approaches impractical in tagging recommendation. To overcome these deficiencies, the tag suppression method has recently been proposed to protect a user’s privacy in a tagging recommender system by modeling a user profile and eliminating selected sensitive tags [19]. However, this method only releases an incomplete dataset that significantly affects the recommendation utility. Moreover, most existing approaches suffer from one common weakness: the privacy notions are weak and hard to prove theoretically, thus impairing the credibility of the final results. Accordingly, a more rigid privacy notion with a more sophisticated privacy preserving approach is needed in tagging recommender systems.

As privacy research advances, especially with the recent development of differential privacy, it is now possible to overcome the previous weakness. As a strong and provable privacy definition that quantifies the risk of individuals [7,8], differential privacy provides a strict privacy guarantee for tagging recommendation. It applies a randomized mechanism suitable for both numeric and non-numeric values and has been proven effective in recommender systems. This paper introduces differential privacy into tagging recommender systems, with the aim to prevent re-identification of users and to prevent the association of sensitive tags (e.g., healthcare tags) with a particular user.

However, although these characteristics make differential privacy a promising method for tagging recommendation, there remains some research barriers:

The naive differential privacy mechanism only focuses on releasing statistical information that can barely retain the structure of the tagging dataset. For example, the naive mechanism lists all the tags, counts the number and adds noise to the statistical output, but ignores the relationship among tags, resources and users. This simple statistic information may not be adequate for recommender systems to reasonable perform.

Differential privacy utilizes the randomized mechanism to preserve privacy, usually inducing a large noise due to the sparsity of the tagging dataset. For a tagging system with millions of tags, the randomized mechanism will result in a large magnitude of noise.

Both barriers indicate that the naive differential privacy mechanism can not be directly used in a tagging recommender system, and a novel differentially private mechanism is needed. To overcome the first barrier, rather than releasing simple statistical information, we generate a synthetic dataset that retains the relationship among tags, resources and users. The second barrier can be addressed by shrinking the randomized domain, because the noise can decrease when the randomized range is limited. Based on these observations, we propose a tailored differential privacy mechanism that retains the acceptable performance of recommendation within the constrains of differential privacy.

The contributions of this paper can be summarized as follows:

The main contribution is to design a practical Private Tagging Releasing algorithm with a rigid privacy guarantee. To the best of our knowledge, this is the first work that adopts differential privacy for the tagging recommender system. It maintains an acceptable utility of the tagging dataset by preserving the relationship among users, resources and tags.

A private clustering operation is proposed to shrink the randomized domain. The effectiveness of the proposed operation is verified by extensive experiments on a real-world dataset.

Considering the advantage of the differential privacy composition properties, a theoretical analysis confirms an improved trade-off between privacy and utility.

The rest of this paper is organized as follows. We present the preliminaries in Section 2, and propose the Private Tagging Releasing algorithm in Section 3, together with the theoretical utility and privacy analysis of the algorithm. Section 4 presents results from the experiments, followed by the conclusion in Section 5.

2. Preliminaries

2.1. Notations

Let G be a dataset to be protected, two datasets G and $G^{'}$ are neighboring datasets if they have the same cardinality but differ in one record. Differential privacy provides a randomization mechanism $M$ and a rigid privacy guarantee to mask the difference between the neighboring datasets [9]. Specifically, suppose a query f maps the dataset G to a real number, Differential privacy mechanism $M$ manages to mask the difference between f queries on G and $G^{'}$ .

In a tagging recommender system, G is a tagging dataset consisting of users, resources and tags. Let $U = {u_{1}, u_{2}, \dots}$ be a set of users, $R = {r_{1}, r_{2}, \dots}$ be a set of resources, and $T = {t_{1}, t_{2}, \dots}$ be a set of all tags. The relationships among users, resources and tags are defined as folksonomy $F = < U, R, T, AS >$ , where the ternary relationship $AS \subseteq U \times R \times T$ is referred to as the tag assignment set. In addition, for a particular user $u_{a} \in U$ and a resource $r_{b} \in R$ , $T (u_{a}, r_{b})$ include all tags flagged by the user on that resource. We also use $T (u_{a})$ to represent all the tags utilized by user $u_{a}$ . A post of the folksonomy can then be defined as a tuple $p = (u, r, T (u, r))$ , which consists of user u, resource r and all tags $T (u, r)$ .

2.2. Foundational concepts

2.2.1. Differential privacy

Differential privacy acquires the intuition that releasing an aggregated report should not reveal too much information about any individual even if his/her record is included in the dataset [7]. A formal definition of Differential Privacy is as follows [9]:

Definition 1 (ϵ-differential privacy).

A randomized mechanism $M$ gives ϵ-differential privacy if for any pair of neighboring datasets G and $G^{'}$ , and for every set of outcomes Ω, the randomized mechanism $M$ satisfies: $\begin{matrix} (1) & \Pr [M (G) \in Ω] ⩽ exp (ϵ) \cdot \Pr [M (G^{'}) \in Ω], \end{matrix}$ where ϵ is the privacy budget: less the budget, higher the privacy level.

The mechanism $M$ is associated with the sensitivity [10], which measures the maximal change of the query f when we remove one user’s record from the dataset G [9]. Sensitivity calibrates the maximal changes in the all neighboring datasets and is only determined by the query f [9],

Definition 2 (Sensitivity).

For $f : G \to R$ , the Sensitivity of f is defined as $\begin{matrix} (2) & Δ f = max_{G, G^{'}} {∥ f (G) - f (G^{'}) ∥}_{1}, \end{matrix}$

To satisfy this definition, two mechanisms are utilized: the Laplace mechanism and the Exponential mechanism. Among them, the Laplace mechanism is suitable for numeric output and relies on the strategy of adding controlled noise to the outcome of a query. It is formally defined as [10]:

Definition 3 (Laplace mechanism).

Given a function $f : G \to R$ over a dataset G, the mechanism, $\begin{matrix} (3) & M (R) = f (R) + Lap (\frac{Δ f}{ϵ}), \end{matrix}$ provides the ϵ-differential privacy.

On the other side, the Exponential mechanism focuses on non-numeric queries [15]. It is paired with an application dependent score function $q (G, ψ)$ , which represents how good an output scheme ψ is for dataset G. The Exponential mechanism is formally defined as [15]:

Definition 4 (Exponential mechanism).

An Exponential mechanism $M$ is ϵ-differential privacy if: $\begin{array}{rcl} M (G) \\ = {return ψ with the probability \propto exp (\frac{ϵ q (G, ψ)}{2 Δ q})}, \end{array}$ where $Δ q$ denotes the sensitivity of q.

2.2.2. Tagging recommender systems

The tagging recommender system recommends a set of tags $\hat{T} (u, r)$ for a given user $u \in U$ and a given resource $r \in R$ . The system first generates a rank on a set of tags according to some quality or relevance criteria, then the top N tags, $\hat{T} (u, r)$ are selected as recommended tags.

A considerable amount of literature has explored various techniques for tagging recommendation, which offers users the possibility to annotate resources with personalized tags and to ease the process of finding good tags for a resource [24]. Sigurbjornsson et al. provided a typical recommender strategy [25]. Given a resource with user-defined tags, an ordered list of candidate tags is derived for each user-defined tag based on tag co-occurrence. After tag aggregating and ranking in the candidate list, the system provides top-N ranked tags. Another well-known study is FolkRank [13], which adapts the well-known PageRank into a tagging recommender system. There are other methods for tagging recommender systems, such as clustering based methods [24], tensor decompositions [26], and topic-models [14].

2.3. Related work

Privacy violations and attacks in recommender systems have been well studied [6,27]. The first study concerned with this issue was undertaken by Ramakrishnan et al. [23]. They claimed that users’ rated items across disjointed domains could face a privacy risk through statistical database queries. This hypothesis was proven by Narayanan and Shmatikov [16,17] because they re-identified part of the users in the Netflix Prize dataset by associating it with the International Movie DataBase (IMDB) dataset. Calandrino et al. [4] presented a more serious privacy violation. By observing temporal changes in the public outputs of a recommender system, they inferred a particular user’s historical rating and behavior with background information. At present, the privacy issue in recommender systems is gaining the attention of both academics and industry.

Several traditional privacy preserving methods have been employed in recommender systems, including cryptographic [5,28], perturbation [21] and obfuscation [1,18]. Canny [5] proposed a multi-party computation method that allows users to publicly aggregate data without disclosing their true data, and each user applies local computation to obtain personalized recommendations. Zhan et al. [28] solved a similar problem by applying homomorphic encryption and scalar product approaches. The cryptographic method retains high performance because it does not disturb the original record. But extra computational cost and complicated security protocols makes it harder to apply widely and it appears to be used for recommender systems rather than normal users. Perturbation and obfuscation are similar. Specifically, perturbation systematically changes a user’s rating by adding noise before submission. For example, Polat and Du [21,22] deployed a centralized server to store the perturbed ratings, and uniform noise was added to each rating before making a recommendation. Obfuscation replaces a certain percentage of a user’s rating by random values. Berkovsky et al. [1] decentralized rating profiles among multiple repositories and replaced some ratings with the mean. Both perturbation and obfuscation offer a high level of privacy because all or part of a user’s ratings are not true. But the magnitude of noise or the percentage of replaced ratings are subjective and difficult to control. Therefore, how to obtain a trade-off between privacy and utility is still a challenge to both techniques.

Although privacy preserving in traditional recommender systems has been well studied, less attention has been given to the issue of privacy in tagging recommender systems. Compared to traditional recommender systems, the privacy problem in tagging systems is more complicated due to its unique structure and semantic content. The most relevant paper was from Parra-Arnau et al. [19], who made the first contribution towards the development of a privacy preserving tagging system by proposing the tag suppression approach. They first modeled the user’s profile using a tagging histogram and eliminated sensitive tags from this profile. To retain utility, they applied a clustering approach to structure all the tags and suppress those less represented. Finally, they analyzed the effectiveness of their approach by discussing the semantic loss of users. However, there are several limitations on tag suppression. It only releases an incomplete dataset, with parts of the sensitive tags deleted. The sensitive tags are subjective without any quantity measurement. Furthermore, if the dataset is publicly shared, users can be identified because the remaining tags still have the potential to reveal a user’s identity. The privacy issue in tagging recommender systems is still largely unexplored, and we will attempt to fill this void in this paper.

Hence, in this paper, we propose a Private Tagging Releasing algorithm, with the aim of preserving comprehensive privacy for individuals and maximizing the utility of the released dataset. Specifically, we attempt to address the following issues:

How to retain the unique structure of a tagging dataset? As mentioned earlier, the tagging dataset has a unique structure, and the relationship among users, resources and tags should be preserved within the private mechanism, otherwise, the utility will significantly decrease. Unfortunately, a naive differentially private mechanism will lose its structure. An effective way to solve this problem is to provide a synthetic dataset instead of simple statistical information.

How to decrease the large magnitude of noise when using differential privacy? In the tagging dataset, the key method to decreasing the noise is to shrink the scale of the randomization mechanism. Previous work focuses on methods that eliminate the number of tags, but this results in high utility loss.

In this paper, we retain the structure by using an improved differential privacy mechanism and shrink the randomized domain to limit the size of the noise. Both issues will be investigated in the following sections.

3. Private Tagging Releasing

In a tagging recommender system, a user $u_{a}$ ’s profile $P_{a}$ is usually modeled by his tagging records, including tag’s names $T (u_{a}) = {t_{1}, \dots, t_{| T (u_{a}) |}}$ and weights $W (u_{a}) = {w_{1}, \dots, w_{| T (u_{a}) |}}$ [19]. The profile $P_{a}$ can be illustrated by a histogram where each bin is defined as the tag’s name and the frequency is represented by the fraction of each tag being used in the user’s tagging history.

However, even the tagging dataset replaces the user’s name by random number before releasing, and profile $P_{a}$ may disclose the user’s privacy. An attacker may easily re-identify a particular user in a tagging dataset if he/she has some background information. For example, if an adversary has background information of a particular user marking a resource using a tag ‘HIV’, then he can re-identify the user by simply searching the tagging dataset. More background information increases the probability of re-identifying a user and due to the difficulty of modeling the background information, traditional privacy approaches hardly provide sufficient protection to users [11].

Suppose there are three users

u_{1}

u_{2}

u_{3}

and each one has tags shown as in Table 1, If an adversary has background about Alice who has three tags

t_{1}

t_{2}

and

t_{3}

, it is very easy for adversary to re-identify Alice in this released table. In addition, if the adversary can successfully identify Alice, he will figure out Alice also has

t_{4}

(It might be ‘HIV’ tag). By this way, the adversary can re-identify Alice and obtain more information as well.

Table 1
Users with Tags

User	$t_{1}$	$t_{2}$	$t_{3}$	$t_{4}$
$u_{1}$	0	1	2	0
$u_{2}$	1	0	0	1
$u_{3}$	1	1	1	1

Differential privacy considers the worst-case of background information, which makes it a promising choice for preserving privacy when releasing the profile of users. Unlike traditional methods, the differential privacy mechanism assumes $P_{a}$ includes all tags in the dataset. Specifically, the names of tags are represented by $T (u_{a}) = {t_{1}, \dots, t_{| T |}}$ and weights are denoted $W (u_{a}) = {w_{1}, \dots, w_{| T |}}$ , where $w_{i} = 0$ indicates that $t_{i}$ is unused by user $u_{a}$ . Then differential privacy utilizes the randomized mechanism to add noise on the weight of each tag and then releases a noisy user’s profile ${\hat{P}}_{a}$ . In this case, $W (u_{a})$ is a sparse vector because a user tends to record limited tags compared to $| T |$ . When we use the randomized mechanism, a lot of weights in $W (u_{a})$ will change from zero to a positive value, thus introducing a large magnitude of noise to the user’s profile.

One way to reduce the noise is to shrink the randomized domain, which refers to the diminished number of bins in the histogram. To achieve this objective, tags can be partitioned and merged within groups. A user profile is represented by a group-based histogram, in which the frequency of each group is computed according to the fraction of tags. Calibrated noise is then added to each group and the noise size will significantly diminish in the meantime. It should be noted that a group need to be chosen carefully because information about the original dataset may be disclosed.

In this section, we propose a Private Tagging Releasing (PriTag) algorithm to address the privacy issues in tagging recommender systems. Specifically, we first present an overview of the algorithm, then provide details for each operation. A theoretical analysis is further provided to answer why this algorithm achieves the ϵ-differential privacy while retaining the utility for recommendation purposes.

3.1. Private Tagging Releasing algorithm

The PriTag algorithm aims to publish all users’ profiles by masking the exact tags and weights under the notion of differential privacy. Three private operations are introduced to ensure that each individual in the releasing dataset cannot be re-identified by an adversary.

Private Tag Clustering: This creates tag clusters but masks the exact number of tags and the centers of each cluster. From the clustering output, the adversary cannot infer which group a tag belongs to.

Private Tag Selection: This aims to mask a user’s profile that an adversary cannot infer the tags flagged by this user.

Tag Weight Perturbation: This masks the true weight of the tags in a user’s profile to prevent an adversary from inferring how many times a user tagged certain resources.

On the basis of these private operations, the proposed PriTag algorithm outputs a new tagging dataset for tagging recommendations. As shown in Algorithm 1, step 1 divides the privacy budget into three parts. One part is used in private clustering and the other two are applied in each user’s profile to hide the tag’s information and weights, respectively. Step 2 clusters all the tags into k groups. Then in step 3, each user’s tag is replaced by privately selecting tags within related clusters. After perturbing the weight of each tag using the Laplace mechanism in step 4, the sanitized dataset $\hat{G}$ is finally released in step 5 for recommendation purposes.

Algorithm 1

Private Tagging Releasing (PriTag) algorithm

The three private operations guarantee the differential privacy and retain the acceptable recommendation performance simultaneously. Details for the Private Tag Clustering operation is presented in Section 3.1.1, and the Private Tag Selection and Weight Perturbation are provided in Section 3.1.2.

3.1.1. Private Tag Clustering

In this subsection, we describe the Private Tag Clustering operation that privately groups tags into several clusters in the differential privacy budget. Private Tag clustering categorizes unstructured tags into groups to eliminate the randomization scope for privacy purposes. According to the notion of differential privacy, this operation ensures that deleting a tag will not significantly affect the clustering results. This means an adversary cannot infer which group a tag is located in from the clustering output. This objective can be achieved by adding Laplace noise in the distance measurement during the clustering process.

We apply K-means as the baseline clustering algorithm and apply the Laplace mechanism into it. Blum’s initial work of SuLQ claims that a private clustering method should mask the cluster center and the number of records in each cluster [2]. Following this work, we roughly conceptualize the Private Tag Clustering in three steps:

Modeling each tag by annotated resources.

Defining a quantitative measure of distance between tags.

Adding noise in each iteration to privately clustering all tags into groups.

We first model each tag using a numeric vector that contains the counts of the tag marked on each resource. For example, tag $t_{i}$ is represented by $z_{i} = {z_{i 1}, \dots z_{i | R |}}$ , where $z_{i a}$ is the number times the tags are marked on resource $r_{a}$ . If $t_{i}$ is never used to annotate $r_{a}$ , $z_{i a}$ is zero. The length of each tag vector is $| R |$ , the number of resources. The tag vector model will be used to measure the distance between tags in the next step.

On the basis of the tag vector model, the second step aims to evaluate the tag’s distance, which is measured by Cosine distance: $\begin{matrix} (4) & d (t_{i}, t_{j}) = 1 - \frac{z_{i} \cdot z_{j}}{∥ z_{i} ∥ \cdot ∥ z_{j} ∥} . \end{matrix}$

The third step introduces differential privacy into K-means, which is a popular iterated algorithm for grouping data observations into k clusters. Let $c_{l}$ denote the center of a cluster $C_{l}$ . Equation (5) shows the objective function D, which measures the total distance between tags and the cluster centers. $\begin{matrix} (5) & D = \sum_{i = 1}^{| T |} \sum_{l = 1}^{k} γ_{i l} d (t_{i}, c_{l}), \end{matrix}$ where γ is an indicator defined as follows. $\begin{matrix} (6) & γ_{i l} = \{\begin{matrix} 1 & t_{i} \in C_{l} \\ 0 & t_{i} \notin C_{l} \end{matrix} \end{matrix}$

When combined with differential privacy, Laplace noise is added in the objective function D. According to Definition 2, Laplace noise is calibrated by the sensitivity of the objective function D and the privacy budget. The sensitivity of D is the maximal distance between a tag to a cluster center: $\begin{matrix} (7) & {Sen}_{D} = max_{t_{i} \in T, c_{l} \in c} d (t_{i}, c_{l}) . \end{matrix}$ When we use Cosine distance, the maximal ${Sen}_{D}$ is 1. For the privacy budget, we separate $ϵ / 2$ is into p parts, where p is the number of iterations. So the private objective function $D^{'}$ is defined as following: $\begin{matrix} (8) & D^{'} = \sum_{i = 1}^{m} \sum_{l = 1}^{k} γ_{i l} d (t_{i}, c_{l}) + Laplace (\frac{2 p}{ϵ}) . \end{matrix}$ After p times iterations, the Private Tag Clustering outputs $\hat{C} = {C_{1}, \dots, C_{k}}$ . Details of this operation are shown in Algorithm 2.

Algorithm 2

Private Tag Clustering operation

3.1.2. Private Tag Selection and Weight Perturbation

After shrinking randomized scope via clustering, suitable tags should be selected to replace the original tags of an active user. The challenge in Private Tag Selection is that for a tag in an active user’s profile, uniform tag selection within a cluster is unacceptable due to significant utility detriment. However, on the other hand, it is also dangerous to replace the original tag with one the most similar because the adversary can easily figure out the most similar tag by simple statistical analysis. Consequently, the Private Tag Selection needs to: 1) retain the utility of tags, and 2) mask the similarity between tags.

To achieve these, Private Tag Selection adopts the Exponential mechanism to privately select tags from a list of candidates. To be precise, the operation for a particular tag $t_{i}$ first locates the cluster $C_{i}$ it belongs to. All tags in $C_{i}$ are included in a candidate list I, and every tag is arranged a probability according to definition 4 that involves score function and sensitivity. The selection of tags is performed based on the allocated probabilities.

We apply the similarity between tags as the score function. For example, the score function q for a target tag $t_{i}$ is defined as follows: $\begin{matrix} (9) & q_{i} (I, t_{j}) = s (i, j), \end{matrix}$ where $s (i, j)$ is the similarity between $t_{i}$ with $t_{j}$ , I is tag $t_{i}$ ’s candidate list for replacement, and $t_{j}$ is the selected tags. Each tag $t \in I$ has a score arranged according to Eq. (9).

The sensitivity for score function q, $Sen (i, j)$ , is measured by the maximal change in similarity of two tags when removing a resource marked by both $t_{i}$ and $t_{j}$ . Let $s^{'} (i, j)$ denote the $s (i, j)$ after deleting a resource, $Sen (i, j)$ captures the maximal difference between $s (i, j)$ and $s^{'} (i, j)$ : $\begin{matrix} (10) & Sen (i, j) = max_{i, j \in I} {∥ s (i, j) - s^{'} (i, j) ∥}_{1} . \end{matrix}$

On the basis of score function and sensitivity, the probability arranged to each tags $t_{j}$ is computed by Eq. (11) and the pseudocode of Private Tag Selection is presented in Algorithm 3: $\begin{matrix} (11) & \frac{exp (\frac{ϵ \cdot s (i, j)}{8 \cdot Sen (i, j)})}{\sum_{j \in C_{l}} exp (\frac{ϵ \cdot s (i, j)}{8 \cdot Sen (i, j)})} . \end{matrix}$

Algorithm 3

Private Tag Selection

The weight of a tag is defined as the fraction of the tag in a user’s profile, and the objective of Tag Weight Perturbation is to preserve the weight of each tag for individuals.

Laplace noise is added to each weight to preserve privacy. According to Definition 2, the sensitivity of weights is measured by the tag’s fraction, which is denoted as $\frac{1}{| T (u) |}$ . The noise is calibrated by a privacy budget of $ϵ / 4$ and the sensitivity. $\begin{matrix} (12) & W_{noise} (u) = W (u) + Lap {(\frac{4}{| T (u) | \cdot ϵ})}^{| T (u) |} . \end{matrix}$ The weights are normalized after adding noise.

3.2. Utility analysis

In this section, we provide a theoretical analysis for the proposed PriTag algorithm from the utility perspective.

The dataset $\hat{G}$ is released for recommendation purposes, and the accuracy of the tagging recommendation highly depends on the user profiles. The closeness between the user’s original profile and the privately selected new profile in PriTag is the key factor that determines the utility level [19]. Given a target user $u_{a}$ , we set the original profile $P_{a}$ as a baseline. By comparing the replaced tags in the user’s profile $\hat{P_{a}}$ with the corresponding tags in the baseline $P_{a}$ , we can evaluate the utility level of the proposed algorithms. The distance between $P_{a}$ and ${\hat{P}}_{a}$ is referred to as semantic loss, which was adopted from [19].

Because tag system contains three element in one tuple $< user, tag, resource >$ , the distance between tags could either defined by users or resources. We apply both distances and defined related semantic loss ( ${SL}_{U}$ and ${SL}_{R}$ ) accordingly. To make the paper concisely, we take the ${SL}_{U}$ as an example, ${SL}_{R}$ can be analyzed in the same way. Equation (13) illustrate the definition of ${SL}_{U}$ . $\begin{matrix} (13) & {SL}_{U} = \frac{1}{| U |} \sum_{u \in U} (\frac{\sum_{t \in P_{a}, t \in {\hat{P}}_{a}} d ((t (u), \hat{t} (u)))}{| T (u) |}) . \end{matrix}$

After this, we apply a widely used utility definition in differential privacy by Blum et al. [3]:

Definition 5 ((α, δ)-usefulness).

A dataset mechanism $M$ is (α, δ)-useful for a set of query F, if with probability $1 - δ$ , for every query $f \in F$ , and every dataset G, for $\hat{G} = M (G)$ , we have $\begin{matrix} (14) & max_{f \in F} | f (\hat{G}) - f (G) | ⩽ α, \end{matrix}$ where F is a group of queries.

This definition can be applied to evaluate $M$ in terms of semantic loss. To analyze the utility of PriTag using Definition 5, we define the semantic loss as the query f and prove that it is less than a certain value with a high probability.

Theorem 3.1.
For any users $u_{a} \in U$ , for all $δ > 0$ , with probability at least $1 - δ$ , the Semantic loss of the user is less than α. When $\begin{matrix} (15) & α > max_{u_{a} \in U} \frac{\sum_{t_{i} \in T (u_{a})} (1 - \frac{exp (\frac{ϵ}{8 | T (u_{a}) | Sen})}{Q_{i}})}{| T (u_{a}) | δ}, \end{matrix}$ where $Sen = {max}_{t i, t j \in T (u_{a})} Sen (i, j)$ , $Q_{i}$ is the normalization factor that depends on the cluster that $t_{a i}$ belongs to. The PriTag is satisfied with $(α, δ)$ -useful.
Proof.
Given a user $u_{a}$ who has a set of tags $T (u_{a}) = {t_{a 1}, \dots, t_{a i}}$ . For each tag $t_{a i}$ , the probability of ‘unchange’ in the private selection is proportional to $\frac{exp (\frac{ϵ^{″}}{4 Sen})}{Q_{i}}$ , where $Q_{i}$ is the normalization factor depending on the cluster that $t_{a i}$ belongs to.

According to Marlkov’s inequality, for user $u_{a}$ , we obtain $\begin{array}{l} \Pr ({SL}_{u_{a}} > α_{a}) ⩽ \frac{E ({SL}_{u_{a}})}{α_{a}} \\ \Rightarrow \Pr ({SL}_{U} ⩽ α_{a}) > 1 - \frac{E ({SL}_{u_{a}})}{α_{a}} \end{array}$ because $\begin{array}{l} E ({SL}_{u_{a}}) \\ = \sum_{t_{i} \in T (u_{a})} \frac{d (t_{a i}, {\hat{t}}_{a i})}{| T (u_{a}) |} (1 - \frac{exp (\frac{ϵ}{8 | T (u_{a}) | Sen})}{Q_{i}}) \end{array}$ Thus, $\begin{array}{l} \Pr ({SL}_{U} ⩽ α_{a}) \\ > 1 - \frac{\sum_{t_{i} \in T (u_{a})} d (t_{a i}, {\hat{t}}_{a i}) (1 - \frac{exp (\frac{ϵ}{8 | T (u_{a}) | Sen})}{Q_{i}})}{| T (u_{a}) | α_{a}} \end{array}$

Let $\begin{array}{l} 1 - \frac{\sum_{t_{i} \in T (u_{a})} d (t_{a i}, {\hat{t}}_{a i}) (1 - \frac{exp (\frac{ϵ}{8 | T (u_{a}) | Sen})}{Q_{i}})}{| T (u_{a}) | α_{a}} \\ ⩾ 1 - δ \end{array}$ Thus, $\begin{matrix} (16) & α_{a} > \frac{\sum_{t_{i} \in T (u_{a})} d (t_{a i}, {\hat{t}}_{a i}) (1 - \frac{exp (\frac{ϵ}{8 | T (u_{a}) | Sen})}{Q_{i}})}{| T (u_{a}) | δ} \end{matrix}$

When we use Cosine distance, which has a maximum value of 1, the maximal $d (t_{a i}, {\hat{t}}_{a i})$ between tags is 1, and Eq. (16) can be simplified as $\begin{matrix} α_{a} > \frac{\sum_{t_{i} \in T (u_{a})} (1 - \frac{exp (\frac{ϵ}{8 | T (u_{a}) | Sen})}{Q_{i}})}{| T (u_{a}) | δ} . \end{matrix}$ For all users, α is determined by the maximal value, $α = {max}_{u_{a} \in U} α_{a}$ □

The proof shows the semantic loss for each user mainly depends on the privacy budget and the normalization factor $Q_{i}$ . According to Eq. (11), the normalization factor $Q_{i}$ for a particular tag $t_{i}$ is defined as $\begin{matrix} (17) & Q_{i} = \sum_{j \in C_{l}} exp (\frac{ϵ \cdot s (i, j)}{8 | T (u_{a}) | \cdot Sen}) . \end{matrix}$

Therefore, the size of $Q_{i}$ is depended on the cohesion inside cluster $C_{i}$ , in which the compact cohesion results in small $Q_{i}$ and less semantic loss. Further analysis shows that the cohesion is determined by the privacy budget ϵ in the private tag clustering operation. It can be concluded that the privacy budget has significant impact on the utility level of PriTag. Both the cohesion of clusters and the average semantic loss of users is evaluated in the experimental Sections 4.2 and 4.3.
3.3. Privacy analysis

This section presents a theoretical privacy analysis for PriTag. As mentioned before, PriTag contains three private operations: Private Tag Clustering, Private Tag Selection and Tag Weight Perturbation. The privacy budget ϵ is consequently divided into three pieces, as illustrated by Table 2.

Table 2
Privacy Budget Allocation in PriTag Algorithm

Operations	Privacy Budget
Private Tag Clustering	$ϵ / 2$
Private Tag Selection	$ϵ / 4$
Tag Weight Perturbation	$ϵ / 4$

To analyze the privacy guarantee, two composite properties of the privacy budget are normally used in differential privacy: the sequential composition and the parallel composition [15]. The sequential composition accumulates privacy budget ϵ of each step when a series of private analyze is performed sequentially on a dataset. The parallel composition corresponds to the case that the each private step is applied on disjointed subsets of the dataset. The ultimate privacy guarantee then depends only on the mechanism with the maximal ϵ. Both properties are shown in Lemmas 1 and 2.

Lemma 1.

Sequential Composition [ 15 ]: $M = {M_{1}, \dots, M_{m}}$ , if each $M_{i}$ provides ϵ privacy guarantee, the sequence of $M$ will provide $m ϵ$ differential privacy.

Lemma 2.

Parallel Composition [ 15 ]: $M = {M_{1}, \dots, M_{m}}$ , if each $M_{i}$ provides $ϵ_{i}$ privacy guarantee on a disjoint subset of the entire dataset, the parallel of $M$ will provide $max ϵ_{1}, \dots, ϵ_{i}$ -differential privacy.

Based on the above lemmas and privacy budget allocation in Table 2, we measure the privacy level of our algorithm as follow:

The Private Tag Clustering consists of p iterations, which are sequentially performed on the whole dataset. The privacy budget of each iteration is $\frac{ϵ}{2 p}$ . According to Lemma 1, this operation preserves $ϵ / 2$ -differential privacy.

The Private Tag Selection processes the Exponential mechanism successively. For one user u, each tag in the profile is replaced by a privately selected tag until all tags are replaced. From the definition of the Exponential mechanism, each selection preserves $(\frac{ϵ}{4 * | T (u) |})$ -differential privacy. According to Lemma 1, the selection for each user guarantees $\frac{ϵ}{4}$ -differential privacy. Furthermore, as a user’s profile is independent, replacing a user’s tags has no effect on the profiles of others. According to Lemma 2, the Private Tag Selection preserves $\frac{ϵ}{4}$ -differential privacy as a whole.

The Tag Weight Perturbation applies the Laplace mechanism to the weights of tags. The noise is calibrated by the $Lap {(\frac{4}{| T (u) | \cdot ϵ})}^{| T (u) |}$ and preserves $\frac{ϵ}{4}$ -differential privacy for each user. According to Lemma 2, every user can be considered as subsets of the entire dataset. Thus, the Tag Weight Perturbation also guarantees $\frac{ϵ}{4}$ -differential privacy.

Consequently, the proposed PriTag algorithm preserves ϵ-differential privacy.

4. Experiment and analysis

In this section, we evaluate the performance of the proposed PriTag algorithm by answering the following questions:

How does the private clustering operation affect cluster quality? As mentioned in Section 3.2, the quality of clustering result has a significant impact on the performance of recommendations. We answer this question by comparing the proposed private clustering operation with the traditional one in terms of the internal index, a cluster validation metric [12].

How does the PriTag algorithm affect the semantic loss of the datasets? Private operations revise a user’s profile to preserve privacy. We quantify the profile change by using the semantic loss. Moreover, to illustrate its effectiveness, we compare the PriTag algorithm with tag suppression [19] in terms of semantic loss.

How does the PriTag algorithm perform in a real tagging recommender system? Private operations decrease the utility of recommendations. In this part, we investigate the performance of PriTag in a real tagging recommender system.

How does the privacy budget affect the performance of the algorithm? In the context of privacy preserving, the privacy budget is a key parameter that controls the privacy level of the algorithms. To show the impact of the privacy budget, we examine the trade-off between the utility and the privacy of PriTag by varying over a wide range.

4.1. Datasets

To obtain a thorough comparison, we conduct the experiment on four datasets: Del.icio.us, Bibsonomy, MovieLens and Last.fm, all of which are structured in the form of triples (user,resource, tag).

The Del.icio.us dataset is retrieved from the Del.icio.us web site by the Distributed Artificial Intelligence Laboratory (DAI-Labor).1

¹
http://www.dai-labor.de/

There are around

132

million resources and

950, 000

users involved in the dataset and we extracted a subset containing

3, 000

users,

34, 212

bookmarks and

12, 183

tags.

The Bibsonomy dataset is provided by Discovery Challenge 2009 ECML/PKDD2009.2

http://www.kde.cs.uni-kassel.de/ws/dc09/

The dataset consists of

3, 000

individual users,

421, 928

resources and

93, 756

tags. We filtered the dataset by removing spammers and automatically added tags like “imported”, “public”, etc. The subset we used contained

607

users,

14, 026

resources and

4, 842

tags.

The MovieLens and Last.fm datasets were obtained from HetRec 2011,3

http://www.dai-labor.de/

which were generated by the Information Retrieval Group at Universidad Autonoma de Madrid. The information for all datasets is shown in Table 3.

Del.icio.us and Bibsonomy datasets focus on resources and tags sharing, so each user tends to collect more resources and various of tags. Last.fm and MovieLens datasets are derived from traditional recommender systems, comparing with particular tagging systems, they have less various of tags and less number of tags on each resources. To demonstrate the effectiveness of the proposed algorithm, we select datasets from both tagging systems and traditional recommender systems.

Table 3

Characteristics of the datasets

Dataset	$Post$	$\| U \|$	$\| R \|$	$\| T \|$
Del.icio.us	130,160	3000	34,212	12,183
Bibsonomy	163,510	607	14,026	4,842
Last.fm	186,479	1892	12,523	9,749
MovieLen	47,957	2113	5,908	9,079

Fig. 1.

Comparison between Private Clustering and Non-private Clustering.

4.2. Private clustering comparison

The performance of the private clustering operation is assessed by Silhouette Coefficient metric, which is a popular internal index that measures the quality of clusters by combining both cohesion and separation [12]. For each tag $t_{i}$ , let $a (t_{i})$ be the average dissimilarity with all other tags in the same cluster l, and $b (t_{i})$ be the lowest average dissimilarity with other tags in other clusters. The Silhouette Coefficient of tag $t_{i}$ can be defined as: $\begin{matrix} (18) & SC (t_{i}) = \frac{b (t_{i}) - a (t_{i})}{max {a (t_{i}), b (t_{i})}} . \end{matrix}$

An overall quality of a clustering operation can be obtained by computing the average Silhouette Coefficient of all tags. It varies between $- 1$ and 1. Normally, the result is acceptable when the Silhouette Coefficient is greater than zero. In general, a higher ${SC}_{avg}$ indicates a better clustering result. $\begin{matrix} (19) & {SC}_{avg} = \frac{1}{| T |} \sum_{i \in T}^{| T |} \frac{b (t_{i}) - a (t_{i})}{max {a (t_{i}), b (t_{i})}} . \end{matrix}$

We compare the private clustering with the non-private clustering in terms of Silhouette Coefficient. The number of clusters is specified from $10$ to $100$ with a step of $10$ and utilize some typical privacy budget $ϵ = 0.5, 0.8$ and $1.0$ .

As shown in Fig. 1, for all clustering results, ${SC}_{avg}$ is positive, which indicates that both non-private and private clustering leads to high quality clusters. For non-private clustering, the average Silhouette Coefficient is raised when k increases, thus illustrating a larger number of clusters obtains a better performance. Please note that the larger the value of k implies the smaller randomization range of Private Tag Selection. When randomization range is too small, for example, in only 2 tags, privacy cannot be preserved properly. The impact of k will be further analyzed in Section 4.3.

On the other hand, for private clustering, where the privacy budget ϵ is fixed to 1, the average Silhouette Coefficients are very close to the non-private ones. This indicates the quality of clusters can be properly retained when the noise size is limited. For example, in Fig. 1(a), when $ϵ = 1$ and $k = 50$ , the average Silhouette Coefficients of non-private clustering is $0.0287$ , while private clustering achieves $0.0245$ , with decrease of $14 %$ . However, when preserving a higher level privacy, e.g. $ϵ = 0.5$ , the average Silhouette Coefficients degrade significantly. For example, in Fig. 1(b), the Silhouette Coefficient decreases around $80 %$ when $k = 50$ , which is unacceptable in practice. This decrease derived from a limited privacy budget, introduces a large volume of noise to the distance measurement in each iteration. Following this, we choose a larger privacy budget, e.g. $ϵ = 1$ , to retain the high quality of clusters.

Similar trends can also be observed on other datasets. Figures 1(c) and 1(d) also show an increase in average Silhouette Coefficient when k becomes larger. For private clustering, when $ϵ = 1$ , other datasets also retain an acceptable cluster quality. For example, in Fig. 1(d), when $k = 20$ the average Silhouette Coefficient decreases from $0.0172$ in non-private clustering to $0.0145$ in private clustering. Therefore, we choose $epsilon = 1$ as the fixed privacy budget because the quality of clustering is acceptable.

Please note that even we can obtain a higher Silhouette Coefficient with a larger number of d, but the noise in the perturbation operation will be increased. So we need to determine a suitable d rather than choosing the high number of clusters. The impacts of parameter d will be further analyzed in Section 4.3.

4.3. Semantic loss analysis

To maintain consistency with previous research, we follow the strategy of [19] and utilize the semantic loss as the measurement metric, which refers to the percentage of eliminated tags for each user or resource. A smaller semantic loss means better performance. Eq. (13) in Section 3.2 defines the average semantic loss of users. In this section, we also test the semantic loss of each resource, which is formulated as follows. $\begin{array}{l} {SemanticLoss}_{u, r} \\ (20) & = \frac{\sum_{t \in T, t \in \hat{T}} distance ((t (u, r), \hat{t} (u, r)))}{| T (u, r) |}, \end{array}$ where the Cosine Distance distance is applied in our experiment. Based on this, we can measure the average semantic loss of user and resources. $\begin{array}{l} {SL}_{U} \\ (21) & = \frac{1}{| U |} \sum_{u \in U} (\frac{\sum_{t \in T, t \in \hat{T}} distance ((t (u), \hat{t} (u)))}{| T (u) |}) . \\ (22) & {SL}_{R} = \frac{1}{| R |} \sum_{r \in R} (\frac{\sum_{t \in T, t \in \hat{T}} d ((t (r), \hat{t} (r)))}{| T (r) |}) . \end{array}$

To achieve a thorough investigation, we conducted experiments to compare PriTag with tag suppression [19] on two datasets. The average semantic loss exhibits a linear relationship with the tag suppression rate σ in tag suppression [19]. We choose the most represented parameters with $σ = 0.6$ . For the PriTag algorithm, we selected $ϵ = 0.3$ and $1.0$ to represent a different privacy level and k varies from $10$ to $100$ with a step of $10$ .

Fig. 2.

The Semantic Loss of Users and Resources on Del.icio.us and Bibsonomy.

Figures 2(a) and 2(b) shows the results of the average semantic loss of users and resources on the Del.i.cio.us dataset, respectively. It can be observed that PriTag outperforms tag suppression on both datasets. For tag suppression, when the elimination parameter $σ = 0.6$ , the semantic loss for both users and resources in tag suppression is fixed to $0.4$ . For PriTag, when $k = 70$ and $ϵ = 1$ , the user semantic loss is $0.1$ , which is $70 %$ lower than tag suppression. Even in a higher privacy level ( $ϵ = 0.3$ ), the average semantic loss of users is still lower than tag suppression by $60 %$ . This indicates the effectiveness of PriTag in terms of semantic information preserving for recommendations.

Similar trends can also be observed in Figs 2(c) and 2(d). Both figures show that PriTag obtains a stable semantic loss at a low level. All these indicate that PriTag retains more utility than tag suppression. This is because PriTag retains the relationship between tags and resources. Consequently, the semantic information can be consequently well preserved and the tags still make resources meaningful. The only exception occurs in 2(d), in which the semantic loss of resources is slightly exceeds the tag suppression when $k < 30$ and $ϵ = 0.3$ . It can be avoided by increasing k, the number of clusters. Normally, we choose $k > 50$ . The results on both datasets demonstrate that the proposed PriTag is superior to the tag suppression in terms of semantic preservation.

A further analysis is conducted to investigate the impact on the number of cluster k on the semantic loss. Specifically, for the user semantic loss, it decreases as k increases at the very beginning. When the parameter k achieves a larger value, semantic loss does not significantly decrease. For example, in Fig. 2(c), user semantic loss achieves its lowest when $k = 70$ and $ϵ = 1$ . For resource semantic loss, a similar trend can be observed. As shown in Fig. 2(d), when $ϵ = 1$ , semantic loss keeps decreasing when k increases. It achieves its lowest when $k = 70$ and then slightly rises as $k > 70$ . When $ϵ = 0.3$ , semantic loss achieves the lowest when $k = 60$ . Both observations indicate that we can obtain a higher quality cluster when k is large, however the semantic loss may not continue to diminish. When k reaches a threshold, the semantic loss will become stable or increase slightly. This is because the semantic loss is affected by two private operations, Private Tag Selection and Tag Weight Perturbation. Private Tag Selection tends to retain higher semantic information on a larger value of k as the randomization scope diminishes when k increases. However, perturbation acts in the opposite way, with its intention of adding a larger magnitude of noise as k rises. Consequently, the choice of k depends on the trade-off between both operations. In this case, we set $k = 70$ to obtain a reasonable result.

4.4. Performance of tagging recommendation

In this section, we investigate the effectiveness of PriTag in the context of tagging recommendations. Specifically, we apply a state-of-the-art tagging recommender system, FolkRank [13], to measure the degradation of tag recommendations in terms of accuracy that is caused by privacy preserving.

The key idea of FolkRank is that a resource which is flagged with important tags by important users becomes important itself. The importance is measured by weight $\vec{ω}$ , which is computed iteratively by follows. $\begin{matrix} (23) & \vec{ω} \leftarrow λ A \vec{ω} + (1 - λ) \vec{ρ} \end{matrix}$ where A is the adjacency matrix of folksonomy, $\vec{ρ}$ is the preference vector and $λ \in [0, 1]$ is the damping factor measuring the influence of $\vec{ρ}$ .

Fig. 3.

FolkRank Private Recall Result.

Fig. 4.

FolkRank Private Precision Result.

In the current investigation, we set $λ = 0.7$ , $\vec{ρ} = 1$ , and the preference weights are set to $1 + | U |$ and $1 + | R |$ , respectively. The computation is ceased after $10$ iterations or when the distance between two consecutive weight vectors is less than $10^{- 6}$ . For PriTag, we chose the number of cluster $k = 70$ , and test the performance when $ϵ = 1$ and $ϵ = 0.5$ .

For the measurement protocol, we apply the Leave-One-Out strategy, which is a popular configuration when evaluating tag recommendations. To begin with, we randomly select one resource of each user, and predict the tags using all the remain tags in the dataset. Two measurements are applied to quantify the performance, precision and recall. For a test user that receives a list of N recommended tags (top-N list), precision and recall are defined as follows: $\begin{array}{l} (24) & precision (T (u, r), \hat{T} (u, r)) & = \frac{T (u, r) \cap \hat{T} (u, r)}{| \hat{T} (u, r) |} . \\ (25) & recall (T (u, r), \hat{T} (u, r)) & = \frac{T (u, r) \cap \hat{T} (u, r)}{| T (u, r) |} . \end{array}$ A large precision or recall value means better performance.

Figure 3 shows the recall performance on four datasets, Del.icio.us, Bibsonomy, MovieLens and Last.fm. It presents the recommendation performance achieved by PriTag when the privacy budget ϵ varies from $0.1$ to 1 with a step of $0.1$ . It is clear the recall performance of tag recommendations is significantly affected by the required privacy budget. For example, as plotted in Fig. 3(a) on the Del.icio.us dataset, when $N = 10$ , PriTag achieves a recall at $0.07$ and $0.22$ when ϵ equals $0.1$ and 1, respectively. Similar trends are reported when N is fixed to other values on all the other datasets. The reason is that the privacy and utility issue are two opposite components of the datasets, and PriTag serves as an effective method to obtain a good trade-off between them. For example, when $ϵ = 0.8$ , PriTag obtains a good recall performance on all datasets, as it can retain the utility of the dataset as much as possible while achieving the fixed privacy level. This also confirms the results of the experiment on semantic loss in Section 4.3. Moreover, Fig. 4 shows precision performance on those four datasets. Similarly, PriTag achieves better performance when ϵ takes larger values, as precision is also determined by the proportion of retrieved tags in the test set.

These results on a real tagging recommendation system confirms the practical effectiveness of the proposed PriTag algorithm on tag recommendations, as we analyzed in Section 3.2.

Fig. 5.

Clustering Private Result.

Fig. 6.

Impact of ϵ on User Semantic Loss.

4.5. Impact of privacy budget

In the context of differential privacy, privacy budget ϵ serves as a key to determining the privacy level. According to literature [8], the lower ϵ represents a higher privacy level. The budget $ϵ ⩽ 1$ would be suitable for privacy preserving purposes. To achieve a comprehensive examination of PriTag, we evaluate the semantic loss under diverse privacy levels by varying ϵ from $0.1$ to 1 with a step of $0.1$ on the four datasets.

Figure 5 reveals the impact of privacy budget ϵ on the clustering result when k is fixed. It was observed that as ϵ increases, the private clustering operation achieves a better performance. For example, in Fig. 5(a), when $ϵ = 1$ , the Silhouette Coefficient is around $0.0334$ when $k = 100$ , which is very close to the non-private clustering result. Moreover, it was observed the Silhouette Coefficient increased much faster when ϵ increased from $0.1$ to $0.4$ , compared to when ϵ increased from $0.4$ to 1. This indicates a large utility cost is needed when preserving a higher privacy level ( $ϵ = 0.1$ ). On the other hand, the Silhouette Coefficient was stable when $ϵ ⩾ 0.5$ . This confirms the private clustering is capable of retaining the quality of clusters, while satisfying the privacy preserving requirement.

Figure 6 presents the impact of ϵ on user semantic loss on Del.icio.us and Bibsonomy datasets when k is equal to $100$ , $80$ , and $20$ . As detailed in Fig. 6(a), semantic loss diminishes as ϵ increases in the Del.icio.us dataset. A similar trend was also observed in Bibsonomy. Figure 6(b) illustrates that for any k, semantic loss achieves its lowest when $ϵ = 1$ . This means a lower privacy level will retain a higher semantic meaning for users or resources. When $ϵ = 1$ , the semantic loss is acceptable. Moreover, semantic loss decreases much faster when ϵ increases from $0.1$ to $0.4$ than when ϵ increases from $0.4$ to 1. This indicates a larger utility cost is needed when preserving a higher privacy level ( $ϵ = 0.1$ ). On the other hand, after the initial and rapid increase, semantic loss remains stable when $ϵ ⩾ 0.5$ . These results indicate that the PriTag is capable of retaining the utility, while satisfying privacy preserving purposes.

5. Conclusions

Privacy preserving is one of the most important aspects of recommender systems as it protects the sensitive information of users. A tagging recommender system will be confronted with more serious privacy violations due to its semantic property [19]. Even differential privacy is considered as a promising solution to provide a rigid and provable privacy guarantee for recommender systems. However, the solution still suffers from two important deficiencies: 1) it fails to retain the relationship among users, resources and tags; 2) it introduces a large volume of noise, which significantly affects the performance.

This paper proposes an effective privacy tagging releasing algorithm for recommendation purposes and makes the following contributions:

We propose a private tagging releasing algorithm to protect users from being re-identified in a tagging dataset by adversaries.

A private clustering operation is designed to reduce the magnitude of noise by shrinking the randomization domain. In addition, two successive private operations, private selection and perturbation, are applied to achieve the privacy purpose.

A better trade-off between privacy and utility is obtained by taking the advantage of the differentially private composition properties. We provide theoretical analysis and experimental results to demonstrate the effectiveness.

These contributions provide a practical way to apply a rigid privacy notion to a tagging recommender system without high utility costs. The experimental results also show the robustness and effectiveness of the proposed PriTag algorithm.

Most notably, and to the best of our knowledge, this is the first study to investigate differential privacy in tagging recommender systems. It has been proven that our algorithm can retain a better utility than the previous method, tag suppression [19]. However, the current evaluation only concentrates on one recommender algorithm, FolkRank, with other recommendation techniques, such as topic model, and the tensor decompositions method, still requiring a further investigation.

References

[1]

Berkovsky,

Eytani,

Kuflik and

Ricci, Enhancing privacy and preserving accuracy of a distributed collaborative filtering, in: Proc. of the 2007 ACM Conference on Recommender Systems, RecSys ’07, ACM, New York, NY, USA, 2007, pp. 9–16.

[2]

Blum,

Dwork,

McSherry and

Nissim, Practical privacy: The sulq framework in: Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2005, pp. 128–138.

[3]

Blum,

Ligett and

Roth, A learning theory approach to non-interactive database privacy, in: Proc. of the 40th Annual ACM Symposium on Theory of Computing, STOC ’08, ACM, New York, NY, USA, 2008, pp. 609–618.

[4]

J.A.

Calandrino,

Kilzer,

Narayanan,

E.W.

Felten and

Shmatikov, “You might also like:” Privacy risks of collaborative filtering, in: Proc. of the 2011 IEEE Symposium on Security and Privacy, SP ’11, IEEE Computer Society, Washington, DC, USA, 2011, pp. 231–246.

[5]

Canny, Collaborative filtering with privacy via factor analysis, in: Proc. of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’02, ACM, New York, NY, USA, 2002, pp. 238–245.

[6]

Carter and

A.A.

Ghorbani, Towards a formalization of value-centric trust in agent societies, Web Intelligence and Agent Systems 2(3) (August 2004), 167–183.

[7]

Dwork, Differential privacy, in: ICALP ’06: Proc. of the 33rd International Conference on Automata, Languages and Programming, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 1–12.

[8]

Dwork, Differential privacy: A survey of results, in: TAMC ’08: Proc. of the 5th International Conference on Theory and Applications of Models of Computation, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 1–19.

[9]

Dwork, A firm foundation for private data analysis, Communications of the ACM 54(1) (2011), 86–95.

10.

[10]

Dwork,

McSherry,

Nissim and

Smith, Calibrating noise to sensitivity in private data analysis, in: TCC ’06: Proc. of the Third Conference on Theory of Cryptography, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 265–284.

11.

[11]

B.C.M.

Fung,

Wang,

Chen and

P.S.

Yu, Privacy-preserving data publishing: A survey of recent developments, ACM Computing Surveys 42(4) (2010), 1–53.

12.

[12]

Halkidi,

Batistakis and

Vazirgiannis, On clustering validation techniques, Journal of Intelligent Information Systems 17(2–3) (December 2001), 107–145.

13.

[13]

Jäschke,

Marinho,

Hotho,

Schmidt-Thieme and

Stumme, Tag recommendations in folksonomies, in: Proc. of the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases, PKDD 2007, Springer-Verlag, Berlin, Heidelberg, 2007, pp. 506–514.

14.

[14]

Krestel,

Fankhauser and

Nejdl, Latent Dirichlet allocation for tag recommendation, in: Proc. of the Third ACM Conference on Recommender Systems, RecSys ’09, ACM, New York, NY, USA, 2009, pp. 61–68.

15.

[15]

McSherry and

Talwar, Mechanism design via differential privacy, in: Proc. of the 48th Annual IEEE Symposium on Foundations of Computer Science, FOCS ’07, IEEE Computer Society, Washington, DC, USA, 2007, pp. 94–103.

16.

[16]

Narayanan and

Shmatikov, How to break anonymity of the netflix prize dataset, CoRR (2006), arXiv:cs/0610105.

17.

[17]

Narayanan and

Shmatikov, Robust de-anonymization of large sparse datasets, in: Proc. of the 2008 IEEE Symposium on Security and Privacy, SP ’08, IEEE Computer Society, Washington, DC, USA, 2008, pp. 111–125.

18.

[18]

Parameswaran and

D.M.

Blough, Privacy preserving collaborative filtering using data obfuscation, in: Granular Computing, 2007. GRC 2007. IEEE International Conference on Granular Computing, Nov. 2007, p. 380.

19.

[19]

Parra-Arnau,

Perego,

Ferrari,

Forne and

Rebollo-Monedero, Privacy-preserving enhanced collaborative tagging, IEEE Transactions on Knowledge and Data Engineering 99(1) (2014), 180–193.

20.

[20]

Parra-Arnau,

Rebollo-Monedero and

Forne, Measuring the privacy of user profiles in personalized information systems, Future Generation Computer Systems 33 (2014), 53–63.

21.

[21]

Polat and

Du, Privacy-preserving collaborative filtering using randomized perturbation techniques, in: Third IEEE International Conference on Data Mining, 2003. ICDM 2003, Nov. 2003, pp. 625–628.

22.

[22]

Polat and

Du, Achieving private recommendations using randomized response techniques, in: Proc. of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD ’06, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 637–646.

23.

[23]

Ramakrishnan,

B.J.

Keller,

B.J.

Mirza,

A.Y.

Grama and

Karypis, Privacy risks in recommender systems, IEEE Internet Computing 5(6) (November 2001), 54– 62.

24.

[24]

Shepitsen,

Gemmell,

Mobasher and

Burke, Personalized recommendation in social tagging systems using hierarchical clustering, in: Proc. of the 2008 ACM Conference on Recommender Systems, RecSys ’08, ACM, New York, NY, USA, 2008, pp. 259–266.

25.

[25]

Sigurbjörnsson and

van Zwol, Flickr tag recommendation based on collective knowledge, in: Proc. of the 17th International Conference on World Wide Web, WWW ’08, ACM, New York, NY, USA, 2008, pp. 327–336.

26.

[26]

Symeonidis,

Nanopoulos and

Manolopoulos, Tag recommendations based on tensor dimensionality reduction, in: Proc. of the 2008 ACM Conference on Recommender Systems, RecSys ’08, ACM, New York, NY, USA, 2008, pp. 43–50.

27.

[27]

Tian,

Hu,

Li,

Liu and

Zhang, Defending against distributed denial-of-service attacks with an auction-based method, Web Intelligence and Agent Systems 4(3) (July 2006), 341–351.

28.

[28]

Zhan,

C.-L.

Hsieh,

I.-C.

Wang Tsan sheng Hsu,

C.-J.

Liau and

D.-W.

Wang, Privacy-preserving collaborative recommender systems, IEEE Transactions on Systems, Man and Cybernetics. Part C, Applications and Reviews 40(4) (July 2010), 472–476.