Differential privacy and SPARQL

Abstract

Differential privacy is a framework that provides formal tools to develop algorithms to access databases and answer statistical queries with quantifiable accuracy and privacy guarantees. The notions of differential privacy are defined independently of the data model and the query language at steak. Most differential privacy results have been obtained on aggregation queries such as counting or finding maximum or average values, and on grouping queries over aggregations such as the creation of histograms. So far, the data model used by the framework research has typically been the relational model and the query language SQL. However, effective realizations of differential privacy for SQL queries that required joins had been limited. This has imposed severe restrictions on applying differential privacy in RDF knowledge graphs and SPARQL queries. By the simple nature of RDF data, most useful queries accessing RDF graphs will require intensive use of joins. Recently, new differential privacy techniques have been developed that can be applied to many types of joins in SQL with reasonable results. This opened the question of whether these new results carry over to RDF and SPARQL. In this paper we provide a positive answer to this question by presenting an algorithm that can answer counting queries over a large class of SPARQL queries that guarantees differential privacy, if the RDF graph is accompanied with semantic information about its structure. We have implemented our algorithm and conducted several experiments, showing the feasibility of our approach for large graph databases. Our aim has been to present an approach that can be used as a stepping stone towards extensions and other realizations of differential privacy for SPARQL and RDF.

Keywords

Differential privacy SPARQL

1. Introduction

As many social norms, privacy, or the right to privacy, is an evolving term that is invoked in many contexts as eloquently described by Louis Menand in [25]: “Privacy is associated with liberty, but it is also associated with privilege (private roads and private sales), with confidentiality (private conversations), with nonconformity and dissent, with shame and embarrassment, with the deviant and the taboo (...), and with subterfuge and concealment”.

In order to get some formal underpinning of privacy in the context of electronic data collection and publishing, Li et al [22] have looked at privacy breaches, studied their general characteristics, and concluded, that electronic privacy breaches always ended with giving an attacker the ability to identify (using public data) whether an individual is member of a set or class that had been intended to be anonymous (e.g., the class of individuals with high cholesterol). Hence, they define preservation of privacy as avoiding privacy breaches in the sense of not disclosing set memberships of individuals.

For the public good, such as the advance of public health, or the fair distribution of government resources, such data is frequently made public. There are also situations in which governmental and commercial organizations collect and analyze data to improve or provide new services. Especially in such cases, society expects a certain level of privacy on the way these organizations use the data. Publishing data with perfect privacy means that no assumption can be made about the prior knowledge an attacker may have about the supposedly anonymous set. Under this assumption, there would be little utility in published data if perfect privacy is expected [11,22]. Therefore, the research community has looked at weaker definitions of “acceptable” privacy. Useful concepts like k-anonymity [33], l-diversity [23] and t-closeness [21] were developed but they were shown to have weak privacy guarantees [10].

In spite of its limitations [12], but because of its formal properties, a privacy notion that has gained a lot of acceptance is differential privacy. We will present precise definitions later in the paper, but informally, differential privacy tries to hide the identity of individuals that are members of a particular class, while still providing quantifiable utility guarantees to the data published about the class. The basic principle is simple. Given a universe $D$ of all possible datasets and a query $f : D \to R$ that can be applied to a dataset D in $D$ and results in a value of an abstract domain $R$ , f is said to be differentially private if it yields indistinguishable results when applied to similar datasets. Differential privacy uses randomized algorithms to answer queries, typically by adding noise to the true query results. This noise is calibrated according to query sensitivity – how much the query result varies between similar datasets –, turning the task of query sensitivity computation essential for the endeavor of differential privacy. In practice, calculating the exact sensitivity of a query is not trivial and approximations are used instead [3].

Even though the notion of differential privacy is in principle independent of the data model and query language at steak, so far most practical, automated implementations over well-established languages have been in the context of relational databases, over SQL, and have been restricted to aggregation queries or grouping. Aggregations are queries such as counting, finding maximum or average values over a certain data subset; grouping is the creation of histograms based on aggregations.

Furthermore, to allow for reasonable approximations of sensitivity, the support of these implementations for queries with joins has been rather limited [24]. It was only in 2018, when Johnson et al. [19] introduced a new approach to approximate sensitivity that can be applied to a wider class of SQL joins, with reasonable results.

In the past decades, graph data models have enjoyed a growing adoption in comparison to the more traditional relational model. One such notable example is the RDF data standard, queried over by the SPARQL language, which have become extremely popular, in particular, for their role in the development of the Semantic Web. By the simple nature of RDF, it can be stored using binary relations [6] and most interesting queries will require operations equivalent to joins. This raises the question whether Johnson et al.’s approach [19] can also be applied to RDF and SPARQL.

In this paper we provide a positive answer to this question by presenting an algorithm that can answer counting queries over a large class of SPARQL queries that guarantees differential privacy. This result has been made possible by introducing the notion of a differential privacy schema that allows redefining Johnson et al.’s sensitivity approximation of SQL queries in the appropriate terms for answering SPARQL queries. A differential privacy schema groups sets of RDF tuples into sub-graphs that can be then used as single units for privacy protection. Examples show that this type of schema naturally arises from the semantics of the data stored in the tuples, and it should not be difficult for a database administrator to define.

We demonstrate the applicability of our approach by implementing a differential privacy query engine that uses the approximation to answer counting and grouping SPARQL queries, and evaluate the implementation running simulations using the Wikidata knowledge base [34].

The rest of the paper is organized as follows: in Section 2 we introduce the readers to the fundamental concepts of differential privacy. In Section 3, we present the core concepts of SPARQL used within the paper, including the notion of differential privacy schema. In Section 5 we prove the correctness of our proposed approximation to sensitivity and in Section 6 we evaluate the effectiveness of our proposed approximation in an implementation that we apply to both synthetic and real world datasets and queries. We present related work in Section 7, and we conclude the paper in Section 8.

2. Preliminaries about differential privacy

We now describe the framework of differential privacy, the problem that arises when applying differential privacy to SQL queries with general joins and how it has been addressed by the scientific community.

2.1. Definition

Intuitively, a randomized algorithm [26] is differentially private if it behaves similarly on similar input datasets. To formalize this intuition, the framework of differential privacy relies on a notion of distance between datasets. We model datasets as a multiset of tuples and we say that two datasets are k-far apart if one can be obtained from the other by changing the value of k tuples. Formally, this corresponds to (a mild generalization of) the notion of distance used for defining bounded differential privacy [20], which quantifies (only) over pairs of datasets of the same size. In the remainder, we let $D$ be the set of all possible datasets, and use $d (D, D^{'}) = k$ to denote that $D, D^{'} \in D$ are k-far apart. In particular, two datasets $D, D^{'} \in D$ that are 1-far apart are called neighbors, written $D \sim D^{'}$ .

Definition 1.
Let $ϵ, δ ⩾ 0$ . A randomized algorithm $A$ is $(ϵ, δ)$ -differentially private if for every pair of neighbor datasets $D, D^{'} \in D$ and every set $S \subseteq range (A)$ , $\begin{matrix} \Pr [A (D) \in S] ⩽ e^{ϵ} \Pr [A (D^{'}) \in S] + δ . \end{matrix}$

This inequality establishes a quantitative closeness condition between $\Pr [A (D) \in S]$ and $\Pr [A (D^{'}) \in S]$ , the probabilities that on inputs D and $D^{'}$ , the outcome of $A$ lies within S. The smaller the ϵ and δ, the closer these two probabilities are, and therefore, the less likely that an adversary can tell D and $D^{'}$ apart. In other words, parameters ϵ and δ quantify the privacy guarantees of the randomized algorithm.

Multi-table datasets Our notions of dataset and distance between datasets can be extended to collections of datasets as follows: A dataset formed by multiple sets $D_{1}, \dots, D_{n}$ will contain (tagged) data points belonging to $D_{1}, \dots, D_{n}$ and the distance between two such datasets D, $D^{'}$ will reduce to $d (D, D^{'}) = \sum_{i = 1}^{n} d (π_{i} (D), π_{i} (D^{'}))$ , $π_{i} (D)$ representing the subset of data points from D belonging to $D_{i}$ . In the context of relational databases, the $D_{i}$ ’s correspond to relational tables.
2.2. Realization via global sensibility

Establishing differential privacy for numeric queries of limited sensitivity is relatively simple. The Laplacian mechanism [9] says that we can obtain a differentially private version of query $f : D \to R$ by simply perturbing its output: On input D, we return $f (D)$ plus some noise sampled from a Laplacian distribution. The noise must be calibrated according to the global sensitivity ${GS}_{f}$ of f, which measures its maximum variation upon neighbor datasets; formally, ${GS}_{f} = {max}_{D, D^{'} | D \sim D^{'}} | f (D) - f (D^{'}) |$ .

Theorem 1.
Given a numeric query $f : D \to R$ of global sensitivity ${GS}_{f}$ , the randomized algorithm $\begin{matrix} A (D) = f (D) + Lap (\frac{{GS}_{f}}{ϵ}) \end{matrix}$ is an $(ϵ, 0)$ -differentially private version of f.

Here, $Lap (λ)$ represents a sample from the Laplacian distribution with parameter λ, a symmetric distribution with probability density function $pdf (x) = \frac{1}{2 λ} e^{- | x | / λ}$ , mean 0 and variance $2 λ^{2}$ . Parameter λ measures how concentrated the mass of the distribution is around its mean 0: The smaller the λ, the less noise we add to the true query result and therefore, the more faithful the mechanism becomes. In the realm of differential privacy, this “faithfulness” property is referred to as the mechanism utility [10]. An important point here is that utility and privacy are always conflicting requirements: adding more noise results in more private and – at the same time – less useful mechanisms.

In practice, when implementing the Laplacian mechanism we approximate the global sensibility of queries by exploiting their structures: Numeric queries are typically constructed by first transforming the original dataset using some standard transformers and by returning as final result some aggregation on the obtained dataset. For example, we join two tables, filter the result (dataset transformations) and return the count (aggregation) of the obtained table. The global sensitivity of such a query can be estimated from the so-called stability properties of the involved transformers. Intuitively, a stable transformer can increase the distance between nearby datasets at most by a multiplicative factor. Formally, we call a dataset transformer $T : D \to D$ α-globally-stable if $d (T (D), T (D^{'})) ⩽ α d (D, D^{'})$ for every $D, D^{'} \in D$ . Transformers with bounded global stability yield bounded global sensitivities: ${GS}_{f \circ T} ⩽ α {GS}_{f}$ whenever T is α-globally-stable.

Conversely, the use of transformers with unbounded stability might result in queries of unbounded sensitivity. A prominent example of a transformer exhibiting this problem is join. Assume we join two tables, say $t_{1}$ and $t_{2}$ , by matching a pair of their attributes. A modification in a mere tuple from $t_{1}$ may result in the addition and/or deletion of an unpredictable number of tuples in the result of the join, leaving elementary queries such as counting the number of tuples in a join already out of the scope of the Laplacian mechanism. The applicability of differential privacy approaches based on query global sensitivity is thus rather limited.
2.3. Realization via local sensibility

To handle queries that involve transformers of unbounded stability, such as joins, we require the use of more advanced techniques. The Laplacian mechanism calibrates noise according to the query, overlooking the fact that queries are done on concrete datasets, hence the employed noise could be potentially customized for each dataset. Nissim et al. show how to exploit this idea of instance-based noise [27]. Their approach relies on the notion of local sensitivity.

Definition 2.
The local sensitivity ${LS}_{f} (D)$ of a numeric query $f : D \to R$ on dataset $D \in D$ is defined as $\begin{matrix} {LS}_{f} (D) = max_{D^{'} | d (D, D^{'}) = 1} | f (D) - f (D^{'}) | . \end{matrix}$ The local sensitivity ${LS}_{f}^{(k)} (D)$ at distance $k \in N_{0}$ of D is defined as $\begin{matrix} {LS}_{f}^{(k)} (D) = max_{D^{'} | d (D, D^{'}) = k} {LS}_{f} (D^{'}) . \end{matrix}$

Observe that ${LS}_{f}^{(0)} (D)$ coincides with ${LS}_{f} (D)$ . Similar to global stability, a dataset transformer $T : D \to D$ , is α-local-stable for a dataset D if $d (T (D), T (D^{'})) ⩽ α d (D, D^{'})$ for every $D^{'} \in D$ . And as with global sensitivity, ${LS}_{f \circ T} (D) ⩽ α {LS}_{f} (D)$ whenever T is α-local-stable for D.

For answering a query f on dataset D, we cannot simply use noise calibrated according to ${LS}_{f} (D)$ because the noise level itself may reveal information about D [19]. Instead, we should use an approximation of ${LS}_{f}$ that is insensitive to small variations of its input dataset. This is captured by the notion of smooth upper bound.
Definition 3.
A function $U_{f} : D \to R_{⩾ 0}$ is called a β-smooth upper bound of the local sensitivity ${LS}_{f} : D \to R_{⩾ 0}$ of query $f : D \to R$ if it satisfies the following requirements:
$U_{f} (D) ⩾ {LS}_{f} (D)$ for all dataset D, and

$U_{f} (D) ⩽ e^{β} U_{f} (D^{'})$ for all neighbor datasets D and $D^{'}$ .

We can readily achieve differential privacy by adding noise calibrated according to a smooth upper bound of the query local sensitivity [28, Corollary 2.4].
Theorem 2.
Let $f : D \to R$ be a numeric query and let $U_{f} : D \to R_{⩾ 0}$ be a β-smooth upper bound of its local sensitivity ${LS}_{f}$ . Moreover, let $δ \in (0, 1)$ and let $β ⩽ \frac{ϵ}{2 ln (2 / δ)}$ . Then, the randomized algorithm $\begin{matrix} A (D) = f (D) + Lap (\frac{2 U_{f} (D)}{ϵ}) \end{matrix}$ is an $(ϵ, δ)$ -differentially private version of f.

The benefits of this mechanism are twofold. On the one hand, it allows handling queries that fail to have a bounded global sensitivity, but do have a bounded local sensitivity. These include e.g. the query we considered earlier, consisting of the count of the join between two tables. On the other hand, it does not require computing the local sensitivity of the queries itself, but only a smooth upper bound thereof. This is key for its practical adoption since calculating the local sensitivity of queries is computationally prohibitive: As observed by Johnson et al. [19], “it requires running the query on every possible neighbor of the original dataset”.

To apply the mechanism from Theorem 2, we must provide a smooth upper bound for the local sensitivity of queries. We can construct the smooth upper bound using approximations for the local sensitivity at fixed distances.
Lemma 1.
Let $f : D \to R$ be a numeric query and assume that $U_{f}^{(k)}$ is a pointwise upper bound of the local sensitivity ${LS}_{f}^{(k)}$ of f at distance k, that is, $\begin{matrix} U_{f}^{(k)} (D) ⩾ {LS}_{f}^{(k)} (D) for all D \in D . \end{matrix}$ Then, $\begin{matrix} U_{f} (D) = max_{0 ⩽ k ⩽ size (D)} e^{- β k} U_{f}^{(k)} (D) \end{matrix}$ is a β-smooth upper bound of the local sensitivity ${LS}_{f} (D)$ of f on D, where $size (D)$ denotes the number of rows (in all the tables) in D.

The goal of Section 5 is to apply the differential privacy mechanism from Theorem 2 to SPARQL counting queries. To do so, we will use Lemma 1 to derive smooth upper bounds of the local sensitivity of queries. In turn, this requires constructing upper bounds for the local sensitivity of queries at fixed distances, for which we will leverage local stability properties of SPARQL dataset transformers.
3. Toward differential privacy over RDF graphs

In this section we examine the semantic information that is necessary considering over RDF graphs, in order to answer counting queries in a differentially private manner. This comprises a data schema and upper bounds on the predicate multiplicities.

3.1. Privacy schema

3.1.1. Motivation

As mentioned earlier, the goal of differential privacy is to protect the (possibly sensible) contribution of each individual within a dataset when publicly releasing aggregate information about the dataset – in our case, the result of counting queries. In the relational model, individuals are typically identified with rows of the database which significantly simplifies all the technical development. For instance, if the database at stake consists of a single table, we consider two instances of the database neighboring, i.e. differing in the contribution of a single individual, if they differ in a single row. On the other hand, if the database consists of multiple tables, we consider two database instances neighboring if they differ in a row of some of the tables (see paragraph in Section 2). The underlying assumption behind this is that each table groups attributes of individuals in a particular entity type, e.g. people, political parties or companies, or part thereof, whose identities must be protected.

Fig. 1.

RDF graph $G$ containing information about three types of entities: people, companies and cites.

To be able to apply differential privacy to a dataset in the form of an RDF graph, we must thus begin by identifying the different types of entities present in the graph, and the set of individuals in each type. Consider, for instance, the RDF graph $G$ in Fig. 1, which will be the running example of our presentation. This graph contains information about three types of entities: people, companies and cites. In particular, it contains information about two people (depicted in ), two companies (depicted in ) and two cities (depicted in ). Said otherwise, there are two individuals of each entity type, adding up to six individuals in all. When querying the graph, we will be interested in protecting the contribution of all these individuals, and when applying differential privacy techniques to this end, we will then consider as a neighbor any other graph that differs in the contribution of either of them.

We refer to the semantic information necessary to identify the individuals in an RDF graph $G$ as a differential privacy schema. More formally, its goal is to partition $G$ as a set ${g_{1}, \dots, g_{n}}$ of sub-graphs, where each $g_{i}$ represents the contribution of an individual, and $G = ⨄_{i} g_{i}$ is the disjoint union of all these sub-graphs. For example, the graph in Fig. 1 is decomposed as the disjoint union of the pair of sub-graphs in blue, the pair of sub-graphs red and the pair of sub-graphs green. Observe that in the relational model, this corresponds to nothing more than understanding a (set of) table(s) as the disjoint union of its (their) rows. Here, our interest is to protect the contribution of the individuals represented by each $g_{i}$ . In Fig. 1, this means, for example, the data related to Alice and to the Walt Disney company.

3.1.2. Formal definition

We briefly review some basic RDF terminology following standard notation used in the literature [14,16], where more details can be found. The RDF language assumes the existence of an infinite set U (of URI references), an infinite set B (of blank nodes), and an infinite set L (of RDF literals). An RDF triple is a term of the form $(v_{1}, v_{2}, v_{3}) \in (U \cup B) \times U \times (U \cup B \cup L)$ . An RDF dataset is a finite set of RDF triples. RDF triples are interpreted as labeled arcs or edges in a directed graph from a vertex $v_{1}$ , called the triple subject, to a vertex $v_{3}$ , called the triple object, and label $v_{2}$ , called the triple predicate. Figure 1 shows an example of an RDF graph $G$ . We denote by $voc (G)$ the finite subset of elements from $(U \cup B \cup L)$ that appear in $G$ . More importantly, because of the nature of the aggregation queries under consideration, we will restrict ourselves to graphs without blank nodes, i.e. graphs where $B = \emptyset$ .1

¹
Calculation of query sensitivity will depend on the domain of blank nodes which can change under different contexts and implementations. This is a topic of future research.

We also require a restricted version of the concept of triple patterns, which, similarly to RDF triples, are terms of the form

(v_{1}, v_{2}, v_{3}) \in (U \cup B \cup V) \times U \times (U \cup B \cup L \cup V)

, with V an infinite set of variables, and basic graph patterns (BGPs) which are finite sets of triple patterns.2

The more general definition of triple patterns allows also for variables in the predicate component of triple patterns.

To hint how we can formally define entities within an RDF graph, observe first the six colored sub-graphs identified in Fig. 1: two green, two blue and two red. All have a star shape [1], consisting of a “center” with outgoing and/or incoming edges, i.e. predicates. Sub-graphs representing individuals of the same entity type are built from the same set of predicates. For example, both (blue) sub-graphs, representing people, are built from predicates $phone$ , $livesIn$ and $member$ . We can characterize entity types through a set of triple patterns (i.e. a BGP) that share a common or “join” vertex. Formally, a join vertex in a BGP is a variable that appears either as a subject or as an object multiple times in the BGP. The type of BGP’s that we need have further restrictions, which are captured by the notion of star BGP below:

Definition 4 (Star BGP).

A BGP is called a star if

both the subject and the object of all its triple patterns are variables,

all triple patterns have different predicates, and

it consists of either

a single triple pattern with no join vertex, i.e. a triple pattern whose subject and object are distinct variables, or

multiple triple patterns with a single join vertex, which appears once and only once in every triple pattern

Example 3.1 (Star BGP).

The three stars employed to identify the different entities in our running example (Fig. 1) are (modulo variable renaming):

Note that in each $S_{i}$ , the vertex denoted by $? c_{i}$ plays the role of the star center, joining all the triple patterns in $S_{i}$ . Furthermore, there are no common predicates across , and . Formally, we define the center of a star as the join vertex if the star contains multiple patterns, or the variable appearing in the subject of the triple pattern in case it consists of a single triple pattern.3

³
The notion of star BGP that we use here is similar to that of star query from [1], except that in a star query the center of the star must always appear as subject.

We say that a set of stars is pairwise predicate-disjoint (or simply pairwise disjoint when no ambiguity arises), if no pair of stars in the collection share a common predicate. They thus define what we call a differential privacy schema:

Definition 5 (Dp-schema).

A differential privacy schema (dp-schema, for short) $P$ is a finite pairwise predicate-disjoint set of stars.

The set is a dp-schema. as well as its sub-set . However, this second schema falls short of describing the whole graph $G$ , as it leaves out the information related to cities. To formally capture this completeness condition, we require the notion of induced sub-graph, whose formal definition is based on (solution) mappings. In SPARQL, a (solution) mapping is a partial mapping from variables to URI references or blank nodes, $μ : V \to U \cup B$ . For a triple pattern $tp$ , $μ (tp)$ denotes the triple obtained after replacing the variables in $tp$ according to μ. If μ is defined for all variables in $tp$ , $μ (tp)$ is an RDF triple. Given a graph $G$ , the solution mappings of a triple pattern $tp$ over $G$ , denoted by ${[[tp]]}_{G}$ , is defined by the set ${μ | μ (tp) \in G}$ . Now, we are ready to define the concept of induced sub-graph:

Definition 6 (Induced sub-graphs).

Let $G$ be an RDF graph and $? x$ the center of a star BGP $S$ . Given some $y \in U$ , let $μ_{y}$ denote the mapping ${? x \to y}$ . The subgraph of $G$ induced by $S$ and y, denoted by $ind {(S)}_{G}^{y}$ , is defined by the set of RDF triples ${μ (μ_{y} (tp)) | μ \in {[[μ_{y} (tp)]]}_{G} \land tp \in S}$ , and the subgraphs of $G$ induced by $S$ , denoted by $S (| G |)$ , is the set of RDF sub-graphs ${ind {(S)}_{G}^{y} | y \in voc (G)}$ .

Example 3.2 (Induced sub-graph).

The star induces two sub-graphs and over $G$ , i.e.. These correspond to the blue sub-graphs in Fig. 1, which are formally defined as:

Likewise, , where

Intuitively, the set of induced sub-graphs $S$ (| $G$ |) can be recovered by evaluating $S$ over $G$ , but assuming that the triple patterns in $S$ are optional. This assumption is already exposed by the example, where the triple pattern $(? s_{2}, member, ? c_{1})$ belongs to , but it is not materialized in . From the privacy point of view, the values of the star center are the unique identifiers of the entities contributing the data in each sub-graph, and their values must be kept confidential.

The notion of induced sub-graphs naturally extends from single stars, to dp-schemas, i.e. sets of stars. Concretely, we let $P (| G |) = ⋃_{S \in P} S (| G |)$ be the set of sub-graphs of $G$ induced by dp-schema $P$ . In our example, .

Now we have all the prerequisites to define the core concept of this section:

Definition 7 (Dp-schema compliance).

We say that an RDF graph $G$ complies with a dp-schema $P$ iff $G$ coincides with the graph induced by $P$ in $G$ , i.e. if $G = ⋃_{g \in P (| G |)} g$ .

Example 3.3 (Dp-schema compliance).

Our running example $G$ complies with dp-schema . In contrast, it does not comply with dp-schema .

In summary, if a graph $G$ complies with a dp-schema $P$ , the schema partitions the graph into a finite set $P (| G |)$ of sub-graphs, which intuitively model the different individuals in the graph. The fact that sub-graphs are disjoint follows from the definition of dp-schema, which requires stars in the schema to be pairwise predicate disjoint, and that all the occurrences of variables in a star are different except for the center of the star. The fact that the set of sub-graphs cover the entire graph follows from the definition of compliance.

3.1.3. Discussion

We now address a few key points about dp-schemas.

No shared attribute between entities At first sight, the requirement that stars in a dp-schema be predicate pairwise disjoint might seem a limitation, as it requires that each attribute belong to a single entity type. For instance, it might seem natural to consider that predicate $employs$ should be part of both the employer and the employee, and it should be thus present in both (which identifies companies) and (which identifies people). However, for the sake of protection it makes no difference to which star it belongs, since our application of differential privacy will protect the contribution of all individuals, regardless of its type.

Existence of dp-schemas RDF graphs always admit compliant dp-schemas. In particular, every graph complies with a trivial dp-schema comprising the union of all singleton BGP’s of the form ${(? x, p, ? y)}$ , where p ranges over the set of predicates appearing in $G$ . Intuitively, this schema indicates that each RDF triple is the contribution of a different individual. Even though this is a valid dp-schema, it will yield very weak privacy guarantees. In general, database administrators should aim to provide dp-schemes with a maximal number of triple patterns per star since, as we will see later, it will allow better approximations of queries’ local sensitivity, and thus, better privacy guarantees.

Compliance verification Checking whether an RDF graph complies with a given schema $P$ is algorithmically straightforward as it amounts to verifying that all predicates in the graph appear also in (some star of) the schema (which, using a hashing algorithm, can be done in $O (n)$ steps, n being the number of predicates in the graph).

Dp-schema provision For practical purposes, we assume that the database administrator of the RDF graph at stake is responsible for designing the dp-schema the graph shall comply with, and for ensuring the compliance as the graph evolves. In this latter regard, observe that removing an RDF triple from the graph always preserves the dp-schema compliance, and adding a triple also preserves compliance provided the predicate in the triple already appears in the dp-schema. We believe this is a natural assumption, as in the relational data model this would correspond to changing the schema of the database by adding a new attribute to a relation if the new predicate is incorporated into an existing star of the dp-schema or creating a new table if the predicate is added to the dp-schema as a new star.

3.2. Predicate multiplicity

Automatic approaches to answer dataset queries in a differentially private manner are typically obtained by adding noise to the query results, calibrated according to their sensitivity. Thus, a prerequisite to apply differential privacy to counting queries over RDF graphs is that they have bounded sensitivity. Unfortunately, this does not occur in the general case.

To see this, consider graph $G$ of our running example and query “How many phone numbers are currently in use?”. If we evaluate the query over $G$ , the answer is 3. Now assume we consider a neighboring graph $G^{'}$ , where Bob’s contribution (i.e. sub-graph ) is replaced by somebody else’s contribution. The query answer over this neighboring graph can certainly be any integer $n ⩾ 1$ , since a priori we do not know how many phone numbers this new individual might have. Therefore, the sensitivity of the query becomes unbounded.

This problem arises because of the presence of predicates that are not one-to-one. To recover bounded sensitivities, we have to restrict ourselves to predicates that have bounded multiplicity. For instance, if the administrator of graph $G$ requires that individuals declare at most 5 phone numbers, then the above query will have a local sensitivity of at most 5 (recall that if $n_{1}, n_{2} ⩽ α$ , then $| n_{1} - n_{2} | ⩽ α$ ). This approach was already taken by other authors [2,3] to bound the sensitivity of counting queries (in the presence of joins), and is the price one has to pay to apply differential privacy over RDF graphs.

On the formal level, we associate such bounds to triple patterns rather than to predicates. This is because in the presence of compliant dp-schemas, predicates are identified with triple patterns (every predicate within a dp-schema occurs in a single triple pattern, in a single star).

Definition 8 (Triple-pattern multiplicity).

Let $G$ be an RDF graph that complies with a dp-schema $P$ . A multiplicity bound κ associates to each triple pattern $tp$ in a star $S$ of $P$ an integer $κ (tp)$ that upper-bounds the number of solution mappings $μ \in {[[tp]]}_{G}$ with the same image for the center of $S$ .

Many predicates (or equivalently, triple patterns) would have a natural multiplicity bound of 1. For instance, a city has a unique $area$ and a unique number of $dailyRobberies$ . Likewise, we can assume that a person $livesIn$ a single city (at least for formal purposes). On the other hand, if a predicate does not admit a natural bound for its multiplicity, think e.g. of predicate $friend$ , we can either a) choose an upper bound that covers most of the cases in practice or b) establish an upper bound in accordance with the size of the graph that the administrator is willing to support.

Henceforth, in the remainder we assume the system administrator provides a dp-scheme $P$ and that every graph $G$ is compliant with the dp-scheme $P$ . Furthermore, we assume that the administrator establishes an upper bound $κ (tp)$ for each triple pattern $tp$ in $P$ . Hence, the graph space that we consider for the purposes of differential privacy will be that of graphs that comply with both $P$ and κ.

For bounding the local sensitivity of queries in Section 5, it will suffices a coarser notion of multiplicity, at the level of stars rather than triple patterns. The required generalization is straightforward:

Definition 9 (Star multiplicity).

Let $G$ be an RDF graph that complies with a dp-schema $P$ and has a multiplicity bound κ. We call the multiplicity of a star $S \in P$ , by notation convenience written $κ (S)$ , to the product of the multiplicity bound of the triple patterns in $S$ , i.e. $κ (S) = \prod_{tp \in S} κ (tp)$ .

Example 3.4 (Star multiplicity).

Assume that a graph administrator adopts dp-schema and requires that each individual $liveIn$ (at most) a single city, declare at most five phone numbers and be member of at most 3 secret societies. Then the star will have multiplicity .

4. Queries

In this section we describe the subset of queries over RDF graphs for which we provide differential privacy, and show how dp-schemes enable a decomposition result for the evaluation of such queries.

4.1. Supported queries

We develop differential privacy for counting queries over the SPARQL fragment of basic graph patterns with filter expressions, also known as constrained basic graph pattern (CBGP) [1]. In this fragment, a query is denoted by a pair $\begin{matrix} \bar{B} = ⟨ B, F ⟩, \end{matrix}$ where B is a finite set of triple patterns, i.e. a BGP, and $F = {f_{1}, \dots, f_{n}}$ is a finite (possibly empty) set of filter expressions. $\bar{B}$ represents the SPARQL graph pattern $\begin{matrix} P = ((\dots (B FILTER f_{1}) \dots) FILTER f_{n}), \end{matrix}$ and its meaning over an RDF graph $G$ , denote by ${[[\bar{B}]]}_{G}$ , is the multiset of solution mappings ${[[P]]}_{G}$ as defined by the standard semantics of SPARQL queries [16].

For simplicity, we consider only CBGPs that are semantically valid. We also assume that in a graph $G$ that complies with a dp-schema $P$ , all predicates appearing in triple patterns of a CBGP also appear in $P$ .

Example 4.1.
Take RDF graph $G$ , which complies with dp-schema . Now assume we want to know how many people have a coworker in a company with headquarters in a city with over 20 daily robberies? The query can be cast in terms of the CBGP $\bar{B} = ⟨ B, F ⟩$ , where $\begin{array}{c} \begin{aligned} B = & {(? x, employs, ? p_{1}), (? x, employs, ? p_{2}), (? p_{2}, livesIn, ? c), (? c, dailyRobberies, ? n)} \end{aligned} \\ F = {? n ⩾ 20} \end{array}$

Triple patterns in an RDF graph compliant with a dp-schema $P$ naturally inherit the notion of center from the star they “belong to”. Specifically, for a star $S \in P$ and a triple pattern $tp = (s, p, o)$ such that $(X, p, Y)$ belongs to $S$ , for some variables X, Y, we let $center (tp, P) = s$ if X is the center of $S$ ; otherwise $center (tp, P) = o$ . For instance, in the above example we have $\begin{array}{c} center ((? x, employs, ? p_{1}), P) = ? x \\ center ((? x, employs, ? p_{2}), P) = ? x, \end{array}$ since star which contains the triple pattern $\begin{matrix} (? c_{2}, employs, ? o_{3}) \end{matrix}$ has center $? c_{2}$ . Note that center is a well-defined function because the predicate of any triple pattern can only appear in single star of the dp-schema. In the remainder, we write $center (tp)$ for $center (tp, P)$ when $P$ is understood from the context (e.g. when referring to the dp-schema established by the administrator of the RDF graph at stake).

Finally, a user query is a CBGP $\bar{B}$ wrapped by either of the two following aggregation operations:
${COUNT}^{? x} (\bar{B})$ , whose semantics ${[[COUN T^{? x} (\bar{B})]]}_{G}$ is defined as the cardinality of the multiset ${[[\bar{B}]]}_{G}$ .

${COUNT}^{DISTINCT ? x} (\bar{B})$ , whose semantics ${[[COUN T^{DISTINCT ? x} (\bar{B})]]}_{G}$ is defined as the cardinality of the set ${μ (? x) | μ \in {[[\bar{B}]]}_{G}}$ .

${COUNT}_{? x_{1} \dots ? x_{n}}^{? x} (\bar{B})$ , where $? x, ? x_{1}, \dots, ? x_{n}$ are variables appearing in $\bar{B}$ , and whose semantics ${[[{COUNT}_{? x_{1}, \dots, ? x_{n}}^{? x} (\bar{B})]]}_{G}$ is defined by grouping the solution mappings from ${[[\bar{B}]]}_{G}$ according to (the values they assign to) variables $? x_{1} \dots ? x_{n}$ and returning the number of mappings within each of the resulting groups. Loosely speaking, this corresponds to a histogram over tuples grouped by keys created by the different combinations of the values assigned to variables $? x_{1} \dots ? x_{n}$ by the solution mappings from ${[[\bar{B}]]}_{G}$ .
Example 4.2.
The query from the previous example can be expressed as ${COUNT}^{DISTINCT ? p_{1}} (\bar{B})$ using the previous CBGP.

4.2. Evaluation decomposition

Continuing with the previous example, assume we want to evaluate B (from Example 4.1) over $G$ (from Fig. 1), that is, to compute ${[[B]]}_{G}$ (observe that if we are interested in obtaining ${[[⟨ B, F ⟩]]}_{G}$ instead, we simply add a $FILTER$ operation on top of the evaluation of ${[[B]]}_{G}$ ). We can do this in a compositional fashion, leveraging the partition that dp-schema induces on $G$ . Concretely, we can split B as $\begin{aligned} (1) & \begin{aligned} B_{1} = {(? x, employs, ? p_{1})} \\ B_{2} = {(? x, employs, ? p_{2})} \\ B_{3} = {(? p_{2}, livesIn, ? c)} \\ B_{4} = {(? c, dailyRobberies, ? n)} \end{aligned} \end{aligned}$

Hence, for any query $\bar{B} = ⟨ B, F ⟩$ , we can formally define a split $B_{\div P}$ of B from a dp-schema $P$ , as follows: $B_{\div P} = {B_{1}, \dots, B_{n}}$ iff the following two conditions hold

every $B_{i} \in B_{\div P}$ i) is a maximal subset of B for which there exists a star $S \in P$ such that $pred (B_{i}) \subseteq p r e d (S)$ , and ii) has no predicate repetitions;

for any two triple patterns $tp, {tp}^{'} \in B_{i}$ , $center (tp) = center ({tp}^{'})$ .

Because the stars in

P

are predicate disjoint and the

B_{i}

’s in

B_{\div P}

are maximal, the split

B_{\div P}

is unique. In the context of Condition 1 above, we call

S

the covering star of

B_{i}

. Moreover, we call a BGP B elementary if

| B_{\div P} | = 1

, and, by construction, all members of a split will be elementary.

Example 4.3.
For the CBGP from Example 4.1, B is split into four elementary BGPs by dp-schema , i.e. $B_{\div P} = {B_{1}, B_{2}, B_{3}, B_{4}}$ as defined in Equation (1) above. Note that $B_{1}$ and $B_{2}$ have both the same covering star , but they are consireded different elementary BGPs because they share predicate $employs$ . The covering stars of $B_{3}$ and $B_{4}$ are and , respectively.

If B were extended, e.g. , with triple pattern $\begin{matrix} tp = (Skull and Bones, member, ? p_{1}), \end{matrix}$ the splitting $B_{\div P}$ would contain a fifth, member $B_{5} = {tp}$ . In this case, $B_{3}$ and $B_{5}$ share the same covering star, , but remain different members of $B_{\div P}$ because their triple patterns have different centers ( $? p_{2}$ is the center of triple patterns in $B_{3}$ and $? p_{1}$ the center of the triple pattern in $B_{5}$ ). Alternatively, if B were extended, e.g., with triple pattern $\begin{matrix} {tp}^{'} = (? x, headquarter, Burbank), \end{matrix}$ the number of elementary BGPs does not change but $B_{1}$ and $B_{2}$ should be both augmented with ${tp}^{'}$ due to the maximality condition of each $B_{i}$ . (In this case, $B_{1}$ and $B_{2}$ remain being covered by star .)

Note also that the elementary BGP where a triple pattern belongs to is determined by the center of the triple pattern, all triple patterns in the same split must share the same variable or RDF term as their center. The interest in $B_{\div P} = {B_{1}, \dots, B_{n}}$ resides in that it lets us isolate the fragment of the graph necessary to answer each $B_{i}$ . Assume we denote by $G_{S_{i}}$ the subgraph induced by star $S_{i}$ , i.e. $G_{S_{i}} = ⋃_{g \in S_{i} (| G |)} g$ .
Lemma 2.
If $S_{i}$ is the covering star of $B_{i}$ then $\begin{matrix} {[[B_{i}]]}_{G} = {[[B_{i}]]}_{G_{S_{i}}} \end{matrix}$

Then, following the terminology defined in [29] for joins between multisets of solution mappings, we can extend the lemma to B as follows:
Lemma 3.
$\begin{array}{c} {[[B]]}_{G} = {[[B_{1}]]}_{G_{S_{1}}} ⋈ ({[[B_{2}]]}_{G_{S_{2}}} ⋈ \dots ⋈ ({[[B_{n - 1}]]}_{G_{S_{n - 1}}} ⋈ {[[B_{n}]]}_{G_{S_{n}}})) \end{array}$

We have already observed that $\bar{B}$ is such that ${[[B]]}_{G}$ can be evaluated using only equi-joins. Therefore, there must exist an ordering of the elements in $B_{\div P}$ such that ${[[B_{i}]]}_{G_{S_{i}}} ⋈ {[[B_{i + 1}]]}_{G_{S_{i + 1}}}$ can also be done with equi-joins. In other words, an ordering where $| var (B_{i}) \cap var (B_{i + 1}) | = 1$ for all $1 ⩽ i < n$ . We call this order a normal ordering of $B_{\div P}$ , and without loss of generality denote by $? x_{i}$ the variable in the equi-join ${[[B_{i}]]}_{G_{S_{i}}} ⋈ {[[B_{i + 1}]]}_{G_{S_{i + 1}}}$ . For convenience, in the remainder we assume that the indexing used for $B_{\div P}$ follows a normal order. Note that this is already the case for the splitting from Example 4.3 ( $B_{1}$ and $B_{2}$ share variable $p_{2}$ , and $B_{2}$ and $B_{3}$ share variable $c$ ).

We are now in a position to establish differential privacy for SPARQL count and histogram queries.
5. Towards differential privacy for SPARQL

In this section we develop all the prerequisites to extend Lemma 1 from the relational model to the graph model, in terms of the $SPARQL$ queries over RDF graphs described in the previous section.

5.1. Preliminary notions

We begin defining the notion of size and distance between RDF graphs. These are straightforward adaptations of the relational case, where the induced sub-graphs play the role of table rows. Concretely, the size of a graph refers to the number of individuals present in it. Formally, given an RDF graph $G$ that complies with a dp-schema $P = {B_{i}}_{i \in I}$ , we define the size of $G$ w.r.t. $P$ as $size {(G)}_{P} = \sum_{i \in I} | B_{i} (| G |) |$ .

Moreover, we say that two graphs are k far apart if one can be obtained from the other by replacing k of its induced sub-graphs. Formally, given a pair of RDF graphs $G_{1}$ , $G_{2}$ that comply with a dp-schema $P = {B_{i}}_{i \in I}$ and such that $size {(G_{1})}_{P} = size {(G_{2})}_{P}$ , their distance is defined as the size of their difference, i.e. $d {(G_{1}, G_{2})}_{P} = size {(G_{1} ∖ G_{2})}_{P}$ . Note that in the general case where the sizes of $G_{1}$ and $G_{2}$ need not coincide, their distance is defined as the size of the their symmetrical difference $(G_{1} ∖ G_{2}) \cup (G_{2} ∖ G_{1})$ , but when the sizes coincide, this reduces to the size of either their differences, making distance commutative as expected.

Finally, this notion of distance between RDF graphs readily induces a notion of local sensitivity (at distance k) ${LS}_{Q} (G)$ ( ${LS}_{Q}^{(k)} (G)$ ) of $SPARQL$ query Q over RDF graph $G$ , as given by Definition 2.

In order not to clutter the presentation, we usually omit the underlying dp-schema when referring to the size of an RDF graph, the distance between a pair of RDF graphs, and the local sensitivity of a $SPARQL$ query if the dp-schema is understood from the context.

5.2. Elastic sensitivity

Our next step is, given a user query Q and an RDF graph $G$ that complies with a dp-schema $P$ , to compute an upper bound of the local sensitivity ${LS}_{Q}^{(k)} (G)$ (denoted by $U_{Q}^{(k)} (G)$ in Lemma 1).

To this end, observe that the naive approach of evaluating the query on every neighbor (at distance k) of $G$ is not a feasible solution, since the number of neighbors can be extremely large. To address this problem in the relational setting, Johnson et al. [19] has introduced the notion of elastic sensitivity, which leverages (maximum) frequency values of the join keys (that can be precomputed or statistically estimated) to provide more efficient upper bounds for the local sensitivity of queries with joins.

In the remainder of the section we adapt Johnson et al.’s approach to the case of RDF graphs and $SPARQL$ . Intuitively, our notion of elastic sensitivity of a $SPARQL$ query Q at distance k of a concrete graph $G$ (that complies with the dp-schema $P$ ) regards the evaluation of Q as the composition of successive transformations applied to $G$ , and is defined in terms of the stability properties of such transformations.

In our case, these transformations are given by the CBGP of user queries, more concretely, by their BGP part. We thus introduce the auxiliary notion of BGP elastic stability. A key property of this notion is that it allows bounding the local sensitivity of counting queries: Given a BGP B, its elastic stability at distance k with respect to a graph $G$ that complies with a dp-schema $P$ , bounds the local stability for any graph $G^{'}$ that also complies with $P$ and is at distance k of $G$ . Hence, it bounds the local sensitivity at distance k of ${COUNT}^{? x} (\bar{B})$ (for any $\bar{B} = ⟨ B, F ⟩$ ) over $G^{'}$ .

The formal definition of elastic stability relies on the frequency of most popular values. More precisely, if $? x$ is a variable occurring in an elementary BGP B, we use $mpv (? x, B, G)$ to denote the frequency of the most popular value to which $? x$ is mapped to, when evaluating B over $G$ . We can use $SPARQL$ itself to determine $mpv (? x, B, G)$ through the query $\begin{aligned} SELECT (COUNT (? x) a s ? c) WHERE B \\ GROUP_BY ? x ORDER_BY ? c DESC LIMIT 1 \end{aligned}$ Loosely speaking, this corresponds to first evaluating ${COUNT}^{? x} (B)$ , and then selecting the value with the largest count. This is an upper bound for the frequency of the most popular value yielded by $? x$ within a CBGP query $\bar{B} = ⟨ B, F ⟩$ , regardless of the filter F, since it can only reduce the size of the result. Alternatively, we could also use the query $\begin{aligned} SELECT (COUNT (? x) a s ? c) WHERE tp \\ GROUP_BY ? x ORDER_BY ? c DESC LIMIT 1, \end{aligned}$ where $tp$ is obtained from a triple pattern in B that 1) has $? x$ as one of its non-predicate (i.e. subject or object) components; 2) $? x$ is participating in a join of the full query $\bar{B}$ ; and 3) it has the other non-predicate component replaced by a fresh variable.4

⁴
Observe that there might be multiple such triple patterns in B, all of them yielding valid upper bounds for $mpv (? x, B, G)$ . If we are interested in obtaining tighter privacy guarantees, we should choose the most precise bound.

The counting on the latter query will be greater than (or equal to) the one on the former query and, therefore, a valid (possibly looser, though) upper bound for $mpv (? x, B, G)$ . Nevertheless, the benefit is that since the second query is merely a variable renaming of a ${tp}^{'} \in S$ , the values can be pre-computed for all ${tp}^{'} \in S$ within a dp-schema, and simply retrieved during (differentially private) query evaluation. This is possible because the dp-schema of an RDF graph is defined with a set of predicates that includes all the predicates appearing the graph. In contrast, if the former query is used to determine mpv, the sensitivity is likely to be more accurate (i.e. tighter approximations) resulting in better privacy guarantees (smaller ϵ’s), but pre-computations cannot be done, possibly impacting system performance.

To compute the elastic stability of BGP B at distance k of graph $G$ , written $S_{B}^{(k)} (G)$ , we start by applying Lemma 3 to decompose B as a sequence of elementary BGPs. Once we fix a normal ordering, we have $B_{\div P} = B_{1} ⋈ (B_{2} ⋈ (\dots ⋈ B_{n}) \dots)$ . This decomposition allows estimating the frequency of most popular values of graphs k far apart from $G$ . Formally, the frequency of the most popular values for variable $? x$ in a BGP B for graphs at distance k of $G$ , written ${mpv}^{(k)} (? x, B, G)$ , is defined by induction on the length $| B_{\div P} |$ as follows:

Base case: If $| B_{\div P} | = 1$ , we let $\begin{matrix} {mpv}^{(k)} (? x, B, G) = mpv (? x, B, G) + k \times κ (S), \end{matrix}$ where S is the covering star of $B_{1}$ (recall that, in this base case, $B_{\div P} = {B_{1}}$ ).

Inductive case: If $| B_{\div P} | > 1$ , we let $\begin{aligned} {mpv}^{(k)} (? x, B, G) = & {mpv}^{(k)} (? x_{1}, B_{\div P} ∖ {B_{1}}, G) \times {mpv}^{(k)} (? x, B_{1}, G), \end{aligned}$ where $? x_{1}$ is the common variable shared by $B_{1}$ and $(B_{2} ⋈ (\dots ⋈ B_{n}) \dots)$ , used for their equi-join.

The intuition behind the base case is easy to grasp. For $k = 1$ , we take a subgraph from $P (| G |)$ , induced by a star BFP S in the schema, and replace it with a different one. The maximal difference between the new and the old value on the count of the most popular mapping value of $? x$ is $κ (S)$ . This is an upper bound of all the mappings that can be produced by the instance due to the triple-pattern multiplicity, if the instance removed didn’t have an instance of the value and a new instance of the same value is added. Hence, k changes will at most increment the count by $k \times κ (S)$ . For the inductive case, we need to worry about the most frequent mapping value of $? x_{1}$ in $B_{\div P} ∖ {B_{1}}$ since for every mapping of $? x$ obtained from $B_{1}$ , if this mapping maps $? x_{1}$ in $B_{1}$ to the same value of the most frequent value of $x_{1}$ in $B_{\div P} ∖ {B_{1}}$ , the value of $? x$ will be repeated as many times in the combined mapping of B. Hence, the most frequent value of $? x$ in ${[[B_{1}]]}_{S_{1}}$ can be duplicated, in the worst case, as many as ${mpv}^{(k)} (? x_{1}, B_{\div P} ∖ {B_{1}}, G)$ times in the multiset of solution mappings ${[[B]]}_{G}$ , giving us a safe upper bound for the count. Importantly, observe that because the multiplication operation is commutative, the frequency is not affected by the selected normal ordering.

We are now ready to define the elastic stability of a BGP B at distance k of graph $G$ , denoted by $S_{B}^{(k)} (G)$ . The definition also proceeds by induction on the cardinality of a fixed normal ordering of $B_{\div P} = B_{1} ⋈ (B_{2} ⋈ (\dots ⋈ B_{n}) \dots)$ :

Base case: If $| B_{\div P} | = 1$ , we let $\begin{matrix} S_{B}^{(k)} (G) = κ (S), \end{matrix}$ where S is the covering star of $B_{1}$ .

Inductive case: If $| B_{\div P} | > 1$ , let $B^{'} = B_{\div P} ∖ {B_{1}}$ . We have two cases. If the covering star S of $B_{1}$ is not the covering star of any other $B_{j} \in B^{'}$ , we let $\begin{aligned} (2) & S_{B}^{(k)} (G) = & max {{mpv}^{(k)} (? x_{1}, B_{1}, G) \times S_{B^{'}}^{(k)} (G), {mpv}^{(k)} (? x_{1}, B^{'}, G) \times S_{B_{1}}^{(k)} (G)} \end{aligned}$

If the covering star S of $B_{1}$ is also the covering star of another $B_{j} \in B^{'}$ , we let $\begin{aligned} S_{B}^{(k)} (G) = & {mpv}^{(k)} (? x_{1}, B_{1}, G) \times S_{B^{'}}^{(k)} (G) + {mpv}^{(k)} (? x_{1}, B^{'}, G) \times S_{B_{1}}^{(k)} (G) + S_{B_{1}}^{(k)} (G) \times S_{B^{'}}^{(k)} (G) \end{aligned}$ Loosely speaking, this definition captures the amount of changes that the transformations (i.e. the joins) within the query, add to the final result, when modifying a single element in the induced schema.

As so defined, the local stability bounds the local sensitivity of counting queries:

Lemma 4.

For any CBGP query $\bar{B} = ⟨ B, F ⟩$ , any $k \in N$ and any graph $G$ compliant with dp-schema $P$ : $\begin{matrix} S_{B}^{(k)} (G) ⩾ {LS}_{{COUNT}^{? x} (\bar{B})}^{(k)} (G) . \end{matrix}$

The main intuition behind the proof is that changes made to a graph to get a new graph at distance 1, are limited to a sub-graph $G_{i}$ , that must be covered by a single star pattern $S$ in $P$ . Then the maximum number of RDF tuples that can change to get the graph at distance 1 is limited by the multiplicity of the predicates in S. Therefore, the change in the number of mappings obtained from the new graph of an elementary BGP covered by S is bounded by $κ (S)$ . If these RDF triples contribute in the result mappings of a join vertex, the number of new mappings can increase by as much as the frequency of the most popular result mapping of the joining triple pattern. For example, if $B = {(? v_{0}, p, ? u), (? u, p^{'}, ? v_{1})}$ , and the triple $(s, p, o)$ is part of $G_{1}$ and o happens to be the most popular result mapping for $(? u, p^{'}, ? v_{1})$ , then there will be at most ${mpv}^{(1)} (? u, (? u, p^{'}, ? v_{1}), G)$ new mappings in the result.

Proof.

The proof of this lemma follows the same strategy as the proof in [19, Lemma 2], and is by induction on the length of $B_{\div P}$ .

Case $| B_{\div P} | = 1$ . Let S be the covering star of B. Hence, its elastic stability is $κ (S)$ , a parameter given by the DBA. Thus, we have $\begin{matrix} κ (S) = S_{B}^{(k)} (G) ⩾ {LS}_{{COUNT}^{? x} (\bar{B})}^{(k)} (G) \end{matrix}$ since the local sensitivity at distance k is calculated as the max of the sensitivities of all graphs at distance 1 of all graphs at distance k or less of $G$ , meaning the modification of a single star, which may change by at most $κ (S)$ tuples and the filter in $\bar{B}$ doesn’t affect its local stability.

Case $| B_{\div P} | = n + 1$ : we have a covering star $S_{1}$ for partition $B_{1}$ and a set with n covering stars for partitions $B^{'} = {B_{2}, \dots B_{n + 1}}$ . We want to bound the number of RDF triples that can change in graphs $G^{'}$ at distance k of $G$ to get a graph at distance 1, based on the star multiplicities. First, let’s assume $S_{1}$ is not the covering star of any other $B_{i}$ in $B^{'}$ . Hence, changes can happen in either $G_{S_{1}}^{'}$ or in a graph $G_{S^{'}}^{'}$ induced by a star $S^{'}$ different from $S_{1}$ that covers some other $B_{i} \in B^{'}$ , but not in both graphs since the new graph must be at distance 1 from $G^{'}$ . Thus, either $S_{B_{1}}^{(k)} (G) = 0$ or $S_{B^{'}}^{(k)} (G) = 0$ :

When $S_{B_{1}}^{(k)} (G) = 0$ , the changes in $G_{S^{'}}^{'}$ , by induction hypothesis using Eq. (2), produce at most ${mpv}^{(k)} (? x_{1}, B_{1}, G) \times S_{B^{'}}^{(k)} (G)$ changed mappings since one change in $G_{S^{'}}^{'}$ might affect at most ${mpv}^{(k)} (? x_{1}, B_{1}, G)$ triplets in the same join when applied to a graph at distance 1 of $G^{'}$ .

In the symmetric case, when $S_{B^{'}}^{(k)} (G) = 0$ , $G_{S_{1}}^{'}$ may contain $S_{B_{1}}^{(k)} (G) = κ (S)$ changed triplets, producing at most ${mpv}^{(k)} (? x_{1}, B^{'}, G) \times S_{B_{1}}^{(k)} (G)$ changed mappings in the joined $SPARQL$ pattern.

We chose the largest of the two values when calculating

S_{B}^{(k)} (G)

. On the other hand, if

S_{1}

also covers another

B_{i} \in B^{'}

, a change in

G_{S_{1}}^{'}

can also imply changes in

G_{S^{'}}^{'}

. This can cause, in the worst case,

{mpv}^{(k)} (? x_{1}, B_{1}, G) \times S_{B^{'}}^{(k)} (G)

changed mappings for the change happening in

G_{S^{'}}^{'}

, plus

{mpv}^{(k)} (? x_{1}, B^{'}, G) \times S_{B_{1}}^{(k)} (G)

changed mappings caused by the change in

G_{S_{1}}

, which may contain

S_{B_{1}}^{(k)} (G) = κ (S)

changed triplets. We also need to consider that the change may cause new joins between then new triplets in both

G_{S_{1}}^{'}

and

G_{S^{'}}^{'}

, for a total of

S_{B_{1}}^{(k)} (G) \times S_{B^{'}}^{(k)} (G)

changed mappings in the joined

SPARQL

pattern. The sum of these three values is what the definition of

S_{B}^{(k)} (G)

uses.

□

Now we have all the prerequisite to define the elastic sensitivity of user queries (at fixed distances of a given graph): $\begin{array}{c} {ES}_{{COUNT}^{? x} (\bar{B})}^{(k)} (G) = S_{B}^{(k)} (G) \\ {ES}_{{COUNT}^{DISTINC T ? x} (\bar{B})}^{(k)} (G) = S_{B}^{(k)} (G) \\ {ES}_{{COUNT}_{? \bar{x}}^{? x_{0}} (\bar{B})}^{(k)} (G) = 2 S_{B}^{(k)} (G) \end{array}$

And as in Johnson et al. [19], the above lemma readily leads us to the desired bound for (the three kind of) user queries:

Lemma 5.

For any user query Q, and any graph $G$ compliant with schema $P$ : $\begin{matrix} {ES}_{Q}^{(k)} (G) ⩾ {LS}_{Q}^{(k)} (G) . \end{matrix}$

Proof.

By case analysis on the type of user query Q:

For plain counting queries ( $Q = {COUNT}^{? x} (\bar{B})$ ): the result follows directly from Lemma 4 since the result of the counting query is given by the application of the sensitivity calculation for the $C B G P s$ in the query.

For plain unique counting queries ( $Q = {COUNT}^{DISTINCT ? x} (\bar{B})$ ): it can be noted that $DISTINCT$ reduces the elastic stability of the elementary BGP, $B^{'}$ , containing $? x$ from $κ (S^{'})$ to 1, where $S^{'}$ is the star covering $B^{'}$ .

For counting queries after grouping ( $Q = {COUNT}_{? \bar{x}}^{? x_{0}} (\bar{B})$ ): During the grouping, each changed triple can affect two result mappings in the query since one modified triple may generate a mapping that will fall into a new group, and at the same time the old mapping is dropped from another group.

□

Lemma 5 readily establishes our main result, which allows applying differential privacy to $SPARQL$ queries over RDF graphs:

Theorem 3.

Assume that our universe of (valid) RDF graphs is composed by the graphs that comply with dp-schema $P$ and multiplicity bound κ and let $G$ be any of those graphs. Let Q be a user query and let $\begin{matrix} U_{Q} (G) = max_{0 ⩽ k ⩽ size {(G)}_{P}} e^{- β k} {ES}_{Q}^{(k)} (G), \end{matrix}$ where $ϵ > 0$ , $0 < δ < 1$ , $β ⩽ \frac{ϵ}{2 ln (2 / δ)}$ and the elastic sensitivity ${ES}_{Q}^{(k)} (G)$ of Q is computed, as previously described, from multiplicity bound κ and the frequencies of most popular values mapped to the variables in Q as specified by function ${mpv}^{(k)}$ . Then, the randomized algorithm $\begin{matrix} A (G) = {[[Q]]}_{G} + Lap (\frac{2 U_{Q} (G)}{ϵ}) \end{matrix}$ is an $(ϵ, δ)$ -differentially private version of Q.

The theorem follows immediately from Theorem 2 and Lemmas 1 and 5.

6. Evaluation

Having characterized formally how an algorithm can be implemented to enforce differential privacy on SPARQL queries based on privacy schemes, in this section, we present an empirical evaluation of how the algorithm would behave in real scenarios.

Setup We conducted our evaluation on a 2018 Macbook Pro with 16 GB of RAM memory having installed a Fuseki instance on a 2 AMD Opteron server with an SSD drive and 64 GB of RAM memory. We used Java 1.17 to implement our proof of concept. We also used the SecureRandom Java class to generate the random numbers to calculate the Laplacian probability distribution since that class implements a well-tested random number generator,5

⁵
https://docs.oracle.com/javase/8/docs/api/java/security/SecureRandom.html

an essential component for ensuring the correctness of our privacy guarantees algorithm. The code and all the queries used for this evaluation are available in GitHub.6

⁶

Repository https://github.com/cbuil/DPSparql.

Data In the evaluation we used real world data and queries from Wikidata [34]. Wikidata is a collaboratively edited knowledge base hosted by the Wikimedia Foundation. It is a common source of data for Wikimedia projects such as Wikipedia, and it has been made available to the general public under a public domain license. Wikidata stores 86,671,701 items (RDF resources), and 1,084,935,969 statements (triples7

⁷

https://tools.wmflabs.org/wikidata-todo/stats.php

). We selected a subset of the Wikidata Truthy from 2021-06-23, which has all but direct properties (i.e. http://www.wikidata.org/prop/direct/P*) removed [18,35]. The data is available to download from Google Drive.8

⁸

https://drive.google.com/u/0/uc?id=1oDkrHT68_v7wfzTxjaRg40F7itb7tVEZ

We also provide the scripts to generate this Wikidata version in our Github repository.9

⁹

https://github.com/MillenniumDB/benchmark/blob/master/src/database_generation/filter_direct_properties.py

We use the following prefixes from Wikidata along this section:

6.1. Privacy schema

Our dp-schema is defined based on three pairwise disjoint stars, $P_{1}$ , $P_{2}$ , and $P_{3}$ , representing data from Humans, Organizations, and Professions, built around Wikidata’s P31 property (the instance of property). We extracted URIs from all the instances of the classes Human, Organization and Profession using queries like:

This gathers the instances of the class Human. Star instances were formed by selecting a subset of properties of the three classes using star queries centered in the URIs (each instance representing either a Human, an Organization or a Profession) to define three sub-graphs covered by three stars, $P_{1}$ , $P_{2}$ , and $P_{3}$ . In other words, we used the mappings of ?center from the initial queries as star centers and for each center we retrieved a few of their properties and used the resulting RDF graph as the basis for our queries. Differential privacy is applied to protect the privacy of the centers. We present the schema with all the properties we used for each star in the following example.

Evaluation privacy schema The three stars employed to identify the different entities in our evaluation schema are (modulo variable renaming):

These three stars shouldn’t be used directly as a privacy schema because $S_{1}$ and $S_{3}$ share a property, P106. Nevertheless, the sets of instances of the property in the sub-graphs induced by $S_{1}$ and $S_{3}$ are disjoint because instances of $? c_{1}$ (human UIRs) and $? c_{2}$ (organization URIs) are disjoint. Hence, for the purpose of our evaluation, we consider this partition of P106 as two different properties that we refer to as P106 Profession in $S_{1}$ and P106 Occupation in $S_{3}$ , keep $S_{3}$ as it is, and rename them $P_{1}$ , $P_{2}$ , and $P_{3}$ respectively.10

¹⁰
Note that this observation suggests a more subtle definition of privacy scheme would be useful.

The Wikidata properties that allow joining data from two different stars are P108 Employer from Humans to Organizations, P106 Profession from Humans to Professions, P106 Occupation from Organizations to Professions and P112 Founded by from Organizations to Humans. Table 1 shows statistics about the instances of the stars in our data, including star size (on top of the table).

Table 1

Table showing key statistics about the data in our privacy schema (the largest schema is by far the Humans star)

Humans: 9,181,487
P569	P570	P106	P108	P2002
5,109,648	2,511,719	6,446,811	1,085,617	159,194
P21	P40
7,223,891	707,747

Professions: 7,786
P1963	P101	P425
97	95	3,018

Organizations: 72,879
P106	P178	P112
50	16	2,789

6.2. Queries

We selected 26 queries from the query logs in [18,35] containing the predicates used in our dp-schema. Since the amount of COUNT queries in the query logs is small [4] for each of these queries we added the COUNT keyword and we removed the triples from the query that were not accessing our schema. In addition, these queries were modified to get a diverse set of query results and types.

We consider star queries which are queries covered by a single star from the dp-schema. These queries can only add filters and remove triple patterns from the star. Therefore, a star query is centered around a single join vertex $? x_{0}$ , corresponding to the center of the star (Fig. 2). We also have linear queries describing a path that must include a join variable that appears in a place that is not the center of any star from the dp-schema (Fig. 3), and snowflake queries (Fig. 4), a concatenation through a join variable of a star query with other queries of different shapes, as defined in [1]. The queries are listed in Appendix B and they can also be found in the companion Github repository for this article. Table 3, also in Appendix A, summarizes their characteristics. We show $Q_{3}$ in Fig. 4, which queries the Humans star, accessing sex, birth and death dates for each human as well as their professions. We use the ?professions variable for connecting to the Professions star. From that star we retrieve each profession’s field of work (property P425), obtaining a snowflake-shaped query.

Fig. 2.

Star-shaped query $Q_{16}$ accessing the Humans privacy star.

Fig. 3.

Linear query $Q_{15}$ accessing the Humans privacy star.

Fig. 4.

Snowflake query $Q_{3}$ accessing the Humans and Organization privacy schemas.

Table 2

Results of the execution of Wikidata queries using our differential privacy method. Those queries with sensitivity “1.0” are star queries since the sensitivity of a COUNT query over a single star schema is 1 (a COUNT query over a table), and their elastic stability is “x” as described in Section 5.2 if there are joins between star BGPs, the sensitivity increases based on the stability polynomial, calculated according to Theorem 3

Query ID	Actual result	Average private	Average private	Sensitivity	Stability $S^{(k)} (Q_{i}, ? x)$
		result using	result using
		Epsilon = 0.1	Epsilon = 1.0
Q1	2,275,177	2,275,176	2,275,176	1.0	1
Q2	1,717,945	1,717,940	1,717,945	1.0	1
Q3	1,274,788	1,189,636	1,270,649	290,415	(x+290,863) * 1
Q4	17,440	17,464	17,319	245	(x+18) * 1
Q5	86	110.8	203.1	245	(x+18) * 1
Q6	1,170,315	3,860,469	1,137,687	400,363	(x+400,981) * 1
Q7	3,018	3019	3,017	1.0	1
Q8	50	57	50	1.0	1
Q9	31	1577	37	241.3	(x+8) * 1
Q10	14,477	14,472	14,477	1.0	1
Q11	2,789	2,792	2,789	1.0	1
Q12	221	1,144	89	1,162.2	(x+1,164) * 1
Q13	6,446,811	6,446,812	6,446,810	1.0	1
Q14	3,615	3,613	3,614	1.0	1
Q15	0	71	8	250.8	(x+33) * 1
Q16	21,683	21,682	21,682	1.0	1
Q17	2,789	2,788	2,788	1.0	1
Q18	25	465	34	248.5	(x+27) * 1
Q19	865	864	865	1.0	1
Q20	7	7,087,181	44,254	1,656,501	(x+6,189) * (x+8) * 1
Q21	3,213	1,739,549	31,357	1,651,883	(x+6) * (x+6,189) * 1
Q22	2,092	377,638,493	1,600,610	408,594,207	(x+7) * (x+1,694,747) * 1
Q23	23,450	23,449	23,449	1.0	1
Q24	2,626	1,071,418	231,278	628,018	(x+628,987) * 1
Q25	29,352	1,593,488	42,317	628,018	(x+628,987) * 1
Q26	29,352	29,350	29,351	1.0	1

6.3. Results

We report the results of our evaluation in Table 2, showing the actual count output by the queries and the average result to these queries with added noise (calculated applying the method described in Section 5). We followed the query schema introduced in Section 5.2, that uses the BGP part of each query to calculate the initial most popular values (mpv). We report the results using two values for ϵ, 0.1 and 1.0. We also report the median error percentage for $ϵ = 1.0$ in Fig. 5 between the real counts vs. the counts with added noise segregated by the type of query. To calculate the error we used the following Equation: $\begin{matrix} median ({\frac{(ActualCount - {NoiseResult}_{i}) * 100}{ActualCount}}_{i = 1 \dots 100}) \end{matrix}$ We use $δ = n^{- ϵ ln n}$ where n is the size of the dataset (i.e. the sum of all the different RDF resources in the schema that the queries access) and $β = ϵ / (2 \times log (2 / δ))$ for the β parameter.

(Blue) squares in Fig. 5 represent star queries (typically accessing a single star within the dp-schema), have a very low error (and thus a high utility) compared to the other two types of queries that involved at least one “join” operation across stars.

However, the smaller the result from the COUNT, the larger the error introduced. (Red) triangles represent path queries and thus queries with join operations. Only query $Q_{9}$ is close to the 10% error threshold to be considered a query with enough utility. That query is only two triple patterns that retrieve all data from the Professions and Organizations stars. (Grey) circles represent Snowflake queries, which are queries that join two or more stars in the schema, and also access several properties from each star pattern. Only those queries returning a result greater than 1,000,000 (queries $Q_{3, 6}$ ) have a high utility result.

Fig. 5.

This plot shows that star-shaped queries (blue squares) have the greatest utility, since they are likely not to have joins between stars in the schema. It also shows that queries accessing large amounts of data have high utility. Notice that $Q 15$ does not appear in the plot since its result is 0.

The sensitivity and stability columns in Table 2 clearly show that the larger the degree in the stability polynomial the greater the query sensitivity, and thus, the higher the error in the result. Note that in most cases the derived queries produce smaller results than the simpler queries, thus the errors are larger. The only queries where this is not the case are queries $Q_{1, 2}$ . However, the errors in these queries are so small that much more data should be collected to really establish if they are statistically different. A class of queries with large errors are queries with small outputs and at least one join. See, for example, queries $Q_{9, 18, 20}$ . In general, the more joins in the query and the smaller the result size, the worse results. Joins directly affect the stability polynomial, more joins imply larger degree. The effect is exacerbated if the value representing the most popular mapping is large. Large amounts of noise are introduced to results of the three queries associated with polynomials of degree 2, $Q_{20, 21, 22}$ , which access data from 3 different stars. Even though the results of the original queries are not small (>2,000), the error introduced is very high. Compare that to the single join query $Q_{9}$ that produces a small result and high errors, but it is much smaller than the errors of $Q_{21}$ and $Q_{22}$ . Results from queries accessing a single star from the dp-schema are good as expected, since without joins they have low query sensitivity and, hence, small errors.

7. Related work

The study of how to guarantee the privacy of individuals contributing personal data to datasets is a long studied problem. In this work we have focused on how to guarantee this privacy in RDF data graphs accessed through SPARQL queries using differential privacy. The related work can be roughly classified into those that provide some privacy guarantees to accesses to data stored in (social) graphs and those that guarantee privacy over the results returned by SPARQL queries. We briefly look over these works in this section.

7.1. Privacy over SPARQL

There have been several approaches to address privacy concerns related queries to RDF data. A good survey can be found in [32]. There is a basic anonymization protection that a SPARQL engine must provide to queries that directly return individuals, as opposed to aggregated data. Similar to the case of relational databases where attribute values are anonymized using nulls, the work presented in [13] uses blank nodes to hide sensitive data. Delanoux et al. [7] introduce a more general framework with formal soundness guarantees for privacy policies that describe information that should be hidden as well as utility policies that describe information that should be available. The framework checks whether policies are compatible with each other, and based on a set of basic update queries that use blank nodes and deletions of triples, automatically derives from the policies candidate sets of anonymization operations that guarantee to transform any input dataset into a dataset satisfying the required policies. However, their soundness guarantees do not imply any formal privacy guarantees. Two early methods developed for privacy protection when answering queries about classes in a dataset are k-anonymity and l-diversity. In particular, k-anonymity is used in [17,31] to answer queries in RDF datasets. Unfortunately, it is well-known these methods, in contrast to differential privacy, do not provide formal guarantees for privacy.

The only work known to us that directly applies differential privacy to SPARQL queries is [32]. But surprisingly, differential privacy is realized through local sensitivity alone without the use of a smoothing function necessary for correctness [12]. A privacy-preserving query language for RDF streams is introduced in [8]. Limiting queries to that language servers can continuously release privacy-preserving histograms (or distributions) from online streams. Han et al. [15] provide differentially-private variants of the algorithms TransE and RESCAL, aimed at constructing knowledge graph embeddings in the form of real-valued vectors. While the authors show that these encodings allow performing some analyses with a reasonable privacy-utility tradeoff, inlcuding clustering and link prediction, it is an open question whether this generalizes to further analyses or counting queries as addressed in the current article.

7.2. Privacy in social graphs

A central task to the development of any practical differentially private analysis tool is finding appropriate approximations and alternatives to global sensitivity: it should be easy to calculate, and, at the same time, close enough to the real sensitivity to allow the computation of statistically useful results. A well-known approach is to rely on the concept of restricted sensitivity [3]. Restricted sensitivity is tailored to provide privacy guarantees assuming datasets come from a specific subgroup of the universe of all possible datasets, and it was introduced in the context of social-graph data analysis. There are two natural notions from which one can define adjacency of graphs: differences on edges and differences on vertices. The distance between two graphs, $G_{1}$ and $G_{2}$ , can be then given by the smallest number of changes (either on edges or vertices) needed to transform $G_{1}$ and $G_{2}$ into the same graph, giving rise to two definitions of restricted sensitivity. Blocki et al. [3] provide efficient algorithms to calculate approximations of these sensitivities for a class of social graph queries that involve only one type of join: aggregations over properties of a node and its neighbors (the specific subgroup of interest). Proserpio, Goldberg and McSherry extended the edge-based definition of restricted sensitivity to include weighted datasets [30]. Briefly, the intent was to increase the utility of the answer by considering weights associated with edges during the calculation of noise. Our notion of dp-schema sensitivity can be seen as a vertex-based sensitivity if each induced subgraph $g_{i} \in P (| G |)$ is interpreted as a single node with its attributes. Our proposed elastic sensitivity can be then interpreted as a generalization to handle multiple joins. The polynomial to calculate the selectivity of a query with a single join is always of degree 1. Furthermore, social graphs are essentially defined using a single relationship type. Using our terminology, this implies that the dp-schema would consist of a single star BGP. Hence, Blocki et al. [3] argue that, in practice, the value of x in the degree-1 polynomial (i.e. the frequency of the most popular value) can be bounded by a constant (which would be provided by the RDF data administrator). This is closely akin to our predicate multiplicity. Elastic sensitivity, on the other hand, uses a variable selectivity (the values of x are obtained directly from the dataset), and generalizes to multiple joins. Other works such as [5] proposed differential privacy methods for subgraph counting queries with unrestricted joins (through node differential privacy), however, answering this type of queries is computationally difficult (NP-hard). The result is then more of theoretical interest and limited for general application.

8. Conclusions

In this paper we have introduced a framework to develop differential privacy tools for RDF data repositories. We have used the framework to develop an $(ϵ, δ)$ -Differential Privacy SPARQL query engine for COUNT queries. A crucial component of our framework is the concept of differential privacy schema or dp-schema. Without it, we would have not been able to develop a differential privacy preserving algorithm to publish data of acceptable quality. The concept is independent of the sensitivity approximation used and we hope that others can build on the concept to get better query answering algorithms. In our algorithm, we adapt the concept of elastic sensitivity of SQL queries from [19] to SPARQL.

We have implemented our algorithm and tested it using the Wikidata RDF database, queries from its log files and other example queries found at the Wikidata endpoint. The simulations show the approach to be effective for queries over large repositories, such as Wikidata, and in many cases for queries within the 10 of thousands answers to aggregate. However, even though elastic sensitivity has been designed to bind the stability of joins, the sensitivity of a query with joins can still be very high. As in the case of SQL queries in relational databases, in order to keep the noise in SPARQL queries under a single percentage digit, query results should have over 1M tuples and $ϵ = 1$ . The evaluation shows though that we can safely evaluate star queries.

There are many pending issues to address. We can still apply several optimizations to our framework. For example, public graphs can be treated as public tables. If they participate in joins, we can directly use their most popular result mappings during calculation of the query sensitivity. From the more practical point of view, more operations need to be implemented. We can consider the approaches described in [19] for SQL to add aggregation functions like sum and averages to our framework. There are also issues to consider about the impact that such algorithms will have on SPARQL query engines. From the more formal side, it is still important to keep searching for better approximations of local and global sensitivities as well as alternative definitions that are less onerous than differential privacy. One possibility is to find a way to apply restricted sensitivity to more types of queries by adding more semantic information to a dp-schema. It might also be possible to find a more accurate approximation for the elastic sensitivity of $D I S T I N C T$ queries to make them independent or partially independent of predicate multiplicity. We should point out that most queries that will require privacy protection will be $D I S T I N C T$ (e.g. how many human exists with properties A and B that work in company C?).

Footnotes

Acknowledgements

We thank the reviewers for their thorough work revising this paper, which improved the overall quality of the paper. Carlos Buil-Aranda was supported by Fondecyt Iniciacion 11170714 and by ANID – Millennium Science Initiative Program – Code ICN17_002. Jorge Lobo was partially supported by the Spanish Ministry of Economy and Competitiveness under Grant Numbers: TIN-2016-81032-P, MDM-2015-052, and the U.S. Army Research Office under Agreement Number W911NF1910432. Federico Olmedo was also supported by ANID – Millennium Science Initiative Program – Code ICN17_002.

Query characteristics

In this Section we present the characteristics of the queries we used in Section 6. The table presents the query ID (which refer to queries Section B in this Appendix), the stars within the Privacy Schema we defined in 6, the query shape, the amount of tripe patterns in the queries as well as the number of variables, the join variables when applicable, including the amount of mappings for each join variable, and data about the Privacy Schema stars in the query.

Queries

In this Section we present the queries we used in Section 6. There are 11 base queries with several variations, totaling 26 SPARQL COUNT queries.

Query $Q_{1}$ :

Query $Q_{2}$ :

Query $Q_{3}$ :

Query $Q_{4}$ :

Query $Q_{5}$ :

Query $Q_{6}$ :

Query $Q_{7}$ :

Query $Q_{8}$ :

Query $Q_{9}$ :

Query $Q_{10}$ :

Query $Q_{11}$ :

Query $Q_{12}$ :

Query $Q_{13}$ :

Query $Q_{14}$ :

Query $Q_{15}$ :

Query $Q_{16}$ :

Query $Q_{17}$ :

Query $Q_{18}$ :

Query $Q_{19}$ :

Query $Q_{20}$ :

Query $Q_{21}$ :

Query $Q_{22}$ :

Query $Q_{23}$ :

Query $Q_{24}$ :

Query $Q_{25}$ :

Query $Q_{26}$ :

References

Aluç,

Hartig,

M.T.

Özsu and

Daudjee, Diversified stress testing of RDF data management systems, in: International Semantic Web Conference, Springer, 2014, pp. 197–212.

Arapinis,

Figueira and

Gaboardi, Sensitivity of counting queries, in: 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, July 11–15, 2016, Rome, Italy,

Chatzigiannakis,

Mitzenmacher,

Rabani and

Sangiorgi, eds, LIPIcs, Vol. 55, Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2016, pp. 120–112013.

Blocki,

Blum,

Datta and

Sheffet, Differentially private data analysis of social networks via restricted sensitivity, in: Proceedings of the 4th Conference on Innovations in Theoretical Computer Science, ITCS’13, ACM, New York, NY, USA, 2013, pp. 87–96. ISBN 978-1-4503-1859-4. doi:10.1145/2422436.2422449.

Bonifati,

Martens and

Timm, Navigating the maze of Wikidata query logs, in: The World Wide Web Conference, 2019, pp. 127–138. doi:10.1145/3308558.3313472.

Chen and

Zhou, Recursive mechanism: Towards node differential privacy and unrestricted joins, in: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM, 2013, pp. 653–664. doi:10.1145/2463676.2465304.

Cyganiak,

Wood and

Lanthaler, RDF 1.1 Concepts and Abstract Syntax, 2014.

Delanaux,

Bonifati,

M.-C.

Rousset and

Thion, Query-based linked data anonymization, in: International Semantic Web Conference, Springer, 2018, pp. 530–546.

Dell’Aglio and

Bernstein, Differentially private stream processing for the semantic web, in: Proceedings of the Web Conference 2020, WWW’20, Association for Computing Machinery, New York, NY, USA, 2020, pp. 1977–1987. ISBN 9781450370233. doi:10.1145/3366423.3380265.

Dwork, Differential privacy, in: 33rd International Colloquium on Automata, Languages and Programming, Part II (ICALP 2006), Lecture Notes in Computer Science, Vol. 4052, Springer Verlag, 2006, pp. 1–12, https://www.microsoft.com/en-us/research/publication/differential-privacy/ . ISBN 3-540-35907-9.

10.

Dwork, Differential privacy: A survey of results, in: International Conference on Theory and Applications of Models of Computation, Springer, 2008, pp. 1–19.

11.

Dwork,

Kenthapadi,

McSherry,

Mironov and

Naor, Our data, ourselves: Privacy via distributed noise generation, in: Annual International Conference on the Theory and Applications of Cryptographic Techniques, Springer, 2006, pp. 486–503.

12.

Dwork,

Rothet al., The algorithmic foundations of differential privacy, Foundations and Trends^® in Theoretical Computer Science9(3–4) (2014), 211–407.

13.

B.C.

Grau and

E.V.

Kostylev, Logical foundations of privacy-preserving publishing of linked data, in: Thirtieth AAAI Conference on Artificial Intelligence, 2016.

14.

Gutierrez,

Hurtado and

A.O.

Mendelzon, Foundations of semantic web databases, in: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2004, pp. 95–106. doi:10.1145/1055558.1055573.

15.

Han,

Dell’Aglio,

Grubenmann,

Cheng and

Bernstein, A framework for differentially-private knowledge graph embeddings, J. Web Semant.72 (2022), 100696. doi:10.1016/j.websem.2021.100696.

16.

Harris and

Seaborne, 2012, SPARQL 1.1 Query language, W3C recommendation, http://www.w3.org/TR/2010/WD-sparql11-query-20101014/.

17.

Heitmann,

Hermsen and

Decker, k – RDF-neighbourhood anonymity: Combining structural and attribute-based anonymisation for linked data, in: PrivOn ISWC, 2017.

18.

Hogan,

Riveros,

Rojas and

Soto, A worst-case optimal join algorithm for SPARQL, in: The Semantic Web – ISWC 2019 – 18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part I,

Ghidini,

Hartig,

Maleshkova,

Svátek,

I.F.

Cruz,

Hogan,

Song,

Lefrançois and

Gandon, eds, Lecture Notes in Computer Science, Vol. 11778, Springer, Auckland, New Zealand, 2019, pp. 258–275. doi:10.1007/978-3-030-30793-6_15.

19.

Johnson,

J.P.

Near and

Song, Towards practical differential privacy for SQL queries, Proceedings of the VLDB Endowment11(5) (2018), 526–539. doi:10.1145/3187009.3177733.

20.

Kifer and

Machanavajjhala, No free lunch in data privacy, in: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD’11, ACM, New York, NY, USA, 2011, pp. 193–204. ISBN 978-1-4503-0661-4. doi:10.1145/1989323.1989345.

21.

Li,

Li and

Venkatasubramanian, t-Closeness: Privacy beyond k-anonymity and l-diversity, in: Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on, IEEE, 2007, pp. 106–115.

22.

Li,

Qardaji,

Su,

Wu and

Yang, Membership privacy: A unifying framework for privacy definitions, in: Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, ACM, 2013, pp. 889–900.

23.

Machanavajjhala,

Gehrke,

Kifer and

Venkitasubramaniam, l-Diversity: Privacy beyond k-anonymity, in: 22nd International Conference on Data Engineering (ICDE’06), IEEE, 2006, pp. 24–24. doi:10.1109/ICDE.2006.1.

24.

F.D.

McSherry, Privacy integrated queries: An extensible platform for privacy-preserving data analysis, in: Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, ACM, 2009, pp. 19–30. doi:10.1145/1559845.1559850.

25.

Menand, Why do we care so much about privacy?, The New Yorker XCIV(17) (2018), 24–29.

26.

Motwani and

Raghavan, Randomized algorithms, ACM Computing Surveys (CSUR)28(1) (1996), 33–37. doi:10.1145/234313.234327.

27.

Nissim,

Raskhodnikova and

Smith, Smooth sensitivity and sampling in private data analysis, in: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing, STOC’07, ACM, New York, NY, USA, 2007, pp. 75–84. ISBN 978-1-59593-631-8. doi:10.1145/1250790.1250803.

28.

Nissim,

Raskhodnikova and

Smith, Smooth sensitivity and sampling in private data analysis, 2011, Draft full version v1.0, http://www.cse.psu.edu/~ads22/pubs/NRS07/NRS07-full-draft-v1.pdf.

29.

Pérez,

Arenas and

Gutierrez, Semantics and complexity of SPARQL, TODS34(3) (2009), 16.

30.

Proserpio,

Goldberg and

McSherry, Calibrating data to sensitivity in private data analysis: A platform for differentially-private analysis of weighted datasets, Proceedings of the VLDB Endowment7(8) (2014), 637–648. doi:10.14778/2732296.2732300.

31.

Radulovic,

García Castro and

Gómez-Pérez, Towards the Anonymisation of RDF Data, 2015.

32.

R.R.C.

Silva,

B.C.

Leal,

F.T.

Brito,

V.M.

Vidal and

J.C.

Machado, A differentially private approach for querying RDF data of social networks, in: Proceedings of the 21st International Database Engineering & Applications Symposium, ACM, 2017, pp. 74–81. doi:10.1145/3105831.3105838.

33.

Sweeney, k-Anonymity: A model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems10(05) (2002), 557–570. doi:10.1142/S0218488502001648.

34.

Vrandečić and

Krötzsch, Wikidata: A free collaborative knowledgebase, Communications of the ACM57(10) (2014), 78–85. doi:10.1145/2629489.

35.

Vrgoc,

Rojas,

Angles,

Arenas,

Arroyuelo,

C.B.

Aranda,

Hogan,

Navarro,

Riveros and

Romero, MillenniumDB: A persistent, open-source, graph database, CoRR (2021), arXiv:2111.01540.

Differential privacy and SPARQL

Abstract

Keywords

1. Introduction

2. Preliminaries about differential privacy

2.1. Definition

3.1. Privacy schema

3.1.1. Motivation

1 Calculation of query sensitivity will depend on the domain of blank nodes which can change under different contexts and implementations. This is a topic of future research.

Example 3.1 (Star BGP).

3 The notion of star BGP that we use here is similar to that of star query from [1], except that in a star query the center of the star must always appear as subject.

Definition 6 (Induced sub-graphs).

Example 3.2 (Induced sub-graph).

Definition 7 (Dp-schema compliance).

Example 3.3 (Dp-schema compliance).

3.1.3. Discussion

3.2. Predicate multiplicity

Definition 8 (Triple-pattern multiplicity).

Definition 9 (Star multiplicity).

Example 3.4 (Star multiplicity).

4. Queries

4.1. Supported queries

5.1. Preliminary notions

5.2. Elastic sensitivity

4 Observe that there might be multiple such triple patterns in B, all of them yielding valid upper bounds for mpv ( ? x , B , G ) . If we are interested in obtaining tighter privacy guarantees, we should choose the most precise bound.

5 https://docs.oracle.com/javase/8/docs/api/java/security/SecureRandom.html

10 Note that this observation suggests a more subtle definition of privacy scheme would be useful.

7.1. Privacy over SPARQL

7.2. Privacy in social graphs

8. Conclusions

Footnotes

Acknowledgements

Query characteristics

Queries

References

¹
Calculation of query sensitivity will depend on the domain of blank nodes which can change under different contexts and implementations. This is a topic of future research.

³
The notion of star BGP that we use here is similar to that of star query from [1], except that in a star query the center of the star must always appear as subject.

⁴
Observe that there might be multiple such triple patterns in B, all of them yielding valid upper bounds for $mpv (? x, B, G)$ . If we are interested in obtaining tighter privacy guarantees, we should choose the most precise bound.

⁵
https://docs.oracle.com/javase/8/docs/api/java/security/SecureRandom.html

¹⁰
Note that this observation suggests a more subtle definition of privacy scheme would be useful.