Abstract
Differential privacy is a framework that provides formal tools to develop algorithms to access databases and answer statistical queries with quantifiable accuracy and privacy guarantees. The notions of differential privacy are defined independently of the data model and the query language at steak. Most differential privacy results have been obtained on aggregation queries such as counting or finding maximum or average values, and on grouping queries over aggregations such as the creation of histograms. So far, the data model used by the framework research has typically been the relational model and the query language SQL. However, effective realizations of differential privacy for SQL queries that required joins had been limited. This has imposed severe restrictions on applying differential privacy in RDF knowledge graphs and SPARQL queries. By the simple nature of RDF data, most useful queries accessing RDF graphs will require intensive use of joins. Recently, new differential privacy techniques have been developed that can be applied to many types of joins in SQL with reasonable results. This opened the question of whether these new results carry over to RDF and SPARQL. In this paper we provide a positive answer to this question by presenting an algorithm that can answer counting queries over a large class of SPARQL queries that guarantees differential privacy, if the RDF graph is accompanied with semantic information about its structure. We have implemented our algorithm and conducted several experiments, showing the feasibility of our approach for large graph databases. Our aim has been to present an approach that can be used as a stepping stone towards extensions and other realizations of differential privacy for SPARQL and RDF.
Keywords
Introduction
As many social norms, privacy, or the right to privacy, is an evolving term that is invoked in many contexts as eloquently described by Louis Menand in [25]: “Privacy is associated with liberty, but it is also associated with privilege (private roads and private sales), with confidentiality (private conversations), with nonconformity and dissent, with shame and embarrassment, with the deviant and the taboo (...), and with subterfuge and concealment”.
In order to get some formal underpinning of privacy in the context of electronic data collection and publishing, Li et al [22] have looked at privacy breaches, studied their general characteristics, and concluded, that electronic privacy breaches always ended with giving an attacker the ability to identify (using public data) whether an individual is member of a set or class that had been intended to be anonymous (e.g., the class of individuals with high cholesterol). Hence, they define preservation of privacy as avoiding privacy breaches in the sense of not disclosing set memberships of individuals.
For the public good, such as the advance of public health, or the fair distribution of government resources, such data is frequently made public. There are also situations in which governmental and commercial organizations collect and analyze data to improve or provide new services. Especially in such cases, society expects a certain level of privacy on the way these organizations use the data. Publishing data with perfect privacy means that no assumption can be made about the prior knowledge an attacker may have about the supposedly anonymous set. Under this assumption, there would be little utility in published data if perfect privacy is expected [11,22]. Therefore, the research community has looked at weaker definitions of “acceptable” privacy. Useful concepts like k-anonymity [33], l-diversity [23] and t-closeness [21] were developed but they were shown to have weak privacy guarantees [10].
In spite of its limitations [12], but because of its formal properties, a privacy notion that has gained a lot of acceptance is differential privacy. We will present precise definitions later in the paper, but informally, differential privacy tries to hide the identity of individuals that are members of a particular class, while still providing quantifiable utility guarantees to the data published about the class. The basic principle is simple. Given a universe
Even though the notion of differential privacy is in principle independent of the data model and query language at steak, so far most practical, automated implementations over well-established languages have been in the context of relational databases, over SQL, and have been restricted to aggregation queries or grouping. Aggregations are queries such as counting, finding maximum or average values over a certain data subset; grouping is the creation of histograms based on aggregations.
Furthermore, to allow for reasonable approximations of sensitivity, the support of these implementations for queries with joins has been rather limited [24]. It was only in 2018, when Johnson et al. [19] introduced a new approach to approximate sensitivity that can be applied to a wider class of SQL joins, with reasonable results.
In the past decades, graph data models have enjoyed a growing adoption in comparison to the more traditional relational model. One such notable example is the RDF data standard, queried over by the SPARQL language, which have become extremely popular, in particular, for their role in the development of the Semantic Web. By the simple nature of RDF, it can be stored using binary relations [6] and most interesting queries will require operations equivalent to joins. This raises the question whether Johnson et al.’s approach [19] can also be applied to RDF and SPARQL.
In this paper we provide a positive answer to this question by presenting an algorithm that can answer counting queries over a large class of SPARQL queries that guarantees differential privacy. This result has been made possible by introducing the notion of a differential privacy schema that allows redefining Johnson et al.’s sensitivity approximation of SQL queries in the appropriate terms for answering SPARQL queries. A differential privacy schema groups sets of RDF tuples into sub-graphs that can be then used as single units for privacy protection. Examples show that this type of schema naturally arises from the semantics of the data stored in the tuples, and it should not be difficult for a database administrator to define.
We demonstrate the applicability of our approach by implementing a differential privacy query engine that uses the approximation to answer counting and grouping SPARQL queries, and evaluate the implementation running simulations using the Wikidata knowledge base [34].
The rest of the paper is organized as follows: in Section 2 we introduce the readers to the fundamental concepts of differential privacy. In Section 3, we present the core concepts of SPARQL used within the paper, including the notion of differential privacy schema. In Section 5 we prove the correctness of our proposed approximation to sensitivity and in Section 6 we evaluate the effectiveness of our proposed approximation in an implementation that we apply to both synthetic and real world datasets and queries. We present related work in Section 7, and we conclude the paper in Section 8.
Preliminaries about differential privacy
We now describe the framework of differential privacy, the problem that arises when applying differential privacy to SQL queries with general joins and how it has been addressed by the scientific community.
Definition
Intuitively, a randomized algorithm [26] is differentially private if it behaves similarly on similar input datasets. To formalize this intuition, the framework of differential privacy relies on a notion of distance between datasets. We model datasets as a multiset of tuples and we say that two datasets are k-far apart if one can be obtained from the other by changing the value of k tuples. Formally, this corresponds to (a mild generalization of) the notion of distance used for defining bounded differential privacy [20], which quantifies (only) over pairs of datasets of the same size. In the remainder, we let
Let
This inequality establishes a quantitative closeness condition between
Multi-table datasets Our notions of dataset and distance between datasets can be extended to collections of datasets as follows: A dataset formed by multiple sets
Establishing differential privacy for numeric queries of limited sensitivity is relatively simple. The Laplacian mechanism [9] says that we can obtain a differentially private version of query
Given a numeric query
Here,
In practice, when implementing the Laplacian mechanism we approximate the global sensibility of queries by exploiting their structures: Numeric queries are typically constructed by first transforming the original dataset using some standard transformers and by returning as final result some aggregation on the obtained dataset. For example, we join two tables, filter the result (dataset transformations) and return the count (aggregation) of the obtained table. The global sensitivity of such a query can be estimated from the so-called stability properties of the involved transformers. Intuitively, a stable transformer can increase the distance between nearby datasets at most by a multiplicative factor. Formally, we call a dataset transformer
Conversely, the use of transformers with unbounded stability might result in queries of unbounded sensitivity. A prominent example of a transformer exhibiting this problem is join. Assume we join two tables, say
To handle queries that involve transformers of unbounded stability, such as joins, we require the use of more advanced techniques. The Laplacian mechanism calibrates noise according to the query, overlooking the fact that queries are done on concrete datasets, hence the employed noise could be potentially customized for each dataset. Nissim et al. show how to exploit this idea of instance-based noise [27]. Their approach relies on the notion of local sensitivity.
The local sensitivity
Observe that
For answering a query f on dataset D, we cannot simply use noise calibrated according to
A function
We can readily achieve differential privacy by adding noise calibrated according to a smooth upper bound of the query local sensitivity [28, Corollary 2.4].
Let
The benefits of this mechanism are twofold. On the one hand, it allows handling queries that fail to have a bounded global sensitivity, but do have a bounded local sensitivity. These include e.g. the query we considered earlier, consisting of the count of the join between two tables. On the other hand, it does not require computing the local sensitivity of the queries itself, but only a smooth upper bound thereof. This is key for its practical adoption since calculating the local sensitivity of queries is computationally prohibitive: As observed by Johnson et al. [19], “it requires running the query on every possible neighbor of the original dataset”.
To apply the mechanism from Theorem 2, we must provide a smooth upper bound for the local sensitivity of queries. We can construct the smooth upper bound using approximations for the local sensitivity at fixed distances.
Let
The goal of Section 5 is to apply the differential privacy mechanism from Theorem 2 to SPARQL counting queries. To do so, we will use Lemma 1 to derive smooth upper bounds of the local sensitivity of queries. In turn, this requires constructing upper bounds for the local sensitivity of queries at fixed distances, for which we will leverage local stability properties of SPARQL dataset transformers.
In this section we examine the semantic information that is necessary considering over RDF graphs, in order to answer counting queries in a differentially private manner. This comprises a data schema and upper bounds on the predicate multiplicities.
Privacy schema
Motivation
As mentioned earlier, the goal of differential privacy is to protect the (possibly sensible) contribution of each individual within a dataset when publicly releasing aggregate information about the dataset – in our case, the result of counting queries. In the relational model, individuals are typically identified with rows of the database which significantly simplifies all the technical development. For instance, if the database at stake consists of a single table, we consider two instances of the database neighboring, i.e. differing in the contribution of a single individual, if they differ in a single row. On the other hand, if the database consists of multiple tables, we consider two database instances neighboring if they differ in a row of some of the tables (see paragraph in Section 2). The underlying assumption behind this is that each table groups attributes of individuals in a particular entity type, e.g. people, political parties or companies, or part thereof, whose identities must be protected.

RDF graph
To be able to apply differential privacy to a dataset in the form of an RDF graph, we must thus begin by identifying the different types of entities present in the graph, and the set of individuals in each type. Consider, for instance, the RDF graph
), two companies (depicted in
) and two cities (depicted in
). Said otherwise, there are two individuals of each entity type, adding up to six individuals in all. When querying the graph, we will be interested in protecting the contribution of all these individuals, and when applying differential privacy techniques to this end, we will then consider as a neighbor any other graph that differs in the contribution of either of them.
We refer to the semantic information necessary to identify the individuals in an RDF graph
We briefly review some basic RDF terminology following standard notation used in the literature [14,16], where more details can be found. The RDF language assumes the existence of an infinite set U (of URI references), an infinite set B (of blank nodes), and an infinite set L (of RDF literals). An RDF triple is a term of the form
Calculation of query sensitivity will depend on the domain of blank nodes which can change under different contexts and implementations. This is a topic of future research.
The more general definition of triple patterns allows also for variables in the predicate component of triple patterns.
To hint how we can formally define entities within an RDF graph, observe first the six colored sub-graphs identified in Fig. 1: two green, two blue and two red. All have a star shape [1], consisting of a “center” with outgoing and/or incoming edges, i.e. predicates. Sub-graphs representing individuals of the same entity type are built from the same set of predicates. For example, both (blue) sub-graphs, representing people, are built from predicates
A BGP is called a star if
both the subject and the object of all its triple patterns are variables, all triple patterns have different predicates, and it consists of either a single triple pattern with no join vertex, i.e. a triple pattern whose subject and object are distinct variables, or multiple triple patterns with a single join vertex, which appears once and only once in every triple pattern
(Star BGP).
The three stars employed to identify the different entities in our running example (Fig. 1) are (modulo variable renaming): 
Note that in each
,
and
. Formally, we define the center of a star as the join vertex if the star contains multiple patterns, or the variable appearing in the subject of the triple pattern in case it consists of a single triple pattern.3
The notion of star BGP that we use here is similar to that of star query from [1], except that in a star query the center of the star must always appear as subject.
A differential privacy schema (dp-schema, for short)
The set
is a dp-schema. as well as its sub-set
. However, this second schema falls short of describing the whole graph
(Induced sub-graphs).
Let
(Induced sub-graph).
The star
induces two sub-graphs
and
over
. These correspond to the blue sub-graphs in Fig. 1, which are formally defined as: 
Likewise,
, where 
Intuitively, the set of induced sub-graphs
, but it is not materialized in
. From the privacy point of view, the values of the star center are the unique identifiers of the entities contributing the data in each sub-graph, and their values must be kept confidential.
The notion of induced sub-graphs naturally extends from single stars, to dp-schemas, i.e. sets of stars. Concretely, we let
.
Now we have all the prerequisites to define the core concept of this section:
(Dp-schema compliance).
We say that an RDF graph
(Dp-schema compliance).
Our running example
. In contrast, it does not comply with dp-schema
.
In summary, if a graph
Discussion
We now address a few key points about dp-schemas.
No shared attribute between entities At first sight, the requirement that stars in a dp-schema be predicate pairwise disjoint might seem a limitation, as it requires that each attribute belong to a single entity type. For instance, it might seem natural to consider that predicate
(which identifies companies) and
(which identifies people). However, for the sake of protection it makes no difference to which star it belongs, since our application of differential privacy will protect the contribution of all individuals, regardless of its type.
Existence of dp-schemas RDF graphs always admit compliant dp-schemas. In particular, every graph complies with a trivial dp-schema comprising the union of all singleton BGP’s of the form
Compliance verification Checking whether an RDF graph complies with a given schema
Dp-schema provision For practical purposes, we assume that the database administrator of the RDF graph at stake is responsible for designing the dp-schema the graph shall comply with, and for ensuring the compliance as the graph evolves. In this latter regard, observe that removing an RDF triple from the graph always preserves the dp-schema compliance, and adding a triple also preserves compliance provided the predicate in the triple already appears in the dp-schema. We believe this is a natural assumption, as in the relational data model this would correspond to changing the schema of the database by adding a new attribute to a relation if the new predicate is incorporated into an existing star of the dp-schema or creating a new table if the predicate is added to the dp-schema as a new star.
Predicate multiplicity
Automatic approaches to answer dataset queries in a differentially private manner are typically obtained by adding noise to the query results, calibrated according to their sensitivity. Thus, a prerequisite to apply differential privacy to counting queries over RDF graphs is that they have bounded sensitivity. Unfortunately, this does not occur in the general case.
To see this, consider graph
) is replaced by somebody else’s contribution. The query answer over this neighboring graph can certainly be any integer
This problem arises because of the presence of predicates that are not one-to-one. To recover bounded sensitivities, we have to restrict ourselves to predicates that have bounded multiplicity. For instance, if the administrator of graph
On the formal level, we associate such bounds to triple patterns rather than to predicates. This is because in the presence of compliant dp-schemas, predicates are identified with triple patterns (every predicate within a dp-schema occurs in a single triple pattern, in a single star).
(Triple-pattern multiplicity).
Let
Many predicates (or equivalently, triple patterns) would have a natural multiplicity bound of 1. For instance, a city has a unique
Henceforth, in the remainder we assume the system administrator provides a dp-scheme
For bounding the local sensitivity of queries in Section 5, it will suffices a coarser notion of multiplicity, at the level of stars rather than triple patterns. The required generalization is straightforward: Let Assume that a graph administrator adopts dp-schema (Star multiplicity).
(Star multiplicity).
and requires that each individual
will have multiplicity
.
Queries
In this section we describe the subset of queries over RDF graphs for which we provide differential privacy, and show how dp-schemes enable a decomposition result for the evaluation of such queries.
Supported queries
We develop differential privacy for counting queries over the SPARQL fragment of basic graph patterns with filter expressions, also known as constrained basic graph pattern (CBGP) [1]. In this fragment, a query is denoted by a pair
For simplicity, we consider only CBGPs that are semantically valid. We also assume that in a graph
Take RDF graph
. Now assume we want to know how many people have a coworker in a company with headquarters in a city with over 20 daily robberies? The query can be cast in terms of the CBGP
Triple patterns in an RDF graph compliant with a dp-schema
which contains the triple pattern
Finally, a user query is a CBGP The query from the previous example can be expressed as
Continuing with the previous example, assume we want to evaluate B (from Example 4.1) over
induces on
Hence, for any query every for any two triple patterns
Because the stars in
For the CBGP from Example 4.1, B is split into four elementary BGPs by dp-schema
, i.e.
, but they are consireded different elementary BGPs because they share predicate
and
, respectively.
If B were extended, e.g. , with triple pattern
, but remain different members of
.)
Note also that the elementary BGP where a triple pattern belongs to is determined by the center of the triple pattern, all triple patterns in the same split must share the same variable or RDF term as their center. The interest in
If
Then, following the terminology defined in [29] for joins between multisets of solution mappings, we can extend the lemma to B as follows:
We have already observed that
We are now in a position to establish differential privacy for SPARQL count and histogram queries.
In this section we develop all the prerequisites to extend Lemma 1 from the relational model to the graph model, in terms of the
Preliminary notions
We begin defining the notion of size and distance between RDF graphs. These are straightforward adaptations of the relational case, where the induced sub-graphs play the role of table rows. Concretely, the size of a graph refers to the number of individuals present in it. Formally, given an RDF graph
Moreover, we say that two graphs are k far apart if one can be obtained from the other by replacing k of its induced sub-graphs. Formally, given a pair of RDF graphs
Finally, this notion of distance between RDF graphs readily induces a notion of local sensitivity (at distance k)
In order not to clutter the presentation, we usually omit the underlying dp-schema when referring to the size of an RDF graph, the distance between a pair of RDF graphs, and the local sensitivity of a
Elastic sensitivity
Our next step is, given a user query Q and an RDF graph
To this end, observe that the naive approach of evaluating the query on every neighbor (at distance k) of
In the remainder of the section we adapt Johnson et al.’s approach to the case of RDF graphs and
In our case, these transformations are given by the CBGP of user queries, more concretely, by their BGP part. We thus introduce the auxiliary notion of BGP elastic stability. A key property of this notion is that it allows bounding the local sensitivity of counting queries: Given a BGP B, its elastic stability at distance k with respect to a graph
The formal definition of elastic stability relies on the frequency of most popular values. More precisely, if
Observe that there might be multiple such triple patterns in B, all of them yielding valid upper bounds for
The counting on the latter query will be greater than (or equal to) the one on the former query and, therefore, a valid (possibly looser, though) upper bound for
To compute the elastic stability of BGP B at distance k of graph
The intuition behind the base case is easy to grasp. For
We are now ready to define the elastic stability of a BGP B at distance k of graph
If the covering star S of
As so defined, the local stability bounds the local sensitivity of counting queries:
For any CBGP query
The main intuition behind the proof is that changes made to a graph to get a new graph at distance 1, are limited to a sub-graph
The proof of this lemma follows the same strategy as the proof in [19, Lemma 2], and is by induction on the length of When In the symmetric case, when
□
We chose the largest of the two values when calculating
Now we have all the prerequisite to define the elastic sensitivity of user queries (at fixed distances of a given graph):
And as in Johnson et al. [19], the above lemma readily leads us to the desired bound for (the three kind of) user queries:
For any user query Q, and any graph
By case analysis on the type of user query Q:
For plain counting queries ( For plain unique counting queries ( For counting queries after grouping (
□
Lemma 5 readily establishes our main result, which allows applying differential privacy to
Assume that our universe of (valid) RDF graphs is composed by the graphs that comply with dp-schema
The theorem follows immediately from Theorem 2 and Lemmas 1 and 5.
Having characterized formally how an algorithm can be implemented to enforce differential privacy on SPARQL queries based on privacy schemes, in this section, we present an empirical evaluation of how the algorithm would behave in real scenarios.
Setup We conducted our evaluation on a 2018 Macbook Pro with 16 GB of RAM memory having installed a Fuseki instance on a 2 AMD Opteron server with an SSD drive and 64 GB of RAM memory. We used Java 1.17 to implement our proof of concept. We also used the
Repository
Data In the evaluation we used real world data and queries from Wikidata [34]. Wikidata is a collaboratively edited knowledge base hosted by the Wikimedia Foundation. It is a common source of data for Wikimedia projects such as Wikipedia, and it has been made available to the general public under a public domain license. Wikidata stores 86,671,701 items (RDF resources), and 1,084,935,969 statements (triples7
Our dp-schema is defined based on three pairwise disjoint stars,
This gathers the instances of the class
Evaluation privacy schema The three stars employed to identify the different entities in our evaluation schema are (modulo variable renaming):
These three stars shouldn’t be used directly as a privacy schema because
Note that this observation suggests a more subtle definition of privacy scheme would be useful.
The Wikidata properties that allow joining data from two different stars are
Table showing key statistics about the data in our privacy schema (the largest schema is by far the Humans star)
We selected 26 queries from the query logs in [18,35] containing the predicates used in our dp-schema. Since the amount of COUNT queries in the query logs is small [4] for each of these queries we added the
We consider star queries which are queries covered by a single star from the dp-schema. These queries can only add filters and remove triple patterns from the star. Therefore, a star query is centered around a single join vertex

Star-shaped query

Linear query

Snowflake query
Results of the execution of Wikidata queries using our differential privacy method. Those queries with sensitivity “1.0” are star queries since the sensitivity of a COUNT query over a single star schema is 1 (a COUNT query over a table), and their elastic stability is “x” as described in Section 5.2 if there are joins between star BGPs, the sensitivity increases based on the stability polynomial, calculated according to Theorem 3
We report the results of our evaluation in Table 2, showing the actual count output by the queries and the average result to these queries with added noise (calculated applying the method described in Section 5). We followed the query schema introduced in Section 5.2, that uses the BGP part of each query to calculate the initial most popular values (mpv). We report the results using two values for ϵ, 0.1 and 1.0. We also report the median error percentage for
(Blue) squares in Fig. 5 represent star queries (typically accessing a single star within the dp-schema), have a very low error (and thus a high utility) compared to the other two types of queries that involved at least one “join” operation across stars.
However, the smaller the result from the

This plot shows that star-shaped queries (blue squares) have the greatest utility, since they are likely not to have joins between stars in the schema. It also shows that queries accessing large amounts of data have high utility. Notice that
The sensitivity and stability columns in Table 2 clearly show that the larger the degree in the stability polynomial the greater the query sensitivity, and thus, the higher the error in the result. Note that in most cases the derived queries produce smaller results than the simpler queries, thus the errors are larger. The only queries where this is not the case are queries
The study of how to guarantee the privacy of individuals contributing personal data to datasets is a long studied problem. In this work we have focused on how to guarantee this privacy in RDF data graphs accessed through SPARQL queries using differential privacy. The related work can be roughly classified into those that provide some privacy guarantees to accesses to data stored in (social) graphs and those that guarantee privacy over the results returned by SPARQL queries. We briefly look over these works in this section.
Privacy over SPARQL
There have been several approaches to address privacy concerns related queries to RDF data. A good survey can be found in [32]. There is a basic anonymization protection that a SPARQL engine must provide to queries that directly return individuals, as opposed to aggregated data. Similar to the case of relational databases where attribute values are anonymized using nulls, the work presented in [13] uses blank nodes to hide sensitive data. Delanoux et al. [7] introduce a more general framework with formal soundness guarantees for privacy policies that describe information that should be hidden as well as utility policies that describe information that should be available. The framework checks whether policies are compatible with each other, and based on a set of basic update queries that use blank nodes and deletions of triples, automatically derives from the policies candidate sets of anonymization operations that guarantee to transform any input dataset into a dataset satisfying the required policies. However, their soundness guarantees do not imply any formal privacy guarantees. Two early methods developed for privacy protection when answering queries about classes in a dataset are k-anonymity and l-diversity. In particular, k-anonymity is used in [17,31] to answer queries in RDF datasets. Unfortunately, it is well-known these methods, in contrast to differential privacy, do not provide formal guarantees for privacy.
The only work known to us that directly applies differential privacy to SPARQL queries is [32]. But surprisingly, differential privacy is realized through local sensitivity alone without the use of a smoothing function necessary for correctness [12]. A privacy-preserving query language for RDF streams is introduced in [8]. Limiting queries to that language servers can continuously release privacy-preserving histograms (or distributions) from online streams. Han et al. [15] provide differentially-private variants of the algorithms TransE and RESCAL, aimed at constructing knowledge graph embeddings in the form of real-valued vectors. While the authors show that these encodings allow performing some analyses with a reasonable privacy-utility tradeoff, inlcuding clustering and link prediction, it is an open question whether this generalizes to further analyses or counting queries as addressed in the current article.
Privacy in social graphs
A central task to the development of any practical differentially private analysis tool is finding appropriate approximations and alternatives to global sensitivity: it should be easy to calculate, and, at the same time, close enough to the real sensitivity to allow the computation of statistically useful results. A well-known approach is to rely on the concept of restricted sensitivity [3]. Restricted sensitivity is tailored to provide privacy guarantees assuming datasets come from a specific subgroup of the universe of all possible datasets, and it was introduced in the context of social-graph data analysis. There are two natural notions from which one can define adjacency of graphs: differences on edges and differences on vertices. The distance between two graphs,
Conclusions
In this paper we have introduced a framework to develop differential privacy tools for RDF data repositories. We have used the framework to develop an
We have implemented our algorithm and tested it using the Wikidata RDF database, queries from its log files and other example queries found at the Wikidata endpoint. The simulations show the approach to be effective for queries over large repositories, such as Wikidata, and in many cases for queries within the 10 of thousands answers to aggregate. However, even though elastic sensitivity has been designed to bind the stability of joins, the sensitivity of a query with joins can still be very high. As in the case of SQL queries in relational databases, in order to keep the noise in SPARQL queries under a single percentage digit, query results should have over 1M tuples and
There are many pending issues to address. We can still apply several optimizations to our framework. For example, public graphs can be treated as public tables. If they participate in joins, we can directly use their most popular result mappings during calculation of the query sensitivity. From the more practical point of view, more operations need to be implemented. We can consider the approaches described in [19] for SQL to add aggregation functions like sum and averages to our framework. There are also issues to consider about the impact that such algorithms will have on SPARQL query engines. From the more formal side, it is still important to keep searching for better approximations of local and global sensitivities as well as alternative definitions that are less onerous than differential privacy. One possibility is to find a way to apply restricted sensitivity to more types of queries by adding more semantic information to a dp-schema. It might also be possible to find a more accurate approximation for the elastic sensitivity of
Footnotes
Acknowledgements
We thank the reviewers for their thorough work revising this paper, which improved the overall quality of the paper. Carlos Buil-Aranda was supported by Fondecyt Iniciacion 11170714 and by ANID – Millennium Science Initiative Program – Code ICN17_002. Jorge Lobo was partially supported by the Spanish Ministry of Economy and Competitiveness under Grant Numbers: TIN-2016-81032-P, MDM-2015-052, and the U.S. Army Research Office under Agreement Number W911NF1910432. Federico Olmedo was also supported by ANID – Millennium Science Initiative Program – Code ICN17_002.
Query characteristics
In this Section we present the characteristics of the queries we used in Section 6. The table presents the query ID (which refer to queries Section B in this Appendix), the stars within the Privacy Schema we defined in 6, the query shape, the amount of tripe patterns in the queries as well as the number of variables, the join variables when applicable, including the amount of mappings for each join variable, and data about the Privacy Schema stars in the query.
Queries
In this Section we present the queries we used in Section 6. There are 11 base queries with several variations, totaling 26 SPARQL COUNT queries.
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
Query
