Abstract
We define a procedure for canonicalising SPARQL 1.1 queries. Specifically, given two input queries that return the same solutions modulo variable names over any RDF graph (which we call congruent queries), the canonicalisation procedure aims to rewrite both input queries to a syntactically canonical query that likewise returns the same results modulo variable renaming. The use-cases for such canonicalisation include caching, optimisation, redundancy elimination, question answering, and more besides. To begin, we formally define the semantics of the SPARQL 1.1 language, including features often overlooked in the literature. We then propose a canonicalisation procedure based on mapping a SPARQL query to an RDF graph, applying algebraic rewritings, removing redundancy, and then using canonical labelling techniques to produce a canonical form. Unfortunately a full canonicalisation procedure for SPARQL 1.1 queries would be undecidable. We rather propose a procedure that we prove to be sound and complete for a decidable fragment of monotone queries under both set and bag semantics, and that is sound but incomplete in the case of the full SPARQL 1.1 query language. Although the worst case of the procedure is super-exponential, our experiments show that it is efficient for real-world queries, and that such difficult cases are rare.
Introduction
The Semantic Web provides a variety of standards and techniques for enhancing the machine-readability of Web content in order to increase the levels of automation possible for day-to-day tasks. RDF [54] is the standard framework for the graph-based representation of data on the Semantic Web. In turn, SPARQL [24] is the standard querying language for RDF, composed of basic graph patterns extended with expressive features that include path expressions, relational algebra, aggregation, federation, among others.
The adoption of RDF as a data model and SPARQL as a query language has grown significantly in recent years [4,26]. Prominent datasets such as DBpedia [35] and Wikidata [61] contain in the order of hundreds of millions or even billions of RDF triples, and their associated SPARQL endpoints receive millions of queries per day [37,52]. Hundreds of other SPARQL endpoints are likewise available on the Web [4]. However, a survey carried out by Buil-Aranda et al. [4] found that a large number of SPARQL endpoints experience performance issues such as latency and unavailability. The same study identified the complexity of SPARQL queries as one of the main causes of these problems, which is perhaps an expected result given the expressivity of the SPARQL query language where, for example, the decision problem consisting of determining if a solution is given by a query over a graph is
One way to address performance issues is through caching of sub-queries [41,62]. The caching of queries is done by evaluating a query, then storing its result set, which can then be used to answer future instances of the same query without using any additional resources. The caching of sub-queries identifies common query patterns whose results can be returned for queries that contain said query patterns. However, this is complicated by the fact that a given query can be expressed in different, semantically equivalent ways. As a result, if we are unable to verify if a given query is equivalent to one that has already been cached, we are not using the cached results optimally: we may miss relevant results.
Ideally, for the purposes of caching, we could use a procedure to canonicalise SPARQL queries. To formalise this idea better, we call two queries equivalent if (and only if) they return the same solutions over any RDF dataset. Note however that this notion of equivalence requires the variables of the solutions of both queries to coincide. In practice, variable names will often differ across queries, where we would still like to be able to cache and retrieve the results for queries whose results are the same modulo variable names. Hence we call two queries congruent if they return the same solutions, modulo variable names, over any RDF dataset; in other words, two queries are congruent if (and only if) there exists a one-to-one mapping from the variables in one query to the variables of the other query that makes the former equivalent to the latter.
In this paper, we propose a procedure by which congruent SPARQL queries can be “canonicalised”. We call such a procedure sound if the output query is congruent to the input query, and complete if the same output query is given for any two congruent input queries.
Consider the following two SPARQL queries asking for names of aunts: 
Both queries are equivalent: they always return the same results for any RDF dataset. Now rather consider a third SPARQL query:
Note that the pattern
Canonicalisation aims to rewrite all three (original) queries to a unique, congruent, output query.
The potential use-cases we foresee for a canonicalisation procedure include the following:
As aforementioned, a canonicalisation procedure can improve caching for SPARQL endpoints. By capturing knowledge about query congruence, canonicalisation can increase the cache hit rate. Similar techniques could also be used to identify and cache frequently appearing (congruent) sub-queries [41].
In a conceptually similar use case to caching, our canonical procedure can be used to describe views [9]. In particular, the canonicalisation procedure can be used to create a key that uniquely identifies each of the views available.
SPARQL endpoint logs can be analysed in order to understand the importance of different SPARQL features [7,52], to build suitable benchmarks [52], to understand how users build queries incrementally [7,64], etc. Our canonicalisation procedure could be used to pre-process and group congruent queries in logs.
Canonicalisation may help with query optimisation by reducing the variants to be considered for query planning, detecting duplicate sub-queries that can be evaluated once, removing redundant patterns (as may occur under query rewriting strategies for reasoning [33]), etc.
Canonicalisation can reduce superficial variance in queries used to train machine learning models. For example, recent question answering systems learn translations from natural language questions to queries [10], where canonicalisation can be used to homogenise the syntax of queries used for training.
Other possible but more speculative use-cases involve signing or hashing SPARQL queries, discovering near-duplicate or parameterised queries (by considering constants as variables), etc. Furthermore, with some adaptations, the methods proposed here could be generalised to other query languages, such as to canonicalise SQL queries, Cypher queries [3], etc.
A key challenge for canonicalising SPARQL queries is the prohibitively high computational complexity that it entails. More specifically, the query equivalence problem takes two queries and returns true if and only if they return the same solutions for any dataset, or false otherwise. In the case of SPARQL, this problem is intractable (
With these limitations in mind, we propose a canonicalisation procedure that is always sound, but only complete for a monotone fragment of SPARQL under set or bag semantics. This monotone fragment permits unions and joins over basic graph patterns, some examples of which were illustrated in Example 1.1. We further provide sound, but incomplete, canonicalisation of the full SPARQL 1.1 query language, whereby the canonicalised query will be congruent to the input query, but not all pairs of congruent input queries will result in the same output query. In the case of incomplete canonicalisation, we are still able to find novel congruences, in particular through canonical labelling of variables, which further allows for ordering operands in a consistent manner. Reviewing the aforementioned use-cases, we believe that this “best-effort” form of canonicalisation is still useful, as in the case of caching, where missing an equivalence will require re-executing the query (which would have to be done in any case), or in the case of learning over queries, where incomplete canonicalisation can still increase the homogeneity of the training examples used.
As a high-level summary, our procedure combines four main techniques for canonicalisation.
The first technique is to convert SPARQL queries to an algebraic graph, which abstracts away syntactic variances, such as the ordering of operands for operators that are commutative, and the grouping of operands for operators that are associative.
The second technique is to apply algebraic rewritings on the graph to achieve normal forms over combinations of operators. For example, we rewrite monotone queries – that allow any combination of join, union, basic graphs patterns, etc. – into unions of basic graph patterns; this would rewrite the first and third queries shown in Example 1.1 into a form similar to the second query.
The third technique is to apply redundancy elimination within the algebraic graph, which typically involves the removal of elements of the query that do not affect the results; this technique would remove the redundant
The fourth and final technique is to apply a canonical labelling of the algebraic graph, which will provide consistent labels to variables, and which in turn allows for the (unordered) algebraic graph to be serialised back into the (ordered) concrete syntax of SPARQL in a canonical way.
We remark that the techniques do not necessarily follow the presented order; in particular, the second and third techniques can be interleaved in order to provide further canonicalisation of queries.
This paper extends upon our previous work [50] where we initially outlined a sound and complete procedure for canonicalising monotone SPARQL queries. The novel contributions of this extended paper include:
We close a gap involving translation of monotone queries under bag semantics that cannot return duplicates into set semantics.
We provide a detailed semantics for SPARQL 1.1 queries; formalising and understanding this is a key prerequisite for canonicalisation.
We extend our algebraic graph representation in order to be able to represent SPARQL 1.1 queries, offering partial canonicalisation support.
We implement algebraic rewriting rules for specific SPARQL 1.1 operators, such as those relating to filters; we further propose techniques to canonicalise property path expressions.
We provide more detailed experiments, which now include results over a Wikidata query log, a comparison with existing systems from the literature that perform pairwise equivalence checks, and more detailed stress testing.
We also provide extended proofs of results that were previously unpublished [51], as well as providing extended discussion and examples throughout.
The outline of the paper is then as follows. Section 2 provides preliminaries for RDF, while Section 3 provides a detailed semantics for SPARQL. Section 4 provides a problem statement, formalising the notion of canonicalisation. Section 5 discusses related works in the areas of systems that support query containment, equivalence, and congruence. Sections 6 and 7 discuss our SPARQL canonicalisation framework for monotone queries, and SPARQL 1.1, respectively. Section 8 presents evaluation results. Section 9 concludes.
We begin by introducing the core concepts of the RDF data model over which the SPARQL query language will later be defined. The following is a relatively standard treatment of RDF, as can be found in various papers from the literature [22,28]. We implicitly refer to RDF 1.1 unless otherwise stated.
Terms and triples
RDF assumes three pairwise disjoint sets of terms: IRIs (
In this paper we concatenate set names to denote their union; e.g.,
In this paper we use Turtle/SPARQL-like syntax, where
An RDF graph G is a set of RDF triples. It is called a graph because each triple
Simple entailment and equivalence
Blank nodes in RDF have special meaning; in particular, they are considered to be existential variables. The notion of simple entailment [22,25] captures the existential semantics of blank nodes (among other fundamental aspects of RDF). This same notion also plays a role in how the SPARQL query language is defined.
Formally, let
Deciding simple entailment
We remark that the RDF standard defines further entailment regimes that cover the semantics of datatypes and the special RDF and RDFS vocabularies [25]; we will not consider such entailment regimes here.
Isomorphism
Given that blank nodes are defined as existential variables [25], two RDF graphs differing only in blank node labels are thus considered isomorphic [19,28].
Formally, if a blank node mapping of the form
Deciding the isomorphism
Leanness and core
Existential blank nodes may give rise to redundant triples. In particular, an RDF graph G is called lean if and only if there does not exist a proper subgraph
The core of an RDF graph G is then an RDF graph
Deciding whether or not an RDF G is lean is known to be
Merge
Blank nodes are considered to be scoped to a local RDF graph. Hence when combining RDF graphs, applying a merge (rather than union) avoids blank nodes with the same name in two (or more) graphs clashing. Given two RDF graphs G and
SPARQL 1.1 semantics
We now define SPARQL 1.1 in detail [24]. We will begin by defining a SPARQL dataset over which queries are evaluated. We then introduce an abstract syntax for SPARQL queries. Thereafter we discuss the evaluation of queries under different semantics.
These definitions extend similar preliminaries found in the literature. However, our definitions of the semantics of SPARQL 1.1 extend beyond the core of the language and rather aim to be exhaustive, where a clear treatment of the full language is a prerequisite for formalising the canonicalisation of queries using the language. Table 1 provides a summary of prior works that have defined the semantics of SPARQL features. We exclude works that came after one of the works shown and use a subset of the features of that work (even if they may contribute novel results about those features). Some SPARQL 1.0 features, such as
Studies that define the semantics of features in SPARQL (1.1), including Mon otone (basic graph patterns, joins, UNION , un-nested SELECT DISTINCT ), Filt ers, Opt ionals, Neg ation (OPTIONAL & !BOUND , MINUS , FILTER (NOT) EXISTS ), N amed Gra phs (GRAPH , FROM (NAMED) ), Path s, Fed eration (SERVICE ), Ass ign ment (BIND , VALUES ), Agg regation (GROUP BY and aggregate functions), Sub Q ueries (nested SELECT ), Sol ution M odifiers (LIMIT , OFFSET , ORDER BY ), Query Form s (CONSTRUCT , ASK , DESCRIBE ), Exp ressions and Functions (e.g., + , BOUND , COUNT , IF ), Bag Semantics; we denote by “*” partial definitions or discussion
Studies that define the semantics of features in SPARQL (1.1), including
SPARQL property path syntax
Before we introduce an abstract syntax for SPARQL queries, we provide some preliminaries:
A triple pattern
A basic graph pattern B is a set of triple patterns. We denote by
A path pattern
A navigational graph pattern N is a set of paths patterns and triple patterns (with variable predicates). We denote by
A term in
An aggregation function ψ is a function that takes a bag of tuples from
If R is a built-in expression, and Δ is a boolean value indicating ascending or descending order, then
We then define the abstract syntax of a SPARQL query as shown in Table 3. Note that we abbreviate
For brevity, we consider the following SPARQL 1.1 operators to be represented as functions:
boolean operators: equality and inequality operators: numeric operators: unary
for example, replacing
We combine
A query such as
Aggregates without grouping can be expressed with
Some aggregation functions in SPARQL take additional arguments, including a
SPARQL allows
In the concrete syntax,
In our proposed abstract syntax and the concrete syntax, the ordering of variables in the
In the concrete syntax,
We use
We do not consider
Aside from the latter point, these exceptions are syntactic conveniences that help simplify later definitions.
Abstract SPARQL syntax
SPARQL allows for indexing and querying more than one RDF graph, which is enabled through the notion of a SPARQL dataset. Formally, a SPARQL dataset
Services
While the SPARQL standard defines a wide range of features that compliant services must implement, a number of decisions are left to a particular service. First and foremost, a service chooses what dataset to index. Along these lines, we define a SPARQL service as a tuple
D is a dataset;
⩽ is a total ordering of RDF terms and ⊥;
We will denote by
Query evaluation
The semantics of a SPARQL query Q can be defined in terms of its evaluation over a SPARQL dataset D, denoted
Solution mappings
A solution mapping μ is a partial mapping from variables in
We say that two solution mappings
Given two compatible solution mappings
Given a solution mapping μ and a triple pattern t, we denote by
Blank nodes in SPARQL queries can likewise play a similar role to variables though they cannot form part of the solution mappings. Given a blank node mapping α, we denote by
Finally, we denote by
Set vs. bag vs. sequence semantics
SPARQL queries can be evaluated under different semantics, which may return a set of solution mappings M, a bag of solution mappings
Given a solution mapping μ and a bag of solution mappings
Given a sequence
Next we provide some convenient notation to convert between sets, bags and sequences. Given a sequence
We continue by defining the semantics of SPARQL queries under set semantics. Later we cover bag semantics, and subsequently discuss aggregation features. Finally we present sequence semantics.
Query patterns: Set semantics
Query patterns evaluated under set semantics return sets of solution mappings without order or duplicates. We first define a set algebra of operators and then define the set evaluation of SPARQL graph patterns.
Set algebra
The SPARQL query language can be defined in terms of a relational-style algebra consisting of unary and binary operators [18]. Here we describe the operators of this algebra as they act on sets of solution mappings. Unary operators transform from one set of solution mappings (possibly with additional arguments) to another set of solution mappings. Binary operators transform two sets of solution mappings to another set of solution mappings. In Table 4, we define the operators of this set algebra. This algebra is not minimal: some operators (per, e.g., the definition of left-outer join) can be expressed using the other operators.
Set algebra, where M,
, and
are sets of solution mappings; V is a set of variables and v is a variable; and R is a built-in expression
Set algebra, where M,
Given an RDF graph G, we define the set of terms appearing as a subject or object in G as follows:
Path expressions where G is an RDF graph, p,
are IRIs, and e,
,
are path expressions
Path expressions where G is an RDF graph, p,
Given a navigational graph pattern N, we denote by
The
We remark that
The set evaluation of a SPARQL graph pattern transforms a SPARQL dataset D into a set of solution mappings. The base evaluation is given in terms of
Set evaluation of graph patterns where D is a dataset; B is a basic graph pattern; N is a navigational graph pattern; Q,
and
are graph patterns; V is a set of variables; R is a built-in expression; v is a variable; M is a set of solution mappings; and x is an IRI
Set evaluation of graph patterns where D is a dataset; B is a basic graph pattern; N is a navigational graph pattern; Q,
Query patterns evaluated under bag semantics return bags of solution mappings. Like under set semantics, we first define a bag algebra of operators and then define the bag evaluation of SPARQL graph patterns.
Bag algebra
The bag algebra is analogous to the set algebra, but further operates over the multiplicity of solution mappings. We define this algebra in Table 7.
Bag algebra where
,
, and
are bags of solution mappings; μ,
,
and
are solution mappings; V is a set of variables and v is a variable; and R is a built-in expression; for legibility, we use iverson bracket notation where
if and only if ϕ holds; otherwise, if ϕ is false or undefined, then
Bag algebra where
The bag evaluation of a graph pattern is based on the bag evaluation of basic graph patterns and navigational graph patterns, as defined in Table 8, where the multiplicity of each individual solution is based on how many blank node mappings satisfy the solution. With the exceptions of
Bag evaluation of graph patterns where D is a dataset; B is a basic graph pattern; N is a navigational graph pattern;
,
are graph patterns
Bag evaluation of graph patterns where D is a dataset; B is a basic graph pattern; N is a navigational graph pattern;
Bag evaluation of navigational patterns where D is a dataset, N is a navigational pattern, and x is a fresh blank node
Let
Aggregation algebra
We define an aggregation algebra in Table 10 under bag semantics with four operators that support the generation of a set of solution groups (aka., group by), the selection of solution groups (aka., having), the binding of new variables in the key of the solution group, as well as the flattening of a set of solution groups to a set of solution mappings by projecting their keys. Note that analogously to the notation for built-in expressions, given an aggregation expression A, we denote by
Aggregation algebra under bag semantics, where
and
are bags of solution mappings; V is a set of variables and v is a variable; A is an aggregation expression; and
is a set of solution groups; we recall that M is the set of all solution mappings
Aggregation algebra under bag semantics, where
We can use the previously defined aggregation algebra to define the semantics of group-by patterns in terms of their evaluation, per Table 11.
Evaluation of group-by patterns where D is a dataset, Q is a graph pattern or group-by pattern, V is a set of variables,
are variables, and
are aggregation expressions
Evaluation of group-by patterns where D is a dataset, Q is a graph pattern or group-by pattern, V is a set of variables,
Sequence patterns return sequences of solutions as their output, which allow duplicates and also maintain an ordering. These sequence patterns in general refer to solution modifiers that allow for ordering solutions, slicing the set of solutions, and removing duplicate solutions. We will again first define an algebra before defining the evaluation of sequence patterns.
Sequence algebra
Sequences deal with some ordering over solutions. We assume a total order ⩽ over
otherwise let j denote the least value
In Table 12, we present an algebra composed of three operators for ordering sequences of solutions based on order comparators, and removing duplicates.
Sequence algebra, where
and
are sequences of solutions, and Ω is a non-empty sequence of order comparators
Sequence algebra, where
Using the sequence algebra, we can then define the evaluation of sequence patterns as shown in Table 13.
Evaluation of sequence patterns where Ω is a non-empty sequence of order comparators, and k is a (non-zero) natural number
Evaluation of sequence patterns where Ω is a non-empty sequence of order comparators, and k is a (non-zero) natural number
We now characterise different types of variables that may appear in a graph pattern in terms of being always bound, never bound, or sometimes bound in the solutions to the graph pattern. This characterisation will become important for rewriting algebraic expressions [53]. Specifically, letting Q denote a graph pattern, recall that we denote by we denote by we denote by
Put more simply,
It may be tempting to think that
Consider the query (pattern) Q:
Now:
Unfortunately, given a graph pattern Q, deciding if
The observant reader may have noticed that in Table 6 and Table 8, in the definitions of
We will start with a case that is not semantically ambiguous. Take a query:
To evaluate it on a dataset D, we take each solution
We next take an example of a query that is syntactically valid, but semantically ill-defined.
Given a solution μ from the left, if we follow the standard literally and replace “every occurrence of a variable v in [the right pattern] by
which is syntactically invalid.
A number of similar issues arise from ambiguities surrounding substitution, and while work is underway to clarify this issue, at the time of writing, various competing proposals are being discussed [32,42,55]. We thus postpone rewriting rules for negation until a standard semantics for substitution is agreed upon.
A query accepts a set, bag or sequence of solution modifiers, depending on the semantics selected (and features supported). In the case of a
Evaluation of queries where D is a dataset, Q is a sequence pattern or graph pattern, V is a set of variables, B is a basic graph pattern, and X is a set of IRIs and/or variables
Evaluation of queries where D is a dataset, Q is a sequence pattern or graph pattern, V is a set of variables, B is a basic graph pattern, and X is a set of IRIs and/or variables
Queries are evaluated on a SPARQL dataset D, where dataset modifiers allow for changing the dataset considered for query evaluation. First, let X and
Evaluation of dataset modifiers where X and
are sets of IRIs
Evaluation of dataset modifiers where X and
A number of features can lead to non-determinism in the evaluation of graph patterns as previously defined. When such features are used, there may be more than one possible valid result for the graph pattern on a dataset. These features are as follows:
Built-in expressions and aggregation expressions may rely on non-deterministic functions, such as
The use of sequence patterns without an explicit
In non-deterministic cases, we can say that
Relationships between the semantics
The SPARQL standard is defined in terms of sequence semantics, i.e., it is assumed that the solutions returned have an order. However, unless the query explicitly uses the sequence algebra (and in particular
Query containment and equivalence
Query containment states that the results of one graph pattern are contained in the other. To begin, take two deterministic graph patterns
If
Query equivalence is a relation between graph patterns that states that the results of one graph pattern are equal to the other. Specifically, given two graph patterns
For example, if
In Fig. 1 we provide examples of query containment and equivalence. The leftmost query finds the maternal grandparents of

Examples of query containment and equivalence.

Example of query congruence.
Regarding equivalence of non-deterministic graph patterns, we highlight that any change to the possible space of results leads to a non-equivalent graph pattern. For example, for a graph pattern Q, it holds that
While the previous discussion refers to graph patterns (which may include use of (sub)
Many use-cases for canonicalisation prefer not to distinguish queries that are equivalent up to variable names. We call a one-to-one variable mapping
We provide an example of non-equivalent but congruent queries in Fig. 2. If we rewrite the variable
Like equivalence and isomorphism, congruence is reflexive (
Based on a query of the form
basic graph patterns (BGPs): Q is a BGP and
unions of basic graph patterns (UBGPs): Q is a graph pattern using BGPs and
conjunctive queries (CQs): Q is a BGP.
unions of conjunctive queries (UCQs): Q is a graph pattern using BGPs and
monotone queries (MQs): Q is a graph pattern using BGPs,
non-monotone queries (NMQs): Q is a graph pattern using BGPs,
navigational graph patterns (NGPs): Q is an NGP and
unions of navigational graph patterns (UNGPs): Q is a graph pattern using NGPs and
conjunctive path queries (CPQs): Q is an NGP.
unions of conjunctive path queries (UCPQs): Q is a graph pattern using NGPs and
monotone path queries (MPQs): Q is a graph pattern using NGPs,
non-monotone path queries (NMPQs): Q is a graph pattern using NGPs,
These query classes are evaluated on an RDF graph (the default graph) rather than an RDF dataset, though results extend naturally to the RDF dataset case. Likewise, since we do not consider the sequence algebra, we have the (meaningful) choice of set or bag semantics under which to consider the tasks; furthermore, since the aggregation algebra is not considered, set and distinct-bag semantics coincide.
Unlike UCQs, which are strictly unions of joins (expressed as basic graph patterns), MQs further permit joins over unions. As such, UCQs are analogous to a disjunctive normal form. Though any monotone query (under set semantics) can be rewritten to an equivalent UCQ, certain queries can be much more concisely expressed as MQs versus UCQs, or put another way, there exist MQs that are exponentially longer when rewritten as UCQs. For example, the first three queries of Fig. 1 are MQs, but only the third is a UCQ; if we use a similar pattern as the third query to go search back n generations, then we would require
CPQs and UCPQs are closely related to the query fragments of conjunctions of 2-way regular paths queries (C2RPQs) and unions of conjunctions of 2-way regular paths queries (UC2RPQs), but additionally allow negated property sets and variables in the predicate position [34]. NMQs are semantically related to the fragment with BGPs, projection,
We here consider four decision problems:
Given a solution μ, a query Q and a graph G, is
Given two queries
Given two queries
Given two queries
In Table 16, we summarise known complexity results for these four tasks considering both bag and set semantics along with a reference for the result. The results refer to combined complexity, where the size of the queries and data (in the case of
We assume that built-in and aggregation expressions can be evaluated using at most polynomial space, as is the case for SPARQL.
Blank nodes act like projection, where their complexity then follows that of *CQs and *CPQs.
Complexity of SPARQL tasks on core fragments (considering combined complexity for
An asterisk implies that the result is not explicitly stated, but trivially follows from a result or technique used. These cases include analogous results for relational settings, upper-or-lower bounds from tasks with obvious reductions to or from the stated problem, etc. We may omit references in case a result directly follows from other results in the table. A less obvious case is that of
This is also known as bag–set semantics, where the data form a set of tuples, but the query is evaluated under bag semantics [12].
Some of the more notable results include:
The decidability of
Although UCQ and MQ classes are semantically equivalent (each UCQ has an equivalent MQ and vice versa), under set semantics the problems of
While
These results – in particular those of
With these definitions in hand, we now state the problem we wish to address: given a query Q, we wish to compute a canonical form of the query
With this canonicalisation procedure, we can decide the congruence
Indeed, even in the case of MQs, deciding
Related works
In this section, we discuss implementations of systems relating to containment, equivalence and canonicalisation of SPARQL queries.
A number of systems have been proposed to decide the containment of SPARQL queries. Among these, Letelier et al. [36] propose a normal form for quasi-well-designed pattern trees – a fragment of SPARQL allowing restricted use of
Some systems have proposed isomorphism-based indexing of sub-queries. In the context of a caching system, Papailiou et al. [41] apply a canonical labelling algorithm (specifically Bliss [31]) on BGPs in order to later find isomorphic BGPs with answers available; their approach further includes methods for generalising BGPs such that it is more likely that they will be reused later. More recently, Stadler et al. [57] propose a system called JSAG for solving the containment of SPARQL queries. The system computes normal forms for queries, before representing them as a graph and applying subgraph isomorphism algorithms to detect containments. Such approaches do not discuss completeness, and would appear to miss containments for CQs under set semantics (and distinct-bag semantics), which require checking for homomorphisms rather than (sub-graph) isomorphisms.
We remark that in the context of relational database systems, there are likewise few implementations of query containment, equivalence, etc., as also observed by Chu et al. [15,16], who propose two systems for deciding the equivalence of SQL queries. Their first system, called Cosette [16], translates SQL into logical formulae, where a constraint solver is used to try to find counterexamples for equivalence; if not found, a proof assistant is used to prove equivalence. Chu et al. [15] later proposed the UDP system, which expresses SQL queries – as well as primary and foreign key constraints – in terms of unbounded semiring expressions, thereafter using a proof assistant to test the equivalence of those expressions; this approach is sound and complete for testing the equivalence of UCQs under both set and bag semantics. Zhou et al. [65] recently propose the EQUITAS system, which converts SQL queries into FOL-based formulae, reducing the equivalence problem to a satisfiability-modulo-theories (SMT) problem, which allows for capturing complex selection criteria (inequalities, boolean expressions, cases, etc.). Aside from targeting SQL, a key difference with our approach is that such systems apply pairwise checks.
In summary, while problems relating to containment and equivalence have been well-studied in the theoretical literature, relatively few practical implementations have emerged, perhaps because of the high computational costs, and indeed the undecidability results for the full SPARQL/SQL language. Of those that have emerged, they either offer sound and complete checks in a pairwise manner for query fragments, such as UCQs (e.g., [15]), or they offer sound but incomplete canonicalisation focused on isomorphic equivalence (e.g., [41]). To the best of our knowledge, the approach that we propose here, which we call QCan, is the only one that allows for canonicalising queries with respect to congruence, and that is sound and complete for monotone queries under both set and bag semantics. Our experiments will show that despite high theoretical computational complexity, QCan can be deployed in practice to detect congruent equivalence classes in large-scale, real-world query logs or streams, which are dominated by relatively small and simple queries.
Canonicalisation of monotone queries
In this section, we will first describe the different steps of our proposed canonicalisation process for monotone queries (MQs), i.e., queries with basic graph patterns, joins, unions, outer projection and distinct (see Section 3.17). In fact, we consider a slightly larger set of queries that we call extended monotone queries (EMQs), which are monotone queries that additionally support property paths using the (non-recursive) features “
As mentioned in the introduction, the canonicalisation process consists of: algebraic rewriting of parts of the query into normal forms, the representation of the query as a graph, the minimisation of the monotonic parts of the query by leaning and containment checks, the canonical labelling of the graph, and finally the mapping back to query syntax. We describe these steps in turn and then conclude the section by proving that canonicalisation is sound and complete for EMQs.
UCQ normalisation
In this section we describe the rules used to rewrite EMQs into a normal form based on unions of conjunctive queries (UCQs). We first describe the steps we apply for rewriting property paths into monotone features (where possible), thus converting EMQs into MQs. We then describe the rewriting of MQs into UCQ normal form. We subsequently describe some postprocessing of variables to ensure that those with the same name are correlated and that variables that are always unbound are removed. Finally we extend the normal form to take into account set vs. bag semantics.
Property path elimination
Per Table 9, property paths that can be rewritten to joins and unions are considered to be equivalent to their rewritten form under both bag and set semantics. We make these equivalences explicit by rewriting such property paths to joins and unions; i.e.:
Consider the following query based on Example 1.1 looking for names of aunts.
This query will be rewritten to:
And then recursively to:
In this case we succeed in removing all property paths; however, property paths with
Pérez et al. [44] establish that, under set semantics, joins and unions in SPARQL are commutative and associative, and that joins distribute over unions. We summarise these results in Table 17. Noting that under set semantics, the multiplicity of joins and unions is given by the multiplication and addition of natural numbers, respectively; that both multiplication and addition are commutative and associative; and that multiplication distributes over addition; the same results also apply under bag semantics.
Equivalences given by Pérez et al. [44] for set semantics
Equivalences given by Pérez et al. [44] for set semantics
Another (folklore) result of interest is that BGPs can be rewritten to equivalent joins of their triple patterns. However, care must be taken when considering blank nodes in BGPs; otherwise the same blank node in two different triple patterns might be matched to two different terms, breaking the equivalence. Along these lines, let
These known results give rise to a UCQ normal form for MQs [44]. More specifically, given a pattern
Given the aforementioned equivalences, the arguments of
The UCQ normal form for MQs is then of the form
We show a case where the multiplicity of union operands changes the multiplicity of results under bag semantics. Consider the following MQ Q:
Assume a dataset D with a default graph
If we rewrite this query to a UCQ, in the first step, pushing the first join inside the union, we generate:
We may now describe the multiplicity of μ as
The multiplicity of this query is described as
The multiplicity is then
As was previously mentioned, the UCQ normal form may be exponentially larger than the original MQ; for example, a relatively concise EMQ of the form
Let us take the output query of Example 6.1 and apply the UCQ normal form.
Blank nodes are rewritten to variables, and then join is distributed over union, giving the following query:
If we were to consider the names of aunts or uncles (
We recall that a graph pattern Q is considered unsatisfiable if and only if there does not exist a dataset D such that
Let Q denote a BGP. Q is unsatisfiable if and only if it contains a literal subject.
Please see Appendix A.1.1 for the proof.
Moving to UCQs, it is not difficult to see that a union is satisfiable if and only if one of its operands is satisfiable, or, equivalently, that it is unsatisfiable if and only if all of its operands are unsatisfiable.
Let
Please see Appendix A.1.2 for the proof.
To deal with CQs of the form Take the CQ: Next take the UCQ:
We replace this with the canonical query: 
We will rewrite this to
by removing the unsatisfiable operand.
The same variable may sometimes occur in multiple query scopes such that replacing an occurrence of the variable in one scope with a fresh variable does not change the query results. We say that such variable occurrences are not correlated. There is one case where this issue may arise in UCQs. We call a variable v a union variable if it occurs in a union
If considering the direct results of a query pattern, then the naming of variables matters as they are bound in the solutions.
Let
Please see Appendix A.1.3 for the proof.
These non-correlated variables give rise to non-trivial equivalences based on the “false” correspondences between variables with the same name that have no effect on each other. We address such cases by differentiating union variables that appear in multiple operands of the union but not outside the union.
We take the output of Example 6.3:
The variable
The resulting query is equivalent to the original query, but avoids “false correspondences” of variables.
Next we apply a simple rule to remove variables that are always unbound in projections. Left unattended, such variables could otherwise constitute a trivial counterexample for the completeness of canonicalisation. We recall from Section 3.9 the notation
Let Q be a graph pattern, let
Please see Appendix A.1.4 for the proof.
We deal with such cases by removing variables that are always unbound from the projection.11
In concrete SPARQL syntax, a
Take a query:
We can remove the variable
The presence or absence of
Let Q denote a satisfiable BGP. It holds that:
Please see Appendix A.1.5 for the proof.
The second case involves unions.
Let
Please see Appendix A.1.6 for the proof.
The same equivalences trivially hold for
The choice to add rather than remove
Take a query such as:
Since the query is a BGP with all variables projected and no blank nodes, no duplicates can be produced, and thus we can add a
An example of a case involving union is as follows:
First we note that the individual basic graph patterns forming the operands of the
Given an EMQ Q, we denote by
Graph representation
Given an EMQ as input, the previous steps either terminate with a canonical unsatisfiable query, or provide us with a satisfiable query in UCQ normal form, with blank nodes replaced by fresh variables, non-correlated variables differentiated, variables that are always unbound removed from the projection (while ensuring that the projection is non-empty), and the
Consider the following UCQs: the ordering of triple patterns within BGPs; the ordering of BGPs within the UCQ; the naming of variables; a redundant triple pattern in each BGP of the first query (those containing the redundant third BGP in the second query.
These queries are congruent, but differ in:
We are left to canonicalise such variations.
Definitions for representational graphs
Our overall approach to address such variations is to encode queries as RDF graphs that we call representational graphs (r-graphs). This representation will allow for identifying and removing redundancies, and for canonically labelling variables such that elements of the query can be ordered deterministically.
We first establish some notation. Let if if if x is a natural number then if x is a boolean value then otherwise
We assume that natural numbers and boolean values produce datatype literals in canonical form (for example, we assume that
Table 18 then provides formal definitions for transforming a UCQ Q in abstract syntax into its r-graph
are replaced with
, and the graph extended with
We present an example of a UCQ and its r-graph in Fig. 3. For clarity (in particular, to avoid non-planarity), we embed the types of nodes into the nodes themselves; e.g., the lowermost node expands to
. Given an input query
Mapping UCQs to r-graphs
Part of the benefit of this graph representation is that it abstracts away the ordering of the operands of query operators where such order does not affect the semantics of the operator. This representation further allows us to leverage existing tools to eliminate redundancy and later canonically label variables.
The minimisation step removes two types of redundancies: redundant triple patterns in BGPs, and redundant BGPs in unions. It is important to note that such redundancies only apply in the case of set semantics [49]; under bag semantics, these “redundancies” affect the multiplicity of results, and thus cannot be removed without changing the query’s semantics.
BGP minimisation
The first type of redundancy we consider stems from redundant triple patterns. Consider a BGP Q (without blank nodes, for simplicity). We denote by
One may note a correspondence to RDF entailment (see Section 2.5), which is also based on homomorphisms, where the core of an RDF graph represents a redundancy-free (lean) version of the graph. We can exploit this correspondence to remove redundancies in BGPs by computing their core. However, care must be taken to ensure that we do not remove variables from the BGP that are projected; we achieve this by temporarily replacing them with IRIs so that they cannot be eliminated during the core computation.

UCQ (above) and its r-graph (below).

BGP (above) and its r-graph (below) with the sub-graph removed during the core computation shown dashed (and in blue).
Consider the following query, Q:
Though perhaps not immediately obvious, this query is congruent with the three queries of Example 1.1. After applying UCQ normal forms and creating the base r-graph for Q, we end up with an r-graph analogous to the following query with a union of three BGPs:
We then replace the blank node for the projected variable
If we consider applying this core computation over all three conjunctive queries, we would end up with an r-graph corresponding to the following query:
We see that the projected variable is preserved in all BGPs. However, we can still see (inter-BGP) redundancy with respect to the first and third BGPs (the first is contained in the third), which we address now.
After removing redundancy from the individual BGPs, we may still be left with a union containing redundant BGPs as highlighted by the output of Example 6.10, where the first BGP is contained in the third BGP: when we take the union, the names of
One may observe that this relates to the aforementioned rule for unsatisfiability in the case of UCQs; however, while the unsatisfiability rule applies in the case of both bag and set semantics, this rule only applies in the case of set semantics.
all
all
To implement condition (1), let us first assume that all BGPs contain all projected variables. Note that in the previous step we have removed all redundancy from the CQs and hence it is sufficient to check for isomorphism between them; we can thus take the current r-graph
Per this process, the first BGP in the output of Example 6.10 is removed as it is contained in the third BGP, with the projected variable corresponding in both. We now take another example.
Consider the following UCQ, where each BGP contains the lone projected variable:
If we consider the first two BGPs, they do not contribute the same results to
The fourth BGP maps to (i.e., contains) the first BGP, and thus the first BGP will be removed. This containment check is implemented by creating the following ASK query from the r-graph for the fourth BGP:
and applying it to the r-graph of the first BGP:
This returns true and hence the first BGP is removed. Likewise the fourth BGP maps to the fifth BGP and also the sixth BGP and hence the fifth and sixth BGPs will also be removed. This leaves us with an r-graph representing the following UCQ query:
This UCQ query is redundancy-free.
Now we drop the assumption that all CQs contain the same projected variables in V, meaning that we can generate unbounds. To resolve such cases, we can partition the BGP operands
Take the following UCQ, where the BGPs now contain different projected variables:
Let
The first two BGPs can return multiple solutions, where none can have an unbound; the third BGP will return the same solutions for
The result of this process will be an r-graph for a redundancy-free UCQ. On this r-graph, we apply some minor post-processing: (i) we replace the temporary IRIs for projected variables with their original blank nodes to allow for canonical labelling in a subsequent phase; and (2) we remove unary
Given a UCQ Q being evaluated under set semantics (with distinct), we denote by
Given a UCQ Q being evaluated under bag semantics (without distinct), we define that
Canonical labelling
The second-last step of the canonicalisation process consists of applying a canonical labelling to the blank nodes of the RDF graph output from the previous process [28]. Specifically, given an RDF graph G, we apply a canonical labelling function
Inverse mapping
The final step of the canonicalisation process is to map from the canonically labelled r-graph to query syntax. More specifically, we define an inverse r-mapping, denoted
Here we assume the use of
Inverting Table 19, we can define ξ as
To arrive at a canonical concrete syntax, we order the operands of commutative operators using a syntactic ordering on the canonicalised elements, and then serialise these operands in their lexicographical order. This then concludes the canonicalisation of EMQs.
Given an EMQ as input, we prove soundness – i.e., that the output query is congruent to the input query – and completeness – i.e., that the output for two input queries is the same if and only if the input queries are congruent – for the proposed canonicalisation scheme.
Soundness
We begin the proof of soundness by showing that the UCQ normalisation preserves congruence.
For an EMQ Q, it holds that:
Please see Appendix A.1.7 for the proof.
Next we prove that the canonical labelling of blank nodes in the r-graph does not affect the properties of the inverse r-mapping.
Given a UCQ Q, it holds that:
Please see Appendix A.1.8 for the proof.
Finally we prove that the minimisation of UCQs through their r-graphs preserves congruence.
Given a UCQ Q, it holds that:
Please see Appendix A.1.9 for the proof.
The following theorem then establishes soundness; i.e., that the proposed canonicalisation procedure preserves congruence of EMQs.
For an EMQ Q, it holds that:
Please see Appendix A.1.10 for the proof.
We now establish completeness: that for any two EMQs, they are congruent if and only if their canonicalised queries are equal. We will prove this by proving lemmas for various cases.
We begin by stating the following remark, which will help us to abbreviate some proofs.
The following hold:
if if if if if
Thus, if any premise 1–5 is satisfied, it holds that
In order to prove the result for various cases, our goal is thus to prove isomorphism of the input queries, the queries in UCQ normal form, the r-graphs of the queries, or the minimised r-graphs.
Our first lemma deals with unsatisfiable UCQs, which is a corner-case specific to SPARQL.
Let
Please see Appendix A.1.11 for the proof.
In practice, if a UCQ Q is unsatisfiable, then the canonicalisation process can stop after
We will start with satisfiable CQs evaluated under set semantics (with distinct).
Let
Please see Appendix A.1.12 for the proof.
We move to CQs evaluated under bag semantics (without distinct; the result also considers cases where the CQ cannot return duplicates).
Let
Please see Appendix A.1.13 for the proof.
We now move to UCQs evaluated under set semantics (with distinct).
Let
Please see Appendix A.1.14 for the proof.
We next consider UCQs under bag semantics (without distinct; again, this also holds in the case that the UCQs cannot return duplicates).
Let
Please see Appendix A.1.15 for the proof.
Finally we consider what happens when one (U)CQ has distinct, and the other does not but is congruent to the first query.
Let Q denote a satisfiable UCQ without distinct. Let
Please see Appendix A.1.16 for the proof.
Having stated all of the core results, we are left to make the final claim of completeness.
Given two EMQs
Please see Appendix A.1.17 for the proof.
Finally we can leverage soundness and completeness for the following stronger claim.
Given two EMQs
Please see Appendix A.1.18 for the proof.
With respect to the complexity of the problem of computing the canonical form of (E)MQs in SPARQL, a solution to this problem can be trivially used to decide the equivalence of MQs, which is
With respect to the complexity of the algorithm
Other cases are not difficult to manage, but require considering the length of a property path, the number of projected variables not appearing the query, etc., in the input, which we consider to be inessential to the complexity, and to our discussion here.
Letting
With respect to
With respect to
With
With respect to
Finally, given a graph with j triples, then
Putting it all together, the complexity of canonicalising an MQ Q with n triple patterns using the procedure
Overall, this complexity assumes worst cases that we expect to be rare in practice, and our analysis assumes brute-force methods for finding homomorphisms, computing cores, labelling blank nodes, etc., whereas we use more optimised methods. For example, the exponentially-sized UCQ r-graphs form a tree-like structure connecting each BGP, where it would be possible to canonically label this structure in a more efficient manner than suggested by this worst-case analysis. Thus, though the method has a high computational cost, this does not necessarily imply that it will be impractical for real-world queries. Still, we can conclude that the difficult cases for canonicalisation are represented by input queries with joins of unions, and that minimisation and canonical labelling will likely have high overhead. We will discuss this further in the context of experiments presented in Section 8.
While the previous section describes a sound and complete procedure for canonicalising EMQs, many SPARQL 1.1 queries in practice use features that fall outside of this fragment. Unfortunately we know from Table 16 that deciding equivalence for the full SPARQL 1.1 language is undecidable, and thus that an algorithm for sound and complete canonicalisation (that is guaranteed to halt) does not exist. Since completeness is not a strong requirement for certain use-cases (e.g., for caching, it would imply a “cache miss” that would otherwise happen without canonicalisation), we rather aim for a sound canonicalisation procedure that supports all features of SPARQL 1.1. Such a procedure supports all queries found in practice, preserving congruence, but may produce different canonicalised output for congruent queries.
Algebraic rewritings
We now describe the additional rewritings we apply in the case of SPARQL 1.1 queries that are not EMQs, in particular for filters, for distinguishing local variables, and for property paths (RPQs). We further describe how canonicalisation of monotone sub-queries is applied based on the previous techniques.
Filter normalisation
Schmidt et al. [53] propose a number of rules for filters, which form the basis for optimising queries by applying filters as early as possible to ensure that the number of intermediate results are reduced. We implement the rules shown in Table 20. It is a folklore result that such rewritings further hold in the case of bag semantics, where they are used by a wide range of SPARQL engines for optimisation purposes: intuitively, filters set to zero the multiplicity of solutions that do not pass in the case of both bag or set semantics, and preserve the multiplicity of other solutions.
Equivalences given by Schmidt et al. [53] for filters under set semantics
Equivalences given by Schmidt et al. [53] for filters under set semantics
With respect to the latter two rules, we remark that this holds only if the variables of the filter expression R are contained within the safe variables of
Syntactic approximation of safe variables where B is a basic graph pattern; N is a navigational graph pattern;
In our case, rather than decomposing filters with disjunction or conjunction, we join them together, creating larger filter expressions that can be normalised.
Consider the following query:
We will rewrite this as follows:
Note that the
Like in the case of union variables, we identify another case where the correspondences between variables in different scopes is coincidental; i.e., where variables with the same name do not correlate. Specifically, we call a variable v local to a graph pattern Q on V (see Table 3) if Consider the following query looking for the names of aunts of people without a father or without a mother.
In this case, we distinguish the union variables. However, the variables
The resulting query is equivalent to the original query, but avoids “false correspondences” of variables.
We continue to apply many of the rules of UCQ normalisation described in Section 6.1. Most of these rules were stated in a general way, and thus apply when other (non-EMQ) features are used in the subqueries of the union and join operators (or outside such operators). There are, however, certain caveats to consider in the more general case:
In the case of variable normalisation, deciding the set of possible variables becomes undecidable for the full language. It suffices for soundness (but not completeness) to use an overapproximation; given a graph pattern Q on V, we can thus simply take V as the set of possible variables. In the case of unsatisfiability normalisation, there are now many other possible causes of unsatisfiable graph patterns; unfortunately, deciding if a pattern is unsatisfiable or not for the full language is undecidable [60]. We currently only remove BGPs with literal subjects, as before for EMQs. In the case of set vs. bag normalisation, deciding whether or not a query can return duplicate solutions is undecidable (noting that
Well-designed pattern normalisation
As mentioned in Section 3, SPARQL allows the querying of optional data, that is, values are returned if they are matched by a graph pattern, or are unbound otherwise. Well-designed patterns denote a class of graph patterns where for each sub-pattern of the form
Equivalences given by Pérez et al. [44] for set semantics and well-designed patterns
Equivalences given by Pérez et al. [44] for set semantics and well-designed patterns
Let us consider the query
This query is not well-designed due to the variable
On the other hand, the following query
Thus we can rewrite it to:
per the second rule of Table 22.
Note that if we were to try to apply the same rule on the non-well-designed pattern in
This query is not equivalent to
We note that for queries with well-designed patterns, these rules are not sufficient for the purposes of completeness; we will show an example of incompleteness later in Section 7.5.4. Per the results of Pichler and Skritek [45], equivalence considering projection and well-designed patterns is already undecidable.
Given a SPARQL 1.1 query Q, we denote by
Graph representation
We extend the graph representation defined for EMQs in Section 7.2 so as to cover all SPARQL 1.1 query features. This extension is provided in Table 23 (we again include the EMQ features for reference).
Definitions for representational graphs
of graph patterns Q, where “
” abbreviates rdf:type , B is a basic graph pattern, N is a navigational graph pattern,
are graph patterns, c is an RDF term, e is a non-simple property path (not an IRI), k is a non-zero natural number, v is a variable, w is a variable or IRI, x is an IRI, y is a variable or property path, μ is a solution mapping, Δ is a boolean value, V is a set of variables, X is a set of variables and/or IRIs, Y and
are sets of IRIs, R is a built-in expression, A is an aggregate expression,
is a bag of solution mappings, Λ is a set of aggregate expression–variable pairs, and Ω is a non-empty sequence of order comparators
Definitions for representational graphs
The reader may have noted that we omitted three details from Table 23: how to represent built-in and aggregate expressions, and property paths. We now discuss these representations in turn.
We recall that a term in
As previously remarked, we consider operators (e.g.,
With respect to aggregation expressions, we recall that if ψ is a function that takes a bag of tuples from
Though not necessary for SPARQL,

NFA, DFA and minimal DFA produced for the RPQ
Because property paths without inverses and negated property sets are regular expressions, and any regular expression can be transformed into a finite automaton, and a finite automaton has a graph-like structure, we opt to represent property paths based on finite automata. This approach allows us to apply known normal forms for finite automata, providing, in turn, a normal form for RPQs, i.e., property paths without inverses or negated property sets; it further provides partial canonicalisation in the case of full property paths. Another benefit of this approach is that the automaton can be converted straightforwardly into RDF and used for the graph representation.
Given an RPQ, we apply the following process:
we construct a non-deterministic finite automaton (NFA) based on Thompson’s construction for transforming regular expressions to NFAs [59];
we convert this NFA into a deterministic finite automaton (DFA) by a standard subset expansion (the resulting DFA may be exponential on the number of states in the NFA);
we perform a minimisation of the DFA using Hopcroft’s algorithm [29], which produces a canonical DFA such that all regular expressions that express the same language will produce the same DFA (modulo isomorphism on states).
We now provide an example to illustrate the process.
Consider an RPQ
The minimal DFA produced by Hopcroft’s algorithm for the RPQ e can then be encoded as an r-graph
In order to generalise this process to full property paths, we must also support inverses and negated property sets. For inverses, we initially attempt to eliminate as many inverses as possible; for example,
We will discuss the inverse mapping from the minimal DFA back to a path expression in Section 7.4.
Minimisation is applied only to BGPs and UBGPs that are contained within the larger query in the r-graph, considering any variable appearing outside of the BGP or union as a projected variable. We apply canonicalisation on the entire r-graph as before, computing a deterministic labelling for blank nodes.
Inverse mapping
The inverse mapping is defined analogously such that Consider the following DFA: The first step consists in transforming this DFA into an equivalent NFA by introducing a new initial state To eliminate a state We now eliminate Next we eliminate We now eliminate Finally, we substitute the IRI





Given a SPARQL 1.1 query Q, the canonicalisation procedure is then defined by
Soundness and completeness for EMQs
First we look at the case of EMQs, and ask if the extended canonicalisation procedure is sound and complete for this fragment.
Given two EMQs
Please see Appendix A.2.1 for the proof.
Next we show that the process is sound for queries in the full SPARQL 1.1 query language.
For a SPARQL 1.1 query Q, it holds that:
Please see Appendix A.2.2 for the proof.
In terms of the complexity of (partially) canonicalising queries with these additional features, if we assume that the size of the input query Q is in
We may rather consider the “token length” of the query, which is the number of terminals – RDF terms, variables, keywords, etc. – appearing in the syntax of the query. In this case, we must additionally consider the costs of computing a normal form for RPQs. Given an RPQ of length l, which includes the number of non-parenthetical symbols (including IRIs,
Incompleteness for SPARQL 1.1
We provide some examples of incompleteness to illustrate the limitations of the canonicalisation process for the full SPARQL 1.1 language.
We start with filters, which, when combined with a particular graph pattern, may always be true, may never be true, may contain redundant elements, etc.; however, detecting such cases can be highly complex.
Consider the following example:
The
Note that reasoning about filters is oftentimes far from trivial. Consider the following example:
This query is unsatisfiable because predicates must be IRIs, and IRIs must always contain a colon (to separate the scheme from the hierarchical path) [20].
Next we consider issues with property paths.
Consider the following example:
Clearly this is equivalent to:
But also to:
Currently we rewrite concatenation, inverse and disjunction in paths (not appearing within a recursive expression) to UCQ features. This means that we currently capture equivalence between the first and second query, but not the first and third.
Other examples are due to inverses, or negated property sets; consider for example:
This is equivalent to:
However, we do not consider the semantic relation between the expressions
Incompleteness can also occur due to negation, which is complicated by the ambiguities surrounding
Incompleteness can also occur while normalising well-designed query patterns with
Consider the query
Since
If we rewrite each well-designed pattern by pushing the
and, analogously, for
However, in the general case it does not hold that
We could list an arbitrary number of ways in which arbitrary features can give rise to unsatisfiability or redundancy, or where queries using seemingly different features end up being equivalent. We could likewise provide an arbitrary number of rewritings and methods to deal with particular cases. However, any such method for canonicalising SPARQL 1.1 queries will be incomplete. Furthermore, many such “corner cases” would be rare in practice, where dealing with them might have limited impact. We then see two interesting directions for future work to address these limitations:
Use query logs or other empirical methods to determine more common cases that this query canonicalisation framework may miss and implement targeted methods to deal with such cases. Extend the query fragment for which sound and complete canonicalisation is possible; an interesting goal, for example, would be to explore EMQs with full property paths (such queries are similar to C2RPQs [34], for which containment and related problems are decidable).
In these experiments, we wish to address two principal questions, as follows:
How is the performance of the canonicalisation procedure in terms of runtime? Which aspects of the procedure take the most time? What kinds of queries are most expensive?
How many more additional congruent queries can the procedure find in real-world logs versus baseline methods? Which aspects of the procedure are most important for increasing the number of congruent queries detected?
With respect to the first question, we might expect poor performance given that the worst-case of the algorithm is super-exponential. However, this is a worst-case analysis with respect to the size of the query, where queries in practice are often relatively small and simple. Hence our hypothesis is that although there exist queries for which canonicalisation is not computationally feasible in theory, it should run efficiently (in fractions of a second) for the majority of real-world queries in practice (as found in public query logs).
With respect to the second question, most of our expected use-cases benefit from being able to find a wider range of congruent, real-world queries, as found in public query logs; for example, in the case of caching, finding more congruent queries will translate into higher cache hit rates. Thus it is of interest to see how many additional congruent queries our canonicalisation procedure can find from public query logs when compared with baseline (syntactic) methods, and in particular, which parts of the procedure have the most impact in terms of finding more congruent queries. In general, we hypothesise that canonical labelling will be an important component as variable naming would likely be a common variation in real-world queries; on the other hand, we hypothesise that minimisation will be less impactful in terms of finding more congruent queries as we expect few real-world queries to contain the types of redundancy dealt with in Section 6.3.
We thus design a number of experiments over real-world and synthetic queries in order to address these questions and evaluate our expectations.
Implementation: QCan
We have implemented the canonicalisation procedure in a system that we call QCan written in Java. The system uses Jena ARQ [38] for query parsing and processing. Algebraic rewritings are implemented on top of the query algebra that Jena ARQ provides. Constructing the r-graph is likewise implemented on top of Jena. Minimisation is conducted by using the blabel system [28] to compute the core of RDF graphs representing BGPs, thereafter implementing containment checks across BGPs using
Real-world query logs
In order to test our algorithm in a real-world setting, we used two query log datasets: the LSQ dataset [52],19
DBpedia [35] (
LinkedGeoData [56] (
The RKB Endpoint (
Semantic Web Dog Food (
Wikidata [61] (
Distribution of queries using SPARQL features in the query logs; we mark features new to SPARQL 1.1 with an asterisk
All of these datasets have public SPARQL endpoints that receive queries from users over the Web. These queries have been published as the Wikidata and LSQ logs for research purposes. Table 24 contains the distribution of features in each of the query sets in our real-world setting. In total we consider 2.8 million queries for our experiments. Despite the fact that BGPs are equivalent to joins of triple patterns, we only count as “

Runtimes for each step of the canonicalisation algorithm.
We now present the runtimes for canonicalising the queries of the aforementioned logs. All queries in all six logs were successfully canonicalised. The results in Fig. 6 indicate that the fastest queries to canonicalise take around 0.00025 seconds, median times vary from 0.00084 seconds (

Runtimes for each step of the canonicalisation algorithm (considering queries with some SPARQL 1.1 feature).
Figure 7 shows the runtimes for the canonicalisation of the queries in the aforementioned logs, limited to those that contain features introduced in SPARQL 1.1. Notably the ranges of runtimes of

R-graph sizes (number of nodes) for each set of queries; the value on the right indicates the maximum size.
Figure 8 shows that the

Runtimes for queries grouped by selected features.
The results in Fig. 9 show the runtimes for all query sets grouped by the features they use (at least once, possibly in conjunction with other features). We include SPARQL 1.0 features along with property paths for those logs using such features. These results indicate that queries containing
We now look at the number of duplicates found, where we compare the following methods:
The raw query string is used, without any kind of normalisation.
The raw query string is parsed using Jena ARQ into its query algebra, and then serialised back into concrete SPARQL syntax.
The raw query string is parsed using Jena ARQ into its query algebra, the r-graph is constructed and canonically labelled, and then serialised back into concrete SPARQL syntax.
The raw query string is parsed using Jena ARQ into its query algebra, the query is rewritten per the algebraic rules, the r-graph is constructed and canonically labelled, and then serialised back into concrete SPARQL syntax.
The raw query string is parsed using Jena ARQ into its query algebra, the query is rewritten per the algebraic rules, the r-graph is constructed, minimised, canonically labelled, and then serialised back into concrete SPARQL syntax.
Total number of duplicates found by each method
Total number of duplicates found by each method
Most duplicates of a single query found by each method
Tables 25 and 26 denote the total number of duplicate (congruent) queries found, and the most duplicates found for a single query. In general, there is a high number of exact duplicate query strings, possibly from the same query being sent many times to refresh the results. Thereafter, the number of duplicates found either remains the same or increases in each successive algorithm. In particular, excluding duplicate query strings (
In this section, we compare the runtime performance of our algorithm with existing systems that can perform pairwise SPARQL containment checks. We provide an overview of these tools in Table 27. In our experiments, we consider SA [36] and JSAG [58] as they support cyclic queries. The queries used are part of the test suite designed by Chekol et al. [13] as part of their experiments. These correspond to two sets of queries: one of CQs without projection, and one of UCQs with projection. As discussed in Section 5, the SA and JSAG systems are not analogous to ours. We focus on query equivalence and congruence, and not on containment; conversely, SA and JSAG support containment. On the other hand, we compute a canonical form of a query, whereas SA and JSAG focus on pairwise checks (though JSAG offers indexing approaches based on constants in the query and isomorphic DAGs). Our approach is sound and complete for UCQs under both bag and set semantics; conversely, SA only considers set semantics, while JSAG focuses on detecting sub-graph isomorphisms between algebraic expressions under bag semantics. In the case of CQs without projection, checking containment is tractable (see Table 16), and quite trivial, requiring checking that one BGP is set-contained in the other.
UCQ features supported by SPARQL Algebra (SA), Alternating Free two-way μ-calculus (AFMU), Tree Solver (TS) and Jena-SPARQL-API Graph-isomorphism (JSAG); note that ABGP denotes Acyclic Basic Graph Patterns
UCQ features supported by SPARQL Algebra (SA), Alternating Free two-way μ-calculus (AFMU), Tree Solver (TS) and Jena-SPARQL-API Graph-isomorphism (JSAG); note that ABGP denotes Acyclic Basic Graph Patterns

Runtimes for JSAG, SA and QCan.
Figure 10 shows the runtimes for our comparison of both containment checkers and our method (QCan). Note that there are no values for SA with UCQs because the UCQ set uses projection and SA does not support queries with projection. The results indicate that most queries for SA and JSAG take between one and ten milliseconds, whereas most queries under our method take between ten and one hundred milliseconds. In terms of the slowest queries, our method is faster than JSAG but slower than SA.
In general, the conclusion is that our method is slower for testing equivalence than existing containment checkers, but this is perhaps not surprising as our approach addresses the more difficult problem of first computing a canonical form for both input queries, and further considers congruence rather than the more trivial case of equivalence where variable names are fixed. Furthermore, once these canonical forms are computed, equivalence classes of congruent queries can be found in constant time using simple data structures (e.g., (ideal) hashing of the canonical query strings). If we estimate that our system is 10 times slower to canonicalise two queries than these systems can perform a pairwise check (per Fig. 10), QCan will begin to outperform these systems for partitioning a set of 11 or more queries by equivalence (or congruence); in the case of 11 queries, these systems must perform
With the exception of some outliers, most queries that we have evaluated thus far have been canonicalised in fractions of a second. On the other hand, we know that our algorithms are super-exponential in the worst case. Such cases may occur when we have monotone queries that are in conjunctive normal form (i.e., consisting of joins of unions), in which case our UCQ normal form can be exponential in size, upon which we perform further potentially exponential processes. In order to stress-test our method and explore such worst cases, we run the following experiment.
For this experiment, we generate queries of the form

Times for UCQ stress tests.
Figure 11 shows the times for each step of the canonicalisation procedure on the synthetic UCQs. On the x-axis we increase k (the base of the exponent), while across the different series we increase m (the exponent). The y-axis is shown in log scale. We see that for the UCQ rewriting, graph construction and minimisation steps, the higher series (representing an increasing exponent) diverge further and further as the x-axis increases (representing an increasing base). On the other hand, the differences in times for canonical labelling are less pronounced since the minimisation process reduces the r-graphs significantly due to the regular construction of the queries. The slowest queries tested (
These results illustrate the poor worst-case behaviour of our canonicalisation procedure, particularly when considering queries with joins of many unions. However, as shown by the results presented in Section 8.2, virtually no queries in our real-world setting caused this worst-case behaviour.
In this paper, we have presented a formal semantics of SPARQL 1.1 and a canonicalisation procedure that produces a deterministic query string modulo congruence (equivalence modulo variable naming). This procedure involves the application of algebraic rewritings of the query, the representation of the query as a graph, the minimisation of the query in order to remove redundancy, and finally the canonical labelling of the graph in order to produce canonical variable names based on the structure of the graph. We have proven this procedure to be sound and complete with respect to “extended monotone queries” under bag and set semantics, i.e., queries that can be rewritten to the features involving BGPs, joins, unions, projection, and distinct. We have further extended this procedure to provide sound and incomplete canonicalisation of the queries under the full SPARQL 1.1 language.
Despite the super-exponential worst-case complexity of our procedure, the experimental results indicate that our method is efficient for most queries, running in a fraction of a second – in the median case – over millions of queries from real-world logs; the slowest query to canonicalise took just over a minute. Such results are achieved because most real-world queries tend to be relatively small and simple. In this analysis, we determined that the canonical labelling is the step that takes the longest time on average. We further found that the
We have further confirmed that our procedure allows for detecting additional congruent queries over real-world logs versus baseline methods. We first observed that the vast majority of congruent queries are actually syntactically identical in terms of raw query strings, likely due to clients reissuing precisely the same query in order to intermittently refresh results. Of our canonical procedures, canonical labelling is the most important for finding additional congruent queries. On the other hand, minimisation and algebraic rewritings – though necessary to ensure completeness for monotone queries – lead to finding only a very small fraction of additional congruent queries. This would tend to suggest that in a practical caching system, where completeness can be traded for efficiency, it may be sufficient to apply canonical labelling without algebraic rewritings and minimisation. However, minimisation may be useful in cases where completeness is an important criterion. Also, in certain setting, queries with redundancy may be automatically generated; an interesting use-case along these lines is that of ontology-based data access (OBDA) [63], where rewritings may produce queries (typically UCQs) with redundancies that are inefficient to evaluate over large graphs.
With respect to future directions, a natural continuation of this work is to explore larger fragments of SPARQL 1.1 for which sound and complete canonicalisation can be performed. In particular, we have already begun to investigate such procedures for a fragment akin to UC2RPQs. At first glance, this should be analogous to the minimisation and canonicalisation of basic graph patterns and unions, where property paths are represented as automata and we can check for containment of automata to find redundant path patterns. However, we have already found that this requires entailment of paths with inverses, which is not as straightforward as checking for containment [34].
SPARQL 1.2 is on the horizon. We believe that the formal semantics of SPARQL 1.1 defined herein may serve as a useful reference for the standardisation effort. If the semantics of
Finally, we are currently exploring use-cases for our canonicalisation procedure in the context of two ongoing projects: one relating to caching, and one relating to query answering. Regarding caching, we have seen that most congruences in real-world query logs are exact (syntactic) duplicates; however, rather than considering congruencies between full queries, a more promising approach for caching is to mine common sub-queries, where canonicalisation can be used for such purposes. In the context of question answering, we can also use canonicalisation in order to normalise training data for sequence-to-sequence models that translate natural language to SPARQL queries. An important future line of research is then to explore and evaluate the benefits of SPARQL canonicalisation in the context of these and further use-cases.
Online material We provide code and other material online at:
Footnotes
Acknowledgements
This work was supported by Fondecyt Grant No. 1181896 and by ANID – Millennium Science Initiative Program – Code ICN17_002. We would also like to thank the reviewers of the conference and journal versions of this paper for useful comments that helped to improve this work.
