Abstract
The SPARQL standard provides operators to retrieve exact matches on data, such as graph patterns, filters and grouping. This work proposes and evaluates two new algebraic operators for SPARQL 1.1 that return similarity-based results instead of exact results. First, a similarity join operator is presented, which brings together similar mappings from two sets of solution mappings. Second, a clustering solution modifier is introduced, which instead of grouping solution mappings according to exact values, brings them together by using similarity criteria. For both cases, a variety of algorithms are proposed and analysed, and use-case queries that showcase the relevance and usefulness of the novel operators are presented. For similarity joins, experimental results are provided by comparing different physical operators over a set of real world queries, as well as comparing our implementation to the closest work found in the literature, DBSimJoin, a PostgreSQL extension that supports similarity joins. For clustering, synthetic queries are designed in order to measure the performance of the different algorithms implemented.
Keywords
Introduction
Query languages such as SPARQL offer users the ability to match data using precise criteria. The results of evaluating the query contain exact matches that satisfy the specified criteria. This is useful, for example, to query for places with a population of more than one million (filters), for the names of famous musicians that fought in World War I (joins), or for the average life expectancy over all countries of each continent (grouping and aggregation). However, users may also require results that are based on imprecise criteria, such as similarity; for instance, obtaining countries with a population similar to Italy (similarity search), pairs of similar planetary systems from two different constellations (similarity join), or groups of similar chemical elements based on certain properties (clustering). The potential applications for efficient similarity-based queries in SPARQL are numerous, including: entity comparison and linking [35,40], multimedia retrieval [11,23], similarity graph management [8,12], pattern recognition [4], query relaxation [16], as well as domain-specific use-cases, such as protein similarity queries [2].
A similarity join
Similarity is usually measured through distance functions defined over a metric or vector space, where two objects are as similar as they are close in the given space [49]. Distances are usually defined over real, d-dimensional vector spaces (e.g.,
It must be noted that some similarity joins could be expressed in SPARQL 1.1; for example, a range-based similarity join that uses the Manhattan distance might be expressed using subqueries, numeric operations and filters. However, general similarity queries over metric spaces cannot be expressed using SPARQL, and those that can be expressed will typically be evaluated by computing all pair-wise distances and applying filters to each one, since SPARQL engines are not able to produce efficient planning and optimisations for such queries. Although several techniques for the efficient evaluation of similarity joins in metric spaces are available [6,17,33,48], the appropriate techniques depend on the similarity criteria to be used, where in order to efficiently compile vanilla SPARQL queries that express a similarity join into optimised physical operators that include these techniques, the engine would need to prove the equivalence of the query to a particular similarity join, which is not clear to be decidable.
Some works have proposed extending SPARQL with similarity features. However, these works have either focused on (1) unidimensional similarity measures that consider similarity with respect to one attribute at a time [20,44], or (2) domain-specific fixed-dimensional similarity measures, such as geographic distance [1,50]. Other approaches rather pre-compute and index similarity scores as part of the RDF graphs [8,16,30] or support metric distance measures external to a query engine [29,40].
In terms of clustering, Ławrynowicz [21] proposed an extension for SPARQL 1.0 that provides the solution modifier
In this paper, two operators that enrich SPARQL with similarity-based queries are presented: similarity joins and clustering. Our work is, to our knowledge, the first to consider multidimensional similarity queries in the context of SPARQL, where the closest proposal to this one is DBSimJoin [43]: a PostgreSQL extension, which – though it lacks features arguably important for the RDF/SPARQL setting, namely nearest neighbours semantics – is considered as a baseline for experiments. Our work is also pioneer on the implementation and evaluation of clustering features on top of SPARQL with read-only capabilities (other works use SPARQL Update queries for some form of clustering [37]).

Examples of queries using similarity joins and clustering in the astronomy domain.
Motivating example To motivate combining similarity joins and clustering with SPARQL, we present some examples from the astronomy domain. Consider the RDF graph of Fig. 1, which describes, in Turtle syntax: the star
We also directly standardise the units of measure. In practice, Wikidata allows for using a broad selection of different units, but the RDF export additionally provides reification properties that map these values to normalised values with the base S.I. unit of measure.
The planet orbiting
The most similar star in the constellation Orion for each star in the constellation Aquarius, based on mass, radius, temperature and number of planets.
Using clustering, the user can find, for example:
Clusters of stars based on mass, radius, temperature and number of planets.
Combining both similarity joins and clustering, the user can find:
The top 5 most similar planets to the Earth based on their mass, radius and temperature whose stars are in the same cluster as the Sun based on mass, radius and temperature, along with each planet’s constellation.
Figure 1b–1e shows how these queries are expressed in the concrete syntax we will propose and describe in detail later in the paper. Of note is that we can use SPARQL’s other features to apply similarity joins and clustering on computed attributes of an entity; for example, the second and third queries reference the number of planets, and although this numeric attribute is not explicitly given in the data, we can compute it in SPARQL using a count aggregation. We can likewise cluster planets based on properties of their stars (e.g., their temperature), based on attributes calculated from explicit values (e.g. the density calculated from mass and radius), etc. Furthermore, we can use SPARQL’s features to retrieve additional details of results returned by similarity joins and clustering, such as the constellations of planets returned by the fourth query, retrieved via a property path.
Later we will present further examples of queries that are enabled by these features in the context of other domains, including chemistry, politics, demography, and multimedia.
Contributions This paper is an extension of our previous work, where we proposed, defined, implemented and evaluated novel types of similarity joins in SPARQL [9]. In this extended version, we discuss new use-case queries over different RDF data, and we expand the evaluation of the similarity join implementations by building a benchmark based on the query logs of Wikidata. Furthermore, we extend the proposal with a novel clustering solution modifier; we describe its syntax, semantics and implementation, and we evaluate its performance. Extended discussion and examples are also provided.
Paper outline In Section 2 we present definitions regarding RDF, the SPARQL algebra and syntax, and similarity-based operations, and in Section 3, related work. The similarity join operator, its properties, and implementation are introduced in Section 4. Section 5 is dedicated to the clustering solution modifier: its definition, a proposed syntax, the implemented clustering algorithms, and use cases. Evaluation of the similarity join operator and the clustering solution modifier is presented and discussed in Section 6. Finally, Section 7 presents our concluding remarks.
In this section, the relevant definitions and notation that is used throughout this paper regarding RDF, SPARQL and Similarity Operations are presented.
RDF and SPARQL
RDF is the standard graph-based data model for the Semantic Web. RDF terms can be either IRIs (
Using that
SPARQL is the W3C standard query language for RDF [15]. The semantics of SPARQL operators are introduced by Pérez et al. [34], in terms of their evaluation over an RDF graph. The result of the evaluation of a SPARQL query is a set of solution mappings. Let
where f is a filter condition that, given a solution mapping, returns true, false, unbound or an error.
We denote the evaluation of a query Q over an RDF graph G as
In this section, we introduce the concept of similarity search over a universe of objects, as well as different similarity search operations.
(Similarity Search).
Let
Mathematically, similarity is modelled as a score. Let
(Metric Distance).
A distance function
Examples of metric distances are the Euclidean distance over real vectors, the edit distance over strings, and the Jaccard distance over sets.
Metric distances induce a topology over the set they are defined upon:
(Metric Space).
A metric space is a tuple
There are two main types of similarity search queries in metric spaces: range and nearest neighbours queries.
(Range Similarity Search).
Let
(Nearest Neighbours Similarity Search).
Let
If instead of a single query object q we consider a set of query objects, we call this operation a similarity join. A similarity join is an operation defined over two comparable sets. Two sets are deemed comparable if a (metric) distance function can be defined over their Cartesian product.
(Similarity Join).
Let
Note that similarity search – also known as query-by-example – is a special case of similarity join where
Related work
This section is divided in four parts: the first presents techniques to evaluate similarity joins; the second introduces works about the similarity operations available in query languages or database engines; the third is dedicated to clustering techniques; and the fourth to the inclusion of clustering in query languages.
Similarity joins
The brute force method for computing a similarity join between
A common way to optimise similarity joins is to index the data using tree structures that divide the space in different ways (offline), then pruning distant pairs of objects from comparison (online). Several tree structures have been proposed as indexes for optimising similarity queries [14,31,41]. Among such approaches, we can find vantage-point Trees (vp-Trees) [48], which make recursive ball cuts of space centred on selected points, attempting to evenly distribute objects inside and outside the ball. vp-Trees have an average-case query-by-example search time of
Other space partitioning algorithms are not used for indexing but rather for evaluating similarity joins online. The Quickjoin (QJ) algorithm [17] was designed to improve upon grid-based partition algorithms [3,5]; it divides the space into ball cuts using random data objects as pivots, splitting the data into the vectors inside and outside the ball, proceeding recursively until the groups are small enough to perform a nested loop. It keeps window partitions in the boundaries of the ball in case there are pairs needed for the result with vectors assigned to different partitions. QJ requires
Another alternative is to apply approximations to evaluate the similarity joins, trading the precision of the results for more efficient computation. FLANN [26] is a library that provides several approximate k-nn algorithms based on randomised k-d-forests, k-means trees, locality-sensitive hashing, etc.; it automatically selects the best algorithm to index based, for example, on a target precision, which can be traded-off to improve execution time.
Similarity in databases
Though similarity joins do not form part of standard query languages, such as SQL or SPARQL, a number of systems have integrated variations of such joins within databases. In the context of SQL, DBSimJoin [43] implements a range-based similarity join operator for PostgreSQL. This implementation claims to handle any metric space, thus supporting various metric distances; it is based on the aforementioned index-free Quickjoin algorithm.
In the more specific context of the Semantic Web, a number of works have proposed online computation of similarity joins in the context of domain-specific measures. Zhai et. al [50] use OWL to describe the spatial information of a map of a Chinese city, enabling geospatial SPARQL queries that include the computation of distances between places. The Parliament SPARQL engine [1] implements an OGC standard called GeoSPARQL, which aside from various geometric operators, also includes geospatial distance. Works on link discovery may also consider specific forms of similarity measures [40], often string similarity measures over labels and descriptions [44].
Other approaches pre-materialise distance values that can then be incorporated into SPARQL queries. IMGpedia [8] pre-computes a k-nn self-similarity join offline over images and stores the results as part of the graph. Similarity measures have also been investigated for the purposes of SPARQL query relaxation, whereby, in cases where a precise query returns no or few results, relaxation finds queries returning similar results to those sought [16,30].
Galvin et al. [10] propose a multiway similarity join operator for RDF; however, the notion of similarity considered is based on semantic similarity that tries to match different terms referring to the same real-world entity. Closer to this work lies iSPARQL [20], which extends SPARQL with
Clustering
Clustering is a process through which a set of objects can be automatically partitioned into groups – called clusters – such that the distance between objects of a cluster is minimised, and the distance between objects in different clusters is maximised. Clustering algorithms can be roughly divided into distance-based, density-based, grid-based, distribution-based, and can also be hierarchical. A prime example of distance-based clustering algorithms is k-means [24], which first randomly selects k centroids and assigns each object to the cluster defined by the closest centroid; it then recomputes new centroids as the mean objects of each current cluster (which may not be a part of the set of objects), and reassigns the objects to the clusters defined by these new centroids until the centroids converge. The k-medoids algorithm is similar to k-means [19], but instead of centroids, it defines medoids: the element of a cluster such that its distance to every other element in the cluster is minimal. As for density-based algorithms, DBSCAN [7] groups together objects in dense regions (every object has several neighbours), separated by low density, sparse regions. DBSCAN also marks isolated objects as outliers, which are not included in any cluster. Grid-based methods, such as STING [46] and GriDBSCAN [25] divide the space into cells, aiming to quickly find dense areas to produce the clusters, and prune sparse areas as noise. However, these methods have been proven to not scale well with increasing dimensions [53]. Distribution-based methods assume that objects in the same cluster have a similar probability distribution: DBCLASD [47] assumes that the points inside a cluster are uniformly distributed. Finally, hierarchy-based approaches generate clusters within clusters forming a hierarchy that can be consumed at various levels. Hierarchical clustering methods include BIRCH [52], CURE [13], and Chameleon [18].
Clustering in databases
In databases, Ordonez and Omiecinski [32] provide an efficient disk-based k-means implementation for RDBMS using SQL
Qi et al. [37] provide a k-means implementation for RDF data by using SPARQL Update queries. Ławrynowicz [21], proposed an extension to SPARQL 1.0, that introduces a
Similarity joins in SPARQL
In this section, the similarity join operator for SPARQL is presented. Along with the algebraic definition of the operator and its properties, we will present options for its implementation and optimisation, and further introduce use-case queries to illustrate its application. We first define the following general criteria that a novel similarity join operator for SPARQL 1.1 should satisfy:
the new operator should be freely combinable with other available operators.
the similarity join should allow the user to specify the required dimensions and metrics to be used in the query, and define new metrics as needed.
the new operator should make as few assumptions as possible about the input or the data, regarding completeness, comparability, etc.
the new operator should be easy for users to adopt, offering high expressiveness while avoiding unnecessary verbosity.
With respect to
Syntax
To define the syntax for a similarity join, it is necessary to consider the syntax of other binary operators such as
The syntax for the similarity join operator extends the SPARQL 1.1 EBNF grammar [15] by adding one production rule for

The extended production rules from SPARQL 1.1 to support similarity joins.
The keyword
Depending on the metric, it could be possible, in principle, to express such queries as plain SPARQL 1.1 queries, by taking advantage of features such as variable binding, numeric expressions, sub-selects, etc. However, there are two key advantages of the dedicated syntax: (1) similarity join queries in vanilla syntax would be complex to express, particularly in the case of k-nn queries or metrics without the corresponding numeric operators in SPARQL; (2) optimising queries written in the vanilla syntax (beyond nested-loop performance) would be practically infeasible, requiring an engine that can prove equivalence between the distance metrics and semantics for which similarity join algorithms are optimised and the plethora of ways in which they can be expressed in vanilla syntax. This is why we propose to make similarity joins for multidimensional distances a first class feature, with dedicated syntax and physical operators offering sub-quadratic performance in average/typical cases.
We now introduce some definitions to extend the SPARQL semantics defined in Section 2 to cover the similarity joins whose syntax is defined in Section 4.1. These definitions adapt the standard definitions of similarity joins as presented in Section 2.2 to make them compatible with the relational algebra underlying SPARQL, i.e., to perform an equi-join on shared variables, and a similarity join on other variables selected by the user.
We start with the definition of a similarity join expression, which indicates the criteria under which the similarity of two SPARQL solution mappings can be computed.
(Similarity Join Expression).
A similarity join expression is a 4-tuple
In the following, we present several definitions that build up to the formal specification of our similarity join operators. We begin with a definition that, given two compatible solutions from either side of the similarity join
Given two compatible solution mappings
Note that per this definition,
Next we define some useful notation for the result of a natural equi-join (applying a join with equality on shared variables, i.e., based on compatible solutions) that forms the basis of our similarity join operator.
We denote by
Given a set of solutions, we now define notation for the union and intersection of the domains of the solutions (the sets of variables for which a solution is defined).
Given a set of solution mappings X we define the union and intersection domains as:
In order words,
The next definition allows us to define a solution mapping that maps specific variables to specific SPARQL terms (similar to
Letting
With all these definitions in hand, we introduce the similarity join operators in SPARQL as follows:
Given two sets of solution mappings X and Y and a similarity join expression
(k-nn Similarity Joins in SPARQL).
Given two sets of solution mappings X and Y and a similarity join expression
Per Definition 4.6, more than k results can be returned for
An error is returned when
Bag semantics are defined for similarity joins in the natural way, where the multiplicity of
Next, given the semantics of the similarity join operators, we present several algebraic properties for the similarity join and its relationship with other SPARQL operators. These properties and equivalences are useful for query planning and optimisation. For instance, having a commutative and associative operator can allow the Query Planner to choose the best possible order in which to evaluate a given sequence of these operators; similarly, having equivalences between operators under certain circumstances can allow for query rewriting and further optimisation. Particularly for the case of similarity joins, some rewriting rules can help to reduce the sets of intermediate solutions considerably before applying the similarity join, and thus reduce query execution time. The proofs for each of the claims along with some illustrative examples can be found in the Appendix:
Let
A key finding here is that properties that hold for (equi) joins in SPARQL do not necessarily hold for similarity joins (commutativity, associativity, distributivity), and these equivalences show how and why this is the case. This means that common optimisations applied in SPARQL engines (such as join reordering) cannot be applied as is for similarity joins, and thus there are interesting open challenges on how to optimise these queries.
Implementation
The implementation of the similarity join operators extends ARQ – the SPARQL engine of Apache Jena3
The Parsing stage receives the query string written by a user, and outputs the algebraic representation of the similarity query. Parsing is implemented by extending the Parser of Jena using JavaCC,4
In terms of physical operators to evaluate the similarity joins, initially we considered applying offline indexing of literal values in the RDF graph in order to allow lookups of similar values. However, we discarded this option for the following two reasons:
Indexing approaches typically consider a fixed metric space, with a fixed set of attributes specified for the distance measure (e.g., longitude and latitude). Conversely, in our scenario, the user can choose the attributes according to which similarity will be computed for a given query. In the setting of an RDF graph such as DBpedia or Wikidata, the user can choose from thousands of attributes. Indexing all possible combinations would lead to an exponential blow-up and would be prohibitively costly in terms of both time and space.
Aside from selecting explicit attributes in the data – and per the
For this reason, we rather consider online physical operators that receive two sets of solution mappings and compute the similarity join. Some of these physical operators include indexing techniques, but these indexes are built online – for a specific query – over intermediate solutions rather than the RDF graph itself.
The similarity join physical operators apply a normalisation function to each of the dimensions of the space, in order to avoid differences in scale of these dimensions to affect disproportionately the similarity of two solution mappings. For example, a similarity join query computing distances w.r.t. the dimensions human development index, HDI, (a number in
Algebra Optimisation then applies static rewriting rules over the query, further turning logical operators (e.g.,
involves a brute-force pairwise comparison, incurring quadratic cost for similarity joins. It computes exact results for both range and k-nn queries and is intended as a performance baseline.
is an index-based approach based on applying ball-cuts in the metric space. It computes exact results for both range and k-nn queries. Though the worst-case for similarity joins is still quadratic, depending on the distance distribution, dimensionality, etc., vp-Trees can approach linear performance.
is an ensemble of approximate methods for k-nn queries only, where the best approach is chosen based on the query, data, target precision, etc.
Algebra Execution begins to evaluate the physical operators chosen in the previous step, with low-level triple/quad patterns and path expression operators feeding higher-level operators. Finally, Result Iteration streams the final results from the evaluation of the top-level physical operator. All physical similarity-join operators follow the same lazy evaluation strategy used for the existing join operators in Jena.
Besides the implementation of sub-quadratic average-case similarity join physical operators, there are other avenues for query optimisation, namely: query rewriting and result caching. As for rewriting rules, there is little opportunity for applying such rules because of a) similarity joins are a relatively expensive operation to compute, thus the best option is to reduce the size of the outputs of the sub-operands as early as possible and therefore, making typical rewriting rules impractical (such as delaying filters and optionals); and b) we see in Section 4.2 that the k-nn similarity join operator is not commutative nor associative, further preventing rewriting such as join reordering.
Back in Section 1, we presented some motivating example queries that can be written using our proposed extension to SPARQL over a small artificial dataset. Now, in this section, we present three use-case queries demonstrating different capabilities of the extension, which are based on real-world data from Wikidata [45] and IMGpedia [8]. All prefixes used in the queries of this paper are listed in Table 1.
IRI prefixes used in the paper
IRI prefixes used in the paper

Query for non-metal chemical elements similar to metallic elements in terms of atomic properties.
Similar chemicals Chemical elements contain several numeric/ordinal properties, such as boiling and melting point, mass, electronegativity and so on. In Fig. 3 we request for metals similar to non-metals in terms of those properties, along with sample results of the similar metal/non-metal query. Figure 3 also presents a selection of results of the query, where we see that Selenium is most similar to Aluminium, and Hydrogen to Sodium.

Query requesting countries that have a number of Literature, Chemistry, Physics and Medicine Nobel Prizes similar to those of Canada.
Similar laureates: Figure 4, presents a more complex similarity query over the Nobel Prize Linked Dataset5

Query for the 10 images of towers most similar to each image of the Entel Tower in terms of the HOG visual descriptor.
Similar images: Figure 5 presents a similarity query over IMGpedia [8]: a multimedia Linked Dataset. The query retrieves images of the Entel Tower in Chile, and computes a 10-nn similarity join for images of towers based on a precomputed Histogram of Oriented Gradients (HOG) descriptor, which extracts the distribution of edge directions of an image. Note that this query uses a different distance funtion IRI than the previous two, since variables
In this section, we analyse, implement and evaluate a
As it has been argued before in this paper, users sometimes require non-exact matches over the data and instead wish for similar matches. This is also the case when using the solution modifier
As presented, the
Syntax
In order to support clustering in SPARQL, we must first define its syntax and semantics in a way that a user can express the clustering variables, and the algorithms and their required parameters. Ławrynowicz [21] proposes the following extension for SPARQL 1.0 [36], where the cluster identifier is bound

Syntactic extension to the SPARQL 1.0 grammar for clustering [21].
The first thing that can be noticed from the definition is that, since the extension was conceived for SPARQL 1.0, it lacks the definition of the
Since clustering is the similarity-based version of grouping (which is based on exact matching), one alternative is to make the modifiers
Since the
Ławrynowicz [21] makes use of the SPARQL Query Results Clustering Ontology (SQRCO) to define the clustering algorithm and its parameters as a set of triples in the query. However, at the time of writing, the ontology is not found on the Web and is not registered in prefix.cc. We instead take a syntax-based approach, using special keywords for the clustering configuration, similarly to what is done with the similarity join operator.
Our proposed extension of the SPARQL 1.1 syntax that includes

Syntactic extension to the SPARQL 1.1 grammar for clustering.
An example of a SPARQL query that uses this extension can be appreciated in Fig. 8, where the countries available in Wikidata are clustered based on their life expectancy and economic growth. The query specifies that the clustering shall use k-means returning 4 clusters. Notice that we also obtain the 2-letter ISO code for each country (property

Example SPARQL query with the
Given the SPARQL semantics presented in Section 2, as well as the discussed syntax for Given a multiset of solution mappings X, the clustering algorithm
Analogous to the similarity join operators, we implement the
Each algorithm requires different parameters, but the implementation offers default values for them, if the user does not wish to specify them. The parameters of each algorithm and the chosen default values are:
k-means [24]: k, the number of clusters to be returned, default 3; m, the number of iterations to terminate the algorithm, default 10.
k-medoids [19]: k, the number of clusters to be returned, default 3.
DBSCAN [7]: ε, the distance to search for neighbours of an object, default 0; p, the minimum number of neighbours that an object needs to be considered part of a cluster, default 1.
DBSCAN is different from k-means and k-medoids in that it does not require the number of clusters to be specified, but also may mark some objects as outliers, not being in any cluster. To comply with the semantics of
Other clustering algorithms were not chosen for various reasons: both grid-based and distribution-based methods do not scale well when dimension increases [53]. Hierarchical clustering methods are left for future versions, because given the syntax and semantics of the clustering solution modifier, it would be hard to add multiple variables bound to different levels of clustering. An initial option to support these methods would be to force the user to choose a clustering level to retrieve, but we argue that in most cases, the user might not know which level of clustering is most significant for their purposes.
Use-cases
Expanding on the motivating example queries of Section 1 (see Fig. 1d–1e), we now showcase three queries that use the clustering solution modifier here proposed in different domains. First, in Fig. 9 we display the result of the query presented in Fig. 8, that clusters the countries in Wikidata with respect to their life expectancy and economic growth rate. Some results were omitted to avoid clutter in the figure. Each point has a different colour and marker depending on their assigned cluster (whose identifier is bound to variable

Clusters of countries by growth and life expectancy.
Then, the Iris Dataset in RDF6
Retrieved from

Extended SPARQL query to obtain three clusters of iris.

Clusters obtained with query 10.
Next, we present a Hertzsprung–Russell (HR) diagram, that is used in astronomy to classify stars into main sequence, giants, white dwarfs, etc. HR diagrams have the normalised B–V Colour (a measure of the temperature or “brightness” of the star) on the X-axis, and
The CSV data is available at

Extract of the RDF data for stars.

SPARQL query for clustering star data.

Visual representation of the star clusters obtained with query from Fig. 13. Outliers not presented.
Finally, we showcase how the

Example SPARQL query with the
Experimental results for both the similarity join operator and the clustering solution modifier are presented in this section. Unless stated otherwise, experiments were run on a machine with Ubuntu 20, an 8-core Intel® CoreTM i7-6700 CPU @3.4 GHz, 16 GB of RAM and a 931 GB HDD. We provide the experiment code, queries, etc., online.8
Results of query of Fig. 15
We compare the performance of different physical operators for similarity joins in the context of increasingly complex SPARQL queries, as well as a comparison with the baseline system DBSimJoin. We conduct experiments with respect to two benchmarks: the first is a benchmark we propose for Wikidata, while the second is an existing benchmark used by DBSimJoin based on visual descriptors of images.
Wikidata: k-nn self-similarity queries
Extending the performance experimentation presented in our previous work [9], we now present similar experiments using real SPARQL queries. We use a truthy version of the Wikidata RDF dump.9
We use the dump from February 2020.
We executed self-similarity joins with

Execution time for self-similarity join queries, for different values of k.
The system proposed in this work is compared with the closest found in literature, DBSimJoin [43], a PostgreSQL extension that supports range-based similarity joins in metric spaces implementing the Quickjoin (QJ) algorithm. As DBSimJoin only supports range queries, it is to be compared with the vp-Tree implementation in Jena (Jena-vp). The DBSimJoin system was originally tested with synthetic datasets and with the Corel Colour Moments (CCM) dataset, which consists of 68,040 records of images, each described with 9 dimensions representing the mean, standard deviation and skewness for the colours (red, green and blue) of the pixels of the image. For CCM, the DBSimJoin paper only reports the effect of the number of QJ pivots on the execution time when
To find suitable distances, we first compute a 1-nn self-similarity join, where we take the maximum 1-nn distance in the result: 5.9; this distance ensures that each object has at least one neighbour in a range-based query with
DBSimJoin is implemented and distributed an extension for PostgreSQL 8. When installed in the current hardware, it crashed with any similarity join query, because of the use of outdated I/O directives, making it impossible to re-run the experiments. Thus, we present the results previously reported [9] where we compare them with Jena-vp using the same dataset, the same queries, and the same hardware reported in [9]. Here, we provide extended discussion over these results, which are presented in Table 3. The execution time of DBSimJoin grows rapidly with r because of the quick growth of the size and number of window partitions in the QJ algorithm. DBSimJoin crashes with
Considering range-based similarity joins, these results indicate that Jena-vp outperforms DBSimJoin in terms of efficiency (our implementation is faster) and scalability (our implementation can handle larger amounts of results).
Execution times in seconds for range similarity joins over the CCM dataset
Execution times in seconds for range similarity joins over the CCM dataset

Execution time of range-based self-similarity joins, including polynomial trend line for DBSimJoin times beyond
To gain insight on the performance of the clustering solution modifier, we measure the overhead produced by the clustering algorithms. To achieve this, we prepared 1,000 synthetic SPARQL queries that draw values from the numeric properties of Wikidata. The queries are composed of only one Basic Graph Pattern (BGP), and we apply clustering over its solutions. To produce these queries, we begin with some data exploration. Specifically, from a Wikidata dump, we compute the ordinal/numeric characteristic sets [28] by: (1) filtering triples with properties that do not take numeric datatype values, or that take non-ordinal numeric datatype values (e.g., Universal Identifiers); (2) extracting the characteristic sets of the graph [28] along with their cardinalities, where, specifically, for each subject in the resulting graph, we extract the set of properties for which it is defined, and for each such set, we compute for how many subjects it is defined. Finally, we generate
Then, we execute these queries with different parameters, in order to compare the execution time of the three implemented algorithms under various realistic configurations. These parameters are:
For k-means:
For DBSCAN:
For k-medoids:
We use as a baseline the execution time of the query that

Execution time of the
In Fig. 18, we present the results for the queries using the k-means algorithm. We use same-coloured lines for runs with the same value of k, and different dashing for different values of m. The baseline is shown with the light green line. We observe that the execution time increases as the value of k and n increases, and that the value of m does not significantly influence the execution time. This is expected, since k-means has an upper bound time complexity of
The results for the queries using DBSCAN are harder to interpret, as the execution time also depends on the distance distribution of the space defined by the solution mappings obtained from evaluating each BGP. This is because DBSCAN’s cost depends on the cost of performing ε-range searches, and on the number of found neighbours. This can be noticed in Fig. 19, which presents a plot where queries with the same value for ε are shown in the same colour, and queries with the same value for p are shown with the same line dashing. The baseline is shown with the light green line. In Fig. 19, we observe unpredictable changes in the magnitude of the execution time, even for queries with a similar n. This can also be explained by the fact that we use fixed values of ε for this evaluation instead of using distance values depending on the metric space derived from the query evaluation, as it is recommended [39]. Notice that normalisation of the underlying metric space would not guarantee more consistent results, as the cost of DBSCAN depends on the distance distribution of the space. Nevertheless, queries with a larger ε and a smaller p tend to take more time. For these results, we marked queries taking more than 3 hours as timeouts, and thus not reported in the graph. This leads to most queries with

Execution time of the

Execution time of the
k-medoids is the costliest clustering algorithm of the three, as it minimises intra-cluster distance while maximising inter-cluster distance. Since the implementation requires to compute all pairwise distances first,
Given these experimental results, we can argue that the preferred way to cluster query results would be to use k-means, which shall have a similar execution time as the underlying query. Moreover, the k-medoids clustering algorithm should be avoided given its intensive memory and time usage. For the case of DBSCAN, it is necessary to be familiar with the vector space that is being generated, and carefully choose the parameters, for which we refer to the recommendations of Schubert et al. [39], of using
In this paper we have presented two extensions to SPARQL that incorporate two new similarity-based operators: similarity join and clustering. We define each operator’s syntax and semantics as part of the SPARQL grammar and the SPARQL algebra, and show how these new operators can be combined with the rest. In the case of the similarity join, we prove algebraic properties and rewriting equivalences. For the case of the clustering solution modifier, we discuss how it interacts with the other solution modifiers, such as
These operators have been implemented on top of Apache Jena, and the implementation is publicly available. We provide experimental results to evaluate the performance of the implementation, where we show that our similarity join outperforms the closest system to ours, DBSimJoin. We also offer comparisons among different methods and configurations to evaluate both the similarity join and the clustering. We determine that the better way to compute the similarity join is the vp-tree when there is a large number of solution mappings, and the nested loop when there are fewer solution mappings. For these experiments, we use 2,726 real SPARQL queries obtained from the logs of Wikidata. We also show that, when clustering, users may prefer to use the k-means algorithm, as it is faster and consumes less memory than k-medoids. We conclude by observing that DBSCAN is rather unpredictable in terms of execution time, and that it can produce useful and fast results if the parameters are carefully chosen.
As a future direction for this work, we plan to implement the physical operators using algorithms for secondary memory. This would allow for more efficient query processing and improved scalability, which has been an issue in particular for k-medoids clustering. Furthermore, we could design tailored algorithms that can take advantage of the database setting and the features of SPARQL, to further optimise the query processing.
Footnotes
Acknowledgements
This work was partly funded by ANID – Millennium Science Initiative Program – Code ICN17_002, by the Swedish Research Council (Vetenskapsrådet, project reg. no. 2019-05655), and the CENIIT program at Linköping University (project no. 17.05). Hogan was supported by Fondecyt Grant 1221926
Proofs of algebraic properties
This section presents proofs for the algebraic properties of the similarity join operator introduced in Section 4.2.
