SPARQL queries on RDF with fuzzy constraints and preferences

Abstract

With the widespread acceptance of RDF (Resource Description Framework) as the de-facto standard recommended by W3C (World Wide Web Consortium) for the representation and exchange of information on the Web, a huge amount of RDF data is being proliferated and becoming available. Efficient querying of RDF data is therefore of increasing importance. Preference queries and fuzzy queries are required for personalized queries, and have been extensively investigated in the context of databases. Focusing on SPARQL (Simple Protocol and RDF Query Language), we propose a fuzzy extension of SPARQL with preferences in this paper. First, we extend SPARQL with fuzzy sets by allowing the occurrence of fuzzy terms, fuzzy operators and fuzzy relations. Second, we introduce multi-dimensional preferences into the fuzzy SPARQL. Based on fuzzy set theory, we also propose a set of translation rules to convert the fuzzy SPARQL with preferences into crisp SPARQL. Finally we propose an effective scoring function to calculate the order of query results, which integrates the preferences and membership degrees. The proposed approach is evaluated with experiments as well.

Keywords

Fuzzy query SPARQL fuzzy relation preferences

1 Introduction

With the explosion of data on the Web, the Semantic Web has been proposed to support uniform access to Web data sources. Being a data model representing objects and their relationships in network resources, RDF is a labeled graph data format, in which the RDF abstract syntax is a set of triples called the RDF graph. Nowadays more and more Web resources are represented by RDF and the queries of massive RDF data is becoming more and more important. Several RDF query languages such as RDQL [21], RQL [7], SeRQL [5], N3QL [3], Triple [23] and SPARQL [19] have been developed. Among these RDF query languages, the W3C SPARQL standard is a vendor-independent query language for the RDF triple data model, and can supply four kinds of query results: SELECT, CONSTRUCT, ASK and DESCRIBE [19]. These solution modifiers in SPARQL play an important role in improving the query precision in different ways.

It should be noted that that the classical SPARQL processes information only in a crisp way and SPARQL FILTERs restrict solutions to those for which the FILTER expression evaluates to TRUE. As a result, existing SPARQL implementations do not allow users to form queries with preferences or vagueness, which would be desirable in the real-world applications. In the community of databases, fuzzy data queries have been widely investigated and applied [14]. In the context of RDF data, there are some efforts in RDF queries with preferences [13, 22], but the work of fuzzy RDF queries is scattered. In [6], SPARQL is extended by allowing fuzzy query conditions and f-SPARQL is proposed. With f-SPARQL, users are required to provide crisp thresholds. It would be common situation where it is not easy for users to choose appropriate thresholds. In addition, f-SPARQL in [6] supports the fuzzy query conditions only containing fuzzy terms and fuzzy operators, and does not support complex fuzzy query conditions. Focusing on approximate answers, Hogan et al. in [12] investigate query relaxation for fuzzy queries over crisp RDF data. They present a proof-of-concept framework for enabling relaxation of structured entity-lookup queries, evaluating different distance measures for performing relaxations. Complex fuzzy query conditions are not considered in [12]. A query language for fuzzy RDFS graphs named Fuzzy SPARQL is proposed in [24]. Unlike [6] and [12], Fuzzy SPARQL in [24] is issued to fuzzy RDFS graphic data instead of crisp RDF data in [6, 12]. Also Fuzzy SPARQL applies simple fuzzy query conditions only containing fuzzy terms and fuzzy operators, and does not support complex fuzzy query conditions. While the works in [6 , 24] do not support user’s preferences in the fuzzy RDF queries in an explicit way, preference SPARQL query over a set of data Web services is investigated in [1, 2] and preferences are modeled thanks to fuzzy sets. The SPARQL query with fuzzy preferences (called as SPARQLp) is discussed in [1, 2]. Unlike [6] and [24], the query conditions of SPARQLp are crisp and fuzziness only appears in the preferences. In addition, the fuzzy preferences of SPARQLp only contain fuzzy terms and do not support complex fuzzy preferences as well.

This paper presents a flexible extension to crisp SPARQL in two ways. We first extend SPARQL with fuzzy sets by allowing the occurrence of fuzzy terms, fuzzy operators and fuzzy relations. The extended SPARQL is called SPARQLf. Second, we further introduce multi-dimensional preferences into the SPARQLf. The SPARQLf with preferences is called SPARQLf-p, which provides users with a flexible way to represent their query intentions. Since the queries described with SPARQLf-p cannot be supported by the existing SPARQL implementations, we then propose a set of translation rules to convert SPARQLf-p into crisp SPARQL. Finally we define an effective scoring function for calculating the order of query results, which integrates the preferences and membership degrees.

This paper differs from [6, 24] in three major aspects. First, the orderly linguistic values are applied instead of the crisp thresholds of [0, 1]. Second, the fuzzy relations between two variables are also accommodated into the fuzzy constrains in addition to fuzzy terms and fuzzy operators. Finally, the dimensional preferences for users are explicitly introduced and applied in ranking the returned results. This paper differs from [1, 2] in two major aspects. First, the fuzzy constraints are directly applied to represent users’ fuzzy requirements instead of crisp query conditions. Second, complex fuzzy preferences are discussed. It is not difficult to differentiate between two fuzzy SPARQL languages, which are the fuzzy SPARQLf in this paper and the f-SPARQL in [6]. We use the term fuzzy SPARQL in the rest of this paper to generally refer to a kind of flexible SPARQL with fuzzy querying conditions.

The remainder of this paper is organized as follows. Section 2 briefly reviews the necessary background knowledge of fuzzy sets, RDF and SPARQL as well as the related work. Section 3 presents the syntax of fuzzy SPARQL with preferences. Section 4 shows how to translate the fuzzy SPARQL queries with preferences into crisp ones. Section 5 describes the top-k ranking algorithm. Section 6 provides an experimental evaluation of the fuzzy SPARQL with preferences. Section 7 concludes this paper.

2 Preliminaries

2.1 RDF and SPARQL

RDF is a directed, labeled graph data format for representing Web resources [15]. The RDF abstract syntax is a set of triples, called the RDF graph. A RDF triple is defined as (s ; p ; o) ∈ (I ∪ B) × I × (I ∪ B ∪ L). Here I, B and L are infinite sets of IRIs, blank nodes and RDF literals, respectively. In a triple, s is called the subject, p the predicate (or property), and o the object.

For RDF data sets, SPARQL [19] provides tools to obtain information from URIs, blank nodes, text of no format and RDF sub-graph, and construct a new RDF graph according to the information from the query graph. A SPARQL query can be defined as a quadruple Q = (V, P, DS, SM), where V declares a result form, P represents a graph pattern, DS is a data set and SM a set of solution modifiers [6].

A SPARQL query has the basic format of SELECT-FROM-WHERE. Among these components, the SELECT component is applied to represent the set of variables being shown in the result, the FROM component is employed to represent the dataset constituted by a set of IRIs of RDF documents, and the WHERE component is used to represent the graph pattern constituted by a set of RDF triples. In addition to these major components above, there is an additional solution modifier SPARQL FILTERs. SPARQL FILTERs restrict solutions to those for which the FILTER expression evaluates to TRUE. One can refer to [19] for the syntax of SPARQL.

2.2 Fuzzy sets and orderly linguistic values

In the classical sets theory, two-valued logic {0, 1} is applied. As a result, an element either belongs to a set or does not belong to a set for certain. But in the fuzzy sets theory [25, 29], an element belongs to a set with a membership degree of [0, 1] instead of {0, 1} was introduced, which is described by a membership function.

Definition 1. Let U be a universe of discourse and A be a fuzzy set in U. A membership function μ_A : U → [0, 1] is defined for A, in which μ_A (x) for each x ∈ U denotes the membership degree of x in the fuzzy set A.

Let A be a fuzzy set in the universe of discourse U with the membership function μ_A : U → [0, 1]. Then we have the following notions.

Support: the support of A is a set of the elements that have non-zero degrees of membership in A, denoted by supp (A) = {x|x ∈ U and μ_A (x) >0}.

Core: the core of A is a set of the elements that completely belong to A, denoted by core (A) = {x|x ∈ U and μ_A (x) =1}.

α-Cut: the strong (weak) α-cut of A is a set of the elements which degrees of membership in A are greater than (greater than or equal to) α, where 0 ⩽ α < 1 (0 < α ⩽ 1), denoted by A_α+ = {x|x ∈ U and μ_A (x) > α} and A_α+ = {x|x ∈ U and μ_A (x) ⩾ α}.

With α-Cut, a fuzzy set (say F) is converted to a classical one (say F_α or F_α+).

Let A be a fuzzy set in the universe of discourse U with the membership function μ_A : U → [0, 1]. We say A is convex if and only if for all μ₁, μ₂ in U, μ_A (λu₁ + (1 - λ) u₂) ⩾ min (μ_A (u₁) , μ_A (u₂)), where λ ∈ [0, 1]. In addition, we say A is a normal fuzzy set if ∃ u ∈ U, μ_A (u) =1. A fuzzy number is a convex and a normal fuzzy set. Let A and B be two fuzzy numbers in U and let A_α and B_α be the α-Cuts of A and B, respectively. Then we have (A ∪ B) _α = A_α ∪ B_α and (A ∩ B) _α = A_α ∩ B_α.

A fuzzy set is applied to represent a fuzzy notation, which generally corresponds to a linguistic value [26 –28]. With some linguistic values, we may get orderly linguistic values.

Definition 2. (orderly linguistic values). Let S be a finite and totally ordered label (term) set and formally defined as S = {s_i|i = 0, …, τ}, where τ is a positive integer which means the cardinality of S, s_i is a linguistic value and s_i < s_j if 0 ⩽ i < j ⩽ τ.

In the orderly linguistic values, the middle linguistic value (i.e., s_τ/2) means “approximately 0.5”, and it is assumed that the other values are placed symmetrically around it. In particular, s₀ and s_τ denote the lower and upper limits of linguistic values. Actually orderly linguistic values generally represent adverbs in the natural language.

Let s_i and s_j be two linguistic values. Then we have the following operations on linguistic values.

neg (s_i) = s_j, j = τ - i

max (s_i, s_j) = s_i, if s_i ⩾ s_j

min(s_i, s_j) = s_i, if s_i ⩽ s_j

Here neg () represents a negation operator, and max () and min () represents the maximazation operator and the minimazation operator, respectively.

The cardinality of S (i.e., τ) are odd such as 7 and 9. The limit of cardinality is 11. Generally it must be small enough so as not to impose useless precision on decision making, but it must also be rich enough in order to allow a discrimination of the performances of each object in a limited number of grades. According to Miller decision theory, generally 5 up to 7 linguistic values on an ordered scale are used [16]. In this paper, we adopt 7 linguistic values to make up the orderly linguistic values. Then we have S = {s₀ = none, s₁ = very low, s₂ = low, s₃ = medium, s₄ = high, s₅ = very high, s₆ = perfect}.

Viewed from the processing of orderly linguistic values, it is required that each linguistic value (say s_i) is associated with a fuzzy set defined with a membership function. In principle, membership functions can be of different shape, but in practice, trapezoidal and triangular membership functions are most frequently used. Trapezoid membership function is generally described with (a_i, b_i, c_i, d_i), where a_i and d_i represent minimal and maximal value in membership interval of an element of an order i, b_i and c_i form an interval to show the membership degree equals to 1.

The notations and semantics of the orderly linguistic values S = {s₀ = none, s₁ = very low, s₂ = low, s₃ = medium, s₄ = high, s₅ = very high, s₆ = perfect} are presented in Table 1. The trapezoid membership functions of the orderly linguistic values in S are given in Fig. 1.

2.3 Related work

In recent years with expansion of data Web applications, problems of RDF queries come to forefront. In [6], SPARQL is extended with an easy-to-grasp yet powerful way for querying the RDF data sets. In the extended SPARQL, the occurrence of fuzzy terms and fuzzy operators is allowed. Also, in order to express users’ weight preferences on fuzzy constraints, user-defined weights in the form of ∑ω_i = 1 are provided in [6]. Finally it is shown in [6] how to translate fuzzy terms and fuzzy operators into crisp ones and how to efficiently calculate the top-k answers of fuzzy queries with regard to membership degrees as well as user-defined weights. In [12], for fuzzy queries over RDF data, a proof-of-concept framework is presented for enabling relaxation of structured entity-lookup queries, evaluating different distance measures for performing relaxations. To query the fuzzy DL-Lite ontologies, in [18], an extended SPARQL is proposed, which allows threshold and user-defined weights in the query.

In addition to the efforts of fuzzy constraints mentioned above, there are some efforts in the expression constructed with users’ preferences in the context of the Semantic Web. In [22], a method to querying the Semantic Web with multi-dimensional preferences is proposed. In their method, the preferences are divided into two types, which are the Boolean preferences and the scoring preferences. By using the Preferring clause which defines some independent preferences separated by the AND construct, users can specify the preferences. In addition, the atomic preferences can be nested using the CASCADE construct. It should be noted that the ranking method in the Preferring modifier may meet the situation that the queries cannot be handled in [22]. Assume that, for example, we just need one more result in the current result set of a four-dimension preference. Also we assume that there is a best solution in each dimension, but there are not any domination relations between any two of them. For this case, it is difficult for the ranking to be finished (see [22] for more details about ranking). In [13], the preference problem of fuzzy and multiple conditions are considered in practical query, and the relevancy computational formula is provided, which can realize preference query through sequencing.

This paper differs from the efforts above in three aspects: first, the orderly linguistic values are applied instead of the crisp thresholds in [0, 1]; second, the fuzzy relations between two variables are also introduced into the fuzzy constrains in addition to fuzzy terms and fuzzy operators; finally, the dimensional preferences for users are explicitly introduced and applied in ranking the returned results.

3 Fuzzy SPARQL with preferences

In this section, we present the fuzzy SPARQL with preferences, which is an extension to the classical SPARQL in two ways. First, we present the fuzzy SPARQL (denoted SPARQLf), which is a flexible extension to the classical SPARQL language. In the FILTER component of SPARQLf, the original expression is extended by allowing fuzzy terms and fuzzy operators. Furthermore fuzzy relations are introduced into SPARQLf. Second, a structure used to express users’ multi-dimensional preference is introduced in to SPARQLf. The SPARQLf with preferences is denoted SPARQLf -p.

3.1 Fuzzy terms and fuzzy operators

The fuzzy extension of SPARQL mainly takes place in the FILTER constraint component. In [6], the basic form of FILTER constraint allowing fuzzy terms is as follows.

FILTER (?X θ FT [WITH α])

Among them, FT denotes a fuzzy term (e.g., “tall” or “young”), which corresponds to a fuzzy set using a trapezoidal membership function, and θ is an operator of >, < , = , ⩾ , ⩽ , ! = , between and not between . If the operator θ is between or notbetween, the syntax of FILTER constraint is a little different, i.e.,

FILTER (?X between/notbetween FT₁ and FT₂) [WITH α],

where FT₁ and FT₂ are two fuzzy numbers, respectively.

The optional parameter [WITH α] indicates the condition that must be satisfied as the minimum membership degree in [0, 1]. It is required in [6] that users need to choose an appropriate value of α to express their requirements. If not specified, 1 is used as default. However, it is not easy for users to understand the semantics of a crisp α and to make an appropriate choice. For this reason, we apply orderly linguistic values in FILTER constraint, which has the form as follows.

FILTER (?X θ FT [WITH s_i]) or

FILTER (?X between/notbetween FT₁ and FT₂) [WITH s_i]

The fuzzy terms applied in SPARQLf can be classified into three types, which are simple (atomic) fuzzy term, modified (composite) fuzzy term and compound fuzzy term. A simple fuzzy term such as “young” or “tall” is defined in terms of a fuzzy number with a membership function. A modified fuzzy term, e.g. “very young” or “more or less tall”, is described by a simple fuzzy term and a fuzzy modifier together. A compound fuzzy term is represented by simple fuzzy terms or modified fuzzy terms connected by or (union), and (intersection), or not (complement) connectors, such as “young or very young”.

Now we consider fuzzy operators within the FILTER constraint, which is formalized as follows.

FILTER (?X $\tilde{θ}$ Y) [WITH s_i]

Here $\tilde{θ}$ denotes a fuzzy operator, and Y denotes a string, an integer or any other data types allowed in RDF. Then the fuzzy operator $\tilde{θ}$ and the crisp value Y constitute a fuzzy number represented by $\tilde{θ}$ Y. We, in this paper, mainly discuss three types of fuzzy relations, which are “close to (around)”, “at least”, and “at most”.

3.2 Fuzzy relation

It is worth noting that the fuzzy terms and fuzzy operators discussed above are applied in single variables of the fuzzy constraints. In “Age = old” and “Age is around 50”, for example, fuzzy term “old” and fuzzy operator “around 50” are related to a single variable “Age”. In practice, there exist many fuzzy semantics constraints within two variables. Let us look at an example. Suppose that we have two variables, say “supplier” and “customer”, and there is a fuzzy constraint relation between them, say “customer likes supplier”. In order to handle this kind of fuzzy constraints, we further introduce fuzzy relations into SPARQLf.

Definition 3. (fuzzy relation). Let U and V be two universes of discourse, and R be a fuzzy relation on U × V. A membership function θμ_R : U × V → [0, 1] is defined for R, in which μ_R (u, v) for each u ∈ U and v ∈ V denotes the membership degree of (u, v) in the fuzzy relation R.

Basically we can identify two major kinds of fuzzy relations, which are simple fuzzy relation and compound fuzzy relation. A simple fuzzy relation between elements from two universes of discourse and its membership function are generally defined directly by the domain experts. Compound fuzzy relations can be obtained with simple fuzzy relations.

Definition 4. (compound fuzzy relation). Let Q be a fuzzy relation on U × V and R be a fuzzy relation on V × W. Then a compound fuzzy relation on Q and R, noted as Q © R, is defined as Q © R (u, w) = ∨ (Q (u, v) ∧ R (v, w)) (u ∈ U ∧ v ∈ V ∧ w ∈ W).

Example 1. Let R be a fuzzy relation, which represents that variable x is much greater than variable y. Here x and y are two variables taking real values. The membership function of this simple fuzzy relation may be defined, for example, as follows.

$R (x, y) = {\begin{matrix} 0, x ⩽ y \\ \overset{- 1}{[1 + 100 \times \overset{- 2}{(x - y)}]}, x > y \end{matrix}$

Suppose that there is a compound fuzzy relation R © R, meaning that x is far greater than y. Then we have R © R (x, y) = ∨ _z (R (x, z) ∧ R (z, y)) and would get

$\begin{matrix} \lor_{z} (R (x, z) \land R (z, y)) = \lor_{z} R (x, z) = R (z_{0}, y) \\ whenR (x, z) ⩽ R (z, y); \end{matrix}$

$\begin{matrix} \lor_{z} (R (x, z) \land R (z, y)) = \lor_{z} R (z, y) = R (x, z_{0}) \\ whenR (z, y) ⩽ R (x, z) . \end{matrix}$ In other words, we can get z₀ that makes R (x, z) equal to R (z, y). That is, [1 +100 × (x - z) ^-2] ^-1 = [1 + 100 × (z - y) ^-2] ^-1. Then we further get the following membership function for R © R. $R (x, y) = {\begin{matrix} 0, x ⩽ y \\ \overset{- 1}{[1 + (100 \times \overset{- 2}{(x + y) / 2)}]}, x > y \end{matrix}$

In order to accommodate the fuzzy relations into SPARQL, we apply FILTER (Fuzzy Relation (?X, ?Y) [WITH s_i]) to express the fuzzy constraints between two variables in the FILTER expression.

Example 2. Suppose that a user would like to find the name, height and weight of some male individuals who are of middle height and physical health. This query can be formalized in SPARQLf as follows.

#FQ#

SELECT ?name, ?height,? weight WHERE {

?x foaf:name ?name.

?x ex:hasHeight ?height.

?x ex:hasWeight ?weight.

FILTER (?height=medium tall WITH s₃ &&

Health (?height, ?weight) WITH s₅)

}

In the query above, #FQ# is placed before the SELECT clause, indicating that the SPARQL query is a fuzzy one. In addition, Health (?height, ?weight) WITH s₅ is used to indicate a fuzzy relation between height and weight, and s₅ is used to indicate the threshold of constraints in the query condition.

3.3 Preferences

It is very common to use preferences in human natural language. Thus to support users’ demands well, it is very necessary to accommodate preferences in queries. In this section, we apply the orderly linguistic values for dimensional preferences.

In order to construct users’ preferences in SPARQL, we introduce two keywords AND and SUPERIOR.

The keyword AND is used to express independent preferences and SUPERIOR is used to signify prioritized preferences. We define AND and SUPERIOR as binary relation connection identifiers. Let P_i, P_j and P_k be three atomic preferences. For AND, we have the followings.

Law 1 (commutative law): P_i AND P_j = P_j AND P_i;

Law 2 (associative law): P_i AND (P_j AND P_k) = (P_i AND P_j) AND P_k;

Law 3 (idempotent law): P_i AND P_i = P_i;

For SUPERIOR, we have:

P_i SUPERIOR (P_j SUPERIOR P_k) = (P_i SUPERIOR P_j) SUPERIOR P_k.

Note that SUPERIOR does not satisfy the commutative law because of the strict partial order relationship between the two atomic preferences. So P_i SUPERIOR P_j is not equal to P_j SUPERIOR P_i. When AND as well as SUPERIOR appears together, we have the followings.

Law 4: P_i SUPERIOR (P_j AND P_k) = (P_i SUPERIOR P_j) AND (P_i SUPERIOR P_k)

Law 5: P_i AND (P_j SUPERIOR P_k) ≠ (P_i AND P_j) SUPERIOR (P_i AND P_k)

Now we investigate how to quantitatively convert a dimensional preference to a weight vector in a semantic perspective. For this purpose, we first define two relational operators: ⊕ for AND and ⊗ for SUPERIOR. Also we use S_Δi to indicate the orderly linguistic value of P_i. We have the following rules to convert an original preference to a weight vector.

Rule 1: P_i AND P_j ≅ S_Δi ⊕ S_Δj;

Rule 2: P_i SUPERIOR P_j ≅ S_Δi ⊗ S_Δj;

In order to obtain the relatively independent weight of atomic preference after SUPERIOR, we define a notion of conditional probability.

Definition 5. (conditional probability). Let A and B be two atomic preferences, and S_ΔA and S_ΔB be their orderly linguistic values, respectively. Suppose that A and B has a relationship of A SUPERIOR B. Then B has a conditional probability with respect to A, denoted CP (B|A).

With CP (B|A), we have the independent weight of B, denoted CP (B|A) * S_vartriangleB. Here CP (B|A) can be calculated by counting N (AB) and N (A). The former represents the number of results satisfying both A and B, and the later represents the number of resultsmeeting A.

The relational operators ⊕ and × and the conditional probabilities satisfy the following rules.

Rule 3: S_vartrianglei ⊕ S_vartrianglei = S_vartrianglei;

Rule 4: S_vartrianglei ⊗ S_vartrianglej = S_vartrianglei ⊕ CP (P_j|P_i) * S_vartrianglej;

Rule 5: CP (P_k|P_i) * S_vartrianglek ⊗ CP (P_k|P_iP_j) * S_vartrianglek = indent CP (P_k|P_iP_j) * S_vartrianglek;

Rule 6: CP (P_k|P_i) * S_vartrianglek ⊗ S_vartrianglej = CP (P_k|P_i) * indent S_vartrianglek ⊗ CP (P_j|P_iP_k) * S_vartrianglej;

Rule 4 means that the conditional probability must be considered while calculating the weight of the right operand with respect to ⊗. Rule 5 describes a preference which must satisfy the most restrictedconditions.

Using the rules presented above, we can obtain a preference with the form of r₁ * S_vartriangle1 ⊕ r₂ * S_vartriangle2 ⊕ … ⊕ r_n * S_vartrianglen. Here n indicates the number of atomic preferences, and r_i indicates a weight coefficient, either a conditional probability or a product of probabilities. Then we use the vector of (r₁ * S_vartriangle1, r₂ * S_vartriangle2, … , r_n * S_vartrianglen) to represent this preference.

3.4 Syntax of SPARQLf-p

The SPARQLf with preferences is denoted as SPARQLf-p. In this section, we summarize the syntax of SPARQLf-p. The SPARQLf-p consists of three parts, which are “Query”, “Constraint” and “PREFER”.

Part 1: Query

In SPARQLf-p, element QueryType is introduced into SELECT query [6] to indicate the query type, either the fuzzy type or the top-k fuzzy type. We use keywords #FQ# to declare that the query is fuzzy, and use #top-k FQ# to declare that the query is top-k fuzzy.

Part 2: Constraint

In addition to the three original elements (i.e., BracketedExpression, BuiltInCall and FunctionCall) in the classical SPARQL, the “Constraint” element is extended with a “FlexibleExpression” element, which contains fuzzy terms elements and fuzzy relation element. Element FuzzyTermExpression is a representation of fuzzy terms that is consisted of symbols =, ≠, >, ≥, <and ≤. The FuzzyOperatorExpression is an expression with simple fuzzy operator or modified fuzzy one. The structure of FuzzyRelationExpression is SimpleFuzzyTerm (Var, Var) [WITH s_i], which describes the fuzzy relation between twovariables.

Part 3: PREFER

The PREFER is an additional part to construct preference expression, which is added to the SolutionModifier part. At this point, the keyword AND constructs the independent preferences and SUPERIOR constructs the prioritized preferences. And its syntax form is SuperiorPreference (‘AND’ SuperiorPreference)*, where the SuperiorPreference is AtomicPreference (‘Superior’ AtomicPreference)*, and the AtomicPreference element is an atomic preference with weight of orderly linguistic value.

Table 2 presents the syntax of SPARQLf-p.

Example 3. Now we use an example to illustrate the SPARQL query containing three parts discussed above. Suppose that a user would “find all persons who are less than 40 years old and with height close to 175 cm and physical health, with preferences of age constraint in a medium degree and height constraint in a high degree with priority to weight constraint in a somewhat high degree”. With the SPARQLf-p, the query can be formalized as follows.

#FQ#

SELECT ?X WHERE {?X ex:has Age ?age.

FILTER (?age <40) WITH s₃.

?X ex:hasHeight ?height.

FILTER (?height close to 175 WITH s₅).

?X ex:hasWeight ?weight.

FILTER (HEALTH (?height,?weight) WITH s₃)

}

PREFER

(?age <40) WITH s₃

AND

(?height = medium) WITH s₅

SUPERIOR (?weight) WITH s₄.

4 Fuzzy query translations

By extending the classical SPARQL, the fuzzy SPARQL can present users’ fuzzy query requirements with preferences well. But the fuzzy SPARQL cannot be identified and supported by the classical SPARQL query engines. In order to implement the fuzzy SPARQL queries, we basically have two ways. The first method is to design and develop a completely new query engines which fully support the fuzzy SPARQL directly. So far, to our best knowledge, there are not any such fuzzy SPARQL query systems. The second method is to use the classical SPARQL query engines instead of developing a new fuzzy SPARQL query engine. The second method has been widely applied in the fuzzy database queries because it has some obvious advantages such as low costs and short times in system implementations [14].

In this paper, we follow the second method mentioned above to implement the fuzzy SPARQL queries. For this method, a core issue is how the fuzzy SPARQL can be identified and supported by the classical SPARQL query engines because the classical SPARQL query engines can only identify and support classical SPARQL. So we need to translate the original fuzzy SPARQL into the classical SPARQL. Clearly, the translated classical SPARQL must be semantically closed to the original fuzzy SPARQL. In the following we investigate how to translate the fuzzy SPARQL into the classical SPARQL in a way of semanticpreservation.

The difference between the fuzzy SPARQL and the classical SPARQL is that the fuzzy SPARQL applies the fuzzy constraints, which contain fuzzy numbers. The fuzzy numbers in the fuzzy SPARQL are convex and normal and so it is possible to translate fuzzy SPARQL to classical SPARQL.

4.1 Translation of fuzzy terms and fuzzy operators

First we concentrate on the translation of simple fuzzy terms and modified fuzzy term. For FILTER (?X θ FT [WITH s_i]), s_i is converted into a fuzzy number with membership function according to Table 1. Then FT with s_i an interval, written as FT (s_i) = [x, y], and fuzzy queries can be translated into crisp ones according to the rules given in Table 3.

Let FT₁ and FT₂ be two fuzzy terms and s_i be an orderly linguistic value. Also let FT₁ (s_i) = [x₁, y₁] and FT₁ (s_i) = [x₂, y₂]. For FILTER (?X between FT₁ and FT₁ WITH s_i) in the constraints, it is converted into ?X ≥ FT₁ WITH s_i && ?X ≤ FT₂ WITH s_i and then into ?X ≥ y₁ && ?X ≤ x₂ according to Table 4. Table 4 gives the translation rules for the compound fuzzy terms.

For the translations of “FT₁ or FT₂” and “not FT”, ones can refer to [6].

Second we focus on the translation of fuzzy operators. Here we mainly discuss three kinds of fuzzy operators, which are “closeto”, “atleast” and “atmost”.

Case 1: close to

For FILTER (?X $\tilde{θ}$ Y)[WITH s_i], the operator $\tilde{θ}$ represents “closeto”. This fuzzy operator along with the crisp value Y actually forms a fuzzy term, which corresponds to a membership function. The membership function of “closeto” is defined as μ_closetoY (Y) (u) = (1 + ((u - Y)/b) ^-2) ^-1. According to the membership interval [λ₁, λ₂] of s_i, we choose λ₁ as a threshold and then get α-cut sets [x, y]. With the interval [x, y], we can translate fuzzy queries into crisp ones.

Example 4. Assume that we have a query “find the jobs with salary very close to 10000”. The query is formatted with f-SPARQL as follows.

#FQ#

SELECT ?X WHERE {

?X rdf:type WorkPos.

?X ex:monthSalary ?salary.

FILTER (?salary close to 10000 WITH s₅).

}

Here we choose 1000 as the value of b in the above membership function. That is μ_closetoY (Y) (u) = (1 + ((u - Y)/1000) ^-2) ^-1. According to the mapping rules between orderly linguistic values and trapezoidal member degree in Table 1, we get λ = 0.71 based on s₅. Then the interval of λ-cut set can be computed and we have Y_λ = [936, 10639]. Furthermore, ?salary close to 10000 WITH S₉ is converted into ?salary ≥ 9361 and ?salary ≤ 10639.

Case 2: at most and at least

We can directly make translations from “at most Y” to “≤ Y” and from “at least Y” to “≥ Y”, respectively.

4.2 Translation of fuzzy relationships

Let R (x, y) be a fuzzy relationship between variables x and y. To translate the fuzzy relationship into a crisp one, we identify the following three cases for R (x, y).

Case 1: there are not any constraints on either x nor y.

For the given s_i, an interval [λ₁, λ₂] of the membership degree can be obtained. Then the reciprocal fuzzy relationship is converted into λ₁ ≤ R (?X, ?Y) ≤ λ₂.Such a expression containing arithmetical operations on variables is allowed in the traditional FILTER constraints and is written as R (?X, ?Y) ≥ λ₁ && R (?X, ?Y) ≤ λ₂.

Case 2: there is a fuzzy constraint either x or y.

First we make a preliminary translation and get λ₁ ⩽ R (x, y) ⩽ λ₂. Suppose that the fuzzy constraint is only on variable x, where x has a crisp range. Then using the inequalities on x and y with the range of x, we can get the expressions of y represented by x. The expressions about y can be translated as y ∈ [f₁ (x) , f₂ (x)], y ∈ [min(U) , f₁ (x)] or y ∈ [f₂ (x) , max(U)]. Among them, min (U) and max (U) mean the minimum value and maximum values of y in its domain, respectively. Finally we can combine these two expressions together and get yθf (x) to represent the constraints on x and y.

Example 5. Assume that the constraint in a query contains “u is very close to 5 and v is fairly bigger than u”. The filter representation is “u close to 5 WITH s₅ and R (v, u) WITH s₄”. With the membership function of “u is much greater than v” (in Section 3.2) and the function of “u close to Y” (in Section 4.1), in which 4 is selected as the value of b, we can get the translation results as follows.

the translation result of the single variable u is u∈ [4.08, 5.92];

the translation result of the fuzzy relation R (v, u) is 1 + 100 * (v - u) ^-2 ⩽ 1/0.76 && v > u;

the complete translation of FILTER (u close to 5 WITH s₅ and R (v, u) WITH s₄) is FILTER (u ≥ 4.08 && u ≤ 5.92 && 1 + 100 * (v - u) ^-2 ⩽ 1/0.76 && v > u).

Case 3: there are fuzzy constraints on both variables in the fuzzy relationship.

Two variables in this case need to be translated, respectively. The translation result of x is x θ f (y) and the translation result of y is y θ g (x). The final constraints are translated to x θ f (y) && y θ g (x).

For the case that there is not a direct simple fuzzy relationship between two variables given in the query, additional fuzzy compositional operations are needed before translating. The basic idea is to first convert the indirect fuzzy relationship to a simple fuzzy relationship and then the translation results can be obtained according to the different cases discussed above.

5 Ranking

In order to rank the returned query results, the work in [6] proposes a scoring function with a simple method of weighted average value. But it is possible that all weights may be equal each other. A ranking method is proposed in [13] for Semantic Web search based on fuzzy logic. Their scoring strategy combines the membership degrees with user preferences and they define a non-increasing order of “degree of relevance”. We propose our strategy, a combination of user preferences and levels of satisfaction based on membership degrees.

We use Q (X, f_ω) as the formal expression of user query, in which X means the bounded variables in query conditions and f_ω indicates the weighting function. This weighting function has some desirable properties (see [13] for more details). In order to satisfy the equation ∑ω_i = 1, the weighting (r₁ * S_vartriangle1, r₂ * S_vartriangle2, …, r_m * S_vartrianglem) has to be reduced to (ω₁, ω₂, …, ω_m) by the same scaling. Without loss of generality, for a weighting ω = (ω₁, ω₂, …, ω_m), we assume ω_1⩾ ω_2⩾ … ⩾ ω_m. Then we have $\begin{matrix} f_{ω} (A_{1} (x_{1}), A_{2} (x_{2}), \dots, A_{m} (x_{m})) \\ = (ω_{1} - ω_{2}) \cdot f (A_{1} (x_{1})) + 2 (ω_{2} - ω_{3}) \cdot f (A_{1} (x_{1}), \\ A_{2} (x_{2})) + 3 (ω_{3} - ω_{4}) \cdot f (A_{1} (x_{1}), A_{2} (x_{2}), A_{3} (x_{3})) \\ + \dots + m ω \cdot f (A_{1} (x_{1}), A_{2} (x_{2}), \dots, A_{m} (x_{m})) \\ = m ω \cdot f (A_{1} (x_{1}), A_{2} (x_{2}), \dots, A_{m} (x_{m})) \\ + \sum_{i = 1}^{m - 1} i \cdot (ω_{1} - ω_{i + 1}) \cdot f (A_{1} (x_{1}), A_{2} (x_{2}), \dots, A_{i} (x_{i})) \end{matrix}$

In the above, f (A₁ (x₁) , A₂ (x₂) , …, A_m (x_m)) is an aggregation function of A_i (x_i), indicating the degree of membership. The aggregation function may be min, product, averaged sum, or even one defined according to real applications. Here we apply min and then the aggregation function is defined as follows. $\begin{array}{l} f (A_{1} (x_{1}), A_{2} (x_{2}), \dots, A_{m} (x_{m})) = \min (A_{1} (x_{1}), \\ A_{2} (x_{2}), \dots, A_{m} (x_{m})) \end{array}$

The formula above is just based on the scoring strategy on the results. It is necessary to improve efficiency of queries due to large data sets. It is proved that the TA algorithm is correct for each monotone aggregation function and requires only bounded buffers [9]. So we adopt the TA algorithm for our ranking. The modified TA algorithm for our ranking is shown in Algorithm 1. In Algorithm 1, List_jDegre is a membership degree list of result set on the property j, and τ is a threshold being updated in the operation.

Algorithm 1 The modified TA algorithm for top-k ranking
Input: n results, membership degree score lists of m bounded
variables, aggregation function and weighting vector and the
expected number k
Output: k results with descending order
1. Buffer ← buffer zone for k results
2. for each i = 1 to i = ndo
3. for each j = 1 to j = mdo
4. Do access to a result r from List_jDegre
5. Get membership score of r in List_jDegree
6. Do random access to the other lists to find the scores in
every list
7. Compute overall score according to scoring function and
get Score(r)
8. if Score (r) > τ (a threshold) then
9. Put (r_id, score) recording Score (r) into Buffer in
descending order
10. count++
11. end if
12. end for
13. if count≥kthen
14. break;
15. end if
16. update τ
17. end for
18. return Buffer

6 Experiments

To evaluate our proposed approach, we use Eclipse as development environment to implement the prototype system. The prototype runs on a Windows XP Professional System with Intel Pentium 1.8 G CPU and with 1 G RAM. Also we use TDB (a database to store RDF dataset) and Jena ARQ (a query engine to execute crisp SPARQL queries) for the evaluation. We generate the RDF dataset from the Lehigh University Benchmark LUBM [10], which contains about 6,000k triples.

We first introduce four SPARQLf queries shown in Table 5.

In the four SPARQLf queries above, we further use in the experiments the following preference.

PREFER

(?X = busy) WITH s₄

AND

(Ratio (UndergraduateStudent, ?X) is moderate)

WITH s₃

SUPERIOR (Ratio (GraduateStudent, ?X) is

almost 3.5) WITH s₃.

Combining this preference, the four SPARQLf queries Q₁’, Q₂’, Q₃’ and Q₄’ become four SPARQLf-p queries Q₁, Q₂, Q₃ and Q₄.

We also apply three membership functions in our experiments, which are defined as follows.

Busy (n) = 2/(1 + exp (–0.4n)) − 1,

Ratio_moderate (x, y) = [1+((x - 11y)/10)²]^–1

Ratio_almost3.5 (x, y) = [1 + ((x - 3.5y)/10)²]^–1

Here n indicates the number of courses taken by a student or taught by a lecturer, and x represents the number of undergraduates and gradates, and y represents the number of the faculty.

We use Recall and Precision to assess our approach. Recall and Precision have been widely adopted for querying evaluation and they can be defined as follows. $Recall = \frac{# relevant triples in query results}{# relevant triples in dataset}$ $Precision = \frac{top - k triples \cap relevant triples}{k}$

The developed prototype takes the four SPARQLf-p queries Q₁, Q₂, Q₃ and Q₄ as inputs, respectively, and calls Jena ARQ to search the corresponding query results from the RDF dataset, which is generated from the LUBM and stored in TDB. For the four SPARQLf-p queries Q₁, Q₂, Q₃ and Q₄, their experimental results of recall and precision are summarized in Table 6. It can be seen that, for the given sample queries, the average recall of our approach is 0.7375 and the average precision of our approach is 0.845. So our approach is good at dealing with the fuzzy SPARQL queries with preferences.

In addition, we compare our approach using the SPARQLf-p (named OLVP for short) with the approach using f-SPARQL proposed in [6] (named NVP for short) on the response time of queries. The experimental results are shown in Fig. 2.

It is clearly shown in Fig. 2 that OLVP takes a little longer time than NVP. The reason why OLVP takes a little longer time than NVP is that compared with NVP, OLVP can provide a more powerful querying means, and this needs more complex processing and hereby more time. First, in addition to the fuzzy terms and the fuzzy operators used by NVP, OLVP uses the fuzzy relations as well. Generally, converting and then evaluating a more complex fuzzy SPARQL query in OLVP obviously takes more time than a simple fuzzy SPARQL query in NVP. Second, OLVP explicitly supports user preferences which are considered in ranking query results. Processing multi-dimensional preferences and ranking query results with preferences in OLVP do need additional time compared to the simple query result ranking in NVP.

7 Summaries

In order to provide users with an efficient way to express their intentions in querying RDF data, in this paper we present a fuzzy extension to SPARQL with preferences. We extend SPARQL first by allowing the occurrence of fuzzy terms, fuzzy operators and fuzzy relations and then by allowing the occurrence of preferences. The fuzzy SPARQL with preferences is called the SPARQLf-p. With fuzzy set theory, we also define a set of translation rules, converting SPARQLf-p into crisp ones. With the translation rules, we can take advantage of existing implementations of SPARQL to process SPARQLf-p queries instead of developing new implementations of SPARQL. Finally we propose a scoring function for calculating the order of query results, which contains membership degrees andpreferences.

In this paper, we use LUBM dataset to verify our approach. As pointed by Cudre-Mauroux et al. [8], there exist several benchmarks for RDF data management research, and, in addition to LUBM, Berlin Benchmark [4], DBpedia Benchmark [17] and SP²Bench [20] are also widely used benchmarks. We will use other datasets such as DBpedia and SP²Bench datasets to evaluate the performances of our approach. Furthermore our future work will investigate possible optimization strategies for querying RDF and evaluating their performances with these benchmarks.

References

Amdouni

, Barhamgi

, Benslimane

, Hadjali

, Benouaret

and Faiz

, Answering Fuzzy Preference Queries over Data Web Services, Proceedings of the 2012 International Conference on Web Engineering, (2012), pp. 456–460.

Benouaret

, Benslimane

, Hadjali

, Barhamgi

, Maamar

and Sheng

Q.Z.

, Web service compositions with fuzzy preferences: A graded dominance relationship-based approach, ACM Trans Internet Technology 13 (2014), 12.

T. Berners-Lee, N3QL–RDF Data Query Language, http://www.w3.org/DesignIssues/N3QL.html, 2004.

Bizer

and Schultz

, The Berlin SPARQL Benchmark, International Journal on Semantic Web and Information Systems 5 (2009), 1–24.

Broekstra

and Kampman

, SeRQL: An RDF Query and Transformation Language, Proceedings of the 2004 International Symposium on Wearable Computers, 2004.

Cheng

J.W.

, Ma

Z.M.

and Yan

, f-SPARQL: A Flexible Extension of SPARQL, Proceedings of the 21st International Conference on Database and Expert Systems Applications, 2010, 487–494.

Christophides

, et al., Declarative Languages for Querying Portal Catalogs, Proceedings of DELOS Workshop: Information Seeking Searching and Querying in Digital Libraries, Zurich, Switzerland, 2000, pp. 115–120.

Cudre-Mauroux

, NoSQL Databases for RDF: An Empirical Evaluation, Proceedings of 12th International Semantic Web Conference, 2013, pp. 310–325.

Fagin

, Lotem

and Naor

, Optimal aggregation algorithms for middleware, Journal of Computer and System Sciences 66 (2003), 614–656.

10.

Guo

, Pan

and Hein

, LUBM: A benchmark for OWL knowledge base systems, Journal of Web Semantics 3 (2005), 158–182.

11.

Herrera-Viedma

, Modeling the retrieval process for an information retrieval system using an ordinal fuzzy linguistic approach, Journal of the American Society for Information Science and Technology 52 (2001), 460–475.

12.

Hogan

, Mellotte

, Powell

and Stampouli

, Towards Fuzzy Query-Relaxation for RDF, Proceedings of the 9th Extended Semantic Web Conference, 2012, pp. 687–702.

13.

Jin

and Ning

X.M.

, Combining weights with fuzziness for intelligent semantic web search, Knowledge-Based Systems 21 (2008), 655–665.

14.

Z.M.

and Yan

, Generalization of strategies for fuzzy query translation in classical relational databases, Information & Software Technology 49 (2007), 172–180.

15.

F. Manola and E. Millar, RDF Primer, http://www.w3.org/TR/2004/REC-rdf-primer-20040210/, 2004.

16.

Miller

C.F.

, Decision Problems for Groups: Survey and Reflections, Algorithms and Classification in Combinatorial Group Theory (eds. Baumslag

and Miller

C.F.

), Springer-Verlag, 1992, pp. 1–59.

17.

Morsey

, Lehmann

, Auer

and Ngomo

A.C.N.

, DBpedia SPARQL Benchmark-Performance Assessment with Real Queries on Real Data, Proceedings of 10th International Semantic Web Conference, 2011, pp. 454–469.

18.

Pan

J.Z.

, Stamou

G.B.

, Stoilos

, Taylor

and Thomas

, Scalable Querying Services Over Fuzzy Ontologies, Proceedings of the 17th International World Wide Web Conference, 2008, pp. 575–584.

19.

E. Prud’hommeaux and A. Seaborne, SPARQL Query Language for RDF, W3C Recommendation, http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/, 2008.

20.

Schmidt

, Hornung

, Kuchlin

, Lausen

and Pinkel

, An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario, Proceedings of the 7th International Semantic Web Conference, 2008, pp. 82–97.

21.

A. Seaborne, RDQL-A Query Language for RDF, http://www.w3.org/Submisssion/2004/SUBM-RDQL-20040109/, 2004.

22.

Siberski

, Pan

J.Z.

and Thaden

, Querying the Semantic Web with Preferences, Proceedings of the Fifth International Semantic Web Conference, 2006, pp. 612–624.

23.

Sintek

and Decker

, TRIPLE-An RDF Query, Inference, and Transformation Language, Proceedings of the Fourteenth International Conference on Applications of Prolog, 2001, pp. 47–56.

24.

Straccia

, Fuzzy SPARQL: A Query Language for Fuzzy RDFS Graphs, Foundations of Fuzzy Logic and Semantic Web Languages, Chapman and Hall/CRC, 2014, pp. 305–309.

25.

Zadeh

L.A.

, Fuzzy sets, Information and Control 8 (1965), 338–353.

26.

Zadeh

L.A.

, The concept of a linguistic variable and its application to approximate reasoning-I, Information Sciences 8 (1975), 199–249.

27.

Zadeh

L.A.

, The concept of a linguistic variable and its application to approximate reasoning-II, Information Sciences 8 (1975), 301–357.

28.

Zadeh

L.A.

, The concept of a linguistic variable and its application to approximate reasoning-III, Information Sciences 9 (1975), 43–80.

29.

Zadeh

L.A.

, Fuzzy sets as a basis for a theory of possibility, Fuzzy Sets and Systems 1 (1978), 3–28.