Linguistic summaries of graph datasets using ontologies: An application to Semantic Web

Abstract

An approach to performing linguistic summaries of graph datasets, with particular focus on usage of ontologies is presented in this paper. This well-known mining technique is based on fuzzy set theory, which is used to model natural language words (e.g. ‘many’, ‘tall’), and in result - generates natural-like sentences describing the data. Although intensely developed, before our work this method has been applied only to relational databases, while more and more data is available in graph model. A special case of such graph datasets is the Semantic Web, in which ontologies provide meaning, therefore enabling advanced machine learning. In our paper we analyze the problem of generating linguistic summaries for a graph data case (for which the method cannot be directly applied), with associated ontologies. The key element of ontologies are concept hierarchies, which are the core of our work. Firstly, due to heterogeneity and lack of schema we propose to use an ontological concept (including all sub-concepts in hierarchy) as a subject for summaries, and extract their attributes (neighboring vertexes). Then we show that by ascending these ontological concept hierarchies (so by attribute-based induction) we obtain additional, generalized summaries. We show this process for both summarizers and qualifiers, and propose an extension to their respective imprecision measures - T₂ and T₉. We perform two experiments on DBPedia - one for summary subject ‘Artist’, and second for ‘Musical Album’. For the latter, we show the optimized process of obtaining the truth values using bottom-up approach.

Keywords

Linguistic summaries fuzzy logic ontology Semantic Web

1 Introduction

This paper focuses on performing linguistic summaries on graph datasets with associated ontologies. Linguistic summaries, defined by Yager [1 –4], is a data-mining method aimed at finding high-level, complex relationships in the data and, based on fuzzy sets, presenting this information in natural language sentences. We focus on application of this method in a non-typical case - on graph datasets. Our research can be viewed as a form of graph data summarization, which in most cases is a about extracting frequent patterns (subgraphs) from a larger graph [5, 6]. Also, this paper concerns the field of Semantic Web mining, with subfields like content, structure and usage mining [7, 8]. A different research area is presented in [9], where elements of fuzzy set theory are used to express fuzzy queries for graph databases (precisely - extending Cypher query language for Noe4j). In [10] new concepts, named ‘Structure Summaries’ and ‘DataStructure Summaries’ are introduced, which are capturing relations between two types of vertices. Despite new nomenclature these constructs are logically equivalent to 1st and 2nd form of summary (see (1) and (2)), and therefore to our approach, which is based on data extraction from graph to pseudo-relational model.

This paper is an extended version of [11]. Compared to our conference paper we have additionally considered qualifiers and studied their generalization analogously to summarizers. Also we have broadened our approach by considering a more abstract notion of concept instead of only ontological class. Therefore, it is now possible to include other hierarchies, like ‘categories’, ‘genres’, etc. This way our approach is more comprehensive, and can be applied to larger number of real datasets, and provide even more additional summaries. In Section 4 we present an analysis of how to efficiently generalize summarizers and qualifiers, based only on the count of sub-concepts. For this we have simplified the equations for truth values for crisp summarizers and qualifiers, and analyzed their change during ascending the hierarchy. We have verified proposed optimizations on a new use-case - summarization on musical albums, where musical genres hierarchy is used to generalize both summarizers and qualifiers.

Using hierarchies for data summarization is a classic approach known as attribute-oriented induction [12]. For relational databases, this method is based on creating super-tuples based on values lower in hierarchies (i.e., the DBLEARN system [13]). Fuzzy hierarchies have also been considered in several works: in [14] authors take a user-guided top-down approach, and at each level user decides if more specification is required; in [15] fuzzy hierarchies are built using fuzzy clustering of attributes; a mature lingusitic summarization system called SaintEtiQ is presented in [16], where user-defined knowledge (including hierarchies) is used for conceptual clustering. Hierarchical nature of ontologies has been used for summarization in [17], where 4 new criteria(based on fuzzy sets) are used to determine if ascending the hierarchy should continue or stop. Here we should emphasize that in all of these approaches relational databases are considered, whereas our approach considers graph datasets.

Although linguistic summaries are a perfectly suited for very large datasets, they have not been considered for Semantic Web, which in fact is a huge, distributed graph dataset. First new problem that arises in this case is - how to define the subject of such summaries? Since the data is a graph, taking all vertexes would not be consistent, since they may be arbitrarily connected, have different attributes, etc. Hence firstly we propose to use an ontological class as a subject of the summary. We also show that concept hierarchies have to be taken into account in order to properly select vertexes for summarization. We propose to rebuild and extend the notion of summarizer and qualifier, by including concept taxonomies into problem analysis. For summary subjects we show that all sub-concepts (narrower concepts) of a given concept have to be analyzed, while for the summarizer and qualifier - all super concepts (that is - more general). This generalizing approach is known as attribute-oriented induction, and has been applied to summarizers (e.g. in [17]), but not for qualifiers, which is done in this paper. We show a process of efficient computation of truth values of generalized summaries (separately for qualifiers and summarizers), based on simplified equations. Secondly, also based on concept taxonomies, we propose to define a new notion of ‘concept imprecision’ based on its hierarchy. Based on this we propose extensions to quality measures T₂ and T₉ - degree of summarizer and qualifier imprecision.

This paper is organized as follows. In Section 2 we only remind the reader the main concepts of linguistic summaries. In Section 3 we discuss how algorithms for linguistic summaries may be adopted for Semantic Web with the use of ontologies. The exact algorithm of generating summaries for Semantic Web is presented in Section 3.5. Proposed algorithm adapts to the dataset, in which the set of summarizers is created dynamically. Section 3.4 shows how T₁ and T₂ quality measures may be extended with class taxonomy. We introduce the notion of summary on different level of generality (Degree of Summarizer Imprecision), depending on the class used in summary. Afterwards, in Section 5 we show the results of two experiments performed in DBPedia - one for class Artist, and second - for class Musical Artist, for which we also show the step-by-step example of generalization of summarizers and qualifiers. In the end, in Section 6 we draw the conclusions and show the possibilities of future work.

2 Linguistic summaries of relational databases

This chapter presents the main aspects of linguistic summaries, designed for relational databases. Further information can be found for example in [1 –4].

There are two main forms of linguistic summaries: 1st form in presented in Equation 1, shown in Example 1; 2nd form: Equation (2), see Example 2. $Q P are / have S [T]$ (1) $\begin{matrix} Q P being / having W are / have S [T] \end{matrix}$ (2) where Q is the linguistic quantifier; P is the subject of the summary (set of objects represented by the database tuples d_i); S is a property of interest, the so-called summarizer represented by a fuzzy or a crisp set (discrete set in particular).

Example 1. About half (Q) employees (P) have low salary (S) [0.89] (T)

Example 2. A lot of (Q) employees (P) that are old (W) have high salary (S) [0.9]

Degree of truth T is used to measure the overall quality of a summary, that is - how well it describes the data. Due to the linguistic, fuzzy-set based nature of the summaries, this value lies in the interval [0, 1]. The algorithm is strictly based on Zadeh calculus of linguistically quantified statements, and is computed as: $T_{1} (Q P are / have S_{j}) = μ_{Q} (\frac{r}{m})$ (3) where $r = \sum_{i = 1}^{m} μ_{S_{j}} (d_{i})$ (4)

When the summarizer S_j is modeled by a fuzzy-set, μ_{S
_j} is a membership function of d_i to this set. When S_j is a discrete value, i.e. “is a woman”, the membership value is shown by Equation 5. $μ_{S_{j}} (d_{i}) = {\begin{matrix} 1 & if d_{i} \in S_{j} \\ 0 & otherwise \end{matrix}$ (5)

Two other quality measures considered in this paper concern the imprecision of a summarizer (T₂, see Equation 5) and a qualifier (T₉, see Equation 7). The more general the statement is, the higher the value of the imprecision, which in result reduces the overall truth values of summaries that are not very informative (in a typical fuzzy-set case - due to wide membership function). $T 2 = 1 - {(\prod_{j = 1}^{n} in (S_{j}))}^{1 / n}$ (6) where in is the imprecision of a fuzzy set [1]. $T_{9} = 1 - {(\prod_{j = 1}^{n} in (W_{j}))}^{1 / n}$ (7)

Note that for a summarizer or qualifier that contain only one element (7) and (6) can be simplified to Equation 8. $T_{2, 9} = 1 - in (S)$ (8)

3 Linguistic summaries of graph databases using ontologies

The first part of this section introduces a set of concepts and definitions from the field of ontologies. In the remainder of this section author’s original contributions are presented - subject selection (using ontological sub-concepts), extensions of summarizer and qualifier (using ontological super-concepts) and extensions of quality measures T₁, T₂ and T₉ (using a complete concept taxonomy).

The key point of our approach is the fact that in graph datasets (like in Semantic Web) concepts form a hierarchy, which means that a value implicitly has more than one meaning, which often has to be inferred. Such concept hierarchies can be created by any types of edges and/or vertexes, but in this paper we will focus on hierarchies of classes and categories. Reader is asked to keep in mind that these are only exemplifications, and our approach is in fact more generic.

An ontology is often defined as ‘explicit specification of a conceptualization’, defined using OWL (Web Ontology Language) and RDFS (Resource Description Framework Schema), which defines classes (linked to subjects by predicate rdf:type), properties of these classes, and taxonomies using predicate rdfs:subClassOf. SKOS vocabularies (Simple Knowledge Organization System) define sets of classes and properties (build upon RDFS) that enable defining classification schemes and taxonomies of categories (linked to subjects by predicate DublinCoreTerms: subject), which form hierarchies by predicate skos: broader. In most cases, both used in a dataset, OWL for formal and SKOS for semi-formal definitions of concept taxonomies [18]. In this paper we will denote a concept as c.

3.1 Ontology-based concept hierarchies

Consider the fragment of DBPedia dataset, the hierarchy of category ‘American_championships’, shown in Fig. 1. Say we want to summarize data about ‘American championships’. In order to retrieve all relevant data from the dataset, it is also necessary to retrieve subjects that are below the chosen category. However, this data is not explicitly given in the dataset - it has to be inferred. Please note that by definition skos:broader is not a transitive property (skos: broader has a transitive counterpart - skos: broaderTransitive). However, in DBPedia there is not a single skos:broaderTransitive property, which proves that this weaker form of predicate is used also in cases where it should be in fact transitive - like in the example in Fig. 1. Therefore, for purposes of this paper we will treat this property as transitive, as in most cases very interesting results can be obtained in this way. Such approach can also be found in datasets of UMBEL (inferred concepts). Concept hierarchies are also created by rdf: subClassOf predicates between classes, which, by definition, is transitive.

Definition 1. Sub-concepts of concept c are concepts, which are directly below concept c in a given taxonomy. We denote this set of sub-concepts by Sub_c.

Example 3. In Fig. 1, concept Volleyball America’s Cup has the following sub-concepts: $\begin{matrix} {Sub}_{Continental Champtionships} \\ = {European, Asian, African, \\ Oceanian, American} . \end{matrix}$

Note that Sub_American⊆Sub_continental, because we consider only direct sub-concepts.

Definition 2. Sub-concepts of of n-th level of class c are concepts, which are separated from concept c by not more then n specialization relations. We denote this set by ${Sub}_{c}^{n}$ .

Note that ${Sub}_{c} = {Sub}_{c}^{1}$ , ${Sub}_{c}^{n - 1} \subseteq {Sub}_{c}^{n}$ .

Definition 3. The complete set of sub-concepts of concept c (all levels below concept c) is denoted by ${Sub}_{c}^{\infty}$ .

Analogously to the notion of sub-concept we define a super-concept, and set of super-concepts of n-th level - see Definitions 1, 2, 3.

Definition 4. Super-concepts of of n-th level of concept c are concepts, which are separated from concepts c by not more then n generalization relation from concept c.

We denote this set of concepts by ${Sup}_{c}^{n}$ .

3.2 Building summary subject using concept taxonomy - including sub-concepts

As a subject of a summary, we propose to use an ontological class or SKOS category, because both are used to classify groups of graph vertexes. We point out that consideration of the hierarchy of concepts is necessary. In a general case, a graph vertex does not specify all classes and categories that is belongs to, but only the most specific one, that is - lowest in concept taxonomy. Hence, in order to properly select objects for summarization, also all sub-concepts of given concept have to be selected (see Def. 3.1). Hence, when a linguistic summary of concept c is created, vertexes not only of concept c, but also all vertices ${Sub}_{c}^{\infty}$ have to be selected (see Def. 3.1), because each member of any of the concept ${Sub}_{c}^{\infty}$ also belongs to concept c.

3.3 Building summarizer and qualifier using concept taxonomy - including super-concepts

In the proposed method, set of summarizers S is created during selecting objects for summarization, so each triple that has predicate $rdf : type \in {Sub}_{c}^{\infty}$ or $dct : subject \in {Sub}_{c}^{\infty}$ (see Def. 3.1). Now for each attribute A_i (that is predicate label) its discrete value set is created based on the values of all retrieved triples. We denote this set of values as X_{A
_i}. The attribute A_i may be a concept that can be generalized, either by rdf:subClassOf or skos:broaderTransitive predicates. In such cases, attribute value a_j also belongs to super concepts of concept c_{A
_i}, hence a set of concepts ${Sup}_{c_{A_{i}}}^{n}$ (see Def. 3.1). In this case the set of summarizers is augmented by this set of super concepts - additional implicit values are inferred from concept taxonomy.

Example 4. Consider a simple concept taxonomy ‘Occupations’ as shown in Fig. 2. Say we are creating linguistic summaries of class ‘Person’, and one of the atributes/predicates is A₁ = squooccupation′ (A belongs to ‘Occupations’ taxonomy). Assume that in the considered dataset the set of values is X_Occupation = {writer, poet, painter}. In this case, in a regular approach, the summarizer S_Occupation based on this attribute would have only three possible values S_Occupation = {writer, poet, painter}. However, by taking the set of super-concepts of each concept in X_Occupation and adding to the summarizer set we obtain the following summarizer set: S_Occupation = {writer, poet, painter, artist}. Summarizing on more general attribute value artist may lead to extracting new knowledge from the data.

3.4 Ontological extensions to quality measures

T₁ extended by concept taxonomy

Recall the summary truth value T₁, given in (1), and the notion of membership function for a discrete summarizer (5). Now consider an hierarchical concept as a summarizer or a qualifier. In this case, the notion of ‘being member of a concept’ may be extended to have the form as in Equation 9. $μ_{c}^{ont .} (d_{i}) = {\begin{matrix} 1 & if d_{i} \in {c, S_{c}^{\infty}} \\ 0 & otherwise \end{matrix}$ (9)

Example 5. Let’s consider a summarizer (or a qualifier) ‘is writer’ and an object, which belongs to a class poet. Using a regular approach, so using Equation 5 the obtained membership value is 0, while by using an extended approach, Equation 9 evaluates to 1.

T₂ extended by concept taxonomy

Using (9) for evaluating the membership value of an attribute to a summarizer leads to generating more general summaries, when using concepts higher in hierarchy. For example, let’s imagine a summarizer related to geographical locations, like cities, countries and continents. Assume also that each instance of a class (e.g. Person) has a property ‘place of birth’. When formula 9 is used, we may create new summaries, extending by the notion of a city to a broader term, like a country or even continent. Then we may form a new summary, otherwise not possible (since each attribute specifies only a city), like ‘average number of people that are tall are born in Europe’. However, such summaries are less precise - extreme case would be ‘most people are born on Earth’. This summary is definitely true, however it is very imprecise.

Hence, we propose an analogous measure to a degree of imprecision for fuzzy sets - we call this notion degree of concept imprecision - see Definition 5. Proposed formula describes the intuition that imprecision of a concept depends on its level in taxonomy. The updated form of T₂ ia presented in (11)

Definition 5. By degree of concept imprecision we call the level of generality (or: information content) of a given concept, and evaluated by using (10) in (11). This measure is inspired (however modified) by so called ‘subsumption’, introduced in [19] for ranking complex associations on Semantic Web. $in (c) = \frac{{levels}_{below} (c)}{{levels}_{above} (c) + {levels}_{below} (c)}$ (10) where: levels_above (c) - number of levels above concept c in a taxonomy, levels_below (c) - number of levels below concept c in a taxonomy,

Example 6. Evaluated degrees of selected concept imprecisions for hierarchy presented in Fig. 1 (for the purposes of this example, assume that it is a complete taxonomy) are shown below:

${in}^{ont .} (Continental) = \frac{3}{0 + 3} = 1$

${in}^{ont .} (American) = \frac{2}{2 + 1} = 0.66$

${in}^{ont .} (Central) = \frac{1}{2 + 1} = 0.33$

${in}^{ont .} (2002 Central) = \frac{0}{3} = 0$

T_{2} = 1 - {(\prod_{j = 1}^{n} {in}^{ont .} (S_{j}))}^{1 / n}

(11) where: n - the number of elements in the summarizer, S_j - j-th element of the compound summarizer

T₉ extended by concept taxonomy

As for the of the summarizer, the same generalization process can be used for the qualifier, for which the imprecision is expressed by the truth value T₉. Hence, we als propose to use (10), hence T₉ becomes: $T_{9} = 1 - {(\prod_{j = 1}^{n} {in}^{ont .} (W_{j}))}^{1 / n}$ (12)

3.5 Generating linguistic summaries for graph databases - complete process with example

The set of quantifiers Q is known beforehand, as well as the summary subject - an ontological class or SKOS category.

For universality of the method, we do not know the attributes, denoted by A, nor their set of values, denoted by X_A. Attributes and their values will be used as summarizers. Exact steps to be followed are listed below.

Define the concept (e.g ontological class/SKOS category) c that will be the subject of the summary

Generate full set of sub-concepts for c: ${Sub}_{c}^{\infty}$ (see Def. 2)

Query the database for objects of concept ${Sub}_{c}^{\infty}$ and their attributes (so vertexes that are directly connected to them).

Based on the queried data we create a set of attributes and their value sets: A = 〈A₁, X_A1〉, 〈A₂, X_A2〉,..., 〈A_i, X_{A
_i}〉

For each attribute A_i we compute the full set of super-concepts of this attribute value ${Sup}_{A_{i}}^{\infty}$ (see Definition 4). Each of the super-concepts may be used to form a more general summary.

We create a set of linguistic summaries using found attributes as summarizers - $X_{A_{i}} \cup {Sup}_{A_{i}}^{\infty}$

For each qualifier we calculate truth valuesT₁ - T₅

Example. Assume that the subject of a linguistic summary is on ontological class writer, and for the summary we will use an dataset GeoNames, which contains information about administrative classification of the world. For subject taxonomy - we used DBPedia ontology.

c = writer

${Sub}_{writer}^{\infty} = {squomusic {composer}^{'}, Playwright, poet, screenwriter, Songwriter}$

query the database for all objects of class writer and all subclasses - ${Sub}_{writer}^{\infty}$

A = 〈squoborn′, {Paris, New York, Katowice, Amsterdam} 〉, 〈squoheight′, {176cmv, 186cm, 190cm, 166cm} 〉.

squoborn′ ∈ T. ${Sup}_{Paris}^{\infty} \cup {Sup}_{New York}^{\infty} {Sup}_{Katowice}^{\infty} = {France, USA, Poland, Europe}$

set of summarizers S = {Paris, New York, Katowice, Amsterdam, France, USA, Netherlands, Poland, Europe}

calculation T₁, T₂ and T_final - an average ofT₁, T₂

4 Optimization of generalizing summarizers and qualifiers

In this section we address the efficiency of calculation of truth value T₁ (see Equation 3) and T₂ (precision) of generalized summaries and quantifiers introduced in Section 3.4. Since the central point of out proposed extensions are concept hierarchies, we limit our discussion to only discrete summarizers. We show that such limitation enables computation of T₁ for generalized summaries directly from more specific ones. Our approach is based on a classic method of attribute-oriented induction [12], adapted and analyzed here for the case of linguistic summaries.

For discrete summarizers the formula for truth value T₁ (see Equation 3) may be simplified: $μ_{q} (\frac{\sum_{i = 1}^{m} μ_{S_{j}} (d_{i})}{m}) = μ_{q} (\frac{count (S_{j})}{m})$ (13) where: $count (S_{j}) = \sum_{i = 1}^{m} (d_{i} : d_{i} {isa}^{ont .} S_{j})$ (14)

Bottom-up calculation of generalized summarizers Consider a concept tree for an attribute A. Using approach presented in Section 3.4 we can formulate several summarizers - one for each class in the structure. However, we want to show that we can generate these generalized summarizers ‘for free’, that is - we can directly infer them from the most specific ones (lowest in the concept tree). Given that transitive (inferred) classes (and other inferred hierarchical concepts) are not present in the data (which most often is the case for broader SKOS categories), these attribute values are NOT given directly (explicitly) in the data.

Equation (15) presents formula for calculation of the count of a generalized summary. Based on the counts of subclasses it is possible to compute the count of a superclass, and according to (13), this count is enough to compute T₁ for all summaries (i.e. all quantifiers). A graphical representation of this process is presented in Fig. 3, where it is clearly visible that count^ont. (S₁) = count (S₁) + count (S₂) + count (S₃), where ‘^ont .’ means that class taxonomy is considered. Addition of count (S₁) (that is without consideration of classes taxonomy) is also necessary because there may be subjects that are in this more general class, but not in any of the subclasses - such situation may occur due to missing data, or due to the fact that a given class is only partially specialized (e.g. DBPedia ontology, specializations of class ‘TeamSport’ - currently there is only single subclass ‘Soccer’). For clarification consider Example 7. $count (S^{ont .}) = \sum_{i = 1}^{| {Sub}_{S}^{1} |} (count ({Sub}_{S}^{1})) + count (S)$ (15) where: count (S^ont.) is the count of subjects that are Sincluding sub-concepts, count (S) is the count of subject that are Swithout sub-concepts, ${Sub}_{S}^{1}$ - set of direct sub-concepts of S, $| {Sub}_{S}^{1} |$ - number of direct sub-concepts (see Def. 1)

Example 7. Say that the summarizer concept tree is shown in Fig. 3, and the summarized attribute is ‘profession’. Say that S₀ - artist, S₁ - musician, S₂ - singer, S₃ - instrumentalist, S₄ - visual artist, S₅ - painter and S₆ - photographer. By having the number singers (e.g. count (S₁ = 10)), instrumentalists (e.g. count (S₃) =30) and general, unspecified musicians (e.g. count (S₁) =20) we can directly compute the total number of musicians (count^ont. (S₁) = count (S₁) + count (S₂) + count (S₃) =60), therefore directly computing T₁.

Bottom-up calculation of generalized qualifiers

Discrete qualifiers (enumerations) have an effect of limiting (filtering) subjects taken unto account when calculating T1 (denominator, m, in (3)). Note that in case of generalized summarizers the number of subjects is constant. Hence, in order to compute values for generalized qualifiers number of subjects for each qualifier also has to be taken into account, as expressed in formula (16).

Consider a hierarchical discrete qualifier W with a concept hierarchy as presented in Fig. 4, and a discrete summarizer S (which hierarchical nature is not relevant at this point). Since W is a discrete summarizer, μ_w → {0, 1}, only a crisp subset of initial dataset matches this qualifier. In context of Fig. 4, there are m_W2 subjects that are W₂, and from these ‘filtered’ subjects count (S|W₂) are S (and by (13) we can quickly compute T₁ for qualifier W₂ and summarizer S). The same analysis applies for W₃, and all other subclasses of W₁. Having this information, it is possible to directly compute the total number of subjects that meet the qualifier condition W₁ (which is the sum of the count of all subclasses, since apart of hierarchies, qualifiers W are disjoint). It is the sum of counts of all subclasses of W₁, and again, as for the summarizers, concept W₁without considering the concept hierarchy - hence ( $m_{W 1}^{ont .} = m_{W 1} + m_{W 2} + m_{W 3}$ ). The same applies for the total count of subjects that meet the summarizer S. Therefore, we are able to directly compute $\frac{count (S | W_{1})}{m_{W 1}}$ , and in result T₁.

Example 8. Say that the dataset concerns living organisms, and the qualifier (W) is ‘species’ (e.g. dataset GeoSpecies). Summarizer can be anything, for example length of life (note that in this case the summarizer does not have to be discrete). Given Fig. 4, assume that W₂ is ‘mammal’ (m_W2 = 15, count (S|W₂) =9), and W₃ is ‘fish’ (m_W3 = 22, count (S|W₃) =13), and W₁ is an animal (count of ‘unclassified’ animals is m_W1 = 5, count (S|W₁) =2). Therefore, considering the hierarchical nature of W, there are $m_{W 1}^{ont .} = 42$ animals, and count^ont. (S|W₁) =23. Again, using (13), T₁ can be directly inferred.

$\begin{matrix} \frac{{count}^{ont .} (S | W)}{m_{W}^{ont .}} \\ = \frac{\sum_{i = 1}^{| {Sub}_{W}^{1} |} (count (S | {Sub}_{W}^{1})) + count (S | W)}{\sum_{i = 1}^{| {Sub}_{W}^{1} |} (m_{W_{i}}) + m_{W}} \end{matrix}$ (16) where: count(S|W) is the number of subjects that meet qualifier W and summarizer S, m_W is the number of subjects meeting qualifier W, ${Sub}_{W}^{1}$ - set of direct subclasses of W, $| {Sub}_{W}^{1} |$ - number of direct subclasses.

5 Application example - generating linguistic summaries of DBPedia

DBPedia (see [21]) is an extraction of info boxes from Wikipedia articles into Semantic Web format - Resource Description Framework, RDF (see [22]). In short, RDF is a data format/model composed of triples (subject, predicate and object) that allow easy data integration and extension. Currently DBPedia contains over 4 milion objects in the main dataset, while it can be easily connected to other data sources using owl : sameAs links available. DBPedia created its own multi-domain ontology which will be used for this experiment, but is also using several others - like subject categories (dcterms), Open Cyc, Wordnet, Freebase, UMBEL and YAGO2. We have implemented our system in Java using jFuzzyLogic [23] and Apache Jena for querying and processing DBPedia (using its SPARQL endpoint). We have used typical triangular definitions of qualifiers.

We have created summaries for a subclass of class Person - class Artists (96300 instances) - see Table 1. Due to the nature of the data, there is an unusually large number of summarizers (in comparison to a typical relational database case) - 56. Due to that complexity, including compound summarizers is not directly feasible and we have excluded them from the experiment. As can be seen from the table, some interesting patters may be found - for example that about half artists are musicians. For musical genres, like jazz, we have computed concept taxonomy based on predicate dbo:musicSubgenre.

As a second example to show clearly our generalization method, we will focus on the music taxonomy, where hierarchies are created using predicate dbo:musicSubgenre in DBPedia dataset. Part of this taxonomy, with instance count, direct and inferred is shown in Fig. 5. Table 2 presents our results, and proves that our approach results in new summaries, that can be computed in an effective way.

6 Conclusions and future work

In this paper we have presented a novel approach to linguistic summarization of graph datasets with the use of ontologies. Firstly, we have proposed the solution to an initial problem - what should be summarized for a graph? - by proposing to use an ontological class or SKOS category as subject, and showed that sub-concept have to also be included. Then attributes for summarization are neighboring graph vertexes. Note that this approach can be extended by navigating deeper in the graph to obtain more summaries. Also, we have shown the application of attribute-based induction for summarizers and, what is new, for qualifiers. Based on the simplification of the equation for computation of the degree of truth (T₁), we have adapted an optimization process of computation for summarizers and qualifiers. Creating such summaries (e.g. summarizing on continental level, while only the information about a country in directly available) leads to finding new dependencies in data. We have shown that obtaining these more general summaries and qualifiers can be inferred directly from more specific ones - that is with minimal computational cost.

We have also extended the T₁, T₂ and T₉ quality measures, also by including concept taxonomy into analysis. By quality measure T₂ we are able to determine the informativeness (precision) of a summary. We have proven our approach by generating linguistic summaries for a small subset of DBPedia, for two cases - summaries of class Artist and Album. For the latter, we’ve shown step-by-step computation of generalized summaries.

Further work may be focused on leveraging Linked Data nature - incorporating other ontologies and databases. Since Semantic Web is based on Linked Data, we may also use other ontologies and information sources to create new summaries. For instance, DBPedia is interconnected to DBTune (music database), Eurostat (statistical information), LinkedMDB (movies database), LinkedGeoData (geographical database), GeoSpecies (variousinformation about species) and many others. Also, since in our approach we define the summary subject, it would be useful to develop an algorithm for automatic subject selection, possibly also for graph datasets without ontologies (which would require a new method of defining the subject). Further work may also focus on generating the linguistic summaries of the topology of the graph, where a particular topological feature (e.q. clique) may also be defined in a fuzzy way, that is - in interval [0, 1].

References

Yager

R.R.

, A new approach to the summarization of data, Inf Sci28(1) (1982), 69–86.

Yager

R.R.

, Linguistic summaries as a tool for database discovery. In pp, In FQAS, 1994, pp. 17–22.

Yager

R.R.

, Ford

K.M.

and Canas

A.J.

, An approach to the linguistic summarization of data. In Bernadette Bouchon-Meunier, Yager

Ronald R.

and Zadeh

Lotfi A.

, editors, IPMU, volume 521 of Lecture Notes in Computer Science, Springer, 1990, pp. 456–468.

Yager

R.R.

, On linguistic summaries of data, In Knowledge Discovery in Databases, 1991, pp. 347–366.

Kuramochi

and Karypis

, Frequent subgraph discovery. In Proceedings of the 2001 IEEE International Conference on Data Mining, ICDM ‘01, Washington, DC, USA, IEEE Computer Society, 2001, pp. 313–320.

Yan

and Han

, gspan: Graph-based substructure pattern mining. iN Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM ’02, Washington, DC, USA, IEEE Computer Society, 2002, p. 721.

Srivastava

, Cooley

, Deshpande

and Tan

P.-N.

, Web usage mining: Discovery and applications of usage patterns from web data, SIGKDD Explor Newsl1(2), 12–23.

Kosala

and Blockeel

, Web mining research: A survey, SIGKDD Explor Newsl2(1) (2000), 1–15.

Castelltort

, Laurent

, Information Processing and Management of Uncertainty in Knowledge-Based Systems: 15th International Conference, IPMU 2014, Montpellier, France, July 15-19, 2014, Proceedings, Part III, chapter Fuzzy Queries over NoSQL Graph Databases: Perspectives for Extending the Cypher Language, Springer International Publishing, Cham, 2014, pp. 384–395.

10.

Castelltort

and Laurent

, Extracting fuzzy summaries from nosql graph databases. In Flexible Query Answering Systems 2015 - Proceedings of the 11th International Conference FQAS 2015, Cracow, Poland, 2015, pp. 189–200.

11.

Strobin

, Niewiadomski

, Computational Collective Intelligence: 7th International Conference, ICCCI 2015, Madrid, Spain, September 21-23, 2015, Proceedings, Part I, chapter Linguistic Summaries of Graph Datasets Using Ontologies: An Application to SemanticWeb, Springer International Publishing, Cham, 2015, pp. 380–389.

12.

Han

, Yongjian

, Advances in knowledge discovery and data mining. chapter Attribute-oriented Induction in Data Mining, American Association for Artificial Intelligence, Menlo Park, CA, USA, 1996, pp. 399–421.

13.

Han

, Fu

, Huang

, Cai

and Cercone

, DBLearn: A system prototype for knowledge discovery in relational databases, SIGMOD Record (ACM Special Interest Group on Management of Data)23(2) (1994), 516.

14.

Lee

D.H.

and Kim

M.H.

, Database summarization using fuzzy isa hierarchies, Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on27(1) (1997), 68–78.

15.

Sassi

, Grissa Touzi

and Ounelli

, A fuzzy linguistic approach to database summarization, In Fuzzy Systems, 2008 FUZZ-IEEE 2008 (IEEEWorld Congress on Computational Intelligence) IEEE International Conference on, 2008, pp. 771–778.

16.

Saint-Paul

, Raschia

and Mouaddib

, Database summarization: The saintetiq system, In 2007 IEEE 23rd International Conference on Data Engineering, 2007, pp. 1475–1476.

17.

Yager

R.R.

and Petry

F.E.

, A multicriteria approach to data summarization using concept ontologies, IEEE Transactions on Fuzzy Systems14(6) (2006), 767–780.

18.

W3c Reference - Using OWL and SKOS, https://www.w3.org/2006/07/SWD/SKOS/skos-and-owl/master.html

19.

Ramakrishnan

, Aleman-Meza

, Halaschek-Wiener

, Sheth

and Arpinar

I.B.

, Ranking complex relationships on the semantic web, IEEE Internet Computing (2005), 37–44.

20.

Kacprzyk

, Wilbik

and Zadrozny

, Linguistic summaries of time series via a quantifier based aggregation using the sugeno integral. In Fuzzy Systems, 2006 IEEE International Conference on, 2006, pp. 713–719.

21.

Lehmann

, et al., DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia, Semantic Web Journal (2014).

22.

Candan

K.S.

, Liu

and Suvarna

, Resource description framework: Metadata and its applications. SIGKDD Strobin, Niewiadomski/Linguistic summaries of graph datasets with ontologies 11 Explor, Newsl3(1) (2001), 6–19.

23.

Cingolani

and Alcala-Fdez

, jfuzzylogic: A robust and flexible fuzzy-logic inference system language implementation. In Fuzzy Systems (FUZZ-IEEE), 2012 IEEE International Conference on, 2012, pp. 1–8. DOI:10.1109/FUZZ-IEEE.2012.6251215