Abstract
An approach to performing linguistic summaries of graph datasets, with particular focus on usage of ontologies is presented in this paper. This well-known mining technique is based on fuzzy set theory, which is used to model natural language words (e.g. ‘many’, ‘tall’), and in result - generates natural-like sentences describing the data. Although intensely developed, before our work this method has been applied only to relational databases, while more and more data is available in graph model. A special case of such graph datasets is the Semantic Web, in which ontologies provide meaning, therefore enabling advanced machine learning. In our paper we analyze the problem of generating linguistic summaries for a graph data case (for which the method cannot be directly applied), with associated ontologies. The key element of ontologies are concept hierarchies, which are the core of our work. Firstly, due to heterogeneity and lack of schema we propose to use an ontological concept (including all sub-concepts in hierarchy) as a subject for summaries, and extract their attributes (neighboring vertexes). Then we show that by ascending these ontological concept hierarchies (so by attribute-based induction) we obtain additional, generalized summaries. We show this process for both summarizers and qualifiers, and propose an extension to their respective imprecision measures - T2 and T9. We perform two experiments on DBPedia - one for summary subject ‘Artist’, and second for ‘Musical Album’. For the latter, we show the optimized process of obtaining the truth values using bottom-up approach.
Introduction
This paper focuses on performing linguistic summaries on graph datasets with associated ontologies. Linguistic summaries, defined by Yager [1–4], is a data-mining method aimed at finding high-level, complex relationships in the data and, based on fuzzy sets, presenting this information in natural language sentences. We focus on application of this method in a non-typical case - on graph datasets. Our research can be viewed as a form of graph data summarization, which in most cases is a about extracting frequent patterns (subgraphs) from a larger graph [5, 6]. Also, this paper concerns the field of Semantic Web mining, with subfields like content, structure and usage mining [7, 8]. A different research area is presented in [9], where elements of fuzzy set theory are used to express fuzzy queries for graph databases (precisely - extending Cypher query language for Noe4j). In [10] new concepts, named ‘Structure Summaries’ and ‘DataStructure Summaries’ are introduced, which are capturing relations between two types of vertices. Despite new nomenclature these constructs are logically equivalent to 1st and 2nd form of summary (see (1) and (2)), and therefore to our approach, which is based on data extraction from graph to pseudo-relational model.
This paper is an extended version of [11]. Compared to our conference paper we have additionally considered qualifiers and studied their generalization analogously to summarizers. Also we have broadened our approach by considering a more abstract notion of concept instead of only ontological class. Therefore, it is now possible to include other hierarchies, like ‘categories’, ‘genres’, etc. This way our approach is more comprehensive, and can be applied to larger number of real datasets, and provide even more additional summaries. In Section 4 we present an analysis of how to efficiently generalize summarizers and qualifiers, based only on the count of sub-concepts. For this we have simplified the equations for truth values for crisp summarizers and qualifiers, and analyzed their change during ascending the hierarchy. We have verified proposed optimizations on a new use-case - summarization on musical albums, where musical genres hierarchy is used to generalize both summarizers and qualifiers.
Using hierarchies for data summarization is a classic approach known as attribute-oriented induction [12]. For relational databases, this method is based on creating super-tuples based on values lower in hierarchies (i.e., the DBLEARN system [13]). Fuzzy hierarchies have also been considered in several works: in [14] authors take a user-guided top-down approach, and at each level user decides if more specification is required; in [15] fuzzy hierarchies are built using fuzzy clustering of attributes; a mature lingusitic summarization system called SaintEtiQ is presented in [16], where user-defined knowledge (including hierarchies) is used for conceptual clustering. Hierarchical nature of ontologies has been used for summarization in [17], where 4 new criteria(based on fuzzy sets) are used to determine if ascending the hierarchy should continue or stop. Here we should emphasize that in all of these approaches relational databases are considered, whereas our approach considers graph datasets.
Although linguistic summaries are a perfectly suited for very large datasets, they have not been considered for Semantic Web, which in fact is a huge, distributed graph dataset. First new problem that arises in this case is - how to define the subject of such summaries? Since the data is a graph, taking all vertexes would not be consistent, since they may be arbitrarily connected, have different attributes, etc. Hence firstly we propose to use an ontological class as a subject of the summary. We also show that concept hierarchies have to be taken into account in order to properly select vertexes for summarization. We propose to rebuild and extend the notion of summarizer and qualifier, by including concept taxonomies into problem analysis. For summary subjects we show that all sub-concepts (narrower concepts) of a given concept have to be analyzed, while for the summarizer and qualifier - all super concepts (that is - more general). This generalizing approach is known as attribute-oriented induction, and has been applied to summarizers (e.g. in [17]), but not for qualifiers, which is done in this paper. We show a process of efficient computation of truth values of generalized summaries (separately for qualifiers and summarizers), based on simplified equations. Secondly, also based on concept taxonomies, we propose to define a new notion of ‘concept imprecision’ based on its hierarchy. Based on this we propose extensions to quality measures T2 and T9 - degree of summarizer and qualifier imprecision.
This paper is organized as follows. In Section 2 we only remind the reader the main concepts of linguistic summaries. In Section 3 we discuss how algorithms for linguistic summaries may be adopted for Semantic Web with the use of ontologies. The exact algorithm of generating summaries for Semantic Web is presented in Section 3.5. Proposed algorithm adapts to the dataset, in which the set of summarizers is created dynamically. Section 3.4 shows how T1 and T2 quality measures may be extended with class taxonomy. We introduce the notion of summary on different level of generality (Degree of Summarizer Imprecision), depending on the class used in summary. Afterwards, in Section 5 we show the results of two experiments performed in DBPedia - one for class Artist, and second - for class Musical Artist, for which we also show the step-by-step example of generalization of summarizers and qualifiers. In the end, in Section 6 we draw the conclusions and show the possibilities of future work.
Linguistic summaries of relational databases
This chapter presents the main aspects of linguistic summaries, designed for relational databases. Further information can be found for example in [1–4].
There are two main forms of linguistic summaries: 1st form in presented in Equation 1, shown in Example 1; 2nd form: Equation (2), see Example 2.
Degree of truth T is used to measure the overall quality of a summary, that is - how well it describes the data. Due to the linguistic, fuzzy-set based nature of the summaries, this value lies in the interval [0, 1]. The algorithm is strictly based on Zadeh calculus of linguistically quantified statements, and is computed as:
When the summarizer S
j
is modeled by a fuzzy-set, μ
S
j
is a membership function of d
i
to this set. When S
j
is a discrete value, i.e. “is a woman”, the membership value is shown by Equation 5.
Two other quality measures considered in this paper concern the imprecision of a summarizer (T2, see Equation 5) and a qualifier (T9, see Equation 7). The more general the statement is, the higher the value of the imprecision, which in result reduces the overall truth values of summaries that are not very informative (in a typical fuzzy-set case - due to wide membership function).
Note that for a summarizer or qualifier that contain only one element (7) and (6) can be simplified to Equation 8.
The first part of this section introduces a set of concepts and definitions from the field of ontologies. In the remainder of this section author’s original contributions are presented - subject selection (using ontological sub-concepts), extensions of summarizer and qualifier (using ontological super-concepts) and extensions of quality measures T1, T2 and T9 (using a complete concept taxonomy).
The key point of our approach is the fact that in graph datasets (like in Semantic Web) concepts form a hierarchy, which means that a value implicitly has more than one meaning, which often has to be inferred. Such concept hierarchies can be created by any types of edges and/or vertexes, but in this paper we will focus on hierarchies of classes and categories. Reader is asked to keep in mind that these are only exemplifications, and our approach is in fact more generic.
An ontology is often defined as ‘explicit specification of a conceptualization’, defined using OWL (Web Ontology Language) and RDFS (Resource Description Framework Schema), which defines classes (linked to subjects by predicate rdf:type), properties of these classes, and taxonomies using predicate rdfs:subClassOf. SKOS vocabularies (Simple Knowledge Organization System) define sets of classes and properties (build upon RDFS) that enable defining classification schemes and taxonomies of categories (linked to subjects by predicate DublinCoreTerms: subject), which form hierarchies by predicate skos: broader. In most cases, both used in a dataset, OWL for formal and SKOS for semi-formal definitions of concept taxonomies [18]. In this paper we will denote a concept as c.
Ontology-based concept hierarchies
Consider the fragment of DBPedia dataset, the hierarchy of category ‘American_championships’, shown in Fig. 1. Say we want to summarize data about ‘American championships’. In order to retrieve all relevant data from the dataset, it is also necessary to retrieve subjects that are below the chosen category. However, this data is not explicitly given in the dataset - it has to be inferred. Please note that by definition skos:broader is not a transitive property (skos: broader has a transitive counterpart - skos: broaderTransitive). However, in DBPedia there is not a single skos:broaderTransitive property, which proves that this weaker form of predicate is used also in cases where it should be in fact transitive - like in the example in Fig. 1. Therefore, for purposes of this paper we will treat this property as transitive, as in most cases very interesting results can be obtained in this way. Such approach can also be found in datasets of UMBEL (inferred concepts). Concept hierarchies are also created by rdf: subClassOf predicates between classes, which, by definition, is transitive.
Note that Sub American ⊆Sub continental , because we consider only direct sub-concepts.
Note that , .
Analogously to the notion of sub-concept we define a super-concept, and set of super-concepts of n-th level - see Definitions 1, 2, 3.
We denote this set of concepts by .
Building summary subject using concept taxonomy - including sub-concepts
As a subject of a summary, we propose to use an ontological class or SKOS category, because both are used to classify groups of graph vertexes. We point out that consideration of the hierarchy of concepts is necessary. In a general case, a graph vertex does not specify all classes and categories that is belongs to, but only the most specific one, that is - lowest in concept taxonomy. Hence, in order to properly select objects for summarization, also all sub-concepts of given concept have to be selected (see Def. 3.1). Hence, when a linguistic summary of concept c is created, vertexes not only of concept c, but also all vertices have to be selected (see Def. 3.1), because each member of any of the concept also belongs to concept c.
Building summarizer and qualifier using concept taxonomy - including super-concepts
In the proposed method, set of summarizers S is created during selecting objects for summarization, so each triple that has predicate or (see Def. 3.1). Now for each attribute A i (that is predicate label) its discrete value set is created based on the values of all retrieved triples. We denote this set of values as X A i . The attribute A i may be a concept that can be generalized, either by rdf:subClassOf or skos:broaderTransitive predicates. In such cases, attribute value a j also belongs to super concepts of concept c A i , hence a set of concepts (see Def. 3.1). In this case the set of summarizers is augmented by this set of super concepts - additional implicit values are inferred from concept taxonomy.
Ontological extensions to quality measures
T1 extended by concept taxonomy
Recall the summary truth value T1, given in (1), and the notion of membership function for a discrete summarizer (5). Now consider an hierarchical concept as a summarizer or a qualifier. In this case, the notion of ‘being member of a concept’ may be extended to have the form as in Equation 9.
T2 extended by concept taxonomy
Using (9) for evaluating the membership value of an attribute to a summarizer leads to generating more general summaries, when using concepts higher in hierarchy. For example, let’s imagine a summarizer related to geographical locations, like cities, countries and continents. Assume also that each instance of a class (e.g. Person) has a property ‘place of birth’. When formula 9 is used, we may create new summaries, extending by the notion of a city to a broader term, like a country or even continent. Then we may form a new summary, otherwise not possible (since each attribute specifies only a city), like ‘average number of people that are tall are born in Europe’. However, such summaries are less precise - extreme case would be ‘most people are born on Earth’. This summary is definitely true, however it is very imprecise.
Hence, we propose an analogous measure to a degree of imprecision for fuzzy sets - we call this notion degree of concept imprecision - see Definition 5. Proposed formula describes the intuition that imprecision of a concept depends on its level in taxonomy. The updated form of T2 ia presented in (11)
T9 extended by concept taxonomy
As for the of the summarizer, the same generalization process can be used for the qualifier, for which the imprecision is expressed by the truth value T9. Hence, we als propose to use (10), hence T9 becomes:
The set of quantifiers Q is known beforehand, as well as the summary subject - an ontological class or SKOS category.
For universality of the method, we do not know the attributes, denoted by A, nor their set of values, denoted by X
A
. Attributes and their values will be used as summarizers. Exact steps to be followed are listed below. Define the concept (e.g ontological class/SKOS category) c that will be the subject of the summary Generate full set of sub-concepts for c: (see Def. 2) Query the database for objects of concept and their attributes (so vertexes that are directly connected to them). Based on the queried data we create a set of attributes and their value sets: A = 〈A1, XA1〉, 〈A2, XA2〉,..., 〈A
i
, X
A
i
〉 For each attribute A
i
we compute the full set of super-concepts of this attribute value (see Definition 4). Each of the super-concepts may be used to form a more general summary. We create a set of linguistic summaries using found attributes as summarizers - For each qualifier we calculate truth valuesT1 - T5
c = writer
query the database for all objects of class writer and all subclasses - A = 〈squoborn′, {Paris, New York, Katowice, Amsterdam} 〉, 〈squoheight′, {176cmv, 186cm, 190cm, 166cm} 〉. squoborn′ ∈ T. set of summarizers S = {Paris, New York, Katowice, Amsterdam, France, USA, Netherlands, Poland, Europe} calculation T1, T2 and T
final
- an average ofT1, T2
Optimization of generalizing summarizers and qualifiers
In this section we address the efficiency of calculation of truth value T1 (see Equation 3) and T2 (precision) of generalized summaries and quantifiers introduced in Section 3.4. Since the central point of out proposed extensions are concept hierarchies, we limit our discussion to only discrete summarizers. We show that such limitation enables computation of T1 for generalized summaries
For discrete summarizers the formula for truth value T1 (see Equation 3) may be simplified:
Bottom-up calculation of generalized summarizers Consider a concept tree for an attribute A. Using approach presented in Section 3.4 we can formulate several summarizers - one for each class in the structure. However, we want to show that we can generate these generalized summarizers ‘for free’, that is - we can directly infer them from the most specific ones (lowest in the concept tree). Given that transitive (inferred) classes (and other inferred hierarchical concepts) are not present in the data (which most often is the case for broader SKOS categories), these attribute values are NOT given
Equation (15) presents formula for calculation of the count of a generalized summary. Based on the counts of subclasses it is possible to compute the count of a superclass, and according to (13), this count is enough to compute T1 for all summaries (i.e. all quantifiers). A graphical representation of this process is presented in Fig. 3, where it is clearly visible that countont. (S1) = count (S1) + count (S2) + count (S3), where ‘ont .’ means that class taxonomy is considered. Addition of count (S1) (that is
Bottom-up calculation of generalized qualifiers
Discrete qualifiers (enumerations) have an effect of limiting (filtering) subjects taken unto account when calculating T1 (denominator, m, in (3)). Note that in case of generalized summarizers the number of subjects is constant. Hence, in order to compute values for generalized qualifiers number of subjects for each qualifier also has to be taken into account, as expressed in formula (16).
Consider a hierarchical discrete qualifier W with a concept hierarchy as presented in Fig. 4, and a discrete summarizer S (which hierarchical nature is not relevant at this point). Since W is a discrete summarizer, μ
w
→ {0, 1}, only a crisp subset of initial dataset matches this qualifier. In context of Fig. 4, there are mW2 subjects that are W2, and from these ‘filtered’ subjects count (S|W2) are S (and by (13) we can quickly compute T1 for qualifier W2 and summarizer S). The same analysis applies for W3, and all other subclasses of W1. Having this information, it is possible to directly compute the total number of subjects that meet the qualifier condition W1 (which is the sum of the count of all subclasses, since apart of hierarchies, qualifiers W are disjoint). It is the sum of counts of all subclasses of W1, and again, as for the summarizers, concept W1
DBPedia (see [21]) is an extraction of info boxes from Wikipedia articles into Semantic Web format - Resource Description Framework, RDF (see [22]). In short, RDF is a data format/model composed of triples (subject, predicate and object) that allow easy data integration and extension. Currently DBPedia contains over 4 milion objects in the main dataset, while it can be easily connected to other data sources using owl : sameAs links available. DBPedia created its own multi-domain ontology which will be used for this experiment, but is also using several others - like subject categories (dcterms), Open Cyc, Wordnet, Freebase, UMBEL and YAGO2. We have implemented our system in Java using jFuzzyLogic [23] and Apache Jena for querying and processing DBPedia (using its SPARQL endpoint). We have used typical triangular definitions of qualifiers.
We have created summaries for a subclass of class Person - class Artists (96300 instances) - see Table 1. Due to the nature of the data, there is an unusually large number of summarizers (in comparison to a typical relational database case) - 56. Due to that complexity, including compound summarizers is not directly feasible and we have excluded them from the experiment. As can be seen from the table, some interesting patters may be found - for example that about half artists are musicians. For musical genres, like jazz, we have computed concept taxonomy based on predicate dbo:musicSubgenre.
As a second example to show clearly our generalization method, we will focus on the music taxonomy, where hierarchies are created using predicate dbo:musicSubgenre in DBPedia dataset. Part of this taxonomy, with instance count, direct and
Conclusions and future work
In this paper we have presented a novel approach to linguistic summarization of graph datasets with the use of ontologies. Firstly, we have proposed the solution to an initial problem - what should be summarized for a graph? - by proposing to use an ontological class or SKOS category as subject, and showed that sub-concept have to also be included. Then attributes for summarization are neighboring graph vertexes. Note that this approach can be extended by navigating deeper in the graph to obtain more summaries. Also, we have shown the application of attribute-based induction for summarizers and, what is new, for qualifiers. Based on the simplification of the equation for computation of the degree of truth (T1), we have adapted an optimization process of computation for summarizers and qualifiers. Creating such summaries (e.g. summarizing on continental level, while only the information about a country in directly available) leads to finding new dependencies in data. We have shown that obtaining these more general summaries and qualifiers can be inferred directly from more specific ones - that is with minimal computational cost.
We have also extended the T1, T2 and T9 quality measures, also by including concept taxonomy into analysis. By quality measure T2 we are able to determine the informativeness (precision) of a summary. We have proven our approach by generating linguistic summaries for a small subset of DBPedia, for two cases - summaries of class Artist and Album. For the latter, we’ve shown step-by-step computation of generalized summaries.
Further work may be focused on leveraging Linked Data nature - incorporating other ontologies and databases. Since Semantic Web is based on Linked Data, we may also use other ontologies and information sources to create new summaries. For instance, DBPedia is interconnected to DBTune (music database), Eurostat (statistical information), LinkedMDB (movies database), LinkedGeoData (geographical database), GeoSpecies (variousinformation about species) and many others. Also, since in our approach we define the summary subject, it would be useful to develop an algorithm for automatic subject selection, possibly also for graph datasets without ontologies (which would require a new method of defining the subject). Further work may also focus on generating the linguistic summaries of the topology of the graph, where a particular topological feature (e.q. clique) may also be defined in a fuzzy way, that is - in interval [0, 1].
