Abstract
AI systems typically make decisions and find patterns in data based on the computation of aggregate and specifically sum functions, expressed as queries, on data’s attributes. This computation can become costly or even inefficient when these queries concern the whole or big parts of the data and especially when we are dealing with big data. New types of intelligent analytics require also the explanation of why something happened.
In this paper we present a randomised algorithm that constructs a small summary of the data, called Aggregate Lineage, which can approximate well and explain all sums with large values in time that depends only on its size. The size of Aggregate Lineage is practically independent on the size of the original data. Our algorithm does not assume any knowledge on the set of sum queries to be approximated.
Keywords
Introduction
Big data poses new challenges not only in storage but in intelligent data analytics as well. Many organisations have the infrastructure to maintain big structured data and need to find methods to efficiently discover patterns and relationships to derive intelligence [3,17]. Thus, it would be desirable to be able to construct out of big data a right representative part that can explain aggregate queries, e.g., why the salaries or the sales of a department are high.
AI systems typically make decisions based on the value of a function computed on data’s attributes. Several approaches have in common the computation of aggregates over the whole or large subsets of the data that helps explain patterns and trends of the data. E.g., recommendation systems rank and retrieve items that are more interesting for a specific user by aggregating existing recommendations [25]. For another example, collaborative filtering computes a function which uses aggregates and a sum over the existing ratings from all users for each product in order to predict the preference of a new user [4,19]. User preferences are often described as queries [18], e.g., queries that give constraints on item features that need to be satisfied.
Another reason for which data analytics seek to explain data is for data debugging purposes. Data debugging, which is the process that allows users to find incorrect data [23,24], is a research direction that is growing fast. Data are collected by various techniques which, moreover, are unknown to and uncontrolled by the user, thus are often erroneous. Finding which part of the data contains errors is essential for companies and affects a large part of their business.
All these applications call for techniques to explain our data. Aggregation is a significant component in all of them. In this paper we offer a technique that constructs a summary of the data with properties that allow it to be used efficiently to explain much of the data behaviour in aggregate for sums. We refer to this summary as Aggregate Lineage, since in most applications it represents the source of an aggregate query.1
Lineage used to be referred to as “explain” in database papers of the late 1980s.
Lineage (a.k.a. provenance) keeps track of where data comes from. Lineage has been investigated for data debugging purposes [16]. Storing the complete lineage of data can be prohibitively expensive and storage-saving techniques to eliminate or simplify similar patterns in it are studied in [5]. For select-project-join SQL queries, lineage stores the set of all tuples that were used to compute a tuple in the answer of the query [2]. This is natural for select-project-join SQL queries where original attribute values are “copied” in attribute values of the answer. However, in an aggregate query the value of the answer is the result of applying an aggregate function over many numerical attribute values. When we want to understand why we get an aggregate answer it may no longer be important or feasible to have lineage to point to all contributing original tuples and their values. We would rather want to compute few values that can be used to tell us as much as possible about the origin of the result of an aggregate query. However is this at all possible and if it is what are the limitations?
In this paper we initiate an investigation of such questions and, interestingly, we show that useful and practical solutions exist. In particular, we offer a technique that uses randomisation to compute Aggregate Lineage which is a small representative sample (it is more sophisticated than just a simple random sample) of the data. This sample has the property to allow for good approximations of a sum query on ad hoc subsets of data – we call them test queries. Test queries are applied to the Aggregate Lineage – not the whole original data. The test queries which we consider are sum queries with same aggregated attribute conditioned with any grouping attributes depending on which subsets of the data we want to test. We give performance guarantees about the quality of the results of the test queries that show the approximation to be good for test queries with large values (i.e., close to the total sum over the whole set of data). Our performance guarantees hold, with high probability, for any set of queries, even if the number of queries is exponentially large in the size of the lineage. The only restriction is that the queries should be oblivious to the actual Aggregate Lineage. This restriction is standard in all previous work on random representative subsets for the evaluation of aggregate queries and is naturally satisfied in virtually all practical applications. The following example offers a scenario about how Aggregate Lineage can be used in data debugging and demonstrates how some test queries can be defined.
Suppose that the accounting department of a big company maintains a database with a relation Salaries with hundreds of attributes and millions of tuples. Each tuple in the relation may contain an identifier of an employee stored in attribute EmplID, his Department stored in attribute Department, his annual salary stored in attribute Sal and many more attribute values. Other relations are extracted from this relation, e.g., a relation which contains aggregated data such as the total sum of salaries of all employees. A user is trying to use the second relation for decision making but he finds that the total sum of salaries is unacceptably high. He does not have easy access to the original relation or he does not want to waste time to pose time-consuming queries on the original big relation. The error could be caused by several reasons (duplication of data in a certain time period, incorrect code that computes salaries in a new department). Thus e.g., if we could find the total sum of salaries for employees in the toy department during 2009, and see that this is unreasonably high, still close to the first total sum of all employees’ salaries, then we will be able to detect such errors and narrow them down to small (and controllable) pieces of data.
In order to do that, we need the capability of posing sum queries restricted to certain parts of the data by using combinations of attributes. This will help the user understand which piece of data is incorrect. We do not know in advance, however, which piece of data the user would want to inquire and thus Aggregate Lineage should allow the user to be able to get good approximated answers to whatever queries he wants to try. There are billions of such possible queries and hence billions of subsets of data which we want to compute a good approximation of the summation of salaries. We want Aggregate Lineage to offer this possibility.
We propose to keep as Aggregate Lineage a small relation under the same schema of the original relation. In order to select which tuples to include, we use valued-based sampling with repetition, i.e., weighted random sampling where the probability of selecting each tuple is proportional to its value on the summed attribute. The intuition why this method works is the following. Larger values contribute more to the sum than smaller ones, thus we expect that tuples with larger values should be selected more often than tuples with smaller values. Hence, we could end up with a tuple selected many times in the sample even if it appears only once in the original data. On the other hand, if there are many tuples with values of moderate size, many of them will be selected in the Aggregate Lineage, so that their total contribution to the approximation of the sum remains significant.
In our approach Aggregate Lineage is a small relation with same schema as the original relation and with the property to offer good approximations to test queries posed on it.
To present performance guarantees, we build on Althöfer’s Sparsification lemma [1]. In [1], Althöfer shows that the result of weighted random sampling over a probability vector is a sparse approximation of the original vector with high probability. This technique has found numerous applications e.g., in the efficient approximation of Nash equilibria for (bi)matrix games [21], in the very fast computation of approximate solutions in Linear Programming [22], and in selfish network design [12].
In this paper, we show for the first time that the techniques of [1] are also useful in the context of sum database queries with lineage. Our results show that the Aggregate Lineage that we extract has the following properties (which we describe in technical terms and prove rigorously in Section 4):
Its size is practically independent of the size of the original data. It can be used to approximate well all “large” sums (i.e., with values close to the total sum), of the aggregated attribute in time that depends only on its size, and thus is almost independent of the size of the original data.
Computing Aggregate Lineage
In this section, we present randomised Algorithm Comp-Lineage which computes Aggregate Lineage in one pass over the data and in time linear in the size n of the original database relation. In Section 4 we show that the output of Comp-Lineage is useful to approximate arbitrary ad-hoc sum test queries in time independent of n. We note that our algorithm is agnostic of the specific sum queries that will be approximated by using its output.
Main symbols used in the paper
Main symbols used in the paper

Algorithm Comp-Lineage
We can use the techniques of [9] for weighted random sampling and efficiently implement our algorithm to run in linear time in the size of the input either in a parallel/distributed environment or over data streams.
Table 1 summarizes the main symbols used throughout the paper.
We illustrate Algorithm Comp-Lineage by applying it to Example 1 with
and presenting the data and the Aggregate Lineage in Table 2. Actually Table 2 only shows the value of the aggregated attribute (
in our example), the rest of the tuple is not shown.
Properties of Aggregate Lineage
for
Notes: The first two columns describe the data. The next three columns describe the Aggregate Lineage relation. The last column shows how we use this lineage to compute sub-sums.
Properties of Aggregate Lineage
Notes: The first two columns describe the data. The next three columns describe the Aggregate Lineage relation. The last column shows how we use this lineage to compute sub-sums.
The first two columns of Table 2 present the data in relation Salaries. In order to be able to present many tuples we have chosen a relation with a few values for attribute
The third column in Table 2 shows how many tuples from Salaries with a specific value in
In order to represent the Aggregate Lineage relation
The fourth column stores the extra attribute frequency
The blocks give us an intuition of the characteristics of the Aggregate Lineage. The first block corresponds to the largest value of
As we will explain in more detail later, this shows partly why the lineage is useful for discovering almost accurately sub-sums that are large compared to the total sum, whereas when a sub-sum is small in comparison, then the lineage cannot be used to compute it accurately.
The second block did not contribute that heavily but still quite a lot, around half of tuples with
Finally the last column in the figure shows how much each tuple from the Aggregate Lineage contributes to the approximation of sub-sums that are computed by the test queries. The same tuple is added in the Aggregate Lineage several times as recorded in the new attribute
Note that Aggregate Lineage does not assume any knowledge of the query set: i.e., we run the random selection of Algorithm Comp-Lineage only once and compute
In technical terms, the queries are posed by an oblivious adversary, i.e., an adversary that knows how exactly Aggregate Lineage works but does not have access to its random choices. The restriction to oblivious adversaries is standard and unavoidable, since if one knows the actual value of
In this section we prove the theoretical guarantees of Aggregate Lineage. Let R be a relation with a nonnegative numerical attribute A. We consider SUM queries that ask for the sum of attribute’s A values over arbitrary subsets of the tuples in relation R. We use tuple identifiers in order to succinctly represent subsets of tuples. Thus, any SUM query defines a set of tuple identifiers for tuples that satisfy its predicates, hence the following formal definitions:
(Exact SUM
).
Let R be a database relation. We attach a tuple identifier on each tuple of R. We denote by
Let Q be a SUM query over
The result of a SUM query,
(Approximated SUM
).
Let Q be a SUM query over
We denote by
The approximated result of SUM query Q, denoted by
The following theorem provides the performance guarantees for any arbitrary set of m SUM queries computed over the Aggregate Lineage relation in order to serve as an approximation of the corresponding SUM queries over the original data.
Let R be a relation with n tuples having nonnegative values
The proof is an adaptation of the proof of Althöfer’s Sparsification lemma [1]. For simplicity, we assume, without loss of generality, that the set
Let us fix an arbitrary SUM query
Applying the Chernoff–Hoeffding bound, we obtain that for the particular choice of b, with probability at least
We use the following form of the Chernoff–Hoeffding bound (see [15]): Let
Applying the union bound, we obtain that
Suppose, in our running example, we want to be able to answer with good approximation
Observations on the practical consequences of Theorem 1. Examining closely equation The value of b depends on m as the logarithm, hence if we go from m to The value of b does not depend much on p (again only as in the logarithm) but it depends mainly on ε which controls the approximation ratio (the approximation ratio itself is
Here is what a user can do for data debugging when using the Aggregate Lineage we propose.
He computes sub-sums by filtering some attributes and possibly specific values for these attributes. E.g., what was the sum of salaries of employees in the toy department in Spring 2010 and only for those employees who were hired after 2005. The user devises several such test queries as he sees appropriate and while he computes them and checks that sub-data is OK or suspicious, he devised different test queries to suit the situation. E.g., if he observes an unusually large value, close to the total sum, in the query about employees in the toy department and hired before 2005, then the rest of the queries he devises stay within this department and within the range until 2005, and tries to narrow down further the wrong part of data. E.g., now he narrows down to each month or/and to employees that are hired between 2005 and 2007, etc. On the other hand, if he finds the answer satisfactory, then he announces this part of the data correct, therefore stays outside this sub-data and tries to find some other part of the data that are faulty. The user uses and poses his test queries over the stored small Aggregate Lineage instead of inefficiently use the original big relation. We continue our running Example 2 where we computed Aggregate Lineage In order to use Aggregate Lineage to understand our data we compute We now use the Aggregate Lineage Another straw man approach would be to select as lineage the 8,852 tuples with larger salary values. This method will select all 100 tuples with salaries
In the following example we show how using Aggregate Lineage to approximate test queries applies to our running example.
We have focused in our exposition only on a single aggregated attribute (e.g.,
Algorithm Comp-Lineage performs a weighted random sampling which selects with replacement b out of n tuples of R where the weight
However the technique in [9], does not seem to be efficiently parallelizable, at least not in a direct way. Thus the problem of how to efficiently implement our technique in distributed computational environments such as MapReduce remains open. Issues about how to implement sampling in MapReduce are discussed in [14]. Another open problem is how to apply this technique to evolving data [13]. In data streams, we assume that the sample is to be computed over the entire data. When data continuously evolve with time, the sample may also change considerably with time. The nature of the sample may vary with both the moment at which it is computed and with the time horizon over which the user is interested in. We have not investigated here how to provide this flexibility.
Comparison with synopses for data
There has been extensive research on approximation techniques for aggregate queries on large databases and data streams. Previous work considers a variety of techniques including random sampling, histograms, multivalued histograms, wavelets and sketches (see e.g., [6] and the references therein for details and applications of those methods). Most of the previous work on histograms, wavelets, and sketches focuses on approximating aggregate queries on a given attribute A for specific subsets of the data that are known when the synopsis is computed (e.g., the synopsis concerns the entire data stream or a particular subset of the database). Thus, such techniques typically lose the correlation between the approximated A values and the original values of other attributes. For the more general case of multiple queries that can be posed over arbitrary sets of attributes and subsets of the data not specified when the synopsis is computed, those techniques typically lead to an exponential (in the number of other attributes involved) increase in the size of the synopsis (see e.g., [7,8]).
In contrast, our approach is far more general and does not focus on approximating queries over specific attributes or subsets of the data. Our algorithm computes a small sample without assuming any knowledge on the set of queries and keeps the association between the sampled A values and all other attributes. Then, we can use the Aggregate Lineage to approximate large-valued sum queries over arbitrary subsets of the data that can be expressed over any set of attributes. The Aggregate Lineage can approximately answer a number of queries exponential in its size. Of course, the queries should be oblivious to the actual Aggregate Lineage (technically, they should be computed by an oblivious adversary), but this technical condition applies to all previously known randomised synopses constructions (see e.g., [6]).
Conclusions
We have presented a method that computes lineage for aggregate queries by applying weighted sampling. The aggregate lineage can be used to compute arbitrary test aggregate queries on subsets of the original data. However the test queries can be computed with good approximation only if the result of each test query is large enough with respect to the total sum over all the data. The aggregate lineage we compute cannot be used to compute test queries if their result is comparatively small. We give performance guarantees.
Parallel implementation on frameworks such as MapReduce is not studied here. The naive approach of parallelizing [9] would have either to transmit a large amount of data to the several compute nodes, or to have a makespan linear in n.
The idea of getting a single (possibly weighted) random sample from a large data set and using it for repeated estimations of a given quantity has appeared before in the context of machine learning and statistical estimation. Boosting techniques [26] such as bootstrapping [10,11] are used. In [20] BLB is used for the efficient estimation of bootstrap-based quantities in a distributed computational environment.
Footnotes
Acknowledgements
This work was supported by the project Handling Uncertainty in Data Intensive Applications, co-financed by the European Union (European Social Fund – ESF) and Greek national funds, through the Operational Program “Education and Lifelong Learning”, under the program THALES.
