Efficient lineage for SUM aggregate queries

Abstract

AI systems typically make decisions and find patterns in data based on the computation of aggregate and specifically sum functions, expressed as queries, on data’s attributes. This computation can become costly or even inefficient when these queries concern the whole or big parts of the data and especially when we are dealing with big data. New types of intelligent analytics require also the explanation of why something happened.

In this paper we present a randomised algorithm that constructs a small summary of the data, called Aggregate Lineage, which can approximate well and explain all sums with large values in time that depends only on its size. The size of Aggregate Lineage is practically independent on the size of the original data. Our algorithm does not assume any knowledge on the set of sum queries to be approximated.

Keywords

Artificial intelligence databases aggregate queries database lineage query approximation randomised algorithms

1. Introduction

Big data poses new challenges not only in storage but in intelligent data analytics as well. Many organisations have the infrastructure to maintain big structured data and need to find methods to efficiently discover patterns and relationships to derive intelligence [3,17]. Thus, it would be desirable to be able to construct out of big data a right representative part that can explain aggregate queries, e.g., why the salaries or the sales of a department are high.

AI systems typically make decisions based on the value of a function computed on data’s attributes. Several approaches have in common the computation of aggregates over the whole or large subsets of the data that helps explain patterns and trends of the data. E.g., recommendation systems rank and retrieve items that are more interesting for a specific user by aggregating existing recommendations [25]. For another example, collaborative filtering computes a function which uses aggregates and a sum over the existing ratings from all users for each product in order to predict the preference of a new user [4,19]. User preferences are often described as queries [18], e.g., queries that give constraints on item features that need to be satisfied.

Another reason for which data analytics seek to explain data is for data debugging purposes. Data debugging, which is the process that allows users to find incorrect data [23,24], is a research direction that is growing fast. Data are collected by various techniques which, moreover, are unknown to and uncontrolled by the user, thus are often erroneous. Finding which part of the data contains errors is essential for companies and affects a large part of their business.

All these applications call for techniques to explain our data. Aggregation is a significant component in all of them. In this paper we offer a technique that constructs a summary of the data with properties that allow it to be used efficiently to explain much of the data behaviour in aggregate for sums. We refer to this summary as Aggregate Lineage, since in most applications it represents the source of an aggregate query.1

¹
Lineage used to be referred to as “explain” in database papers of the late 1980s.

Lineage (a.k.a. provenance) keeps track of where data comes from. Lineage has been investigated for data debugging purposes [16]. Storing the complete lineage of data can be prohibitively expensive and storage-saving techniques to eliminate or simplify similar patterns in it are studied in [5]. For select-project-join SQL queries, lineage stores the set of all tuples that were used to compute a tuple in the answer of the query [2]. This is natural for select-project-join SQL queries where original attribute values are “copied” in attribute values of the answer. However, in an aggregate query the value of the answer is the result of applying an aggregate function over many numerical attribute values. When we want to understand why we get an aggregate answer it may no longer be important or feasible to have lineage to point to all contributing original tuples and their values. We would rather want to compute few values that can be used to tell us as much as possible about the origin of the result of an aggregate query. However is this at all possible and if it is what are the limitations?

In this paper we initiate an investigation of such questions and, interestingly, we show that useful and practical solutions exist. In particular, we offer a technique that uses randomisation to compute Aggregate Lineage which is a small representative sample (it is more sophisticated than just a simple random sample) of the data. This sample has the property to allow for good approximations of a sum query on ad hoc subsets of data – we call them test queries. Test queries are applied to the Aggregate Lineage – not the whole original data. The test queries which we consider are sum queries with same aggregated attribute conditioned with any grouping attributes depending on which subsets of the data we want to test. We give performance guarantees about the quality of the results of the test queries that show the approximation to be good for test queries with large values (i.e., close to the total sum over the whole set of data). Our performance guarantees hold, with high probability, for any set of queries, even if the number of queries is exponentially large in the size of the lineage. The only restriction is that the queries should be oblivious to the actual Aggregate Lineage. This restriction is standard in all previous work on random representative subsets for the evaluation of aggregate queries and is naturally satisfied in virtually all practical applications. The following example offers a scenario about how Aggregate Lineage can be used in data debugging and demonstrates how some test queries can be defined.

Example 1.

Suppose that the accounting department of a big company maintains a database with a relation Salaries with hundreds of attributes and millions of tuples. Each tuple in the relation may contain an identifier of an employee stored in attribute EmplID, his Department stored in attribute Department, his annual salary stored in attribute Sal and many more attribute values. Other relations are extracted from this relation, e.g., a relation which contains aggregated data such as the total sum of salaries of all employees. A user is trying to use the second relation for decision making but he finds that the total sum of salaries is unacceptably high. He does not have easy access to the original relation or he does not want to waste time to pose time-consuming queries on the original big relation. The error could be caused by several reasons (duplication of data in a certain time period, incorrect code that computes salaries in a new department). Thus e.g., if we could find the total sum of salaries for employees in the toy department during 2009, and see that this is unreasonably high, still close to the first total sum of all employees’ salaries, then we will be able to detect such errors and narrow them down to small (and controllable) pieces of data.

In order to do that, we need the capability of posing sum queries restricted to certain parts of the data by using combinations of attributes. This will help the user understand which piece of data is incorrect. We do not know in advance, however, which piece of data the user would want to inquire and thus Aggregate Lineage should allow the user to be able to get good approximated answers to whatever queries he wants to try. There are billions of such possible queries and hence billions of subsets of data which we want to compute a good approximation of the summation of salaries. We want Aggregate Lineage to offer this possibility.

We propose to keep as Aggregate Lineage a small relation under the same schema of the original relation. In order to select which tuples to include, we use valued-based sampling with repetition, i.e., weighted random sampling where the probability of selecting each tuple is proportional to its value on the summed attribute. The intuition why this method works is the following. Larger values contribute more to the sum than smaller ones, thus we expect that tuples with larger values should be selected more often than tuples with smaller values. Hence, we could end up with a tuple selected many times in the sample even if it appears only once in the original data. On the other hand, if there are many tuples with values of moderate size, many of them will be selected in the Aggregate Lineage, so that their total contribution to the approximation of the sum remains significant.

1.1. Our contribution

In our approach Aggregate Lineage is a small relation with same schema as the original relation and with the property to offer good approximations to test queries posed on it.

To present performance guarantees, we build on Althöfer’s Sparsification lemma [1]. In [1], Althöfer shows that the result of weighted random sampling over a probability vector is a sparse approximation of the original vector with high probability. This technique has found numerous applications e.g., in the efficient approximation of Nash equilibria for (bi)matrix games [21], in the very fast computation of approximate solutions in Linear Programming [22], and in selfish network design [12].

In this paper, we show for the first time that the techniques of [1] are also useful in the context of sum database queries with lineage. Our results show that the Aggregate Lineage that we extract has the following properties (which we describe in technical terms and prove rigorously in Section 4):

Its size is practically independent of the size of the original data.

It can be used to approximate well all “large” sums (i.e., with values close to the total sum), of the aggregated attribute in time that depends only on its size, and thus is almost independent of the size of the original data.

2. Computing Aggregate Lineage

In this section, we present randomised Algorithm Comp-Lineage which computes Aggregate Lineage in one pass over the data and in time linear in the size n of the original database relation. In Section 4 we show that the output of Comp-Lineage is useful to approximate arbitrary ad-hoc sum test queries in time independent of n. We note that our algorithm is agnostic of the specific sum queries that will be approximated by using its output.

Suppose that we are given a database with a relation R with n tuples and we are given a positive integer b which is the number of tuples we have decided to include in the Aggregate Lineage (in Section 4 we will explain how we decide b to give good performance and approximation guarantees). Suppose that A is a numerical attribute of R which takes nonnegative values. Let S be the sum of values of attribute A over all n tuples. The algorithm essentially is a biased sampling with repetition that selects b tuples from R. Each tuple t has probability to be selected equal to

p_{t} = t [A] / S

where

t [A]

is the value of attribute A in t. It collects initially a bag (a.k.a. multiset and is allowed to have the same element more than once) of tuples (since each tuple may be selected multiple times) which is turned in a set of tuples by adding an extra attribute

Fr

(for Frequency) which shows the number of times this tuple is selected. We denote by

L_{R . A}

the Aggregate Lineage of relation R with sum attribute A. Algorithm Comp-Lineage is presented as Algorithm 1.

Table 1
Main symbols used in the paper

$t [A]$	Value of attribute A in tuple t
S	Sum of all values over attribute A
$p_{t}$	Probability that Algorithm Comp-Lineage selects tuple t
$Fr$	Additional attribute recording the frequency of a tuple in the lineage
$L_{R . A}$	The lineage relation computed by Algorithm Comp-Lineage w.r.t. attribute A
$Q (R . A)$ (Section 4)	A sub-sum query computed over original relation R w.r.t. attribute A
$Q^{'} (L_{R . A})$ (Section 4)	Sub-sum query Q computed over the aggregate lineage relation
$I_{R}^{Q}$ (Section 4)	Set of identifiers of the tuples in relation R that satisfy the predicates in query Q

Algorithm 1.

Algorithm Comp-Lineage

We can use the techniques of [9] for weighted random sampling and efficiently implement our algorithm to run in linear time in the size of the input either in a parallel/distributed environment or over data streams.

Table 1 summarizes the main symbols used throughout the paper.

3. Running example

Example 2.
We illustrate Algorithm Comp-Lineage by applying it to Example 1 with $b = 8, 852$ and presenting the data and the Aggregate Lineage in Table 2. Actually Table 2 only shows the value of the aggregated attribute ( $Sal$ in our example), the rest of the tuple is not shown.
Table 2
Properties of Aggregate Lineage $L_{Salaries . Sal}$ for $b = 8, 852$

$Sal$ : O.V. # of tuples in Salaries Total # of tuples in Aggregate Lineage $Fr$ # of tuples with $Fr$ $Sal$ : Values $Fr \cdot S / b$ in Aggregate Lineage

$10^{9}$ 100 100 3 5 $3 \cdot S / b = 4.41 \times 10^{8}$

4 10 $4 \cdot S / b = 5.87 \times 10^{8}$

5 19 $5 \cdot S / b = 7.34 \times 10^{8}$

6 14 $6 \cdot S / b = 8.81 \times 10^{8}$

7 13 $7 \cdot S / b = 1.03 \times 10^{9}$

8 15 $8 \cdot S / b = 1.17 \times 10^{9}$

9 8 $9 \cdot S / b = 1.32 \times 10^{9}$

10 12 $10 \cdot S / b = 1.47 \times 10^{9}$

11 4 $11 \cdot S / b = 1.62 \times 10^{9}$

$10^{8}$ 1,000 497 1 347 $S / b = 1.47 \times 10^{8}$

2 123 $2 \cdot S / b = 2.94 \times 10^{8}$

3 20 $3 \cdot S / b = 4.41 \times 10^{8}$

4 7 $4 \cdot S / b = 5.87 \times 10^{8}$

$10^{7}$ 10,000 681 1 681 $S / b = 1.47 \times 10^{8}$

$10^{6}$ 1,000,000 6,809 1 6,809 $S / b = 1.47 \times 10^{8}$

$10$ 1,000 0 0 0 0

Notes: The first two columns describe the data. The next three columns describe the Aggregate Lineage relation. The last column shows how we use this lineage to compute sub-sums.

The first two columns of Table 2 present the data in relation Salaries. In order to be able to present many tuples we have chosen a relation with a few values for attribute $Sal$ , actually five (i.e., $10^{9}$ , $10^{8}$ , $10^{7}$ , $10^{6}$ and 10) and their Original Values (O.V.) are shown in the first column. The second column shows how many tuples in Salaries have these values in $Sal$ . Thus, it says, e.g., that there are 100 tuples with value in $Sal$ equal to $10^{9}$ , 1,000 tuples with value in $Sal$ equal to $10^{8}$ and so on.

The third column in Table 2 shows how many tuples from Salaries with a specific value in $Sal$ are selected by Algorithm Comp-Lineage to be included in the Aggregate Lineage relation. Thus, e.g., all 100 tuples with $Sal = 10^{9}$ were chosen, only 681 tuples with $Sal = 10^{7}$ were chosen and no tuple with $Sal = 10$ was chosen.

In order to represent the Aggregate Lineage relation $L_{Salaries . Sal}$ in the most demonstrative way, we have chosen to partition its tuples in blocks (each block further divided in multiple rows in columns 4, 5 and 6), each block corresponding to one value of $Sal$ in $Salaries$ . Thus the first block has 9 rows, the second block has 4 rows and the last three blocks have one row each. This breaking into blocks gives a visualisation of the characteristics of the algorithm.

The fourth column stores the extra attribute frequency $Fr$ which tells how many times a certain tuple was selected by the algorithm and the fifth column stores the number of tuples that were selected so many times. Thus, e.g., the first row says that 5 tuples were selected 3 times each. The ninth row says that 4 tuples from $Salaries$ were selected 11 times each.

The blocks give us an intuition of the characteristics of the Aggregate Lineage. The first block corresponds to the largest value of $Sal$ and tuples with this value (i.e., $Sal = 10^{9}$ ) contributed quite heavily to the lineage – all 100 tuples with $Sal = 10^{9}$ were selected multiple times. In more detail, there are 100 tuples with value $Sal = 10^{9}$ . Of those tuples, 5 were added in the bag 3 times each, 10 tuples were added in the bag 4 times each, and so on. Thus, by considering these 100 tuples, the Algorithm Comp-Lineage added in the bag $3 \cdot 5 + 4 \cdot 10 + 5 \cdot 19 + 6 \cdot 14 + 7 \cdot 13 + 8 \cdot 15 + 9 \cdot 8 + 10 \cdot 12 + 11 \cdot 4 = 681$ tuples in total. That is to say, each of those 100 tuples contributed on average $6.81$ to the bag. When we get a set out of the bag by using frequencies (to avoid repeating a tuple multiple times), then we see that the average frequency per tuple is $6.81$ . So, from this first block, the $681$ tuples in the bag of Algorithm Comp-Lineage are transformed to a set of $100$ tuples in Aggregate Lineage with average frequency $6.81$ . We can compare it with the average frequency in the second block which is 0.681 (this is $1 \cdot 347 + 2 \cdot 123 + 3 \cdot 20 + 4 \cdot 7 = 681$ divided by 1,000 tuples) and see that, in the data of our example, each tuple of the first block contributes more heavily to the lineage.

As we will explain in more detail later, this shows partly why the lineage is useful for discovering almost accurately sub-sums that are large compared to the total sum, whereas when a sub-sum is small in comparison, then the lineage cannot be used to compute it accurately.

The second block did not contribute that heavily but still quite a lot, around half of tuples with $Sal = 10^{8}$ were selected at least once and quite a few more than once, in total this block contributed 681 tuples in the bag. The third block contributed moderately. The fourth block is interesting because the value of $Sal$ is very small only $10^{6}$ but it contributed quite a lot due to the fact that there are many tuples in $Salaries$ with $Sal = 10^{6}$ , thus it contributed almost 85 percent of the tuples in the Aggregate Lineage.

Finally the last column in the figure shows how much each tuple from the Aggregate Lineage contributes to the approximation of sub-sums that are computed by the test queries. The same tuple is added in the Aggregate Lineage several times as recorded in the new attribute $Fr$ and thus, in order to calculate the contribution of a certain tuple, we multiply its frequency in $Fr$ by $S / b$ . By doing so, some tuples (e.g., the ones in the fourth block) in our example of Table 2 will contribute much more than their actual value in $Sal$ . But this is to compensate for the tuples with value close to it (same value in our example) that are not selected to be included in the Aggregate Lineage. In the next section we give the technical details on how Aggregate Lineage can be used in order to approximate sub-sums.

Note that Aggregate Lineage does not assume any knowledge of the query set: i.e., we run the random selection of Algorithm Comp-Lineage only once and compute $L_{R . A}$ without assuming anything about the queries. Then, this same relation $L_{R . A}$ can be used to make us understand any sub-sum test query, without requiring that the test queries are given beforehand or requiring that the test queries are chosen in any specific fashion (e.g., they do not have to be chosen uniformly at random), as long as the query choice is oblivious to the actual sample computed by Aggregate Lineage.2
²
In technical terms, the queries are posed by an oblivious adversary, i.e., an adversary that knows how exactly Aggregate Lineage works but does not have access to its random choices. The restriction to oblivious adversaries is standard and unavoidable, since if one knows the actual value of $L_{R . A}$ , he can construct a query that includes only tuples not belonging to $L_{R . A}$ , for which no meaningful approximation guarantee would be possible.

We first present the theoretical approximation guarantees and then demonstrate how these guarantees play for debugging on our running example.
4. Approximation guarantees of test queries on Aggregate Lineage

$Sal$ : O.V.	# of tuples in Salaries	Total # of tuples in Aggregate Lineage	$Fr$	# of tuples with $Fr$	$Sal$ : Values $Fr \cdot S / b$ in Aggregate Lineage
$10^{9}$	100	100	3	5	$3 \cdot S / b = 4.41 \times 10^{8}$
		4	10	$4 \cdot S / b = 5.87 \times 10^{8}$
		5	19	$5 \cdot S / b = 7.34 \times 10^{8}$
		6	14	$6 \cdot S / b = 8.81 \times 10^{8}$
		7	13	$7 \cdot S / b = 1.03 \times 10^{9}$
		8	15	$8 \cdot S / b = 1.17 \times 10^{9}$
		9	8	$9 \cdot S / b = 1.32 \times 10^{9}$
		10	12	$10 \cdot S / b = 1.47 \times 10^{9}$
		11	4	$11 \cdot S / b = 1.62 \times 10^{9}$
$10^{8}$	1,000	497	1	347	$S / b = 1.47 \times 10^{8}$
		2	123	$2 \cdot S / b = 2.94 \times 10^{8}$
		3	20	$3 \cdot S / b = 4.41 \times 10^{8}$
		4	7	$4 \cdot S / b = 5.87 \times 10^{8}$
$10^{7}$	10,000	681	1	681	$S / b = 1.47 \times 10^{8}$
$10^{6}$	1,000,000	6,809	1	6,809	$S / b = 1.47 \times 10^{8}$
$10$	1,000	0	0	0	0

In this section we prove the theoretical guarantees of Aggregate Lineage. Let R be a relation with a nonnegative numerical attribute A. We consider SUM queries that ask for the sum of attribute’s A values over arbitrary subsets of the tuples in relation R. We use tuple identifiers in order to succinctly represent subsets of tuples. Thus, any SUM query defines a set of tuple identifiers for tuples that satisfy its predicates, hence the following formal definitions:

Definition 1 (Exact SUM $Q (R . A)$ ).

Let R be a database relation. We attach a tuple identifier on each tuple of R. We denote by $I_{R}$ the set of all identifiers in relation R. Given an attribute A in the schema of R, we denote by $a_{i}$ the value of attribute $R . A$ in the tuple with identifier i in R.

Let Q be a SUM query over $R . A$ . We denote by $I_{R}^{Q}$ the set of tuple identifiers from $I_{R}$ for tuples of R that satisfy Q’s predicates.

The result of a SUM query, $Q (R . A)$ , is the summation of the values of $R . A$ over the set of tuples with identifiers that appear in $I_{R}^{Q}$ , i.e., $Q (R . A) = \sum_{i \in I_{R}^{Q}} a_{i}$ .

Definition 2 (Approximated SUM $Q^{'} (L_{R . A})$ ).

Let Q be a SUM query over $R . A$ and let $L_{R . A}$ be an Aggregate Lineage. We attach a tuple identifier on each tuple of $L_{R . A}$ . We denote by $I_{L}$ the set of all identifiers in $L_{R . A}$ . We denote by $I_{L}^{Q}$ the set of tuple identifiers from $I_{L}$ for tuples of $L_{R . A}$ that satisfy Q’s predicates (since the set of attributes of R is a subset of the set of attributes of $L_{R . A}$ , we have that the predicates of a SUM query Q, expressed on attributes of R, define $I_{L}^{Q}$ ).

We denote by $f_{i}$ the value of attribute $L_{R . A} \cdot Fr$ in the tuple with identifier i in $L_{R . A}$ .

The approximated result of SUM query Q, denoted by $Q^{'} (L_{R . A})$ , is the summation of the values of $L_{R . A} \cdot Fr$ over the set of tuples with identifiers that appear in $I_{L}^{Q}$ multiplied by $S / b$ , i.e., $Q^{'} (L_{R . A}) = \sum_{i \in I_{L}^{Q}} f_{i} \cdot S / b$ .

The following theorem provides the performance guarantees for any arbitrary set of m SUM queries computed over the Aggregate Lineage relation in order to serve as an approximation of the corresponding SUM queries over the original data.

Theorem 1.
Let R be a relation with n tuples having nonnegative values $a_{1}, \dots, a_{n}$ on attribute A, and let $S = \sum_{i = 1}^{n} a_{i}$ . Then, for any collection of m SUM queries $Q_{1} (R . A), \dots, Q_{m} (R . A)$ (not known to the algorithm), any $p \in (0, 1)$ , and any $ε > 0$ , the Algorithm Comp-Lineage with input all tuples of R and $b = ⌈ ln (2 m / p) / (2 ε^{2}) ⌉$ derives an Aggregate Lineage $L_{R . A}$ such that $| Q_{j} (R . A) - Q_{j}^{'} (L_{R . A}) | ⩽ ε S$ , for all $j \in [m]$ , with probability at least $1 - p$ .
Proof.
The proof is an adaptation of the proof of Althöfer’s Sparsification lemma [1]. For simplicity, we assume, without loss of generality, that the set $I_{R}$ of all tuple identifiers of R in Definition 1 is $I_{R} = {1, \dots, n}$ . We define b independent identically distributed random variables $X_{1}, \dots, X_{b}$ , which take each value $i \in [n]$ with probability $a_{i} / S$ . Namely, each random variable $X_{i}$ corresponds to the outcome of the ith trial of Comp-Lineage. For each tuple i, its frequency in the sample is $f_{i} = | {k \in [b] : X_{k} = i} |$ .

Let us fix an arbitrary SUM query $Q_{j} (R . A)$ . For each $k \in [b]$ , we let $Y_{j}^{k}$ be a random variable that is equal to 1, if $X_{k} \in I_{R}^{Q_{j}}$ and 0, otherwise. Since the random variable $Y_{j}^{k}$ is equal to 1 with probability $Q_{j} (R . A) / S$ , $E [Y_{j}^{k}] = Q_{j} (R . A) / S$ . We observe that the random variables ${Y_{j}^{k}}_{k \in [b]}$ are independent, because the random variables ${X_{k}}_{k \in [b]}$ are independent. Furthermore, we let $Y_{j}$ be a random variable defined as $Y_{j} = \frac{1}{b} \sum_{k = 1}^{b} Y_{j}^{k} .$ By definition, $Y_{j} = \sum_{i \in I_{R}^{Q_{j}}} f_{i} / b = \sum_{i \in I_{L}^{Q_{j}}} f_{i} / b$ , and thus we have that $Y_{j} = Q_{j}^{'} (L_{R . A}) / S$ , i.e., $Y_{j}$ is equal to the approximated result of the SUM query divided by S. Also, by linearity of expectation, $E [Y_{j}] = Q_{j} (R . A) / S$ .

Applying the Chernoff–Hoeffding bound, we obtain that for the particular choice of b, with probability at least $1 - p / m$ , the actual value of $Y_{j}$ differs from its expectation $Q_{j} (R . A) / S$ by at most ε, which implies that $Q_{j}^{'} (L_{R . A})$ differs from $Q_{j} (R . A)$ by at most $ε S$ . Formally, by the Chernoff–Hoeffding bound,3
³
We use the following form of the Chernoff–Hoeffding bound (see [15]): Let $Y^{1}, \dots, Y^{b}$ be random variables independently distributed in $[0, 1]$ , and let $Y = \frac{1}{b} \sum_{k = 1}^{b} Y^{k}$ . Then, for all $ε > 0$ , $P r [| Y - E [Y] | > ε] ⩽ 2 {exp}^{- 2 ε^{2} b}$ , where $exp = 2.71 \dots$ is the basis of natural logarithms.

$\begin{array}{rcl} P r [| Q_{j}^{'} (L_{R . A}) - Q_{j} (R . A) | > ε S] \\ = P r [| Y_{j} - Q_{j} (R . A) / S | > ε] \\ ⩽ 2 {exp}^{- 2 ε^{2} b} ⩽ p / m, \end{array}$ where the last inequality follows from the choice of b.

Applying the union bound, we obtain that $\begin{array}{rcl} P r [\exists j \in [m] : | Q_{j}^{'} (L_{R . A}) - Q_{j} (R . A) | > ε S] \\ ⩽ p, \end{array}$ which concludes the proof of the lemma. □
Example 3.
Suppose, in our running example, we want to be able to answer with good approximation $m = 10^{6}$ queries. What are the guarantees that the theorem provides? The original data have $n \approx 10^{6}$ tuples. Suppose we select the number of tuples in the Aggregate Lineage to be $b \approx 9, 000$ . Then the theorem says that, by setting $ε = 0.04$ , we can compute any of $10^{6}$ arbitrary queries within $0.04 S$ of its real value with probability $1 - 10^{- 6}$ . Thus, if the real exact value of the query $Q_{1}$ is equal to $Q_{1} (Salaries . Sal) = 0.4 S = S_{1}$ (remember S is the sum over all tuples of relation R) then the approximation will be $0.04 S = 0.04 S_{1} / 0.4 = 0.1 S_{1}$ . If for another query $Q_{2}$ we have $Q_{2} (Salaries . Sal) = 0.8 S = S_{2}$ then the approximation will be $0.05 S_{2}$ , so, then with high probability we get an answer that is within a factor of $0.05$ of the actual answer.

Observations on the practical consequences of Theorem 1. Examining closely equation $b = ⌈ ln (2 m / p) / 2 ε^{2} ⌉$ which gives us an upper bound of the number of tuples in the Aggregate Lineage for m queries and with p and ε guarantees as in its statement, we make the following observations:
The value of b depends on m as the logarithm, hence if we go from m to $m^{2}$ queries, we only need to multiply b by 2 in order to keep the same performance guarantees. Thus it is reasonable to state that, in many practical cases the number m of queries that can be approximated well can be as large as a polynomial on the size of data – even with coefficient in the order of a few hundreds.

The value of b does not depend much on p (again only as in the logarithm) but it depends mainly on ε which controls the approximation ratio (the approximation ratio itself is $ε / ρ$ if the query to be computed has a sum $S^{'} = ρ S$ ).

5. A debugging scenario

Here is what a user can do for data debugging when using the Aggregate Lineage we propose.

He computes sub-sums by filtering some attributes and possibly specific values for these attributes. E.g., what was the sum of salaries of employees in the toy department in Spring 2010 and only for those employees who were hired after 2005. The user devises several such test queries as he sees appropriate and while he computes them and checks that sub-data is OK or suspicious, he devised different test queries to suit the situation. E.g., if he observes an unusually large value, close to the total sum, in the query about employees in the toy department and hired before 2005, then the rest of the queries he devises stay within this department and within the range until 2005, and tries to narrow down further the wrong part of data. E.g., now he narrows down to each month or/and to employees that are hired between 2005 and 2007, etc. On the other hand, if he finds the answer satisfactory, then he announces this part of the data correct, therefore stays outside this sub-data and tries to find some other part of the data that are faulty. The user uses and poses his test queries over the stored small Aggregate Lineage instead of inefficiently use the original big relation.

In the following example we show how using Aggregate Lineage to approximate test queries applies to our running example.

Example 4.
We continue our running Example 2 where we computed Aggregate Lineage $L_{Salaries . Sal}$ . Suppose that we have a SUM test query $Q_{1}$ asking the sum of the salaries of a subset of the employees of the company defined from a subset of $EmpID$ ’s. Let this subset consist of $50$ employees with salary $10^{9}$ , 5,000 employees with salary $10^{7}$ (so half of them) and of all $10^{6}$ employees with salary $10^{6}$ . We compute the query over $Salaries$ and take the exact answer $1.1 \times 10^{12}$ .

In order to use Aggregate Lineage to understand our data we compute $I_{L}^{Q_{1}}$ . The Aggregate Lineage has at most 8,852 tuples. The identifiers of $I_{L}^{Q_{1}}$ define the sub-lineage of query $Q_{1}$ over $L_{Salaries . Sal}$ . The sub-lineage of $Q_{1}$ points to $50$ of the tuples of $L_{Salaries . Sal}$ with original salaries $10^{9}$ and to all 6,809 tuples with original $Sal$ values $10^{6}$ (cf. Table 2). It will also point to some tuples of $L_{Salaries . Sal}$ with $Sal$ values $10^{7}$ : On average query $Q_{1}$ is applied on half of the $681$ selected in Aggregate Lineage tuples, but in extreme cases it may include all or none of them. For this reason, it is a good practice to run the randomised algorithm more than once and compute a few distinct summaries in order to have better results. For instance, we may compute three summaries, use some benchmark sub-queries to decide a distance between summaries, toss the summary which is the more distant and keep one of the others arbitrarily. Note that it is easy to compute the benchmark queries in one pass through the original data in parallel with computing the lineage.

We now use the Aggregate Lineage $L_{Salaries . Sal}$ shown in Table 2 to approximate the value of the sum answer to $Q_{1}$ . In one worst case query $Q_{1}$ will include: the $50$ tuples with salaries $10^{9}$ from $L_{Salaries . Sal}$ tuples with the larger frequencies and all 681 selected tuples with salaries $10^{7}$ . The approximation $Q_{1}^{'} (L_{Salaries . Sal})$ in this case is $(4 \cdot 11 + 12 \cdot 10 + \dots + 681 + 6, 809) S / b = 7, 935 \cdot S / b = 1.17 \cdot 10^{12}$ . In the other extreme case $Q_{1}$ includes tuples with the smaller frequencies and none of the selected in Aggregate Lineage tuples with salaries $10^{7}$ , yielding the approximation $6, 995 \cdot S / b = 1.03 \cdot 10^{12}$ . We see that $Q_{1}$ is well approximated. Of course the approximation bounds are not the same for every SUM query – we presented the guarantees in Section 4.

Another straw man approach would be to select as lineage the 8,852 tuples with larger salary values. This method will select all 100 tuples with salaries $10^{9}$ , all 1,000 tuples with salaries $10^{8}$ and the remaining 7,752 tuples from tules with salaries $10^{7}$ . With this approach, query $Q_{1}$ will be on average approximated with the value $50 \cdot 10^{9} + 3, 876 \cdot 10^{7} \approx 8.8 \times 10^{10}$ because it loses all the information about all original $10^{6}$ tuples with salaries $10^{6}$ contributing to the sum. On another approach, a simple random sampling of 8,852 tuples will almost always select all of them from the $10^{6}$ many tuples with salaries $10^{6}$ . Query $Q_{1}$ will then be approximated with the value $8, 852 \cdot 10^{6} \approx 8.8 \times 10^{9}$ . Note, on the other hand, that if all original tuples had the same salaries then our method would coincide with simple random sampling.

6. Discussion

We have focused in our exposition only on a single aggregated attribute (e.g., $Sal$ in our example). This is done for simplicity. Our ideas can be easily extended to include more aggregated attributes as long as we are willing to keep a distinct aggregate lineage for each attribute. E.g., suppose we also had a $Rev$ (for Revenue) attribute for each employee. In such a case we keep two lineage relations, one for $Sal$ and one for $Rev$ . The algorithm to compute them can be thought of as a parallel implementation of two copies of the algorithm Comp-Lineage. We need only one pass through the original data. The only difference is that now, (a) we need the two total sums $S_{Sal}$ and $S_{Rev}$ and (b) for each tuple t, we have two probabilities $p_{t}^{Sal}$ and $p_{t}^{Rev}$ , the first to be used for the lineage related to attribute $Sal$ and the second to be used for the lineage related to attribute $Rev$ .

Algorithm Comp-Lineage performs a weighted random sampling which selects with replacement b out of n tuples of R where the weight $w_{i}$ for the tuple with identifier i is equal to the value of attribute A of this tuple. Using b copies (each copy selects a single element) of the weighted random sampling with reservoir algorithm presented in [9], we can implement Comp-Lineage in one-pass over R, in $O (b n)$ time and $O (b)$ space. This implementation can also be applied to data streams and to settings where the values of n and S are not known in advance.

However the technique in [9], does not seem to be efficiently parallelizable, at least not in a direct way. Thus the problem of how to efficiently implement our technique in distributed computational environments such as MapReduce remains open. Issues about how to implement sampling in MapReduce are discussed in [14]. Another open problem is how to apply this technique to evolving data [13]. In data streams, we assume that the sample is to be computed over the entire data. When data continuously evolve with time, the sample may also change considerably with time. The nature of the sample may vary with both the moment at which it is computed and with the time horizon over which the user is interested in. We have not investigated here how to provide this flexibility.

7. Comparison with synopses for data

There has been extensive research on approximation techniques for aggregate queries on large databases and data streams. Previous work considers a variety of techniques including random sampling, histograms, multivalued histograms, wavelets and sketches (see e.g., [6] and the references therein for details and applications of those methods). Most of the previous work on histograms, wavelets, and sketches focuses on approximating aggregate queries on a given attribute A for specific subsets of the data that are known when the synopsis is computed (e.g., the synopsis concerns the entire data stream or a particular subset of the database). Thus, such techniques typically lose the correlation between the approximated A values and the original values of other attributes. For the more general case of multiple queries that can be posed over arbitrary sets of attributes and subsets of the data not specified when the synopsis is computed, those techniques typically lead to an exponential (in the number of other attributes involved) increase in the size of the synopsis (see e.g., [7,8]).

In contrast, our approach is far more general and does not focus on approximating queries over specific attributes or subsets of the data. Our algorithm computes a small sample without assuming any knowledge on the set of queries and keeps the association between the sampled A values and all other attributes. Then, we can use the Aggregate Lineage to approximate large-valued sum queries over arbitrary subsets of the data that can be expressed over any set of attributes. The Aggregate Lineage can approximately answer a number of queries exponential in its size. Of course, the queries should be oblivious to the actual Aggregate Lineage (technically, they should be computed by an oblivious adversary), but this technical condition applies to all previously known randomised synopses constructions (see e.g., [6]).

8. Conclusions

We have presented a method that computes lineage for aggregate queries by applying weighted sampling. The aggregate lineage can be used to compute arbitrary test aggregate queries on subsets of the original data. However the test queries can be computed with good approximation only if the result of each test query is large enough with respect to the total sum over all the data. The aggregate lineage we compute cannot be used to compute test queries if their result is comparatively small. We give performance guarantees.

Parallel implementation on frameworks such as MapReduce is not studied here. The naive approach of parallelizing [9] would have either to transmit a large amount of data to the several compute nodes, or to have a makespan linear in n.

The idea of getting a single (possibly weighted) random sample from a large data set and using it for repeated estimations of a given quantity has appeared before in the context of machine learning and statistical estimation. Boosting techniques [26] such as bootstrapping [10,11] are used. In [20] BLB is used for the efficient estimation of bootstrap-based quantities in a distributed computational environment.

Footnotes

Acknowledgements

This work was supported by the project Handling Uncertainty in Data Intensive Applications, co-financed by the European Union (European Social Fund – ESF) and Greek national funds, through the Operational Program “Education and Lifelong Learning”, under the program THALES.

References

[1]

Althöfer, On sparse approximations to randomized strategies and convex combinations, Linear Algebra and Applications 99 (1994), 339–355.

[2]

Benjelloun,

A.D.

Sarma,

A.Y.

Halevy,

Theobald and

Widom, Databases with uncertainty and lineage, VLDB J. 17(2) (2008), 243–264.

[3]Big data analytics – Advanced analytics in oracle database, An Oracle White Paper, March 2013.

[4]

J.S.

Breese,

Heckerman and

C.M.

Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: UAI, 1998, pp. 43–52.

[5]

Chapman,

H.V.

Jagadish and

Ramanan, Efficient provenance storage, in: SIGMOD Conference, 2008, pp. 993–1006.

[6]

Cormode,

M.N.

Garofalakis,

P.J.

Haas and

Jermaine, Synopses for massive data: Samples, histograms, wavelets, sketches, Foundations and Trends in Databases 4(1–3) (2012), 1–294.

[7]

Dobra,

Garofalakis,

Gehrke and

Rastogi, Processing complex aggregate queries over data streams, in: SIGMOD Conference, ACM, 2002, pp. 61–72.

[8]

Dobra,

Garofalakis,

Gehrke and

Rastogi, Sketch-based multi-query processing over data streams, in: Proc. 9th International Conference on Extending Database Technology (EDBT 2004), Lecture Notes in Computer Science, Vol. 2992, Springer, 2004, pp. 551–568.

[9]

Efraimidis and

P.G.

Spirakis, Weighted random sampling with a reservoir, Inf. Process. Lett. 97(5) (2006), 181–185.

10.

[10]

Efron, Bootstrap methods: Another look at the jackknife, The Annals of Statistics 7(1) (1979), 1–26.

11.

[11]

Efron and

R.J.

Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, New York, NY, 1993.

12.

[12]

Fotakis,

Kaporis and

Spirakis, Efficient methods for selfish network design, Theoretical Computer Science 448 (2012), 9–20.

13.

[13]

Ganti and

Ramakrishnan, Mining and Monitoring Evolving Data, Springer, 2002.

14.

[14]

Grover and

M.J.

Carey, Extending map-reduce for efficient predicate-based sampling, in: ICDE, 2012, pp. 486–497.

15.

[15]

Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association 58(301) (1963), 13–30.

16.

[16]

Ikeda,

Cho,

Fang,

Salihoglu,

Torikai and

Widom, Provenance-based debugging and drill-down in data-oriented workflows, in: ICDE, 2012, pp. 1249–1252.

17.

[17]Information management and big data, An Oracle White Paper, February 2013.

18.

[18]

Jannach, Fast computation of query relaxations for knowledge-based recommenders, AI Commun. 22(4) (2009), 235–248.

19.

[19]

Kagie,

van der Loos and

M.C.

van Wezel, Including item characteristics in the probabilistic latent semantic analysis model for collaborative filtering, AI Commun. 22(4) (2009), 249–265.

20.

[20]

Kleiner,

Talwalkar,

Sarkar and

M.I.

Jordan, The big data bootstrap, in: ICML, 2012.

21.

[21]

Lipton,

Markakis and

Mehta, Playing large games using simple strategies, in: Proc. 4th ACM Conference on Electronic Commerce (EC’03), 2003, pp. 36–41.

22.

[22]

Lipton and

Young, Simple strategies for large zero-sum games with applications to complexity theory, in: Proc. 26th ACM Symposium on Theory of Computing (STOC’94), 1994, pp. 734–740.

23.

[23]

Meliou,

Gatterbauer,

Nath and

Suciu, Tracing data errors with view-conditioned causality, in: SIGMOD Conference, 2011, pp. 505–516.

24.

[24]

Muslu,

Brun and

Meliou, Data debugging with continuous testing, in: ESEC/SIGSOFT FSE, 2013, pp. 631–634.

25.

[25]

Ricci,

Rokach,

Shapira and

P.B.

Kantor (eds), Recommender Systems Handbook, Springer, 2011.

26.

[26]

Schapire, The boosting approach to machine learning: An overview, in: Nonlinear Estimation and Classification, 2003.

Efficient lineage for SUM aggregate queries

Abstract

Keywords

1. Introduction

1 Lineage used to be referred to as “explain” in database papers of the late 1980s.

2. Computing Aggregate Lineage

Table 1 Main symbols used in the paper

Definition 1 (Exact SUM Q ( R . A ) ).

Definition 2 (Approximated SUM Q ′ ( L R . A ) ).

7. Comparison with synopses for data

8. Conclusions

Footnotes

Acknowledgements

References

¹
Lineage used to be referred to as “explain” in database papers of the late 1980s.

Table 1
Main symbols used in the paper

Definition 1 (Exact SUM $Q (R . A)$ ).

Definition 2 (Approximated SUM $Q^{'} (L_{R . A})$ ).