Abstract
Benchmarking is among the most widely adopted practices in business today. However, to the best of our knowledge, conducting multidimensional benchmarking in data warehouses has not been explored from a technical efficiency perspective. In this paper, we formulate benchmark queries in the context of data warehousing and business intelligence, and develop algorithms to answer benchmark queries efficiently. Our methods employ a few interesting ideas and the state-of-the-art data cube computation techniques to reduce the number of aggregate cells that need to be computed and indexed. An empirical study using the TPC-H and the Weather data sets demonstrates the efficiency and the scalability of our methods.
Introduction
Organizations conduct benchmarking for continuous improvement. As a first step, they want to identify if there are areas where they are not performing as well as others. For example, an analyst is interested in finding the performance of the senior sales representatives in Asia by measuring the average sales amount per representative. The analyst may then be interested in finding factors that make up the context in which the performance of the senior sales representatives is measured. These factors may include the product lines, the purchasing customer industry, and the transaction time in a month. The analyst wants to find other sales groups that are performing significantly better than the group of senior sales representatives in Asia in some contexts. For example, some answers interesting to the analyst may be, “compared to the group in question, the sales representatives in North America outperforms the group most for selling laptops to financial business customers during the first 2 quarters of the year”. This kind of questions are also known as benchmarking analysis in business analytics, since they try to find a benchmark for a query group.
The recognition of benchmarking as a useful management tool was formalized in early 1980s when Xerox employed benchmarking as part of its “Leadership through Quality”, a program to find ways to reduce manufacturing costs. In 1982, Xerox determined that the average manufacturing cost of copies in Japanese companies was 40–50% of that of Xerox’s, and they were able to undercut Xerox’s prices effortlessly. As part of the “Leadership through Quality”, Xerox established the benchmarking program, which played a major role in pulling Xerox out of trouble in the years to come. Xerox since then has become one of the best examples of the successful implementation of benchmarking [20].
Benchmark queries can be very sophisticated. For example, one may add various constraints to refine search space. Instead of comparing a query group with any group, one may only be interested in those groups that are super-groups, sub-groups or sibling groups of the query group. For instance, the super-groups of senior sales representatives in Asia are the sales representatives in Asia, the senior sales representatives in the world, and all sales representatives in the world. Likewise, the sibling groups of senior sales representatives in Asia are the groups of senior sales representatives in other regions, such as North America, South America, Europe, and the group of junior sales representatives in Asia.
Benchmarking in business is related to egocentric analysis. In egocentric analysis, given a query group, it tries to identify aspects in which the query group is better than its peers. For example, given a group of senior sales representatives in Asia as the query group, the egocentric analysis tries to identify the factors by which this group performs the best compared to the other groups. An answer may look like “compared to the senior sales representatives in other regions, the query group has the best performance in selling desktop computers to education customers”. If a query group cannot find a benchmark in a subspace, the subspace is the answer to the egocentric analysis for the query group.
Data warehouses are the essential information infrastructure in modern enterprises. However, to the best of our knowledge, benchmarking effectively and efficiently in data warehouses remains largely unexplored from a technical perspective in data management and analytics. Benchmark queries cannot be answered online using the existing data cube and data warehouse techniques. Even when we compute a whole data cube using all the attributes, in the context of benchmarking, we need to define a query group and then need to search the cube for the answers to the query. It is well recognized that the size of a data cube is exponential with respect to the number of tuples and the dimensionality of the base table.
In this paper, we tackle the problem of efficiently answering benchmark queries. We make a few contributions. First, we formulate benchmark queries technically. To the best of our knowledge, this technical problem has not been studied systematically in literature for computational efficiency. Second, we explore algorithmic approaches to benchmark queries. We develop two approaches. First approach is the Sorted Inverted Index Cube (SIIC); we sort aggregate cells in a cube and explore the idea of inverted index for fast access to the search scope of a query. The second approach is the Dominant Answer Materialization (DAM) approach; we exploit a property of aggregate cells to refine the definition of benchmark queries to remove redundancy in answers, thus achieving a more efficiency. Finally, we conduct an extensive experimental study using both synthetic and real data sets to determine the efficiency of our proposed methods.
The rest of the paper is organized as follows. In Section 2, we define benchmark queries and review related work. In Section 3, we develop a Sorted Inverted Index Cube (SIIC) method. In Section 4, we propose a Dominant Answer Materialization (DAM) approach. We report an empirical evaluation in Section 5. Section 6 concludes the paper.
Problem definition and related work
In this section, we first outline some preliminaries in data cube and multidimensional analysis. We then define the benchmark queries formally.
Preliminaries
We largely follow the notations in the conventional data cube and data warehouse literature [8]. Consider a relational table
Let
An aggregate cell in the cuboid on
We can define a partial order
A data cube is the set of cuboids on all possible subspaces, that is, all subsets of dimensions, including the empty set. Equivalently, a data cube is also the set of all aggregate cells. We denote a data cube by
For two aggregate cells
Consider aggregate cells
Benchmark queries
We consider a relational base table
The unit-id attributes are used to group tuples in
The dimension attributes are used to conduct multidimensional comparative analysis between two units. The measure attribute is used to calculate aggregates and derive quantity analysis. For the sake of simplicity, we have only one measure attribute in our discussion. However, our methods can be extended to the scenarios where multiple measure attributes are used to derive sophisticated aggregates. We also assume that the measure attribute takes non-negative values. This assumption often is true in many business intelligence applications. For example, measures such as count, sales volume, and amount are often used in practice. Even when a measure can take negative values, we can always normalize the attribute such that the normalized measure attribute has non-negative values.
For each non-empty unit that consists of at least one tuple in the base table, using the dimension attributes DIM and the measure attributes
Suppose we use attributes
We use attributes
As noted earlier, unit-id attributes and dimension attributes may not be exclusive. That is, an attribute may be both a unit-id attribute and a dimension attribute. Technically, we can always create a copy of an attribute that is used as both a unit-id attribute and a dimension attribute. As such, one copy is used as a unit-id attribute and the other copy is used as a dimension attribute. Therefore, without loss of generality, for the rest of the paper, we assume that the unit-id attributes and the dimension attributes are exclusive.
To compare two aggregate cells
For a unit
We define a benchmark querie benchmark query
a base table a query unit the search scope, namely ancestors, descendants and siblings; and a parameter
Let
there are at most
The requirement
Given a benchmark query
Base table of employees of a company
Suppose the query unit is
Aggregate functions can be categorized into two types: monotonic aggregates and non-monotonic aggregates. An aggregate function
For a monotonic aggregate function, answering a benchmark query is straightforward, since the apex cell
Benchmark queries are related to iceberg queries, gradient analysis, and discovery-driven OLAP. We briefly review the existing studies and how benchmark queries differ from them.
In online analytics processing (OLAP) in data warehouses, iceberg queries [7] compute the aggregate cells whose aggregate values are over a user defined threshold. For example, in a data cube of sales data, an iceberg query may return all the aggregate cells whose total sales amount is over 100 thousand dollars.
Iceberg queries have been extensively studied. A focus is on efficient algorithms for answering iceberg queries. For example, Beyer and Ramakrishan [2] proposed algorithm BUC which computes iceberg cubes with monotonic aggregate functions. Han et al. [9] developed a method for computing iceberg queries with non-monotonic aggregate functions. Ng et al. [16] investigated iceberg queries on distributed systems. Chen et al. [5] explored iceberg cube computation in shared-nothing clusters. Lo et al. [15] extended iceberg queries to sequence data. Recently, He et al. [10] used patterns as “dimensions” in iceberg queries on sequences. Chen et al. [4] extended iceberg queries to graphs.
Although both iceberg queries and benchmark queries are concerned with aggregates, they are fundamentally different. Iceberg queries do not use any query unit and do not compare aggregate cells. Further, while Iceberg queries are often used as the first step for materializing a scope for further analysis, benchmark queries are tools for more focused analysis on a target unit. Benchmark queries cannot be answered by a straight application of iceberg query methods.
In gradient analysis [6, 13], given a probe aggregate cell
There are some similarities between gradient analysis and benchmark queries. First, both gradient analysis and benchmark queries use a query aggregate cell, and find interesting aggregate cells when compared to the query cell. Second, both gradient analysis and benchmark queries use the aggregate ratios as the significance measure. Third, both gradient analysis and benchmark queries can search ancestors and descendants of the query cell. However, the two types of queries have some fundamental differences. Gradient analysis does not separate the unit attributes and the dimension attributes. Gradient analysis in this sense can be regarded as a special case of benchmark queries where the set of dimension attributes DIM is empty. Also, the business objectives for the two are also very different. Benchmark queries facilitate more detailed multidimensional analysis for comparing the query unit with the other units in the intended search scope.
Sarawagi et al. [18] developed the notion of discovery-driven exploration of OLAP cube. The main idea is to identify anomalies within a data cube and provide proper indicators in the corresponding aggregate cells. Both discovery-driven exploration of OLAP cube and benchmark queries want to find significant exceptions. However, discovery-driven exploration does not focus on one query cell. It instead considers all descendant cells for each aggregate cell. Further, the objectives in the two problems are very different. Discovery-driven exploration aims to provide navigation guidance for users to browse interesting regions of a cube, while benchmark queries compares a query unit with the other units. Thus, the measures used are very different.
Common forms of benchmarking methods lend from economic efficiency analysis which involve parametric and non-parametric techniques. The primary objective of both is to measure the technical efficiency which is defined as the ability of a producer to produce maximum output from a given set of inputs. Technical efficiency thus is translated as the success indicator of performance measure by which producers are evaluated. Given the importance of technical efficiency analysis, several models of frontiers have been developed. The frontier models are based on the premise that efficient producers operate on the production frontier using the most efficient technology available, while inefficient producers operate below the production frontier and the level of inefficiency is measured by the deviation from the frontier [12]. The major advantage of this method is that it allows the test of hypothesis concerning the goodness of fit of the model. The stochastic aspect of the model allows it to handle measurement problems appropriately and other stochastic influences that would otherwise show up as causes of inefficiency [11]. However, the major drawback is that it requires specification of technology, which may be restrictive in most cases [12].
Data Envelopment Analysis (DEA) is a non-parametric linear programming technique widely used in the operations research and management science literature [1]. DEA estimates the cost level that an efficient organization should be able to achieve in a particular market. The model seeks to determine an envelopment surface, also referred to as the efficiency frontier. Rather than estimating the impact of different cost drivers, DEA establishes an efficiency frontier (taking account of all relevant variables) based on the “envelope” of observations. Each organization is then assigned an efficiency score based on its proximity to the estimated efficiency frontier. With DEA, the efficient frontier (determined by the efficient organizations in the sample) is the benchmark against that the relative performance of organizations is measured.
As pointed out earlier, conducting benchmarking effectively and efficiently using scalable computational technology, particularly on readily available data warehouse infrastructure, has not been explored in literature. Bridging the gap between business needs and the technology motivates this study.
A Sorted Inverted Index Cube (SIIC) method
With the advanced data warehousing techniques, we can materialize a multidiemsional data cube. We assume a data cube materialization method
For each possible unit
To facilitate answering benchmark queries, we can materialize
A naïve method would be to, given unit
Inverted indices for fast search
We use two simple ideas to facilitate fast search.
The first idea is to sort all aggregate cells in the cube
Let
The second idea is to facilitate the visit of aggregate cells in the search scope using inverted indices. For every unit-id attribute, we maintain an inverted index for each value of the domain to record the list of aggregate cells containing the value. Suppose
We can retrieve all aggregate cells of cube
We also show that retrieving all unit aggregate cells in the search scope, that is, ancestors, descendants and siblings, can also be conducted efficiently using the inverted index in a way similar to merge-sort. Let
To find all descendant units, we search the inverted indices
First, we build an inverted index for each value in the domain of every unit-id attribute. Two examples of inverted indices are shown in Fig. 1.
Example of SIIC, using inverted indices for values “young” and “M”.
Suppose the query unit is
To find all aggregate cells of the ancestors, descendants and siblings of
We will continue and complete the query answering process in Example 5.
Since we scan the aggregate cells in the aggregate value descending order, we maintain the top-
Proof. We only need to recall that, if
Using the result above, for any aggregate cell
Let
Using the above pruning rules, we can prune the aspects of
Example of SIIC with pruning.
Suppose the query unit is
The SIIC method uses some simple yet efficient techniques to accelerate answering benchmark queries. However, there is one severe drawback; that is, in the worst case, we still have to go through the list of all aggregate cells of the whole data cube
Search scope of ancestors
We consider the search scope of ancestors first and discuss how to address the search scope of descendants later.
Consider a query unit
Proof. We prove by contradiction. Assume that
Since
Theorem 1 suggests a useful hint for answering benchmark queries. Multiple query units may share a common aggregate unit as an answer for benchmark queries. To answer benchmark queries efficiently, we can precompute those aggregate units and the associated aspects that may be answers to benchmark queries. With that in mind, the problem now is, for an aggregate unit
there exists an ancestor there exists a descendant
Proof. If there exists an ancestor
According to the first statement in Lemma 2, in order to answer benchmark queries whose search scope is ancestors, we do not need to store the whole data cube
For an aggregate unit
Once all dominant answers are materialized, we can organize those dominant answers using inverted indices as described in the SIIC method. The query answering method remains the same. The second statement in Lemma 2 guarantees the correctness. The major saving in the DAM method is that we do not need to store or search any non-dominant answers.
The last remaining is how to compute the dominant answers. A brute-force method would be to compute a full data cube and then select the dominant answers from all the aggregate cells. Since we are concerned with groups of aggregate cells with different measure values, we can adopt the quotient cube method [14].
The quotient cube method [14], instead of computing all the aggregate cells in a cube, it groups the aggregate cells according to the tuples in the base table that contribute most to the aggregate of the cells. For an aggregate cell
The quotient group technique thus is suitable for answering benchmark queries. We only need to materialize the upper bounds of the quotient groups that are dominant answers.
The set of ancestors of the query unit is
According to the base table,
As an example of quotient cube based on the base table, we can verify that cov (mid-age, M, Vancouver, manager, *)
With the aid of the quotient cube algorithm, we can materialize all dominant answers from the quotient groups in Table 1; that is,
We now consider benchmark queries with the search scope of descendants. Given a query unit
We also have a result similar to Lemma 2.
there exists a descendant there exists an ancestor
We thus have a method similar to that for the search scope of ancestors.
We present an empirical study in this section. The algorithms were implemented with Python 2.7 running with PyPy1
We evaluated our algorithms on both synthetic data and real data.
TPC-H benchmark (synthetic data). TPC-H (as of the experiemtns, we used TPC-H v2.17.1) is a widely used benchmark that consists of a suite of business oriented ad-hoc queries and concurrent data modifications. TPC-H has Weather data set (real data). The weather data set2
TPC-H: the number of computed and indexed cells when the dimensionality of DIM is fixed to 5
Weather: the number of computed and indexed cells when the dimensionality of DIM fixed to 5
TPC-H: reduction ratio of DAM to SIIC/SIICP when the dimensionality of DIM is fixed.
In our experiments, 100 queries were randomly generated for each data set. Each experiment was conducted 10 times, reporting the average values. Using the avg() as the aggregation function, we compare the following methods in the experiments.
SIIC: the Sorted Inverted Index Cube method without pruning; TPC-H: the number of computed and indexed cells when the dimensionality of UID is fixed to 10 on the TPC-H data set
Weather: the number of computed and indexed cells when the dimensionality of UID is fixed to 5 on the Weather data set
Weather: reduction ratio of DAM over SIIC/SIICP when the dimensionality of DIM is fixed.
TPC-H: reduction ratio of DAM to SIIC/SIICP when the dimensionality of UID is fixed.
Weather: reduction ratio of DAM over SIIC/SIICP when the dimensionality of UID is fixed.
TPC-H: runtime and memory usage with DIM fixed.
Weather: runtime and memory usage with DIM fixed.
TPC-H: runtime and memory usage with UID fixed.
Weather: runtime and memory usage with UID fixed.
TPC-H: scalability.
SIICP: the Sorted Inverted Index Cube method with pruning;
DAM: the Dominant Answer Materialization method.
We used BUC [2] to materialize the data cubes for SIIC and SIICP, and the Quotient Cube algorithm [14] to compute the quotient groups for DAM.
We conducted two experiments to evaluate the effectiveness of reducing the numbers of aggregate cells computed and indexed in our algorithms while the indices were constructed.
For the first set of experiments, we fixed the dimensionality of DIM, and reported the numbers of cells computed and indexed with respect to the increase of dimensionality of UID. We sorted the attributes in the descending order cardinalities. For the TPC-H data set, we generated 9 testing data sets with 2 to 10 dimensions of UID and the dimensionality of DIM was fixed to 5. For the Weather data set, we generated 4 testing data sets with 2 to 5 dimensions if UID and the dimensionality of DIM was fixed to 5.
The results are shown in Tables 2 and 3. For the materialization step, SIICP and SIIC have the same mechanism; thus, they have the same results for the number of computed and indexed cells. The DAM algorithm shows its advantage over SIICP and SIIC for both the synthetic and the real data sets. Figure 3 shows the reduction ratio of the computed and indexed cells with the TPC-H data set. The reduction ratio is ratio of the number of cells computed by DAM to the number of cells computed by SIICP/SIIC. The reduction ratio in most cases is about 10% meaning that DAM only computes and indices about 10% of the cells that SIICP and SIIC do. Further, the ratio becomes smaller when the dimensionality of UID increases. This indicates that when UID has more dimensions, DAM can save more in materialization and indexing. Figure 4 shows the results with the Weather data set. The trends are similar to those with the synthetic data set.
The savings of DAM are due to the fact DAM only stores and searches the dominant answers in the quotient groups. This mechanism also reduces the runtime and the memory usage in query answering, which will be shown later.
Runtime and memory usage
For the second set of experiments, we fixed the dimensionality of UID, and reported the numbers of computed and indexed cells with respect to the dimensionality of DIM. For the TPC-H data set, we generated 4 testing data sets with 2 to 5 dimensions of DIM and the dimensionality of UID was fixed to 10. For the Weather data set, we generated 4 testing data sets with 2 to 5 dimensions of DIM and the dimensionality of UID was fixed to 5.
The results are shown in Tables 4 and 5. Similar to the first set of experiments, SIICP and SIIC have the same mechanism in the materialization step; as such, the two methods have the same numbers of computed and indexed cells. The DAM algorithm significantly outperforms SIICP and SIIC on both with the synthetic and the real data sets. Figure 5 shows the reduction ratio of the computed and indexed cells with the TPC-H data set. The reduction ratio in most cases is under 10%. Similar to the earlier observation, the ratio decreases when the dimensionality of DIM increases. This indicates that when DIM has more dimensions, DAM can save more in materialization and indexing. Figure 6 shows the results with the Weather data set and it demonstrates similar trends.
We fixed the dimensionality of DIM, and reported both the runtime and the memory usage in index construction and query answering with respect to the dimensionality of UID. The testing data sets were the same as those for the first set of experiments in Section 5.2. The memory usage reported is the peak memory usage in the query answering process. When we tested query answering, 100 random queries were conducted; thus, the average query answering runtime and the standard deviation are reported.
The results with the TPC-H data set and the Weather data set are shown in Figs 7 and 8, respectively.
For the runtime, DAM saves time in both the indexing and the query answering steps. As shown in Figs 7a and 8a, in the index construction step, DAM takes less than half of the runtime of SIIC. Further, the runtime of DAM increases much more slowly than that of SIIC and SIICP when the dimensionality of UID increases. In the index construction step, SIIC and SIICP have the same mechanism; as such, the two methods present the same results for the indexing time. As shown in Figs 7b and 8b, SIICP is faster than SIIC in the query answering, but is still much more slowly than DAM.
For the memory usage, as shown in Figs 7c and 8c, DAM consumes a small amount of memory, while SIIC and SIICP consume much larger amounts of memory. Further, the memory usage of DAM increases more slowly than that of SIIC and SIICP when the dimensionality of UID increases. The above results indicate that when UID has more dimensions, DAM can save more time and memory in indexing.
The savings of DAM come from the fact that DAM only computes and stores the dominant answers in the quotient groups. Once a query is given, DAM only searches the dominant answers, which leads to efficiency in both time and memory usage. Both SIIC and SIICP need to materialize the data cube using BUC, and then build the inverted indices. SIICP is faster than SIIC because SIICP applies the pruning techniques in query answering.
Next, we fixed the dimensionality of UID, and reported both the runtime and the memory usage with respect to the dimensionality of DIM. The testing data sets were the same as those for the second set of experiments in Section 5.2.
The results are shown in Figs 9 and 10. The DAM algorithm clearly outperforms SIIC and SIICP on both the real and the synthetic data sets. The runtime and memory usage of DAM increase slower than those of SIIC and SIICP when the dimensionality of DIM increases.
Scalability
To assess the scalability of our algorithms, we generated and used 4 TPC-H data sets with different sizes: 25%, 50%, 75%, 100% of 1 GB using the corresponding TPC-H data set size parameters: 0.25, 0.5, 0.75, 1, respectively. The dimensionality of UID was fixed to 10, and the dimensionality of DIM was fixed to 5. The runtime and the memory usage are reported with respect to the different sizes of the data sets.
The results are shown in Fig. 11. The DAM algorithm is much more scalable than the SIIC and SIICP methods in materialization/indexing as well as query answering. For the memory usage, all the three methods are scalable. DAM consistently uses much less memory than SIIC and SIICP.
Conclusions
Benchmarking is conducted extensively in various industries for performance improvement. However, performing multidimensional benchmarking efficiently with large data sets remains a technical problem. In this paper, we formulated benchmark queries in the context of data warehousing and business intelligence. Benchmark queries cannot be answered in a straightforward manner using existing OLAP methods. To this end, we developed a few algorithms to answer benchmark queries efficiently.
Our methods employ progressive data cube computation techniques to reduce the number of aggregate cells that need to be computed and indexed. An empirical study using the TPC-H and the Weather data sets demonstrates the efficiency and scalability of our methods. In particular, the DAM method is fast and scalable with large data sets.
To the best of our knowledge, there are no tools that allow multidimensional benchmarking in data warehouses in business today. We plan to develop tools using the techniques proposed in this paper as our contribution to business at large. We also plan to explore new types of analytic queries and tasks built upon benchmark queries.
Footnotes
Acknowledgments
The research was supported in part by an NSERC Discovery Grant and the NSERC CRC Program. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.
