Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization

Abstract

Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks (e.g., Hadoop, Spark) when analytical queries are posed. These frameworks divide the data into multiple partitions and then process each partition in a separate task, consequently creating tasks based on the total file size and not the actual size of the data to be read. This typically leads to launching more tasks than needed, which, in turn, increases the query execution time and induces significant waste of computing resources. To allow a more efficient use of resources and reduce the query execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. We prototyped our solution for Apache Parquet and Spark and found that our estimations are highly correlated (0.96) with the real executions. Further, using TPC-H we show that our recommended configurations are only 5.6% away from the Pareto front and provide 2.1 × speedup compared with default solutions.

Introduction

The size of data is growing exponentially.¹ Since huge volumes of data are difficult to be stored on model first load later fashion, organizations end up storing all the raw data on a distributed file system (e.g., HDFS*) or cloud storage (e.g., Amazon S3^†). In addition, they have their own data pipelines to process the raw data, and store them into very wide tables^2,3 by using hybrid layouts,^4,5 which have built-in support for projection and selection operations, helping in reading data more efficiently from the disk.^6,7

There are several available hybrid layout implementations, such as: Optimized Record Columnar (ORC),^‡ Parquet,^§ and CarbonData.** All of them follow the same physical structure as shown in Figure 1. Data are stored into multiple horizontal partitions, known as stripes in ORC, row groups (RGs) in Parquet, and blocklets in CarbonData; each horizontal partition stores its data column-wise, which is beneficial for projection. Statistics about the data are stored in each partition, and they may help on filtering partitions. Further, hybrid layouts support dictionary encoding for compressing repetitive values of individual columns. The dictionary can also be used to filter partitions.

FIG. 1.

Structure of hybrid layouts.

Further, high-level languages (e.g., Apache Pig,^†† Hive,^‡‡ etc.) are used to write analytical queries for getting business insights from the processed data. These analytical queries are executed on distributed processing frameworks (such as Hadoop^§§ or Spark***), which process data in parallel on multiple machines to speed up the analysis. As mentioned earlier, hybrid layouts allow to read less data from the disk. However, this is not thoroughly exploited by distributed frameworks when deciding the number of tasks^††† for processing the data. They always decide the number of tasks based on the total table size and not on the portion of the table being read. This leads to the over-provisioning of tasks, where many tasks remain idle—without any data to process, but still present extra overhead (e.g., initialization time, garbage collection). Further, the idle tasks also waste the computational resources that are assigned to them. The latter is not considered even in the area of cloud computing,^8–11 where computational resources are decided based on the total data size. This leads to wastage of resources and money.

Motivational example

Let us assume that we have the processed data stored in a hybrid layout, which contains four RGs. Let us further assume that we are executing an analytical query with a filter operation, which only satisfies two RGs. The distributed processing frameworks process the data in parallel by dividing them into multiple partitions (for simplicity, we assume that one partition is equal to an RG). By default, a distributed framework would create four tasks. However, two of them would remain idle (c.f. Fig. 2a), and yet would read extra metadata from the disk and would require extra initialization time. This would increase the makespan—execution time. Further, in terms of computational resources, four executors^‡‡‡ would be required to execute all these tasks in parallel. However, in an ideal scenario, based on the amount of data read (c.f. Fig. 2b) only two tasks with two executors would be enough. The latter would help in saving computational resources and reduce the makespan.

FIG. 2.

Parallelism for hybrid layouts in distributed processing frameworks. (a) Tasks based on the total size. (b) Tasks based on the reading size.

As argued earlier, we need to decide the number of tasks based on the actual data read from the disk. To do that, we first need to estimate the read size, which can be done by utilizing our cost model presented in Munir et al.¹² The cost model estimates the scan, projection, and selection sizes for hybrid layouts.

In this article, we extend it further to estimate the makespan of the query implementing a query based on the estimated reading size. Thus, we design a framework that takes a user query and data statistics as inputs to estimate the reading size, and then through a multi-objective optimization method¹³ decide the number of tasks and executors.

After configuring the number of tasks and executors, the query would be automatically submitted to a distributed processing framework. We implemented our approach for Parquet and Spark to show its applicability in real scenarios. The main contributions of this work can be summarized as follows:

– We extend the cost model for hybrid layouts presented in Munir et al.¹² to estimate the makespan of a query.

– We propose a framework based on a multi-objective optimization method¹³ that using our extended cost model, configures the number of tasks and executors for a given query.

– We prototype our approach on Parquet and Spark to show its benefits.

– We report the results of our extensive evaluation with the TPC-H^§§§ benchmark.

The remainder of this article is organized as follows: In the Related Work section, we discuss the related work. In the Cost Model for Hybrid Layouts and Our Approach sections, we present the cost model and the architecture of our approach. In the Multi-Objective Optimization section, we discuss a multi-objective method to find the number of tasks and executors. In the Experimental Results section, we present our experimental results and finally, in the Conclusions section, we conclude the article.

Related Work

Estimating number of tasks

There are research works^14,15 for Hadoop, which estimate the number of mappers and reducers tasks. In Nghiem and Figueira,¹⁴ the elbow curve technique is used to find the trade-off between the number of tasks and execution time. This helps to find the right number of tasks where execution time is minimized. Similarly, Verma et al.,¹⁵ utilize a multi-objective approach for estimating the number of tasks by considering a deadline constraint. Both these approaches do not consider the amount of data read, while estimating the number of tasks. These works only estimate the tasks based on the available number of machines and some objectives (such as deadline). As previously argued, the amount of data read is an important factor in deciding the number of tasks.

Resource provisioning in cloud

There have been extensive research works^8–11 by cloud community on resource provisioning. There is also a survey¹⁶ on energy-efficient techniques for big data analytics, which are divided into five categories. One of them (i.e., energy-aware resource allocation) focuses on deciding the number of machines to execute a given query with the aim of saving energy. These works from both cloud computing and energy-efficient big data analytics focus more on deciding the number of machines to process an application. They aim at saving energy and computational resources, which indirectly leads to cost savings. However, they make these decisions without considering the reading size. Our approach could help them to decide resource provisioning at a more granular level and overall, it can help these works to achieve their goals more efficiently.

Tuning configurable parameters

There are research works^17–19 to tune the configurable parameters of distributed processing frameworks.¹⁹ proposes a trial and error approach to tune the configuration parameters of Spark. This work finds the optimal values for these parameters, based on the trial and error approach. Similarly, Gounaris and Torres¹⁸ propose a methodology to profile the impact of different parameter pairs on benchmarking applications, by applying a graph algorithm to create complex candidate configurations. These configurations are checked in parallel and then, the best performing one is chosen. In Davidson and Or,²⁰ the shuffle performance in Spark is improved by controlling the total number of shuffle files. This approach consolidated multiple shuffle files into one based on the available cores. This helps in improving the execution time of the shuffle phase.¹⁷ profiles the bottlenecks (i.e., JVM, GC, serialization, etc.) of TPC-H queries, and parameters are manually configured to avoid the bottlenecks. This significantly increases the query performance.

Baldacci and Golfarelli²¹ have proposed a cost model for Spark, which helps to estimate the cost of different query plans and decide the best one. Nevertheless, they assume that the number of tasks and executors are fixed. This work is complementary to ours and would optimize the overall query plan, once data are read from the disk and available for the first task. As discussed earlier, these existing works do not explicitly consider the degree of parallelism. Their main aim is to fine tune a cluster of distributed processing frameworks or find an optimal query plan. Our approach can further help them to improve the query execution time, by configuring the degree of parallelism and computational resources.

Cost Model for Hybrid Layouts

In Munir et al.,¹² we did not consider configuring the number of tasks and machines, but we focused on choosing different storage layouts based on their reading and writing cost. Thus, we extend the cost model to consider new factors (e.g., $U s e d_{E x e c u t o r s}$ , $P_{S i z e}$ , etc.) and estimations to help in deciding the number of tasks and machines for a given query. In this section, we present the extended cost model for estimating the number of tasks and executors. It should be noted that the number of tasks depends on the partition size (also known as input split).

Parameters of the cost model

Our cost model for hybrid layouts relies on a wide range of statistical information that is summarized in Table 1, containing system constants, data statistics, workload statistics, and hybrid layout variables. We assume that the constants that depend on the configuration of the environment (e.g., $B W_{D i s k}$ ) are provided. Further, we discuss the collection of statistics (i.e., dataset and workload) in Our Approach section.

Table 1.

Parameters of the cost model

Variable	Description
System constants
$U s e d_{E x e c u t o r s}$	Number of executors for processing
$C h u n k_{S i z e}$	Block size in HDFS
$P_{S i z e}$	Size of partition to control the number of tasks
$B W_{D i s k}$	Disk bandwidth
$B W_{N e t}$	Network bandwidth
Data statistics
$\| T \|$	Number of rows in a table
$C o l V a l u e_{S i z e}$ ^a	Average size of a column value
$# C o l s$	Total columns of T
$S o r t e d_{C o l}$	True for sorted and False for unsorted column
Workload statistics
$R e f_{C o l s}$	Number of columns used in an operation
$S F$	Selectivity factor of an operation
Hybrid Layout Variables
$R G_{S i z e}$	RG size
$M a r k e r_{S i z e}$	Size of sync marker
$M e t a_{C o l s_{S i z e}}$	Size of min-max statistics of columns for an RG
$B o d y_{S i z e}$	Size of the body
$H e a d e r_{S i z e}$	Size of the header
$F o o t e r_{S i z e}$	Size of the footer
$U s e d_{R G s}$	Number of RGs in the file
$\| R G \|$	Number of rows of a RG

Extra 4 bytes are considered for variable length columns.

RG, row group.

{Used}_{RGs} = \frac{{ColValues}_{Size} \cdot | T | \cdot #Cols}{{RG}_{Size} {-(Marker}_{Size} \cdot #Cols)}

(1)

| R G | = \frac{| T |}{U s e d_{R G s}}

(2)

{Body}_{Size} {=((ColValue}_{Size} \cdot {|RG|+Marker}_{Size}) \cdot #Cols) \cdot {Used}_{RGs}

(3)

{Meta}_{Size} {=(MetaCols}_{Size} \cdot #Cols) \cdot {Used}_{RGs}

(4)

T o t a l_{S i z e} = H e a d e r_{S i z e} + B o d y_{S i z e} + M e t a_{S i z e}

(5)

Physical format of hybrid layouts

As shown in Figure 1, hybrid layouts divide the data into multiple RGs [estimated by using Eq. (1)], and each RG contains a subset of rows [estimated by using Eq. (2)]. In each RG, hybrid layouts store data column-wise and its size can be estimated by using Equation (3). Moreover, hybrid layouts also store metadata (e.g., min-max statistics) for each RG inside either the header or footer section, which can be estimated by using Equation (4). The size of actual data and metadata are further used in Equation (5) to estimate the total size of the file. ${Used}_{Tasks} = ⌈ \frac{{Body}_{Size}}{P_{Size}} ⌉$ (6)

U s e d_{W a v e s} = ⌈\frac{U s e d_{T a s k s}}{U s e d_{E x e c u t o r s}}⌉

(7)

L a s t W a v e_{E x e c u t o r s} = ((U s e d_{T a s k s} - 1) mod U s e d_{E x e c u t o r s}) + 1

(8)

# R G s_{P a r t i t i o n} = ⌈\frac{P_{S i z e}}{R G_{S i z e}}⌉

(9)

Estimating number of tasks

Modern distributed processing frameworks decide the number of tasks based on the total file size (which is the size of the actual data without metadata) and the partition size [estimated by using Eq. (6)]. Moreover, the degree of parallelism depends on the number of executors. All tasks cannot be executed at once, if the number of executors is less than the total number of tasks. Thus, we need multiple rounds/waves to finish the query [estimated by using Eq. (7)]. Further, we can calculate the number of executors active in the last wave by using Equation (8). In addition, each partition contains one or more RGs, which can be estimated by using Equation (9).

Estimating makespan

In this article, we focus on read-only analytical queries, to estimate the amount of data read for their first operation and based on that, we try to find the best partition size to control the number of tasks. Given the simplicity of a file system (far from that of a DBMS), only three operations need to be considered: scan, projection, and selection. These three operations can be generalized to selection sorted and selection unsorted, because scan and projection operations are just the extreme cases of selection unsorted with selectivity factor (SF) of 1 (i.e., they read all RGs). $R e f C o l s_{S i z e} = (C o l V a l u e_{S i z e} \cdot | R G | + M a r k e r_{S i z e}) \cdot R e f_{C o l s}$ (10)

{Read}_{RGs} = {\begin{matrix} SF \cdot {Used}_{RGs} + 1 & selection sorted \\ (1 - {(1 - SF)}^{| RG |}) \cdot {Used}_{RGs} & selection unsorted \end{matrix}

(11)

Data read estimation

As mentioned earlier, hybrid layouts help to read only the referred columns and their size can be estimated by using Equation (10). In addition, they use the available metadata (e.g., min-max statistics) to filter some RGs. If selection is applied on sorted data, the average number of read RGs can be calculated directly based on the SF as shown in Equation (11) (we add one to handle the effect of position variation inside the RGs, because hybrid layouts read the whole RG even if there is only one matching row⁵). However, for selection of unsorted data, the expected number of read RGs can be estimated by using Equation (11) (borrowed from bitmap indexes²²). $F u l l_{P a r t i t i o n s} = \{\begin{matrix} U s e d_{T a s k s} - 1 & s e l e c t i o n u n s o r t e d \\ ⌈\frac{R e a d_{R G s}}{# R G s_{P a r t i t i o n}}⌉ & s e l e c t i o n s o r t e d \end{matrix}$ (12)

P a r t i a l_{P a r t i t i o n s} = \{\begin{matrix} 0 & s e l e c t i o n u n s o r t e d \\ 2 & s e l e c t i o n s o r t e d \end{matrix}

(13)

L a s t_{P a r t i t i o n} = \{\begin{matrix} 1 & s e l e c t i o n u n s o r t e d \\ 0 & s e l e c t i o n s o r t e d \end{matrix}

(14)

\begin{matrix} E m p t y_{P a r t i t i o n s} & = U s e d_{T a s k s} - F u l l_{P a r t i t i o n s} - P a r t i a l_{P a r t i t i o n s} \\ - L a s t_{P a r t i t i o n} \end{matrix}

(15)

Types of partitions

Distributed processing frameworks process data by dividing them into multiple partitions, where each partition is processed in a separate task. For selection unsorted, every task processes a full partition except the last task, whose partition might not be completely full, as shown in Figure 3a. Equations (12) and (14) indicate the number of full and last partitions. Thus, for unsorted data, any partition has the same probability of containing data. However, selection sorted guarantees that we read full partitions, except for, potentially, the first (from where selection starts) and last one (where selection ends), because requested data will not start just at the beginning and finish just at the end of a partition. To reflect this, we always have two partial partitions [Eq. (13)] and the number of full partitions depends on the number of RGs to be read [Eq. (12)]. Importantly, note that all other partitions will, nevertheless, read their metadata to determine that no data match the predicate [Eq. (15)]. Figure 3b exemplifies these partitions. It should be noted that the number of partitions and the number of RGs inside each partition are important factors for deciding the correct number of tasks and have direct impact on makespan estimation.

FIG. 3.

Type of partitions in selection sorted and unsorted. (a) Selection unsorted. (b) Selection sorted.

Cost estimation

The total cost of a task depends on four factors: initialization cost, I/O cost, CPU cost, and networking cost. The initialization cost is constant and can be determined according to the execution environment. The I/O cost depends on the amount of data read within a task and the disk bandwidth. We do not consider CPU cost due to its negligible impact compared with I/O cost (existing works^4,5 already proved that this is enough to capture the execution trend). Finally, we do not need any shuffling,⁴ because we focus only on the first operation loading data and therefore, the networking cost for shuffling is considered to be zero.

However, there might be some cases when partition size goes beyond the chunk size and it may require some chunks to be transferred over the network. There are two solutions to handle this scenario. The first one is to define a maximum limit on the partition size and always keep it less than the chunk size. The second is to use an existing approach,²³ which transfers data in advance to avoid idle cycles on the processing machines. The approach to be used should be chosen based on the business requirements. Our approach would work fine for cloud storage (e.g., Amazon S3), as soon as it accesses the whole file together as an object (not in partitions). Thus, distributed processing frameworks can create a partition without worrying about going beyond the chunk size and data locality. $C o s t_{M e t a d a t a} = \frac{M e t a_{S i z e}}{B W_{D i s k}} + \frac{M e t a_{S i z e}}{B W_{N e t}} \cdot (U s e d_{E x e c u t o r s} - 1)$ (16)

There is still a networking cost for metadata [Eq. (16)], because current solutions require to sequentially transfer metadata to all other executors before they start processing the data. Typically, it is read and transferred by the master or driver executor. $\begin{matrix} C o s t_{F u l l_{P a r t i t i o n}} & = C o s t_{I n i t} \\ + \frac{M e t a_{S i z e} + R e f C o l s_{S i z e} \cdot # R G s_{P a r t i t i o n} \cdot (1 - {(1 - S F)}^{| R G |})}{B W_{D i s k}} \end{matrix}$ (17)

O d d_{D a t a} = \frac{R e f C o l s_{S i z e} \cdot (F u l l_{P a r t i t i o n s} \cdot # R G s P a r t i t i o n - R e a d_{R G s})}{P a r t i a l_{P a r t i t i o n s}}

(18)

C o s t_{P a r t i a l P a r t i t i o n} = C o s t_{I n i t} + \frac{M e t a_{S i z e} + O d d_{D a t a}}{B W_{D i s k}}

(19)

\begin{matrix} {Residual}_{Data} {RefCols}_{Size} \\ \cdot {Used}_{RGs} - RGsPartition \cdot {Full}_{Partitions} \\ \cdot (1-(1-SF)^{|RG|}) \end{matrix}

(20)

C o s t_{L a s t_{P a r t i t i o n}} = C o s t_{I n i t} + \frac{M e t a_{S i z e} + R e s i d u a l_{D a t a}}{B W_{D i s k}}

(21)

C o s t_{E m p t y_{P a r t i t i o n}} = C o s t_{I n i t} + \frac{M e t a_{S i z e}}{B W_{D i s k}}

(22)

Each partition has an initialization cost, which is a constant, and an I/O cost (which depends on metadata and the amount of data read inside the partition). As shown in Figure 3, full partitions read all matched RGs inside a partition, and their cost can be estimated by using Equation (17). Equation (18) estimates data read from partial partitions, and Equation (19) determines its cost. Equation (20) reads the data left in the last partition, and Equation (21) deals with its cost. The other partitions just read metadata and its cost is in Equation (22). $\begin{matrix} C o s t_{A l l T a s k s} = F u l l_{P a r t i t i o n s} \cdot C o s t_{F u l l_{P a r t i t i o n}} \\ + E m p t y_{P a r t i t i o n s} \cdot C o s t_{E m p t y_{P a r t i t i o n}} \\ + P a r t i a l_{P a r t i t i o n s} \cdot C o s t_{P a r t i a l_{P a r t i t i o n}} \end{matrix}$ (23)

A v g C o s t_{T a s k} = \frac{C o s t_{A l l T a s k s}}{U s e d_{T a s k s} - L a s t_{P a r t i t i o n}}

(24)

These costs of all partitions help to estimate the total cost of all tasks using Equation (23), which is used in Equation (24) to estimate the average cost of a task. It should be noted that the cost of the last partition is only applied for selection unsorted and it is considered separately when estimating the total makespan. Thus, we do not consider its cost here.

Estimating makespan

As discussed earlier, each task processes different amounts of data and thus, some tasks can finish earlier compared with others. Likewise, each executor can finish their assigned tasks at different times. Thus, we should estimate makespan based on the executor that is processing the largest stack of tasks (e.g., in Fig. 4, Executor 0 and Executor 1 are the ones with the largest stack). This can be done by estimating standard deviation among tasks and using it further for estimating the overall makespan of an operation.

FIG. 4.

Execution of tasks.

U s e {d'}_{R G s} = # R G s P a r t i t i o n \cdot F u l l_{P a r t i t i o n s}

(25)

For standard deviation, first we need to estimate the number of RGs inside full partitions, using Equation (25). It is further used in Equation (26) to estimate the actual read RGs based on the SF.

Finally, we use hypergeometric distribution²⁴ for selection unsorted to estimate the standard deviation of a full partition in Equation (27), based on the read RGs. Hypergeometric distribution estimates the standard deviation of choosing a subset of items without replacement from the total available items. This is similar to our case where we are also trying to select RGs (i.e., $R e a {d'}_{R G s}$ ) from the total RGs (i.e., $U s e {d'}_{R G s}$ ). Similarly, we also estimate standard deviation in Equation (27) for selection sorted. $M a k e S p a n = \{\begin{matrix} W h e n L a s t W a v e_{E x e c u t o r s} = 1 \\ C o s t_{F u l l_{P a r t i t i o n s}} * (U s e d_{W a v e s} - 1) + C o s t_{L a s t_{P a r t i t i o n}} \\ + C o s t_{M e t a d a t a} \\ W h e n L a s t W a v e_{E x e c u t o r s} > 1 \\ (U s e d_{W a v e s} \cdot A v g C o s t_{T a s k}) + C o s t_{M e t a d a t a} \\ + S t d e v \cdot \sqrt{U s e d_{W a v e s} \cdot 2 \cdot l o g_{e} (L a s t W a v e_{E x e c u t o r s})} \end{matrix}$ (28)

Finally, we estimate makespan for an operation by using Equation (28). There are two scenarios based on the number of executors active in the last wave. In the first scenario, there is only one executor in the largest stack. In this case, the last task is processing $L a s t_{P a r t i t i o n}$ . Then, we do not need to take any standard deviation, because there is one single largest stack. Thus, we just add the average duration of all tasks in that stack.

In the second scenario, the makespan depends on metadata transfer, the average cost of a task, the number of executors running in the last wave, and their standard deviation. Thus, we need to estimate the expected maximum²⁵ of those by using the standard deviation as presented in Equation (28), which accounts for the standard deviation of the addition of tasks (i.e., $\sqrt{U s e d_{W a v e s}}$ ), as well as the maximum among executors in the last wave [i.e., $\sqrt{2 \cdot l o g_{e} (L a s t W a v e_{E x e c u t o r s})}$ ].

Our Approach

In this section, we discuss our approach in detail. Figure 5 shows its architecture, which does not require any change in a distributed processing framework (i.e., it is fully transparent for users). The main function blocks of our architecture are the following ones:

FIG. 5.

Architecture of our approach.

Query parser

The query parser takes a query as input and uses an existing parser (i.e., SparkSQL parser****) to validate its syntax. After validation, it generates the physical plan of the query as an XML and forwards it to the next module. The physical plan represents a tree that starts from input sources to the final output. It also highlights the operations, which can be pushed down to the storage layer.

Query profiling

The query profiling takes physical plan as an input and extracts pushdown operations from the plan. Hybrid layouts can only push down two operations: projection and selection. It is easy to extract referred columns from the physical plan. However, for selection, it is not possible to extract SF from the physical plan. To extract SF, the query log needs to be parsed for analyzing the old executions of the same query. Finally, this module passes the pushdown operations along with the required statistical information of operations to the cost model.

Data profiling

The data profiling module takes a sample of data and computes the statistical information listed in Table 1. We rely on an existing approach, namely the single-column profiling technique from Abedjan et al.²⁶

Cost model

The cost model is used to estimate the reading size for a given query. Typically, a query can have many operations that are linked together as a Directed Acyclic Graph. The operations are ordered based on their possibility of pushdown to the storage layer. Hence, the first operation is always a pushdown operation, which reads directly from the disk and impacts parallelism. The subsequent operations takes processed data from the first operation, which modern processing frameworks (e.g., Spark) always keep in memory.

The cost model uses a pushdown operation, workload, data statistics, and cluster configuration as inputs, which are used to estimate the makespan for a given partition size and the number of executors as presented in the Cost Model for Hybrid Layouts section. Our goal is to find the best partition size and the number of executors, which can be done by using a multi-objective optimization method described in the next section.

Multi-Objective Optimization

In this article, we focus on optimizing two objectives, which are contradictory to each other. These objectives are makespan of query and resource usage (i.e., number of executors) required to run the query. We would like to minimize both together. However, they are mutually contradicting, that is, if we want to reduce makespan, we require more computational resources. In the same way, if we want to save computational resources, we have to compromise makespan. Thus, we need to find a trade-off between them that satisfies user requirements and constraints.

The first objective function [i.e., MakeSpan (Operation_Type, P_Size, Used_Executors)] is based on the makespan estimation according to Equation (28) (as defined in the Cost Model for Hybrid Layouts section) for a given operation type, partition size, and the number of executors. Similarly, the second objective function [i.e., ${Resource}_{Usage} (P_{Size}) = {Cost}_{AllTasks}$ ] as defined in Equation (23) estimates the resource usage, which increases with the number of tasks. $P_{S i z e} > = R G_{S i z e} a n d P_{S i z e} < = \frac{T o t a l_{S i z e}}{U s e d_{E x e c u t o r s}}$ (29)

P_{S i z e} < = E x e c u t o r_{M e m o r y_{S i z e}}

(30)

U s e d_{E x e c u t o r s} < = M a x_{E x e c u t o r s}

(31)

To avoid unfavorable or even impossible configurations, we need to add three constraints. First, Equation (29) guarantees that the partition size is always greater than or equal to the RG size and at the same time, we have enough partitions to utilize all assigned executors as shown in Equation (30). Finally, Equation (31) enforces the maximum number of executors.

Typically, there is no single optimum in a multi-objective optimization problem, but a Pareto front that contains many potentially optimal solutions depending on user prioritization of one objective or another (as shown in Fig. 6a). Thus, the user has to choose one configuration from the Pareto front to, in the end, execute the query at hand. Our framework^†††† facilitates the user choice by reducing the many possible configurations to very few (belonging or close to the Pareto front), thus helping her to select one according to her preferences. As shown in Figure 6b, the position in the solution space does not determine the position in the configuration space, which hinders user's choice. In this case, our framework leaves only 2 (out of 35 possible solutions), which satisfy both objectives according to our estimations. When the user selects one of those two, the framework submits the query seamlessly to a processing engine by configuring the partition size and number of executors accordingly.

FIG. 6.

Pareto front for a selection (circle size represents resource usage, the bigger the more resources; and color represents makespan, red for high and green for low). (a) Makespan versus resource usage for sorted selection (SF: 0.2). (b) Pareto front for sorted selection (SF: 0.2). SF, selectivity factor.

In this article, we do not focus on proposing a new multi-objective method, rather we focus on finding the best possible configuration (i.e., number of tasks and executors) for a given query. Thus, we use an existing multi-objective optimization approach, namely NSGA-II,¹³ implementing genetic algorithms. It simply takes objective functions along with constrains as input, and it produces the Pareto front as an output.

Experimental Results

In this section, we discuss the setup and dataset used in our experiments. We also provide the results that validate the accuracy of the cost model and show the benefits of our approach.

Setup

We perform experiments on a five-machines cluster. Each machine has a Xeon E5-2630L v2 @2.40GHz CPU, 128 GB of main memory, and 1TB SATA-3 of hard disk, and it runs Hadoop 2.6.2 and Spark 2.1.10 on Ubuntu 14.04 (64 bit). In the cluster, we dedicated one machine for the HDFS name node and Spark master node together, and the remaining machines to data nodes for Hadoop and executors for Spark. We prototyped our approach for Apache Parquet 1.8.2. Table 2 shows the values of all environmental variables in our testbed. We also configured replication factor equals to the number of machines to have replicas on every machine, thus avoiding chunk transfer in the case of having partition size greater than the chunk size.

Table 2.

Values according to our environment

Variable	Value
$U s e d_{E x e c u t o r s}$	2, 3, 4, 5, and 6
$C h u n k_{S i z e}$	128 MB
$B W_{D i s k}$	1.3 × $1 0^{8}$ bytes/s
$B W_{N e t}$	1.25 × $1 0^{8}$ bytes/s
$C o s t_{I n i t}$	1 second
$R G_{S i z e}$	128 MB
$M a r k e r_{S i z e}$	16 bytes
$M e t a_{C o l s_{S i z e}}$	156 bytes
$H e a d e r_{S i z e}$	4 bytes

We also instantiated our cost model presented in the Cost Model for Hybrid Layouts section for scan, projection, and selection (both sorted and unsorted). Scan operation is just a selection unsorted with SF 1, referring all the columns of the table. Similarly, Projection is also a selection unsorted with SF 1 and based only on the referred columns. For Selection, we just need to give SF and it would work for both.

Results

As mentioned in Refs.,^2,3 very wide tables are common in modern analytical systems, because of their advantages in processing compared with normalizing data into narrower tables. Nevertheless, to the best of our knowledge, there is no public benchmark available that consists of wide tables. Therefore, in this section, we first validate the accuracy of our cost model for makespan with a synthetic dataset of a very wide table. Further, we present the results to show the benefits of our approach to choose the best configuration for queries over the TPC-H denormalized schema.

Cost model validation

We generated a synthetic dataset of a very wide table with 1186 columns with different data types and 32 GB of size. We executed scan, projection with 10 referred columns, and selection with 0.2 SF to compare the real executions with our estimations. Figure 7 shows that comparison (notice that, we normalized the results, both real and estimation, like $\frac{x - m i n}{m a x - m i n}$ to facilitate visual comparison).

FIG. 7.

Validation of our estimation for makespan. (a) Scan operation with four executors correlation: 0.98. (b) Scan operation with five executors correlation: 0.92. (c) Scan operation with six executors correlation: 0.89. (d) Projection (10 columns) with 4 executors correlation: 0.99. (e) Projection (10 columns) with 5 executors correlation: 0.99. (f) Projection (10 columns) with 6 executors correlation: 0.99. (g) Selection sorted (SF: 0.2) with four executors correlation: 0.97. (h) Selection sorted (SF: 0.2) with five executors correlation: 0.89. (i) Selection sorted (SF: 0.2) with six executors correlation: 0.92. (j) Selection unsorted (SF: 1.0E-06) with four executors correlation: 0.99. (k) Selection unsorted (SF: 1.0E-06) with five executors correlation: 0.99. (l) Selection unsorted (SF: 1.0E-06) with six executors correlation: 0.97.

Figure 7a–c show the results for a scan operation with a different number of executors. Similarly, Figure 7d–f shows the results for a projection operation with a different number of executors. Finally, Figure 7g–i shows the accuracy of our estimations in comparison with the real executions for selection operation against sorted data. Figure 7j–l shows the results for selection operation against unsorted data. Observe that our estimations successfully capture the trends of real executions in almost all cases. Most of our predictions closely follow the real trends. In case of Figure 7c, h, and i, the divergences with the real trends are due to the different units used in our estimation. However, the trends are predicted correctly and suffice to find the optimal partition size. The only exception is Figure 7b, where we estimated a lower cost for large partition. Nevertheless, even in this case, our choice is still better than the default partition size.

We also confirm the accuracy of our estimations with the real executions by using statistical correlation, which measures how well the cost model estimates are related to the real execution. In Figure 7, it can be seen that our estimations are highly correlated (i.e., overall Pearson correlation coefficient 0.96) to real executions.

Performance evaluation

In TPC-H, the widest table has only 16 columns and in TPC-DS,^‡‡‡‡ only 26. Hence, we follow²⁷ to generate a wide table by completely denormalizing all other tables in TPC-H against lineitem. The FROM clauses in all queries are consequently changed to the corresponding denormalized table. This results, for a scale factor 16 GB, in a denormailized table of 124 GB being generated for the evaluation. We have chosen six representative queries based on different projected attributes and SFs as shown in Table 3. The table shows the intervals of SF and the number of referred columns of each group of queries. The other queries follow similar access patterns to those selected.

Table 3.

Representative queries of TPC-H

Query	SF	Ref cols	Similar queries
Q1	[0.98, 0.98]	[7, 7]	—
Q3	[0.0026, 0.0056]	[5, 7]	Q8, Q12, Q17
Q10	[0.011, 0.031]	[4, 11]	Q4, Q5, Q6, Q7, Q11, Q14
Q16	[0.04, 0.08]	[2, 8]	Q2, Q13, Q15, Q18
Q20	[0.000025, 0.0007]	[5, 11]	Q19, Q21
Q22	[0.11, 0.2]	[2, 7]	Q9

As presented earlier, there is no optimal solution in a multi-objective optimization, but there are many best solutions referred to as Pareto front. The Pareto front of our estimation is denoted as $P a r e t o_{E s t i m a t e d}$ , and the Pareto front of the real execution is denoted as $P a r e t o_{R e a l}$ . It could happen that in the $P a r e t o_{E s t i m a t e d}$ , we miss some real Pareto solutions. These are referred to as $P a r e t o_{M i s s e d}$ . Further, we have the default set of solutions—when a default partition size (i.e., 128 MB) is used, denoted as $D e f a u l t_{R e a l}$ . Finally, each solution has two metrics based on our objectives, namely, makespan and resource usage.

We compute the Euclidean distance between $P a r e t o_{E s t i m a t e d}$ and $P a r e t o_{R e a l}$ (both makespan and resource usage components are normalized to mitigate differences in the scales, resulting in a maximum distance of $\sqrt{2}$ ), and also to penalize the missed solutions, we compute the distance between $P a r e t o_{M i s s}$ and $P a r e t o_{E s t i m a t e d}$ —all these distances compute to Our Solution. Further, we also compute the distance between $D e f a u l t_{R e a l}$ and $P a r e t o_{R e a l}$ —which are represented as Default Configurations. More precisely, the Euclidean distance is computed between each solution of one set and all the solutions of the other set, and each time the minimum distance between them is taken.

In Figure 8, we show the Boxplot of the distances corresponding to Our Solution compared with the boxplot of the distances to Default Configurations. We are showing the results of the representative queries (chosen based on their referred columns and SFs) of TPC-H. It should be observed that the boxplots in Our Solution are smaller and closer to zero distance, which means that the solutions proposed (i.e., $P a r e t o_{E s t i m a t e d}$ ) are much closer to the real Pareto front (i.e., $P a r e t o_{R e a l}$ ) than the default configurations (i.e., $D e f a u l t_{R e a l}$ ). In summary, on average we are as close as 5.6% to the real Pareto front, whereas the default configuration is much further from the Pareto front (on average 58.2%).

FIG. 8.

Comparison between our configurations and default ones for TPC-H. (a) Q1. (b) Q3. (c) Q10. (d) Q16. (e) Q20. (f) Q22.

We also compare the query execution time (i.e., makespan) of our approach with the default configuration (e.g., default partition size of 128 MB). As mentioned earlier, we have multiple solutions for a query and we took the minimum makespan among these solutions for comparison. Similarly, we have multiple default configurations and we took the average of their makespans. Figure 9a compares the makespan of TPC-H queries, which highlights the advantage of our approach over the default solutions. Likewise, we also present the speedup gain in Figure 9b, which is between 1.8 × and 2.5 × . On average, our approach provides 2.1 × speedup against the default configuration.

FIG. 9.

Speedup gain for TPC-H queries. (a) Improvement in makespan. (b) Speedup gain over default configuration.

Conclusions

Hybrid layouts are widely used to store processed data in highly distributed Big Data systems to perform ad hoc analysis. These Big Data systems process data on a computers cluster by creating multiple tasks. Typically, they create tasks based on the total size of the table, rather than based on the reading size of the query. Moreover, always using the default configuration has a heavy impact on performance. Thus, we proposed a cost-based framework that utilizes a multi-objective approach to decide the number of tasks and executors for a given query based on the reading size. We prototyped our approach for Apache Parquet, evaluated it on TPC-H queries, and showed the improvement it provides.

Footnotes

Author Disclosure Statement

No competing financial interests exist.

Funding Information

This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate “Information Technologies for Business Intelligence–Doctoral College” (IT4BI-DC).

Abbreviations Used

References

Shvachko

KV.

HDFS scalability: The limits to growth. Login, 2010; 35:6–16.

Bian

, Yan

, Tao

, et al. Wide table layout optimization based on column ordering and duplication. In: Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL: ACM, 2017. pp. 299–314.

, Patel

. WideTable: An accelerator for analytical data processing. PVLDB, 2014; 7:907–918.

Bian

, Tao

, Jin

, et al. Rainbow: Adaptive layout optimization for wide tables. In: 2018 IEEE 34th International Conference on Data Engineering. Paris, 2018. pp. 1657–1660.

Munir

, Abelló

, Romero

, et al. ATUN-HL: Auto tuning of hybrid layouts using workload and data characteristics. In: ADBIS'2018. Cham, Springer, 2018. pp. 200–215.

, Lee

, Huai

, et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: Proceedings of the 27th International Conference on Data Engineering, Hannover, Germany: IEEE, 2011. pp. 1199–1208.

Ivanov

, Pergolesi

. The impact of columnar file formats on SQLonhadoop engine performance: A study on ORC and Parquet. Concurr Comput Pract Expert. 2020; 32:e5523.

Herodotou

, Dong

, Babu

No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics. In: Proceedings of the 2nd ACM Symposium on Cloud Computing, Cascais, Portugal: ACM, 2011. p. 18.

Islam

, Karunasekera

, Buyya

. dSpark: Deadline-based resource allocation for big data applications in apache spark. In: 2017 IEEE 13th International Conference on E-Science. Auckland, New Zealand, 2017. pp. 89–98.

10.

Sidhanta

, Golab

, Mukhopadhyay

Optex: A deadline-aware cost optimization model for Spark. In: 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), Cartagena, Colombia: IEEE, 2016. pp. 193–202.

11.

Thai

, Varghese

, Barker

. Budget constrained execution of multiple bag-of-tasks applications on the cloud. CoRR. 2015:abs/1507.05467.

12.

Munir

, Abelló

, Romero

, et al. A cost-based storage format selector for materialization in big data frameworks. CoRR. 2018:abs/1806.03901.

13.

Deb

, Agrawal

, Pratap

, Meyarivan

. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evolut Comput. 2002; 6:182–197.

14.

Nghiem

, Figueira

. Towards efficient resource provisioning in MapReduce. JPDC, 2016; 95:29–41.

15.

Verma

, Cherkasova

, Campbell

. Resource provisioning framework for MapReduce jobs with performance goals. In: Middleware 2011 – ACM/IFIP/USENIX 12th International Middleware Conference, Lisbon, Portugal: Springer, 2011. pp. 165–186.

16.

, Lin

, Hsu

, He

. Energy-efficient Hadoop for big data analytics and computing: A systematic review and research insights. Future Gener Comput Syst. 2018; 86:1351–1367.

17.

Chiba

, Onodera

. Workload characterization and optimization of TPC-H queries on Apache

Spark

. In: 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Uppsala, IEEE, 2016. pp. 112–121.

18.

Gounaris

, Torres

. A methodology for Spark parameter tuning. Big Data Res. 2018; 11:22–32.

19.

Petridis

, Gounaris

, Torres

Spark parameter tuning via trial-and-error. In: Advances in Big Data – Proceedings of the 2nd INNS Conference on Big Data, Thessaloniki, Greece: Springer, 2016. pp. 226–237.

20.

Davidson

, Or

Optimizing shuffle performance in Spark. Technical Report, UC Berkeley, 2013.

21.

Baldacci

, Golfarelli

. A cost model for Spark SQL. TKDE. 2019; 31:819–832.

22.

Cardenas

AF.

Analysis and performance of inverted data base structures. Commun ACM, 1975; 18:253–263.

23.

Jovanovic

, Romero

, Calders

, Abelló

H-WorD: Supporting job scheduling in Hadoop with workload-driven data redistribution. In: Conference on Advances in Databases and Information Systems–20th East European Conference (ADBIS). Praga, 2016. pp. 306–320.

24.

Skala

. Hypergeometric tail inequalities: Ending the insanity. CoRR. 2013:abs/1311.5939.

25.

Dasarathy

A simple probability trick for bounding the expected maximum of n random variables. Technical Report, Arizona State University, 2011.

26.

Abedjan

, Golab

, Naumann

Data profiling: A tutorial. In: Proceedings of the 2017 ACM International Conference on Management of Data, Chicago, IL: ACM, 2017. pp. 1747–1751.

27.

Sun

, Franklin

, Krishnan

, Xin

. Fine-grained partitioning for aggressive data skipping. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT: ACM, 2014. pp. 1115–1126.