Handling data skew in joins based on cluster cost partitioning for MapReduce

Abstract

Data skew in parallel joins results in poor load balancing which can lead to significantly varying execution times for the reducers in MapReduce. The performance of join operation is severely degraded in the presence of heavy skew in the datasets to be joined. Previous work mainly focuses on either input or output load imbalance among reducers, which is ineffective for load balancing. In this paper, we present a new data skew handling method based on Cluster Cost Partitioning (CCP) for optimizing parallel joins in MapReduce. A new cost model which considers the properties of both input and output is defined to estimate the cost of the parallel join. CCP employs clusters instead of join keys from input relations to create join matrix. Using the cost model, CCP identifies and splits heavy cells in the cluster join matrix. Then CCP assigns a set of non-heavy cells to reducers for join load-balancing. For different applications, the input and output weight values in the cost model could be dynamically adjusted to depict the join costs more precisely. The experimental results demonstrate that CCP achieves a more accurate load balancing result among reducers.

Keywords

Data skew load balance join algorithm cluster cost partitioning MapReduce

1. Introduction

With the rapid growth of information and data, there is an urgent need for large-scale data analysis and processing. MapReduce – a software framework developed by Google, due to its remarkable features in simplicity, fault tolerance, and scalability, is by far the most successful realization of data-intensive cloud computing platforms [1].

Join operation is one of the most widely used operations in relational database systems, but it is also a heavily time-consuming operation [2]. Unfortunately, it is not directly supported by the MapReduce framework. This is because (1) the framework is originally designed for the processing of a single dataset, and the join operation typically requires two or more datasets, and (2) MapReduce’s key-equality based data grouping makes it difficult to support complex join conditions [3].

One of the major obstacles hindering effective parallel join processing on MapReduce is data skew. Data skew refers to the imbalance in the amount of data assigned to each task, or the imbalance in the amount of work required to process such data [4]. The job completion time in MapReduce depends on the slowest running task in the job. If one task takes significantly longer to finish than others (the so-called straggler), it could delay the progress of the entire job. Stragglers can occur due to various reasons, among which data skew is a serious one.

Figure 1.

Standard repartition join based on hash partitioning.

Data skew has been studied previously in the parallel database literature, but only limited on join [5, 6, 7], group [8], and aggregate [9] operations. Handling data skew effects on join operations using MapReduce is a challenging problem, and a simple extension of the traditional solution is insufficient. Recent many researches have been reported in the literature on join operations. Such works roughly fall into the two categories. The first is to design novel join algorithms on top of Hadoop [10, 11, 12, 13]. And the second is to change the internals of Hadoop or build a new layer on top of Hadoop for the optimization of traditional join algorithms [14, 15, 16, 17].

Among the solutions proposed, for users to change the internals of Hadoop or build a new layer on top of Hadoop is a much harder task. When balancing the workload for parallel joins in MapReduce, the distribution of input data received from mappers and the output data produced by reducers are both important for performance. Previous work which designs novel join algorithms balances either input share (for input-size dominated joins) or output share (for output-size dominated joins), stands ineffective for load balancing [13]. Such as skew handling join (Sand-join) [18] which employs range partitioning instead of hash partitioning considers only input load distribution. M-Bucket-I and M-Bucket-O need more detailed input statistics (Multiple-bucket histogram), minimize max-reducer-Input and max-reducer-Output. These two algorithms are designed for input-size dominated joins and output-size dominated joins respectively [10].

In this paper, we present a new skew handling method based on cluster cost partitioning. The method balances both the input data and the output data among reducers when processing skewed data, to overcome the limitations of traditional methods. No modifications of the origin MapReduce framework are necessary. The main contributions of our work include the following.

We proposed a novel load-balancing algorithm, called Cluster Cost Partitioning (CCP), for parallel joins in MapReduce. The algorithm optimizes both input and output imbalance based on cluster cost partitioning and it can achieve a better load balancing result among reducers.

The optimization CCP algorithm extended the cluster splitting method for a single dataset to the join algorithm which involves two datasets to handle data skew in MapReduce. We adopted clusters to create join matrix and to build the cluster cost model.

We implemented the CCP algorithm on Hadoop and conducted comprehensive experiments. It does not require any modification to the Hadoop source code. All the functions are achieved by specifying the appropriate Map and Reduce functions on top of Hadoop.

The remainder of the paper is organized as follows. Section 2 briefly introduces MapReduce and the recent skew handling methods in MapReduce. The design and implementation of the CCP approach are detailed in Section 3. Section 4 shows representative experimental results. Section 5 reviews related works, and Section 6 concludes the paper.

2. Preliminaries

2.1 Overview of MapReduce

The MapReduce programming model introduces a way of processing large-scale data that is based on two functions: Map and Reduce. Map and Reduce are the two primitives provided by the framework for distributed data processing. The signatures of these primitives for key ‘ $k$ ’ and value ‘ $v$ ’ are as Eq. (2.1).

$\displaystyle\textit{map}({k_{1},v_{1}})\to\textit{list}({k_{2},v_{2}})$ $\displaystyle\textit{reduce}({k_{2},list({v_{2}})})\to\textit{list}({k_{3},v_{% 3}})$ (1)

The Map function transforms a key-value pair into a list of intermediate key-value pairs which are distributed among the reduce functions for further aggregation. In simple terms, data is distributed among the nodes for processing during the Map phase and the result is aggregated in the Reduce phase [19].

2.2 Data skew in equi-joins

Repartition join is the most commonly used join strategy in the MapReduce framework. A two ways equi-join example is illustrated in Fig. 1.

Example 1. Tables $S$ and $T$ have 18 data tuples respectively, and there are $r=$ 3 reducers in the cluster. The data schema contains three fields ( $pk,jk,\textit{others}$ ) where the attribute $j k$ is the join attribute. In the map phase, each map task works on a split of either $S$ or $T$ . To identify which table an input record is from, each map task tags the record with its originating table, and outputs the extracted join key and the tagged record as a ( $s . j k, s$ ) pair. The outputs are then partitioned, sorted and merged by the framework. All the records for each join key are grouped together and eventually fed to the ( $s . j k$ mod $r$ )th reducer. For each join key, the reduce function first separates and buffers the input records into two sets according to the table tag, and then performs a cross-product between records in these sets.

In the standard repartition join, skew in the distribution of the join attribute’s value can overshadow the strengths of parallel processing infrastructure. Figure 1 presents the first example, with default hash-based partitioning. R1 receives 14 input tuples { $s$ 1, $s$ 2, $s$ 3, $s$ 4, $s$ 5, $s$ 6, $s$ 7, $s$ 8, $s$ 9, $s$ 10, $s$ 11, $s$ 12, $s$ 13, $s$ 18} from $S$ and 6 input tuples { $t$ 1, $t$ 14, $t$ 15, $t$ 16, $t$ 17, $t$ 18} from $T$ respectively. The $R$ 1 has 20 input tuples and produces 18 output tuples, which has a heavy workload compared to other reducers. For instance, $R$ 3 receives an input tuple { $s$ 17} from $S$ and 4 input tuples { $t$ 10, $t$ 11, $t$ 12, $t$ 13} from $T$ . It only has 5 input tuples and produces 4 output tuples. Obviously, $R$ 1 takes significantly longer time to process the data than others, which slows down the entire computation.

Figure 2.

Range-based partitioning.

2.3 Current skew handling methods

In this section, we discuss three important skew handling methods with intuitive examples. To compare with the performance of our proposed algorithm, the data in all examples is the same. Our original idea of the cluster cost partitioning method is inspired by these examples.

2.3.1 Range-based partitioning method

In range-based partitioning, the domain of join keys is divided into a number of blocks, called ranges. The number of ranges is equal to the number of partitions [18]. Figure 2 presents the second example using range-based partitioning method.

Example 2. In this example, we have $S . j k$ $=$ {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 4}. Since there are $r=$ 3 reducers, we select $r-1$ key entries to construct the split vector. In our example, as $18/3=6$ , the 6th and 12th elements {1, 1} are selected. From this split vector, each reducer is assigned a lower bound and an upper bound for range partitioning (except the first and the last reducer which do not have the lower and upper bounds respectively). All the tuples that have their join key attribute falling into a particular range are sent to the reducer associated with that range. This contributes to balance the input workloads to a certain extent.

2.3.2 Randomized method

Okcan and Riedewald model a join between two data sets $S$ and $T$ with a join-matrix $M$ , and employ this representation for creation of and reasoning about different join implementations in MapReduce recently [10]. Using this model, they derived a simple randomized algorithm called 1-Bucket-Theta, for implementing theta joins. It is effective to handle data skew.

Example 3. Figure 3a shows the join-matrix for an equi-join with the join condition $S.jk=T.jk$ . The join-matrix is an 18 by 18 matrix that represents the cross-product space between two relations $S$ and $T$ . This equi-join example has a sparse join matrix, only a small portion of the cross-product space produces output tuples in Fig. 3a. The matrix can be covered by $r=$ 3 reducers. To balance reducer’s workloads, each reducer should be assigned the same number of cells in the join matrix. In this method, a random row is selected for a given input tuple, then it creates an output tuple for each region that intersects with this row. Considering partitions in Fig. 3b based on 1-Bucket, all the regions ( $R$ 1, $R$ 2, and $R$ 3) have the same size (108 cells), this guarantee that each reducer is responsible for its exact fair share of $108/(18*18)=1/3$ of the cross-product result. The 1-Bucket-Theta algorithm in Fig. 3b assigns all cells (18 ${}^{2}$ ) to 3 reducers, regardless of the join condition. Thus, regions cover the entire join matrix. The scheme achieves almost perfect load balancing. However, it incurs prohibitively high input costs for low-selectivity joins.

Figure 3.

Skew handling methods based on join matrix.

In order to identify large regions in the join matrix that do not contain any output tuples, a M-Bucket algorithm which need the detailed input statistics of multiple-bucket histogram was proposed by constructing an approximate equi-depth histogram with $p$ buckets over join keys on input relations $S$ and $T$ , and creating a grid of size $p\times p$ over the join matrix. In Fig. 3c, $p=18/2=$ 9, and each grid cell contains $h=$ $(18/p)^{2}=4$ matrix cells. The grid cell which may produce an output tuple is a candidate cell (marked with diagonally engraved lines). M-Bucket only assigns candidate grid cells to reducers, and it disregards large contiguous portions of the join matrix that produce no output.

2.3.3 Multi-dimensional range partitioning method

A skew handling method called multi-dimensional range partitioning (MDRP) is stochastic like the range-based algorithm and the M-Bucket algorithm.

Example 4. There are $r=$ 3 reducers. So, $r-1=2$ keys entries {1, 1} and {2, 3} are selected on the join key $j k$ over relations $S$ and $T$ respectively, resulting 3 sub-ranges for each relation. We can create a partitioning matrix $M$ based on MDRP as shown in Fig. 3d. Non-candidate (shaded) cells show the area of unsatisfying the equi-join condition. In the matrix $M$ , each cell has different workloads. In Fig. 3d, the optimal workload is $\lfloor{(13+24+4+5)/3}\rfloor=15$ . Since $M(1,3)=24$ , the cell (T1, S3) is a heavy cell (24 $>$ 15). By splitting the heavy cell, we can assign the heavy cell to two or more different reducers to balance the workloads across reducers.

Figure 4.

Processing pipeline at a reducer.

Among different skew handling join algorithms discussed above, the range-based algorithm mainly focuses on balancing input workloads of reducers, it does not consider the join output distribution. While 1-Bucket method practically guarantees to balance the cross-product output across reducers using randomized approach, and M-Bucket can reliably balance input-related join with equi-depth histogram because it knows exactly how many input tuples from each relation belong to each bucket. The multi-dimensional range partitioning method balances output very well by the range-based matrix. However, it uses the range-based cell to replace the key-level cell to balance the input. The number of join key may vary between different ranges in the join matrix. Hence it may incur input imbalance. These algorithms fall into the shortness of considering both the input and the output imbalance.

3. Cluster cost partitioning (CCP)

The cluster cost partitioning (CCP) method is designed to overcome the incapability of existing join algorithms which optimize either for input or output. We consider the properties of both input and output data based on our cost model.

3.1 Cost model

Due to the nature of MapReduce, it is easy to balance load between mapper nodes. However, some reducers may receive a much larger amount of data using standard join algorithm. We first analyze the completion time of the Reduce phase, consider a single reducer. The MapReduce shuffles the mapper output to reducers based on the partition function, the reducer uses this shuffled data as its input. It sorts the input key, reads the corresponding value-list for a key, computes the join for this list, and then writes its locally created join tuples to the distributed file system (DFS) [10]. The process of the reducer is shown in Fig. 4.

The objective of load balancing is to minimize the maximum reducer load. As seen in Fig. 4, the problem with the optimal solution is the size of the input and the output data. Since the size of input data received from mappers and the output data produced by the join algorithm in reducers are both important. For a simple example, reducer $R$ 1 and $R$ 2 both receive 20 tuples from two relations $R$ and $S$ . However, 19 tuples from relation $R$ join with only 1 tuple from relation $S$ in reducer $R$ 1, then the join algorithm will produce $19*1=19$ output records. If the reducer $R$ 2 contains 10 tuples from relation $R$ and 10 tuples from relation $S$ , the join operation in reducer $R$ 2 will produce $10*10=100$ output records. The running time of processing these 20 tuples is same in reducer $R$ 1 and $R$ 2, but the execution time of the join operation is different. Similarly, for another simple example, reducer $R$ 1 receives 5 tuples from relation $R$ and 10 tuples from relation $S$ . While 2 tuples from relation $R$ and 25 tuples from relation $S$ execute join operation in reducer $R$ 2. The size of the join result are both 50 ( $5*10=2*25=50$ ). But the number of tuples received in Reducer $R$ 1 is $5+10=15$ and the size of input tuples in $R$ 2 is $2+25=27$ . Under this condition, the time of execute join in Reducer $R$ 1 and $R$ 2 is same, but the time of processing received tuples is different. In order to balance the size of both input and output data during the join algorithm in reducers. We represent a reducer’s ( $r$ ) workload as a weight function of input and output costs:

$\displaystyle w(r)=c_{in}(r)+c_{\textit{out}}(r)=w_{in}*\textit{input}(r)+w_{% \textit{out}}*\textit{output}(r).$

where $c_{in}(r)$ and $c_{\textit{out}}(r)$ are the input cost and the output cost of the reducer, $w_{in}$ and $w_{\textit{out}}$ indicate the average time cost of processing a single input and output tuple respectively, input( $r$ ) represents the size of the input data, output( $r$ ) corresponds to the output size.

3.2 Cluster cost partitioning

The map function transforms input data into (key, value) pairs. A cluster is the subset of all (key, value) pairs, or tuples, sharing the same key. The clusters are distributed to different reducers by applying a hash function. The proposed CCP algorithm create a join matrix based on the clusters. Each dimension in the matrix represents an input relation, and a cell represents the cross-product of the clusters of the two relations. From the frequency values of each cluster, we can estimate the detailed cost of a cell in terms of both input and output tuples. For the purpose of load balancing, the cluster cost partitioning method identifies and splits the heavy cells into no-heavy cells. Finally, all the no-heavy cells are assigned to reducers based on the greedy heuristics to achieve a load balancing result.

3.2.1 Creation of a cluster join matrix

We first consider the join of two relations $S$ and $T$ on a join key $j k$ for the same data sets in Section 2. Relation $S$ has $S.jk=$ {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 4} and relation $T$ has $T.jk=$ {1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 4}. Intuitively, $S$ and $T$ both have 4 clusters: { $S_{1},S_{2},S_{3},S_{4}$ } and { $T_{1},T_{2},T_{3},T_{4}$ }. For the clusters of $S$ and $T$ have property as follows:

$\displaystyle S_{1}\cup S_{2}\cup S_{3}\cup S_{4}=S;S_{i}\cap S_{j}=\varphi;1% \leqslant i,j\leqslant 4,i\neq j$ $\displaystyle T_{1}\cup T_{2}\cup T_{3}\cup T_{4}=T;T_{i}\cap T_{j}=\varphi;1% \leqslant i,j\leqslant 4,i\neq j.$ (2)

Using these clusters, we can create a cluster join matrix $M$ as shown in Fig. 5. In the matrix $M$ , the superscript of the join key is the frequency value of the cluster. The $i$ th row is the cluster $T_{i}$ and the $j$ th column represents the cluster $S_{j}$ . Non-candidate (shaded in the picture) cells in the matrix show the area that does not satisfy the equi-join condition. Each candidate cell contains two elements. Values above the slash in candidate cells are the size of the input data of the two clusters from relation $S$ and $T$ , and values below the slash corresponding to the size of the output data. For instance, the size of the input data of the cell ( $T_{2}$ , $S_{2}$ ) is 11 because the cluster $T_{2}$ has 8 tuples and the frequency value of $S_{2}$ is 3. These 8 tuples from $T$ and 3 tuples from $S$ which have the same $j k$ $=$ 2 will produce 24 join output tuples.

Figure 5.

The cluster join matrix.

The cluster join matrix can be used to distribute the load to the reducers. Our goal is to find a mapping from the cluster join matrix cells to reducers that achieves optimal for both input and output. When the Map function receives an input tuple from relation $S$ or $T$ , we find the cluster that the tuple belongs to, and get the reducer according to matrix-to-reducer mapping. Our cluster cost partitioning algorithm is responsible for balancing workloads across reducers.

Notice that the cluster join matrix $M$ for equi-joins must be a square matrix. This is because the clusters in relation $S$ will be discarded when they are not equal to any of the clusters in relation $T$ (Similarly for clusters in relation $T$ ). It is also notable that for a $n\times n$ matrix $M$ , it only contains $n$ candidate cells. Since a cluster is the subset of all tuples sharing the same key. For one cluster in relation $S$ , there must be one cluster in relation $T$ corresponding to it in the cluster join matrix. The join matrix $M$ is a diagonal matrix if clusters in relation $S$ and $T$ are sorted by value before creating it.

3.2.2 Identifying and splitting heavy cells

Since some keys appear more frequently in the intermediate results than others, the clusters may vary considerably in size. Thus, even if each reducer receives the same number of clusters, the overall number of tuples per reducer may still be different. In order to balance reducers’ workloads, we first define a heavy cell as follows:

Definition 1. Set $M_{in}(i,j)$ to represent the size of input tuples of the cell ( $T_{i}$ , $S_{j}$ ) and $M_{\textit{out}}(i,j)$ to represent the size of output tuples. A heavy cell in a $n\times n$ cluster join matrix $M$ is a cell ( $T_{i}$ , $S_{j}$ ) which satisfies:

$\displaystyle\frac{w_{in}\ast({M_{in}({i,j})})+w_{\textit{out}}\ast({M_{% \textit{out}}({i,j})})}{\sum\nolimits_{t=1}^{n}{\sum\nolimits_{s=1}^{n}{({w_{% in}\ast({M_{in}({t,s})})+w_{\textit{out}}\ast({M_{\textit{out}}({t,s})})})}}}% \geqslant\frac{1}{r}.$ (3)

where $r$ is the number of reducers and 1/ $r$ is the optimal workload ratio for each reducer. In the best case, a reducer only receives this one heavy cell based on the matrix-to-reducer mapping. However, because the ratio of the heavy cell’s workload to the total workloads is greater than or equal to 1/ $r$ . This will lead to a load imbalance due to heavy cells in the join matrix.

Figure 6.

Three partitioning schemes.

For example, $w_{in}=$ 1 and $w_{\textit{out}}=$ 1 are assumed for equi-join in Fig. 5. Then the total workloads are 82, and the optimal workload of each reducer is $\lfloor{82/3}\rfloor=27$ . Since the workload of the ( $T_{2}$ , $S_{2}$ ) cell is 11 $+$ 24 $=$ 35, which is greater than the optimal value. So the ( $T_{2}$ , $S_{2}$ ) cell is the heavy cell. Our goal of distributing candidate cells is to balance the workload on reducers, so each heavy cell in the cluster join matrix should be split into:

$\displaystyle d=\left\lceil{\frac{w_{in}\ast({M_{in}({i,j})})+w_{\textit{out}}% \ast({M_{\textit{out}}({i,j})})}{w}}\right\rceil.$ (4)

sub-cells. Where $w$ indicates the optimal workload of the reducer. In these $d$ sub-cells, the workload of each top $d-1$ sub-cell should be equal to or very close to $w$ .

Since the value of $\frac{w_{in}\ast({M_{in}({i,j})})+w_{\textit{out}}\ast({M_{\textit{out}}({i,j}% )})}{w}$ maybe a decimal value. The number of sub-cells must be a positive integer. We use the upward-rounded value $d=\lceil{\frac{w_{in}\ast({M_{in}({i,j})})+w_{\textit{out}}\ast({M_{\textit{% out}}({i,j})})}{w}}\rceil$ as the number of sub-cells. Then the value of $d$ is equal to the integral part of $\frac{w_{in}\ast({M_{in}({i,j})})+w_{\textit{out}}\ast({M_{\textit{out}}({i,j}% )})}{w}$ plus 1. While the workload of each top $d$ -1 sub-cell is $w$ . The workload of the last sub-cell should be very close to:

$\displaystyle w_{\textit{rem}}=({w_{in}\ast({M_{in}({i,j})})+w_{\textit{out}}% \ast({M_{\textit{out}}({i,j})})})\%({({d-1})\ast w}).$ (5)

Theorem 1. The heavy cell ( $T_{i}$ , $S_{j}$ ) is split into $d$ sub-cells. If the frequency value of the cluster $T_{i}$ is greater than that of $S_{j}$ ( $f(T_{i})>f(S_{j})$ ). Then the actual workload of each top $d$ -1 sub-cell is:

$\displaystyle w_{\textit{act}}=w_{in}\left({\left\lceil{\frac{w-w_{in}\ast f({% s_{j}})}{w_{in}+w_{\textit{out}}f({s_{j}})}}\right\rceil+f\left({s_{j}}\right)% }\right)+w_{\textit{out}}\left({\left\lceil{\frac{w-w_{in}\ast f({s_{j}})}{w_{% in}+w_{\textit{out}}f({s_{j}})}}\right\rceil\ast f\left({s_{j}}\right)}\right).$ (6)

Proof Since $f(T_{i})>f(S_{j})$ and the cell ( $T_{i}$ , $S_{j}$ ) is a heavy cell. The cluster $T_{i}$ should be split into $d$ smaller fragments and the cluster $S_{j}$ needs to be replicated to all $d$ sub-cells. We assume the number of tuples in each top $d-1$ fragment (fragment of the cluster $T_{i}$ ) is $x$ . The optimal workload of each top $d-1$ sub-cell is $w$ . According to the cost model, we have:

$\displaystyle w_{in}({x+f({S_{j}})})+w_{\textit{out}}\ast x\ast f({S_{j}})=w$ $\displaystyle x({w_{in}+w_{\textit{out}}f({S_{j}})})+w_{in}\ast f({S_{j}})=w$ $\displaystyle x=\left\lceil{\frac{w-w_{in}\ast f({S_{j}})}{w_{in}+w_{\textit{% out}}\ast f({S_{j}})}}\right\rceil.$ (7)

Since $x$ is a positive integer:

$\displaystyle w_{\textit{act}}=w_{in}\left({\left\lceil{\frac{w-w_{in}\ast f({% S_{j}})}{w_{in}+w_{\textit{out}}f({S_{j}})}}\right\rceil+f\left({S_{j}}\right)% }\right)+w_{\textit{out}}\left({\left\lceil{\frac{w-w_{in}\ast f({S_{j}})}{w_{% in}+w_{\textit{out}}f({S_{j}})}}\right\rceil\ast f\left({S_{j}}\right)}\right).$ (8)

Then the actual workload of the last sub-cell is:

$\displaystyle w_{\textit{last}}=w_{in}\left({f({T_{i}})-({d-1})\left\lceil{% \frac{w-w_{in}\ast f({S_{j}})}{w_{in}+w_{\textit{out}}f({S_{j}})}}\right\rceil% +f({S_{j}})}\right)+w_{\textit{out}}\left({f({T_{i}})-({d-1})\left\lceil{\frac% {w-w_{in}\ast f({S_{j}})}{w_{in}+w_{\textit{out}}f({S_{j}})}}\right\rceil}% \right)f\left({S_{j}}\right)$ (9)

If the frequency value of the cluster $S_{j}$ is greater than that of $T_{i}$ ( $f(S_{j})>f(T_{i})$ ). We only need to replace $f(S_{j})$ with $f(T_{i})$ . Notice that $w_{\textit{act}}$ and $w_{\textit{last}}$ are equal to or very close to $w$ and $w_{\textit{rem}}$ respectively.

3.2.3 Assigning cluster join matrix cells to reducers

After splitting all the heavy cells in the cluster join matrix, we now have a set of non-heavy cells $C=$ { $c_{1},c_{2},\ldots,c_{|C|}$ }. Given $r$ reducers, we want to assign all the non-heavy cells in $C$ to $r$ reducers. The goal of assigning cells to reducers is to balance the load across reducers. The optimal load balance is achieved by solving the respective bin packing problem. Unfortunately, bin packing is non-deterministic polynomial hard (NP-hard) [24]. So we use a greedy heuristics to determine the candidate cell bundles.

A cell $c_{l}\in C$ , and its workload is $w(c_{l})$ . We sort cells in $C$ according to their workload $w(c_{l})$ . For each cell $c_{l}\in C$ , it is assigned to the reducer which has the smallest total workloads. The total workloads of a reducer are the sum of the costs of all cells assigned to it. We repeat these steps until all cells in $C$ have been assigned. It is notable when all the cells have been assigned, if the workload of a reducer is much greater than the optimal value, a larger cell in this reducer could be selected and split again. Then these sub-cells are assigned to other reducers based on the greedy heuristics to achieve a better load balancing result. In summary, our target is to find the strategy that can make the smallest max $w_{\max}$ across reducers.

Example 5. Figure 6 shows three partitioning schemes (input-size dominated, output-size dominated and input-output dominated) based on cluster cost partitioning method for the same data sets in Section 2. A heavy cell ( $T_{i}$ , $S_{j}$ ) may split into several sub-cells, so we use ( $T_{i}$ , $S_{j}$ , $w(c_{l})$ ) to express a cell. In Fig. 6c, for simplicity, the values $w_{in}=$ 1 and $w_{\textit{out}}=$ 1 are suggested in this example. Then the set which contains non-heavy candidate cells is $C=$ {( $T_{1}$ , $S_{1}$ , 27), ( $T_{2}$ , $S_{2}$ , 27), ( $T_{2}$ , $S_{2}$ , 11), ( $T_{3}$ , $S_{3}$ , 9), ( $T_{4}$ , $S_{4}$ , 11)}. In origin matrix, the ( $T_{2}$ , $S_{2}$ ) cell is a heavy cell and the optimal value $w=$ 27. So it is split into $d=\lceil{35/27}\rceil=2$ non-heavy cells. Since $f(T_{2})=$ 8 and $f(S_{2})=$ 3, $f(T_{2})>f(S_{2})$ . We split the cluster $T_{2}$ into two fragments and replicate $S_{2}$ to two sub-cells. According to Eq. (3.2.2), the frequency value of the cluster $T_{2}$ in the first fragment is:

$\displaystyle\left\lceil{\frac{w-w_{in}\ast 3}{w_{in}+w_{\textit{out}}\ast 3}}% \right\rceil=\left\lceil{\frac{27-3}{1+3}}\right\rceil=6$

The workload of the first sub-cell is $6+3+6*3=27$ and the second sub-cell is $(8-6)$ $+$ 3 $+$ $(8-6)*3=$ 11. Then these 5 non-heavy cells are sorted according to their workload $w(c_{l})$ . We have $r=$ 3 reducers { $R_{1}$ , $R_{2}$ , $R_{3}$ }. First, we select ( $T_{1}$ , $S_{1}$ , 27) and assign it to the reducer $R_{1}$ (shaded in light blue). Next, ( $T_{2}$ , $S_{2}$ , 27) is taken to be sent to the reducer $R_{2}$ (shaded in light yellow). At last, the remaining 3 cells are assigned to the reducer $R_{3}$ (shaded in light green). The maximum workload in the cluster is $w(R_{3})=$ 31. Reducers $R_{1}$ and $R_{2}$ have reached the optimal value 27. Hence, although $w(R_{3})$ is a little greater than the optimal value, cells in the reducer $R_{3}$ do not need to be split.

Seeing the input-size dominated and the output-size dominated examples in Fig. 6a and b, the heavy cell is split into two sub-cells, too. In Fig. 6a, after the sub-cell ( $T_{1}$ , $S_{1}$ , 3) is assigned to the reducer $R_{3}$ based on the greedy heuristics. The reducer $R_{3}$ turns from non-heavy reducer to heavy reducer. The reducer $R_{2}$ is still a non-heavy reducer. Then we further split the cell ( $T_{4}$ , $S_{4}$ , 6) into two sub-cells: ( $T_{4}$ , $S_{4}$ , 5) and ( $T_{4}$ , $S_{4}$ , 2). The sub-cell ( $T_{4}$ , $S_{4}$ , 2) is assigned to the reducer $R_{2}$ . The maximum workload in the cluster is changed from $w(R_{3})=$ 14 to $w(R_{2})=$ 13. We can see in Fig. 6a, the ( $T_{1}$ , $S_{1}$ , 14) cell is split by column, while in Fig. 6b, the heavy cell ( $T_{2}$ , $S_{2}$ , 24) is split by row. This is based on the frequency values of clusters in the cell.

In contrast to previous work in this field, the proposed CCP algorithm achieves load balancing on minimal work per reducer, which considers both input and output workload. And the weight values of $w_{in}$ and $w_{\textit{out}}$ could be dynamically adjusted for different scenarios. For example, if the number of distinct values (the number of clusters) in the join datasets is large, while the frequency values of clusters are small. Then the weight value of $w_{in}$ should be larger than $w_{\textit{out}}$ . Reducers have to process a large amount of input tuples. However, the join computation may only produce a small amount of output. And the weight values of input and output is related with the join condition. The operators in a join query are more than just equi-join. The join condition can be defined as a binary function $\theta$ that belongs to { $<$ , $\leqslant$ , $=$ , $\geqslant$ , $>$ }. Compared with equi-join, the theta-join is more expressive in relation description in data analytic queries. A complex join may include combinations of equality, band and theta join conditions. For the complex join condition, the weight value of $w_{\textit{out}}$ should be set a little larger than the equal-join.

Techniques presented above which include the creation of cluster join matrix, identifying and splitting heavy cells, assigning a cluster to join matrix cells to reducers aim at distributing clusters to reducers such that the workload on all reducers is balanced well. We use $\lambda=\max(W_{k})/\textit{avg}(W_{k})$ to show the effectiveness of our CCP load balancing algorithm, where $W_{k}$ indicates the total workloads of the $k$ th reducer.

Theorem 2. In the worst case, the workload imbalance $\lambda$ is less or equal to 2.

Proof Since all the heavy cells in the matrix $M$ have been split, then we have a set of non-heavy cells $C=$ { $c_{1},c_{2},\ldots,c_{|C|}$ }. We first sort these $|C|$ cells according to their workloads. We assume $w$ is the average workload of reducers. Intuitively, we can get:

$\displaystyle w({c_{1}})\geqslant w({c_{2}})\geqslant\ldots\geqslant w({c_{|C|% }})$ $\displaystyle w({c_{i}})\leqslant w;1\leqslant\forall i\leqslant|C|.$ (10)

After all the non-heavy cells are assigned to $r$ reducers. Suppose the reducer $R_{k}$ has the maximum workload of all reducers. Before the reducer $R_{k}$ receives the last non-heavy cell, the workload of $R_{k}$ must be smaller than $w$ . This is because the greedy heuristics schema always choose the reducer which has the smallest workload. So $w_{k}\leqslant w_{m};1\leqslant\forall m\leqslant r,m\neq k$ . If $w_{k}>w$ , we have:

$\displaystyle\Rightarrow w_{m}>w,\forall 1\leqslant m\leqslant r,m\neq k$ $\displaystyle\Rightarrow w_{1}+w_{2}+\ldots+w_{r}>rw.$ (11)

The total workloads of these $r$ reducers is only $r w$ . This is contradictory to the conclusion $w_{1}+w_{2}+\ldots+w_{r}>rw$ , so the hypothesis does not hold water. In the worst case, assume the workload of the last non-heavy cell which is assigned to the reducer $R_{k}$ is $w-\varepsilon$ , where $\varepsilon$ is a small positive real number close to 0. After assigning this cell to the reducer $R_{k}$ , the workload of $R_{k}$ is $w_{k}+w-\varepsilon$ . $R_{k}$ has the maximum workload among all reducers, the total workload of the reducer $R_{k}$ is:

$\displaystyle\max({W_{k}})=w_{k}+w-\varepsilon\leqslant w+w-\varepsilon\simeq 2% w-\varepsilon.$ (12)

Then in the worst case, the workload imbalance $\lambda$ is:

$\displaystyle\lambda=\frac{\max({W_{k}})}{\textit{avg}({W_{k}})}=\frac{2w-% \varepsilon}{w}$ (13)

Since $\varepsilon$ is a small positive real number close to 0, $\lambda\leqslant 2$ . The upper bound of $\lambda$ shows the effectiveness of our load balancing algorithm in terms of data skew in joins for MapReduce.

3.2.4 Executing equi-join

We use the fragment-replicate technique for actual equi-join processing. For each heavy cell in the matrix, the cluster with larger frequency value is fragmented, and the cluster with smaller frequency value is replicated to multiple sub-cells. As seen in Fig. 6c, the heavy cell ( $T_{2}$ , $S_{2}$ , 35) is split into two sub-cells: ( $T_{2}$ , $S_{2}$ , 27) and ( $T_{2}$ , $S_{2}$ , 11). The workload of the first sub-cell is 27. We can compute the frequency value of the cluster $T_{2}$ in this fragment according to Eq. (3.2.2).

$\displaystyle f_{\textit{first}}({T_{2}})=\left\lceil{\frac{w-w_{in}\ast 3}{w_% {in}+w_{\textit{out}}\ast 3}}\right\rceil=\left\lceil{\frac{27-3}{1+3}}\right% \rceil=6$

The frequency value of the cluster $T_{2}$ in the origin cell ( $T_{2}$ , $S_{2}$ , 35) is 8. Hence, the ratio of the frequency value of the cluster $T_{2}$ in the first sub-cell to the total frequency value of the cluster $T_{2}$ in the original cell is $6/8=3/4$ . For each incoming tuple $T.jk=$ 2 from relation $T$ , the map function sends it to the reducer $R_{2}$ with probability $3/4$ and sends it to the reducer $R_{3}$ with probability $1-3/4=1/4$ . In contrary, input tuples from relation $S$ with $S.jk=$ 2 need to be replicated to both reducers.

Compared to the range-based method and the MDRP method, CCP algorithm provides fine-grained fragment-replicate control. A sub-range of a relation is determined to be fragmented or replicated in the range-based method and the MDRP method, while our approach determines whether a cluster should be fragmented or replicated based on the frequency value of the cluster. For a heavy cell, $T_{i}$ may be fragmented, while in another heavy cell $T_{j}$ may be replicated. The 1-Bucket algorithm constructs the join matrix using sampled data sets, the dimension of the join matrix is equal to the size of the sampled data. Our method significantly reduces the dimension of the join matrix by using clusters to create it.

3.3 Implementing over MapReduce

In this section, the implementation details of CCP over MapReduce is described. We first introduce the sampling stage, which samples $n$ records from each relation in a MapReduce job. Sampling is the process of choosing a representative sample from a target population and collecting data from that sample to recognize the statistical properties of the underlying data. It has been established as an effective tool for reducing the size of input data and avoiding huge costs in any subsequent processing.

In our implementation, for an incoming tuple from $S$ or $T$ , Map decides with probability $n/|S|$ or $n/|T|$ to output the tuple, otherwise, it is discarded. These $n$ tuples are sorted by join key and grouped in a single reduce task to calculate the frequency values of each cluster. Obviously, only clusters which belong to the intersection ( $S\cap T$ ) of $S$ and $T$ will produce join output in equi-join. To further reduce the dimension of the cluster join matrix, we discard clusters that do not belong to $S\cap T$ in each data set.

Using clusters’ statistics information, we can create a cluster join matrix and split the heavy cells in the matrix. Before the MapReduce job starts, the partitioning matrix is copied to all the mappers. This is achieved by a facility called DistributedCache, which is provided by the MapReduce framework to cache the files required by the applications.

The pseudocode of the map function is shown in Algorithm 1. For each incoming tuple from relation

Algorithm 1. Map.
Input: An input tuple $x\in T\cap S$ , a cluster partitioning matrix $M$
1:	if $x\in T$ then
2:	candidatecell $=M$ .getCell( $x . j k$ , $T$ )
3:	$R=$ candidatecell.listReducer()
4:	if $M$ .fragment( $x . j k$ , $T$ ) $==$ true then
5:	for $i=$ 1 to $\|R\|$ do
6:	$f$ ( $R$ [ $i$ ]) $=$ compute( $w$ ( $R$ [ $i$ ]), $f$ ( $S_{\textit{candidatecell}}$ ))
7:	$p=f$ ( $R$ [ $i$ ])/ $f$ ( $T_{\textit{candidatecell}}$ )
8:	output( $p$ , $R$ [ $i$ ], ( $x$ , $T$ ))
9:	end for
10:	else
11:	for $i=$ 1 to $\|R\|$ do
12:	output( $R$ [ $i$ ], ( $x$ , $T$ ))
13:	end for
14:	end if
15:	end if
16:	/( $x\in S$ is processed similarly.)/

$T$ , we get the candidate cell in the matrix that satisfies the join condition (Lines 1 $\sim$ 2). Note that the matrix-to-reducer mapping has been obtained before this MapReduce job starts. We can get a list of reducers for the candidate cell in listReducer() function (Line 3). Then the tuple is determined to be fragmented or replicated based on the frequency value of the clusters in this candidate cell (Line 4). If the input tuple $x\in T$ would be fragmented, we calculate how many tuples belonging to the cluster $T_{\textit{candidatecell}}$ should be sent to the reducer $R$ [ $i$ ]. Then the map function sends this tuple to the reducer $R_{i}$ with probability $f$ ( $R$ [ $i$ ])/ $f$ ( $T_{\textit{candidatecell}}$ ), where $f$ ( $T_{\textit{candidatecell}}$ ) indicates the frequency value of the cluster $T_{i}$ before it is fragmented (Lines 5 $\sim$ 9). Otherwise, the input tuple is replicated to all reducers returned by the listReducer() function (Lines 10 $\sim$ 15).

At the end of the Map function, each reducer receives a list of tuples from relations $S$ and $T$ .

The pseudocode of the reduce function is shown in Algorithm 2. The reduce function separates and buffers the input records into two empty sets according to the table tag (Lines 1–8). Then computes the cross-product between the records in these two sets (Lines 9–13).

Algorithm 2. Reduce.
Input: (reducerId, [( $t_{1},T$ ), ( $t_{2},T$ ), …, ( $s_{1},S$ ), ( $s_{2},S$ ), …])
1:	Ttuples $=\phi$ , Stuples $=\phi$
2:	for each ( $x_{i}$ , relation) in input list do
3:	if relation $=$ “T” then
4:	Ttuples $=$ Ttuples $\cup$ { $x_{i}$ }
5:	else
6:	Stuples $=$ Stuples $\cup$ { $x_{i}$ }
7:	end if
8:	end for
9:	for each $t_{i}$ in Ttuples do
10:	for each $s_{j}$ in Stuples do
11:	output $t_{i}\bowtie s_{j}$
12:	end for
13:	end for

4. Experiments

4.1 Experimental environment and datasets

Our experiment platform is a cluster of 4 nodes running Hadoop 2.6.0. One machine serves as the master node, and the remaining 3 nodes act as the worker nodes. Each node is equipped with a 3.30 G quad-core Intel i5-4590 Central Processing Unit (CPU), 8 GB of Random-Access Memory (RAM), and a 1 TB hard disk. To show the detailed number of tuples in each reducer with varying the weight values of input and output when evaluating the cost model, the number of reducers in the cluster is set to 10, and Hadoop distributed file system block is set to 128 MB.

We generate two independent datasets as the input relations of the join algorithm. For the sake of performance comparison, we construct datasets of cardinalities 1,000,000 with varying degrees of skew. Each dataset contains three attribute fields ( $p k$ , $j k$ , description). The join key $j k$ is an integer number between 1 and 1000. In all datasets, the skewed join key has the attribute value “1”. To represent the dataset simply and directly, we used the convention way in [18]. The convention used to represent the datasets is like this: $d1$ represents an input dataset that only a single 1 as the join key. All the other join key attributes are randomly assigned values from 2 to 1000. Similarly, $d1K$ represents an input dataset with 1000 1 s and the remaining 999,000 values are randomly assigned values from 2 to 1,000.

4.2 Experimental results and analysis

4.2.1 Performance evaluation

We first conduct experiments to compare the performance of our proposed CCP algorithm with the other four partitioning approaches (HASH, RANGE, RANDOM, and MDRP). We implemented the Hash partitioning method ourselves. And the range partitioning method is implemented based on [18] and [10]. The MDRP algorithm is implemented according to [3]. For the virtual processor Range partitioning method, we keep the factor of virtual processor ‘2’ i.e. the total number of partitions is 20. To reduce the amount of replication, all the partitions of a table are mutually disjoint except for the highly skewed partitions. The Random algorithm is an implementation of the M-Bucket-O algorithm proposed by Okcan and Riedewald [10]. Experiments in the literature [10] show that the algorithm achieves better load balancing when the number of the buckets is greater than or equal to 100. The number of the buckets in equi-depth histogram is set up to 100. The MDRP requires a square matrix based on deterministic sub-ranges of data. The dimension of the matrix is equal to the number of reducers in the cluster. In our implementation, we have overridden the default hash partitioning method in MapReduce. Each experiment is conducted three times and the mean of those values are presented.

Figure 7.

Execution times on skew datasets.

Figure 7 shows the performance comparison of algorithms while varying the degree of skew in the input data. The results of $d1\bowtie d1K$ in Fig. 7 are clearly described that the Hash performs better than other algorithms in little- or no- skew situations. Since the Hash skips the phase of sampling, creating partitioning matrix, and assigning cells to reducers. In the Map phase, it only needs to compute a hash value to determine a destination reducer, while the other algorithms have to search all candidate cells in the join matrix. The Range, MDRP and CCP show similar elapsed time. The Random method requires to compute an approximate equi-depth histogram which contains $k$ buckets on both input data sets. Each region in this $k$ dimensional join matrix must be a rectangle, this will result in a large amount of useless data to be replicated at multiple reducers.

As the skew increases, the performance of the Hash algorithm starts degrading. This is because the keys are distributed among the reducer nodes according to the hash code of the join key. The reducer receiving the skewed keys are overloaded as compared to the other reducers, hence the overloaded reducer takes more time to compute the join [18]. We notice that the execution time of the Range increases sharply from $d1K\bowtie d500K$ to $d10K\bowtie d100K$ . This due to the Range algorithm only considers the input balance among reducers. Even if all the reducers have the same amount of input data, output imbalance may vary considerably across reducers. We can see that the MDRP algorithm and the CCP algorithm proposed in this paper show consistent performance regardless of the degree of skew in Fig. 7. The reason is that the MDRP creates the partitioning matrix based on the sub-ranges of input datasets. It mainly considers the output balance when assigning candidate cells to reducers. However, every sub-range has almost the same number of input tuples. For the purpose of load balancing, the heavy cells have to be chopped into several no-heavy cells. Obviously, the size of input tuples in each sub-cell is less than the other cells in the matrix. While the output in all reducers is balanced. In the experiment, each reducer produces a large number of output tuples. The input balance of reducers is not very important. Compared with the MDRP algorithm, our CCP algorithm achieves a little better input load balancing result on reducers. Therefore, the MDRP and CCP methods give similar results. The size of join output tuples in each reducer is much larger than the received input tuples. This can be seen in the experiments in Subsection 4.2.2. The Random algorithm is also robust. However, its overhead of useless data duplication leads to longer execution time.

Since the total workloads are fixed. The optimal workload of each reducer decrease as the number of reducers increase. Our proposed algorithm divides all the heavy cells into non-heavy sub-cells. These sub-cells are assigned to reducers based on the greedy heuristics. Theorem 2 in Subsection 3.2.3 shows that in the worst case, the workload imbalance is less or equal to 2. Therefore, the elapsed time will decrease as the number of reducers increase. And the CCP algorithm distributes the tuples evenly to the reducers. Hence its performance scales with varying the number of reducers.

4.2.2 Evaluating the cost model

The MDRP mainly considers the output balance when assigning candidate cells to reducers. It assumes every cell in the join matrix has the same size of input tuples. In contrast, the CCP algorithm is a hybrid method for both input and output load. As we discussed in Section 3.1, we define the weight function for load balancing among the reducers as: $w(r)=c_{in}(r)+c_{\textit{out}}(r)=w_{in}*\textit{input}(r)+w_{\textit{out}}*% \textit{output}(r)$ . For our experiments in Section 4.2.1, the values of $w_{in}$ and $w_{out}$ are both set to 1. In this Section, we adjust the weight values of $w_{in}$ and $w_{\textit{out}}$ to observe changes in the size of the input and output data among reducers.

Figure 8.

Load balancing in $d10K\bowtie d300K$ .

In the $d10K\bowtie d300K$ case, the join produces the largest number of outputs, with the skewed join output has the attribute value “1” appearing over 3 billion times. So we first choose the $d10K$ and $d300K$ as our test datasets. Since the size of output tuples is thousands of times bigger than the size of input tuples. In our experiments, the $w_{\textit{out}}$ is set to 0.002 and 0.0002 while the $w_{in}$ is set to 1. Otherwise, changes in the amount of input data are not obvious. Secondly, a no-skew case $d1\bowtie d1K$ is selected, the results show that our CCP algorithm does not lead to load imbalance although it repartitions the data.

Figure 8 shows the size of the input and output data in each reducer on different weight values. In Fig. 8a, the heavy cell in the join matrix is split into 8 non-heavy candidate cells and these non-heavy cells are assigned to 8 reducers. The reducer 1 and the reducer 2 account for a large proportion of non-heavy cells that the join key attribute values are from 2 to 1000. As we adjust the output weight value $w_{\textit{out}}$ from 1 to 0.002 in Fig. 8b, 6.3 reducers are required to compute the data with the join key value “1” and the remaining 4 reducers are used for processing the other join keys. Notice that 4% of the data having the join key value “1” in the dataset $d300K$ is assigned to the reducer R4 and all the data which the join key value is equal to “1” in the dataset $d10K$ is replicated to R4 $\sim$ R10. When we set the $w_{\textit{out}}=$ 0.0002 in Fig. 8c, compared with Fig. 8a and b, it achieves a better load balancing for input. Since the $w_{in}=$ 1 and $w_{\textit{out}}=$ 0.0002, the input cost dominates the total cost of reducers.

Figure 9 compares the load balancing results with the Hash in the $d1\bowtie d1K$ case. The result in Fig. 9b looks exactly the same as the result in Fig. 9a. Each reducer has little difference between the size of the input and the output. As evident from Fig. 9, in case of little or no-skew, CCP algorithm does not destroy the balance among reducers even if the weight values of input and output are changed.

4.2.3 Extra cost

Before the MapReduce job starts, we sample approximately $n$ records from both datasets and create the cluster join matrix using the sampled data. Then the partitioning matrix is copied to all the mappers by Distributed Cache. This process will add extra cost.

Figure 9.

Load balancing in $d1\bowtie d1K$ .

Figure 10.

Processing times for the extra cost.

Figure 10a shows the details of the extra cost added with varying the number of sampling size. The creation time and the distribution time is relatively very small compared with the sampling time. As the sampling records increase, the processing time of the sampling also starts increasing. However, the amount of the increasing time is small. This is easy to understand since when the input dataset is fixed, each mapper has to process more records when the sample size increases.

In Fig. 10a, the sampling cost dominates the extra cost, and it increases with the increasing amount of sampled data. To further understand the sampling cost, we change the size of the input dataset to observe the trend of the sampling time. In Fig. 10b, the sampling time increases with the increase of the dataset when the sample size is fixed. However, the same as in Fig. 10a, the increasing time is relatively small. It is notable in our experiments, the MapReduce program is implemented by MyEclipse in Window 7. The sampling time includes the time to initialize Java Virtual Machine (JVM) metrics. From the experimental results, we can conclude that the extra cost is affordable because gains from the skewed input data are significantly bigger than the sampling cost.

5. Related work

Effective handling of skew is an important problem in any parallel system because improper skew handling can counter all the benefits of parallel processing [20]. There has been extensive researchs on handling data skew in parallel databases. While MapReduce shares many challenges and solutions, the fixed execution phase (map, shuffle, reduce) and user-defined functions differentiate the practices for MapReduce applications from the skew resistant relational algorithms in parallel databases [21].

Kwon et al. [22] presented SkewReduce, a system that statically optimizes the data partitioning according to user-defined cost functions. The approach effectively addresses potential data skew problems, but it relies on domain knowledge from users and is limited to specific types of applications. In 2012, they proposed another system called SkewTune [23], the SkewTune system tackles the data skew problem from a different angle. It does not aim to partition the intermediate data evenly at the beginning. Instead, it adjusts the data partition dynamically: after detecting a straggler task, it repartitions the unprocessed data of the task and assigns them to new tasks in other nodes. SkewTune fully utilizes the nodes in the cluster and preserves the ordering of the input data so that the original output can be reconstructed by concatenation. But it does not detect or split large keys and hence cannot make a better partition decision. Chen et al. [4] developed LIBRA (Lightweight Implementation of Balanced Range Assignment), a lightweight strategy to address the data skew problem. LIBRA and SkewTune are complementary to each other. When the load changes dynamically or when reduce failure occurs, it is better to mitigate skew lazily using SkewTune. On the other hand, when the load is relatively stable, LIBRA can better balance the copy and the sort phase in reduce tasks and its large cluster split optimization can improve the performance further when application semantic permit. One feature of LIBRA is its support of large cluster split. This feature is similar to our cluster cost partitioning algorithm, but the LIBRA only considers a single input data set, which is not applicable to join operations. Ibrahim et al. [1] designed the LEEN (Locality-aware and Fairness-aware key partitioning) algorithm to determine the corresponding partition of map output based on the frequency of key-value pairs. All the above methods belong to the strategy of changing the internals of Hadoop or building a new layer on top of Hadoop. In contrary, we design the CCP algorithm on top of Hadoop, it does not require any modifications to the MapReduce environment. Literature [24] proposed two load balancing approaches, fine partitioning, and dynamic fragmentation. Fine partitioning produces a fixed number of data partitions, dynamic fragmentation dynamically splits large partitions into smaller portions and replicates data if necessary. The rationale of splitting large clusters between the fine partitioning and dynamic fragmentation approaches is similar to our CCP algorithm. However, these two methods focus on processing a single dataset. The cost model of the cluster is different with our cost model. Gao et al. [25] proposed a two-stage strategy and the partition tuning method to disperse key-value pairs in virtual partitions and recombines each partition in case of data skew. The proposed Partition Tuning-based Skew Handling (PTSH) algorithm is more suitable for association rule mining on healthcare data. It is similar to the LEEN and the fine partitioning algorithm in the literature [24]. The difference between our CCP and these approaches is that the CCP algorithm can handle data skew in joins.

There has been much work done towards devising efficient join algorithms using MapReduce framework. Blanas et al. [11] surveyed several well-known join strategies in MapReduce. Among these different join algorithms, repartition join is the most commonly used join strategy in the MapReduce framework. In this join strategy, the join datasets are dynamically partitioned on the join key and the corresponding pairs of partitions are joined. When one of the reference table is much smaller than the other table, broadcast join is a better choice in this case. However, this strategy based on hash function for partitioning the data do not handle skew in the input data effectively. Atta et al. [18] introduced “Skew Handling join” that employs range partitioning instead of hash partitioning for load distribution. This contributes to balance the input workloads. A limitation of their algorithm is that they do not consider the output workloads of reducers. Almost at the same time, Okcan and Riedewald [10] proposed the randomized algorithm called 1-Bucket-Theta for arbitrary joins in a single MapReduce job. They also derived the M-Bucket class of algorithms that can improve runtime of theta-join compared to 1-Bucket-Theta by exploiting input statistics to exclude large regions of the join matrix. The general idea of our cluster cost partitioning method comes from this literature. The difference between CCP and randomized algorithm is that we use cluster to create the join matrix. Obviously, the size of clusters is smaller than the number of tuples from two join relations. Using the cluster to create the join matrix can reduce the size of each dimension in the matrix. Zhang et al. [26] extended the randomized method to multi-way theta-join queries. The algorithm can process multi-way theta-join in a single MapReduce job, they proposed a Hilbert curve based on space partition method that minimizes data copying volume over network and balances the workload among reduce tasks. To process multi-way joins in a single MapReduce job need to replicate the map output records multiple times. So, the data transmission from the Map phase to the reduce phase become a bottleneck in the join execution. Myung et al. [3] presented a MDRP method. In MDRP, they use sub-ranges of two relations to create the partitioning matrix. It is stochastic like the range-based algorithm and the M-Bucket algorithm. This work mainly focuses on optimizes either for input or output. Our CCP algorithm can achieve a little better load balancing result than the MDRP. Since the MDRP create the join matrix based on the sub-ranges of join relation. Every sub-range has almost the same number of input tuples. After splitting the heavy cell into several no-heavy cells, the size of input tuples in each sub-cell is less than the other cells in the matrix, while the output in all reducers is balanced. The CCP algorithm considers both the input and the output balance when assigning non-heavy cells to reducers. Vitorovic et al. [13] were the first to employ rectangle tiling algorithms for join load-balancing, the method considers the properties of both input and output data through sampling of the original join matrix. It introduced a coarsening stage to further reduce the regionalization input and built an equi-weight histogram to capture workload skew and partition the work. Our cost model is similar with the cost model in [13], the difference between the two algorithms is we use the cluster instead of the join key to create the join matrix, the size of each sub-cell that is assigned to reducers can be computed based on the frequency value of the cluster in the join matrix. Hassan and Bamha [27] introduced a groupBy-join algorithm called MRFAG-Join (MapReduce Frequency Adaptive GroupBy-join) based on distributed histograms to get detailed information about data distribution. The histogram used in MRFAG-join and the frequency values of clusters in our CCP play the same role. However, the MRFAG-join proceeds in three MapReduce job. It requires two additional MapReduce jobs to compute distributed histogram and partial aggregation of relevant data. Zhao et al. [28] presented a KNN-DP (K-Nearest Neighbors Data Partitioning) algorithm to handle data skewness in KNN joins. The partition strategies used in the KNN-DP algorithm is like the range-based partitioning method. The difference between KNN-DP and CCP is that the k-nearest-neighbor join combine the KNN query and the join operation, it is a very expensive operation.

6. Conclusions

In this paper, we address the problem of load imbalance of reducers in parallel joins for MapReduce. After providing a survey of current skew handling methods, a novel skew migrating algorithm based on cluster cost partitioning is proposed which considers both input and output load imbalance among reducers. Using our cost model, all the heavy cells in the cluster join matrix are split into non-heavy cells and are assigned to reducers. Skewed clusters in the join matrix are fragmented, and the size of each fragmentation could be computed preciously since the frequency values of each cluster in the join matrix have been obtained through a sampling of the original datasets. The CCP algorithm is capable of handling skew in different applications by adjusting the weight values of input and output in the cost model. The experimental results show that the CCP algorithm achieves a better time and load balancing results.

For future work, we will further extend this algorithm to multi-join queries on large-scale systems. As MapReduce is lack of a schema, lack of a declarative query language, and lack of indexes. We will explore indexing methods to speedup join queries.

Footnotes

Acknowledgments

This work is partly supported by the National Science Foundation of China under Grant Nos. 61640209 and 91746116, the Science and Technology Project of Sichuan under Grant No. SCMZ2006 012, and the Science and Technology Project of Guizhou under Grant No. [2014]2004, [2014]2001, [2016]7433, [2018]5702 and [2015]13.

Authors’ Bios

	Yang Wang received his Bachelor Degree in computer science from Xuchang University in 2011, and the MSc degree in 2014 in Guizhou University. He joined Chengdu Institute of Computer Application in Chinese Academy of Sciences as a PhD student in 2014. His research interests include big data analytics, distributed, and parallel computing.
	Yong Zhong received the MSc Degree in 1994 in Chengdu University of Technology, and the PhD Degree in 2002 in University of Chinese Academy of Sciences. He is a research fellow at Chengdu Institute of Computer Application in Chinese Academy of Sciences. His research interests include big data analytics and data mining.
	Qingshan Ma received his Bachelor Degree in computer science from Shanxi University in 2014. He joined Chengdu Institute of Computer Application in Chinese Academy of Sciences as an MSc student in 2015. His research interests include big data analytics and join algorithm based on MapReduce model.
	Guanci Yang received the MSc Degree in 2009 in Guizhou University, and the PhD Degree in 2012 in University of Chinese Academy of Sciences. He is a professor of Guizhou University. His research interests include computational intelligence and social robot, multi-objective optimization of complex system, and big data analytics. He is a member of the Institute of Electrical and Electronics Engineers and China Computer Federation.

References

Ibrahim

Jin

B.S.

Antoniu

and Wu

, Handling partitioning skew in MapReduce using LEEN, Peer-to-Peer Networking and Applications 6(4) (2013), 409–424.

Al Hajj Hassan

Bamha

and Loulergue

, Handling data-skew effects in Join operations using MapReduce, 14th International Conference on Computational Science (ICCS 2014), Cairns, Australia, (2014), 145–158.

Myung

Shim

Yeon

and Lee

S.G.

, Handling data skew in join algorithms using MapReduce, Expert Systems with Applications 51 (2016), 286–299.

Chen

Yao

J.Y.

and Xiao

, LIBRA: Lightweight data skew mitigation in MapReduce, IEEE Transactions on Parallel and Distributed Systems 26(9) (2015), 2520–2533.

Walton

C.B.

Dale

A.G.

and Jenevein

R.M.

, A taxonomy and performance model of data skew effects in parallel joins, Proceedings of the 17th International Conference on Very Large Data Bases, San Francisco, USA, (1991), 537–548.

Poosala

and Ioannidis

Y.E.

, Estimation of query-result distribution and its application in parallel join load balancing, Proceedings of the 22th International Conference on Very Large Data Bases, San Francisco, USA, (1996), 448–459.

and Kostamaa

, Efficient outer join data skew handling in parallel DBMS, VLDB Endowment 2(2) (2009), 1390–1396.

Acharya

Gibbons

P.B.

and Poosala

, Congressional samples for approximate answering of group-by queries, Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, USA, (2000), 487–498.

Shatdal

and Naughton

J.F.

, Adaptive parallel aggregation algorithms, Acm Sigmod Record 24(2) (1995), 104–114.

10.

Okcan

and Riedewald

, Processing theta-joins using MapReduce, Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, (2011), 949–960.

11.

Blanas

Patel

J.M.

Ercegovac

Rao

Shekita

E.J.

and Tian

Y.Y.

, A comparison of join algorithms for log processing in MapReduce, Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, Indianapolis, USA, (2010), 975–986.

12.

Afrati

F.N.

and Ullman

J.D.

, Optimizing joins in a Map-Reduce environment, Proceedings of the 13th International Conference on Extending Database Technology, Lausanne, Switzerland, (2010), 99–110.

13.

Vitorovic

Elseidy

and Koch

, Load balancing and skew resilience for parallel joins, IEEE 32nd International Conference on Data Engineering, Helsinki, Finland, (2016), 313–324.

14.

Yang

H.C.

Dasdan

Hsiao

R.L.

and Parker

D.S.

, Map-Reduce-Merge: Simplified relational data processing on large clusters, Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, Beijing, China, (2007), 1029–1040.

15.

Jiang

D.W.

Tung

A.K.H.

and Chen

, MAP-JOIN-REDUCE: Towards scalable and efficient data analysis on large clusters, IEEE Transactions on Knowledge & Data Engineering 23(9) (2011), 1299–1311.

16.

Lin

Agrawal

Chen

Ooi

B.C.

and Wu

, Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework, Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, (2011), 961–972.

17.

Kaldewey

Shekita

E.J.

and Tata

, Clydesdale: Structured data processing on MapReduce, 15th International Conference on Extending Database Technology, EDBT ’12, Berlin, Germany, (2012), 15–25.

18.

Atta

Viglas

S.D.

and Niazi

, SAND join – A skew handling join algorithm for Google’s MapReduce framework, IEEE 14th International Multitopic Conference, Karachi, Pakistan, (2011), 170–175.

19.

Atta

, Implementation and analysis of join algorithms to handle skew for the Hadoop Map/Reduce framework, University of Edinburgh, 2010.

20.

Dewitt

D.J.

and Gray

, Parallel database systems: The future of high performance database systems, Comunications of the ACM 35(6) (1992), 85–98.

21.

Kwon

Balazinska

Howe

and Rolia

, A study of skew in MapReduce applications, Open Cirrus Summit, Moscow, Russia, (2011).

22.

Kwon

Balazinska

Howe

and Rolia

, Skew-resistant parallel processing of feature-extracting scientific user-defined functions, Proceedings of the 1st ACM Symposium on Cloud Computing, Indianapolis, USA, (2010), 75–86.

23.

Kwon

Balazinska

Hoew

and Rolia

, SkewTune: Mitigating skew in MapReduce applications, Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, Scottsdale, USA, (2012), 25–36.

24.

Gufler

Augsten

Reiser

and Kemper

, Handling data skew in MapReduce, International Conference on Cloud Computing and Services Science, Noordwijkerhout, Netherlands, (2012), 574–583.

25.

Gao

Zhou

Shi

and Zhang

, Handling data skew in MapReduce cluster by using partition tuning, Journal of Healthcare Engineering 2017(5) (2017), 1–12.

26.

Zhang

X.F.

Chen

and Wang

, Efficient multi-way theta-join processing using MapReduce, Proceedings of the VLDB Endowment 5(11) (2012), 1184–1195.

27.

Hassan

M.A.H.

and Bamha

, Towards scalability and data skew handling in GroupBy-Joins using MapReduce model, Procedia Computer Science 51(1) (2015), 70–79.

28.

Zhao

Zhang

and Qin

, KNN-DP: Handling data skewness in kNN joins using MapReduce, IEEE Transaction on Parallel and Distributed Systems, (2017). doi: 10.1109/TPDS.2017.2767596.