Algorithms for the semi-online MapReduce scheduling problem on two identical machines

Abstract

We investigate the semi-online problem of MapReduce scheduling on two parallel machines. We aim to minimize the makespan. Jobs are released over-list, and each job includes a map task and a reduce task. The job’s map task can be preemptive and scheduled simultaneously onto different machines, however, the reduce task is non-preemptive. The job’s reduce task needs to wait for its map task to complete before starting. We consider the following two versions: Firstly, we know the processing time of the largest reduce task beforehand, and then design a 4/3-competitive optimal semi-online algorithm. Secondly, we know in advance the value of the reduce task with the largest processing time and the the total sum of the processing times. Then we present a 4/3-competitive semi-online algorithm. We conclude that the algorithm is the best possible when the largest reduce task meets certain conditions.

Keywords

MapReduce system semi-online scheduling competitive ratio makespan

1 Introduction

With the advent of the cloud age, big data has attracted more and more attention, and it has been integrated into business, energy information, economics, physics, and other fields. Faced with the complex and growing structure of big data, it is particularly important to find a simple and feasible solution, and MapReduce[1] is such a programming model used for parallel operations on large-scale datasets. The data processing process of MapReduce is mainly split into two phases: the map phase and the reduce phase. When submitting a job to the MapReduce system, these two phases are always included in its computation. Firstly, JobTracker is responsible for breaking down assignments into multiple tasks and assigning them to each TaskTracker. Then the input data is divided into fixed-size fragments, each of which is processed by a Map task. In the map phase, convert the raw data into key-value pairs. Finally, in the reduce phase, values with the same key-value are processed before outputting new key-value pairs as the final result. The specific process can be seen in Figure 1. All tasks are processed on a set of parallel machines, which can be summarized as a parallel machine scheduling problem.

Fig. 1

MapReduce execution flowchart.

leavevmode Formally, we study the following problem in MapReduce system. A list of jobs J ₁, J ₂, ⋯, J _n are arrived over-list and are to be scheduled onto two identical parallel machines σ ₁ and σ ₂. We assume that each job J _j concludes a map task M _j and a reduce task R _j, and the length of its M _j is not greater than that of its R _j. The reduce task of J _j can be processed only after its map task have been processed. For each job, suppose that each map task can be preemptive and be processed on different machines simultaneously, however, the reduce task is non-preemptive. We aim to minimize the makespan. We take into account the following two cases: Firstly, we know the processing time of the largest reduce task beforehand. Secondly, we know in advance the processing time of the largest reduce task and the total sum of the processing times. Using the classical three-field notation, our problems are denoted by P2|MR, max|C _max and P2|MR, sum & max|C _max, respectively. For simplicity, use M1 and M2 to denote the problem P2|MR, max|C _max and the problem P2|MR, sum & max|C _max, respectively.

For the problems of semi-online scheduling, use the competitive ratio to measure the quality of a semi-online algorithm. For a set of jobs J and an algorithm A, then the algorithm A is referred to as ρ-competitive if $ρ = sup_{J} {\frac{C^{A} (J)}{C^{*} (J)}},$ where C ^A (J) denotes the objective function value produced by algorithm A and C ^* (J) denotes the optimal objective value.

There are a great deal of researches on different MapReduce scheduling models over recent years. In addition to some papers focused on empirical works [2 –5], a large number of theoretical works have also emerged [6 –10]. Zhu et al. [8] considered offline MapReduce scheduling problems. They aim to minimize makespan and minimize total completion time on m identical machines. On makespan minimization, they gave an optimal algorithm for the case reduce tasks can be preemptive, and devised a (3/2 - 1/2m)-approximation algorithm for the case reduce tasks can not be preemptive. In terms of minimizing total completion time, they devised a heuristic and an approximation algorithm for the cases of preemptive and non-preemptive reduce tasks, respectively. Jiang et al. [11] expanded Zhu et al’s work and provided approximation algorithms for it.

For online scheduling problems in MapReduce system, Moseley et al. [6] used a two-stage flexible flow shop problem to model the MapReduce system. The goal is to minimize the total flow time. They designed an online (1 + ∈)-speed O (1/∈ ²)-competitive algorithm, where 0 < ∈ ≤ 1. Zheng et al. [7] constructed a new standard for measuring the performance of online algorithms, called efficiency ratio. Then, they devised an algorithm with a small efficiency ratio for the case that reduce tasks can be preemptive. Le et al. [9] discussed the online MapReduce load balancing problem under imbalanced data input. They proposed an online algorithm with a competitive ratio of 2 and a sample-based enhancement algorithm, which probabilistically achieved a 3/2-competitive ratio with bounded errors. Luo et al. [10] investigated the online MapReduce scheduling problem where the reduce tasks of a job are unknown before the completion of its map tasks. For both the preemptive and non-preemptive cases, they designed online optimal algorithms with the same competitive ratio of (2 -1/m). Chen et al. [12] studied the online MapReduce scheduling problem, where jobs are released over time. They aimed to minimize the makespan. They showed the lower bound and designed a (2 -1/m)-competitive algorithm for non-preemptive reduce tasks. They devised a 1-competitive online algorithm on two parallel machines for preemptive reduce tasks. Jiang et al. [13] addressed online problem of MapReduce scheduling on two parallel machines. They devised an optimal algorithm for the case that reduce tasks can be preemptive. Huang et al. [14] investigated a special case of online over-list MapReduce processing problem on two identical machines where each job consists of only one map task and one reduce task.

Almost all of the above MapReduce scheduling problems focus on online or off-line scheduling problems. In practical problems, however, there are few opportunities for these two situations. In most cases, we only know partial information about the problems, such as the total sum of the processing times, the largest processing time, the ordering of processing time, etc. Understanding the partial information of the job helps to arrange the execution order of tasks and allocate resources more reasonably in the MapReduce framework. This can reduce waiting time and improve the overall efficiency of job execution. To our best knowledge, there is limited literature on the semi-online MapReduce scheduling problem. If the largest processing time is known beforehand, He and Zhang [15] presented a 4/3-competitive optimal algorithm. For the case where the total sum of the processing times is known beforehand, Kellerer et al. [16] presented a 4/3-competitive optimal semi-online algorithm, too. Tan and He [18] considered the problem that the information sum and max are known beforehand. The sum denotes knowing the total sum of the processing times beforehand. The max means knowing the processing time of the largest job beforehand. They designed a 6/5-competitive optimal algorithm. Dósa et al. [19] addressed semi-online scheduling problem on two uniform machines where the optimal offline makespan is given beforehand. They aimed to minimize the makespan. They designed a compound algorithm and achieved tight bounds. Chen et al. [20] discussed several semi-online scheduling problems on two identical machines. The goal is to maximize total early work. When the total processing time of all jobs is given beforehand, they devised a 6/5-competitive optimal online algorithm. They proved that the tight bound is still 6/5 when the optimal criterion value is known in advance. When the the maximal job processing time and the total processing time of all jobs are known beforehand, they showed that the tight bound is reduced to 10/9. Dwibedy et al. [21] surveyed semi-online scheduling algorithms in various parallel machine models. Zheng et al. [22] studied a semi-online scheduling problem with lookahead, where an online algorithm knows the information of first k jobs in advance. The objective is to minimize makespan. They investigated three situations with different initial-lookahead information and designed online algorithms, respectively.

In this study, we investigate two different semi-online MapReduce scheduling problems on two parallel machines. For the case where we know the largest processing time of the reduced task beforehand, we provide a 4/3-competitive optimal semi-online algorithm. For the case of knowing in advance the total sum of the processing times and the processing time of the largest reduce task, we propose a 4/3-competitive semi-online algorithm. We can conclude that the algorithm is the best possible when the largest reduce task meets certain conditions.

We arrange subsequent section of the study as follows: In Sect.2 we propose an optimal semi-online algorithm for the problem M1. In Sect.3 we provide a semi-online algorithm for the problem M2. Finally, in Sect.4 we conclude the paper.

2 Notations and an optimal algorithm for M1

In the remaining paper, we use the notations below.

p (M _j) or p (R _j): the length of the task M _j or R _j, j ∈ N ⁺.

$P_{j} = \sum_{i = 1}^{j} (p (M_{i}) + p (R_{i}))$ : the total sum of the processing times of the first j jobs, j ∈ N ⁺.

$l_{j}^{i}$ : the workload of machine σ _i after scheduling job J _j, i = 1, 2, j ∈ N ⁺.

C ^A: the makespan produced by algorithm A.

C ^*: the optimal makespan.

Next, we consider the semi-online scheduling problem P2|MR, max|C _max, where we suppose that a list of jobs J ₁, J ₂, ⋯ , J _n are released one by one and we don’t know any other information about the jobs besides knowing the length of the largest reduce task beforehand. For job J _j, we assume that the length of its map task M _j is not greater than that of its reduce task R _j, i.e. p (M _j) ≤ p (R _j). Let R _max denote the reduce task with the largest processing time. If the job’s reduce task is R _max, then we call this job a large job. Without loss of generality, let M _max denote the map task of the large job. We can easily get that the optimal makespan satisfies $C^{*} \geq max {\frac{P_{n}}{2}, \frac{p (M_{max})}{2} + p (R_{max})} .$

Now, we show the lower bound of problem M1 below.

Theorem 2.1.For problem M1, any semi-online algorithm A has a competitive ratio of at least .

Proof. Suppose that p (R _max) =2. Job J ₁ = {M ₁, R ₁} and job J ₂ = {M ₂, R ₂} arrive, where p (M ₁) = p (M ₂) =0 and p (R ₁) = p (R ₂) =1. If J ₁ and J ₂ are scheduled on the same machine by executing algorithm A, the last job J ₃ = {M ₃, R ₃} comes, where p (M ₃) =2 and p (R ₃) =2. We have the makespan C ^A ≥ 4, and the optimum makespan C ^* = 3. Thus, C ^A/C ^* ≥ 4/3. If algorithm A schedules these two jobs on the different machines, the last job J ₃ = {M ₃, R ₃} comes, where p (M ₃) =0 and p (R ₃) =2. We can obtain that C ^A ≥ 3 and C ^* = 2. Therefore C ^A/C ^* ≥ 3/2.

Hence, C ^A/C ^* ≥ 4/3. The Theorem holds. ■

We will provide the best possible algorithm as follows.

For the classic semi-online scheduling problem with known largest processing time, He et al. [15] proposed the optimal algorithm for two identical machines with a competition ratio of 4/3. According to the algorithm of proposed by He et al., we can get our algorithm.

Algorithm A₁

Step 1. Schedule the current job J _j onto machine σ ₁, until one of the following two things occurs:

C1. The currently arrived job is a large one.

C2. If the currently arrived job is scheduled onto σ ₁, the completion time of σ ₁ would be greater than 2R _max.

Step 2. If C1 or C2 appears in step 1, schedule the currently arrived job (Assuming the job is J _j) as follows:

Step 2.1. If C1 in Step 1 happens,

Step 2.1.1. If p (M _j) ≤ P _j-1, then schedule the job J _j on machine σ ₂. (See Figure 2a).

Step 2.1.2. If p (M _j) > P _j-1, then partition M _j into two parts $M_{j}^{1}$ and $M_{j}^{2}$ such that $p (M_{j}^{1}) = P_{j - 1}$ . Assign the $M_{j}^{1}$ on machine σ ₂, and the $M_{j}^{2}$ on two machines evenly at time $l_{j - 1}^{1}$ , and the R _j on machine σ ₁. (See Figure 2b).

Step 2.2. If C2 in Step 1 happens,

Step 2.2.1. If $l_{j - 1}^{1} + R_{j} \leq 2 R_{max}$ , then partition M _j into two parts $M_{j}^{1}$ and $M_{j}^{2}$ such that $l_{j - 1}^{1} + p (M_{j}^{1}) + p (R_{j}) = 2 R_{max}$ . Schedule the $M_{j}^{1}$ and the R _j on machine σ ₁, and the map tasks in $M_{j}^{2}$ on machine σ ₂. (See Figure 2c).

Step 2.2.2. If $l_{j - 1}^{1} + R_{j} > 2 R_{max}$ , then schedule the job J _j on machine σ ₂. (See Figure 2d).

Step 3. If $l_{j}^{1} \leq l_{j}^{2}$ , renumber the index of two machines such that $l_{j}^{1} \geq l_{j}^{2}$ . We assign all subsequent jobs as follows:

For i = j to n - 1, do

Step 3.1. If $l_{i}^{1} - l_{i}^{2} \geq p (M_{i + 1})$ , schedule the job J _i+1 on machine σ ₂. If $l_{i + 1}^{1} \leq l_{i + 1}^{2}$ , renumber the index of the machines such that $l_{i + 1}^{1} \geq l_{i + 1}^{2}$ . (See Figure 3a).

Step 3.2. If $l_{i}^{1} - l_{i}^{2} < p (M_{i + 1})$ , then partition M _i+1 into two parts $M_{i + 1}^{1}$ and $M_{i + 1}^{2}$ such that $l_{i}^{2} + p (M_{i + 1}^{1}) = l_{i}^{1}$ . Schedule the $M_{i + 1}^{1}$ on machine σ ₂, and the $M_{i + 1}^{2}$ on two machine evenly at time $l_{i}^{1}$ , and R _i+1 on machine σ ₁. (See Figure 3b).

Fig. 2

Schedule produced by Step 2 of Algorithm A ₁.

Fig. 3

Schedule produced by Step 3 of Algorithm A ₁.

Remark 1. From algorithm A ₁, we always have $l_{j}^{1} \geq l_{j}^{2}$ after scheduling job J _j.

Theorem 2.2.For problem M1, algorithm A ₁ is -competitive, and it is optimal. Proof. Suppose that J _n is the last job. Immediately before algorithm A ₁ schedules job J _n, we use $l_{n - 1}^{1}$ and $l_{n - 1}^{2}$ to denote the completion times of two machines, separately. We take into account two cases below.

Case 1 $l_{n - 1}^{2} = 0$ . In this case, we can get J _n is a large job, i.e., J _n = {M _n, R _max}. We consider two subcases as follows.

Case 1.1 $l_{n - 1}^{1} \leq p (R_{max})$ . According to algorithm A ₁, we can obtain case 1.1.1 and case 1.1.2 two cases as follows.

Case 1.1.1 J _n is scheduled according to Step 2.1.1. We have C ^{A
₁} = p (M _n) + p (R _max). We can easily obtain that $C^{*} = \frac{p (M_{n})}{2} + p (R_{max})$ . Therefore, $\frac{C^{A_{1}}}{C^{*}} = \frac{p (M_{n}) + p (R_{max})}{\frac{p (M_{n})}{2} + p (R_{max})} \leq \frac{43}{,}$ due to p (M _n) ≤ p (R _max).

Case 1.1.2 J _n is scheduled by Step 2.1.2. We have C ^{A
₁} = (P _n-1 + p (M _n))/2 + p (R _max) and $C^{*} = \frac{p (M_{n})}{2} + p (R_{max})$ . Therefore, $\begin{matrix} \frac{C^{A_{1}}}{C^{*}} & = \frac{(P_{n - 1} + p (M_{n})) / 2 + p (R_{max})}{p (M_{n}) / 2 + p (R_{max})} \\ = 1 + \frac{P_{n - 1} / 2}{p (M_{n}) / 2 + p (R_{max})} \leq \frac{43}{,} \end{matrix}$ the inequality is due to p (M _n) ≤ p (R _max).

Case 1.2 $l_{n - 1}^{1} > p (R_{max})$ . From the algorithm A ₁, We have $l_{n - 1}^{1} = P_{n - 1} \leq 2 p (R_{max})$ and $l_{n}^{2} = p (M_{n}) + p (R_{max}) \leq 2 p (R_{max})$ . Thus, $C^{A_{1}} = max {l_{n - 1}^{1}, l_{n}^{2}}$ and C ^* ≥ max {P _n/2, p (M _n)/2 + p (R _max)}.

If $C^{A_{1}} = l_{n - 1}^{1}$ , then $\begin{matrix} \frac{C^{A_{1}}}{C^{*}} & \leq \frac{2 P_{n - 1}}{P_{n}} \leq \frac{2 P_{n} - 2 p (M_{n}) - 2 p (R_{max})}{P_{n}} \\ \leq 2 - \frac{2 p (M_{n}) + 2 p (R_{max})}{l_{n - 1}^{1} + p (M_{n}) + p (R_{max})} \\ \leq 2 - \frac{2 p (M_{n}) + 2 p (R_{max})}{p (M_{n}) + 3 p (R_{max})} \leq \frac{43}{.} \end{matrix}$

If $C^{A_{1}} = l_{n}^{2}$ , we have $\frac{C^{A_{1}}}{C^{*}} \leq \frac{p (M_{n}) + p (R_{max})}{p (M_{n}) / 2 + p (R_{max})} \leq \frac{43}{.}$

Case 2 $0 < l_{n - 1}^{2} \leq l_{n - 1}^{1}$ . In this case, we take into account two cases below.

Case 2.1 $l_{n - 1}^{2} < p (R_{max})$ and no large job arrives before job J _n comes. Thus, J _n is a large job, i.e. J _n = {M _n, R _n} = {M _n, R _max}. let J _s denote the first job that is scheduled onto machine σ ₂. According to algorithm A ₁, we can get that the workload of σ ₂ is always not greater than that of machine σ ₁ and $l_{n - 1}^{1} + l_{n - 1}^{2} > 2 p (R_{max})$ . After J _s comes, machine σ ₁ no longer accepts jobs. So $l_{n - 1}^{1} < 2 p (R_{max})$ . By the algorithm A ₁, we consider two subcases below.

Case 2.1.1 J _n is scheduled according to Step 3.1. We have $C^{A_{1}} = max {l_{n - 1}^{1}, l_{n - 1}^{2} + p (M_{n}) + p (R_{max})}$ .

If $C^{A_{1}} = l_{n - 1}^{1}$ , we can get $l_{n - 1}^{2} + p (M_{n}) + p (R_{n}) \leq l_{n - 1}^{1} \leq 2 p (R_{max})$ . Thus, $C^{*} \geq (l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{n})) / 2 \geq 3 p (R_{max}) / 2 .$ Therefore, C ^{A
₁}/C ^* ≤ 4/3.

If $C^{A_{1}} = l_{n - 1}^{2} + p (M_{n}) + p (R_{max}) \leq P_{n} / 2 + p (R_{max}) / 2$ , we have $P_{n} = l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{max}) \geq 3 p (R_{max})$ . So, $\frac{C^{A_{1}}}{C^{*}} \leq \frac{P_{n} / 2 + p (R_{max}) / 2}{P_{n} / 2} \leq 1 + \frac{p (R_{max})}{P_{n}} \leq \frac{43}{.}$

Case 2.1.2 J _n is scheduled according to Step 3.2. We have C ^{A
₁} = P _n/2 + p (R _max)/2 and $P_{n} = l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{max}) \geq 3 p (R_{max})$ . Therefore, we get $\frac{C^{A_{1}}}{C^{*}} \leq \frac{P_{n} / 2 + p (R_{max}) / 2}{P_{n} / 2} \leq 1 + \frac{p (R_{max})}{P_{n}} \leq \frac{43}{.}$

Case 2.2 We consider two situations: firstly, $l_{n - 1}^{2} < p (R_{max})$ and at least one large job arrived before job J _n comes; secondly, $l_{n - 1}^{2} \geq p (R_{max})$ . Regardless of the situation, we take into account the following two cases for the arrangement of job J _n.

Case 2.2.1 J _n is assigned according to Step 3.1. We have $l_{n - 1}^{2} + p (M_{n}) + p (R_{max}) \leq 2 l_{n - 1}^{1}$ , i.e., $3 (l_{n - 1}^{2} + p (M_{n}) + p (R_{max})) \leq 2 (l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{max}))$ , which follows that $\frac{2 (l_{n - 1}^{2} + p (M_{n}) + p (R_{max}))}{l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{max})} \leq \frac{43}{.}$ On the other side, because $l_{n - 1}^{1} - l_{n - 1}^{2} \leq p (R_{max})$ , we get $\begin{matrix} 3 l_{n - 1}^{1} & \leq 2 l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{max}) \\ \leq 2 l_{n - 1}^{1} + 2 l_{n - 1}^{2} + 2 p (M_{n}) + 2 p (R_{n}), \end{matrix}$ i.e., $2 l_{n - 1}^{1} / (l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{n})) \leq 4 / 3 .$

Therefore, we get $\frac{C^{A_{1}}}{C^{*}} \leq \frac{2 max {l_{n - 1}^{1}, l_{n - 1}^{2} + p (M_{n}) + p (R_{max})}}{l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{max})} \leq \frac{43}{.}$

Case 2.2.2 J _n is scheduled by Step 3.2. We have C ^{A
₁} = P _n/2 + p (R _n)/2 and $P_{n} = l_{n - 1}^{1} + l_{n - 1}^{2} + p (M_{n}) + p (R_{n}) \geq 3 p (R_{n})$ . Therefore, we get $\frac{C^{A_{1}}}{C^{*}} \leq \frac{P_{n} / 2 + p (R_{n}) / 2}{P_{n} / 2} \leq 1 + \frac{p (R_{n})}{P_{n}} \leq \frac{43}{.}$ ■

3 An approximate algorithm for M2

In this section, we discuss the semi-online problem P2|MR, sum & max|C _max. A series of jobs J ₁, J ₂, ⋯, J _n arrive over list and will be scheduled onto two machines. We assume that we do not have much knowledge about the jobs except that we know the total sum of the processing times and the value of the largest reduce task beforehand. For job J _j, we assume that the length of its M _j is not greater than that of its R _j, i.e. p (M _j) ≤ p (R _j). We use T to denote the total sum of the processing times, i.e., $T = \sum_{j = 1}^{n} (p (M_{j}) + p (R_{j}))$ . Denote by R _max the reduce task with the largest processing time. If the job’s reduce task is R _max, then we call this job a large job. We use J _max to denote a large job, that is J _max = {M _max, R _max}.

Now, we verify the lower bound of M2, and the conclusion is derived according to the following five lemmas.

Theorem 3.1. For problem M2, any semi-online algorithm A has a competitive ratio of at least ${\begin{matrix} \frac{65}{,} if \frac{1}{2} T \leq p (R_{max}); \\ \frac{65}{,} if \frac{2}{5} T \leq p (R_{max}) < \frac{1}{2} T and \\ p (M_{max}) + p (R_{max}) \leq \frac{3}{5} T; \\ \frac{14}{11}, if \frac{2}{5} T \leq p (R_{max}) < \frac{1}{2} T and \\ \frac{3}{5} T < p (M_{max}) + p (R_{max}); \\ \frac{65}{,} if 0 < p (R_{max}) < \frac{2}{5} T . \end{matrix}$

Lemma 3.1. For problem M2, if p (R _max) ≥ T/2, then any semi-online algorithm A has a competitive ratio of at least 6/5.

Proof. Assuming we know T = 12 and p (R _max) =6 beforehand. Job J ₁ = {M ₁, R ₁} and job J ₂ = {M ₂, R ₂} arrive, where p (M ₁) = p (M ₂) =0, p (R ₁) =1 and p (R ₂) =2. If J ₁ and J ₂ are scheduled onto the same machine by an algorithm A, the last job J ₃ = {M ₃, R ₃} comes, where p (M ₃) =3 and p (R ₃) =6. We have the makespan C ^A ≥ 9, and the optimum makespan C ^* = 15/2. Thus, C ^A/C ^* ≥ 6/5. If algorithm A schedules these two jobs on the different machines, the last two jobs J ₃ = {M ₃, R ₃} and J ₃ = {M ₄, R ₄} come, where p (M ₃) =0, p (R ₃) =1, p (M ₄) =2 and p (R ₄) =6. We can obtain that C ^A ≥ 9 and C ^* = 7. Therefore, C ^A/C ^* ≥ 9/7.

Hence, C ^A/C ^* ≥ 6/5. The Lemma holds.■

Lemma 3.2. For problem M2, if 2T/5 ≤ p (R _max) < T/2 and p (M _max) + p (R _max) ≤ 3T/5, then any semi-online algorithm A has a competitive ratio of at least 6/5.

Proof. Assuming we know T = 10 and p (R _max) =4 beforehand. Job J ₁ = {M ₁, R ₁} and job J ₂ = {M ₂, R ₂} arrive, where p (M ₁) = p (M ₂) =0, p (R ₁) =1 and p (R ₂) =2. If J ₁ and J ₂ are scheduled onto the same machine by executing algorithm A, the last two jobs J ₃ = {M ₃, R ₃} and J ₄ = {M ₄, R ₄} come, where p (M ₃) =2, p (R ₃) =4, p (M ₄) =0 and p (R ₄) =1. We have C ^A/C ^* ≥ 6/5. If algorithm A schedules these two jobs on the different machines, the last two jobs J ₃ = {M ₃, R ₃} and J ₄ = {M ₄, R ₄} come, where p (M ₃) =1, p (R ₃) =4, p (M ₄) =0 and p (R ₄) =2. We have C ^A/C ^* ≥ 6/5. ■

Lemma 3.3. For problem M2, if 0 < p (R _max) ≤ T/5, any semi-online algorithm A has a competitive ratio of at least 6/5.

Proof. Assuming we know T = 10 and p (R _max) =2 beforehand. Job J ₁ = {M ₁, R ₁} and job J ₂ = {M ₂, R ₂} arrive, where p (M ₁) = p (M ₂) =0, p (R ₁) =2 and p (R ₂) =2. If J ₁ and J ₂ are scheduled onto the same machine by executing algorithm A, the last two jobs J ₃ = {M ₃, R ₃} and J ₄ = {M ₄, R ₄} come, where p (M ₃) =2, p (R ₃) =2, p (M ₄) =0 and p (R ₄) =2. We have C ^A/C ^* ≥ 6/5. If algorithm A schedules J ₁ and J ₂ on the different machines (WLOG, we suppose that J _i is scheduled onto σ _i, i = 1, 2.), then the job J ₃ = {M ₃, R ₃} comes, where p (M ₃) =0, p (R ₃) =2. If J ₃ is assigned to σ ₁ along with J ₁, let the last job be J ₄ = {M ₄, R ₄}, where p (M ₄) =2 and p (R ₄) =2. We have C ^A/C ^* ≥ 6/5.

■ Lemma 3.4. For problem M2, if 2T/5 ≤ p (R _max) < T/2 and 3T/5 < p (M _max) + p (R _max), then any semi-online algorithm A has a competitive ratio of at least 14/11.

Proof. Assuming we know T = 10 and p (R _max) =4 beforehand. Job J ₁ = {M ₁, R ₁} and job J ₂ = {M ₂, R ₂} arrive, where p (M ₁) = p (M ₂) =0, p (R ₁) =1 and p (R ₂) =2. If J ₁ and J ₂ are scheduled onto the same machine by executing algorithm A, the last job J ₃ = {M ₃, R ₃} comes, where p (M ₃) =3, p (R ₃) =4. We have the makespan C ^A ≥ 7, and the optimum makespan C ^* = 11/2. Thus, C ^A/C ^* ≥ 14/11. If algorithm A schedules these two jobs on the different machines, let the last jobs J ₃ = {M ₃, R ₃} come, where p (M ₃) =3, p (R ₃) =4. We can obtain that C ^A ≥ 7 and the optimum makespan C ^* = 11/2. Therefore, C ^A/C ^* ≥ 14/11.

■

Lemma 3.5. For problem M2, if T/5 < p (R _max) < 2T/5, then any semi-online algorithm A has a competitive ratio of at least 6/5.

Proof. Assuming we know T = 10 and p (R _max) =3 beforehand. Job J ₁ = {M ₁, R ₁} and job J ₂ = {M ₂, R ₂} arrive, where p (M ₁) = p (M ₂) =0, p (R ₁) =1 and p (R ₂) =3. If J ₁ and J ₂ are scheduled onto the same machine by executing algorithm A, the last job J ₃ = {M ₃, R ₃} comes, where p (M ₃) =3, p (R ₃) =3. We have C ^A/C ^* ≥ 6/5. If algorithm A schedules these two jobs on the different machines, let the last two jobs J ₃ = {M ₃, R ₃} and J ₄ = {M ₄, R ₄} come, where p (M ₃) =2, p (R ₃) =3, p (M ₄) =0 and p (R ₄) =1. We have C ^A/C ^* ≥ 6/5.

■ Now we will propose an algorithm A ₂ for problem M2. The design of algorithm A ₂ depends on the ratio between T and p (R _max). We can easily obtain the desired ratio if the ratio between T and p (R _max) is too big or too small, i.e. T/p (R _max) >5 or T/p (R _max) <2. Therefore, we devise a new procedure to address the remaining cases carefully. According to the value of T/p (R _max), algorithm A ₂ chooses a suitable procedure.

0.8pt Algorithm A₂0.8pt

Step 1. If $\frac{2}{5} T \leq p (R_{max}) \leq T$ , assign the current jobs to machine σ ₁, unless the arriving job J _j = {M _j, R _j} is the large job.

Step 1.1. If $p (M_{j}) \leq l_{j - 1}^{1}$ , schedule the M _j and R _j on machine σ ₂. Then assign all the left jobs to σ ₁. Stop. (See Figure 4a).

Step 1.2. If $l_{j - 1}^{1} < p (M_{j})$ , partition M _j into two parts $M_{j}^{1}$ and $M_{j}^{2}$ such that $p (M_{j}^{1}) = l_{j - 1}^{1}$ . Schedule $M_{j}^{1}$ on machine σ ₂, $M_{j}^{2}$ on the two machines evenly at time $l_{j - 1}^{1}$ , and R _j on machine σ ₁. Then assign all the left jobs to σ ₂. Stop. (See Figure 4b).

Step 2. If $0 < p (R_{max}) \leq \frac{1}{5} T$ , assume the arriving job is J _j.

Step 2.1. If $l_{j - 1}^{1} = l_{j - 1}^{2}$ , assign the M _j to the two machines evenly, and the R _j on σ ₁. (See Figure 5a).

Step 2.2. If $p (M_{j}) \leq l_{j - 1}^{1} - l_{j - 1}^{2}$ , schedule the M _j and the R _j on σ ₂. If $l_{j}^{1} \leq l_{j}^{2}$ , renumber the index of two machines such that $l_{j}^{1} \geq l_{j}^{2}$ . (See Figure 5b).

Step 2.3. If $p (M_{j}) > l_{j - 1}^{1} - l_{j - 1}^{2}$ , partition M _j into two parts $M_{j}^{1}$ and $M_{j}^{2}$ such that $p (M_{j}^{1}) = l_{j - 1}^{1} - l_{j - 1}^{2}$ . Schedule $M_{j}^{1}$ on machine σ ₂, $M_{j}^{2}$ on the two machines evenly at time $l_{j - 1}^{1}$ , and R _j on machine σ ₁. (See Figure 5c).

Step 2.4. If job J _j is not the last one, then go to Step 2.

Step 3. If $\frac{1}{5} T < p (R_{max}) < \frac{2}{5} T$ , run algorithm A ₁ for all jobs.

0.8pt

Fig. 4

Schedule produced by Step 1 of Algorithm A ₂.

Fig. 5

Schedule produced by Step 2 of Algorithm A ₂.

The difference between the completion times of the two machines does not exceed the processing time of the largest reduce task. The following theorem provides competitive ratio of the algorithm A ₂.

Before we provide our main result, we introduce a lemma that is useful to prove Theorem 3.2.

Suppose that s1 and s2 are known information, such as the total processing time of all jobs, the largest processing time of all jobs and so on.

Lemma 3.6. ([18])(1) If the lower bounds of P2|s1|C _max, P2|s2|C _max and P2|s1 & s2|C _max are l ₁, l ₂ and l ₁₂, respectively, we have l ₁₂ ≤ min {l ₁, l ₂};

(2) if A is a l _A-competitive semi-online algorithm for P2|s1|C _max(or P2|s2|C _max), then the competitive ratio of A is at most l _A for solving P2|s1 & s2|C _max.

Theorem 3.2. Algorithm A ₂ is -competitive for the semi-online problem M2.

Proof. We will prove that C ^{A
₂}/C ^* ≤ 4/3 holds for any instance. We take into account the following four cases.

Case 1. T/2 ≤ p (R _max) < T. We can conclude that C ^* ≥ p (M _max)/2 + p (R _max). J _j is scheduled as follows.

Case 1.1 J _j is scheduled according to Step 1.1. We have C ^{A
₂} = p (M _max) + p (R _max). Therefore,

$\begin{matrix} \frac{C^{A_{2}}}{C^{*}} & \leq \frac{p (M_{max}) + p (R_{max})}{p (M_{max}) / 2 + p (R_{max})} \\ \leq \frac{(T - p (R_{max})) / 2 + p (R_{max})}{(T - p (R_{max})) / 4 + p (R_{max})} \\ = \frac{2 T + 2 p (R_{max})}{T + 3 p (R_{max})} \leq \frac{65}{.} \end{matrix}$

Case 1.2 J _j is scheduled according to Step 1.2. We consider two cases below.

Case 1.2.1 p (M _max) ≥ (T - p (R _max))/2. We can obtain that C ^{A
₂} ≤ (T - p (R _max))/2 + p (R _max) = (T + p (R _max))/2. Therefore, we have

$\begin{matrix} \frac{C^{A_{2}}}{C^{*}} & \leq \frac{(T + p (R_{max})) / 2}{p (M_{max}) / 2 + p (R_{max})} \\ \leq \frac{T + p (R_{max})}{p (M_{max}) + 2 p (R_{max})} \\ \leq \frac{2 T + 2 p (R_{max})}{T + 3 p (R_{max})} \leq \frac{65}{.} \end{matrix}$

Case 1.2.2 p (M _max) < (T - p (R _max))/2. We can obtain that C ^{A
₂} ≤ p (M _max) + p (R _max). Therefore, we have

$\frac{C^{A_{2}}}{C^{*}} \leq \frac{p (M_{max}) + p (R_{max})}{p (M_{max}) / 2 + p (R_{max})} \leq \frac{65}{.}$

Case 2. 2T/5 ≤ p (R _max) < T/2. We take into account the following two cases.

Case 2.1 p (M _max) + p (R _max) ≤3T/5. Whether J _j is scheduled by Step 1.1 or Step 1.2, we can conclude that C ^{A
₂} ≤ 3T/5 and C ^* ≥ T/2. Therefore, we have

$\frac{C^{A_{2}}}{C^{*}} \leq \frac{3 T / 5}{T / 2} \leq \frac{65}{.}$

Case 2.2 p (M _max) + p (R _max) >3T/5. We get C ^* ≥ max {T/2, p (M _max)/2 + p (R _max)}. J _j is scheduled as follows.

Case 2.2.1 J _j is scheduled according to Step 1.1. We can obtain that C ^{A
₂} = p (M _max) + p (R _max). Therefore, we have

Case 2.2.2 J _j is scheduled according to Step 1.2. We think about the following two cases.

Case 2.2.2a p (M _max) ≥ (T - p (R _max))/2. We can obtain that C ^{A
₂} ≤ (T - p (R _max))/2 + p (R _max) = (T + p (R _max))/2. Therefore, we have

Case 2.2.2b p (M _max) < (T - p (R _max))/2. We can obtain that C ^{A
₂} ≤ p (M _max) + p (R _max). Therefore, we have

$\frac{C^{A_{2}}}{C^{*}} \leq \frac{p (M_{max}) + p (R_{max})}{p (M_{max}) / 2 + p (R_{max})} \leq \frac{14}{11} .$

Case 3. 0 < p (R _max) ≤ T/5. All jobs are scheduled according to Step 2. Due to the difference in completion time between the two machines not exceeding p (R _max), we get

$\begin{matrix} C^{A_{2}} & \leq (T - p (R_{max})) / 2 + p (R_{max}) \\ \leq 3 T / 5 \leq 6 C^{*} / 5, \end{matrix}$ where the third inequality holds due to C ^* ≥ T/2.

Case 4. T/5 < p (R _max) <2T/5. All jobs are scheduled by Step 3. According to Lemma 3.6, we can get $\frac{C^{A_{2}}}{C^{*}} \leq \frac{43}{.}$

By now, we show that the result holds for any cases. ■

According to Theorem 3, we can get the corollary as follows.

Corollary 3.1. For problem M2, the algorithm A ₂ is the best possible when ${\begin{matrix} \frac{1}{2} T \leq p (R_{max}); \\ \frac{2}{5} T \leq p (R_{max}) < \frac{1}{2} T and p (M_{max}) + p (R_{max}) \leq \frac{3}{5} T; \\ \frac{2}{5} T \leq p (R_{max}) < \frac{1}{2} T and \frac{3}{5} T < p (M_{max}) + p (R_{max}); \\ 0 < p (R_{max}) \leq \frac{1}{5} T . \end{matrix}$

4 Conclusions

In this study, we address the semi-online problem of MapReduce scheduling on two parallel machines, aiming at minimizing the makespan. We discuss two versions under the assumption that the processing time of each job’s map task is not greater than that of its reduced task. Firstly, suppose that we know the processing time of the largest reduce task beforehand, we give a 4/3-competitive optimal algorithm. Secondly, suppose that we know the total sum of the processing times and the processing time of the largest reduce task in advance, we verify the lower bound and design a semi-online algorithm. When the largest reduce task meets certain conditions, the algorithm is the best possible. Future research direction is suggested to consider the general situation of m ≥ 2 parallel machines and semi-online algorithms with other basic types of partial information.

Acknowledgments

The authors would like to thank the editors and anonymous reviewers for their constructive comments, which have significantly improved the quality of the paper.

This work was supported by the National Natural Science Foundation of China(No.12271295), the National Science Foundation of China (No.12001313) and the Natural Science Foundation of Shandong Province of China (No.ZR2020QA023).

Conflict of interest

The authors declare no conflict of interest.

References

Dean

and Ghemawat

, Mapreduce: Simplified data processing on large clusters, J. Comb. Optim. 51 (2004), 107–113.

Isard

, Prabhakaran

, Currey

, Wieder

, Talwar

, Goldberg

, Quincy: Fair scheduling for distributed computing clusters, Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, ACM., (2009), 261–276.

Zaharia

, Borthakur

, Sarma

, Elmeleegy

, Shenker

, Stoica

, Delay scheduling:Asimple technique for achieving locality and fairness in cluster scheduling, Proceedings of the 5th European Conference on Computer Systems, Association for Computing Machinery. (2010), 265–278.

Kolb

, Thor

, Rahm

, Load balancing for mapreduce based entity resolution, IEEE 28th International Conference on Data Engineering (ICDE). (2012), 618–629.

Bechini

, Marcelloni

and Segatori

, A MapReduce solution for associative classification of big data, Inform. Sciences. 332 (2016), 33–55.

Moseley

, Dasgupta

, Kumar

, Sarlos

, On scheduling in Map-Reduce and flowshops, Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures. (2011), 289–298.

Zheng

, Shroff

, Sinha

, Anewanalytical technique for designing provably efficient MapReduce schedulers, Proceeding of INFOCOM’13 (2013), 1600–1608.

Zhu

, Jiang

, Wu

, Ding

, Teredesai

, Li

, Lee

, Minimizing makespan and total completion time in MapReduce-like systems, Proceeding of INFOCOM’14. (2014), 2166–2174.

, Liu

, Ergun

, Wang

, Online load balancing for MapReduce with skewed data input, Proceeding of INFOCOM’14 (2014), 2004–2012.

10.

Luo

, Zhu

, Wu

, Xu

, Du

, Online makespan minimization in MapReduce-like systems with complex reduce tasks, Optim. Lett. (2017), 271–277.

11.

Jiang

, Zhu

, Wu

and Li

, Makespan minimization for MapReduce systems with different servers, Future Gener. Comp. Sy. 67 (2017), 13–21.

12.

Chen

, Xu

, Zhu

and Sun

, Online MapReduce scheduling problem of minimizing the makespan, J. Comb. Optim. 33 (2017), 590–608.

13.

Jiang

, Zhou

and Zhou

, An optimal preemptive algorithmfor online MapReduce scheduling on two parallel machines, Asia Pac. J. Oper. Res. 35 (2018), 185003.

14.

Huang

, Zheng

, Xu

and Liu

, Online MapReduce processing on two identical parallel machines, J. Comb. Optim. 35 (2018), 216–223.

15.

and Zhang

, Semi on-line scheduling on two identical machines, Computing. 62 (1999), 179–187.

16.

Kellerer

, Kotov

, Speranza

and Tuza

, Semi on-line algorithms for the partition problem, Oper. Res. Lett. 21 (1997), 235–242.

17.

Seiden

, Sgall

and Woeginger

, Semi-online scheduling with decreasing job sizes, Oper. Res. Lett. 27 (2000), 215–221.

18.

Tan

and He

, Semi-on-line problems on two identical machines with combined partial information, Oper. Res. Lett. 30 (2002), 408–414.

19.

Dósa

, Fügenschuh

, Tan

, Tuza

and Węsek

, Tight lower bounds for semi-online scheduling on two uniform machines with known optimum, Cent. Eur. J. Oper. Res. 27 (2019), 1107–1130.

20.

Chen

, Kovalev

, Liu

, Sterna

, Chalamon

and Błażewicz

, Semi-online scheduling on two identical machines with a common due date to maximize total early work, Discrete Appl. Math. 290 (2021), 71–78.

21.

Dwibedy

and Mohanty

, Semi-online scheduling: A survey, Comput. Oper. Res. 139 (2022), 105646.

22.

Zheng

, Chen

, Liu

, Xu

, Semi-online scheduling on two identical parallel machines with initiallookahead information, Asia-Pac. J. Oper. Res. (2022), 2350003.