Abstract
MapReduce is a widespread programming model used for overcoming processing limits of current hardware resources. In the MapReduce paradigm, large amount of data distributed into multiple parts and each part processed on different processing units and finally the results of all units combined to obtain the ultimate result. Scheduling is an important aspect that affects overall processing quality, hence finding suitable algorithm based on properties of jobs is essential to acquire maximum performance. The basic FIFO algorithm for job scheduling does not meet maximum efficiency. Therefore, many of new schedulers proposed so far to improve quality metrics of the MapReduce system. The scheduling methods can be designed for job scheduling, map or reduce task scheduling or both. In this paper, we select high quality studies that concern the MapReduce scheduling problem and then classify them based on their main concerning quality measure and subsequently review selected studies. Finally, we discuss research trends and provide a roadmap of the scheduling problem for researchers.
Introduction
In the new age of developing computers, the size of produced data is growing rapidly. This data needs to be processed for various applications. Cloud computing model is invented for better usage of resources with lower price. According to this model, users can exploit resources and when they do not use them, the resources will be available for other users in the network. Cloud computing has many features such as payment on the go and low operation costs [1]. According to NIST definition [2] the cloud computing is a model that make use of a shared resource easily and has characteristics like: automatic allocation of required resources, availability in heterogeneous platforms, supporting multiple resource pooling for costumers, quick access to resources in unlimitedly manner and controlling of resources and services that used.
Another definition for cloud computing provided in [3] introduces cloud as the whole of the application and hardware and facilities that can be provisioned as a service for user needs. Despite grid computing, the cloud may consist of many infrastructures that can be considered as one consolidated service provider in a way that users cannot distinguish its compartments. The cloud computing model in some aspects has new concepts that make it special, such as unlimited resources available for users to meet their demands, letting users pay only for used resources and giving companies the ability for starting by small amount of hardware infrastructures since resources can be used by multiple users on different time spans.
Moreover, the cloud computing solves many problems tightened by personal computers like software installation, update and maintenance mainly through the web [4]. In addition to software, the cloud provides portability and lower hardware cost for the applications running in the cloud. In such an explosive data growing occurring in the world, we encounter with three of the most important aspects of data known as 3V: namely, velocity, volume and variety. The whole aspects of the huge data imply big data [5]. Since a large amount of data generated continually the data should be stored, accessed and processed with special methods. Such data cannot be processed by traditional methods and models, therefore Hadoop [6] invented to overcome existing limits for handling big data. It implements Google MapReduce [7] framework proposed for distributing data in parallel processing. MapReduce is used for processing data in distributed manner in a way that is tolerable for probable faults and can be used for terabyte scale data analysis that is intractable in the traditional systems [8].
The MapReduce model now is widely used and optimization of MapReduce will result in faster execution of its jobs with improved throughput. The optimization of scheduling of jobs is one of the criteria that helps to speed up MapReduce. Job scheduling can be important when for example there are many of jobs and selecting a job that is near to processing unit will result in faster job completion [9]. The scheduling can be done in finer granularity such as the map and reduce task levels.
The simplest scheduling algorithm is FIFO (first in first out) that used by Hadoop implementation of MapReduce by default while it is not efficient in all of conditions and may slow down the processing.
We first describe some backgrounds about the MapReduce scheduling problem in Section 2 and then describe the research method of systematic review in Section 3. Afterwards, we will review selected studies in Section 4 and discuss about the research results and the we define road map for MapReduce scheduling methods in Section 5. Finally, in Section 6 we conclude the review.
Background
MapReduce is a parallel programming model used for big data, dividing the job to smaller units to enable distribution of them on multiple machines. The MapReduce is implemented by Hadoop for using in the cloud systems. The other models than the MapReduce can be combined with it in one shared cluster using YARN resource manager. In the following subsections we will describe the related basic concepts in details.
MapReduce
The MapReduce is a programming method in which enables easy parallelization of problems to be executed on distributed units to make large processing jobs tractable. Parallelization details in this model is hidden and its usage is easy. The MapReduce works with two basic functions named map and reduce. The user should define map and reduce functions based on problem specifications. The map and reduce functions take input data and return output both in the format of <key, value> tuples. The map function gets the entire of the data and generates some <key, value> tuples whereas the reduce function finds the <key, value> tuples with same keys and combine their values to generate the final result. The reduce action can be done in three subsections, namely; shuffling, sorting and reducing. The shuffling is the act of gathering results of map tasks with same key as input for the reducing section. In the sorting section, the keys in intermediary level is sorted. The reducing is the act of running reduce function [7, 8]. The basic structure of MapReduce model is shown in Fig. 1.
MapReduce basic structure.
The simplicity of the MapReduce model gives the opportunity of developing parallel applications to less experienced programmers. Another key advantage of it is that does not require to load all of the processing data at once. Moreover, it is fault-tolerant during the jobs execution. The progress of map and reduce functions is monitored continually and if any failure occurred, the tasks will be scheduled again [8].
Hadoop is an open-source implementation of MapReduce for commodity hardware. It has a positive paradigm since it moves code through data that is faster than moving data through the code. The Hadoop exploits its own file system named HDFS (Hadoop Distributed File System), that is a fault-tolerant file system. The HDFS consists of two structural components named NameNode and DataNode with master/slave design. The NameNode is the master that controls file client access. The HDFS client provides interface for file system by reading and writing data through a DataNode directly. The NameNode manages metadata and file trees and is aware of place of each DataNode. Moreover, there is a DataNode component for each worker machine that reports NameNode status regularly. In turn the NameNode sends required commands to the DataNode to do actions on stored data blocks. In the application layer of system, JobTracker splits jobs and assigns them to existing machines and then the TaskTracker in each machine continues the task execution. Therefore, the JobTracker is the Hadoop master that manages the task running progress by reports continually sent by TaskTrackers. There is one TaskTracker for each slave node which is responsible for running map and reduce functions [6, 10, 11, 12, 13]. The Fig. 2 shows basic structure of the HDFS.
Hadoop MapReduce and HDFS basic structure.
The Hadoop provides fault-tolerance using data replication. It stores data in sequential blocks with same size on multiple machines. The NameNode is informed about data replication by DataNode which sends heart-beat and block-beat signals periodically. Since in the large systems in which use HDFS there are multiple racks, data replication in HDFS is done in multiple racks to provide more reliability and more network traffic utilization. The default policy for data replication is done in three positions; two different nodes in local rack and another node in non-local rack. Additionally, the Hadoop tries to exploit data locality by using the local rack data replication when it is available to reduce required network traffic for task completion. Since the failure of NameNode enforces the system to halt. In this situation, another NameNode should be started with replicated data. The slow running or failed tasks detected by the Hadoop system and another backup task that is same as the original task will be started. Afterwards, each of the original task or backup task that completed first, another one should be terminated [10, 12].
YARN [14] or yet another resource negotiator is a new model of computing in the Hadoop with new design that made the model of programming, disassociated from resource management. This make possible using of various programming models in one cluster by sharing cluster resources among them. The YARN is more scalable, more efficient than traditional Hadoop [14].
Job scheduling problem
The scheduling of jobs by the master node for execution in slave nodes can be done in different ways that affect the final performance. The MapReduce default program considers all of the jobs as one process and runs simultaneously various jobs and executes them one by one based on simple queueing. Thereby, using a job-aware or task-aware scheduling method will improve total system throughput and will satisfy other required conditions in the system. The scheduling is mostly done by considering input of the job and its properties like locality of data, SLA constraints, availability of resources and many other criteria [15].
Research method
In this systematic review, we have followed the three phase guidelines suggested by Kitchenham in [16]. The first phase is planning the review i.e. defining review procedures and making a review protocol. Afterwards, in the second phase the review should be conducted i.e. paper search results and data should be presented in the review and finally in the third phase reporting done with disseminating review results. In this section we will describe the first phase of systematic review (i.e. planning).
List of selected sources
List of selected sources
The MapReduce scheduling problem now has numerous new methods published in the papers yearly. The need for evaluating current progress of the methods toward better outcome in various criteria and lack of systematic review for this subject motivated us to review various available MapReduce scheduling methods. The MapReduce scheduling methods concern various quality measures and some of them concern more than one quality measure, therefore our first research question is about quality measures:
RQ1: What are the main quality measures concerned by the scheduling methods?
Afterwards we categorize selected methods based on each method characterizations and introduce evaluation methods used by selected papers:
RQ2: What are main studies concern each quality measure? RQ3: What are the most concerned quality measures by the studies? RQ4: What are the main benchmark datasets used by the studies?
Search criteria
After defining research questions, in this subsection, details of search for the resources enlightened as the following:
Search strings
The MapReduce scheduling papers mostly contain standard words of “MapReduce” and “scheduling”. Hence, we built search string based on those two words (i.e. MapReduce AND scheduling). We applied search string on abstract and keywords in addition to paper title. There is no need for synonym consideration for search string because of unique name of selected case.
Search sources
We selected the most popular digital libraries that available for paper search and have capability for advanced search. The name of selected sources and their web address are presented in Table 1.
Selection strategy
Subsequently, we constructed inclusion and exclusion criteria to filter inappropriate and low quality papers. The inclusion criteria were built based on research subject and questions introduced in the Section 3.1. The inclusion criteria applied on paper contents and papers that hold all of the criteria, have reviewed and others have discarded.
The inclusion criteria items are the following:
The study should address at least one of the MapReduce related optimization problems. The study should be published in a journal or conference paper that is available online. The solution should be concerning pure MapReduce paradigm and not focusing other problems like Hadoop applications.
Number of initial and selected resources
Quality measures of the MapReduce schedulers
Moreover, a number of exclusion criteria are considered for removing low quality papers. Any paper that have one of the exclusion criteria, removed from final review list.
The exclusion criteria items are the following:
Repeated studies that are not available in English. Items that are not published in the form of an original research article, such as review articles, thesis, posters, tutorials, abstracts, etc. The scheduling algorithms that are general and are not specific to the MapReduce.
We have committed above inclusion and exclusion criteria according to the following stages to select qualified studies:
Querying the selected digital libraries and getting the results Removing duplicate papers, invalid items and irrelevant studies Applying inclusion and exclusion criteria Adding suggested studies
The Table 2 shows the number of selected studies in each stage.
The fetching of research items has done on June 25, 2018. The found study items collected in a spreadsheet contain details for each of items, such as title, publication name and year, and other properties needed for classification and knowledge extraction.
Scheduling methods
Each of the MapReduce scheduling methods concerns one or more quality measure improvement for the job level or task level scheduling. The quality measures that we have considered for MapReduce scheduling improvements with their short description is listed in Table 3. We have considered the main focused quality measure as classification label, nonetheless we have considered all of the improvement qualities in statistics.
Data Locality
Location of data which processed by map and reduce functions, is an important parameter for execution speed and has great impact on system performance. If the data that is being to processed, locate in local memory, the network overload will be reduced and this leads to higher throughput. In many applications of the MapReduce, it may be used in heterogeneous environments that MapReduce clusters or virtual machines have different hardware configurations.
Delay Scheduler [9] aiming to improve data locality by trying to find tasks for a node from its head-of-line job and if it could not start any local task, it finds another job for local task assigning. This may cause other jobs never can be executed, thus in this method after long waiting, non-local jobs are accepted for task assignment. Indeed, the act of delaying non-local jobs cause that after some time, the slot with local job data be freed and this improves data-locality. This method is shown that can additionally reduce response times in multi-user systems. In [17], a solution for data locality problem has proposed in which assigns tasks by considering waiting time and network transmission time in a way that for each TaskTracker request for task, if no task with local data exists, it compares transmission time of task data and task waiting time to decide which task should be allocated to TaskTracker for execution. Another scheduling method aiming improvement of data locality, is Purlieus [18] that exploits local data storage for intensive works. Indeed, the Purlieus has three categories of jobs in scheduling procedure: map function heavy data, reduce function heavy data and both map and reduce functions with heavy data. In this method, heavy data of jobs with each of the former categories, stored on local memory and other data stored anywhere else. This technique reduces traffic usage of network and moreover decreases job execution time enormously.
In [19] a hybrid scheduler named HybS has proposed that aims to reduce jobs waiting time with preserving data locality and user SLA. The HybS utilizes dynamic priority assignment that can be useful for variety of jobs with different task length, job size and waiting time. The HybS schedules local nodes greedily using knapsack algorithm. It first selects jobs that have highest priority and have data locality and if an item with highest priority doesn’t have data locality, the scheduler considers next job with highest priority. LaSA [20] is a locality-aware scheduler that models interference weight of data and uses the calculated weights to manage data-locality. This algorithm introduces rare resource as a data node with some rare data in it. The rare resource obtains higher weight of interference and when nodes assigned to tasks, the scheduler tries to select data nodes with lower data interference to avoid data missing. MRA
In [24] Wang et al, address the data locality problem in a network perspective by balancing load and locality of data. This algorithm is built using MaxWeight and Join the Shortest Queue (JSQ) policies. It uses non-pre-emptive execution of tasks in addition to random service times. The proofs have done and have shown by authors that the proposed algorithm is delay and throughput optimal in heavy traffic way. In [25], the authors consider three levels of data locality and remark that JSQ-MaxWeight is not throughput optimal with this presumption. They introduced an extension for JSQ-MaxWeight that makes it throughput optimal with the new presumption. Afterwards, the authors propose a new algorithm that uses weighted workload routing and priority service. They have provided heavy-traffic optimality for the proposed algorithm for all traffic scenarios. GB-PANDAS [26] is a throughput optimal scheduling algorithm that considers three levels of data locality. In contrast to methods like JSQ-MaxWeight that use non-realistic assumptions as geometric distribution for service time, the GB-PANDAS exploits new queuing structure and making no assumptions about it.
The MapReduce network topology and transmission cost are two important criteria to consider for improving data locality since distribution of data in a network results different transmission time. In [27]heuristic scheduling method is introduced for data locality improvement by considering network bandwidth as a resource. The Balance-Reduce (BAR) algorithm reduces job completion time by initial allocating of tasks based on network status and cluster load for improving data locality. The BAR detects network state and when the network connection is poor, it improves data locality and otherwise, when load of the cluster is high, it dismisses data locality for faster starting of the tasks. In [28] a scheduling algorithm has proposed to reduce data transmission and speed up the system. They calculated a probability value by considering data transmission in a way that higher probability means the item require lower data transmission and is better choice for the free slot of cluster in which the probability is calculated. The result of experiments shows more efficient result rather than fair scheduler. In [29], prefetching and caching of required input data of node have used to improve data locality and reduce network transmission. High Performance Scheduling Optimizer (HSPO) provides a data prefetching service in task level by predicting best nodes for each map task by monitoring running behavior of the tasks. In the industrial applications of MapReduce, the servers maybe have different locations that network transmissions need to be mitigated. In [30] a traffic aware scheduling method for geo-distributed data centers has proposed aiming to minimize network traffic. It considers both task placement and data movement to construct an optimization model and to obtain solutions using linearization and relaxation methods. Moreover, the authors used a chance constrained optimization method to ensure that execution of the job will not exceed the predicted job completion time with high probability.
DPMQS [31] is a multi-queue task scheduler that aims to reduce jobs response time by providing data locality. It tries to schedule tasks near to TaskTracker coupled with required input data of tasks. Furthermore, the DPMQS increases the priority of jobs whose map tasks are near to finishing time to minimize reduce task waiting time for map task completion.
Optimization of mapping task for getting better data locality will improve overall system performance. In [32], a next-k-node (NKS) method for improving data locality of map tasks in homogenous environments has introduced. The NKS calculates probability for each map task based on where its data is located. Indeed, low probability value is assigned to the map tasks that their input data are located in the next node to request data. Finally, the method holds low probability nodes reserved for nodes that have their local data, to improve overall data locality. In [33] a network aware scheduling algorithm has proposed that improves FIFO and FAIR schedulers by placing map tasks on the nodes containing required data. Another effort for this type of optimization has done in [34]. It uses a strategy that has two parts: first part is about locality of data and the second part is about selection of replica. In the data locality part, it uses multiple queues to assign map task only to the TaskTracker that has input data and therefore no data transfer is required. On the second part, the authors defined a load value to address the load of system problem with assumption of the same nodes in each cluster of the Hadoop.
LARTS [35] is a scheduler that has concentrated on reduce task scheduling data locality and tries to become aware of network location and size of each partition. The method aims to schedule reduce tasks near to where their maximum amount of data is located. Besides the LARTS attempts to balance delay and skew of scheduling as well as resource utilization and degree of parallelism. In [36] the CoGRS scheduling method has proposed that concentrates on both partitioning skew and data locality in the MapReduce heterogeneous environments. The CoGRS tries to schedule each reduce task on center of gravity node of the task to reduce required network traffic and improve total system performance.
Performance
The overall performance of the MapReduce running system is an important factor within the other quality measures. The simple definition for system performance is the number of jobs done by the system in a time span. In the execution procedure of the MapReduce speculative execution happens when the processing of data in one machine takes more time than normal. Therefore, another machine should start processing of the same data simultaneously to compensate for abnormal execution. The speculative execution on remote data occurs enormously which causes low efficiency. Maestro [37] is a scheduling algorithm that has proposed to address useless speculative execution of map tasks on data replications. This algorithm does the optimization in two major stages. In the first stage it counts the number of map tasks of each node and assigns map tasks to free slots in addition to considering data replication. Then, in the second stage, it decides according to probability value calculated based on replication of map task input data.
The workload jobs may differ in amount of usage of CPU-bound and I/O-bound operations. MR-predict [38] method predicts category of the jobs before execution. It considers three categories; CPU-bound, I/O-bound and uncertain CPU-bound workloads. According to these job categories, three queues constructed in the scheduler, the first is for CPU-bound jobs, the second is for I/O bound jobs and the third is a waiting queue for evaluating new jobs in the test run. Afterwards, each job type assigned to the best matching resource for each type. This queuing method improves both of throughput and total job completion time. ATAS [39] is a scheduling method tries to improve throughput of the MapReduce in heterogeneous environments by increasing successful execution of backup tasks. It assigns tasks to nodes by considering processing ability of the node by labelling each node as SlowNode or QuickNode. Moreover, the backup tasks are assigned to QuickNodes to accelerate their execution.
Longest Approximate Time-to-End (LATE) scheduler [40] is designed for systems which their clusters have different hardware configurations or comprise clusters that use virtualization. The variety of configurations of MapReduce nodes (heterogeneous environments) requires new type of algorithms since many of the algorithms presume that all nodes have equal speed. Another presumption that can be false in heterogeneous environments is task loading time that can be non-zero. The LATE algorithm uses speculation [19] for tasks finishing time in heterogeneous environments. It uses three criteria for its algorithm: giving more priority to speculated tasks, preventing tasks that are not in running state from failing and assigning not running tasks to fast nodes. The LATE furthermore considers a threshold value to classify a task as slow or fast by comparing the progress of the task. Due to the weakness of the LATE [40] in computing correct remaining time and speed of running tasks, a Self-adaptive MapReduce scheduler (SAMR) proposed in [41] to address these problems. The SAMR uses a heuristic approach to find slow running tasks. It categorizes tasks based on map and reduce tasks speed using task log information, namely, tasks with slow map and tasks with slow reduce. The SAMR runs backup task on another node simultaneously as well as the original task to obtain the best finishing time and to not run as slow as first time. The SAMR method has one drawback that is the fact that the map and reduce tasks may have different weights. ESAMR [42] is an enhanced version of the SAMR method that uses k-means clustering method for job execution history classification of each node. It uses the weight of clusters for job running time estimation.
Failure in the job execution process results increasing job completion time. Chronos [43] is a fault-aware scheduler that decreases completion time of jobs by resolving failed task uncertain waiting time. It considers pre-emption for resource allocation in addition to data locality perseverance for failed tasks. The selection criteria for pre-emption of running tasks are task progress, location of data of failed task and objectives of the scheduling. In [44] another failure-aware scheduler has introduced that exploits machine learning methods based on MDP model for dynamic scheduling decisions to be used instead of classic heartbeat approach for detecting failures. The method has deployed using the Hadoop and named ATLAS+. The early detection of failures using system logs in TaskTracker aids the ATLAS+ scheduler to successfully decrease number of failures and provide lower execution time in addition to lower memory and processing resource usage.
The partitioning skew is an issue in MapReduce programming model caused by uneven distribution of map tasks among reduce tasks that increases job completion time. DREAMS [45] tries to reduce partitioning skew on runtime instead of repartitioning reduce task workloads. It provides dynamic resource allocation for reduce tasks by online prediction of partition size. Moreover, the DREAMS introduces a correlation between reduce task duration and task size and its allocated resource, to model performance of the reduce task. Additionally, the uneven distribution of data causing the reduce tasks waiting for the slowest reduce task because of different size of data has assigned to reduce tasks. MBR [46] improves efficiency of the MapReduce by concerning data that unevenly distributed among the nodes. The MBR is based on processing and self-adaptation scheduling algorithms.
In the Fair Scheduler when the number of reduce tasks exceeds its critical value, different delay distribution may happen. Coupling Scheduler [47] has proposed to start each reduce task gradually upon its related map tasks execution progress. This enables the Coupling Scheduler to obtain delay distribution order gain in comparison to the Fair Scheduler. SARS algorithm [48] has proposed for alleviating reduce task start time by dynamically considering job details for reduce tasks of starting job. It checks the map task result and the time required for task completion of the jobs for decision making. The method finally reduces total job completion time up to 29% in response time in the comparison with the classic FIFO approach.
In [49] a method has proposed that assigns tasks of job to DataNodes of cluster in a way that does not make any effect on other cluster tasks. In this algorithm, each TaskTracker with specific configuration, paired with the tasks that are compatible with it. Indeed, the type of tasks such as CPU-bound, memory-bound etc. matched to the TaskTracker hardware configuration. This method exploits machine learning methods (Bayes classification) to evaluate type of tasks and their compatibility. CASH [50] is a context-aware job scheduling method considers job characteristics as an important factor in scheduling decision making. It uses the fact that a large number of jobs in the MapReduce workloads are running periodically with the same context.The CASH categorizes periodic jobs based on CPU or storage intensive type and further categorize resources as well to CPU or storage type and tries to map each type of job to its respective resource type. Tarazu [51] is a MapReduce scheduler that exploits heuristics that comprise online measurements to combine cluster and application characterizations. The Tarazu concerns two major issues in heterogonous workloads: competing for bandwidth by map and shuffle tasks due to poor load balancing and reduce tasks imbalance caused by the workload heterogeneity. This method contains various components to address the issues such as load balancing module for both of map and reduce tasks and a communication-aware scheduler for map tasks.
In [52] a method has proposed that utilizes reordering of executing the MapReduce jobs that have no dependencies for improving total completion time of job. The method introduces the abstraction that execution time of map and reduce stages build required time of the MapReduce job completion. Afterwards, the authors use classic Johnson algorithm for optimal job scheduling. In cases that the workload contains periodical jobs, the method exploits execution history information to estimate required time for job execution of new datasets. Since in some cases Johnson algorithm results suboptimal solutions, the authors designed BalancedPools that is a heuristic to improve low quality scheduling results.
In [53] a scheduler has introduced in which exploits genetic algorithm as an optimizing method for minimizing makespan time. The method makes decision for scheduling based on the whole of job queue tasks evaluation. In [54] online and offline 3-approximation scheduling algorithm exploiting linear programming has proposed attempting to reduce job completion time by finding proper order of job execution. In [55] a scheduling approximation algorithm has proposed in which concentrates on both pre-emptive and non-pre-emptive reduce tasks for makespan and total job completion time minimization. The method presumes that map tasks are parallelizable, whereas considers reduce tasks as non-parallelizable for solving the problem.
Online schedulers decide for each job when they arrive. This enforce schedulers to consider no assumption about next jobs [56]. In [57], preemptive and non-preemptive versions of online scheduling algorithms is proposed with assumption that reduce function cannot be parallelized because of complexity and therefore the jobs are scheduled by release time. Another non-preemptive algorithm has introduced that is named MF-LPT (Map First Longest Processing Time) and is based on LPT algorithm [58]. The MF-LPT first schedules map functions to finish first and then assigns reduce tasks from longest to shortest length when no map task remained. In [59] an online and offline scheduling algorithms in both preemptive and non-preemptive versions has proposed with assumptions of parallelizable map and non-parallelizable reduce tasks which designed for heterogeneous environments. This algorithm in non-preemptive version tries to use LRF (Large Reduce First) algorithm that they designed for offline version and introduces OLRF (Online Large Reduce First) for online version. The LRF tries to find large reduce tasks and their related map tasks to schedule them first and then does the same for other remained map and reduce tasks consequently. LsPS [60] is an online scheduling algorithm, attempting to improve system response time by finding user job size pattern and using collected information for proper resource sharing of the users. It exploits a self-tuning policy in which schedules jobs in two levels: the user across other users and individual user. In the first scenario, job size of each user considered for sharing resources and in the second one, job size distribution used for job scheduling.
LA [61] is a load-aware scheduling method for heterogeneous environments concerning cluster performance by addressing the problems that initiated by dynamic loading. The LA has two main components: the data collector that periodically obtain TaskTracker information and task assigner that schedules tasks based on collected information by data collector component. FiGMR [62] is a fine-grained and dynamic scheduler which exploits log of task execution node to detect slow map and reduce in addition to evaluating performance of each node. It prepares backup tasks for slower items to run on faster nodes. Moreover, the FiGMR cares about data locality and acclaimed to reduce total execution time and consequently improving the performance.
MASHERS [63], is a jointly scheduling algorithm that is proposed that aims to improve all three phases of the MapReduce (i.e. map, shuffle and reduce). In this study a heuristic constant factor approximation method has proposed for the problem that exploits linear programming with considering map and reduce tasks dependencies. The methods shown to be well in practice for performance improvement. In [64] MaxSRTP and SplitSRTP methods are proposed in complementary manner to schedule overlapping map and shuffle phases of the MapReduce in worst 2-competitive. This may happen when some tasks of a job start shuffle phase while others be still in map phase. Furthermore, this paper proves that offline case of scheduling problem of minimizing average response time is an NP-hard problem. In [65] a precedence-aware scheduling algorithm aiming to minimize weighted job execution time with considering multiple round of map and reduce functions execution has proposed. The method has improved the algorithm proposed in [66] which is in form of flexible flow shop [67]. The designed algorithm is
In [68] a scheduling method is proposed aiming to improve average job completion time. The method schedules jobs based on information gathered from jobs execution, like estimated job arriving rate and average time of job execution. The method classifies jobs by finding appropriate resources satisfying their demands using collected information. Furthermore, it is adaptable to system changes and makes required changes on class for better performance. In [69] two classes of algorithms has proposed for reducing execution time of workloads. The first algorithm does optimizations in the job-level and the second one concentrates on map and reduce tasks optimization. The heuristic algorithms improve both of the total completion time and makespan time of workloads by job reordering and slot configuration approaches. Ant [70] is a self-adaptive task tuning scheduler that aims to improve performance of heterogonous clusters. The Ant tries to find the best configurations for each task when runs on different nodes. It groups nodes based on their hardware specifications and considers each group as a homogenous cluster and exploits self-tuning algorithm for them. The Ant at first selects random configuration for the tasks and gradually improves the configuration based on their first running performance. The Ant exploits genetic algorithm for adaptive task configuration to avoid local optimal configurations.
In [71] another scheduling method has proposed that concentrates on short jobs execution time and resource utilization improvements in MapReduce using wasted resources. This method records statistics of running tasks and furthermore tries to predict execution time of other not-running tasks. Subsequently, it exploits gathered data to reuse released resources of running tasks for unscheduled tasks. Tyrex [72] is a size-based scheduling method concentrates on improving short jobs completion time in the MapReduce. The Tyrex partitions the resources of system and then dynamically imposes limitations for each job based on its size and move jobs across different partitions. The authors constructed a near optimal statistical model for dynamic moving of jobs across the partitions.
Fairness
In this sort of scheduling methods, priority taken into account to accomplish negotiated service level to user. Fair Scheduler [73] has proposed to provide MapReduce scheduling solution for multiple users. The Fair Scheduler provisions job isolation and minimum of user share by its two levels mechanism. It allocates task slots to pools and the pools to jobs subsequently. Dynamic Proportional Share Scheduler (DP) [74] is a scheduling model which lets users to assign priorities of jobs and define required resources. Moreover, it can schedule jobs based on user budget and service levels. If no such of data is provided by the user, the scheduler acts like the Fair Scheduler. The DP preempts resources on required situations to prevent starvation of bigger jobs. This scheduler exploits multiple queues to handle the situation wherein more service levels are available for the providers. The DP provides scalability, dynamic and adaptive service levels and mitigates system overhead whereas it suffers from low control over long-time running tasks.
Quincy [75] is a fair scheduling algorithm that mapped the scheduling problem into min-cost flow problem in a way that fairness, data locality and starvation freedom considered as edge weight and afterwards the optimal online solution is computed based on global cost model. The limitation of this method is that the fairness concept only considered as number of allocated computers to a job without considering sharing of the other resources. In [76] a size-based scheduling method named HFSP has proposed that estimates job execution time online and provides fairness with low response time and additionally it is fault-tolerant for wrong estimations. The HFSP exploits the concept that the progress of the job can be modelled by aging function preventing major job starvation. Another algorithm is proposed in [77], named regulated dynamic prioritization that provides multiple service levels according to priorities defined by user. It introduces an efficiency evaluation criterion for better testing of algorithm and being more general and case-independent. The method uses three levels of priority in the algorithm, namely, priority of bottlenecks in the workflow, priority of each stage and the whole workflow priority. They have tried to isolate clusters of the MapReduce in proper way to improve performance along with preserving user defined priorities.
In [78] a scheduling algorithm is proposed that evaluates jobs and exploits pre-emptive job resource allocation. It estimates task running time and assigns the resources to jobs with higher priority based on priorities defined by the user. A hybrid scheduling method has introduced in [79] that provides fairness and better performance. It selects the best algorithms based on job scale. It considers three schedulers to select; namely, FIFO, Fair Sharing and COSHH algorithms that introduced by the authors. The COSHH exploits linear programming to classify jobs based on their required resources and attempts to find proper resources for the jobs, that has proposed in the same study. In [80] a scheduling algorithm has introduced attempting to improve performance as well as fairness. It exploits history of completed tasks in combination with Delay Scheduler [9] in a way that it evaluates performance of each slot based on history of completed jobs.
DynamicMR [81] exploits multiple methods to improve slot resource allocation with concerning fairness and performance. The DynamicMR consists of the following components; DHSA for improving slot utilization, SEPB for balancing performance between single and batch jobs and PreSchdeule for improving data locality as well as preserving fairness. FRESH [82] aims to improves fairness and makespan time by configuring slots and assigning tasks to each slot properly. It does the slot configuration statically before starting task and dynamically at runtime. The authors furthermore provide a new definition for overall fairness to obtain more accurate results.
Deadline awareness
Sometimes the jobs must be finished before their deadline time. In this type of jobs, finishing the job after desired time is useless, therefore MapReduce schedulers should handle jobs that have deadline with special methods. In [83], the problem of long time running tasks has discussed with deadline consideration using two approaches: in the first approach scheduling done by considering running time of map and reduce functions, data distribution and input data locale. In the second approach, the scheduler tries to estimate the time required for each job completion assuming that the environment is homogeneous, map and reduce functions done in order, the key value is uniform and the fact that data is managed by HDFS. The second model needs that each job has a deadline defined by user. Moreover, in this method the input deadline time is considered as maximum running time. The method tries to meet deadlines by assuming other jobs in the cluster as not-running and to execute as many as possible jobs due to deadlines.
ARIA [84] is a method introduced for meeting soft deadline of jobs according to service level objectives. It exploits three related components. The scheduler builds job profile for each new job and gathers its critical information during map and reduce tasks. Afterwards the scheduler determines job ordering based on the earliest deadline first policy and dynamically estimates the required resources for each job based on collected job profiles to provide job requirements for making possible the execution of job within its deadline. In [85], a heterogeneous environment algorithm is proposed that adapts to dynamic resources along with meeting the user defined job deadlines. Additionally, the algorithm is data locality aware when manages multiple jobs in the workload. It calculates average task execution time for all jobs continuously and exploits the collected information for scheduling jobs due to their defined completion time.
PDCS [86] is a pre-emptive deadline-aware scheduler that has proposed to minimize total job completion time. It provides slot utilization and prevents job starvation. The authors provide a heuristic based on current status of the system, number of free slots and input size to achieve the best results on computational intensive workloads. BGMRS [87] is a deadline-aware scheduling method in which converts deadline constrained scheduling problem into minimum weighted bipartite graph problem and exploits ILP modelling to solve the problem optimally. Moreover, it proposes a heuristic algorithm that contains node grouping method to reduce computation time. The BGMRS aims to improve data locality in conjunction to meeting job deadlines. The method considers the whole job deadline in a way that if each of the map or reduce tasks of a job take more time to complete, it changes deadline of remaining map or reduce tasks to prevent violation of job deadline.
MRCP-RM [88] is a scheduling algorithm which models the scheduling problem as an optimization problem. It exploits constraint programming approach for solving the optimization problem. It considers multiple parameters for the workload as SLA; namely, execution time, earliest start time and deadline of each job. Subsequently. when a job submitted with these characterizations, the MRCP-RM tries to process jobs stream according to their requirements efficiently. HCP-RM [88] is another constraint programming based management algorithm that is designed for Hadoop that besides features of the MRCP-RM, provides better data locality for the system.
In [89] a couple of schedulers for execution of both real-time and non-real-time jobs has proposed. The first scheduler is a deadline-aware scheduler that satisfies job’s deadline as well as maximizing number of jobs running concurrently using task forward scheduling technique and AUMD resource allocation model that introduced in the same study. The second is a two levels scheduler in which mixes the deadline scheduler with currently existing non-real-time schedulers while avoiding job starvation. WOHA [90] is a deadline constraint aware scheduler that assigns different priorities to each of workflows. It allows each client node to produce scheduling planes locally to use by master node for resource allocation. Finally, the master node builds proper algorithm using gathered information from client nodes.
Paused Rate Monotonic (PRM) [91] is a scheduling algorithm that provisions concurrent service sharing in jobs with deadlines. This method, exploits two ways for handling the problem: The first is that for concurrent sharing, it tries to detect details of tasks and its related procedure. The second, it supports jobs with distinct priority. The PRM handles jobs by pausing between map and reduce functions to manipulate intermediate generated data to meet constraints. Another data locality aware and deadline constraint based scheduling algorithm proposed in [92] which tries to compromise between job blocking in run-time and to provide maximum of possible data-locality to improve performance and energy efficiency.
Resource utilization
Resource-aware schedulers decide for each MapReduce task and assign them to best of available TaskTrackers to gain best efficiency. RAS [11]is a resource aware scheduling algorithm introducing a new concept named “job slot” which means a slot that is bound to specific job with map or reduce task type. The algorithm tries to adapt demands with available amount of resources and decides for scheduling based on previous job execution logs. Moreover, The RAS obtains completion time of jobs from the user as input and tries to meet the time goal in the soft manner. TuMM [93] is a scheduler that relies on self-adjusting slot configuration for resource utilization improvement. Static slot configuration is claimed to be an important factor that causes low resource utilization. Therefore, the authors proposed a method exploits slot ratio for assigning to map and reduce tasks for minimizing makespan time. The TuMM has two main components: the first is workload monitor that estimates current map and reduce tasks execution time based on extracted information of recently finished tasks and the second is slot assigner that exploits the collected information by workload monitor component for slot assignment.
MROrchestrator [94] is a dynamic fine grained resource manager which is designed due to poor performance of static resource allocation because of variable demands for different resources in the system. It builds online resource demand estimator based on runtime resource usage logs of the tasks. This results improvements in both resource utilization and job completion time. In [95] the packing server notion is introduced for proposing scheduling algorithm to improve resource utilization. The algorithm converts the problem of MapReduce precedence constrained workflows scheduling to independent scheduling of the tasks in system runtime.
PRISM [96] is another resource aware scheduler in “phase” level. It introduces “phase” as a sub-section of task that has uniform resource consumption during its execution. The different phases of a task can have different resource requirements. This granularity level lets defining required amount of resources in the phase-level in addition to the task-level. The PRISM has three main parts: master node phase scheduler, manager of local node and progress monitor of the jobs. JAS [97] (Job Allocation Scheduler) is another proposed scheduler based on job workloads to improve resource utilization. The JAS classifies jobs to CPU and I/O bound items to put in queues. Afterwards, TaskTrackers are evaluated for assigning CPU or I/O bound and capability of each slot revealed for job assignment. The data locality aware version (JASL) and its dynamic version (DJASL) proposed in the same study for overcoming locality problem of data, raised because of the jobs classification.
In [98] a resource stealing approach used for resource utilization improvement. This method assigns reserved resources of idle slots to new tasks resulting full utilization of wasted resources without altering job scheduling. Additionally, in the same study a benefit-aware speculative execution is proposed that considers estimated benefits of running speculative tasks to decide whether give extra priority to it or not. In [99] a couple of task schedulers has introduced that exploits simultaneous scheduling of map and reduce tasks to improve data locality and prevent task starvation problem caused by separate scheduling of map and reduce tasks. Furthermore, considering interaction between the schedulers, provides more data locality. The method starts reduce tasks according to the respective map tasks progress that makes improvement on single job running.
Energy saving
Energy consumption reduction in the MapReduce running systems is another item of overall system optimization. An energy-aware scheduler is proposed in [100] to minimize energy consumed by data-centers in conjunction with meeting deadlines and SLA. In this heuristic algorithm a metric introduced for assigning energy consumption value for each machine slot, named energy consumption rate (ecr). The ratio named ecr is different in two types of algorithms that is proposed; in the first one, minimum value of map and reduce energy consumption rate is used and in the second type average value of energy consumption rate is used. The lower value of ecr indicates higher chance of the slot for task assignment and vice versa.
In [101] a scheduler has introduced that exploits two runtime configurations for improving energy efficiency, namely, degree of MapReduce workers’ parallelism and dynamic voltage and frequency scaling (DVFS). The number of concurrent processing is calculated based on number of allocated worker nodes and the number of concurrent tasks of each node. This method exploits maximum processing frequency when map or reduce tasks are running and minimum frequency for others (like I/O operations). Another energy consumption aware and simultaneously data locality optimal method proposed in [102], exploiting integer programming. In this work genetic algorithm is used to obtain multi-objective optimization in conjunction with MOEA/D [103] algorithm. The authors found a decent coding method to able to use genetic algorithm that makes the searching faster and reduces its overload.
In [104] a scheduling algorithm has proposed for heterogeneous environments with energy saving goal in conjunction with user SLA. This method considers power consumption of CPU and storage by analysing distribution of intermediate key data to select best resource for task execution. The power used by storage and CPU reduced by utilizing slack time of map and reduce tasks. EMRSA [105] is an energy aware algorithm that finds appropriate slot for map and reduce tasks with considering energy efficiency and deadline SLA. It first models the scheduling problem with integer programming and then exploits a greedy approach to find solution. In [106] a method named EAMR-RM has proposed that meets user defined deadline as SLA in conjunction with energy saving. The EAMR-RM exploits constraint programming approach for resource management. It uses slack time of tasks to execute some tasks in reduced CPU working frequency. In [107] a semi-dynamic online method for decreasing energy consumption is proposed that is based on adaptive task partitioning. The proposed algorithm furthermore satisfies user defined deadline.
Cost minimization
Cost is an important criterion in the cloud for the end user for various resources that used, such as data storage, network etc. One of the constraints which is considered in SLA agreement is user budget and this implies importance of cost minimization. Zeng et al. proposed MASA [108] which is a greedy heuristic algorithm with deadline constraint and cost minimizing approach in the context of SLA. They calculate minimum and maximum of map tasks required for each job in a way that cost and deadline constraints not violated and then define bounds for reduce tasks based on calculated number of map tasks. Afterwards, they calculate each of possible map and reduce tasks numbers greedily along with classifying virtual machines based on their computation power. Finally, the time and cost of allocating each task to the virtual machines calculated and the allocation is done for the items that do not violate any constraint and simultaneously have minimum cost.
In [109], two cost and time constraints aware algorithms proposed in the task level. The authors used dynamic programming and greedy method to schedule tasks for optimizing completion time and meeting budget constraints. Afterwards they introduced two greedy algorithms to improve time complexity of scheduler whilst preserving cost limitations; namely GGB and GR. Both of GGB and GR algorithms claimed to be near optimal in which decrease scheduling required time substantially. Cura [110] is a cost aware scheduler designed to provide better solution specially for interactive job type that is most of jobs type in today workloads. It selects best cluster configuration in the VM for each job without the need of user resource selection. Moreover, Cura utilizes workload slack time to optimize the resource provisioning by delaying jobs. The ChEsS [111] is a scheduling method trying to meet budget constraint as well as improvement of workload makespan. It is a Pareto-based search algorithm that assigns each job to a cluster in near to optimal way. Furthermore, ChEsS considers data locality, processing power of each cluster and intra-job scheduling method used of the cluster, for best assignment of jobs to clusters. In [112] a multi-objective scheduling algorithm is proposed that exploits earliest completion time approach which addresses cost and finishing time that results shown its better throughput over FIFO and Fair schedulers. The scheduling problem with budget constraint is considered to minimize the sum of all tasks execution times with distinct budget for a job.
The percentage of concerned quality measures in the studies.
In this section, we discuss about the outcome of the review and categorizing of the literature studies and then review available evaluation methods and benchmark datasets.
The quality measures concerned by the reviewed methods
The quality measures concerned by the reviewed methods
The problem of optimal MapReduce scheduling is an important part of the overall big data system improvement. The MapReduce used for big data processing and its outcome evaluated by some quality measures defined in the literature, namely, data locality, fairness, performance, resource utilization, deadline awareness, energy saving and cost minimization. The proposed scheduling algorithms in various studies try to improve one or more items of these quality measures. Nevertheless, improving one of these quality measures affect others. Therefore, some of the scheduling algorithms provide further improvements on other quality measures additionally. The summary of quality measures that each of reviewed methods are concerned, is shown in the Table 4. The gathered information from the selected studies demonstrates that most of improvements done in the MapReduce scheduling methods are performance improvements. Indeed, data locality and resource utilization quality measures could affect performance implicitly. Moreover, the data locality is in the second rank of improvements. The percentage of each quality measure improvement in the selected studies is shown in the Fig. 3.
The number of the MapReduce scheduling methods is growing gradually and further, by development of cloud computing usage in the world, the need for optimal scheduler is growing. The number of papers from 2009 till 2017 years is shown in the Fig. 4 based on quality measure concerns.
The most popular benchmarks used by the studies
The most popular benchmarks used by the studies
The number of papers about the MapReduce scheduling published yearly.
The scheduling algorithms could act in two major levels of granularity in MapReduce environment, namely, task level and job level. The task level means scheduling at the level of map and reduce tasks and the job level means scheduling and decision making on the whole of a job entity without considering map or reduce task concept independently. Moreover, some of the studies such as [50, 71, 92] combine task level with job level scheduling to obtain better results. We have found out that the most of the studies have done in the task level since it has more potential for optimization. The percentage of each level used by the studies is shown in the Fig. 5.
In order to find out the efficiency of each method designed for MapReduce optimization, it should be evaluated either on a real or in a simulated system that processes data using the MapReduce approach. The best evaluation method is to implement the MapReduce scheduler and to test it in a real system. However, this approach requires vast amount of hardware infrastructure. Consequently, in some of the studies such as [52], the authors exploit MapReduce simulators such as: MRSim [113] and HSim [114].
The percentage of scheduling level of reviewed studies.
The pitfall in MapReduce scheduler evaluation is the different behavior of the scheduler in various workloads. This makes benchmarking a misleading approach for testing methods. Indeed, a scheduler may be work efficiently with only some of workload data. This causing that comparing the scheduling methods become difficult. The studies exploit diverse set of benchmarks as input for testing the algorithms. The available benchmarks include numerous datasets e.g. Word-count, Tera-sort etc. The most used benchmark suits that are available on the web openly, are introduced in the Table 5.
The quality measures required by the new big data applications are growing, therefore the new schedulers to be designed for satisfying the new requirements. The reviewed studies are mostly concentrated on one metric and only a few of them concern numerous metrics.
The authors evaluated their method only for their focused metrics without considering the other quality measures. Another open issue is the fact that improving one quality measure in the MapReduce can affect other metrics negatively. The proposed methods are evaluated by the authors only in the concentrated metrics and this leads to improper compare with other scheduling algorithms.
A future work can be designing schedulers with new required quality measures and improving currently focused quality measures without any negative effect on other quality measures of the system.
Conclusion
The MapReduce paradigm aims to tackle with big data despite current hardware limitations. The scheduling in the MapReduce is an important factor to be optimized.
In this study we have selected 93 papers from 813 papers that was available in the literature. Afterwards, we have categorized selected studies that proposed the MapReduce scheduling algorithms based on their main quality measure that they aim to improve it. The constructed categories are data locality, fairness, performance, resource utilization, deadline awareness, energy saving and cost minimization. The most contribution was on data locality and performance of the scheduler. The selected studies in each of the categories described concisely and statistical details of reviewed studies gathered and represented in the systematic manner. Moreover, we have discussed about pitfalls and trends of the research in the MapReduce scheduling field for future works.
