Abstract
Parallel processing is crucial for accelerating computation in many high-performance applications and modern technologies including computational modeling, optimization and simulation, Web and DNS servers, peer-to-peer systems, grid computing and cloud computing. Due to the heterogeneity nature of various processing nodes and the differences of workloads of various tasks, some processors can be idle while others are overloaded. In this paper, we present a simple, yet efficient, solution inspired by the intelligence of ant colonies to adequately mitigate the load imbalance and communication overhead problems in multiprocessor environments. The proposed approach is based on defining and maintaining data structures to dynamically track the load of each processor. We implemented the proposed algorithm and evaluated its performance under different scenarios against the baseline round-robin algorithm. The results showed that the proposed algorithm has more effective properties than the round-robin algorithm.
Keywords
Introduction
In early stages of computing, computers were executing instructions sequentially causing a huge delay for computationally- and data-intensive programs. In order to overcome this problem, many models have been proposed in the past for parallel computing and multiprogramming systems, symmetric multiprocessing and massively parallel processing [1]. The concept of parallel processing is based on processing program instructions by dividing them over multiple processors aiming to execute the program in less time with high throughput and better utilization of resources. Parallel processing has become an essential paradigm for many real-world applications such computer graphics and visualization, image and video processing, computational modeling, optimization and simulation, weather forecasting, environmental modeling, material and geophysics sciences, etc. However, several challenges are associated with such systems [2, 3]: Algorithmic overhead: presented when programmers are required to spend more efforts in parallelizing sequential programs by considering and adding extra parts in their code such as the related parallel prefix. Communication overhead: presented when the time spent in communication between nodes proportionally increases. Critical paths problem: presented when dependencies exist between operations spread over multiple processors. Bottlenecks: presented when one processor is holding a required resource while the others are idle waiting for this processor to release this resource. Memory performance: presented when the time taken to access data from memory is much higher than the time required for execution. Load imbalance: presented when the load is inefficiently distributed over multiple nodes. This issue becomes extremely pivotal in system performance as nodes heterogeneity increases.
The problem considered here is the load imbalance in multiprocessor environments, where the load is distributed unequally between the processing units. In such scenarios, the result is that one or more processors are having high load while others are having much less. This inefficient load distribution between processors leads to inefficient utilization of the available processing resources. Moreover, it increases the overall time required for the execution of tasks. In this paper, we briefly review the state-of-the-art techniques and present a novel approximate approach based on the collective behavior of ants to overcome the load imbalance problem while eliminating the communication overhead.
The rest of the paper is organized as follows. Section 2 describes the problem under the study. Section 3 reviews related work. Section 4 describes the proposed methodology. Section 5 evaluates and compares the effectiveness of the proposed method under various scenarios. Section 6 concludes the paper and highlights future work.
Problem formulation
The problem considered in this study is to efficiently distribute N tasks over P processors. Each task has a workload on each processor which may differ from the other tasks. Tasks arrive over time and are allocated to processors by the load balancer. To achieve the highest efficiency in load balancing between the processors, all processors should have almost the same load. The highest possible efficiency is when the load difference between processors equals zero, which means all processors have the same load as presented by the below formulas.
We use the notation I
i
(P
j
) =1 to mean the i-th task is allocated to j-th processor; otherwise I
i
(P
j
) =0. We also use L
i
(P
j
) to refer to the workload of the i-th task on the j-th processor. For a specific distribution of tasks, the total load on the j-th processor is given by:
For the ideal distribution, each processor must have an equal share of the total workload which is given by:
The imbalance factor α (P
j
) refers to the deviation of the load on the j-th processor from the perfect average load μ and is given by:
It is required to distribute arrived tasks over existing processors efficiently in order to minimize the maximum imbalance factor between processors while reducing the communication overhead.
Load balancing refers to spreading the computation load evenly across the processors while minimizing the required communication among them. The proper load balancing technique for regular or irregular algorithms relies on certain properties concerned with the behavior of data and tasks, e.g. [4]: (a) Task-oriented or data-oriented algorithmic structure, (b) Regular or irregular data input, and (c) Static or dynamic data nature. Another very important issue is the communication pattern resulting due to the distribution of data and tasks. The communication behavior depends on the characteristics of the algorithm and the implementation strategies. The locality of data dependencies, the locality of data patterns and the locality of communication patterns are all very important factors as well. Moreover, issues related to load balancing should be considered such data redistribution or tasks migration between processors [5].
A recent literature survey is presented in [6]. The authors classified load balancing techniques into seven categories: Hadoop MapReduce, natural phenomena-based, agent-based, general load balancing, application-oriented, network-aware, and workflow-specific.
Several algorithms or models based on swarm systems have been proposed and applied in real life [7]: (a) Ant Colony Optimization (ACO), (b) Particle Swarm Optimization (PSO), (c) Glowworm Swarm Optimization, and (d) Artificial Bee Colony (ABC). ACO was inspired from the behavior of biological ants [8]. Mainly these ants deposit a chemical substance, known as ‘pheromone’, on the ground while searching for food, in order to mark the best paths that should be followed by other members of the colony. Pheromone also evaporates as time passes. The net result is that more pheromone is accumulated on shorter paths. ACO uses a similar mechanism in order to solve optimization problems such as the traveling salesman problem [9]. An ant which is currently in the i-th city has to choose the next city to visit via a stochastic mechanism similar to the behavior of natural ants.
Willebeek-LeMair and Reeves [10] presented five strategies for dynamic load balancing on highly parallel computers. These methods are built on the tradeoff between the accuracy of balancing decisions and the communication and processing overheads. In [11], Sinclair proposed another dynamic algorithm which provides quick and accurate assignment for a number of interacting tasks over multiple processing nodes based on Global Scheduling Table (GST). This algorithm combines the information contained in the program graph and the network graph of the system with the current state of the system for the assignment of tasks. Ludwig and Moallem [12] compared multiple swarm intelligence approaches for load balancing in grid computing.
Recently the concept of cloud computing has emerged based on the idea of distributed and grid computing. Many research studies have contributed to realizing this powerful technology as a scalable service over the Internet. One of the core tasks in cloud computing is load balancing to avoid having some nodes overloaded while others are idle. In [3], the authors presented a systematic comprehensive review of existing challenges, techniques, and implementations of load balancing in cloud environments. In [14], a task scheduling policy is proposed for cloud computing based on ant colony optimization. Cloudsim is used for setting up the testing environment and the number of virtual machines in each datacenter is set to 50. In [15], another method is introduced based on ant colony optimization to overcome the load imbalance in cloud computing. The introduced algorithm is based on a distributed pheromone level update mechanism for load balancing. In [16], the advantages and disadvantages of multiple load balancing schemes for different cloud environment are discussed. The paper concludes that static load balancing schemes are easier in terms of simulation and monitoring of the environment. However, static methods cannot model the heterogeneous nature of cloud computing. On the other hand, dynamic load balancing schemes are difficult to simulate, but they are considered as the best solutions for the heterogeneous environment of cloud computing.
In [17], the authors discussed load balancing problems on grid systems and proposed a solution based on ant colony optimization. The experimental results show that while implementing the proposed algorithm to the grid environment, increasing the number of jobs and their length has an insignificant impact on the system response time. Another related work has proposed a grid balancing algorithm using ant colony optimization [18]. This algorithm takes into consideration the available resources, the critical path and the processing speed of the resources.
Although there have been a number of related research studies, the majority of the existing solutions are either increasing the time spent for communication between nodes or do not achieve load balance between nodes in high efficiency. In this paper, the proposed method is focusing on mitigating the load imbalance while eliminating the time spent in communication between the processing nodes for the purpose of getting the current load status on each processor.
Proposed methodology
There are multiple processors (P) and multiple tasks (N), which may be different in terms of nature and behavior. The proposed method is based on the behavior of ants. For each task, there is one ant. Each processor is represented by an ant colony, thus the total number of ant colonies equals the total number of processors. We use pheromone to represent the estimated time required for execution of a task. The proposed method assumes that ants are directed towards the highest pheromone level, which refers to the lowest required execution time. The proposed load balancing is based on defining two data structures (e.g. tables): Ants Table: This table represents the tasks and includes for each task its assigned ant colony and estimated execution time (aka task’s pheromone level). An example is shown in Fig. 1(a). Ants Colony Table: This table represents the processors and includes the scheduled tasks for each processor as well as the total pheromone of assigned tasks. For each ant colony, the assigned ants (tasks) are stored as a linked list of Tasks IDs. An example is shown in Fig. 1(b).

Illustrative example of Ants and Ant Colonies Tables.
Moreover, the proposed method defines two procedures: (a) Add task: triggered when a task arrives (pseudocode is shown in Algorithm 1), and (b) Remove task: triggered when a task is completed (pseudocode is shown in Algorithm 2).
1: Estimate the pheromone level for this task (i.e. required estimation time on each processor)
2: Select the ant colony with the minimum pheromone
3: Add an ant to the Ants Table with its assigned Ant Colony and Pheromone level
4: Update the Ant Colonies Table by adding the new Ant to the list
5: Update the Pheromone level of this specific colony by adding the estimated pheromone of that Ant to its total pheromone level
1: Access the Ants Table and get the associated Pheromone level and the associated Ant Colony
2: Access the Ant Colonies Table and remove the task
3: Update the Pheromone level of this specific colony by deducting the Pheromone Level of the departed ant
4: Perform shift left for the tasks assigned to the associated Ant Colony
5: Remove the Ant’s record from the Ants Table
In this section, we evaluate the performance of the proposed algorithm taking into consideration two critical factors: (a) number of tasks, and (b) size of each task.
For the evaluation purpose, we use the following criteria: (a) maximum imbalance α
max
, (b) average imbalance α
avg
, (c) maximum percentage imbalance α
max
(%), and (d) average percentage imbalance α
avg
(%):
The initial assumption was having multiple same size tasks or programs that need to be run in parallel, e.g. SPMD. To study the behavior and the accuracy of the proposed algorithm, the algorithm was tested on several scenarios. Initially, the size of tasks was maintained but the number of tasks at each time was changed. Then, the number of tasks was maintained but the size of tasks was changed.
Effect of number of tasks
Here, the size of tasks was maintained fixed at 10, and three scenarios were tested when the number of tasks was 10, 20, and 30. The results of maximum and average percentage imbalance are shown in Table 1. It is apparent from these results that imbalance could be predicted if the size of tasks is fixed by the following rule:
Maximum and average percentage imbalance of the proposed algorithm for 10, 20 and 30 unified-size tasks
Maximum and average percentage imbalance of the proposed algorithm for 10, 20 and 30 unified-size tasks
α (P j ) ←0
Obviously, both the number of tasks, N, and number of processors, P, directly affect the accuracy of the proposed algorithm.
Here, three sizes of tasks were considered (10, 20 and 30) but the number of tasks was maintained in each test to be 100. The algorithm was run multiple times and at each time the number of processors was changed. From the results of the three test cases, it is obvious that the imbalance factor increases as the sizes of the tasks increase, as shown in Fig. 2. However, the percentage imbalance remains the same if the number of processors and the number of tasks remain the same, as shown in Fig. 3.

Imbalance factor when the number of tasks is kept fixed and sizes of tasks are changed.

Percentage imbalance when the number of tasks is kept fixed and sizes of tasks are changed.
Additionally, we considered the case of having multiple tasks of random sizes that need to run in parallel, e.g. MPMD. To study the behavior and the accuracy of the algorithm, the algorithm was run for three scenarios; each having 100 tasks with randomly generated sizes. First, the performance was evaluated for unordered tasks. Then, the effect of changing the sequence of arrival was examined. Therefore, the tasks were sorted according to their sizes from the smallest to the largest (ascending order) and from the largest task to the smallest (descending order). The same tasks that were used in the previous part were used again here, but the order of task assignment to processors was sorted. The results were illustrated in the Figs. 4 to 6. Here, it is obvious that having tasks ascendingly sorted (with random arrival) does not affect the accuracy of the algorithm that much. However, the significant change was observed when the tasks were descendingly sorted. This fits well with the theory of ‘The Big Rocks of Life’ by Stephen Covey [19]. When the assignment of tasks starts with the largest task, the degree of imbalance was reduced in all three test cases.

Performance for the first test case (unordered tasks).

Performance for the second test case (ascendingly-ordered tasks).

Performance for the third test case (descendingly-ordered tasks).
The proposed algorithm could be considered as hybrid due to its both static and dynamic approaches. The dynamic approach is observed in the dynamic process of updating the tables (Ant Colonies table and Ants table). The main purpose of maintaining these tables updated is to eliminate the requirement for each node to communicate with the other nodes to get the total pheromone level on the other nodes. The only time required to be spent here by each node is the time to access the Pheromone Table and the time required to update the tables. Therefore, it is possible to say that the time spent in communication between nodes to assign a new task is reduced. The static approach of the algorithm is observed as there is no reassignment of tasks once assigned to the destination node. The advantage here is that the time spent in re-balancing the load is omitted. However, the task size is an estimated number and tasks with big sizes may finish execution before smaller sizes tasks due multiple reasons (e.g. file does not exist). Therefore, due to the hybrid nature of the algorithm, it is obvious that the algorithm may help to reduce the overall time spent in communication between nodes. Comparing the results generated by the algorithm with the findings in [12], the average load balancing while using ant colony is 65± 0.7 % and it comes in the fourth place in terms of accuracy after PSO, State Broadcast Algorithm and Random Space Shared. However, from the results provided here, it is obvious that the accuracy level of the proposed algorithm is much more accurate than the results presented in the paper due to the consideration of multiple directly affecting variables. These variables are: (a) Programming model (SPMD or MPMD), (b) Task size, (c) Sequence of the arrival of tasks, and (d) Number of processing nodes. Considering these factors leads to realizing that it is impossible to stick to a static average imbalance or average load balancing level, as the imbalance value and the way to calculate the imbalance vary if one or more of the variables above was changed. Moreover, the results which were presented previously and generated based on the proposed algorithm are more accurate than the results presented in the paper which are based on the algorithms used by the researchers. On the other hand, there are also multiple variables ignored which may affect the accuracy such as Computer Architecture, Technology, etc. The main reason to ignore these factors is to maintain the generalization of the algorithm rather than the dependency on one technology.
In addition to that, the results of the proposed algorithm were compared with the results of the baseline Round Robin approach. For this comparison, both approaches were tested under similar conditions and using the same data. For the first set of comparisons, the same data sets used previously without ordering were used. The results are shown in Table 2. From these results, it is clear that the the proposed algorithm is more accurate compared to the Round Robin approach.
Comparing the Proposed algorithm with Round Robin in terms of percentage maximum and average errors – for unordered tasks
Comparing the Proposed algorithm with Round Robin in terms of percentage maximum and average errors – for unordered tasks
In the second set of comparisons, the same ascendingly-sorted data sets used previously were used again. The results in Table 3 show that both algorithms have very close performance. Finally, Table 4 shows the results for the descendingly-sorted data sets. From these results, it is obvious that the proposed algorithm is more accurate than the baseline Round Robin approach.
Comparing the Proposed algorithm with Round Robin in terms of percentage maximum and average imbalances – ascendingly-sorted tasks
Comparing the Proposed algorithm with Round Robin in terms of percentage maximum and average imbalances – descendingly-sorted tasks
A simple approach based on the behavior of ant colonies was suggested for solving the load imbalance problem in parallel and distributed processing while reducing the communication time between nodes. Using global tables, newly arrived tasks are assigned to the processing node with the least load. The effectiveness of the presented approach is evaluated under different workload scenarios. The results of its implementation show that the suggested algorithm has improved performance compared to the baseline round-robin load balancing algorithm. Moreover, the maximum imbalance could be predicted for the programming model used. As future work, we recommend taking into consideration the task priority for the revision of tables during the execution of tasks, which can further improve the efficiency through dynamic load redistribution. Therefore, we recommend more research on this issue. We also recommend exploration and comparison with other swarm intelligence algorithms.
Footnotes
Acknowledgments
The authors would like to thank Ahlia University (Bahrain) and King Fahd University of Petroleum and Minerals (Saudi Arabia) for support during this work.
