Abstract
Metaheuristic algorithms are a family of algorithms that help solve NP-hard problems by providing near-optimal solutions in a reasonable amount of time. Galactic Swarm Optimization (GSO) is the state-of-the-art metaheuristic algorithm that takes inspiration from the motion of stars and galaxies under the influence of gravity. In this paper, a new scalable algorithm is proposed to help overcome the inherent sequential nature of GSO and helps the modified version of the GSO algorithm to utilize the full computing capacity of the hardware efficiently. The modified algorithm includes new features to tackle the problem of training an Artificial Neural Network. The proposed algorithm is compared with Stochastic Gradient Descent based on performance and accuracy. The algorithm’s performance was evaluated based on per-CPU utilization on multiple platforms. Experimental results have shown that PGSO outperforms GSO and other competitors like PSO in a variety of challenging settings.
Keywords
Introduction
"Look deep into nature and then you will understand everything better" - Albert Einstein.
Historically man has looked up to nature for seeking clues and inspiration for inventions. From taking inspiration from birds for constructing airplanes to observing swarm movements to inspire the creation of the family of meta-heuristic algorithms. For example - Genetic Algorithm is inspired by Darwin’s theory of evolution [8]. The Particle Swarm Algorithm is designed based on the flocking behavior of birds [7]. Each algorithm is suited for a certain type of problem. The choice boils down to what the application requires. Each algorithm tinkers the balance between exploration and exploitation phase. An algorithm which correctly balances the exploration and exploitation phase with multiple iterations of each cycle,in general, becomes a strong performer. These characteristics are shown by the GSO algorithm.
Our Universe has a certain balance and symmetry in the way objects interact with each other, everything revolves around something. The electrons around the nucleus, the planets around the sun, the stars around a galaxy center, the galaxies act like dots in the universe and revolve around what we today call superclusters [13]. GSO algorithm takes inspiration from this hierarchical motion. The point masses in the universe can be taken as solutions in a search space. These solutions are improved by complex communications among themselves. The solutions are then improved further by combining them. The universe is our search space and each point mass is an explorer. Notice that, each explorer has the responsibility of revolving around locally and also being part of a bigger revolving entity that gets to visit other parts of the universe. In heuristic terms, thereby providing the correct balance in terms of exploration and exploitation phases.
In the GSO algorithm [13], the stars revolve around galaxy centers. The galaxy revolves around the supercluster center. Each galaxy is represented by their global best solution which is also the galaxy center.
GSO can have any swarm algorithm mimicking the notion of a search for a solution starting from some random location in our bounds. In this paper, the focus is on using PSO as the algorithm playing the role of a galaxy with stars as the particles. This choice is made because of its faster converging property which will be useful when describing how to train a Neural Network with PGSO in Section 4. The GSO algorithm works in 2 stages. Stage 1 has multiple PSOs running independently. Having multiple independent swarms explore search space helps independent exploration. This guarantees robustness in solution and deals with multimodalities. Stage 2 involves using individual best solutions as initialization points to start another PSO. This step enforces the exploitation phase which guarantees faster convergence since the initialization points are themselves sub-optimal local minima. Modifications were made to this architecture to make it scalable. The performance was evaluated based on the amount of work done by 2 implementations (sequential and parallel) in the same amount of time rather than time comparisons of what portion of the code is parallel and what is not. The architecture modifications are highlighted in Section 3.
The rest of the paper is organized as follows. Literature Review is presented in Section 2, the algorithm’s architecture is presented in Section 3, the methodology that was followed to design the architecture is presented in Section 4. The Artificial Neural Network training procedure is presented in Section 5. Detailed performance analysis in terms of latency, average CPU utilization and evaluation against benchmark suite is presented in Section 6. The discussion and conclusion is presented in Section 7.
Related work
Parallel metaheuristic algorithms have found their use in various fields of science, business and industry. Some applications being - Vehicle Routing [11], Scheduling, flood management [2], designing strong car chassis [15], training artificial neural networks [10] and more. Our focus in this paper, is on exploring how the PGSO algorithm performs on optimizing an Artificial Neural Network. Let us first look at the classical approaches that are used to parallelize a population-based metaheuritic algorithm - Population parallelization Computational parallelization
Population parallelization involves letting each explorer of your swarm to work independently and simultaneously along with other members, whereas, the computational parallelization involves splitting the computational intensive region to be computed in parallel. It may also be the case that each swarm is considered as a single unit and that multiple swarms run simultaneously and explore the same search space. The authors have adopted the latter approach.
Popular algorithms like Ant Colony Optimization have been parallelized using SIMD(Single Instruction Multiple Data) format by [5]. In this implementation, each ant’s walk over the graph is treated as an independent task, where each ant’s pheromone trail is recorded in independent matrices τ j . The only drawback of this implementation is that all the independent matrices need to be merged at the end of a cycle which is a highly computationally expensive O(n2) operation. For the problem of parallelizing the Artificial Bee Colony algorithm, after employing multiple parallelizing strategies [16] found that giving a coarser level (swarm level) task to the CPU helps improve the accuracy and keep CPU cores busy for a longer time. [14] have parallelized the Genetic Algorithm [8] to train an Artificial Neural Network. The Genetic Algorithm intrinsically allows flexible distribution of tasks. The master directs sub-population of candidate solutions to slave nodes using a message passing interface. They got significant speedups from the parallel and distributed implementation.
PSO - particle swarm optimization
The author in [6] proposed the Particle Swarm Optimization algorithm inspired by the flocking behaviour of birds. By far, it is one of the most popular algorithms to have successful applications and performance in various fields, as given by [1].
The velocity and position update equations 1,2 for the sequential algorithm are given below. Here v j is the velocity of particle j, w represents the weight/importance to be given to the inertia of the particle, c1 (cognitive factor), c2 (social factor) weight represent the importance to be given to the best position found by the particle and the best position found among all particles respectively, r1, r2 are randomly initialized numbers, p j , g j , x j represent the best-recorded position by the particle, best-recorded position globally and particle’s current position respectively.
In [3] a novel algorithm is proposed called Parallel Particle Swarm Optimization(PPSO). The algorithm was applied to infinite impulse response image filters for designing image enhancement filters along with arbitrary frequency results to the multidimensional optimization problem. The experiment mainly consisted of two phases: The first phase was used for global exploration to explore multiple local minima, in which the PSO algorithm was divided into sub-populations and later on, best solutions from each sub-population are being interchanged between the cores. In the second phase, a local search is done to refine the solution computed by PSO, using the Nelder-Mead method. The author concluded that PPSO is giving better results as compared to other global optimization algorithms in terms of the mean square error between CPU usage and the ideal and designed filter frequency response. They used Computational Parallelization in their approach.
GSO - galactic swarm optimization
Muthiah-Nakarajan and Noel in [13] proposed Galactic Swarm Optimization(GSO) which is a new algorithm, inspired by the motion of stars, galaxies and supercluster of galaxies. This paper used a 2 phase approach similar to 2.1. In this paper, two phases were being discussed namely - exploration and exploration. In the explorations phase, sub-population explores the search space independently whereas, in the exploitation phase, the best solutions of sub-populations clustered together are called as supercluster, which moves towards best solutions. They proposed a methodology in which both sub-populations, as well as superwarm, were being updated and their statistical results showed GSO algorithms converges faster to more reasonable and accurate solutions compared to state-of-the-art PSO algorithms on a wide variety of high dimensional and multi-modal benchmark optimization problems.
The GSO algorithm is given in Algorithm 1.
[4] proposed a variant of GSO named Fuzzy Galactic swarm optimization (FGSO) for dynamic adjustment of parameters for the optimization of the benchmark. The term c1,c2 refers to social and cognitive factors in level 1, c3 and c4 refer to the social and cognitive factors in level 2. The fuzzy systems were of the Mamdani type in which the input was defined as the number of iterations and with two outputs. the first was with c3 in increase and c4 in decrease and the second variant was with c3 in decrease and c4 in the increase. To measure the performance of the algorithm with 17 benchmark functions with the different number of dimensions, they suggested different fuzzy systems for the dynamic adaptation of the c3 and c4 parameters. The experiments after comparing FGSO and GSO showed that their proposal achieved an improvement against the original algorithm in some cases where the original algorithm fails to reach zero. FGSO looked very intuitive in its approach to this problem. In essence, at a given time we have more explorers traversing and exploring the search space simultaneously. So when we move to higher dimensions we do not face exploration related problems as the FGSO and GSO faced. The validation for this effect has been provided in the results section.
Architecture
Coming up with a parallel version for a sequential algorithm is non-trivial. It involves the need to think parallel in a bottom-up manner to maximize gains. If the need arises even modifying the algorithm’s logic could be the right recourse given that there is no significant trade-off with accuracy. The model was evaluated based on how much work it can do in the same amount of time as the serial algorithm. In essence, it must be able to explore the search space more exhaustively in the same amount of time. The Gustafson-Barsis’s Law is based on evaluating performance based on work is given by 3
We have applied
Applications
Training a neural network with PGSO
An Artificial Neural Network (ANN) was trained with PGSO for tackling the Churn Prediction problem and the NBA player prediction problem. The former problem required the ANN to detect/classify whether a customer is going to cancel a subscription. The data collection procedure involved collecting the data on how frequently the customers use the subscribed product. The latter problem involves predicting based on a player’s statistics whether the player would last for the next 5 years. A 3-layer Neural Network with 6 nodes in the hidden layer was used for this purpose. The binary cross-entropy loss was used as the target function to optimize as given in equation 4 where
The PGSO algorithm was modified to adapt to the problem of training a neural network. Our problem was to estimate the best set of weights given a neural network architecture. The modified Neural Network architecture is given in Fig. 1. Imagine a situation where a baby is shown a million images of cats and dogs over and over again to make him understand how to recognize cats and dogs. This is what modern day algorithms for training a Neural Networks. In particular, Stochastic Gradient Descent has following drawbacks - Wasteful of data - Too much recycling Cannot approximate Non-differentiable functions
PGSO is a promising alternative towards training a neural network because it overcomes the above difficulties. In practice, PGSO could get similar amount of gains as the Adam Optimizer got within same on lesser amount of epochs for the datasets we have tested both of them on. Table 1 shows some of the datasets we tested the PGSO on.

Training a Neural Network with PGSO.
F1 scores for different data-sets
The performance of the proposed PGSO is evaluated based on the following metrics - Performance on single thread (Latency) Average CPU utilization Evaluation against benchmark suite
Performance on a single thread
The algorithm was prototyped using Numba python library which allowed us to gain C like performance in python. Even though our code should have staggered against a single-thread implementation of the original GSO our implementation dominates the single-thread implementation of GSO and PSO even on a single thread. On the Rosenbrock function the PSO algorithm took (13.3 +/-309ms), the GSO algorithm took (13.8 +/-128ms) while the PGSO took (12.9 +/-70.1ms) using identical parameters for evaluation averaged over 50 runs.
Another observation taken using the Intel VTune Amplifier tool was that it took almost twice the time (25.828sec + -40ms) on a single thread without the use of Numba.
Average CPU utilization
When it comes to scaling, the emphasis was given on how efficiently was the algorithm able to utilize all the CPU resources over time and not let any core remain idle. This intuitively translated to that exhaustive exploration of the search space of possible solutions by employing more number of workers at the same time. The Intel VTune Amplifier and other libraries were used to measure average CPU utilization over time. Figure 2 illustrates that the algorithm is in the ideal range of average CPU utilization and most of the time the algorithm was able to utilize 6 to 7 out of 8 logical CPU cores.

Average CPU utilization (Intel VTune Readings).

Convergence Characteristics of Non Continuous Rastrigin.

Convergence Characteristics of Ackley.

Convergence Characteristics of Rastrigin.

Convergence Characteristics of Rosenbrock.

Convergence Characteristics of Rotated Rastrigin.

Convergence Characteristics of Rotated Griewank.

Convergence Characteristics of Rotated Ackley.

Convergence Characteristics of Shifted Rotated Rastrigin.
PGSO was evaluated against PSO, GSO over multiple benchmark functions which extend the DeJong Test Suite [12]. The benchmark functions are considered as a gold standard in the meta-heuristics community. The benchmark functions serve to mimic real-world problem’s optimization functions. These algorithm have a varying degree of complexity. with the sphere function being the simplest and the shifted rotated versions being the hardest. The functions chosen are multimodal, which means we can have multiple good solutions which may or may not be the best solution. It also implies that there are numerous opportunities for an algorithm to get stuck at local minimas. Besides, a common trend is recorded among metaheuristic algorithms that they tend to perform exponentially bad when the dimensions of search space for each of these functions is increased considerably. These properties make the benchmark suite an ideal framework for comparing our algorithm with others.
Tables 4–6 shows how easily PGSO outclasses PSO and GSO over multiple benchmark functions. Notice how low the error values are in the case of PGSO and how big they are in the case of GSO and PSO. Moreover, when we try finding solutions for the benchmark functions in higher dimensions the performance gap becomes exponentially significant between PGSO and others. All instructions to reproduce the results, benchmarks along with the application of PGSO with neural network is properly documented along with code implementation.
Unimodal benchmark functions
Unimodal benchmark functions
Multimodal benchmark functions
Results on 10-dimensions
Results on 30-dimensions
Results on 50-dimensions
Figures 3–10 depict the convergence characteristics of PSO, GSO, and PGSO over multiple benchmark functions. PGSO returns loss values below machine precision for most of the functions and does so in around 200 iterations. Its competitors get stuck at some minima and are not able to find a better position after a certain number of iterations. Also, to encourage fairness to record these results the authors had not executed the supercluster step which combines all outputs from the various explorers. Including that step would result in faster convergence in a shorter number of iterations.
In this paper a new parallel galactic swarm optimization algorithm is proposed and applied to the training of a neural network. Parallel implementation of the algorithm exhibited superior performance against state-of-the-art algorithms on all the benchmark optimization problems in terms of convergence and accuracy of the best solution. The proposed algorithm was also used to train a neural network to solve the problem of Churn Prediction and the problem of predicting whether an NBA player would last for the next 5 years. PGSO algorithm can be applied to Design, Finance, Combinatorial Optimization, Controllers, Clustering and Classification, Communication Networks, Faults, Images and Videos as mentioned by [9] on application domains of meta-heuristic algorithms. Future work will explore implementing the PGSO algorithm on a GPU to utilize the vertical scaling potential of the proposed algorithm’s architecture.
