Abstract
Unrelated parallel machine scheduling problem (UPMSP) with additional resources and UPMSP with learning effect have attracted some attention; however, UPMSP with additional resources and learning effect is seldom studied and meta-heuristics for UPMSP hardly possess reinforcement learning as new optimization mechanism. In this study, a shuffled frog-leaping algorithm with Q-learning (QSFLA) is presented to solve UPMSP with one additional resource and learning effect. A new solution presentation is presented. Two populations are obtained by division. A Q-learning algorithm is constructed to dynamically decide search operator and search times. It has 12 states depicted by population quality evaluation, four actions defined as search operators, a new reward function and a new action selection. Extensive experiments are conducted. Computational results demonstrate that QSFLA has promising advantages for the considered UPMSP.
Keywords
Introduction
Scheduling problems and algorithms have been widely applied in manufacturing industries for efficient production. Parallel machine scheduling problem (PMSP) is a typical scheduling one which is among the most studied area in scheduling literature and has been extensively investigated [1–24]. In most of the related works on PMSP, machine is the only considered resource; however, additional resources including automated guided vehicles, machine operators, tools, pallets, dies and industrial robots are often required in many real-world parallel machine manufacturing process. PMSP with additional resources has become a significant area of scheduling research [3–24].
Edis et al. [3] presented notation and classification on PMSP with additional resources according to types of additional resources, which may be renewable or non-renewable, discrete or continuous and exist in processing or during setup etc. Unrelated parallel machine scheduling problem with additional resources (UPMSPR) is an extended version of unrelated parallel machine scheduling problem (UPMSP) and has been extensively considered in the past decades. With respect to UPMSPR with one additional resource, Ventura and Kim [6] dealt with two UPMSPR, one of which possesses one single type additional resource and is shown to be equivalent to the asymmetric assignment problem. Zheng and Wang [7] reported a two-stage adaptive fruit fly optimization algorithm (FOA) for the problem with renewable resource. Fanjul-Peyro et al. [8] constructed two integer linear programming model and proposed three matheuristics. Fleszar and Hindi [9] presented an efficient mixed-integer linear programming (MILP) model to calculated a lower bound and a constraint programming model to schedule jobs on machines. Zheng and Wang [10] reported a collaborative multi-objective FOA for the problem with carbon emission minimization. Villa et al. [11] developed several heuristics based on resource constraint and assignment rule. Vallada et al. [12] applied an enriched scatter search and an enriched iterated greedy and adopted a best known heuristic and a repair mechanism in two algorithms.
Regarding UPMSP with multiple additional resources, Chen [13] proposed a heuristic with a capability relative to a runtime and solution quality for UPMSP with some types of dies as second resources. Bitar and Dauzère-Pérès [14] dealt with UPMSP with auxiliary resources in a photolithography workshop of a semiconductor plant, analyzed properties of the problem and provided a memetic algorithm. Afzalirad and Shafipour [15] developed an integer mathematical programming model and two genetic algorithms (GA) for UPMSPR with eligibility restrictions. Al-Harkan and Qamhan [16] solved UPMSP with non-zero arbitrary release dates, additional resources and sequence-dependent setup times (SDST) by a two-stage hybrid meta-heuristic based on variable neighborhood search and simulated annealing. Fanjul-Peyro [17] focused on UPMSP with processing resources, setup resources and shared resources and presented a MILP model and a three phase algorithm. Yepes-Borrero et al. [18] presented three heuristics and GRASP algorithm for UPMSP with setup times and additional limited resources in setup. Pinar and Seyda [19] considered multi-resource constrained UPMSP with sequence-dependent setup times (SDST), precedence relation, machine eligibility and release dates and proposed a constraint programming model and two branching strategies for the model.
Learning effect means that processing time is subject to change due to learning of operators. UPMSP with learning effect has been investigated in the past decade. Kuo et al. [20] discussed the polynomial solvability of UPMSP with setup times and learning effect. Wang and Wang [21] handled UPMSP with learning effect and deteriorating jobs and proved that the problem can be solved in polynomial time when the number of machines is given. Lu et al. [22] analyzed unrelated parallel-machine resource allocation scheduling problem with learning effect and deteriorating jobs and proved that the problem is polynomial time solvable if the number of machines is fixed. Soleimani et al. [23] considered UPMSP with start-time-related deterioration, position-based learning and SDST and presented GA, cat swarm optimization and interactive artificial bee colony to minimize mean weighted tardiness and power consumption. Zhang et al. [24] proposed a combinatorial evolutionary algorithm for UPMSP with setup times, limited worker resources and learning effect.
As stated above, UPMSPR is frequently investigated and UPMSP with learning effect also attracts some attention; however, the existing works seldom combine additional resources and learning effect together. In fact, additional resources and learning effect often exist simultaneously in parallel machine production process and optimization results of UPMSPR with learning effect can effectively reflect real-life situation, so it is necessary to considered UPMSPR with learning effect. On the other hand, as the main approaches to production scheduling problem, meta-heuristics are not applied to solve UPMSPR fully, the optimization ability and advantages of some meta-heuristics including shuffled frog-leaping algorithm (SFLA) are hardly tested to find more competitive methods; moreover, new optimization mechanisms including reinforcement learning (RL) are seldom added into meta-heuristics for the above UPMSP.
RL algorithms including Q-learning algorithm have been applied to service meta-heuristics for production scheduling problem in recent years. Cao et al. [25] presented a cuckoo search with RL and surrogate modelling for semiconductor final testing scheduling problem with multiresource constraints. Chen et al. [26] proposed a self-learning genetic algorithm with Q-learning algorithm for flexible job shop scheduling. Cao et al. [27] developed a knowledge-based cuckoo search with knowledge base based on RL algorithm named SARSA for flexible job shop scheduling with sequencing flexibility. In these works, RL is used to adjust adaptively parameters settings and results proved that the performance of meta-heuristic can be improved effectively. The integration of RL and meta-heuristic also can dynamically select search operator [28]; however, search operator of meta-heuristic is seldom decided dynamically by RL, so it is necessary to discuss fully the integration of RL and meta-heuristic.
SFLA is a meta-heuristic by observing, imitating and modelling the behavior of a group of frogs when searching for the location that has maximum amount of available food [29]. Owing to fast convergence speed and effective algorithm structure containing local search and global information exchanges, it has been widely applied to solve various scheduling problems in single factory [30–38], and multiple factories [39–42]; however, SFLA is hardly applied to solve UPMSPR and UPMSPR with learning effect. SFLA has some good features and extensive applications to scheduling; moreover, RL can be integrated easily with SFLA because of its algorithm structure, so SFLA may be effective path to solve UPMSPR with learning effect and its search efficiency can be improved effectively by RL algorithm.
In this study, UPMSPR with learning effect is considered and a novel shuffled frog-leaping algorithm with Q-learning (QSFLA) is proposed to minimize makespan. In QSFLA, a new coding and decoding are proposed and the whole population is divided into two sub-populations, then Q-learning algorithm is constructed to dynamically select search operator and search times. It consists of 12 states depicted by population quality evaluation, four actions defined as search operators, a new reward function and a novel action selection. A number of experiments are conducted and computational results show that QSFLA can solve effectively UPMSPR with learning effect.
The remainder of the paper is organized as follows. The problem is described in Section 2. QSFLA for the considered problem is reported in Section 3. Numerical experiments on QSFLA are reported in Section 4 and the conclusions are summarized in the final section and some topics of the future research are provided.
Problem description
UPMSPR with learning effect is described as follows. There are n jobs J1, J2, ⋯ , J n and m unrelated parallel machines M1, M2, ⋯ , M m . Each job can be processed on any one of m machines. p kjv indicates processing time of job J j processed on position v on machine M k by a given schedule, p kjv = p kj (W + (1 - W) v δ) [43], where p kj indicates initial processing time, δ ≤ 0 is learning factor, W is incompressibility factor, v denotes the position of J j on M k in a given schedule.
An additional renewable resource is considered. When job J j is processed on M k , it needs re kj units of the additional resource. At most R max units of the additional resource can be used at any time.
There have following constraints on jobs and machines.
All jobs and machines are available at time zero.
Each job can be processed on only one machine at a time.
Operations cannot be interrupted.
Preemption is not allowed.
The problem is composed of scheduling sub-problem and machine assignment sub-problem. The goal of the problem is to minimize makespan.
An illustrative example is provided. W = 0, δ = -0.1, R max = 10. A schedule of the example is shown in Fig. 1.

A schedule of the example.
The integration of RL and meta-heuristic for combinatorial optimization has been extensively studied [28]. In this study, a typical RL algorithm named Q-learning algorithm is integrated with SFLA to dynamically select search operator and search times. Q-learning algorithm is constructed by 12 states based on population evaluation, four actions used as search operators, a new reward function and a action selection method. The detailed steps of QSFLA are described in the following sections.
Solution representation and population division
A new two-string representation is adopted. For UPMSPR with n jobs, m machines, R max units of additional resource and learning effect, its solution is composed of a machine assignment string [M h 1 , M h 2 , ⋯ , M h n ] and a scheduling string [π1, π2, ⋯ , π n ], where M h i is the assigned machine for job J i and π i ∈ { 1, 2, ⋯ , n }.
The decoding procedure is shown below. For scheduling string, start with π1, for each π i , first check each idle period of the processing machine M k of job J π i decided by the first string, if J π i can be inserted into some idle periods when processing time and resource constraint are met, then choose an idle period with the smallest beginning time and insert J π i into the chosen period according to processing time and resource constraint; otherwise, J π i is processed after the current last processed job of M k .
All constraints are handled in decoding procedure as done in papers [1–24] and each solution is feasible, solutions can be improved continuously in the search process of QSFLA.
The beginning time B π i k is decided as follows: B π i k = max{ η k , θ π i k }, η k is the best available time of M k and θ π i k is the earliest time of J π i on M k that resource constraint is met. When a job J π i is inserted into an idle period of a machine, the position of other jobs after the inserted job J π i will increase and the corresponding processing time will diminish, in this case, beginning times of these jobs are kept invariant and completion time will become smaller.
For the example in Section 2, a solution is composed of a machine assignment string [M2, M1, M2, M1, M2, M2, M1, M1] and a scheduling string [1, 6, 4, 7, 5, 8, 2, 3]. In the decoding process, when job J7 is handled and processed on M1, there is an idle period [0,88] and J7 cannot be processed in the idle period because of resource constraint, so J7 is processed after job J4. For job J8, it can be inserted into idle period [0,81] on M1, after J8 is inserted, completion times of J4, J7 diminish, so an idle period between J4 and J7. If there is no insertion on M1, no idle period between J4, J7 exists. Fig. 1 shows the obtained schedule by decoding the above two strings.
An initial population P with N solutions is randomly produced. N/2 randomly chosen solutions are added into sub-population P1 and the remained N/2 solutions are included into sub-population P2. P = P1 ∪ P2.
In sub-population P1, there are S/2 solutions with smaller makespan than other solutions of P1, suppose
The remained s/2 memeplexes are constructed by using P2 in the same way of P1. Finally, s memeplexes are obtained.
Global search and neighborhood search
In each memeplex
Five neighborhood structures are used.
Multiple neighborhood search on x
b
is described as follows.
Q-learning algorithm
Q-learning algorithm is the most commonly used model-free RL algorithm [45]. It has successfully applied to solve scheduling problem [25–27].
In Q-learning algorithm, there are state st t , action a t , reward r t and action selection strategy. In this study, environmental state is depicted as population evaluation result and action is described as a combination of global search and neighborhood search.
12 states are defined based on evaluation on population P. Two performance indices are used. trial* describes the improvement of elite solution x*. trial* = 0 indicates that x* is improved on the current generation t. trial* > 0 implies that x* cannot be improved in trial* generations. Diversity index on generation t is defined by
Table 1 describes 12 states defined by trial* and ΔDI t , which is equal to DI t - DIt-1. where μ i is integer, μ1 < μ2.
State set
Four actions are adopted, which are search operators of P1, P2. Each search operator consists on a combination of global search and neighborhood search on P1 and another combination of global search and neighborhood search on P2.
When action is defined, sub-populations P1, P2 are first evaluated, then they are sorted in ascending order of
Action a (1) is described as follows. For each
The differences between a (2) and a (1) lie in that no global search is done for each
Action a (4) is shown below. Global search on
A new reward function rt+1 is defined by
In Equ. (6), w1, w2 are weight, w1 + w2 = 1,
Action selection is done based on Q values. Initially, all Q values are zero, which means that the agent does not have any learning experience to use. ɛ-greedy is often used and expressed as follows. If a random number rand < ɛ, then randomly select an action a; otherwise, select the action a which maximizes Q values, that is,
In this study, a novel action selection method is proposed. For the current state st t , one of four actions a (1) , a (2) , a (3) , a (4) is chosen as a t by roulette selection based on probability Pr (st t , a (l)) , l = 1, 2, 3, 4.
In this way, each action can be selected in an appropriate probability for each state st t and it is unnecessary to decide parameter ɛ.
Unlike the previous works [25–27], Q-learning algorithm is used to provide search operator and search times for memeplex. 12 states are used to depict evolution quality of population P and four actions are applied to decide search operators for P1, P2 and search iterations for each operator; action selection is done without using ɛ and a reward function is defined.
In QSFLA, a new solution representation and decoding process are first introduced; then population P is divided into sub-populations P1, P2; Q-learning algorithm is used to dynamically select search operator for all memeplexes on each generation; moreover, memeplex
The detailed steps of QSFLA are shown below.
(1) Randomly produce an initial population P with N solutions, set initial Q-table, t = 1.
(2) Divide P into P1, P2 and obtain s/2 memeplexes for each P i , i = 1, 2.
(3) Evaluate sub-populations P1, P2 and sort them according to
(4) Select an action according to Equ. (7) and execute memeplex search in each memeplex by using the chosen action.
(5) Execute population shuffling and update Q-table, t = t + 1.
(6) If the stopping condition is not met, go to step (2); otherwise, terminate the search.
Q-table is updated once on a generation, so t indicates generation of population evolution and times of Q-learning. Time complexity of QSFLA is N × (n + snum × anum) × max _ t, where snum, aunm, max _ t are total number of states, total number of actions and maximum generation.
Computational experiments
Extensive experiments are conducted to test the performance of QSFLA for UPMSPR with learning effect. All experiments are implemented by using Microsoft Visual C++ 2019 and run on 8.0G RAM 2.30GHz CPU PC.
Test instances and comparative algorithms
300 instances are used and their basic data are directly from Fanjul-Peyro et al. [8]. On learning effect, W = 0 and δ = -0.1. R max = 5m. Each instance is denoted as n × m × No. There are five ways to produce processing time and two ways to generate additional resource. There are 10 combinations of processing time and additional resource. No is defined as a combination of the a-th way of processing time and the b-th way of additional resource. b = 1 for No ≤ 5 and b = 2 for No > 5.
Zheng and Wang [7] proposed a two-stage adaptive fruit fly optimization algorithm (TAFOA) and Villa et al. [11] developed a multi-pass heuristic (MPH) for UPMSPR. TAFOA and MPH have promising advantages on solving UPMSPR and can be directly used to solve UPMSPR with learning effect, so they are chosen as comparative algorithms. Salehi Mir and Rezaeian [46] presented a hybrid particle swarm optimization and genetic algorithm (HPSOGA). HPSOGA can be directly applied to handle our UPMSPR after the decoding procedure of QSFLA is adopted, so we use it as comparative algorithm. TAFOA and HPSOGA have the same complexity N × n × max _ t.
We construct SFLA by deleting Q-learning algorithm from QSFLA and no population division to show the effect of Q-learning. In the search process of each memeplex, for optimization object x b , global search is first performed, if the new solution cannot substitute for x b as described in QSFLA, then multiple neighborhood search of x b is executed.
Parameter settings
QSFLA has main parameters: N, s, R, β, γ, w1 and stopping condition. We first tested stopping condition and found that QSFLA can converge well when 0.3n seconds CPU times reaches. We also found that each comparative algorithm also converges fully when 0.3n seconds CPU times is used, so stopping condition of comparative algorithms is set to 0.3n seconds CPU times.
Taguchi method [47] is used to decide the settings for other parameters. We select instance 150 × 20 × 7. The levels of each parameter are shown in Table 2. The orthogonal array L27 (36) is tested. QSFLA with each combination runs 10 times independently for the chosen instance.
Parameters and their levels
Parameters and their levels
The results of MIN and S/N ratio are shown in Fig.2, in which S/N ratio is defined as -10 × log 10 (MIN2) and MIN is the best solution found in 10 runs. It can be found from Fig. 2 that QSFLA with following combination N = 64, s = 8, R = 50, β = R/3, γ = 0.9 and w1 = 0.7 can obtain better results than QSFLA with other combinations, so the above combination is adopted.

The mean MIN and the mean S/N ratio of MIN
SFLA has following parameters: N = 64, s = 8 and R = 50.
Parameter settings of three comparative algorithms are directly selected from references [7, 46] except that the stopping condition. We also test these settings for each comparative algorithm. The computational results show that these settings of each comparative algorithm are still effective and comparative algorithms with these settings can produce better results than MPH, HPSOGA and TAFOA with other settings.
QSFLA is compared with SFLA, MPH, HPSOGA and TAFOA. Each algorithm randomly runs 10 times for each instance. The corresponding results of all algorithms are shown in Tables 3–11, where AVG indicates average value of 10 elite solutions obtained in 10 runs and SD is standard deviation among 10 elite solutions in 10 runs. Fig. 3 describes box plot of all algorithms and Fig. 4 shows convergence curves of all algorithms.
Computational results of five algorithms on metric MIN
Computational results of five algorithms on metric MIN
Computational results of five algorithms on metric MIN
Computational results of five algorithms on metric MIN
Computational results of five algorithms on metric AVG
Computational results of five algorithms on metric AVG
Computational results of five algorithms on metric AVG
Computational results of five algorithms on metric SD
Computational results of five algorithms on metric SD
Computational results of five algorithms on metric SD
As stated in Tables 3–11, QSFLA has better SD than SFLA on most of instances and obtains better MIN and AVG than SFLA on nearly all instances; moreover, MIN and AVG of SFLA is notably worse than those of QSFLA on all instances. There are significant performance differences between QSFLA and SFLA on convergence and average performance. This conclusion also can be obtained from Fig. 3 and Fig. 4. The difference between QSFLA and SFLA is the usage of Q-learning. The notable performance differences between QSFLA and SFLA show that Q-learning algorithm really has positive impact on the performance of QSFLA, so it is effective and efficient to introduce Q-learning algorithm into QSFLA.
It can be found from Tables 3–5 that QSFLA performs better than its comparative algorithms on metric MIN. QSFLA produces smaller MIN than three comparative algorithms on 266 of 300 instances; moreover, MIN of QSFLA is less than that of TAFOA by at least 30 on 142 instances, that of MPH by at least 30 on 178 instances and that of HPSOGA by at least 30 on 245 instances. QSFLA has better convergence than MPH, HSPOGA and TAFOA. This conclusion also can be drawn from Figs. 3 and 4.
As shown in Tables 6–8, QSFLA obtains smaller AVG than MPH, TAFOA and HPSOGA on 273 instances; moreover, AVG of QSFLA is better than that of its all comparative algorithms by at least 10 on 226 instances. QSFLA possesses better average performance than MPH, TAFOA and HPSOGA. Figure 3 also depicts the average performance difference.
It can be seen from Tables 9–11 that QSFLA produces better SD than its comparative algorithms on 88 of 120 instances with n ≤ 20, obtains smaller SD than TAFOA on 108 instances with n > 20, than HPSOGA on 101 instances with n > 20 and than MPH on 107 instances with n > 20. Fig. 3 also shows that QSFLA can provide better results on metric SD than comparative algorithms.
The above analyses validate the good performances of QSFLA on UPMSPR with learning effect. The good performance mainly results from the dynamical selection of search operators and the differentiated search of sub-populations P1, P2. These two strategies lead to good exploration ability and low possibility of falling local optima, thus, it can be concluded that QSFLA is a very competitive method for UPMSPR with learning effect.
UPMSP with additional resources and other real-life constraints like learning effect is seldom considered. In this study, UPMSPR with learning effect is handled and a new algorithm named QSFLA is proposed by combining Q-learning algorithm and SFLA together. A new solution presentation and decoding process are presented. The whole population is divided into two sub-populations. A Q-learning algorithm is applied to implement the dynamical selection of search operator and search times. It possesses 12 states, four actions, a new action selection and a reward function. Extensive experiments are conducted. Computational results demonstrate that QSFLA has promising advantages for the considered UPMSPR.
UPMSPR with other processing constraints is still our future research topic because additional resources, learning effect or deteriorating jobs etc often exist in an unrelated parallel machines manufacturing process. We will try to solve the above problem using meta-heuristics like imperialist competitive algorithm etc. The previous works proved that RL is an effective path to enhance search ability of meta-heuristics. We will focus on the integration of RL and meta-heuristics for various production scheduling problems including distributed scheduling and energy-efficient scheduling.
