an asynchronous reinforcement learning hyper ...

AN ASYNCHRONOUS REINFORCEMENT LEARNING HYPER-HEURISTIC ALGORITHM FOR FLOW SHOP PROBLEM Wen shi, Xueyan Song, Cuiling Yu, Jizhou Sun School of Computer Science and Technology, Tianjin University Tianjin, China [email protected], [email protected], [email protected], [email protected]

ABSTRACT This paper investigates a hyper-heuristic algorithm for the permutation flow shop problem(FSP) to find a sequence to minimize the makespan. In comparison with existing approaches, our proposed hyper-heuristic algorithm based on multi-agent architecture includes two levels: low level heuristic agents(LLHAs) do local search in the solution domain and hyper-heuristic agent(HHA) manages low level heuristic agents with reinforcement learning. The LLHAs improve the current solution by local search and send it to the HHA. Depending on the last few performances of LLHAs, the HHA decide whether or not to accept the current received solution as an initial solution for the next local search. Simulation studies demonstrate that the hyper-heuristic with asynchronous parallel reinforcement learning yields better solutions than other algorithms. KEY WORDS Agent-based and Multiagent Systems, Machine Learning, Permutation Flow Shop Scheduling, Hyper-heuristic

1. Introduction With a strong engineering background, FSP is a strongly NP-complete combinatorial optimization problem[1]. The object of FSP is to schedule n-jobs on m-machines to minimize the makespan which is the time from the first job starting on the first machine to the last job finish on the last machine. The earliest optimal solution for FPS can be dated back to 1954[2]. There are two kinds of algorithms for FPS: constructive methods and improvement methods. Constructive algorithms start with an empty solution and gradually build a complete solution. An early constructive algorithm is proposed by Campbell et al.[3] which is called CDS. CDS method clusters all the machines into two virtual machines and solves the generated two machine problem repeatedly. Another famous constructive heuristic is NEH algorithm which is proposed by Nawaz et al.[4]. NEH algorithm sorts all the jobs by non-increasing the total processing time on all the machines, select the first two jobs and make a partial sequence. Then the other jobs are inserted into the current sequence one by one which minimises the makespan.

Suliman[5] developed an improvement heuristic with a job chair exchange mechanism with a directionality constraint. In addition to these domain specific algorithms, some popular artificial intelligence techniques like genetic algorithms[6], artificial neural networks[7] and simulated annealing[8, 9] are used to solve FSP. However, according to the "no-free lunch" theorem, no heuristic search algorithm is suitable for all optimization problems or all instances in one problem. Building a general improvement heuristic for FSP to choose a suited heuristic in each determination point during the search is one of the motivations of hyper-heuristics in this paper. A survey of hyper-heuristic was described by Burke et al.[10]. The idea behind of hyper-heuristics can be dated back to the early 1960s[11]. In early 2000s, hyperheuristics was introduced to describe the idea of "heuristics to choose heuristics" in the context of combinatorial optimization[12]. The early hyper-heuristic is to manage constructive heuristics to build a complete solution gradually. Then, improvement hyper-heuristics are proposed to select and apply low level heuristics to improve the current solution. Further studies in hyperheuristics try to automatically generate new heuristics suited to a given domain from a set of potential heuristic components[13]. The main process in hyper-heuristics includes heuristic selection and move acceptance. In reference[10], different kinds of heuristic selection mechanisms such as Simple Random(SR), Random Permutation(RP) and Greedy(GR)[12, 14]. The process of heuristic selection can also make use of the reinforcement learning[15, 16] to improve the decision making process. The reinforcement learning mechanism is shown in Figure 1. In this kind of strategy, weights represent the performance of heuristics and could be treated as a guide of the search process. Nareyek[17] makes uses of reinforcement learning as a low level heuristic selection method and set a weight for each low level heuristic. Burke et al.[18] and Özcan et al.[19] combined reinforcement learning with tabu search and great deluge. However, in these heuristic selection methods, heuristics are arranged one by one but not in parallel. Obviously, the efficiency will decrease when only one heuristic does a local search at any time. For distributing the hyper-heuristic having parallel execution of several low level heuristics, Ouelhadj and Petrovic[20] proposed an agent-based cooperative hyper-heuristic framework. The agent-based heuristics could not only

cooperate but also compensate for weakness each other. Results in this study demonstrated that the cooperative hyper-heuristic in parallel outperformed the sequential ones. In reference [20], synchronous and asynchronous cooperative mechanisms are in combination with existing move acceptation strategies such as Greedy Improving Only[21], Greedy All Moves[21], Tabu Search[22], Simulated Annealing[21] et al. but without learning. In this paper, a hyper-heuristic base on multi-agent framework for FSP is proposed. We embed reinforcement learning into a hyper-heuristics multi-agent framework. And we propose a deterministic move acceptance with non-stationary reinforcement learning described in reference [17]. The HHA manages LLHAs with reinforcement learning by generating online weights for each LLHAs based on their performances and decides whether accept the solution as the initial solution for the next iteration according to weights of LLHAs. HHA cooperates with LLHAs asynchronously. The LLHAs contains five exciting improvement heuristics: Suliman[5], SWAP[6], inversion[6], insertion[6] and permutation[10]. Simulation studies demonstrate that the asynchronous parallel cooperative hyper-heuristic with reinforcement learning outperforms the ones with deterministic or nondeterministic move acceptance and other flow-shop algorithms.

Ci , j = max{Ci -1,1 , C1, j -1} + pi , j i = 2,..., n j = 2,..., m And the value of objective function

3. Hyper-heuristic for FPS 3.1 Hyper-heuristic Agent The HHA proposed in this paper makes use of heuristic selection methods with learning and deterministic move acceptance. Meanwhile, as an agent, the HHA could cooperate with LLHAs asynchronously. First of all, the HHA generates an initial solution and broadcast to all LLHAs. The LLHAs receive the solution, improve the solution by local search and send it back to the HHA. The HHA has a list to store the weights of the LLHAs which depended on the performance of LLHAs before. When it received a solution from one LLHA, HHA adjusts the weight of the LLHA based on the received solution and decides the initial solution for the next iteration. The pseudo code of process of HHA is: Initialize initial solution, local best solution and global best solution; iteration = 0; For i = 1 to n

LLHAi as wi ;

Send the initial solution to

2. Problem description The goal of the FSP is to find an optimal schedule for n jobs ( i = 1...n ) on m machines ( j = 1...m ). A job

m operations and the j operation of each job j . If the job completes on machine j - 1 meanwhile the machine j is free, it can start on machine j . Meanwhile, one machine could consists in

must be processed on machine

process one job and one job could only operate on one machine at a time. Each operation has a known processing time pi , j . The competition time machine formulate:

Ci , j for the job i on the

j can be calculated using the following

C1,1 = p1,1

(1)

Ci ,1 = Ci -1,1 + pi ,1 i = 2,..., n

(2)

C1, j = C1, j -1 + p1, j j = 2,..., m

(3)

Cmax is the

completion time for the last operation on the last machine. The objective is the minimisation of makespan Cmax .

Set weight of

Figure 1. The reinforcement learning mechanism

(4)

LLHAi ;

While iteration < maximum number of iterations Do While receive the current solution from LLHAk Do If the current solution is better than the local best solution then Update the local best solution as the current solution; If the current solution is better than the global best solution then Update the global best solution as the current solution; End if wk = award( wk ); Else

wk = punish( wk ); If wk > w j , "j Î {1..n} Ç j ¹ k then Update the local best solution as the current solution; End if End if The initial solution is set as the local best solution; iteration++;

Send the initial solution to End do End do Terminate;

LLHAk ;

3.2 Low Level Heuristic Agents The LLHAa in hyper-heuristic framework are in charge of local search in the solution domain. The LLHAs cooperate with HHA in parallel as described in Figure 2.

Figure 2. HHA and LLHAs In this paper, two kinds of heuristics are introduced. One is mutational heuristics which make a random perturbation and do not expected to produce a better solution after they are applied. Mutational heuristics in our hyper-heuristic for FPS include:

·

Insertion heuristic: Select a job randomly and move it to another position in the sequence of scheduled jobs; · Inversion heuristic: Select a sub-sequence of jobs randomly and reverses the positions of these jobs; · Swap heuristic: Select two jobs in the sequence of scheduled job randomly and exchange their positions; · Permutation heuristic: Select a sub-sequence of jobs and do a random sequence permutation; The other kind of heuristics is called hill climber. The aim of hill climber is to improve the candidate solution. The suliman heuristic proposed by Suliman[5] is introduced as hill climber in our hyper-heuristic framework. The suliman heuristic improves the solution with a job pair exchange with a directionality constraint. The pseudo code of process of LLHAs is: Do If receive the initial solution from HHA then local_iteration = 0; While local_iteration < maximum number of local iterations Do Find the current solution by local search strategy; local_iteration++; End do End if Send the current solution back to HHA; End do

Figure 3. Performances of algorithms for different number of iterations We implemented the hyper-heuristic based on multi-agent architecture using Java Agent Development framework (Jade)[23], which is an open source for peer-to-peer agent 4.1 Simulation Environment based applications, in the hardware /software platform:

4. Performance Evaluation

Intel Pentium Dual E2160, 1.80GHz, 1.79GHz, 2-GB random access memory, Microsoft Windows XP OS. We consider 8 different sizes of the flow shop secluding instances given by Taillard[24] including 20*5, 20*10, 20*20, 50*5, 50*10, 50*20, 100*5 and 100*10. We select one instance of each size. Similar to reference [20], we also calculate the average percentage makespan increase over the lower bound(LB) of C MAX to evaluate different approach for FSP. Thus, smaller values in Figure 4 are better results. The weight updating mechanisms we used are wa ¬ wa + 1 and wa ¬ wa - 1 as described in reference [17]. The termination condition is a maximum number of 20000 iterations. To compare performance of different algorithms fairly, we make use

of a large enough number of iterations. In Figure 3, performances of algorithms for different number of iterations in one case of FSP are shown. After about 5000 iteration, local best solutions of every algorithm tend to be stable. To evaluate the performance of our proposed algorithm, firstly we implement some move acceptance strategies combined with parallel heuristic selection for comparison: · Deterministic move acceptance strategies: tabu search (TS) and simulated annealing(SA). · Non-deterministic move acceptance strategies: greedy improving only(IO) and greedy all moves (AM).

Figure 4. Average percentage makespan increase over LB others with deterministic or non-deterministic move acceptance strategies. For some small scale cases such as In Figure 4, the results of different algorithms are shown. 20*5, 50*5 and 100*5, the differences between the results Hyper-heuristic with asynchronous parallel reinforcement are not obvious. But for large scale cases, our proposed learning(abbreviated as ARL in Figure 4) outperforms algorithm yields better solutions apparently. 4.2 Experimental results

RD+ARL RD+Suliman NEH+ARL NEH+Suliman

Table 1 Comparison with Suliman heuristic algorithm 20*5 20*10 20*20 50*5 50*10 50*20 5.891 12.155 23.48 0.516 9.529 16.006 10.233 27.686 35.965 5.38 17.492 30.546 5.62 11.609 23.061 0.442 9.46 15.934 5.814 12.431 25.118 0.627 9.77 16.006

As an improvement algorithm for FSP, our proposed algorithm also outperformed other existing improvement algorithm such as Suliman heuristic (SC) as described in Table 1. The headings of the table are 8 cases of FSP. The results in Table 1 are also the average percentages makespan increase over LB. For testing the performance,

100*5 1.839 9.674 1.416 1.416

100*10 5.939 9.481 1.632 1.667

we combine these two improvement algorithms to two constructive algorithms including random (RD) and NEH respectively. Results show that hyper-heuristic with asynchronous reinforcement learning performs better than SC whether collaborates with RD or NEH. Meanwhile, our method is so effective that the results of RD-ARL are

near to those of NEH-ARL even if RD is worse than NEH evidently.

4. Conclusion Hyper-heuristics solves the search problem by comprising different kinds of heuristic methods with the collective goal. In order to further improve the search efficiency of hyper-heuristic with sequential reinforcement learning mechanism, in this paper, we propose parallel reinforcement learning with a hyper-heuristics multi-agent framework and apply them FSP. Simulation results show that the hyper-heuristic with asynchronous parallel reinforcement learning mechanism yield better solutions than those with deterministic or non-deterministic move acceptance strategies. And it is also an effective improvement algorithm for FSP. In our future work, multi-agent reinforcement learning will be applied to the hyper-heuristics. And more widely comparative work on hyper-heuristics multi-agent competition would be carried out.

Acknowledgement(s) This study is supported by grants from National Natural Science Foundation of China and Civil Aviation University of China(61039001) and Tianjin Municipal Science and Technology Commission(11ZCKFGX04200).

References [1] M.R. Gary, & D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness (WH Freeman and Company, New York, 1979). [2] S.M. Johnson, Optimal two ‐ and three ‐ stage production schedules with setup times included, Naval research logistics quarterly, 1 (1), 1954, 61-68. [3] H.G. Campbell, R.A. Dudek, & M.L. Smith, A heuristic algorithm for the n job, m machine sequencing problem, Management Science, 16 (10), 1970, B-630-B637. [4] M. Nawaz, E.E. Enscore, & I. Ham, A heuristic algorithm for the m-machine, n-job flow-shop sequencing problem, Omega, 11 (1), 1983, 91-95. [5] S.M.A. Suliman, A two-phase heuristic approach to the permutation flow-shop scheduling problem, International Journal of Production Economics, 64 (1–3), 2000, 143-152. [6] T. Murata, H. Ishibuchi, & H. Tanaka, Genetic algorithms for flowshop scheduling problems, Computers & Industrial Engineering, 30 (4), 1996, 1061-1071. [7] T.R. Ramanan, R. Sridharan, K.S. Shashikant, & A.N. Haq, An artificial neural network based heuristic for flow shop scheduling problems, Journal of Intelligent Manufacturing, 22 (2), 2011, 279-288.

[8] L. Wang, & D.Z. Zheng, A modified evolutionary programming for flow shop scheduling, The International Journal of Advanced Manufacturing Technology, 22 (7), 2003, 522-527. [9] A.C. Nearchou, Flow-shop sequencing using hybrid simulated annealing, Journal of Intelligent Manufacturing, 15 (3), 2004, 317-328. [10] E.K. Burke, M. Hyde, G. Kendall, G. Ochoa, E. Ozcan, & R. Qu, A survey of hyper-heuristics, Computer Science Technical Report No. NOTTCS-TR-SUB0906241418-2747, School of Computer Science and Information Technology, University of Nottingham 2009. [11] F. H, & T. GL, Probabilistic learning combinations of local job-shop scheduling rules, Industrial Scheduling 1963, 225–251. [12] P. Cowling, G. Kendall, & E. Soubeiga, A hyperheuristic approach to scheduling a sales summit, Practice and Theory of Automated Timetabling III 2001, 176-190. [13] E.K. Burke, M.R. Hyde, G. Kendall, G. Ochoa, E. Ozcan, & J.R. Woodward, Exploring Hyper-heuristic Methodologies with Genetic Programming Computational Intelligence, 1 2009, 177-201. [14] P. Cowling, G. Kendall, & E. Soubeiga, Hyperheuristics: A tool for rapid prototyping in scheduling and optimisation, Applications of Evolutionary Computing 2002, 269-287. [15] L.P. Kaelbling, M.L. Littman, & A.W. Moore, Reinforcement learning: A survey, Arxiv preprint cs/9605103 1996. [16] R.S. Sutton, & A.G. Barto, Reinforcement Learning: An Introduction, Neural Networks, IEEE Transactions on, 9 (5), 1998, 1045-9227. [17] A. Nareyek, An Empirical Analysis of WeightAdaptation Strategies for Neighborhoods of Heuristics. Proceedings of the fourth metaheuristics international conference, 211-215. [18] E. Burke, G. Kendall, J. Newall, E. Hart, P. Ross, & S. Schulenburg, Hyper-heuristics: An emerging direction in modern search technology, Handbook of metaheuristics 2003, 457-474. [19] E. Özcan, M. Misir, G. Ochoa, & E.K. Burke, A reinforcement learning-great-deluge hyper-heuristic for examination timetabling, International Journal of Applied Metaheuristic Computing (IJAMC), 1 (1), 2010, 39-59. [20] D. Ouelhadj, & S. Petrovic, A cooperative hyperheuristic search framework, Journal of Heuristics, 16 (6), 2010, 835-857. [21] E. Soubeiga, “Development and application of hyperheuristics to personnel scheduling,” University of Nottingham, 2003. [22] P. Cowling, & K. Chakhlevitch, Hyperheuristics for managing a large collection of low level heuristics to schedule personnel. 2003 Congress on Evolutionary Computation, Piscataway, NJ, USA, 1214-1221. [23] F. Bellifemine, G. Caire, A. Poggi, & G. Rimassa, JADE: a software framework for developing multi-agent applications. Lessons learned, Information and Software Technology, 50 (1-2), 2008, 10-21.

[24] É. Taillard. Summary of best known lower and upper bounds of Taillard's instance, June, 2012, http://mistic.heigvd.ch/taillard/problemes.dir/ordonnancement.dir/ordonna ncement.html.

an asynchronous reinforcement learning hyper ...

an asynchronous reinforcement learning hyper ...

Suggest Documents

A Reinforcement Learning â Great-Deluge Hyper-heuristic ... - CiteSeerX

Reinforcement learning and conditioning: an

Reinforcement Learning

Reinforcement Learning in an Environment Synthetically Augmented ...

Reinforcement Learning by an Accuracy-Based

An Environment Model for Nonstationary Reinforcement Learning

Reinforcement Learning: An Introduction - Google Sites

Download Reinforcement Learning: An Introduction ... - Google Sites

An Investigation of Reinforcement Learning for

Approximate Reinforcement Learning: An Overview - CiteSeerX

Ebook Download Reinforcement Learning: An ... - Google Sites

An Adaptive Reinforcement Learning-based ... - Semantic Scholar

Approximate Reinforcement Learning: An Overview - Semantic Scholar

IMPROVING REINFORCEMENT LEARNING OF AN ...

PDF Download Reinforcement Learning: An ... - Google Sites

Quantum Reinforcement Learning - arXiv

reinforcement learning concrete problems

Reinforcement Learning and Dimensionality

Reinforcement Learning

Reinforcement Learning - VideoLectures

Quantile Reinforcement Learning

Modified Reinforcement Learning Infrastructure

Bayesian Inverse Reinforcement Learning

Relational Deep Reinforcement Learning

an asynchronous reinforcement learning hyper ...