Solving Missile Defense and Interceptor allocation

5 downloads 0 Views 535KB Size Report
Hybrid architecture and reinforcement learning(RL) method. Q learning is a RL [1] ... research paper, we have analysed the neural Q learning. (TD method) and ...
SHORT PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 3, November 2009

Solving Missile Defense and Interceptor allocation problem using Reinforcement learning and optimization techniques S.THAMARAI SELVI Madras Institute of Technology/ Department of Information Technology,Chennai,India Email: [email protected]

R.MALMATHANRAJ AND S.MATHAN RAJ National Institute of Technology/ Department of ECE ,Tiruchirapalli,India. Email: [email protected] , [email protected] Abstract- This paper proposes and implements biologically inspired architectures like Genetic algorithm(GA), Reinforcement algorithm and Particle swarm optimization (PSO) algorithm to solve the weapon allocation problem in the Multi layer defense scenario. The proposed schemes were implemented in MATLAB and the percentage of assets saved has increased. In the experimental analysis training time is drastically reduced. PSO was shown to converge rapidly and resulted in saving more assets with faster convergence of learning.

implemented in the LVQ-RBF algorithm and the result is compared with the GA and PSO of the EA approach. This paper uses the mathematical model of the Battle Management Command Control and Communication (BM/C3) problem for exact modeling of environment. This paper is organised as follows: Section II explains about the Interceptor allocation problem by LVQ-RBF algorithm,section III about the Genetic algorithm, section IV about PSO, section V about results and discussion and section VI about the Conclusion.

Keywords- Battle Management/Command Control and Communication Problem; Genetic Algorithm; Interceptor allocation problem; Particle Swarm Optimization; Probability of Survival.

II. INTERCEPTOR ALLOCATION PROBLEM BY LVQ-RBF ALGORITHM The neural architecture is trained to map the stateaction pair values of the TMD problem into actions. The TMD environment is assumed to be dynamic, with the presentation of new dataset for every learning trial. The work on TMD Interceptor allocation by Bertsekas et.al [6] utilizes Neuro dynamic programming technique. In this paper we propose a new architecture to maximize the number of assets used in simulation. Thus categorization of priority is also increased. For all Markov Decision Problems the state space increases exponentially as learning progresses. This is resolved by including state space exploration. The input space for Interceptor Allocation problem is explored and selective states that have numerical closeness are grouped together. This partitioning results in separate regions of state space. One state space vector, i=(ar1, ar2, ar3 , …, arn) & a=(am1,am2,am3,…,amn) is used as the representative state equation for one region, where ar1 denotes the representative asset value on first priority region, am1 denotes the representative attacking missiles value for the first priority region and so on. Let the current state have two components (i,a). The input state space available for the problem results in a combinatorial explosion of states. The state space available to model this problem is as high as (s6600) which is due to Curse of Dimensionality. The initial partitioning of state space is done by Learning Vector Quantization neural network. In Learning Vector Quantisation based partitioning the target vector is presented by exploiting domain knowledge of available (i,a) combination. LVQ network uses supervised learning scheme and offers accurate classification. The complete architecture involves LVQ based initial architecture and function approximation by multiple RBF agents as shown in Fig 1.

I. INTRODUCTION In the modern warfare, missiles with warheads can inflict severe damages to far flung target locations in a matter of minutes. The defense against such attacks is by launching anti-ballistic missiles. The war scenario involves launching multiple missiles with different ranges to hit potential targets. The defending side normally practises the Theatre Missile Defense (TMD) concept. The solution to the Interceptor allocation problem is constrained due to state space and timing complexities. This is a combinatorial optimization problem. A new solution method for TMD problem is proposed in this paper using LVQ-RBF multi agent Hybrid architecture and reinforcement learning(RL) method. Q learning is a RL [1] method. RL provides a flexible approach to design intelligent agents in situations where planning and supervised learning are impractical. Kaelbling et.al.[7] and Sutton et.al.[9] provide surveys of the field of RL. They characterize two classes of methods: methods that search the space of value functions and those that search the space of policies. The former class is exemplified by temporal difference (TD) and the latter by evolutionary algorithm (EA). In this research paper, we have analysed the neural Q learning (TD method) and biological inspired algorithms (EA approach) for solving the interceptor allocation problem in multi layer defense scenario. The neural Q-learning is

© 2009 ACADEMY PUBLISHER

117

SHORT PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 3, November 2009 The Q values required for action selection is obtained by the weighted combination of Q values from individual neural networks. In the weighted averaging approach, a(x)=(∑kwk*ak(x))/∑kwk (1) where k=1,2,3…n. denote the number of agents used for solving sequential decision problem. X is an input, k denotes an agent (k є [1,n]), ak(x) is the output of agent k, wk is the weight of agent k, and a(x) is the combined output. Here the Q value function is approximated by RBF neural networks. The weighting approach of wk’ s reduces the error, where wk‘s are weights used for every agents output subject to constraints, wk>=1;Σwk=1; (2) To minimize the gating weight error, we use the gradient descent algorithm, error=Σxerror(x)=[Σxy(x)-wkak(x)]^2; (3) The error based approach ensures that the combined outcome is better than an individual agent (on average), and that the combination weights are optimal in the sense of minimizing the weighted errors. The output agents and the weights are combined with the RBF network learning algorithm and the gradient descent algorithm.The whole input range is divided into a set of homogenous subsets by using LVQ and online partitioning . The error measure obtained is termed as overall Bellman residual and is used for learning gate and region weights. A. RBF based Function Approximation In complex problems the multi agent concept can provide better results. Also the larger state space effects the function approximators to learn the Q function. RBF allows to learn partitioning along with Q function. These functions have the highest activation at their centers and gradually taper off until having zero activation. This is known as soft partitioning of input region. The temporal error is minimized using the global error criterion. The Q value used for action selection is obtained by the weighted combination of multiple RBF agents. In Q learning, updating can be done online, without using probability estimates. It is done using actual state transitions. Updating is also incremental, since we use only the information about current state transition (i.e.) Q(x,a) := Q(x,a) + α{g(xt+1)}+ γ*(maxbεA(y) Q(y,b)Q(x,a)) (4) where ‘a’ is the action is selected by Past Success Directed Exploration scheme. In this paper the learning starts from scratch (i.e) Q values are initialized to zero. Previous works on Fuzzy Q Learning used either Boltzmann distribution or Pseudo stochastic Exploration techniques. But we modify the Past success directed exploration technique[6], for RBF Q learning. This technique biases exploration by the amount and rate of success of the learner. The learner exploits more either if it acquires reward at an increasing rate or if the learner stops receiving reward due to change in environment. The average discounted reward reflects both the amount and frequency of received immediate rewards, and is defined by:

µrt = tΣk=1 v(t-k+1)rt / tΣk=1 v(t-k+1) © 2009 ACADEMY PUBLISHER

where v[0,1] is the discount rate and rt is the reward received at time t. Discount factor determines how past rewards are viewed. Past success directed exploration combines with ξ Greedy algorithm to have ξ as follows: (6) ξt =0.8 exp(-αµrt)3 +0.1

Figure 1. LVQ-RBF multi agent Hybrid architecture. III. GENETIC ALGORITHM Genetic algorithm (GA) is a stochastic search algorithm, which uses evolution and natural selection as heuristic. It is an iterative algorithm that retains a pool of feasibly strong solutions at each genetic pass. IV.PARTICLE SWARM OPTIMIZATION Particle Swarm optimization (PSO) is a stochastic optimization technique, inspired by social behavior of bird or fish schooling. The system is initialized with a population of random solution and searches for optima by updating generations. The design goal is to train neural network to achieve upto 75 percentage survivability. The objective function is the probability of survival calculation. In this paper, we have taken kdsa and gsa in random. V. RESULTS AND DISCUSSION A number of experiments is carried out with the design goal to maximize reward values, assets saved and action space exploration. The simulation involves six priority regions and each priority region of atleast 500 assets. The input state equation used is i=[a1 a2 a3 a4 a5 a6 , m1 m2 m3 m4 m5 m6], where a1 – a6 denoting the number of assets present in the ith priority and m1–m6 the number of incoming missiles towards ith priority. Experiments are done to learn complex Q value function by agents with an ability to withstand 276 attack waves. The Past success directed exploration scheme is used to maximize the survivability of assets. The LVQ-RBF hybrid neural network decision module is trained to allot interceptors to defend varying priority regions. The RBF network used has ten centers. Euclidean distance is used in RL scheme for error criterion. The value of alpha used is 0.4.The region weights used in the gradient descent learning scheme and sample output is shown in Figure 2. The reward values plot in Figure 3 shows the accumulation of rewards by the LVQ-RBF network for 500 iterations. The survival probability values obtained for PSO are plotted in Figure 4. The optimal defense plan obtained in different populations and different runs is shown in Table 1 for GA and in Table 2 for PSO.

(5)

118

SHORT PAPER International Journal of Recent Trends in Engineering, Vol 2, No. 3, November 2009 The plot of Probability of survival plot is shown in Figure 5. Figures 6 and 7 are graphs of objective function of benchmark functions Foxhole and Alpine respectively. VI. CONCLUSION This paper proposes an efficient solution for the Interceptor allocation problem by including the multiple priority attacking weapons and defending weapons and BM/C3 problem analysis. The decision module is capable of sequential allocation of defense resources over a period of time sequences. The proposed system facilitates quicker convergence. The learning is performed by PSO technique. The graphs of results show the efficiency of the decision module in terms of the number of trials. PSO offers better survival probability than GA and is more efficient in terms of convergence.

Fig 2. Plot to show the learning of region weights.

Fig 6. Objective function for Foxhole function using PSO

Fig 7. Objective function for Alpine function using PSO REFERENCES [1] Bisht : Hybrid Genetic Algorithm for Optimal [2]

Fig 3. Reward Accumulation with discrete reward scheme.

[3] [4] [5]

Table 1- Optimal defense plan obtained during different populations and different runs using Genetic Algorithm.

[6]

[7]

[8]

[9]

Fig 4. Survival probability values obtained for PSO .

Table 2- Optimal defense plan using PSO.

Fig 5. Probability of survival using PSO © 2009 ACADEMY PUBLISHER

119

Weapon Allocation , Defense Science Journal Vol 54, No 3, July 2004. Bellman, R. (1961), Adaptive Control Processes: A Guided Tour, Princeton University Press. Bertsekas, D., and Tsitsiklis, J. (1996). Neurodynamic programming. Athena Scientific. Bishop, C.M. Neural Networks for Pattern Recognition,Oxford University Press. 1995. T.Jaakkola, M.I.Jordan & S.P. Singh , “On the convergence of stochastic iterative dynamic programming algorithms”. Neural Computation, 6(6):1185-1201, 1994. Jereme Wyatt, “Exploration and Inference in Learning from Reinforcement”. Ph.D Dissertation, University of EdinBurgh, 1997. Kaelbling L. P.; M.L.Littman & A.W.Moore, “Reinforcement Learning: A survey”. Journal of Artificial Intelligence Research, vol. 4,237-285,1996. Satinder.P. Singh, T. Jaakkola & Michael L. Jordan, “Reinforcement learning with soft state aggregation”, Advances in Neural Information Processing, 1992. Sutton, R. S. , and Barto, A. G.,”Reinforcement learning: An introduction”, MIT Press,1998.

Suggest Documents