Artificial Intelligence (AI) learning techniques, particularly,. Reinforcement Learning. The remainder of this paper is organized as follows. Section. 2 presents the ...
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.
An Adaptive Reinforcement Learning-based Approach to Reduce Blocking Probability in Bufferless OBS Networks Abdeltouab Belbekkouche, Abdelhakim Hafid Network Research Laboratory, University of Montreal, Pavillon André-Aisenstadt H3C 3J7, Canada {belbekka, ahafid}@iro.umontreal.ca Abstract—Optical burst switching (OBS) is an optical switching paradigm which offers a good tradeoff between the traditional Optical Circuit Switching (OCS) and Optical Packet Switching (OPS) since it has the relatively easy implementation of the first and the efficient bandwidth utilization of the second. Hence, OBS is a promising technology for the next generation optical Internet. A buffer-less OBS network can be implemented using ordinary optical communication equipment without the need for either wavelength converters or optical memories. However, OBS networks suffer from a relatively high blocking probability, a primary metric of interest, because of contention. In this paper we propose a new contention resolution scheme for buffer-less OBS networks using deflection routing and Reinforcement Learning agents to dynamically assign an appropriate Offset Time (OT) to each burst in order to reduce losses caused, for example, by Insufficient Offset Time (IOT) in case only deflection is used. Simulation results demonstrate that our approach reduces effectively blocking probability, whereas it maintains a reasonable end-to-end delay for each burst. Hence, it establishes an appropriate tradeoff between loss rate and delay. Keywords-Deflection Routing; Optical Burst Switching; Unsupervised learning; Wavelength Division Multiplexing
I.
INTRODUCTION
Optical burst switching (OBS) [1, 2] is an optical switching paradigm; it is a very promising technology to use in the next generation optical Internet networks that aims to respond to the increasing bandwidth of Internet users. OBS is a good tradeoff between the traditional Optical Circuit Switching (OCS), which is relatively easy to implement but suffers from poor bandwidth utilization and coarse granularity, and Optical Packet Switching (OPS), which has a good bandwidth utilization and fine granularity but suffers from difficult implementation because of the immaturity of current optical technologies such as optical buffer and optical logic [3]. However, buffer-less OBS can be implemented using current optical communication equipments. In OBS networks, traffic is groomed in bursts of variable lengths. At the beginning of the grooming process, a control packet is sent from the source router to the destination router in order to reserve the required resources along the light path. This control packet is subject to Optical-Electric-Optical
(OEO) conversions and receives an appropriate processing at each intermediate node (OBS switch). After a delay called Offset Time (OT), the burst is sent through the same light path without any buffering requirement. From a signaling point of view, there exist various OBS protocols. For instance, we find Just-In-Time (JIT) with open-ended reservation where two types of control packets corresponding to a burst are used: setup packets and release packets. Hence, the bandwidth needed by a burst is reserved at each intermediate node (belonging to the path towards the destination) immediately after receiving and processing the setup packet, and released immediately after receiving and processing the corresponding release packet. Also, we find Just-Enough-Time (JET) with close-ended reservation where the required bandwidth is reserved immediately at the moment the burst arrives to each intermediate node (belonging to the path towards the destination) and only for the burst duration indicated in the corresponding control packet. In both JIT and JET protocols, a burst is transmitted after its corresponding control packet (after an Offset Time) without waiting for an acknowledgment [1, 4]. A major issue in OBS networks is wavelength contention which is the main cause of burst losses; this may degrade considerably the performance of such networks. A contention arises when two or more bursts intend to take the same output fiber, on the same wavelength, at the same time. Hence, various studies have already been made to resolve wavelength contention in OBS networks. We find essentially four methods: (a) buffering with the use of Fiber Delay Lines (FDLs); (b) wavelength conversion when the burst encounters a contention situation; (c) burst segmentation where the contending part of a burst involved in a contention is dropped or forwarded on an alternate path while the other part is forwarded on the primary path [5, 6]; and (d) deflection routing where only one of the bursts involved in a contention is routed to its primary fiber, whereas each of the other bursts is switched to an alternate outgoing fiber with available wavelength(s). We note that combining these techniques often leads to better solutions [7]. In this paper, we are interested in the performance of
1-4244-0353-7/07/$25.00 ©2007 IEEE 2377
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.
deflection routing method in buffer-less OBS networks based on Just-Enough-Time (JET) signaling. We concentrate on addressing the Offset Time (OT) assignment issue; instead of using a static value for OT, we propose a scheme that dynamically computes the value of OT adapting to changes in the network state (like burst blocking probability and end-toend delay) with little computing time and few information on network state and design. The proposed scheme makes use of Artificial Intelligence (AI) learning techniques, particularly, Reinforcement Learning. The remainder of this paper is organized as follows. Section 2 presents the related work. Section 3 provides a short overview of deflection routing in OBS networks. Section 4 describes the proposed scheme including the reinforcement Learning model and the corresponding algorithm. Section 5 evaluates the performance of the proposed scheme using simulations. Finally, Section 6 concludes the paper. II.
RELATED WORK
Whereas wavelength contention has been identified as an important issue to address in order to increase the performance of OBS networks, most authors are still focusing on very small networks and traffic instances [8]. In addition, few authors have considered deflection routing exclusively with a focus on the offset assignment issue, even though its significant impact is reported in [9]. Indeed, Offset Time setting has been investigated in the Quality of Service (QoS) context, with the addition of a fixed Extra Offset Time (EOT) to the basic Offset Time (which is a fixed value too), according to the traffic class [10-15]. Hence, the authors in [15, 16] proposed to randomize the offset generation process by using a statistical shaping model, similar to a leaky bucket regulator. The authors in [16] proposed a scheme with dynamic offset windows mechanism where the closest node to a link shared by several sources defines, for each source, an interval of time where its bursts can be scheduled. With scheduling time intervals (offset windows) given in a dynamic way, different sources will avoid to schedule bursts in the same time frame. More recently, the authors in [9] proposed an adaptive Offset Time determination based on the bandwidth utilization of links and nodes traversed by the bursts and some measurements of burst losses due to Insufficient Offset Time (IOT). However, it requires new protocols to collect considerable amount of information (e.g., paths, load) needed to compute the value of OT and thus introduces a considerable overhead. Reinforcement Learning (RL) [17] has already been used to address networking issues in [18-22]. For example, the authors in [20] apply Q-learning [23] to the packet scheduling problem in routers. The authors in [18] propose an adaptive provisioning scheme of diffServ networks based on Reinforcement Learning; the objective of the scheme is to overcome the allocation of a fixed amount of bandwidth to each Diffserv traffic class at its admission moment by enabling the RL agent, at the ingress router, to dynamically change the allocation with the objective maximizing the profit of the
2378
access provider. The authors report an improvement of both the bandwidth utilization and the provider’s benefit. III.
DEFLECTION ROUTING PROTOCOL
Deflection routing in buffer-less OBS networks can be seen as the use of idle links in the network as Fiber Delay Lines (FDLs). In deflection protocols, every intermediate optical switch uses deflection routing only for contention resolution. To do so, if the default output link (wavelength) is reserved by another burst, the optical switch deflects the burst in question to any of other idle links (with the right available wavelength). We prohibit infinite loops due to infinite deflections by authorizing a limited number of deflections that a control packet (and thus its corresponding data burst) may incur. For that purpose, we add an NDef field to each control packet in order to record the number of times it has been deflected, and an NDmax value that fixes the maximum number of authorized deflections. In addition to SETUP control packet, we define a new kind of control packets called NACK which is sent if a burst is dropped due to Insufficient Offset Time (IOT) or if no available alternative link is found to deflect the data burst. Switching Node Incoming SETUP control packet
Yes
Default output link available ?
No
Yes
Ndef > Ndmax ? (Maximum number of deflections reached)
No
No
Alternate link available ? (Deflection routing)
Send back NACK packet Drop the burst
Yes
Send burst to selected output link
Outgoing SETUP control packet
Figure 1. Deflection routing protocol in the optical switch
Fig. 1 shows the routing procedure in an optical burst switch. It performs the following process on receiving a SETUP control packet: • Check the destination node address, and then look up the routing table to decide the output link based on the shortest path routing; • Reserve bandwidth for the corresponding data burst
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.
•
•
on the assigned wavelength if it is not already reserved for another burst and then send the control packet to next node; Otherwise, verify that the maximum number of deflections is not reached and deflect setup control packet (and the corresponding data burst thereafter) to any available link if the assigned wavelength has not been reserved for another burst; If the assigned wavelength on every output link is not available or if the maximum number of deflections is reached, SETUP control packet is dropped (and the corresponding data burst thereafter) and a NACK packet is immediately sent back to the source node in order to notify the burst dropping.
IV.
ADAPTIVE COMPUTATION OF OT VALUES
A. Offset Time determination The Offset Time (OT) is the time which separates the control packet from its burst. It is an important parameter in OBS networks, especially in buffer-less OBS networks, since its appropriate calculation can decrease significantly bursts’ blocking probability due to Insufficient Offset Time (IOT) when the only scheme used to resolve contentions is the deflection routing. The OT calculation is based on the control packet Processing Time (PT) on each optical switch of the path followed by the control packet from source to destination routers. Let N be the size of this path, the OT is given by the following formula:
OT = N × PT Hence, if
delay. An adaptive scheme that computes the value of EOT (and thus OT) dynamically, taking into account the state of the OBS network, will definitively be effective to improve OBS performance. In the following section, we propose an adaptive scheme in the framework of RL. B. The Reinforcement Learning Model In the framework of reinforcement learning, a system accepts input from the environment (e.g., blocking rate), responds by selecting appropriate actions (e.g., computing a new value of EOT), and the environment evaluates the decisions by sending a rewarding or penalizing reinforcement feedback. Based on the value of the received reinforcement, the system updates its parameters so that good decisions become more likely to be made in the future, while bad decisions become less likely to occur [24]. Actually, three key components are needed: a representation of the system state, a reward function which reflects the immediate value of the system actions, and a learning algorithm which refines/adjusts the actions (e.g., values of EOT) based on the feedback provided by the reward function. In this paper, we suppose that a RL system, called RL agent, is placed at each ingress edge router of the OBS network. It uses a Gaussian unit of which the output yt determines the Extended Offset Time (EOT) at each time step t according to the parameters µt and σ t which are mean and standard deviation of the Gaussian unit respectively. They are updated at the beginning of each new step t + 1 using the following formulas:
µ t +1 = µt + α µ .(rt − rˆt ).
(1)
( yt − µ t )
(4)
σ2
N SD is the shortest-path size, then we have the
following:
OT ≥ N SD × PT
σ t +1 = σ t + α σ (rt − rˆt )
(2)
However, if the control packet is deflected because of a contention situation, it may take another path which may be longer than the default shortest path. In this case, the data burst may go beyond its corresponding control packet, and will be dropped due to Insufficient Offset Time (IOT) resulting in increasing the Blocking Probability of the whole network. To overcome this problem, we propose to add an explicit Extended Offset Time (EOT) to the basic OT. This last will be calculated as follows:
OT = ( N SD × PT ) + EOT
(3)
OT can be considered as a Time To Live of the burst. Hence, the EOT should not be too small, but also it should not be too large in order to alleviate increasing bursts’ end-to-end
2379
( yt − µt )2 − σ t
σ t3
2
(5)
where, rt is the immediate reward obtained by the previous
α µ and ασ are respectively the adaptation rates of Gaussian unit parameters µ and σ . The authors in [25] output yt ;
report that a reasonable behavior of this Gaussian unit can be obtained by taking
α µ = ασ = α ⋅ σ 2 with α being a
sufficiently small positive constant. The cumulative reward rˆt is obtained by using the following exponential scheme formula:
rˆt = γ .rt −1 + (1 − γ ).rˆt −1
(6)
where 0 < γ < 1 allows the weighting of the immediate
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.
reward with respect to the cumulative reward. The reward function r represents in our model a tradeoff between the Blocking Probability (due to IOT) and the end-toend delay of bursts during a time step t . In order to calculate the blocking probability during a time step t , noted
τ t , each
time a burst is dropped due to IOT, a NACK packet is sent to the burst’s source edge router which updates immediately the number of dropped bursts since the beginning of the current time step. Hence, the reward function at the beginning of each time step t is calculated as follows:
rt = −(ω.τ t + (1 − ω ).EOTt )
(7)
where, 0 < ω ≤ 1 allows the weighting of the blocking probability and the delay induced by EOT. The negative sign in this formula means that the increase of either the blocking probability or the end-to-end delay will return a penalty to the Reinforcement Learning agent at the source edge router. The idea behind (4) is to create a disturbance between the current Gaussian unit output value yt and its mean µt ; if the immediate reward value is better than the cumulative reward value, the mean of the Gaussian unit moves to the current output direction, otherwise, it moves in the opposite direction. Equation (5) widens the exploration space if an improvement is found far from the mean or if we receive less improvement close to it. On the other hand, (5) narrows the exploration space if the output close to mean is better or if an exploration far from it has produced an immediate reward value less than the cumulative reward value. This mechanism allows finding a tradeoff between exploitation and exploration which is a traditional issue in Reinforcement Learning models [17, 25]. C. The learning algorithm The RL agent at the ingress router of the OBS network executes, at the beginning of each time step t , the learning algorithm, shown in Table 1, in order to determine the EOT and, consequently, the OT. An important issue in this type of Reinforcement Learning algorithms is that the convergence may depend strongly on the initial values of parameters µ 0 , σ 0 (that should be chosen
TABLE 1. EXTEND OFFSET TIME DETERMINIATION ALGORITHM
Input: µt , σ t ,
γ , α µ , ασ ω and τ t .
Output: The Extended Offset Time EOTt at each time step t. Initialization of parameters
µ 0 , σ 0 , τ 0 , γ , α µ , ασ , ω and
EOT0 ← | Ν ( µ0 , σ 0 ) | ; While ( t ≥ 0 ){ rt ← −(ω.τ t + (1 − ω ).EOTt ) ; rˆt ← γ .rt −1 + (1 − γ ).rˆt −1 ; /* if ( t = 0 ) rˆt ← 0 */ (y − µ ) µt +1 ← µt + α µ .(rt − rˆt ). t 2 t ; σ 2 ( yt − µt )2 − σ t ; σ t +1 ← σ t + ασ (rt − rˆt ). 3
σt
EOTt +1 ← | Ν ( µt +1,σ t +1 ) | ; t++; Wait a time step duration; } V.
SIMULATION RESULTS AND ANALYSIS
We studied the performance of the proposed RL-based approach using NCTUns network simulator and emulator [26]. Considering NSFNET topology with 14 nodes as shown in Fig. 2, we assume that each single fiber link is bidirectional and has the same number of wavelengths operating each at 1 Gbps. Each node in the network can route, generate and receive traffic.
according to the information on the network’s state at time t = 0 ). The other parameters γ , α µ , ασ and ω also have an impact on the proposed algorithm performance and, consequently, the OBS network performance. Another important factor is the duration of the time step t which influences directly the reaction latency of our algorithm to changes in the network’s state. Hence, it influences the stability of the whole network, since a very little time step can bring a fluctuation of network traffic state (knowing that OT can be used to differentiate traffic classes [14]), whereas a very big time step will react to network state changes (like a sudden increase in bursts’ blocking probability) late.
2380
Figure 2. NSFNET topology with 14 nodes
The goal of the simulations is to study the performance of the proposed Reinforcement Learning-based approach and compare it to classical deflection routing schemes. Indeed, we compare the deflection routing with and without the proposed Offset Time assignment mechanism. For more information about deflection routing performance in OBS networks, we refer the reader to [3] and [12]. We consider the following
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.
γ
= 0.15, ω = 0.9,
α µ = ασ = 0.00005 and time step t =
0.5 seconds. We set NDmax to 2; this means that we do not permit that a burst be deflected more than two times. We consider CBR traffic where each node in the network sends to all the other nodes. We define traffic load as being the ratio of the total input source nodes throughput by the total input fiber capacities of these nodes (i.e., throughput /
∑ input nodes
0,13 0,12 0,11 0,1 0,09
∑ input nodes capacities). We vary traffic load
0,045 0,040 0,035 0,030 0,025 0,020 0,015 0,010 0,005 0,000 0,1
0,2
Load
0,3
Our approach
0,08
from 0.1 to 1. In order to have better presentation of burst blocking probability results, we split these results in two figures (Fig. 3 and Fig. 4). Fig. 3 shows the variation of the burst blocking probability with the traffic load between 0.1 and 0.4. The results show clearly that our approach reduces effectively blocking probability for low and fairly loaded network. Indeed, when load is 0.1 the relative improvement (i.e., [(blocking probability of deflection routing)-(blocking probability of the proposed approach)]/(blocking probability of deflection routing)) of the proposed approach over classical deflection routing scheme is 97%. When load is 0.4 this improvement is 28 %. Only deflection Our approach Blocking Probability
Only deflection 0,14
0,5
0,6
0,7
0,8
0,9
1
Load Figure 4. Burst Blocking Probability comparison between our approach and classical deflection routing when load is great than 0.5
Fig. 5 shows the variation of end-to-end delay with the traffic load. As expected, our approach incurs bigger delay as a response to the increase of burst losses when traffic load increases. In the worst case, The relative end-to-end delay increase (i.e., [(delay with RL-based approach)-(delay with deflection routing)]/(delay with RL-based approach)) of our approach compared to deflection routing is 52; this increase is still acceptable and comparable to that reported in [9] and [27]. Only deflection Mean E2E delay (milliseconds)
0.1,
routing.
Blocking Probability
performance metrics: the blocking probability which is the main metric in buffer-less OBS networks, and the end-to-end delay. Initially, we set the parameters as follows: µ0 = 0.2, σ 0 =
Our approach
2,5 2,0 1,5 1,0 0,5 0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Load
0,4
Figure 5. Burst End-to-end delay comparison between our approach and classical deflection routing
Figure 3. Burst Blocking Probability comparison between our approach and classical deflection routing when load is less than 0.5
Fig. 4 shows the variation of the burst blocking probability when load varies between 0.5 and 1. In this case, the improvement of our approach is between 11 % when load is 0.5 and less than 1 % when load is 1. Our RL-based approach provides far better improvement in slightly loaded network than in loaded network. This can be explained as follows: when the offered load increases, the number of successful deflections tends to decrease, which in turn decreases burst losses due to Insufficient Offset Time, making the improvement brought by our approach less significant. Overall, the RL-based approach outperforms deflection
2381
Fig. 6 shows the reaction (in terms of extended offset time) of our model over time while varying the traffic load. The traffic load between time=0 and time=1 is equal to 0.1. At time=1, we suddenly increase the traffic load to 1 for 2 seconds. At time=3, we suddenly decrease the traffic load to 0.1. The results, in Fig. 6, shows that our RL-model reacts rapidly to traffic load variations and it converges to a stable state within few time steps.
Extended Offset Time (milliseconds)
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2007 proceedings.
0,25 0,24 0,23 0,22 0,21 0,2 0,19 0,5
1
1,5
2
2,5
3
3,5
4
4,5
5
5,5
6
Time (seconds) Figure 6. The reaction of our RL-model to variation in traffic load
VI.
CONCLUSION
In this paper, we proposed a new Reinforcement Learningbased approach to reduce blocking probability in buffer-less OBS networks. This approach uses deflection routing to resolve wavelength contention and an adaptive Offset Time assignment mechanism to decrease losses due to Insufficient Offset Time. In addition to its simplicity, our approach is effective in reducing blocking probability at the expense of an acceptable overhead in burst end-to-end delay. Further work will consider the impact of combining other contention resolution schemes, like segmentation and wavelength conversion, with our approach in order to improve its performance. REFERENCES [1]
C. Qiao and M. Yoo. Optical burst switching (OBS) - a new paradigm for an optical Internet. Journal of High Speed Networks, 8(1):69-84, 1999. [2] C. Qiao and M. Yoo. Choices, features and issues in optical burst switching. Optical Network Magazine, 1(2):36-44, 2000. [3] Y. Chen, H. Wu , D. Xu and C. Qiao, Performance Analysis of Optical Burst Switched Node with Deflection Routing, In Proceedings of IEEE ICC, Vol. 2, pages 1355-1359, 2003. [4] J. Y. Wei and R. I. McFarland. Just-In-Time signaling for WDM optical burst switching networks. Journal of Lightwave Technology, 18(12):2019-2037, 2000. [5] V. Vokkarane, J.P. Jue, and S. Sitaraman. Burst segmentation: An approach for reducing packet loss in optical burst switched networks. In Proceeding of IEEE ICC, Vol. 5, pages 2673-2677, 2002. [6] A. Detti, V. Eramo, and M. Listanti. Optical burst switching with burst drop(obs/bd): an easy obs improvement. In Proceeding of IEEE ICC, Vol. 5, pages 2687-2691, 2002. [7] A. Zalesky, et al. Evaluation of limited wavelength conversion and deflection routing as methods to reduce blocking probability in optical burst switched networks. In Proceeding of IEEE ICC, Vol. 3, pages 1543- 1547, 2004. [8] T. Coutelen, H. Elbiaze, B. Jaumard, A. Metnani. Measurement-based alternative routing strategies in optical burst-switched networks. In Proceedings of 7th International Conference on Transparent Optical Networks, Vol. l, pages 224- 227, 2005. [9] T. Coutelen, H. Elbiaze, B Jaumard. An efficient adaptive offset mechanism to reduce burst losses in OBS networks. In IEEE GLOBECOM '05, Vol.4, pages 5 pp. 2005. [10] N. Barakat, E. H. Sargent. Quantifying the effect of extended offsets in optical burst switching networks. In Canadian Conference on Electrical and Computer Engineering,Vol.4, pages 2331- 2335, 2004.
2382
[11] D. Q. Liu, M.T. Liu. Optical burst switching reservation process modeling and analysis. in The 8th International Conference on Communication Systems, Vol.2, pages 928- 932, 2002. [12] C. Hsu, T. Liu, and N. Huang. Performance analysis of deflection routing in optical burst-switched networks. In proceedings of INFOCOM, Vol. 1, pages 66-73, 2002. [13] M. Yoo and C. Qiao. Supporting multiple classes of services in IP over WDM networks. In proceedings of GLOBECOM, Vol. 1b, pages 10231027, 1999. [14] M. Yoo, C. Qiao, and S. Dixit. Optical burst switching for service differentiation in the next-generation optical Internet. IEEE Communications Magazine, 39:98-104, Feburary 2001. [15] S. Verma, H. Chaskar, and R. Ravikanth. Optical burst switching: A viable solution for Terabit IP backbone. IEEE Network, pages 48-53, Nov/Dec 2000. [16] A. Agusti, C. Cervello-Pastor. A new contentionless dynamic routing protocol for OBS using wavelength occupation knowledge. In Proceedings of the 12th IEEE Mediterranean Electrotechnical Conference, Vol. 2, pages 519- 522, 2004. [17] R. S. Sutton and A. G. Barto. Reinforcement Learning: An introduction. 1998. [18] T. C. K. Hui and C.K. Tham. Adaptive Provisioning of Differentiated Services Networks based on Reinforcement Learning. IEEE Transactions on Systems, Man and Cybernetics, 33(4): 492- 501. 2003 [19] J. A. Boyan and M. L. Littman. Packet routing in dynamically changing networks: A reinforcement learning approach. In Advances In Neural Information Processing Systems, Vol. 6, pages 671-678, 1994. [20] H. Ferra, et al. Applying Reinforcement Learning to Packet Scheduling in Routers. In Proceedings of the Fifteenth Innovative Applications of Artificial Intelligence Conference, pages 79-84, 2003. [21] S. Kumar and R. Miikkualainen. Confidence-based Q-routing: an onqueue adaptive routing algorithm. In Proceedings of Neural Networks in Engineering, 1998. [22] S. Kumar and R. Miikkualainen. Dual reinforcement Q-routing: an onqueue adaptive routing algorithm. In Proceedings of Neural Networks in Engineering, 1997. [23] C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8: 279-292, 1992. [24] K. Littman and M. Moore. Reinforcement Learning: A survey. Journal of Artificial Intelligence Research, 4: 237-285, 1996. [25] R. J. Williams. Simple Statistical Gradient-Following Algorithms for connectionnist Reinforcement Learning. Machine Learning, 8(3): 229256, 1992. [26] S. Y. Wang, et al. The design and implementation of the NCTUns 1.0 network simulator. Computer networks, 42(2): 175-197, 2003. [27] S. Kim, N. Kim, and M. Kang. Contention resolution for optical burst switching networks using alternative routing. In Proceedings of IEEE ICC, Vol. 5, pages 2678-2681, 2002.