Proceedings - 4th International Workshop on Dependable Embedded ...

Proceedings of the

4th International Workshop on Dependable Embedded Systems October 9, 2007, Beijing, China http://wdes07.di.fc.ul.pt

in conjunction with the

26th Symposium on Reliable Distributed Systems

ii

Foreword After the success of previous editions, the International Workshop on Dependable Embedded Systems (WDES 2007) is now going into its fourth edition, to be held on October 9, 2007 in Beijing, China. In this workshop we wish to bring together researchers and practitioners to share research results, practical experiences and advances in (or impediments to) the application of embedded systems for dependable systems. We encourage participation by professionals with diverse backgrounds who can contribute to advancing the technology and reflecting the latest trends and who can foster discussing the implications. The aim of the workshop is to provide a forum with interesting discussions and debates. Authors will prepare the final version of their paper after the event to reflect the discussions at the workshop.

Scope and Topics Today, nearly every processor is deployed as integral part of a daily life artifact. Embedded computing systems can be found performing more or less critical functions, in application domains ranging from mass‐ consumer entertainment gadgets to vehicular, industrial automation or health mission‐critical applications. Furthermore, the availability of wireless and low‐power technologies creates opportunities to make some of these applications mobile and distributed, cooperating with other embedded systems, forming what may be called as systems of embedded systems. We solicited position papers, research contributions and experience reports addressing issues related to the design, analysis, validation, implementation of dependable distributed embedded systems and systems of embedded systems. Topics of particular interest include: • • • • • • • • •

Self‐configuring distributed embedded systems Dependable communication in open wireless networks Architectures for dependable distributed applications Security of safety‐critical nodes with connectivity to open networks Achieving dependability through adaptation and QoS assurance Formal verification of embedded systems Low‐power embedded systems Dependable embedded applications Case studies of dependable embedded systems

Co‐Chairs • •

António Casimiro (University of Lisboa, Portugal) Xavier Défago (JAIST, Japan)

Program Committee • • • • • • • • • • • • •

Emmanuelle Anceaume (IRISA, France) Leandro Buss Becker (UFSC, Brazil) Felicita Di Giandomenico (CNR‐ISTI, Italy) Shlomi Dolev (Ben Gurion Univ., Israel) Joaquim Ferreira (EST‐IPCB, Portugal) Yasushi Hibino (JAIST, Japan) Gábor Huszerl (BME, Hungary) Nobuyasu Kanekawa (Hitachi Ltd., Japan) Johan Karlsson (Chalmers Univ., Sweden) Raimund Kirner (TU Vienna, Austria) Phil Koopman (CMU, USA) Xiaodong Lu (TITech., Japan) Tatsuo Nakajima (Waseda Univ., Japan)

iii

iv

Table of Contents Keynote: Reliable Broadcast Communication in Mobile Ad‐hoc Networks R. Baldoni ............................................................................................................................................... 1 Talk 1: Dependability‐Performance Trade‐off on Multiple Clustered Core Processors T. Funaki, T. Soto ................................................................................................................................... 3 Talk 2: Flexible Bus Media Redundancy V. F. Silva, J. Ferreira, J. A. Fonseca ....................................................................................................... 9 Talk 3: Taking Advantage of Within‐Die Delay‐Variation to Reduce Cache Leakage Power Using Additional Cache‐Ways M. Goudarzi, T. Matsumura, T. Ishihara .............................................................................................. 15 Talk 4: Optimizing Byzantine Consensus for Fault‐Tolerant Embedded Systems with Ad‐Hoc and Infrastructure Networks H. P. Reiser, A. Casimiro ....................................................................................................................... 21 Talk 5: Fault‐tolerant collision prevention for cooperative autonomous mobile robots with asynchronous communications R. Yared ................................................................................................................................................ 27

v

vi

Reliable Broadcast Communication in Mobile Ad-hoc Networks Roberto Baldoni

Abstract A fundamental issue of distributed computing consists in finding concepts and mechanisms that are general and powerful enough to allow reducing (or even eliminating) the underlying uncertainty. This uncertainty is created by asynchrony, failures, unstable behaviors, non-monotonicity, system dynamism, mobility, low computing capability, scalability requirements, etc. Mastering one form or another of uncertainty is pervasive in all distributed computing problems. This talk focusses on how speed of of nodes create uncertainty in a distributed systems with mobile nodes in terms of problem cost and problem solvability. In particular the talk will address the specific problem of geocasting and show how speed of nodes impact geocasting solvability and how speed affect geocasting cost. For the one-dimensional case of the mobile ad-hoc network, we provide an algorithm for geocasting and we prove its correctness given exact bounds on the speed of movement. This analysis formally verifies the intuition that the faster nodes move, the most costly it would be to solve geocasting. Interestingly, the set of steps we followed (i.e.,the model, the way solvability problem has been tackled and how the tradeoff bounds on the cost of solvability has been established) for analyzing geocasting can be a general canvas within which analyzing the uncertainty due to node speeds introduced within other distributed computing related problems working on the top of a mobile settings.

1

2

!"#"$%&'()(*+,-"./0.1&$2"34.&%",0//30$356)*(#)"37)68*"."%370."3-.02"880.83 ! ! "#$%&'($(!)*+(,&! !"#$%#&'($)*)#)+&,-&.+/%(,0,1"& )2-#(34*5403673*74"#)+/%73/789& !

"#$%&+#-&!.(/#! !"#$%#&:(*;+6740$:05)4$:%&)$

!. /0123$) 453674

/&)$ 453674

#"

8$9:>6740$:2=500:%&)$

! "#$%&'!()!*'+,-.#$%&/01'!23'+%4#,-!5,6'7! ! 563.&6&7,$,547'! ;4%! 3%4C5)57?@G!$'!'(487!57!A5?@0! P5
#$%&'$

() *)$'

!"

#$%&'$

!.

!.

!.

!.

-+

-+

-+

-+

-+

-+

+, +,

+, +,

+, +,

+, +,

+, +,

+, +,

#"

/01234)$0'56$7$62)$'89'09%:

!.

!.

?&)$ @0=>9@

#"

/;12!9&956$7$62)$'89'09%:

7+891%#!:#$%&%'()*+,+-.#;3(%( I '$$($2 ! !"#&=$'%9

*+,8 ! ) ( < 8 ,E % " & ! !"#/($' # < 8 ! " ," ! ;7 ! & # ' < 8 ! " ," $ *+,8 ! ) " < 8 ! ;7" ! !"#/($'

"

*+, ! ) '$$($ I $%&' ! !"#

$ ?()1)$ )$ 30$ 4.&4K$ =1)L:),4/9$ B>>C$ 30$ *)+,$ '3*)$ '&$ =+3.:1)9$! -.91! ?7(! -.>/! -.-@! )77=A! -.>3! -.0@! BCA)DB! 9.8@! -.8@! 2)=2B=! 9.13! -.8@! =A')E! 9.>3! -.@>!

!"#$%&'()&'*+%#

!

!

S.+$! +;%51;5%$+! 1.2! =$! ?%,;$1;$!K((9!"$!#.@$!I.=0$!WD!&-!"$!1.2!*42,%$! )K_! *2! /$/,%>! +;%51;5%$+9! H&I3*2+2! *+! */?%,@$9! ;#%$$! AB3! .%$!
&2!,%3)&'*+#7058# ! &]!$2;%*$+! 6F!M[! &2;!52*;+! 6F!6! 3'!52*;+! 6F!6! R3(%)! QCQbS! Y&8+6(+68*''6+,[! "KKK!Y&8+6,!?639!CP,!L69!Q!ICPPPN! B^D! K9! L6+7)#1,! XG&#! Y/3$&5U*+06+7)#8*! U+68*''6+! 06+! ]6-! U6-*+! K72*11*1! >((3&8)$&6#',[! CR$%! "KKK! G47(6'&/7! 6#! ]6-5U6-*+!)#1!M&!G&7/3)$&6#5_)'*1!G60$!K++6+!K'$&7)$&6#! Y*$%61636
!"#$%&'(&)%$"*+,%&-.!/01&23+*4&5667& & ! "#$%&'()*+! "#$%,-./(*0&1! 2345! 6//! 2374! 68,9! 234:! 255! ;/0! 2327! &0! 23?7! @9(! 23A:! *99>B! 23?4! )CB*D)! 23?5! 255! 6*>6)>! 2324! >B/*.! 23?7! !

'8&69:;$N!O!CPQRRRRS!0+67! T)()#! G68&*$4! 06+! $%*! U+676$&6#! 60! G8&*#8*! ITGUGN,! )#1! 24! $%*! VHKG=! IV6+*! H*'*)+8%! 06+! K;63/$&6#)3! G8&*#8*! )#1! =*8%#636
Flexible Bus Media Redundancy Valter Filipe Silva ESTGA-University of Aveiro [email protected]

Joaquim Ferreira EST - Polytechnic Institute of Castelo Branco [email protected]

José Alberto Fonseca DETI - University of Aveiro [email protected]

Abstract

which is sent by the master node. Within each EC the protocol defines two consecutive windows, asynchronous (law in Figure 1 stands for length of asynchronous window) in and synchronous (lsw in Figure 1 stands for length of synchronous window), that correspond to two separate phases (see Figure 1). The first is used to convey event-triggered traffic (AM in Figure 1 stands for Asynchronous Messages) and the second is used to convey time-triggered traffic (SM in Figure 1 stands for Synchronous Messages). Between these two windows there is a guardian time to guarantee the temporal isolation (α in Figure 1). The synchronous window of the nth EC has a duration that is set according to the traffic scheduled for it. The schedule for each EC is conveyed by the respective EC trigger message (see Figure 2). Since this window is placed at the end of the EC, its starting instant is variable and it is also encoded in the respective EC trigger message.

This paper proposes a flexible approach to bus media redundancy in Controller Area Network (CAN) fieldbuses, both to improve the bandwidth by transmitting different traffic in different channels or to promote redundancy by transmitting the same message in more than one channel. Specifically the proposed solution is discussed in the context of Flexible Time-Triggered protocol over CAN (FTTCAN) and inherits the online scheduling flexibility of FTTCAN, enabling on-the-fly modifications of the traffic conveyed in the replicated buses. Flexible bus media redundancy is useful to fulfill application requirements in terms of additional bandwidth or to react to bus failures leading the system to a degraded operational mode, without compromising safety. The arguments for and against flexible bus media redundancy in the context of FTT-CAN are also discussed in detail.

1 FTT-CAN With Multiple Buses Basis FTT-CAN (Flexible Time-Triggered communication protocol on CAN) [1] has been developed with the main purpose of combining a high level of operational flexibility with timeliness guarantees. It uses the dual-phase elementary cycle concept to isolated time and event-triggered communication. The time-triggered traffic is scheduled online in a particular node called a master, facilitating online admission control of requests, thus being managed in a flexible way, under guaranteed timeliness. The protocol relies on a relaxed master-slave medium access control in which the same master message triggers the transmission of messages in several slaves simultaneously (master/multi-slave). Eventual collisions between slave messages are handled by the native distributed arbitration of CAN. FTT-CAN slots the bus time in consecutive Elementary Cycles (ECs) with fixed duration. All nodes are synchronized at the start of each EC by the reception of a particular message known as an EC Trigger Message (TM),

Figure 1. The Elementary Cycle The communication requirements are held in a database located in the master node [1], the System Requirements Database (SRDB). This database holds several components, one of which is the Synchronous Requirements Table (SRT), that contains the description of the periodic message streams. Based on the SRT, an online scheduler builds the synchronous schedules for each EC. These schedules are then inserted in the data area of the appropriate trigger message (see Figure 2) and broad casted with it. Due to the online nature of the scheduling function, changes performed in the SRT at run time will be reflected in the bus traffic within a bounded delay,

9

Figure 4. Bus timing with two buses In contrast, synchronous messages 2 and 3 do not require redundancy and, thus, are not transmitted in both buses. In fact, as it can be seen in Figure 3, the proposed system also includes replicated masters, adopting a leaderfollower behavior. The system only has a single active master at each time, being all the others backup masters. In case of an error in the active master, one backup master will become active since the previous active will be stopped (fail silent) . The master nodes are located at the end of the buses and the number of backup masters at one end of the buses equals the number of backup masters at the opposite end. This facilitates the bus error detection since, one Trigger Message omission can be easily detected by the master located in the opposite end of the bus. In this way, if a Trigger Message is omitted the backup master located at the opposite end of the bus will inform the active master of the error. The active master, if not crashed, could then re-schedule the traffic to the nonfaulty buses.

Figure 2. master/multi-slave access control and EC schedule coding scheme

resulting in a flexible behavior. One recent improvement on the FTT-CAN is the use of the master to control more than one buses in the system [18][19]. Using more than one CAN bus improves both the fault tolerance of the system and the available bandwidth, since messages can be transmitted on different buses. This solution provides additional bandwidth and overcomes the single point of failure of a non replicated CAN bus [18]. In this way, multiple buses can be used either to improve the bandwidth by transmitting different traffic in different channels or to promote redundancy by transmitting the same message in more than one channel. This architecture (see Figure 3) inherits the dispatching flexibility of FTT-CAN, enabling online changes on the traffic conveyed in the channels. This is useful to fulfill application requirements in terms of additional bandwidth or to react to bus failures leading the system to a degraded operational state, without compromising safety.

2 Pros and Cons of Flexible Bus Media Redundancy This section presents the arguments for and against flexible bus media redundancy in the context of FTTCAN. A multiple bus FTT-CAN architecture inherits most of the good properties of FTT-CAN, and adds some others, namely: • Increased bandwidth • Increased resilience to bus failures • Increased flexibility

Figure 3. FTT-CAN using multiple buses

• Scalability of replicated buses

Notice that slaves can be connected to just one CAN bus or to a set of buses, depending on the tasks that a specific slave has to perform, on the dependability level and on the bandwidth requirements. Similarly to the single bus system case, all the buses must convey a synchronized Trigger Message, with the same Elementary Cycle in all of them. That is, the Trigger Messages are issued to all the buses at the same time, dividing the bus time in all buses in the same way. Figure 4 presents an example with two buses. In the small example of Figure 4 synchronous message 1 is replicated in both buses improving its dependability.

• Master replication is still feasible Despite these advantages, there are also some drawbacks and limitations: • Increased complexity and price of the master node • Increased complexity of the slave nodes, in some cases • Inflexibility in terms of spacial location of the masters nodes

10

• The overall architecture complexity is higher

the transmitter only; and bit stuffing, to prevent synchronization loss due to several consecutive bits of the same polarity. In the recent years, several solutions adopting star topologies instead of traditional bus topologies, were proposed. Star topologies use one CAN link per each node, and can isolate (in case of an active star) any faulty segment of the system. However, the use of star topologies goes against one of the initial design requirements of the fieldbuses: reduce the wiring harness [20]. Using a FTT-CAN architecture with more than one bus, means that, at an application level, the data can be transmitted in more than one bus, improving the resilience to bus failures.

These aspects discussed in detail in the following sections. 2.1 Increased bandwidth The widespread use of CAN networks in applications with increasingly higher bandwidth requirements, calls for innovative solutions able to provide some extra bandwidth. For example, in the automotive domain, where the infotainment data is currently not conveyed in the CAN network due to bandwidth restrictions. MOST [4] or FlexRay [3] are used instead. Also some CAN based networks use more than one CAN bus, but do not improve the overall available bandwidth, example of such networks are TTCAN [12] networks and also the ones based on the Columbus Egg idea [16]. Since the FTT-CAN is based on the CAN protocol, the bandwidth available in the CAN bus cannot be exceeded. The throughput of a CAN bus is affected by the protocol overhead and depends on the bit rate and on the number of stuff bits. The CAN overhead varies from 42% to 88% [14]. The minimum value for overhead is obtained with a frame of 8 data bytes and the higher value (88%) is obtained with a frame of 1 data byte and the maximum number of stuff bits. This means that for a bit rate of 1 Mbps, the maximum throughput available for the data transmission is 420 kBps. Moreover, the FTT-CAN uses the Trigger Message to control purposes, increasing thus the global overhead. This overheads depends on the bit rate, the Elementary Cycle length (LEC in Figure 4) and the number of data bytes of the Trigger Message (proportional to the number of the synchronous messages in the system). The number of stuff bits of the Trigger Message also play an important role in the overhead of the Trigger Message. The overhead of the Trigger Message varies from 2.7% to 28.4% [14]. Using more than one CAN bus it is possible to partially improve this scenario, since the additional buses can be used to transmit both replicated and non-replicated data. The improvement of the available bandwidth is proportional to the number of replicated buses. In this way, the total available bandwidth in the system could double with two buses, if different message streams are transmitted in each bus.

2.3 Increased Flexibility The two advantages presented before can be combined together, i.e., transmitting the same data in different buses can be combined with the transmission of different data in different buses. This results in an important flexibility improvment, since the master node can schedule a specific message transmission to a specific bus or to several buses, depending on the criticality of the message. Notice that the master node holds a global and centralized knowledge of the system state, thus it can easily change a message stream from a bus to another. In case of an error in one bus, the master can schedule the messages assigned to the faulty bus to other buses. This is done online without any interference in the service provided. This means that the master has one more degree of flexibility to schedule the data in each Elementary Cycle. Moreover there is also an improvement on the flexibility of the topology of the architecture, as it can be seen in Figure 3, where the slaves can be connected to one bus or a set of buses. This improves the flexibility of the system and also enables the use of legacy slave nodes. 2.4 Scalability of replicated buses In FlexRay the same message can be transmitted in all available channels, or in just one channel, i.e., FlexRay also uses multiple buses, in this case only two, both to improve the bandwidth and fault tolerance. On the other hand, TTP/C allow replicated channels, but it is only possible to use two channels (channels: bus or star) [9]. This is also true for TTCAN, where the same restriction is imposed. Both in TTP/C and TTCAN, the additional buses cannot be used to improve the bandwidth of the system. In the FTT-CAN architecture with multiple buses the number of buses is only limited by the number of CAN controllers available at the master nodes. Nowadays is possible to find microcontrollers with 6 CAN controllers [13]. This leads to an important scalability of the proposed architecture. The flexibility of adding more buses and to use the additional bandwidth in a efficient way, together with the scalability makes FTT-CAN with multiple buses a unique

2.2 Increased resilience to bus failures CAN already provides some fault tolerance mechanisms. Examples of such mechanisms, at the physical level, are the use of a differential voltage and the network operation with just one wire. Some mechanism at the data link layer are implement in order to detect and signaling errors. This mechanisms are: Cyclic Redundancy Check (CRC), to account for message corruption; Frame Check, to detect message format violations; Acknowledge errors, to allow a node to detect if it is isolate from the network; Transmission monitoring, to allow distinguishing global errors from errors in

11

solution in terms of fieldbuses.

FTT-CAN applications, e.g. [21].

2.5 Master replication is still feasible The master node is the central point of the FTT-CAN protocol and its main tasks are the scheduling of the synchronous messages and to set all the bus timing. The master node needs some modifications to deal with multiple buses, these changes are described in detail in [20]. The master node is a single point of failure, so a replication protocol for an FTT-CAN architecture with just one bus has already been presented [6]. The replication of the master node is still possible for the case of multiple buses. In case of a master node failure, the active master is replace by a backup master without any interruption of the system operation. This replacement is done online, and slave nodes do not notice any change in the global system. In [17] the replication mechanisms for the multiple bus FTT-CAN architecture are explained in detail. This mechanism can also be used to detect permanent errors on the buses.

2.7 Increased complexity of the slaves nodes The slaves nodes used on the FTT-CAN architecture with multiple buses can be the same as the ones applied to FTT-CAN with just one CAN bus. This legacy nodes can only be connected to just one bus. In contrast, if the slaves are connected to more than one bus, there is the need to implement the necessary software and hardware adjustments. This software, only needs to be able to receive and transmit messages through the additional buses, thus slave nodes only need the software drivers for the additional buses and some adjustments to the FTT code. 2.8 Inflexibility in terms of spacial location of the masters nodes In the FTT-CAN with just one CAN bus, the master node and its replicas can be located in any part of the bus. However in FTT-CAN with more than one CAN bus a master node and its replica must be located at both ends of the buses to provide effective bus error detection [20]. This results in less flexibility in terms of spacial localization of the master nodes.

2.6 Increased complexity and price of the master node The master node was changed to accommodate multiple buses [20]. This modification adds extra (low) complexity to the master node architecture. Specifically, two new modules were added to the master: The bus error detection module and the multi-bus handler. It was also necessary to add extra fields to the table where the synchronous messages properties are stored (called SRT Synchronous Requirement Table) to include the properties related to the allocation of the messages to a bus. At the implementation level, the increase on the required RAM memory is 23%, since all the message properties are now 13 bytes long. At a first glance this value seems to be high, but it only depends on the number of messages stored in the SRT. For a typical application [10] with 16 synchronous messages, the increase will be 48 bytes only. This value is negligible regarding the available RAM of most microcontrollers. The increasing in the code size is less than 5% when comparing with the master with just one bus. The master node must have more than one CAN controller, either built-in or external. In the first case the microcontroller will be more complex and thus it’s price will increase. In the second case, one needs to add external CAN controllers, that will also increase the price of the master node. Moreover, each bus must have a bus driver, such as [15] or [11], which also will also increase the cost of the node. Notice, however, that more powerful microcontrollers with more features (possible additions CAN controllers) are expect appear on the market, thus, the pricing impact will tend to be lower. In what concerns the power consumption, it will increase with the complexity. Nodes will have higher computational load [20] and also has more hardware components. This issue is not negligible for battery powered

2.9 The overall architecture complexity is higher When a system becomes more complex the probability of an error increases due to the increasing number of components (hardware and software) [5]. In FTT-CAN with multiple bus, the number of hardware components is higher and the complexity of the software, measured in lines of code, is also higher. The higher probability of errors is a price to pay to have more flexibility in the system. For this reason, the FTT-CAN with multiple buses incorporates mechanisms to detect permanent errors on the buses. In the literature there are some example of dependability assessment based on modeling [5][8][7]. This dependability analysis can be done using modeling tools such as möbius [2]. We plan to use this tool to assess if the dependability of the FTT-CAN with multiple buses is kept at an acceptable level.

3 Conclusions This paper presented a multiple bus FTT-CAN architecture and discussed its pros and cons when compared with the single bus FTT-CAN architecture. Flexibility, which is one of the cornerstones of the Flexible TimeTriggered paradigm, is not compromised by the multiple bus FTT-CAN architecture. In fact, it is improved, since additional buses can be used to improve the bandwidth or to transmit the same data in different buses. Other advantages are the number of buses that are only dependent on hardware resources and the master node replication that is still feasible. However some drawbacks of using multiple buses arise. The complexity of the nodes (master and slaves)

12

increases, but in a limited way. The architecture also imposes that master nodes should be located at both ends of the buses. Moreover, a dependability analysis using a modeling tool needs to be performed in order to evaluate the new architecture.

[13] NEC Electronics Corporation. µPD70F3430 Data Sheet, November 2005. [14] T. Nolte, H. Hansson, C. Norström, and S. Punnekkat. Using bit-stuffing distributions in can analysis. In S. L. Iain Bate, editor, Proceedings of the IEEE/IEE Real-Time Embedded Systems Workshop in conjunction with the 22nd IEEE Real-Time Systems Symposium (RTSS’01), London, UK, December 2001. Department of Computer Science, University of York. [15] Philips Semiconductors. PCA82C250 Data Sheet, January 2000. [16] J. Rufino, P. Veríssimo, and G. Arroz. A Columbus’ egg idea for CAN media redundancy. In Digest of Papers, The 29th International Symposium on Fault-Tolerant Computing Systems, pages 286–293, Madison, Wisconsin, USA, June 1999. IEEE. [17] V. Silva, J. Ferreira, and J. Fonseca. Master Replication and Bus Error Detection in FTT-CAN with Multiple Buses. In Proceedings of the 12th IEEE Conference on Emerging Technologies and Factory Automation (ETFA 2007), Patras, Greece, 2007. [18] V. Silva and J. Fonseca. Using FTT-CAN to Combine Redundancy with Increased Bandwidth. In Proceedings of the 2006 IEEE International Workshop on Factory Communication Systems, pages 54–62, June 2006. [19] V. Silva, J. Fonseca, and J. Ferreira. Using FTT-CAN to the Flexible Control of Bus Redundancy and Bandwidth Usage. In Proceedings of the 11th International CAN Conference iCC 2006, pages 5.9 – 5.15, Sweden, September 2006. [20] V. Silva, J. Fonseca, and J. Ferreira. Adapting the FTTCAN Master for Multiple-bus Operation. In Proceedings of the 5th IEEE International Conference on Industrial Informatics, Vienna, Austria, July 2007. [21] V. Silva, R. Marau, L. Almeida, J. Ferreira, M. Calha, P. Pedreiras, and J. Fonseca. Implementing a distributed sensing and actuation system: The CAMBADA robots case study. In Proceedings of the 10th IEEE Conference on Emerging Technologies and Factory Automation, 2005. ETFA 2005, volume 2, pages 781–788, September 2005.

Acknowledgments This work was supported by Fundação para a Ciência e Tecnologia under grant PRODEP 2001 - Formação Avançada de Docentes do Ensino Superior Nº 200.019 and by ARTIST2, NoE on Embedded Systems Design, (EC-IST - IST-004527).

References [1] L. Almeida, P. Pedreiras, and J. A. Fonseca. The FTTCAN Protocol: Why and How. IEEE Transactions on Industrial Electronics, 49(6):1189–1201, December 2002. [2] G. Clark, T. Courtney, D. Daly, D. Deavours, S. Derisavi, J. M. Doyle, W. H. Sanders, and P. Webster. The möbius modeling tool. In PNPM ’01: Proceedings of the 9th international Workshop on Petri Nets and Performance Models (PNPM’01), pages 241–250, Washington, DC, USA, 2001. IEEE Computer Society. [3] F. Consortium. FlexRay Communications System - Protocol Specification, v2.0. Technical report, FlexRay Consortium, 2004. [4] M. Corperation. Media Oriented System Transport, Multimedia and Control Networking Technology, November 2002. [5] F. Di Giandomenico, S. Porcarelli, D. Viva, A. Bondavalli, and P. Lollini. Model-based Evaluation for Dependability Assessment of CAUTION++ Instances. In Venue ’04 (informal proceedings), Athens, Greece, May 27-28 2004. [6] J. Ferreira, P. Pedreiras, L. Almeida, and J. Fonseca. The FTT-CAN protocol: improving flexibility in safety-critical systems. IEEE Micro (special issue on Critical Embedded Automotive Networks), 22(4):46–55, 2002. [7] Q. Gan and B. Helvik. Dependability modelling and analysis of networks as taking routing and traffic into account. In Proceedings of the 2nd Conference on Next Generation Internet Design and Engineering, April 2006. [8] R. Ghostine, J.-M. Thiriet, and J.-F. Aubry. Dependability evaluation of networked control systems under transmission faults. In 6th IFAC Symposium on Fault Detection, Supervision and Safety of Technical Processes, Safeprocess 2006, Beijing - China, 2006. [9] G. B. Hermann Kopetz. The Time-Triggered Architecture. In Proceedings of the IEEE, volume 91, January 2003. [10] R. Marau, L. Almeida, J. Fonseca, J. Ferreira, and V. Silva. Assessment of FTT-CAN master replication mechanisms for safety-critical applications. In Proceedings of the SAE 2006 World Congress & Exhibition, 2006. Paper Number: 06AE-278. [11] Microchip. Mcp2551 data sheet, 2003. DS21667D version. [12] B. Müller, T. Führer, F. Hartwich, R. Hugel, and H. Weiler. Fault tolerant ttcan networks. In Proceedings of 8th International CAN Conference. CAN in Automation GmbH, Oct 2002.

13

14

Taking Advantage of Within-Die Delay-Variation to Reduce Cache Leakage Power Using Additional Cache-Ways Maziar Goudarzi†, Tadayuki Matsumura‡, Tohru Ishihara† †

{ System LSI Research Center,

‡

Graduate School of Information Science and Electrical Engineering}, Kyushu University, Fukuoka, Japan

{goudarzi, ishihara}@slrc.kyushu-u.ac.jp

Abstract Leakage power, especially in cache memories, is dominating total power consumption of processor-based embedded systems. By choosing a higher threshold voltage, SRAM leakage can be exponentially reduced in return for lower speed. Since SRAM cells in the same cache have different delays in nanometer technologies due to within-die process variation, not all of the cells violate the cache delay. However, since timing-violating cells are randomly distributed over the cache, row/column redundancies are inefficient. We propose to add extra cache-way(s) to replace slow cache-lines separately in each cache-set. In a commercial 90nm process, our technique can reduce leakage power by up to 54% which, depending on the share of leakage in total cache power, translates to 25.36% and 53.37% reduction of total energy respectively in L1 and L2 cache by adding two spare ways to a 4-way set-associative cache with no performance penalty.

1

Introduction

The share of leakage in total power consumption of cache memories increases with every new technology node. In addition, leakage exponentially increases with temperature. The naïve solution to exponentially reduce leakage power is to use higher threshold voltage (Vth) and/or gate-oxide thickness (Tox), but this slows down the SRAM cells speed. Another effect, which is increasingly more pronounced in sub-90nm processes, is random within-die variation in SRAM delay; these variations generally follow Gaussian distribution [1]. In the absence of such variations, all SRAM cells have the same delay which also defines the cache delay. But in presence of random within-die delay variation, the cache delay has to be farther from the mean delay of SRAM cells so as to obtain a reasonable timing-yield for the cache-containing chip. In such case, at higher Vth and/or Tox, only a subset of the cells (which are randomly distributed over the cache) violate the target cache delay; we use extra cache-ways to compensate for them. Several previous works address improving cache-memory timing-yield in presence of process variation by proposing process-tolerant cache architectures [1], [2] and codeplacement compiler techniques [3], but they actually reduce the useful capacity of the cache by marking and avoiding to use too-slow cache lines. Although [3] provides a solution to mitigate the performance impact, it demands a per-chip different binary executable. Several other researches address cache power reduction by reducing dynamic power [4], [5] or static power [6], [7], but these do not consider processvariation effects and the corresponding yield loss.

15

[email protected]

In this paper, we propose an optimization technique for cache-design that is applied at design and manufacturing time of the cache-containing chip and reduces total power, by significantly reducing the leakage component, at the cost of extra chip area for additional cache ways. We propose (i) to keep VDD untouched (so as not to impact dynamic power), (ii) to use a higher Vth and gate-oxide thickness (Tox) (so as to exponentially reduce subthreshold leakage as well as gate leakage), and finally (iii) to add a few extra cache ways to compensate for the delay-violating cache-lines (so as to keep the original performance). We choose the number of extra cache ways and the value for Vth and Tox such that leakage power is minimized while the cache capacity, speed, and timing-yield are all kept unchanged. A major issue is the random spreading of delay-violating SRAM cells in the cache, which makes it inefficient to use row/column redundancy. Our technique answers this problem since we replace slow cache lines separately in each cache-set. The rest of this paper is organized as follows. Section 2 reviews related works and presents our approach. In Section 3, the optimization problem is formulated and an algorithm is provided for that. Section 4 provides the experimental results and finally Section 5 summarizes and concludes the paper.

2 2.1

Related Works and our Approach Related Works

Turning off unused parts of the cache [9] [6] or putting them in a low-energy “drowsy” mode using two different supply voltages [7] reduce leakage, but they require separate supply (VDD) for each cache-line or ground lines to cutoff or reduce supply voltage, which have proven expensive. Our technique does not need any change in the cache line design and only replicates cache-ways and chooses manufacturingtime Vth and Tox options available in most fabrication processes today. Reverse body biasing [10] or increasing the Vth at manufacturing time can effectively reduce leakage, but this increases cells delay and results in lower performance and/or reduced timing-yield. Forward body biasing during activemode [11] and dynamic Vth control [12] can improve delay, but naturally they increase leakage in the active mode. This delay impact can be compensated by increasing VDD in line with the increase in Vth, but this quadratically increases the dynamic power [8]. We keep the original VDD although we increase Vth; instead, we add extra cache ways to compensate for the cache ways that violate target delay due to the

2.2

Motivational Example

Figure 1-a shows a sample 2-way set-associative cache with 4 sets. After uniformly increasing the Vth and/or Tox of the cache SRAM cells to save leakage, the delay of all cells increases; however, since their individual delays were not the same at the beginning (due to the within-die delay variation) only some of them would now violate the original cache delay (the corresponding cache lines are painted in red in Figure 1-b). To compensate for such delay-violating cache-lines, the Additional cache way(s) is added so that still the same 2 delaymeeting cache-lines are available in all sets.

2.3

Analytical Example

We define the following notations: μd: the original mean delay of SRAM cells. σd: the original standard deviation of delay of SRAM cells. D: the target delay of the cache. N: the number of additional cache ways. Y: the original timing-yield of the cache. Ycell: the timing-yield of a single SRAM cell. Yset, N: timing-yield of a cache-set with N additional ways; the cache-set is still fault-free if at most N ways violate the timing. Assume that a 4-way set-associative cache with 128-bits per cache-line and 256 cache-sets is to be used in an embedded system. Due to process variation, different SRAM cells of the cache will have different delays even in the same chip; this distribution is believed to follow Gaussian distribution [1]. The probability, Ycell, that a single SRAM cell meets the target delay, D, can be thus given by equation (1) below: D

Ycell = Pr[ x ≤ D] = ∫ f (x) dx −∞

f ( x) =

1

σ d 2π

− ( x − μd ) 2

e

2σ d2

way 1

way 2

way 1

way 2

Additional

set 1 set 2 set 3 set 4

a) Original

b) After increase in Vth, Tox

cache-line meeting the original cache delay

Violating original cache delay

Figure 1. a) Original cache, b) after applying our technique

Yline=(Ycell)128 and for a complete cache-set the corresponding probability is Yset,0 = (Yline)4 since this is a 4-way cache. Now, if we add one extra way to the cache, each set would be working at least as good as before if at least 4 ways out of the available 5 ways meet the target delay. Thus:

Yset ,1 = Yline + Yline × (1 − Yline ) × 5 5

4

(2)

Similarly, if there are two extra cache-ways, the probability Yset,2 that the cache would still be as fine as before is even higher and is given by the following formula:

Yset , 2 = Yline + Yline × (1 − Yline ) × 6 6

5

(3)

+ Yline × (1 − Yline ) 2 × 15 4

The timing-yield of the entire cache would then be (Yset,N)256 where N represents the number of extra cache ways. Figure 2 depicts Yset,0, Yset,1, Yset,2 for 0.999≤Ycell≤1 (for presentational purposes, we have depicted Yset values instead of the yield of the entire cache in each case). For a given target timing-yield, the caches with extra ways demand a lower Ycell (e.g. in Figure 2, for 97% target Yset, the original cache requires Ycell to be 99.996%, while with one extra way this reduces to 99.958%, and with two extra ways, it further reduces to 99.9%); lower Ycell corresponds to higher delay (Figure 3) and lower power (Figure 4); this is detailed below. Ycell corresponds to the area below the PDF of the delay of a SRAM cell up to the upper bound D (gray area in Figure 3). Thus, a reduced Ycell requirement means the PDF diagram can be shifted to the right by increasing μd and keeping the same D; this is illustrated in Figure 3. For example in Figure 2, for 4-way cache + 2 extra ways 100 97

Yset,N : Timing-yield of a cache-set (%)

increased Vth. Using spares to repair manufacturing defects [13] is not new, but to the best of our knowledge it has not been used to reduce leakage in the past. A conventional technique to reduce leakage power in memories without performance/timing-yield penalty is supplyvoltage scaling along with Vth scaling. In this approach, VDD of the cache is increased, after the raise in Vth for leakage reduction, so that SRAM cells speed and the cache timingyield are restored to the original values. The disadvantage, however, is the quadratic increase in dynamic power. To mitigate this disadvantage, various techniques exist that reduce dynamic power of cache [5]. A well-known technique applicable to instruction cache is [14][15] to substantially reduce dynamic power in set-associative instruction caches; since instructions are mostly executed sequentially, and several instructions usually reside in the same cache-line, tagcomparisons can be eliminated and only one cache way can be activated except when the last executed instruction either was a branch or was resided at the end of a cache-line. We call this technique Inter-Line Way Memorization (ILWM) and use it in our experiments in Section 4.

Ordinary 4-way cache

90

4-way cache + 1 extra way 0.61σ 80

0.86σ μ+3.09σ

(1)

μ+3.34σ

μ+3.95σ

70 99.9

where f(x) is the Probability Density Function (PDF) of Gaussian distribution and μd and σd are respectively mean and standard deviation of the delay distribution. The probability of all cells in a cache-line meeting the target delay is

16

99.958

99.996

Ycell : Timing-yield of a SRAM bit

Figure 2. Timing-yield diagrams of ordinary, 1-extra, and 2-extra caches reflecting the available space for the increase in the mean-delay of cells.

the same timing-yield, the ordinary cache needs D=μd+3.95σd while with one additional cache way D=μd+3.34σd and for two additional ways D=μd+3.09σd , which thus means μd can be increased by 0.61σd and 0.86σd respectively if D is desired to remain constant. Raising μd can be realized by increasing the Vth and Tox of the transistors comprising the SRAM cells so that leakage (both subthreshold and gate leakage) are effectively reduced. This is shown in Figure 4 which gives the SPICE simulation results of a 16KB cache in a commercial 90nm process; the delay and leakage values are obtained for varying values of Vth and Tox. The SPICE models are manufacturer-supplied and correspond to a middleperformance 90nm process technology; the leakage values correspond to a single SRAM cell. Delay-distribution can be shifted to higher values to reduce leakage new Ycell Original Ycell

Figure 3. Lower Ycell translates to higher meandelay of SRAM cells resulting in lower leakage.

SRAM leakage power per cell [nW]

In summary, we propose that by adding extra cache ways, the leakage power of the cache memory can be effectively reduced, by changing their Vth and Tox, without compromising performance or timing-yield or cache-capacity while also tolerating process variation. We choose the number of extra ways and the Vth and Tox values such that timing-yield and target delay remain invariant. Adding extra cache ways, however, increases the dynamic power per access, and hence, the total power does not necessarily decrease. Obviously, the higher the share of leakage in the total cache power, the more effective this technique. We provide experimental results in Section 4 that reflect its effectiveness in real-life applications. It is important to note that our technique is enabled by the within-die variation (higher σd gives more savings). Thus, this

is a variation-based leakage reduction technique; it saves leakage by choosing higher Vth and/or Tox, but the within-die delay variation is essential to restore the cache capacity, speed, and timing-yield afterwards by additional cache ways.

2.4

Our Approach

Figure 5 shows the outline of our proposed approach. We propose a design- and manufacturing-time optimization that determines the number of extra cache-ways and the manufacturing options of Tox and Vth of the transistors of the cache SRAM cells. The original cache organization (i.e. total size, line-size, and the number of ways) along with the process-technology characteristics (i.e., mean and standarddeviation of SRAM cell delay caused by within-die variation as well as leakage-delay curves of the cells at various Vth and Tox values) are the inputs of the optimization program. The cache organization is modified according to the optimization results and the manufacturing option of Tox and Vth is handed over to the manufacturer for chip fabrication. The produced chips are then tested offline to detect and mark cache lines containing slow SRAM cells. If the number of such slow cache lines exceeds the number of extra cache ways in each cacheset, the chip is considered faulty and contributes to loss of yield. It would have been best to cutoff the power-line of such slow cache lines to prevent them from leaking, but since this may not be practical, we rely on marking them at boot-time while their location is read from an agreed-upon non-volatile storage. Marking of slow cache lines can be done, as in [1] and [3], by clearing their valid bit and setting their lock bit. Finally, at runtime the cache works as ever without noticing and without using the slow lines.

3

Problem Definition and Algorithm

The following notations are added to those in Section 2.3: w: the original number of ways in the cache. b: the number of bits per cache line (including tag bits).

4.0

3.0

2.0

1.0 515

When changing Tox

When changing Vth

520 525 530 535 SRAM access delay [p sec]

540

Figure 4. Leakage power vs. access-delay of a single SRAM cell when raising Vth and Tox (from left to right) in a 90nm process technology.

Figure 5. Big picture of our proposed approach.

17

s: the number of cache-sets. PL: the leakage power of the cache. Vth: the optimal value for the Vth of cache transistors. Tox: the optimal gate-oxide thickness of cache transistors. The optimization problem can be formally defined as follows: “For a given process technology (i.e. μd, σd), cache organization (i.e. w, b, and s), and timing-yield (Y), minimize the leakage power of the cache (i.e. PL) by setting Vth, Tox, and N such that the target delay, D , is kept unchanged.” Algorithm. The following algorithm takes the cache organization and process technology as input and provides the best choice of Vth, Tox, and N if our technique can be useful. Otherwise, it returns an empty set indicating the failure. Algorithm 1: OptimizeCacheDesign() Inputs: (σd, μd: process technology characteristics), (w, b, s: original cache configuration), (Y: Target timing-yield of cache) Output: set of (N, Vth, Tox) triples. 1 set answers_set = empty_set 2 compute D based on Y, σd, μd. 3 compute PL (leakage power) of the original cache. 4 for N=1 to w/2 do 4.1 compute the highest μd in presence of N extra cache ways such that D and Y are kept intact. 4.2 choose the best Vth and Tox corresponding to this new μd 4.3 compute PL’ (leakage power) of the new cache 4.4 if PL’ else esti ← < ⊥, justification >; (6) RBcast( PRE(esti ) ); (7) wait until (( n − f ) justified PRE(x) messages have been delivered); (8) if (#PRE(x1 6= ⊥)≥ f + 1, #PRE(x2 )= 0 for x2 6∈ (x1 , ⊥)) then bi ← 1 else bi ← 0; (9) ci ← BinaryConsensus(bi ); (10) if ci = 0 then return ⊥; (11) wait until (( f + 1) PRE(< v 6= ⊥, justification >) have been received); (12) return v; Function Deliver (INIT(v)) at Coordinator ps (13) if (( n − f ) INIT(v) messages or f + 1 identical INIT(v) messages have been delivered): (14) if (∃v : #INIT(v)≥ f + 1) then lv :=< v, justification > else lv :=< ⊥, justification > (15) RBcast( COORD(lv)); Figure 2: Infrastructure-assisted consensus algorithm received from all of its members, and the messages from the justification set proof the validity of the value chosen by the coordinator (i.e., for a value v 6= ⊥, there is a set of f + 1 identical messages, and for v = ⊥, there is a set of n − f message without a subset of f + 1 identical messages). If the interaction with ps fails, a node calculates its own justified value. Next, every node broadcast a PRE message with a value and its justification obtained in the step before. In line (7) the justification of all PRE message can be verified as described for the COORD value. Note that it may happen that a PRE message is not justified when it is delivered (as a required INIT message might not yet have been received), but it may later become justified through the reception of the missing INIT message. The justification ensures that if all correct nodes propose the same value v1 , then no other value v2 6= ⊥ may get a justification (as this would require the justification from f + 1 nodes), nor may v2 = ⊥ get a justification (as any n−f nodes contain at least n − 2f correct nodes (i.e., with n ≥ 3f + 1, at least f + 1 correct nodes).

b = 1 for a different value x2 . If one node selects x1 , there are at least f + 1 PRE messages for this value, so all other nodes receive at least one of these messages. If all propose the same value, there will be only PRE messages for that value, causing all correct nodes to propose b = 1. We assume the use of a randomized binary consensus such as that of Bracha [5], which guarantees that if all correct nodes start a round with identical values, they decide in the same round. The two fast termination properties follow directly from this property and the observation that if either all correct nodes have identical initial values or all of them successfully interact with the infrastructure, they all propose the same value (b = 1) to binary consensus.

The selection of the TIMEOUT value (in line 2) has a direct impact on the efficiency of the algorithm. A TIMEOUT value too short will cause the failure of infrastructure interaction. In this case, the algorithm will work only with the weaker guarantees of distributed consensus without infrastructure. A large TIMEOUT value will delay consensus execution for a large period Line (8) ensures that if a node selects b = 1 because of time in case that the infrastructure is unavailable or of PRE messages for a value x1 , no other node selects corrupt.

25

4.4

References

Final remarks

Comparing the three approaches, the integration of infrastructure interaction at the multi-valued consensus level seems to be the most promising. If the infrastructure is available in this step, all correct nodes will propose the same value to binary consensus, ensuring fast termination. This variant could be combined with infrastructureassisted binary consensus. The combination ensures fast termination in case that the infrastructure is unavailable at the start of the consensus, but becomes available later during the binary consensus phase. In addition, the infrastructure interaction at the reliable multicast level would help to reduce the total number of message, but only if the infrastructure does not show malicious behaviour.

5

Conclusion

In this paper, we have discussed approaches for efficiently solving the consensus problem in distributed embedded systems in realistic environments in which an ad-hoc network and an infrastructure network are simultaneously available. The general idea is that if the infrastructure is not available, the participants execute a randomized consensus algorithm with probabilistic termination guarantees using the ad-hoc network. If the infrastructure is available, the participants take advantage of this and achieve consensus with better termination guarantees. This paper has presented on-going work. The presented ideas still lack an experimental validation, which should provide real data about the relative behaviour of the proposed approaches and of previously published consensus algorithms. An extended version of this document will include a formal correctness proof. An issue to be investigated in the future is using distributed access points instead of a central server to support the consensus progress. Furthermore, the memory consumption of the protocol, for example for communication buffers, is an important issue in embedded systems. This aspect of the protocol should be accurately examined, and the worstcase memory consumption of the protocols should be minimized. We strongly believe that tailored consensus solution will help to construct dependable distributed embedded systems.

26

[1] M. K. Aguilera and S. Toueg. Failure detection and randomization: A hybrid approach to solve consensus. SIAM J. Comput., 28(3):890–903, 1999. [2] D. Angluin, M. J. Fischer, and H. Jiang. Stabilizing consensus in mobile networks. In DCOSS, pages 37–50, 2006. [3] N. Badache, M. Hurfin, and R. J. de Araújo Macêdo. Solving the consensus problem in a mobile environment. In IPCCC, pages 29–35. IEEE, 1999. [4] M. Ben-Or. Another advantage of free choice (extended abstract): Completely asynchronous agreement protocols. In PODC ’83, pages 27–30. ACM Press, 1983. [5] G. Bracha. An asynchronous [(n - 1)/3]-resilient consensus protocol. In PODC ’84: Proceedings of the third annual ACM symposium on Principles of distributed computing, pages 154–162. ACM Press, 1984. [6] C. Cachin, K. Kursawe, and V. Shoup. Random oracles in Constantinople: Practical asynchronous Byzantine agreement using cryptography. Journal of Cryptology, 18(3):219–246, July 2005. [7] M. Correia, N. F. Neves, and P. Ver´ıssimo. From consensus to atomic broadcast: Time-free byzantineresistant protocols without signatures. Computer Journal, 41(1):82–96, Jan 2006. [8] V. Drabkin, R. Friedman, and M. Segal. Efficient byzantine broadcast in wireless ad-hoc networks. dsn, 00:160–169, 2005. [9] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of distributed consensus with one faulty process. J. ACM, 32(2):374–382, 1985. [10] A. Mostefaoui, M. Raynal, and F. Tronel. The best of both worlds: A hybrid approach to solve consensus. In DSN ’00: Proc. of the 2000 Int. Conf. on Dependable Systems and Networks, pages 513–522, 2000. [11] H. Seba, N. Badache, and A. Bouabdallah. Solving the consensus problem in a dynamic group: An approach suitable for a mobile environment. In ISCC ’02: Proc. of the 7th Int. Symp. on Computers and Communications, page 327. IEEE Computer Society, 2002. [12] S. C. Wang, W. P. Yang, and C. F. Cheng. Byzantine agreement on mobile ad-hoc network. 2004 IEEE Int. Conf. on Networking, Sensing and Control, 1:52–57, Mar. 2004. [13] W. Wu, J. Cao, J. Yang, and M. Raynal. A hierarchical consensus protocol for mobile ad hoc networks. In PDP ’06: Proc. of the 14th Euromicro Int. Conf. on Parallel, Distributed, and Network-Based Processing, pages 64– 72. IEEE Computer Society, 2006.

Crash-resilient cooperative mobile robots with asynchronous communications ∗ Rami Yared JAIST, School of Information Science Japan Advanced Institute of Science and Technology Email: {r-yared}@jaist.ac.jp Abstract

prior knowledge about neither the paths of other robots, nor their speeds.

The paper discusses different approaches that ensure the liveness of collision prevention system of cooperative mobile robots with asynchronous communications, in presence of the crash of some robots. We compare different techniques and discuss the trade off between the strength of the properties of the failure detector, and the performance of the collision prevention system in terms of the number of requests cancelation. Each robot in the system knows the composition of the group, and can communicate with all robots of the group.

1

Context and problem We consider that the robots have the ability to communicate wirelessly and also that they can query their own position according a common referential, as given by a positioning system. However, the robots do not have the ability to detect each other’s position in the environment, and they are not synchronized. In addition, communication delays are unpredictable, and actual robot motion speed is unknown. A robot is based on its local motion planning facility to compute a path between the current location and the goal. This path avoids the collisions with fixed known obstacles.1

Introduction

Problem The robots are moving in different directions sharing the physical space, thus collisions between mobile robots can possibly occur. It is very important to focus on the problem of preventing collisions between mobile robots, and thus ensure a safe motion, such that no two robots ever collide regardless of the tasks of the robots. The safety of the system must be guaranteed independently of timeliness properties of the system, and even in the event of unexpected timing errors in the environment. However, the performance of the system may possibly degrade as the result of badly unstable network characteristics or erratic robot speed. Thus, it is essential to provide a safe motion platform on which mobile robots can rely for their motion. This platform guarantees that no collision between robots can occur, regardless of the timeliness guarantees of the underlying environment.

Many research in distributed systems study problems in which hosts are mobile and their physical location cannot be abstracted. While most efforts are still aimed at mobile ad hoc networks and sensor networks, there is also a gradual realization that cooperative robotics raises many interesting new challenges with respect to distributed systems, and particularly in relation to mobility. Indeed, unlike traditional distributed systems and even more so than ad hoc or sensor networks, mobility becomes an essential part of the problems to address. Many interesting applications are envisioned that rely on groups of cooperating mobile robots. Tasks may be inherently too complex (or impossible) for a single robot to accomplish, or performance benefits can be gained from using multiple robots [2]. As a simple illustration, consider the following example. A distributed system composed of cooperative autonomous mobile robots cultivating a garden. Cultivating a garden requires that mobile robots move in all directions in the garden sharing the same geographical space. A robot has no

Contribution We focus on the collision prevention problem in presence of crash of some robots. We propose faulttolerant approaches that enable our collision prevention protocol to handle the crash of some robots, and to ensure the liveness of the system of robots in presence of crash.

∗ Work supported by MEXT Grant-in-Aid for Young Scientists (A) (Nr. 18680007).

1 The robots are the only moving entities in the considered applications.

27

Structure of the paper The rest of the paper is organized as follows. Section. 2 presents the related work. Section. 3 describes the system model and terminology. In Section. 4, we propose several crash-resilient approaches that enable a collision prevention system to handle the crash of some robots, and maintain the liveness of the robotic system. Section. 5 concludes the paper.

2

at any time, robots in each network share a common world model by accessing sensing information of all other robots in the same network. Robots avoid collisions by re-planning their paths. Their approach relies on proper timing of communications and robots speed. Jager et al. [5] presented a decentralized collision avoidance mechanism based on motion coordination between robots. When the distance between two robots goes below a certain threshold, they exchange information about their respective planned paths and determine whether there is a risk of collision. If a collision is possible, then they monitor each other’s movements and may change their speed to avoid the collision. The approach is highly dependant on the proper timing of communication and, to some extent, to the proper control of robots speed. Similarly, Azarm et al. [1] presented an on line distributed motion planning. When a conflict is detected between two robots, they exchange their information and determine their respective priorities. The robot with the highest priority keeps its original path while other robots must re-plan their motion. The problem of robots collision avoidance has also been handled using different strategies which are sensor-based motion planning methods. The detailed information about motion planning strategies is inspired from [8]. Minguez et al. [8] compute collision-free motion for a robot operating in dynamic and unknown scenarios. Motion planning algorithms compute a collision-free path between a robot’s location and its destination. Robots involve sensing directly within the motion planning by sensing periodically at a high rate. Some of these approaches (e.g., [9]) apply mathematical equations to the sensory information and the solutions are transformed into motion commands. Another group of methods (e.g., [12]) computes a set of suitable motion commands to select one command based on navigation strategy. Finally, other methods (e.g., [8]) compute a high-level information description (e.g., entities near obstacles, areas of free space) from the sensory information, and then apply several techniques simplifying the difficulty of the navigation to obtain a motion command in complex scenarios. Sensor-based approaches depend on real-time guarantees for processing the sensory information. Furthermore, the information provided by proximity sensors is unreliable and much more limited in range than most wireless network interfaces.

Related work

Martins et al. [7] demonstrated the avoidance of collisions between three cars, elaborated in the CORTEX project. They rely on the coexistence of two networks, as defined in the Timely Computing Base of Ver´ıssimo and Casimiro [13]. One network, the payload network, is asynchronous and carries the information payload of the application. The second network, the control network or wormhole, enforces strict real-time guarantees and is used sparingly by the protocol. Their approach differs from ours in several aspects. The major difference with our protocol presented in this paper is that our protocol tolerate the crash of robots in the system, while the approach in [7] is not fault-tolerant. Nett et al. [10] presented a protocol for cooperative mobile systems in real-time applications. They considered a traffic control application in which a group of mobile robots share a specified predetermined space. Communication is done through WiFi (802.11) with a base station. All robots can communicate directly with each other, and the system assumes the existence of a known upper bound on communication delays. Needless to say that the protocol relies on the strict enforcement of timing assumptions, and it des not tolerate the crash of robots. We have recently developed a simpler version [15] which is a time-free collision prevention platform for a group of asynchronous cooperative mobile robots. Our protocol presented in [15] guarantees that no collision occurs between robots, independently of timeliness properties of the system, and even in the presence of timing failures in the environment. However, our protocol in [15] does not consider the crash of robots. In [14], we have presented a time-free collision prevention protocol which relies on ad hoc communication and supports dynamic groups of mobile robots. In a dynamic group of mobile robots the total composition of the system, of which robots have a partial knowledge, can change dynamically, our protocol presented in [14] does not consider the crash of robots. Clark et al. [4] presented a collision avoidance based on motion planning framework by combining centralized with decentralized motion planning techniques. When robots become within communication range of each other, they establish dynamically a network. Their protocol ensures that

3 3.1

System model and terminology System model

We consider a system of n mobile robots S = {r1 , . . . , rn }, moving in a two dimensional plane. Each

28

robot has a unique identifier. The total composition of the system is known to each robot. Robots have access to a global positioning machine that, when queried by a robot ri , returns ri ’s position with a bounded error εgps . The robots communicate using wireless communication such that a robot ri can communicate with all robots of the system. Communications assume retransmissions mechanisms such that communication channels are reliable. The system is asynchronous in the sense that there is no bound on communication delays neither between the robots nor between a robot and the positioning machine, processing speed and on robots speed of movement. We consider that robot can fail by crash and that crash is permanent. A correct robot is defined as a robot that never crashes. A faulty robot is a robot that might crashes. A robot ri is provided with a failure detector. We assume that the majority of the robots are correct. The number of faulty robots f (f < n2 ), where n is the total number of robots.

3.2

the crash-free protocol is explained intuitively as follows. It is essentially a mutual exclusion on geographical zones. Our crash-free protocol consists of a distributed path reservation system, such that a robot must reserve a zone before it moves. When a robot reserves a zone, it can move safely inside the zone. All robots run the same protocol. When a robot wants to move along a given chunk of a path, it must reserve the zone that surrounds this chunk. When this zone is reserved, the robot moves along the chunk. Once the robot reaches the end of the chunk, it releases the zone except for the area that the robot occupies. When moving along a path, the robot repeats this procedure for each chunk along the path. The protocol is based on the state machine approach of Lamport [6]. Briefly, the idea is as follows. Each robot maintains a copy of the reservation queue and a protocol ensures that all requests/releases are delivered in the same sequence. With all replicas starting in the same state, they evolve consistently with no need for further synchronization.

Failure detectors

The wait-for graph is a directed acyclic graph that represents the wait-for relations between robots. A node represents a robot and a directed edge represents a wait-for relation between the corresponding robots.

There exists several classes of failure detectors, depending on how unreliable the information provided by the failure detector can be. Classes are defined by two properties, called completeness and accuracy. We distinguish four classes of failure detectors, P (perfect), ♦P (eventually perfect), S (strong), and ♦S (eventually strong). The four classes have the same property of completeness, but differs in their accuracy property. [3].

System liveness in presence of robots crash In the crash-free model, the liveness property of the system is guaranteed since a robot eventually release the zone that has reserved. However, in the presence of crash of some robots, the protocol cannot guarantee the liveness of the system. If a robot rj has crashed while it reserves Zj , then a robot ri that waits for rj starves because the requested zone Zi intersects with Zj which would be kept infinitely under the reservation of rj . Since the system is asynchronous, then it is impossible for the robot ri to distinguish whether rj is very slow or rj has actually crashed. Therefore, the robot ri that waits for the crashed robot rj is blocked. After some period of time, if a robot rk requests a zone Zk that intersects with Zi or with Zj , then rk would be also blocked and so on. This Snowball effect may eventually invoke all the robots in the system, and thus the whole system will be blocked.

• S TRONG C OMPLETENESS Eventually every faulty process is permanently suspected by all correct processes. • S TRONG ACCURACY No process is suspected before it crashes. [class P] • E VENTUAL S TRONG ACCURACY There is a time after which correct processes are not suspected by any correct process. [class ♦P] • W EAK ACCURACY Some process is never suspected. [class S] • E VENTUAL W EAK ACCURACY There is a time after which some correct process is never suspected by any correct process. [class ♦S]

4

We provide techniques that guarantee the liveness of the system in presence of crash of some robots. We present three algorithms based on different classes of failure detectors. At first, we present a solution relies on a perfect failure detector (class P), a second algorithm relies on an Eventual Strong Accurate failure detector (class ♦P) and a third algorithm relies on an Eventual Weak Accurate failure detector (class ♦S).

Crash-resilient collision prevention approaches

Crash-free collision prevention We have presented a crash-free collision prevention protocol in [15]. The idea of

29

4.1

Approach with a perfect failure detector (class P )

4.4

We discuss an approach based on leases and synchronized clocks to tolerate the crash of robots and to guarantee the liveness of the system. A lease is an abstraction of a contract, such that its holder is given a tenure for a limited period of time. The time period of a lease is called term of the lease. The intuition of this approach is the following. Each robot is provided by a local physical clock, such that a robot ri can observe time using its clock. The values of the clock of ri is related to real-time and can be read and written by ri . The clocks of the robots have a bounded value of drift, so a clock synchronization is required to achieve and maintain a known bounded drift between the clocks. A lease is given to a robot rj such that the term of the lease equals to the time required by rj to move along its requested chunk of path and to release Zj . The robot rj reads its local clock to detect the expiry of its lease taking in consideration the maximal drift between the clocks. If rj does not complete moving along its requested chunk of path, then it stops within a time delay Δ and behaves as if it were a fix obstacle. (This requires bounds on processing and movement speed of robots to stop within a delay Δ). rj releases Zj and performs a new request to complete moving along its previous chunk of path. When the lease of rj expires according to the local clock of the robot ri , then ri waits for a time delay Δ in addition to the value of the maximal drift between the clocks. After that, ri considers the zone Zj as a released zone. This approach is simple, however it relies on clock synchronization. Deterministic software clock synchronization algorithms requires a known bound on communication message delay. There exist some hardware techniques for clock synchronization which require a dedicated network to connect the physical clocks (other than the network of the application). (Shin and Ramanathan [11]).

A perfect failure detector of class P has the Strong Accuracy property, such that no process is suspected before it crashes. The intuition of this approach is as follows. A robot ri that waits for a robot rj . If the failure detector F Di of ri suspects that rj has crashed, then ri handles rj as a fixed obstacle and ri considers that Zj is released. Hence ri removes the node that represents rj and its related edges. When ri does not wait for any robot, then ri is granted Zi .

4.2

Approach with an Eventual Strong Accurate failure detector (class ♦P )

A failure detector of the class ♦P has the property that eventually correct processes are not suspected by any correct process. The intuition of this approach is as follows. A request of a robot ri is preempted if it is considered as a crashed robot by the majority of robots, and only if ri is not granted Zi . When the request (ri , Zi ) is preempted, ri restarts a new request of Zi later, if ri has not really crashed. If the robot ri is considered as a crashed robot by the majority of robots in the system after ri is granted Zi , then Zi is considered as a blocked zone and it is granted to ri . If ri has not really crashed, then ri eventually releases Zi . If ri has really crashed, then Zi remains a blocked zone, and is granted always to ri . In the preemptive protocol, the suspicion of a robot and granting a zone to a robot occur using Total Order Broadcast, to ensure that all the robot in the system decide consistently that a robot ri is suspected by the majority of the robots and whether ri is granted Zi or not, after being suspected by the majority of robots in the system. According to the Eventual Strong Accuracy, all correct robots are not suspected by any correct robot, so eventually a correct robot ri will be granted Zi because eventually the request (ri , Zi ) is not preempted. Therefore, the liveness of the system is ensured.

4.3

Other approach

4.5

Comparison and discussion

The described algorithms guarantee the liveness of the system. There is a trade off between the strength of the failure detector and the performance of the system in terms of the number of requests cancelation. Request cancelation implies that a robot requests an alternative zone, and this may cause inconvenience for a robotic system, particularly when the density of robots is high. A solution using a failure detector of class ♦P generates less request cancelation than a solution relies on a ♦S failure detector, although the classes of these failure detectors differ in the Accuracy property (some correct process are not suspected versus all correct process are not suspected). A solution relies on a perfect failure detector (class P) is ideal, but a perfect fail-

Approach with an Eventual Weak Accurate failure detector (class ♦S )

A failure detector of the class ♦S has the property that eventually some correct process is never suspected by any correct process. The intuition of the non preemptive protocol using a ♦S failure detector is the following. If a robot rj waits directly for a robot ri (because Zj intersects with Zi ), and if rj suspects that ri has crashed, then rj cancels its request (rj , Zj ) and requests an alternative zone that does not intersect with Zi . In this approach a request of a robot is not preempted if the robot is suspected to be crashed.

30

ure detector relies on strong assumptions that requires the knowledge of a fix upper bound on communication delays. Also, a solution relies on leases with synchronized clocks requires the knowledge of a fix upper bound on communication delays.

5

[8] J. Minguez and L. Montano. Nearness diagram (ND) navigation: Collision avoidance in troublesome scenarios. IEEE Trans. on Robotics and Automation, 20(1):45–59, 2004. [9] L. Montano and J. Asensio. Real-time robot navigation in unstructured environments using a 3D laser rangefinder. In IEEE/RSJ Conf. on Intelligent Robots and Systems, pages 526–532, 1997. [10] E. Nett and S. Schemmer. Reliable real-time communication in cooperative mobile applications. IEEE Trans. Computers, 52(2):166–180, 2003. [11] K. Shin and P. Ramanathan. Transmission delays in hardware clock synchronization. IEEE Trans. Computers, 37(11):1465–1467, Nov. 1988. [12] R. Simmons. The curvature-velocity method for local obstacle avoidance. In IEEE/RSJ Conf. on Intelligent Robots and Systems, pages 3375–3382, 1996. [13] P. Ver´ıssimo. Uncertainty and predictability: Can they be reconciled? In Future Directions in Distributed Computing, pages 108–113, 2003. [14] R. Yared, J. Cartigny, X. Défago, and M. Wiesmann. Locality-preserving distributed path reservation protocol for asynchronous cooperative mobile robots. In 8th IEEE Intl. Symp. on Autonomous Decentralized Systems ISADS’07, 2007. [15] R. Yared, X. Défago, and M. Wiesmann. Collision prevention using group communication for asynchronous cooperative mobile robots. In 21st IEEE Intl. Conf. on Advanced Information Networking and Applications AINA’07, 2007.

Conclusion and future directions

We have presented different algorithms and techniques to ensure the liveness of a collision prevention system of cooperative mobile robots with asynchronous communications, in presence of crash. We provided different approaches each of which uses a specific class of failure detector. We discussed the trade off between the strength of the assumptions and the properties of the failure detector and the performance of the collision prevention protocol in terms of the number of requests cancellation. We presented also an approach based on leases with synchronized clocks, which is equivalent to a solution based on a perfect failure detector. In both cases, the knowledge of a fix upper bound on communication delays is required. A quantitative study of the trade off between the different approaches is in progress using simulations in different scenarios and models. In particular, the trade off between a ♦P based approach and another based on ♦S failure detector. In the future we intend to further investigate and optimize the performance of the system, according to different parameters.

References [1] K. Azarm and G. Schmidt. Conflict-free motion of multiple mobile robots based on decentralized motion planning and negotiation. In Proc. IEEE Int’l Conf. Robotic and Automation, ICRA’97., 1997. [2] Y. Cao, A. Fukunaga, and A. Kahng. Cooperative mobile robotics: Antecedents and directions. Autonomous Robots, 4(1):7–27, 1997. [3] T. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. J. ACM: Journal of the ACM, 43(2):225–267, 1996. [4] C. Clark, S. Rock, and J.-C. Latombe. Motion planning for multiple mobile robots using dynamic networks. In Proc. IEEE Int’l Conf. Robotics and Automation (ICRA’03), Taipei, Taiwan, Sept. 2003. [5] M. Jager and B. Nebel. Decentralized collision avoidance, deadlock detection, and deadlock resolution for multiple mobile robots. In Proc. IEEE/RSJ Int’l Conf. Intelligent Robots and Systems (IROS’01), 2001. [6] L. Lamport. The implementation of reliable distributed multiprocess systems. Computer Networks, 2:95–114, 1978. [7] P. Martins, P. Sousa, A. Casimiro, and P. Ver´ıssimo. A new programming model for dependable adaptive real-time applications. IEEE Distributed Systems Online, 6(5), May 2005.

31