Linear Time Byzantine Self-Stabilizing Clock ...

7 downloads 0 Views 436KB Size Report
1 School of Engineering and Computer Science, The Hebrew University of Jerusalem, ... synchronization as well as for non-Byzantine self-stabilizing clock syn-.
Linear Time Byzantine Self-Stabilizing Clock Synchronization Ariel Daliot1 , Danny Dolev1? , and Hanna Parnas2 1

School of Engineering and Computer Science, The Hebrew University of Jerusalem, Israel. {adaliot,dolev}@cs.huji.ac.il 2 Department of Neurobiology and the Otto Loewi Minerva Center for Cellular and Molecular Neurobiology, Institute of Life Science, The Hebrew University of Jerusalem, Israel. [email protected]

Abstract. Awareness of the need for robustness in distributed systems increases as distributed systems become an integral part of day-to-day systems. Tolerating Byzantine faults and possessing self-stabilizing features are sensible and important requirements of distributed systems in general, and of a fundamental task such as clock synchronization in particular. There are efficient solutions for Byzantine non-stabilizing clock synchronization as well as for non-Byzantine self-stabilizing clock synchronization. In contrast, current Byzantine self-stabilizing clock synchronization algorithms have exponential convergence time and are thus impractical. We present a linear time Byzantine self-stabilizing clock synchronization algorithm, which thus makes this task feasible. Our deterministic clock synchronization algorithm is based on the observation that all clock synchronization algorithms require events for re-synchronizing the clock values. These events usually need to happen synchronously at the different nodes. In these solutions this is fulfilled or aided by having the clocks initially close to each other and thus the actual clock values can be used for synchronizing the events. This implies that the clock values cannot differ arbitrarily, which necessarily renders these solutions to be non-stabilizing. Our scheme suggests using a tight pulse synchronization that is uncorrelated to the actual clock values. The synchronized pulses are used as the events for re-synchronizing the clock values.

1

Introduction

Overcoming failures that are not predictable in advance is most suitably addressed by tolerating Byzantine faults. It is the preferred fault model in order to seal off unexpected behavior within limitations on the number of concurrent faults. Most distributed tasks require the number of Byzantine faults, f , to abide by the ratio of 3f < n, where n is the network size. See [?] for impossibility results on several consensus related problems such as clock synchronization. Additionally, it makes sense to require such systems to resume operation after serious unpredictable events without the need for an outside intervention or restart of the system from scratch. E.g. systems may occasionally experience ?

This research was supported in part by Intel COMM Grant - Internet Network/Transport Layer & QoS Environment (IXA)

2

Daliot, Dolev and Parnas

short periods in which more than a third of the nodes are faulty or messages sent by all nodes may be lost for some time. Such transient violations of the basic fault assumptions may leave the system in an arbitrary state from which the protocol is required to resume in realizing its task. Typically, Byzantine algorithms do not ensure convergence in such cases. Byzantine algorithms focus on merely preventing Byzantine faults from notably shifting the system state away from the goal. They sometimes make strong assumptions on the initial state. A self-stabilizing algorithm overcomes this limitation by converging within finite time to a correct state from any initial state. Thus, even if the system loses its consistency due to a transient violation of the basic fault assumptions (e.g. more than a third of the nodes being faulty, network disconnected, etc.) then once the system is back within the assumption boundaries the protocol will successfully realize the task, irrespective of the resumed state of the system. For a short survey of self-stabilization see [?] and for an extensive study see [?]. The current paper addresses the problem of synchronizing clocks in a distributed system. There are several efficient algorithms for self-stabilizing clock synchronization withstanding crash faults (see [?,?,?] or other variants of the problem [?,?]). There are many efficient classic Byzantine clock synchronization algorithms (for a performance evaluation of clock synchronization algorithms see [?]), however strong assumptions on the initial state of the nodes are typically made, such as assuming all clocks are initially synchronized ([?,?,?]) and thus are not self-stabilizing. On the other hand, self-stabilizing clock synchronization algorithms can initiate with arbitrary values which can have a cost in the convergence times or in the severity of the faults contained. There are surprisingly few self-stabilizing solutions facing Byzantine faults ([?]), which additionally have unpractical convergence times. Note that self-stabilizing clock synchronization has an inherent difficulty in estimating real-time without an external time reference due to the fact that non-faulty nodes may initialize with arbitrary clock values. Thus self-stabilizing clock synchronization aims at reaching a stable state from which clocks proceed synchronously at the rate of real-time and not necessarily estimate real-time (assuming that nodes have access to physical timers that proceed close to real-time rate). Many applications utilizing the synchronization of clocks do not really require the exact real-time notion (see [?]). In such applications, agreeing on a common clock reading is sufficient as long as the clocks progress within a linear envelope of any real-time interval. We present a protocol with the following property: should the system be initialized with clocks that hold values that are close to real-time then the clocks stay synchronized while attaining similar real-time accuracy, precision and time complexity as non-stabilizing clock synchronization protocols. Should the system initialize with arbitrary clock values or recover from any transient faults then the clocks synchronize very fast and proceed at real-time rate with high precision. The protocol we present significantly improves upon existing Byzantine selfstabilizing clock synchronization algorithms by reducing the time complexity from expected exponential ([?]) to deterministic O(f ). The comparably low complexity is achieved by focusing on a deterministic Byzantine self-stabilizing algorithm for pulse synchronization. The synchronized pulses progress at a pace that allows the execution of a Byzantine Strong Consensus protocol on the clock values in between pulses, thus obtaining a common clock reading.

Byzantine Self-Stabilizing Clock Synchronization

3

A special challenge in self-stabilizing clock synchronization is the clock wrap around. In non-stabilizing algorithms having a large enough integer eliminates the problem for any practical concern. In self-stabilizing schemes a transient failure can cause nodes to initialize with arbitrarily large clocks, surfacing the issue of the clock bounds. The clock synchronization schemes described above handles this wrap around. Having access to an outside source of real-time is useful, though it introduces a single point of failure. Our approach is useful in such a case to overcome periods in which the outside source fails in order to maintain a consistent system state.

2

Model and Problem Definition

The environment is a network of processors (nodes) that communicate by exchanging messages. Individual nodes have no access to a central clock and there is no global pulse system. The hardware clocks (referred to as the physical timers) of correct nodes have a bounded drift rate, ρ, from real-time. The communication network does not guarantee any order on messages. The network and/or all the nodes can behave arbitrarily, though eventually the network performs within the defined assumption boundaries in which at most f out of the n nodes may behave arbitrarily. Definition 1. The network assumption boundaries are: 1. Message passing allowing for an authenticated identity of the senders. 2. At most f of the nodes are faulty. 3. Any message sent by any non-faulty node will eventually reach every nonfaulty node within δ time units. Definition 2. A node is correct at times that it complies with the following conditions: 1. Obeys a global constant 0 < ρglob