Improving Robustness of the Synchronization Quality ... - CiteSeerX

4 downloads 0 Views 867KB Size Report
point because it organizes the clock in a master-slave hierarchy: a single failure of the master causes the lost of the synchronization of the whole network.
Improving Robustness of the Synchronization Quality oflEEE1588 Nodes P. Ferrari, A. Flammini, S. Rinaldi

A. Bondavalli, F. Brancati

Dept. of Information Engineering, University of Brescia Via Branze, 38 - 25123 - Brescia, Italy [email protected]

University of Florence Viale Morgagni 65, 1-50134, Firenze, Italy [email protected]

Abstract-Nowadays

the

IEEE1588

synchronization

protocol has been adopted in a growing number of fields, from automation to telecommunication systems. Usually, the quality of the services provided by these applications requires an accurate and reliable time synchronization. Several solutions have been already proposed to achieve these goals. However, the statistical instruments provided in the standard (PTP variance) appear to be inadequate to identify synchronization problems while unpredictable events affect the quality of the synchronization. This paper provides alternative instruments for the analysis and improvement of the synchronization quality of a IEEE 1588 system resorting to the Reliable & Self Aware (R&SA) clock. The R&SA clock provides statistical information which are shown to be able to correctly identify the

different

sources

of

problem

that

can

affect

the

synchronization. The paper then makes a proposal to take advantage

of the

statistical

data

collected

for

improving

reliability and accuracy of PTP nodes.

Keywords-component: synchronization; system; IEEEI588; quality of synchronization

I.

fault-tolerant

INTRODUCTION

The wide diffusion of distributed system in different fields, such as, industrial automation and telecommunication systems, requires an accurate synchronization of the nodes in order to assure an adequate quality of the services (QoS) offered. The synchronization of time reference among nodes of a distributed system is a well known problem in scientific and research communities. Several approaches have been proposed [1],[2],[3], each of them dedicated to satisfy specific requirements. In the case an high performance synchronization is required, the IEEE1588 standard, also known as Precision Time Protocol (PTP) [4] can be adopted. This protocol is dedicated to the synchronization of distributed nodes in a LAN with an accuracy well below the �s, specifically for Ethernet-based instrumentation [5] and industrial automation systems. Sometimes these applications require a reliable system in order to prevent damages to the plant or service unavailability. The synchronization system itself has to provide a reliable time information also during a fault of the system. The PTP protocol has several weakness point because it organizes the clock in a master-slave hierarchy: a single failure of the master causes the lost of the synchronization of the whole network. This problem has

978-1-4244-5977-3/101$26.00 ©2010

IEEE

been already deeply analyzed and solutions have been proposed [6],[11]. Safety-Critical applications requires a reliable source of time. For this class of applications, the Reliable and Self Aware Clock (R&SA Clock) has been proposed [7],[8]. The statistical information collected by R&SA clock to estimate the synchronization uncertainty is useful as a feedback about the quality of the synchronization. The nodes equipped with R&SA clock are continuously updated about the current synchronization performance. A node can infer information about the anomalous conditions analyzing this data and then can take the proper corrective actions. The paper is structured as follow. At the beginning, a brief introduction to R&SA clock has been provided. Afterwards, in section II, the points of failure of a IEEE1588 system are classified. These failures can affect the synchronization quality of a IEEE nodes; however the experiments of the V section confmns the identification capabilities of R&SA clock. Then, in the section VI, it is discussed the possibilities to use this information to increase synchronization performance. Finally, the results are summarized in the conclusions and an outlook for future researches has been provided.

Figure 1.

II.

High-level representation of R&SAClock.

THE RELIABLE AND SELF A WARE CLOCK

The behavior of a local clock c is characterized by the quantities offset, accuracy and drift. The offset is the actual distance of local clock of a node n from the global time at time t [3]. This distance may vary through time. Accuracy is

an upper bound of the offset [12]; Drift Pc(t) describes the rate of deviation of a local clock c at time t from global time [12]. Synchronization uncertainty Vc(t) is defined in [7] as an adaptive and conservative evaluation of offset 0c(t) at any time t such that

A.j,c;::: U.j,c{t ) ;::: IE> .j,c{t �;::: O.

R&SA Clock is a software clock that provides to users (e.g., system processes) both the time value and the synchronization uncertainty associated to the time value. When a user asks the current time to R&SA Clock (by invoking function getTime), R&SA Clock provides an enriched time value [likelyTime ,minTime, maxTime, FLAG]. LikelyTime is the time value computed reading the local clock (Le., crt)�. MinTime and maxTime are based on the synchronization uncertainty provided by the internal mechanisms of R&SA Clock. It is evident that the main core of R&SA Clock is the uncertainty evaluation algorithm (UEA), that equips R&SA Clock with the ability to compute the uncertainty. Figure 1 sketches the high level structure of R&SA Clock and its main components [7]. III.

F AlLURE POINTS IN A IEEE 1588 SYNCHRONIZAnON SYSTEM

As mentioned above, the R&SA clock is a software clock able to use the information provided by any synchronization algorithm (usually the estmated offset and estimated drift) to provide to an application the information about the uncertainty of the current time view. This solution, as mentioned above, has been already proposed using NTP and GPS-based synchronization [8]. The R&SA clock can be applied to any other synchronization protocol, that estimates the information about the time offset of the local clock from a reference source, such as the PTP protocol[4]. The PTP protocol has been dedicated to the internal synchronization of devices over a LAN. The nodes of the network are organized in a master-slave hierarchy; the master of the network, i.e. the clock with the best performance, provides to other nodes, the slaves, the time reference. The slaves can obtain an accurate estimate of the time offset and of the drift from the master through an exchange of messages over the network. A clock servo is then used to ensure that the local clock tracks the reference time. Generally speaking, this protocol is adopted in industrial control and telecommunication systems; in these environments it is usually important to provide reliable time information to applications. A brief analysis of failure conditions in a PTP system has been provided and experimentally tested in the following paragraphs. A.

Fault-model of a IEEE 1588 synchronization system

The synchronization protocol described in the IEEE 1588 standard, since its early version [9], is dedicated to the precise synchronization of simple devices with limited resources over a LAN. In order to simplify and to reduce the messages exchange, the protocol provides for a master-slave topology, in which the node with the best clock behavior provides the time reference to the others. Clearly, such a topology is too weak in case of a master fault: during the

election of a new master, the clocks of the slave nodes are running freely and the network is unsynchronized. This is a well known problem, and some solutions have been proposed [11],[6]. Such solutions ensures a fault-tolerant synchronization but require the adoption of more complex algorithms (dedicated convergence functions) or topologies. The recent release of the standard, the so called PTPv2, introduces several enhancements in order to make the synchronization more fault-tolerant. In particular, the improved standard provides for support to redundant networks in order to take into account faults to the communication links. The Best Master Clock (BMC) algorithm defines the classical master-slave hierarchy among nodes, determining which link are in the active and which in the passive state. Only the active links are involved in the synchronization protocol, though also the propagation delay of the passive link is measured. In the case of a link fault, the synchronization messages can pass through a redundant link (that became a the new active link). An additional recommended improvement regards the so called Alternate Master; the slaves can collect statistical information not only on the current master, but also on the alternative master. In the case of a fault to the current master, the transient phase, due to the switch over of the master, is decreased because information about the new master has been already available. Before starting the analysis of the quality of the synchronization, it is important to defme the fault scenarios, in order to be able to identify the different failure sources that affect the synchronization performance. As mentioned above, the main fault point of such a system is the master; a master fault can be classified by its duration: permanent and temporary. In the former, the fault provokes the definitive elimination of the master from synchronization network. The election of a new master is needed; during this time the nodes of the network are running freely. Approximately the election of a new master can take several hundreds of seconds (PTP_SYNC_RECEIPT_TIMEOUT+Teleetion+Tservo_reeovery). A permanent fault to communication link or a fault to the node itself (power down, permanent damages, etc..) are examples of this class. In the latter, the master can recover from the fault after a recovery time, Tree. If Tree > PTP_SYNC_RECEIPT_TIMEOUT (10x2SYNC_1NTERVAL), the slaves elect a new master; after Tree the old master restarts to work properly, causing the re-execution of the BMC algorithm and its re-election to the master state. During the temporary fault the slaves start to synchronize themself to the new master time. In this way the slave can be affected by a time offset bigger than the offset due only to the drift of their clock during the Tree, because of the behavior of the servo clock during the election of a new master. The re-starting of the operative system of the master node or a temporary saturation of the communication link are examples of such situation. Another class of problem that can affect the master are the so called Byzantine failures. In this case the master fails in arbitrary way, not just stopping but providing to slaves inconsistent or unstable time reference, due to hardware problems (e.g. drift of the local oscillator due to local strong

thermal variation) or corruption of local state and program data. In this case, though the slaves are continuously synchronized to the master, the source of time is unstable; therefore their synchronization accuracy is decreased. A stronger filtering action of the estimated offset @c(t) can only partially solve the problem if variation of the time reference is bounded, otherwise a master with a better behavior has to be selected.The analysis has been dedicated to master faults because the problem of a slave is less serious; in fact the failures do not propagate to the other nodes. Notwithstanding this, it is important to correctly identify any problem and to put the slave in a warning state, in order to avoid the election of a damaged slave into the master state. Therefore an analysis of the local behaviour of the node is useful to prevent possible propagation of the failure to the rest of the network. IV.

THE EXPERIMENTAL SET-UP

In this paper a typical PTP system has been implemented in laboratory in order to test the failure conditions in a safe and repeatable environment. The experimental set-up, shown in Fig. 2, is composed of a set of four PTP nodes connected through a 10/100/1000 Ethernet switch (Netgear OS108). In order to improve the synchronization accuracy, a PTP boundary clock (e.g. the Hirschmann MICE MS30) or a transparent clock has to be used instead of traditional switch. However in this context, the attention has been pointed to the failures condition more than to the synchronization accuracy. A boundary clock manages possible failures of the PTP system as the other PTP nodes; therefore a traditional Ethernet switch is enough for the purpose of this paper. The PTP nodes have been implemented using PC equipped with the daemon PTPdv2002 [10], an open source implementations of the IEEE 1588 standard v2002 [4], though the approach is implementation independent. In this way, an instrumentation of the code of PTP daemon, if required by the test, is possible. The PC1 can be synchronized to an external reference source of time, such as a OPS receiver or a primary NTP time server. For this reason, the PC1 is the primary master of this system. In case of a failure, the PC2 is elected as the new master because it can receive the reference time from a primary NTP server. The other nodes remain in the slave state. The four nodes belong to the same PTP domain LDEFAULT); the synchronization interval is 2 s (the default SYNCjNTERVAL value). Each node logs the local synchronization data (the timestamp of the synchronization, Tsyne, the offset of the local clock from master, TolI, the one way delay, Towd and the correction of the local clock) in a PTP Logfile. This data are then used by the R&SA in order to obtain an estimate of the synchronization uncertainty. This elaboration can be done at run-time on each PC or off-line, during the post processing. Obviously these two modes are equivalent for the sake of the analysis reported in this paper.

� �

M/SPTP Figure 2.

Experimental set-up adopted to evaluate the failure condition of the PTP master (PCI).

In the first experiment a PTP system that works properly has been analyzed for sake of completeness. In the following experiments two fault conditions have been tested: a temporary master fault and a temporary deterioration of the frequency of its oscillator. A permanent master fault has not been tested because the behavior of the PTP nodes after this event it is similar to the starting election of the first experiment.

The capability of the R&SA clock implemented on the slaves to reveal any potential failure condition has been highlighted. The IEEE 1588 standard provide for a statistical measure of the stability of the clock, the PTP Variance,

;

(7 TP

, related to the Allan deviation (Oy(T)) by this

relationship

;

(7 TP

1

;

,2 X -(7

where T is a multiple of the 3 synclnterval. The Allan deviation is an extremely useful statistical instrument used to characterize the stability of the frequency of the clock and to identify the type of the underlying noise process [13]. However it is not sensitive to constant offset in time or in frequency and to occasional amplitude or frequency glitches that can be a proof of synchronization failure or abnormal functioning; for this reason it is not the best instrument to monitoring the synchronization process. The synchronization uncertainty estimated by the R&SA clock is directly based on statistics about the measures of the time offset and the oscillator drift and contains the offset itself [8]. In Fig. 3 the time offset (continuous lines) the synchronization uncertainty (SU, dot lines) and the PTP deviation (dashed line; estimated run-time on 100 samples appropriately scaled) have been compared. In this experiment the master clock is affected by a random changes of the oscillator frequency. As clearly highlighted by the results in the figure, the PTP deviation is not sensible to these changes that severely affect the synchronization of the system. On the contrary the synchronization uncertainty is able to signal the problem. =

2.5 1.5

/\

·1,5

·15

·2

·2.5

·2

�--���OO�---SOO�O----�S�500�--�70�O--"7W�O�1

·25

Seconds

Figure 3.

PTP deviation vs the R&SA synchronization uncertainty.

V.

ANALYSIS OF EXPERIMENTAL RESULTS

As mentioned above, the capability of the synchronization uncertainty Uc(t) to identify potential failure conditions has been investigated in this section. In the first experiment, the synchronization network is composed by three PTP slaves synchronized to the PTP master. The sync interval is 2 s, the default value; the synchronization data are collected by each slave on the log file for one hour. The information about the offset from master are then post processed by the R&SA algorithm in order to obtain the trend of the synchronization uncertainty. In Fig. 4 is shown the results of this experiment. Dots represents the time offset from the master of a single slave, while the relative mintime and maxtime are shown by the two dashed lines.

During this fITst experiment no failure condition has been tested. After the transient phase, that end after 1500 s. Note that the offset in the first part of the graph is out of scale (i.e. its larger than 100 /.lS before 400s) the synchronization offset is below 10 J.lS, a typical result that can be obtained using a software only implementation, such as PTPd daemon, standard PCs and switches not PTP compliant. As expected, the synchronization uncertainty provides a reliable synchronization uncertainty interval, i.e. the offset is always contained in the predicted interval. Moreover it's possible to observe that in case of a good synchronization, the synchronization uncertainty remains stable. 100

o

.2

::;:

+

PC2

PC3

PC4

- SU4

"

' .

, " �.:"�'I\