NTI: A Network Time Interface M-Module for High-Accuracy Clock ...

3 downloads 6438 Views 238KB Size Report
Apr 30, 1997 - Email: [email protected] ... Email: fs, [email protected] ... it did not receive much attention until recently, when highly accurate and cheap receivers for the ...... carrier-board hosting the NTI MA-Module.
NTI: A Network Time Interface M-Module for High-Accuracy Clock Synchronization Martin Horauer Technische Universitat Wien Department of Computer Technology Guhausstrae 27, A-1040 Vienna, Austria Email: [email protected] Ulrich Schmid, Klaus Schossmaier Technische Universitat Wien Department of Automation Treitlstrae 1, A-1040 Vienna, Austria Email: fs, [email protected]

April 30, 1997 Abstract

This paper1 provides a description of our Network Time Interface M-Module (NTI) supporting very-high accuracy external clock synchronization by hardware. The NTI is built around our custom Universal Time Coordinated Synchronization Unit VLSI chip (UTCSUASIC), which contains the core of the hardware support required for interval-based clock synchronization: A state and rate adjustable clock device, locally maintained accuracy intervals, interfaces to GPS receivers, and various timestamping features. Designed for maximum network and CPU independence, our NTI provides a turn-key solution for adding highresolution synchronized clocks to distributed real-time systems built upon hardware with M-Module interfaces. Keywords: external clock synchronization, hardware support, fault-tolerant distributed real-time systems, interval-based paradigm, M-Modules, GPS, VLSI.

1 Introduction Designing distributed fault-tolerant real-time applications is usually considerably simpli ed when synchronized clocks are available. Temporally ordered events are in fact bene cial for a wide variety of tasks, ranging from relating sensor data gathered at di erent nodes up to fully- edged distributed algorithms, see [Lis93] for some examples. Providing mutually synchronized local clocks is known as the internal clock synchronization problem, and numerous solutions have been worked out |at least in scienti c research| under the term fault-tolerant clock synchronization , see [RSB90], [SWL90] for an overview and [YM93] for a bibliography. If system time provided by synchronized clocks must also have a well-de ned relation to Universal Time Coordinated (UTC), the only ocial and legal standard time, then the fault-tolerant external clock synchronization problem needs to be addressed. Unlike internal synchronization, it did not receive much attention until recently, when highly accurate and cheap receivers for the Global Positioning System (GPS) became widespread, see [Dan97]. A representative overview of the current research on external clock synchronization may be found in [Sch97a]. This research is part of our project SynUTC has been supported by the Austrian Science Foundation (FWF) under grant no. P10244-O MA and is now continued under the START programme Y41-MAT. 1

1

It is well-known that the worst case synchronization tightness achieved by any clock synchronization approach depends heavily on the uncertainty (= variability) in the end-to-end delay arising in the exchange of time information. Hence, it is primarily determined by the properties of the communications subsystem, which can be classi ed on the bridged physical distance as follows: (I) If the interconnected nodes are only a few 10 meters apart, a dedicated and usually fully connected clocking \network" exhibiting small and constant propagation delays is sometimes a ordable. This setting allows the construction of phase-locked-loop clocks that employ clock voting for fault-tolerance reasons, which can provide a synchronization tightness down to the ns-range, see [RSB90] for an overview. (II) Nodes within a few 100 meters of each other are usually interconnected by a packetoriented communications subsystem, where sending data packets is the only means for exchanging (time) information. Typical distributed real-time systems employ (redundant) LANs like eldbusses based on shared broadcast channels, which provide almost deterministic propagation delays but incur a considerable medium access uncertainty. Almost any work on clock synchronization addresses this type of systems, although fully connected point-to-point networks are usually preferred. Purely software-based solutions typically achieve a synchronization tightness in the ms-range, which can be brought down to s with moderate hardware support, see Section 5. (III) World-wide distributed systems connected via long haul networks constitute an entirely di erent class of systems. In fact, they have to cope with end-to-end transmission delays that are potentially unbounded and highly variable due to the inevitable queuing delays at intermediate gateway nodes (e.g. in case of congestion and/or failures). The most prominent external clock synchronization scheme for such settings is undoubtly the Network Time Protocol (NTP) designed for disseminating UTC among workstations throughout the Internet, see [Mil91] for details. Although deterministic guarantees cannot be given here, there are reports like [Tro94] that state maximum UTC deviations in the 10 ms-range under \reasonable" conditions. The Network Time Interface (NTI) described in this paper has been designed to support fault-tolerant external clock synchronization in distributed systems of type (II) above.2 Implemented as an M-Module, our NTI allows to extend state-of-the-art real-time systems technology with synchronized clocks providing a synchronization tightness in the 1 s-range. Apart from advanced algorithmic issues, this undertaking primarily requires hardware support for exact timestamping of clock synchronization data packets and a sophisticated rate and state adjustable clock. The appropriate features are presented in the remaining sections of our paper, which are organized as follows: To introduce certain non-standard features like accuracy intervals, Section 2 provides a very brief overview of our interval-based clock synchronization approach. The following major Section 3 is devoted to the NTI's hardware architecture. Starting with a requirements analysis in Subsection 3.1, an overview of the resulting features of the NTI M-Module is provided in Subsection 3.2. The most salient features of our UTCSU VLSI chip are brie y reviewed in Subsection 3.3, and a short description of the basic hardware/software interface is provided in Subsection 3.4. In Section 4, our hardware overview is complemented by a brief outline of how we incorporated the NTI into a start-of-the-art real-time kernel. The paper concludes with a reasonably detailed discussion of related work in Section 5 and a short summary of our accomplishments in Section 6. Note that our approach can also be adopted to more general topologies commonly known as WANs-of-LANs, provided that all gateway nodes are also equipped with the NTI, cf. [Sch95] and [SS95]. 2

2

2 Interval-based Clock Synchronization Interval-based clock synchronization |introduced in [Sch94] and further developed in a number of papers [SS97], [Scho97], [Sch97b], [Sch97c], etc.| relies on the interval-based paradigm originally introduced in [Mar84] and [Lam87]. Real-time t (usually UTC) is not just represented by a single time-dependent clock value C (t) here, but rather by an accuracy interval A(t) that must satisfy t 2 A(t). As illustrated in Figure 1, accuracy intervals are usually provided by combining an ordinary clock C (t) with a time-dependent interval of accuracies (t) = [? ? (t); +(t)] taken relatively to the clock's value, leading to A(t) = [C (t) ? ? (t); C (t) + + (t)]. Clock Time T

α+(t’)

T=t

T’=C(t’) α- (t’)

t’

Real Time t

Figure 1: Accuracy Interval For interval-based clock synchronization, we hence assume that each node p in our system is equipped with an interval clock that continuously displays its local accuracy interval Ap(t). An interval-based (external) clock synchronization algorithm is in charge of maintaining Ap (t) so that the following is guaranteed: (P) Precision requirement : there is some precision max  0 such that jCp(t) ? Cq (t)j  max for all nodes p, q that are non-faulty up to real-time t. (A) Accuracy requirement : the interval of accuracies p(t) is such that ? +p (t)  Cp(t) ? t  ?p (t) for all nodes p that are non-faulty up to real-time t. All interval-based clock synchronization algorithms developed so far employ the basic structure of the generic algorithm introduced and analyzed in [SS97], which works as follows: At each node p, the following steps are periodically executed (leading to a round-based execution well-known from traditional clock synchronization algorithms): 1. When Cp (t) = kP , k  1 denoting the current round, a clock synchronization packet (CSP) containing Ap (t) is broadcast to each node within the system. 2. Upon reception of a CSP, the received accuracy interval is preprocessed to make it compatible with the accuracy intervals of the other nodes received during the same round. 3. When Cp (t) = kP +  for some suitable  > 0, an interval-valued convergence function is applied to the set of preprocessed intervals to compute and subsequently enforce an improved accuracy interval. 3

Two basic operations are required in step 2, where the exchanged intervals are made compatible with each other while preserving the inclusion of real-time: First, delay compensation is applied to the interval received in a CSP to account for the adverse e ect of transmitting an accuracy interval over a network. To account for the maximum transmission delay uncertainty, the received interval must be enlarged appropriately. Second, drift compensation is used to shift the resulting interval to some common point in real-time by means of the local clock C (t). Since clocks might have a non-zero drift, a sucient enlargement (\deterioration") of the interval is required here as well. Note that drift compensation must also be performed continuously by the local interval clock during each round, i.e., after step 3. In step 3, a suitable convergence function is applied to the set of preprocessed accuracy intervals. It is in charge of providing a new (smaller) accuracy interval for the local interval clock that guarantees (P) and (A), despite of some possibly faulty input intervals. The convergence function in fact determines the performance and fault-tolerance degree of the above interval-based clock synchronization algorithm. For example, fault-tolerant external clock synchronization can be achieved by employing interval-based clock validation introduced in [Sch94]: A highly accurate but possibly faulty accuracy interval provided by an external time source like a GPS receiver is only considered if it is consistent with the less accurate but reliable validation interval , which is computed by applying a convergence function suitable for internal intervalbased clock synchronization (like the ones of [Sch97b], [Sch97c]) to the accuracy intervals of ordinary nodes. Generally speaking, the major advantage of interval-based clock synchronization is its ability to provide each node with a local on-line bound on the own clock's deviation from real-time, as utilized e.g. in OSF DCE, see [OSF92]. Moreover, since accuracy intervals are maintained dynamically, they are quite small on the average, which compares favorably to the \static" worst case accuracy bounds known for traditional clock synchronization algorithms. The price to be paid for this additional information, however, is the need of explicit bounds on certain system parameters like transmission delay uncertainty and maximum clock drift. These bounds can either be compiled statically into the algorithm from a priori information or, preferably, measured |even controlled| dynamically. In fact, our ambitious goal of a 1s-range precision/accuracy makes it inevitable to employ an accurate round-trip-based transmission delay measurement and, most importantly, to utilize bounds on the maximum clock drift provided by a suitable rate synchronization algorithm . Responsible for synchronizing the speeds dC (t)=dt of all non-faulty clocks, the interval-based rate synchronization algorithm introduced and analyzed in [Scho97] e ectively reduces the maximum drift without necessitating highly accurate and stable oscillators at each node.

3 NTI Architecture As already pointed out in the introduction, our approach targets distributed systems consisting of computing nodes interconnected by an ordinary packet-oriented data network. According to Figure 2 outlining the basic hardware components required for clock synchronization, each node must be equipped with a hardware clock (our UTCSU-ASIC, see Section 3.3), a general purpose CPU (the node's central processor or, preferably, a dedicated microprocessor or microcontroller) responsible for executing the software-part of the clock synchronization algorithm, and a Communications Coprocessor (COMCO), which provides access to the network by reading/writing data packets from/to (shared) memory independently of CPU operation (e.g. via DMA). For external synchronization purposes, some nodes must also be equipped with external time sources like GPS satellite receivers. 4

Node A

Node B

CPU

Memory

Memory

CPU

UTCSU

COMCO

COMCO

UTCSU

GPS external timesource

1 0 0 1

network medium

11 00 00 11

Figure 2: Basic clock synchronization hardware architecture

3.1 NTI Requirements

Providing hardware support for highly accurate/precise clock synchronization is primarily driven by the requirement of exact timestamping of CSPs at both sending and receiving side. In fact, the work of [LL84] revealed that even n ideal clocks cannot be synchronized with a worst case precision less than " (1 ? 1=n) in presence of a transmission/reception time uncertainty ", which is de ned as the variability of the di erence between the real times of CSP timestamping at the peer nodes. Unfortunately, there are several steps involved in packet transmission/reception that could contribute to ", cf. [KO87]: 1. Sender-CPU assembles the CSP 2. Sender-CPU signals sender-COMCO to take over for transmission 3. Sender-COMCO tries to acquire the network medium 4. Sender-COMCO reads CSP data from memory and pushes the resulting bit stream onto the medium 5. Receiver-COMCO pulls the bit stream from the medium and writes CSP data into memory 6. Receiver-COMCO noti es receiver-CPU of packet reception via interrupt 7. Receiver-CPU processes CSP Purely software-based clock synchronization approaches perform CSP timestamping upon transmission resp. reception in steps 1 resp. 7, which means that " incorporates the medium access uncertainty (3 ! 4), any variable network delay (4 ! 5), and the reception interrupt latency (6 ! 7). The rst one can be quite large for any network utilizing a shared medium, and the last one is seriously impaired by code segments with interrupts disabled. Fortunately, in our LAN-based setting, we can safely neglect the contribution from 4 ! 5 since there are no (load- and hop-dependent) queueing delays from intermediate gateway nodes.3 Therefore, the resulting transmission/reception uncertainty emerges primarily from 1 ! 4 resp. 5 ! 7 at the sending resp. receiving node itself. In an e ort to reduce ", the clock synchronization hardware should thence be placed as close as possible to the network facilities. Ideally, a CSP should be timestamped at the sender resp. receiver exactly when, say, its rst byte is pushed on resp. pulled from the medium. However, this needs support from the interior of the COMCO, which is usually not available. In order to support existing network controller technology, a less tight method of coupling has to be considered. Note that this statement would remain true if all intermediate nodes in a multi-hop network were also equipped with the packet timestamping mechanism described below. 3

5

For this purpose, our NTI uses a re nement of the widely applicable DMA-based coupling method proposed in [KO87]. The key idea is to insert a timestamp on-the- y into the memory holding a CSP in a way that minimizes the transmission/reception uncertainty. More speci cally, a modi ed address decoding logic for the memory is utilized, which 1. generates trigger signals that sample a timestamp into two dedicated UTCSU registers when a certain byte within the transmit resp. receive bu er for a CSP is read resp. written, 2. transparently maps the sampled transmit timestamp into some portion of the transmit bu er. Note that this special functionality is only present when a transmit/receive bu er is accessed by the COMCO, whereas CPU-accesses are just plain memory accesses. To illustrate the entire process of CSP timestamping, we brie y discuss one possible scenario depicted in Figure 3; alternative scenarios may be found in [SS95].

Dest. Adr.

Transmit buffer: Packet:

Receive buffer:

Receiver - UTCSU:

Transmit TS

TRANSMIT

Sender - UTCSU:

Preamble

don’t care

User Data

Src.Adr.

Dest.Adr.

Transm. TS.

don’t care

User Data

Src.Adr.

Dest.Adr.

Transm. TS.

Rec. TS.

User Data

CRC

Receive TS

RECEIVE

Figure 3: Packet timestamping Whenever the COMCO fetches data from the transmit bu er holding CSP data for transmission, it has to read across the particular address that causes the decoding logic to generate the trigger signal TRANSMIT. Upon occurrence of this signal, the UTCSU puts a transmit timestamp into a dedicated sample register, which is transparently mapped into a certain succeeding portion of the transmit bu er and hence automatically inserted into the outgoing packet. Note that the trigger address and the mapping address may be di erent. By the same token, when the COMCO at the receiving side writes a certain portion of the receive bu er in memory, the trigger signal RECEIVE is generated by the decoding logic, which causes the UTCSU to sample the receive timestamp into a dedicated register. Subsequently, the timestamp can be saved in an unused portion of the receive bu er either by the CPU upon reception noti cation or even immediately by some additional transparent mapping, see [SS95] for more details. The proposed approach works for any COMCO that accesses CSP data immediately in memory via DMA. Suitable chipsets are available for a wide variety of networks, ranging from eldbusses like Pro bus over Ethernet up to advanced high-speed FDDI or ATM networks. De nitely inappropriate, however, are COMCOs that provide on-chip storage for entire packets, as is the case for most CAN controllers, unless clock synchronization is explicitly supported by exporting the required trigger signals, cf. [KKMS95]. In the latter case, the UTCSU-ASIC could be even more tightly coupled with the COMCO, leading to a further reduction of ". Nevertheless, it is worth mentioning that assessing the transmission/reception uncertainty of a particular COMCO usually requires some experimental evaluation, cf. Section 4. In fact, 6

since CSP timestamping occurs in step 4 resp. 5 of the data transmission/reception sequence introduced earlier, the only activities that still contribute to " are the \ongoing" data transmission and the bus arbitration necessary for COMCO memory write cycles upon CSP reception. The former uncertainty, i.e., the time between fetching a byte from the transmit bu er and trying to deposit it in the receive bu er, obviously depends on the internal architecture (FIFOs etc.) of the COMCO. Whereas adjusting the trigger position of the transmit/receive timestamp may help in reducing/circumventing certain impairments, it is nevertheless not easy to nd and justify a suitable choice without actual measurements.

3.2 NTI Design

M-Modules [MM96] are an open, simple, and robust mezzanine bus interface primarily designed for VME carrier boards, which are commonly used in Europe. MA-Modules are enhanced M-Modules providing a 32 bit data bus instead of the 16 bit one of the original M-Modules. The address space consists of 256 bytes I/O-space accessible via the standard M-Module interface and up to 16 MB of memory-space addressed by multiplexing the MA-Module data bus. The asynchronous bus interface requires the module to generate an acknowledge signal for termination of a bus cycle only, thus minimizing the control logic on-board the M-Module. Further signals in the M-Module interface comprise a single vectorized interrupt line and two additional DMA control lines. The unit construction design of the 146 x 53 mm (single-height) M-Modules provides a peripheral I/O D-sub connector on the front panel, two plug connectors to the carrier board for peripheral I/O and MA-interface, and an intermodule port connector for interconnecting several M-Modules. We found MA-Modules well-suited for crafting the evaluation prototype of our clock synchronization hardware. More speci cally, our basic assumption about CPU and COMCO independence (recall Section 3.1) strongly suggested not to bother ourselves with developing a fully- edged node hardware, but rather to extend existing CPU boards with adequate support. Mezzanine busses are certainly the easiest way to accomplish this, and given the M-Modules simplicity, robustness, size, and low cost, we eventually decided to implement the NTI as an MA-Module. In this and the following subsections, we will provide a short overview of the NTI design; consult [HLS96] for all the details. Figure 4 shows the major components.

25 pin D-Sub connector

60 pin Intermodule Port

60 pin MA-Interface connector

Memory

24 pin plug connector

UTCSU

CPLD S-PROM

TCXO

Figure 4: NTI block diagram 7

The UTCSU-ASIC contains most of the dedicated hardware support for interval-based clock synchronization, see Section 3.3 for its features. It is driven by an on-board temperaturecompensated (TCXO) or ovenized (OCXO) quartz oscillator; alternatively, an external frequency source like the 10 MHz output of a high-end GPS receiver can be used. All UTCSU application time/accuracy-stamp inputs and duty timer outputs as well as an interface to an external GPS receiver are available via the M-Module's front-panel 25 pin Dsub connector. In addition, all receive and transmit time/accuracy-stamp signals are available to the carrier board via the 24 bit pin plug connector. Note that high-speed opto-couplers or transceivers are interposed for all inputs in order to provide a decoupled and reliable interface. The M-Modules' intermodule port is used to export the multiplexed NTPA-bus required for future extension modules, and to connect modularized GPS receivers directly to the UTCSU. The memory serves as control and data interface between the CPU and the COMCO, providing the special functionality for COMCO accesses as outlined in Section 3.1. It consists of two 64k x 16 bit SRAM chips and supports byte, word, and longword read/write accesses. In Section 3.4, the memory map will be explained in some detail. All required decoding and glue logic of the NTI is incorporated in a single, in-circuit programmable complex programmable logic device (CPLD), which adapts the UTCSU and the memory to the MA-Module interface. Additional responsibilities of the CPLD are forwarding interrupt requests from the UTCSU to the carrier-board, generating the acknowledge signal terminating a bus cycle, and giving access to the serial PROM. This read only memory stores identi cation and revision information according to the M-Module speci cation, cf. [MM96].

3.3 UTCSU Features

In this subsection, we provide a succinct overview of the wealth of functionalities of our custom Universal Time Coordinated Synchronization Unit (UTCSU); further information is available in [SSHL97], [SL96], [SS95] and especially in [Loy96]. Manufactured as an ASIC in 0.7 m CMOS technology, the UTCSU accommodates about 65; 000 gates on a 100 mm2 die packed into a 180-pin PGA case. Due to its exible bus interface, featuring dynamic bus sizing and little/big endian byte ordering, the UTCSU can be used in conjunction with virtually any 8, 16 and 32 bit CPU. Figure 5 gives an overview of the major functional blocks inside the UTCSU. The centerpiece of our chip is a local clock unit (LTU) utilizing a 56 bit NTP-time format, which maintains a xed point representation of the current time with 32 bit integer part and 24 bit fractional part, cf. [Mil91]. Clock time can be read atomically as a 32 bit timestamp with resolution 2?24  60 ns that wraps around every 256 s, and a 32 bit macrostamp containing the remaining 24 most-signi cant bits of seconds along with an 8 bit checksum protecting the entire time information. The local clock of the UTCSU can be paced with any oscillator frequency fosc in the range of 1 : : : 20 MHz, is ne-grained rate adjustable in steps of about 10 ns=s, and supports state adjustment via continuous amortization as well as (optional) leap second corrections in hardware. Those outstanding features are primarily a consequence of our novel adder-based clock design, which uses a large (91 bit), high-speed adder instead of a simple counter for summing up the elapsed time between succeeding oscillator ticks. Of course, a proper augend value (in multiples of 2?51 s  0:44 fs) must be provided by the atop running clock rate synchronization algorithm to enforce the desired rate adjustment. It might appear that 60 ns granularity, 20 MHz maximum oscillator frequency, and 10 ns=s minimal rate adjustment are too much of a good thing for 1 s worst-case precision/accuracy. However, our extensive formal analysis of (interval-based) clock synchronization revealed that clock granularity and, in particular, rate adjustment uncertainty considerably impair achievable worst-case precision and accuracy, see Section 5. 8

LTU ACU

Data Address

fosc

TRANSMIT[1..6] RECEIVE[1..6]

SSU

BIU

Control 1PPS[1..3] STATUS[1..3] A-Bus

I-Bus

APU

NTP-Bus

GPU

APP[1..9] APPL

INTT INTN INTA

NTPA-Bus

ITU

NTU SNU

unit ACU APU BIU BTU GPU ITU LTU NTU SNU SSU

full name Accuracy Unit Application Unit Bus Interface Unit Built-In Test Unit GPS Unit Interrupt Unit Local Time Unit Network Time Interface Unit Snapshot Unit Synchronization Subnet Unit

HWSNAP SYNCRUN

BTU

Figure 5: Interior of the UTCSU To support interval-based clock synchronization as outlined in Section 2, our UTCSU contains two more adder-based \clocks" in the ACU that are also driven by the oscillator frequency fosc . They are responsible for holding and automatically deteriorating the 16 bit accuracies ? and + to account for the maximum oscillator drift. Both can be (re)initialized atomically in conjunction with the clock register in the LTU. In addition, some extra logic suppresses a wrap-around of ? and + and zero-masks potentially negative accuracies during continuous amortization. A number of external events, supplied to the UTCSU via polarity programmable input lines, can be time/accuracy-stamped with local time and accuracy (i.e., ? and + ), which are atomically sampled into dedicated registers upon the appropriate input transition. Optionally, an interrupt can be raised on such an event as well. Due to the asynchronous nature of these inputs, internal synchronizer stages are utilized that introduce a timing uncertainty of at most 1=fosc . Note that a one- or two-stage synchronizer is employed, depending on the status of the UTCSUs reliable -pin; the recovery time for metastability phenomenons is therefore 1=fosc or 2=fosc , resulting in a reasonably small probability of failure, see [Loy96] for more information. Three di erent functional blocks in the UTCSU utilize time/accuracy-stamping: First, trigger signals generated by the decoding logic at CSP transmission/reception (as explained in Section 3.1) sample the current local time/accuracy into dedicated UTCSU registers in the SSU. Six independent SSUs are provided to facilitate fault-tolerant (redundant) communications architectures or gateway nodes. Second, three independent GPUs are provided for timestamping the one pulse per second (1pps) signal |indicating the exact begin of a UTC second| from up to three GPS receivers. Note that this simple interfacing is sucient for coupling GPS receivers, since additional and less time critical information is usually provided via a serial interface and handled o -chip the UTCSU. Finally, nine independent application time/accuracy-stamping inputs are provided by the APU. Note that additional application-related features can be realized o -chip by tapping the 48 bit wide multiplexed NTPA-Bus , which exports the entire local time and accuracy information at full speed. The above timestamping features are complemented by several 48 bit programmable duty 9

timers accommodated in several functional blocks of the UTCSU. Whenever an armed duty timer goes o (due to the fact that local time reaches the programmed one), an interrupt is raised. Duty timers are used to execute the protocol for CSP exchange governed by the clock synchronization algorithm, to control continuous amortization, to insert/delete leap seconds, and to generate application-related events. Accordingly, there are many di erent interrupt sources inside the UTCSU, which can be controlled on an individual basis. They are all (statically) mapped onto three dedicated UTCSU interrupt outputs INTN (network-related), INTT (timer-related) and INTA (application-related). Since M-Modules provide only a single interrupt line for signaling a vectorized interrupt, the NTI is responsible for further mapping INTN, INTT, and INTA onto a single interrupt line and generating the appropriate interrupt vector. Last but not least, the UTCSU is equipped with features for test (BTU) and debugging purposes (SNU). This includes calculation of checksums, blocksums and signatures for local time, snapshots of certain registers to facilitate an experimental evaluation of precision/accuracy, and (re)start operations. Note that such provisions are mandatory for building up fault-tolerant applications based on self-checking and/or redundant units.

3.4 NTI Hardware/Software Interface

512 byte

UTCSU Registers

CPU access

184 KB

8 KB

on Memory

2 x 64K16 SRAM System Structures

Data Buffers

60 KB

31

Receive Headers

4 KB

A19=0

A18=0

256 KB

A18=1

256 KB

A19=1

All accesses to UTCSU registers and NTI memory are performed by addressing the M-Modules memory-space. As explained in Section 3.1, read/writes of the COMCO require additional logic to provide timestamping functionalities. To distinguish between CPU and COMCO accesses, the CPLD maps two address regions to the same physical memory as illustrated in Figure 6.

Transmit Headers

31

0

0

Figure 6: Memory map of the NTI On top of the memory map is the NTI memory's 256 KB address region for CPU-accesses 10

64 byte

64 byte

followed by a 512 byte segment containing the UTCSU registers, which are both decoded without special functionality. The 256 KB region for COMCO-accesses to NTI memory starts at address 0 and is divided into four di erent sections: The System Structures section holds the command interface and system data structures usually required by the COMCO, the Data Buffers are available for ordinary packet data. Special functionalities apply only to accesses in the Receive Headers resp. Transmit Headers sections, which hold packet-speci c control and routing information (e.g. source and destination addresses) for received resp. transmitted CSPs. The CPLD is currently programmed to support Intel's 82596CA Ethernet coprocessor using 64 byte receive and transmit headers. Figure 7 outlines the o sets within each header that need to be supervised here.

TRANSMIT

RECEIVE

Accuracy 0x1C

31

Receive Header

0

0x00

0x20 0x1C 0x18 0x14

Timestamp

31

Transmit Header

0

0x00

Figure 7: Receive and Transmit Header When the 82596CA writes o set 0x1C within a receive header upon reception of a CSP, the timestamp trigger signal RECEIVE for the UTCSU is generated. In addition, the base address of the accessed receive header is stored into a dedicated NTI-register to facilitate further interrupt processing, as explained below. Similarly, when the 82596CA reads o set 0x14 within a transmit header upon transmission of a CSP, a timestamp trigger signal TRANSMIT is issued to the UTCSU. Since the UTCSU registers holding the sampled time/accuracystamp are mapped to the memory addresses with o set 0x18 and 0x20 in the transmit header, they are transparently inserted into the outgoing data packet. In addition to the UTCSU registers and the shared memory, there are also a few registers provided by the NTI itself, primarily for the purpose of interrupt handling. They are accessible via the M-Modules I/O-space according to the memory map in Figure 8. The Receive Header Base register at address 0x00 is required for correctly assigning receive timestamps to data packets: After the UTCSU has sampled a receive time/accuracystamp, it must be moved to an unused portion of the appropriate CSP before the next CSP drops in. This could be done in an ISR activated by a RECEIVE transition interrupt, which, however, cannot reliably4 determine the address of the receive header associated with the sampled timestamp. Therefore, the NTI latches this address into the Receive Header Base register upon the occurrence of the RECEIVE-signal. 4 There are of course alternatives, which, however, do not work in general. For example, one might try to move the timestamp in the packet reception ISR, where the base address of the receive bu er is of course known. Unfortunately, this might be too late for avoiding a timestamp loss in case of back-to-back CSPs. Also inappropriate are schemes that try to exploit a sequential order of received packets, since there might be CSPs that trigger a timestamp but are eventually discarded, e.g., due to an incorrect CRC.

11

256 byte

S-PROM

Dis/Enable Interrupt Logic Vector (Base) Receive Header Base

0xFE

0x04 0x02 0x00

Figure 8: Memory map I/O-space of the NTI Apart from the access-byte to the M-Module's Serial PROM at address 0xFE, there are two additional NTI registers controlling interrupt generation: The Vector (Base) register at address 0x02 can be used to program the interrupt vector generated upon an UTCSU interrupt. Note that the nal vector also includes the state of the three UTCSU interrupt pins INTT, INTN, and INTA. Accessing register Dis/Enable Interrupt Logic at address 0x04 eventually enables (further) NTI interrupts; it is usually written immediately prior to leaving the interrupt service routine.

4 NTI Integration This section brie y surveys how we are incorporating the NTI and our clock synchronization software into the state-of-the-art5 industrial multiprocessing/multitasking real-time kernel pSOS+m (Integrated Systems, Inc.). Figure 9 outlines the complete software structure of a node, including the underlying hardware. The entire NTI software is embedded in a pSOS+m add-on that e ectively hides clock synchronization from the application tasks. The heart of this add-on is the COMCO driver [Ri97], which actually multiplexes three di erent interfaces to the COMCO: 1. Kernel Interface (KI): pSOS+m supports multiprocessing by providing remote objects (tasks, queues, semaphores, etc.) that are internally managed via RPCs. To keep the kernel reasonably independent of the particular communications network, a user-supplied KI is required that maps a simple message-passing interface to the particular COMCO. 2. Network Interface (NI): In addition to using kernel services, application tasks can communicate with remote sites via TCP/IP standard sockets if the additional software component pNA+ is present. Like the pSOS+m kernel, pNA+ is kept hardware-independent by means of a user-supplied NI, which is similar to the KI but plugs into a di erent message-passing interface. 3. Clock Interface (CI): The third component that requires network services is our clock synchronization algorithm. Again, a simple message-passing interface CI is sucient here. Recall that CSPs sent/received via the CI are timestamped, as described in Section 3.1. Note that we are of course aware of the fact that customary kernels are quite inadequate for building wellfounded distributed real-time systems. We think, however, that one should try to gradually introduce sound concepts into existing systems, rather than persuade industry to discard familiar (ready-to-use but insucient) technology in favor of a novel (immature but sucient) one. 5

12

Application Tasks

pSOS

pNA

KI

NI

Clock Synchronization

CI

COMCO Driver CPU

COMCO

NTI

Figure 9: Software structure of a node Viewed at the level of application tasks, a usual pSOS+m /pNA+ -environment is at hand that additionally provides synchronized clocks solely by means of the application support of the UTCSU-ASIC. In fact, apart from the created computing and networking load, clock synchronization is performed totally transparent to the application. The current version of the COMCO driver has been developed for Motorola's MVME-162 CPU (M68040 CPU + Intel 82596CA Ethernet coprocessor) in conjunction with a passive VME carrier-board hosting the NTI MA-Module. Note that some preliminary experiments with a two-node system revealed a transmission/reception time uncertainty " well below 1 s here. A more thorough experimental evaluation, however, will be conducted on a 16 node prototype distributed system consisting of four MVME-162 with four NTIs each, which is currently under development. As a nal remark, we note that a transition to a di erent hardware only requires redevelopment of the network controller's part of the COMCO driver (written in C) and perhaps some reprogramming of the CPLD on-board the NTI. In fact, we are considering a new version of the COMCO driver that supports AcQ's i6040 multiprocessor VME-CPU, which utilizes an M68040 in conjunction with a M68EN360 QUICC network coprocessor and has 2 MA-Slots on board. Note that the i6040 is particularly suitable for our application, since it allows the CPU-32 on-chip the M68EN360 to operate concurrently with the M68040 (avoiding companion mode ). Due to the fact that the MA-Slots can be accessed from the M68EN360 without disturbing the execution of the M68040, we can hence execute the clock synchronization software completely on-chip the M68EN360.

5 Relation to Existing Work Despite of the large body of related research work on clock synchronization, there are only a few papers that deal with hardware support. In fact, most published papers target a precision in the ms-range and hence restrict their attention to purely software-based approaches. A notable exception is the synchronization scheme of [VRC97], which \sprays" external time obtained via GPS into broadcast-type LANs with a precision/accuracy in the 10 s-range. However, their software-based a posteriori agreement technique rests on the (quite optimistic) assumption that at least one broadcast among f + 1 attempted ones is fault-free. 13

In [KO87], it was shown that considerably better results can be achieved with any clock synchronization algorithm if some dedicated hardware support is present. In fact, the pioneering Clock Synchronization Unit (CSU) brie y discussed therein allows to construct synchronized clocks with a precision in the 10 s-range for broadcast networks. Even better results are claimed for the CSU's successor described in [KKMS95], which is tailored to eldbusses for automotive applications. The stated worst case precision of a few s, however, is only legitimate due to the fact that granularity e ects have been completely ignored in this paper (although they utilize a clock with granularity G = 1 s!). Actually, our analysis of the (fault-tolerant midpoint -like) orthogonal accuracy convergence function OA in [Sch97b] reveals that clock granularity G and discrete rate adjustment uncertainty u impair the achievable worst case precision by 4G + 10u. According to our detailed discussion6 in [SS97, Sec. 3.1], u = Gosc = 1=fosc for the adder-based clock of the UTCSU. This means that, when OA is used, G = u < 70 ns (fosc > 14 MHz) is required for a worst case precision below 1 s. Another hardware-assisted clock synchronization scheme for (not necessarily fully connected) point-to-point networks is brie y outlined in [RKS90]. It is also inspired by [KO87] and targets a precision in the 100 s-range. A full assessment, however, is impossible due to lacking details. Our NTI also leans on the general hardware architecture proposed in [KO87]. In particular, we utilize the same (DMA-based) method of packet timestamping, although extended by several \uncertainty-saving" engineering improvements. Most importantly, instead of triggering a receive timestamp upon the packet reception interrupt (that occurs only after a whole packet has been received), we generate this trigger when a speci c address within the receive bu er is written upon packet reception. In addition, the NTI provides two independently con gurable addresses for timestamp triggering and transparent mapping upon packet transmission. This freedom can help in reducing the e ects of COMCO architectural peculiarities upon transmission/reception uncertainty, see Section 3.1. The apparent similarities between the CSU chip (see [O87] for details) and our UTCSUASIC, however, are implied by the general requirements put on any hardware support for clock synchronization only. In fact, our UTCSU di ers from the CSU in many important ways:  The UTCSU supports a di erent, i.e. interval-based, clock synchronization paradigm, which requires maintenance of local accuracy intervals, see Section 2.  Our approach aims at a worst case precision/accuracy in the 1 s-range, which demands a considerably improved granularity as well as excellent rate and state adjustment capabilities.  We employ fundamentally di erent implementations of the vital functional units on-chip the UTCSU. The strikingly elegant and simple adder-based clock design, for example, surpasses any existing approach we are aware of. This is also true for the unwieldy clock device of [KKMS95], which may be viewed as a concatenation of an adder-based clock and a counter, cf. [SS95].  The UTCSU provides features like hardware support for continuous amortization and leap second insertion/deletion, which are not found in alternative approaches.  Last but not least, the tremendous advances in VLSI technology allowed us to overcome the CSU's apparent design limitations. Our UTCSU provides dozens of wide internal registers (64 bit NTP time format + 32 bit accuracy) and many additional units like application timestamping features and interfaces to GPS receivers, which would have been impossible to accommodate in 1987. Note that there is a notational mistake in the statement of up for adder-based clocks in [SS97, p. 185], which should read up = [?Gosc + O(Gosc ); Gosc + O(Gosc )] (valid if STEP < 2Gosc , i.e. when the nominal clock speed is at most doubled). 6

14

Relating our NTI to purely hardware-based approaches, in particular, phase-locked-loop (PLL) clocks [RSB90] and \stand-alone" GPS receivers is considerably more dicult. This is due to the fact the the latters' superior accuracy/precision (down to the ns-range!) is primarily a consequence of di erent system assumptions. For example, apart from the principal diculty of incorporating propagation delays, one cannot use PLL-clocks atop of standard data networks. This feature, however, is one of the primary advantages of our approach, cf. Section 1. Similarly, besides the questionable undertaking of always trusting7 the output of a GPS receiver, providing each node of a distributed system with a dedicated GPS receiver is certainly infeasible due to the required \forest" of antennas. Our NTI in conjunction with interval-based clock validation, however, provides a way to escape from both problems by simultaneously increasing the fault-tolerance degree and decreasing the number of GPS receivers required in the system.

6 Conclusions In this paper, we presented an overview of our Network Time Interface M-Module dedicated to support interval-based external clock synchronization by hardware. Implemented as an MModule around our novel UTCSU-ASIC, the NTI provides a cheap way of extending state-ofthe-art real-time systems technology with fault-tolerant synchronized clocks providing a worst case precision/accuracy in the 1 s-range. Apart from the principal advantages of intervalbased clock synchronization, our approach thus allows an improvement of at least one order of magnitude over existing approaches. Although we are currently aware of only one particular | but quite interesting and promising| application that requires this high precision/accuracy, we are nevertheless convinced that others will emerge when enabling technology eventually exists.

References [Dan97]

P.H. Dana. Global Positioning System (GPS) Time Dissemination for Real-Time Applications , J. Real-Time Systems 12(1), January 1997. [HS97] D. Hochtl, U. Schmid. Long-Term Evaluation of GPS Timing Receiver Faults , manuscript to be submitted, 1997. [HLS96] M. Horauer, D. Loy, U. Schmid. NTI Functional and Architectural Speci cation , Technical Report 183/1-69, Department of Automation, Technische Universitat Wien, December 1996. [KKMS95] H. Kopetz, A. Kruger, D. Millinger, A. Schedl. A Synchronization Strategy for a Time-Triggered Multicluster Real-Time System , Reliable Distributed Systems (RDS'95), Bad Neuenahr, Germany, September 1995. [KO87] H. Kopetz, W. Ochsenreiter. Clock Synchronization in Distributed Real-Time Systems , IEEE Transactions on Computers, C-36(8), p. 933{939, August 1987. [Lam87] L. Lamport. Synchronizing Time Servers. Technical Report Digital System Research Center 18, p. 1{33, 1987. [Lis93] B. Liskov. Practical uses of synchronized clocks in distributed systems , Distributed Computing 6, p. 211{219, 1993. Note that we conducted a 2-month continuous experimental evaluation of the output of six di erent GPS receivers, which revealed a wide variety of failures, see [HS97] for details. 7

15

[LL84] [Loy96] [Mar84] [MM96] [Mil91] [O87] [OSF92] [RKS90] [RSB90] [Ri97] [Sch94] [Sch95] [Sch97a] [Sch97b] [Sch97c] [SS97] [Scho97]

J. Lundelius, N. Lynch. An Upper an Lower Bound for Clock Synchronization , Information and Control, vol. 62, p. 190{204, 1984. D. Loy. GPS-Linked High Accuracy NTP Time Processor for Distributed FaultTolerant Real-Time Systems , PhD thesis, Faculty of Electrotechnics, Technische Universitat Wien, April 1996. K. Marzullo. Maintaining the time in a distributed system. An example of a looselycoupled distributed service , PhD thesis, Department of Electrical Engineering, Stanford University, February 1984. Manufacturers and Users of M-Modules (MUMM) e.V., M-Module Speci cation , 1996. D.L. Mills. Internet Time Synchronization: The Network Time Protocol , IEEE Transactions on Communications, 39(10), p. 1482{1493, October 1991. W. Ochsenreiter. Fehlertolerante Uhrensynchronisation in verteilten Realzeitsystemen , Dissertation, Faculty of Technical and Natural Sciences, Technische Universitat Wien, 1987. OSF. Introduction to OSF DCE , Prentice Hall, Englewood Cli s, NJ, 1992. P. Ramanathan, D.D. Kandlur, K.G. Shin. Hardware-Assisted Software Clock Synchronization for Homogeneous Distributed Systems , IEEE Transactions on Computers, 39(4), p. 514{524, April 1990. P. Ramanathan, K. G. Shin, R. W. Butler. Fault-Tolerant Clock Synchronization in Distributed Systems , IEEE Computer, 23(10), p. 33{42, October 1990. G. Richter. Driver for Real-Time Commonications Coprocessor , Diploma thesis, Department of Automation, Technische Universitat Wien, to appear 1997. U. Schmid. Synchronized UTC for Distributed Real-Time Systems , Proc. IFAC Workshop on Real-Time Programming WRTP'94, Lake Reichenau/Germany, 1994, p. 101{107. U. Schmid. Synchronized Universal Time Coordinated for Distributed Real-Time Systems , Control Engineering Practice 3(6), p. 877{884, 1995. (Reprint of [Sch94]). U. Schmid (ed.). Special Issue J. Real-Time Systems on The Challenge of Global Time in Large-Scale Distributed Real-Time Systems , Journal of Real-Time 12(1{3), 1997. U. Schmid. Orthogonal Accuracy Clock Synchronization , submitted, 1997. U. Schmid. Interval-based Clock Synchronization with Optimal Precision , manuscript to be submitted, 1997. U. Schmid, K. Schossmaier. Interval-based Clock Synchronization , J. Real-Time Systems 12(2), March 1997. K. Schossmaier. An Interval-Based Framework for Clock Rate Synchronization , to appear in the Proc. of the 16th ACM PODC, 1997. 16

[SL96] [SS95] [SSHL97] [SWL90] [Tro94] [VRC97] [YM93]

K. Schossmaier, D. Loy. An ASIC Supporting External Clock Synchronization for Distributed Real-Time Systems , Proceedings of the 8th Euromicro Workshop on Real-Time Systems, June 1996. K. Schossmaier, U. Schmid. UTCSU Functional Speci cation , Technical Report 183/1-56, Department of Automation, Technische Universitat Wien, July 1995. K. Schossmaier, U. Schmid, M. Horauer, D. Loy. Speci cation and Implementation of the Universal Time Coordinated Synchronization Unit (UTCSU) , J. Real-Time Systems 12(3), May 1997. B. Simons, J. Lundelius-Welch, N. Lynch. An Overview of Clock Synchronization , in B. Simons, A. Spector (eds.): Fault-Tolerant Distributed Computing, Springer Lecture Notes on Computer Science 448, p. 84{96, 1990. G.D. Troxel. Time Surveying: Clock Synchronization over Packet Networks , PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institut of Technology, May 1994. P. Verissimo, L. Rodrigues, A. Casimiro. CesiumSpray: a Precise and Accurate Global Clock Service for Large-scale Systems , J. Real-Time Systems 12(3), May 1997. Z. Yang, T. A. Marsland. Annotated Bibliography on Global States and Times in Distributed Systems , ACM SIGOPS Operating Systems Review, p. 55{72, June 1993.

17