HIGHLY EFFICIENT MULTI-POINT CLOCK DISTRIBUTION NETWORKS Rui L. Aguiar, Dinis M. Santos Dpt. Electrónica e Telecomunicações, Universidade de Aveiro / Instituto de Telecomunicações Campus de Santiago, 3810-193 Aveiro, Portugal, Tel: +351.34.370200, Fax: +351.34.381128. Email:
[email protected],
[email protected]
ABSTRACT: This paper presents techniques for top-level high-speed clock distribution in large VLSI circuits. The techniques described resort to feedback mechanisms on the clock path, using controlled delay lines. These techniques allow the synchronization of a large number of top-level domains without extra interconnection lines, with clear performance improvements over other proposals.
I. INTRODUCTION Clock distribution has become an increasingly large problem as CMOS technology evolves, due both to the effective increase in circuit size and to the increasing importance of interconnect on VLSI circuits [1]. Increasing clock rate requirements place further constrains on the acceptable jitter and skew values for the clock. Note that both skew differences and jitter affect the relative clock phase between domains, with similar implications from the designer’s point of view. Traditional clock distribution techniques [2] are based on the minimization of global mismatches across the circuit, implementing distribution structures (clock trees) as geometrically (and electrically) matched as possible. Modern developments extend these principles looking into the clock distribution elements, analyzing issues such as optimization of the number of repeaters, sizing of buffers and matching of loads in the different branches of the clock network. Nevertheless these techniques are being disputed at the high-performance end by methods applying feedback in the clock distribution network [3-5]. The proposed techniques present several disadvantages in comparison with classic clock trees, with added trade-offs in terms of physical layout (increased number of interconnections [3-4]) and scalability (small number of controllable branches [5]), besides the centralized control unit required by the feedback concepts used. This paper presents novel evolutions of these feedback techniques for clock distribution, with improved performances in terms of clock phase precision and with lower interconnect requirements – although still with the obvious requirement for a control block in the feedback path. The paper is organized as follows. Section II presents the basic features of clock distribution with feedback
mechanisms. Section III shows how to improve this method with proper placement of the control block, for nearby synchronization domains, while Section IV discusses extensions to larger synchronization areas. Section V presents the major conclusions of this work.
II. FEEDBACK IN THE CLOCK PATH The high performance feedback-based clock distribution topologies recently proposed [3-5] resort to DLLs (delay control loops). For simplicity, we will discuss these DLLs as composed of an (analogue) voltage controlled delay line (VCDL) and a phase detector (PD) in the control unit. The basic idea is to distribute the clock to specific synchronization points, each input to a different synchronization domain (SD) (such as an ALU, a cache controller, a register unit, etc…). The propagation delay inside these domains is lower than the aimed clock phase precision, and is ignored in a first approximation. In consequence, this clock distribution mechanism defines clock skew (and jitter) values across the whole circuit. Fig. 1 illustrates one such principle of clock distribution, where two different domains (SD A and SD B) are driven by the signals derived from two DLLs. Matched interconnection lines (eventually with repeaters), with delays τ and τ’, establish the connections to each domain. Due to its feedback nature, each DLL will present a delay such that the clock (ClkA and ClkB) at the “controlled point” will be in phase with the reference clock. Thus the DLL midpoints Φ1 and Φ'1 are in phase, if the total delay in both DLLs has the same parity [3]. This provides a mechanism for generic multi-point clock distribution, synchronizing multiple SDs with a precise reference instant - using a DLL for each domain, and placing all control blocks near the clock reference point. The advantage of this approach over traditional clock trees [2] is apparent when global parameter variation in CMOS technologies is considered. In a typical 0.8µm the propagation delay in a lightly loaded inverter can vary between 0.9ns to 1.5ns – and typical clock interconnections have several such inverters. Even if parameter invariant design is used [2], skew variations of 10% of total interconnection delay can be expected. Traditional clock trees would present phase errors
B1
SD1
Φ1
SD A
RA
VCDL1 DE
DE
τ
τ
DE
DE
DE
Synchronization Domain 2
DE
Vc
τ DLL
ClkA
Control 1 Control 2
Clock
DE
A1
Controlled Clocks C1
ClkB
τ
τ
DLL
DLL
SD 4
DLL DLL A5
Vc
τ
Synchronization SB Domain 3 SD 5
DE
DE
DE
Clock
DE
Fig. 2 –Synchronization strategy: basic topology.
VCDL2
Φ’1
RB
SD B
Fig. 1 –Synchronization strategy with feedback in the clock distribution. dependent on the maximum parameter mismatch between separated and unmatched interconnections. In high-speed systems, the clock interconnections are reasonably long, subject to large parameter variations, and therefore uncontrolled distribution techniques may lead to clocks with large phase imprecision. Although exact performance analysis would resort to yield considerations and three-sigma variation analysis, simpler bounds on clock uncertainty between all domains can be used for practical design purposes. Thus, with the approach depicted in Fig. 1, SDs A and B will present a maximum clock phase error θ subject to: i) the quality of the PD, e.g. its maximum phase error ε; ii) the local (un)matching of the interconnection lines δ; and iii) on the matching ∆ of the clock drivers RA and RB attacking the registers in each SD (all these values are measured as time differences). These effects are cumulative, in terms of clock imprecision. Thus: θ = ∆ + δ/2 + 2*ε
(1)
III. BASIC TOPOLOGY The first modification on the concepts described in the previous section also resorts to DLLs, but placed in the control block. The controlled clock for each synchronization domain is the clock that attacks the flip-flops, regardless of the eventual implementation of individual clock distribution networks inside each domain. Thus the controlled point directly connected to the PDs is the actual clock at the registers, and not a distribution-imposed point. In terms of Fig. 1, signal clocks ClkA and ClkB become the actual clock signals inside the SDs. As a result, this simple improvement removes both the δ and ∆ factors from the clock phase error between domains. The phase error derived by such a distribution mechanism is simply:
θ = 2*ε (2) This basic topology is illustrated in Fig. 2, and requires all domains to be topologically near a common central control unit. Fig. 2 presents five different SDs, with independent internal clock trees (exemplified in SD1 and SD5). A small (Fig. 2 is not to scale) synchronization block (SB) “covers” all five domains. Note that the propagation delay of any interconnect in this SB is negligible when compared with the desired clock phase accuracy. In the SB lays a DLL for each domain. All DLLs have the common clock reference signal in one input, and the other input (the “controlled point”) is connected with the clock line of a register in the controlled domain. Fig. 2 exemplifies this situation, as registers A1 and A5 are the registers nearer to the SB and thus are connected to its respective DLL in the SB. Therefore the same clock signal that is driving these DLLs will be imposed in these registers (due to the timing adjustments made by each DLL). Note that these are the clock signals actually driving registers in two quite distinct domains regardless both of the clock distribution mechanism used by each domain and of propagation delay differences between their specific clock distribution networks. Phase errors derived by eq. (2) can be very small naturally depending on the PD characteristics. For good PDs, this value can be under 100ps. In these cases, the effective maximum phase error ξ inside each domain should not be neglected, and has to be considered in these equations. The phase errors in each domain will be dependent on the (logical and parametrical) matching of their own distribution mechanisms, which may be comparable with the phase error (2ε) introduced by the SB between some registers in different domains. In the example in Fig. 2, the phase error between registers B1 and A5 (in different domains!) is dependent both on the PD characteristics and on the characteristics of the domain SD1 (how well designed its clock distribution is). The fact that A5 is on a different synchronization domain accounts only for the PD-dependent phase error.
In summary, when a floorplanning solution is feasible that creates a small area for a SB near a group of domains, then this approach assures that the registers in these domains will be driven in phase, with a very small phase error, depending on the phase detector characteristics of the DLLs [6-7]. This is assured without any extra interconnect lines, and independently of the size (and total delay) of the clock distribution network in each domain. This method can present phase errors so small that the matching qualities of the internal clock distribution network of each domain will be relevant for the overall variation of clock phase across the whole circuit. If each synchronization domain assures a phase error lower than ξ between its registers, then the whole system will assure a maximum phase error of θ = 2*ξ + 2*ε between any registers in any domain.
(3)
IV. LARGE DOMAINS In the previous section, the clock distribution problem is divided in smaller multiple domains distribution problems (eventually handled by different design teams [6]) when the domains are “adjacent”. This neighborhood condition is not always met, especially in very large circuits. Nevertheless the principles above can be extended to situations where the domains are not adjacent. There are basically two extension strategies, depending on the relative evaluation of the mismatches in the clock network inside each clock domain versus the interconnect facilities in the circuit. a) global interconnect channels available The simplest strategy is to divide the floorplanning in groups of nearby SDs, and place a synchronization
SD 10 Synchronization Domain 7
SD 9
SD 8 SD6
SB2
SD1
Synchronization Domain 2
SD 4
SD 5
Synchronization SB1 Domain 3 Clock
Fig. 3 –Synchronization strategy: multiple synchronization blocks.
block in each neighborhood group, according to the principles described in the previous section. These SBs can subsequently be synchronized using the general methods summarized in section II and described in [3, 6]. Fig. 3 exemplifies this situation with two SBs, interconnected by a wide line. This “line” (actually two interconnect lines) provides a matched feedback path between both blocks, and SB1 imposes the correct clock phase in the reference point in SB2. Note that in this case, total delay in this “control” DLL has to be an even multiple of the clock period. This strategy is primarily useful for first-level clock distribution, where some dozen large blocks are being synchronized. Maximum clock imprecision across the whole circuit is given by: θ = 2*ξ + 2*ε + ∆ + δ/2
(4)
where ξ is the maximum phase error inside a SD, ε is PD-induced phase error, δ is the local (un)matching of the interconnection lines between both SBs and ∆ the matching of the buffers driving each SB. It seems clear from the comparison between (1) and (4) that this method does not present phase advantages when compared with the basic structure depicted in Fig. 1. However it requires much less interconnections that that initial system. Although the phase error expressions have been presented in a conceptual and concise mode, it is clear that their precise formulation would require all the bounds in the expressions to be replaced by the maximum values of each set for all domains. Thus “ξ” would refer to the maximum value from all the phase errors inside all SDs, “∆" would refer to the maximum value from all the differences between the matching of all clock drivers, and so on. Taking this aspect in consideration, although the factors in expression (4) are similar to those of eq. (1), it should be expected these terms to present lower values in this case, as less interconnections will be required. b) good matching in the internal clock domain distribution network This strategy relies on the assumption of negligible phase delay between all registers inside a clock domain (where negligible means much smaller than the desired clock phase accuracy). Thus the main clock delays to be compensated are the different delays each clock distribution network has. This is easily compensated by a DLL at the input of each block. Fig. 4 depicts this situation. Each domain, as it becomes further apart from the clock reference point, derives his clock from a nearby register on other domain, and compensates his own distribution delay by a DLL at the clock input point. In the figure, the clock for domain SD6 (controlled at register A6) is set equal to the clock driving register B1 (by the DLL at the input of SD6), which has the same phase as register A1 (by assumption), which has the same phase as the reference clock (by the DLL in the SB). The same
SD 7
A7
SD 6
B6
Naturally these two strategies for large domains present problems, and cannot be applied without some care. Implementation problems are due mainly to the jitter introduced in the clock signal (see e.g. [7]), which will be dependent on the DLL and on the global clock path until the clock reference point. Nevertheless these approaches can be designed targeting very low clock imprecision values.
A6
V. CONCLUSIONS B1
SD1
Synchronization Domain 2
A1
SD 4
C1
SD 5
Synchronization SB1 Domain 3 Clock
Fig. 4 –Synchronization strategy: adjacent synchronization. process is applied to domain SD7, with the clock derived from domain SD6. Total clock imprecision can be expressed as the sum of the PD-induced phase errors across the N hierarchical levels. Including the non-zero phase error inside each domain, total clock phase error can be bounded by: θ = (N+1)*(ξ +ε) (5) This strategy is adequate to small clock domains, eventually with many other adjacent domains, all deriving their own clock reference from a couple of reference domains. These domains can be treated as equiphase domains, with negligible phase difference between their registers. Note that any phase error inside a SD can be propagated across all interconnected domains. In terms of design-rule orientations, this cumulative mismatch has to be treated as a potential event in the clock distribution, and thus this method should not be used when ξ has a non-negligible value. This approach provides a different view on the concept described in section III. An immediate evolution for that strategy, according to the lines described in the above paragraphs, is the complete withdrawal of the synchronization block. If all domains are implemented with a DLL at the input of their clock line in order to compensate distribution delays, then the choice of the reference signal is merely a design decision, based on issues as jitter, skew or clock propagation delay (in Fig. 2 this would correspond to the placement of the DLLs for domains SD1 to SD5 inside each domain; the SB would then be a simple line interconnecting these nearby DLLs). Expression (5) does become eq. (3) in this case.
The paper discusses novel clock distribution mechanisms based on feedback in the clock distribution network. Several strategies are presented, and their relative advantages are quantified. These strategies do not pose extra interconnect overhead or sizing limitations to designs, and can be merged depending on design considerations. Several simulations have been run to illustrate these mechanisms with typical DLLs [3,7]. Some care was required on the convergence stage, as this presents a period when the clock signals can present very large phase differences. With that situation covered, the results achieved show that these mechanisms present better results than traditional clock trees.
REFERENCES [1] Yuan Taur, Douglas Buchanan, et all, "CMOS scaling into the Nanometer Regime", Proceedings of the IEEE, v.85, n.4, pp.486-503, Apr 1997. [2] Eby G. Friedman, "Introduction - Clock Distribution Networks in VLSI Circuits and Systems", Editor paper in the IEEE reprint Clock Distribution Networks in VLSI Circuits and Systems, IEEE Press, 1995, pag.1-36. [3] Rui L. Aguiar, Dinis Santos, "Wide-Area Clock Distribution Using Controlled Delay Lines", 5th IEEE International Conference on Electronics, Circuits and Systems, ICECS’98, pp.63-66, Lisbon, Sep 98. [4] H. Sutoh, and K. Yamakoshi, “A Clock Distribution Technique with an Automatic Skew Compensation Circuit”, IEICE Trans. on Electronics, E81-C, n.2, Feb. 98. [5] G. Geannopoulos, X. Dai, "An Adaptive Digital Deskewing Circuit for Clock Distribution Networks", IEEE International Solid-State Circuits Conference, ISSCC’98, pp. 400-401, 1998. [6] Rui L. Aguiar e D. Santos, “Clock Distribution Strategy for IP-Based Development”, in VLSI: Systems in a Chip, L.M Silveira, S. Devadas e R. Reis (editors), Chapman & Hill, ISBN-0-79237731-1, pp.181-191, 1999. [7] Rui L. Aguiar, Dinis Santos, "Modelling Chargepump Digital Delay Locked Loops", 6th IEEE International Conference on Electronics, Circuits and Systems, ICECS’99, Cyprus, Sep 99.