An Error-Correcting Unordered Code and ... - DATE Conference

1 downloads 0 Views 366KB Size Report
and delay overheads, while the best non-systematic ECU codes have 3.82 to .... phase or non-return-to-zero (NRZ) protocol [12, 17], which avoids the reset ...
An Error-Correcting Unordered Code and Hardware Support for Robust Asynchronous Global Communication∗ Melinda Y. Agyekum Steven M. Nowick Department of Computer Science, Columbia University New York, NY, 10027 Email: {melinda, nowick}@cs.columbia.edu

Abstract. A new delay-insensitive data encoding scheme for global asynchronous communication is introduced. The goal of this work is to combine the timing-robustness of delay-insensitive (i.e., unordered) codes with the faulttolerance of error-correcting codes. The proposed errorcorrecting unordered (ECU) code, called Zero-Sum, can safely accommodate arbitrary skew in arrival times of individual bits in a packet, while simultaneously providing 1-bit correction and 2-bit detection. A systematic code is targeted, where data can be directly extracted from the codewords. A basic method for generating the code is presented, as well as detailed designs for the supporting hardware blocks. An outline of the system micro-architecture and its operating protocol is also given. When compared to the best previous systematic ECU code, the new code provides a 5.74 to 18.18% reduction in transition power for most field sizes, with better or comparable coding efficiency. Pre-layout technology-mapped implementations of the supporting hardware (encoder, completion detector, error-corrector) were synthesized with the UC Berkeley ABC tool using a 90nm industrial standard cell library. Results indicate that they have moderate area and delay overheads, while the best non-systematic ECU codes have 3.82 to 10.44x greater area for larger field sizes.

1. Introduction

As digital systems grow in complexity, the challenges of design reuse, scalability, power and reliability continue to grow at a rapid pace [21]. These parameters are expected to become major unsolved bottlenecks in less than a decade. A major focus of recent strategies for organizing such systems is the use of networks-on-chip (NoCs), which support the orthogonalized development of computation blocks (e.g., cores) and the communication fabric [16]. One promising direction of research has been to explore the use of asynchronous global communication, to provide flexibility in system integration, as well as dynamic power which adapts on demand to the current traffic [3, 20, 21]. Such systems can be entirely asynchronous, or use a hybrid combination of synchronous computation blocks interconnected by asynchronous channels, thus forming a globally-asynchronous locally-synchronous (GALS) system [10, 27]. The contribution of this paper is a new systematic error-correcting unordered (ECU) code, called Zero-Sum, which supports reliable asynchronous global communication. This code simultaneously targets two types of robustness: timing-robustness, involving resilience to the arrival times and orders of individual bits in a packet on a channel; and fault-robustness, providing detection and correction capabilities for hard and soft errors. Key additional goals for the Zero-Sum code are low transition-power and efficient coding efficiency. An additional feature of the code is that ∗ This work was supported by NSF Award No. CCF-0811504 and by a 2008-2009 Intel Foundation PhD Fellowship.

978-3-9810801-6-2/DATE10 © 2010 EDAA

it is “systematic”, where data appears in unaltered form in each codeword, and can be directly extracted by the receiver without any decoding hardware. When compared to the best previous systematic ECU code, the new code provides significant reduction in transition power for most field sizes, with better or comparable coding efficiency. In addition, designs for three key hardware support blocks are presented, including an encoder unit (for the sender) and a completion detector and error correcting unit (for the receiver). An outline of the overall microarchitecture and system-level asynchronous communication protocol is also provided.

2. Background and Related Work

Background on asynchronous communication and error correction is now briefly reviewed.

2.1 Point-to-Point Asynchronous Communication

The method outlined in this paper assumes point-to-point communication [3, 20, 17, 29] between a sender and a receiver. Three key features of asynchronous point-to-point communication are described below. (a) Asynchronous Communication Channels. An asynchronous communication channel [29] is the means by which information is transmitted. Fig. 1(a) gives an example of point-to-point communication. Abstractly, the sender provides a request output signal (REQ) to the receiver; the receiver in turn provides an acknowledgment input signal (ACK) to the sender. If the sender passes actual data to the receiver (rather than providing simple control synchronization), the REQ is typically replaced by the encoded data, as shown in the figure. The ACK indicates data has been received by the receiver and new data can be sent [29]. a. Block diagram of model data wires = req

SENDER

RECEIVER

b. FourͲPhase RZ protocol Evaluate Operation

Reset Operation

REQ ACK

bit set to 1 ack

Figure 1. Point-to-point asynchronous communication

(b) Four-Phase Communication Protocol. Given an asynchronous communication channel, a protocol is needed to transfer information from sender to receiver. The most widely-used protocol is four-phase or return-to-zero (RZ) [3, 11, 29]. As illustrated in Fig. 1(b), the protocol has two operations: (1) evaluate and (2) reset. During the evaluate operation, the sender first indicates the start of an event by issuing a rising REQ+ to the receiver. Once the data has been received, the receiver asserts an ACK+. At this point, the reset operation begins. The sender de-asserts the REQand in turn, the receiver de-asserts its ACK- which is the final event of the reset stage and the four-phase transaction.

An alternative to the return-to-zero protocol is a twophase or non-return-to-zero (NRZ) protocol [12, 17], which avoids the reset phase; however, codes designed for an NRZ protocol (e.g. LEDR [12]) typically have huge overheads and require complex circuit structures. Therefore, a fourphase asynchronous communication protocol is assumed throughout this paper. (c) Delay-Insensitive Codes. When asynchronous communication is used, as shown in Fig. 1(a), data must be suitably encoded so that the receiver can identify when a packet has been received. Delay-insensitive (DI) codes [1, 23, 29] (i.e. unordered codes [6, 7, 9]) are insensitive to propagation delays on individual bits in a codeword.1 In an asynchronous system, these codes have an inherent timingrobustness, where data on individual wires can arrive in any order and at any time. Their key property is that no valid codeword is covered by another (i.e., no valid codeword is ever a proper subset of another valid codeword). Therefore, when used for asynchronous transmission, the codewords themselves identify their own validity. The formal definition of an unordered code is as follows: Definition 1 (Unordered Code [8]) A codeword X = x1 x2 ...xn covers another codeword Y = y1 y2 ...yn if and only if, for each bit position i, if yi = 1 then xi = 1. In these cases, Y is covered by X, or Y ≤ X. Codewords X and Y are unordered if X  Y and Y  X. A code, C, is unordered iff each pair of codewords in C is unordered. Example. Given codewords, X = 001, Y = 100, Z = 011, the delay-insensitive pairs are {X, Y } and {Y, Z}. X is not unordered when compared to Z. There area two classes of delay-insensitive codes – systematic and non-systematic codes. A systematic code [15, 29] contain two types of fields: (1) a data (or information) field which contains the original data bits, and (2) a check field. For asynchronous communication, the check field provides extra bits to guarantee that the entire code is delay-insensitive. Common types of systematic codes are Berger [4] and Knuth [29] codes. Two potential benefits of systematic codes are: (i) ease of data extraction, where no hardware decoders are necessary since the original data appears directly in the codeword; and (2) generally more compact codes, since the check field is typically logarithmic in the size of the data field, often resulting in improved coding density (# of wires per bit) and transition-power (# of wire-flips per bit per transaction). In contrast, in non-systematic codes [3, 17, 29], there are no separate data and check fields. Instead, data is encoded in a unified field, which ensures delay-insensitivity. Common examples include dual-rail (i.e., 1-of-2), 1-of-4 and the general class of m-of-n codes [3, 29]. For the simplest cases (e.g. dual-rail), data extraction is trivial, through a simple mapping. However, for general m-of-n codes, complex decoding hardware is typically required, unlike the case of systematic codes. A benefit of the simpler non-systematic codes (e.g., dual-rail, 1-of-4) is that they have more compact completion detectors, but this advantage no longer holds for arbitrary m-of-n codes (see Section 5.2). 1 Although

the terms DI and unordered are used interchangeably, DI refers to dynamic behavior during transmission in an asynchronous system, while “unordered” implies a static property of the codes (i.e. they provide error detection but not correction) in their use in synchronous systems.

2.2 Hamming Error Correction Codes

One widely-used type of error-correcting code is the Hamming code [14]. As with systematic codes, Hamming codes have two fields: data (information) and check (parity) bits. Each parity bit provides error coverage for a unique subset of information bits, called a parity group: it is set to the appropriate value to make its parity group even [14].

2.3 Related Work

The topic of providing reliable point-to-point communication has been studied in both the synchronous and asynchronous domain. Within the synchronous domain, there has been a substantial body of work that explores general techniques for fault-tolerant global communication. Particular areas of interest include low-power bus encoding [25, 13] and retransmission [5] strategies, as well as crosstalk avoidance and mitigation [13, 22, 30]. However, these methods do not target error-correcting unordered codes, which is the focus of this paper. Within the asynchronous domain, there have been several loosely-related approaches, in the areas of fault tolerance and global bus encoding. Several techniques have been proposed to design self-checking hardware to attach to robust asynchronous components [2, 19, 23], where function blocks are assumed to be already designed using existing unordered codes (including dual-rail, m-of-n and Berger). This work does not consider ECU codes, which is the focus of this paper. Other approaches have been proposed for time-dependent codes (i.e. non-DI) [20] for global communication, as well as specialized work on the design of completion detectors for existing unordered (i.e. non-error correcting) codes [1]. Closest to our work, there have been a limited number of proposed approaches for ECU codes, combining both error correction and unordered properties for global asynchronous communication [11, 6]. Cheng et al. [11] proposed a basic systematic ECU code for asynchronous systems which supports 1-bit error correction, but only 1-bit error detection. The approach adds a parity bit, then creates a replicated but inverted copy of the entire dataword, which is then appended. The resulting codes have poor coding efficiency and high transition power, Blaum et al. [6] present three distinct ECU codes, with the systematic Code 1 being the most general, which provides 1-bit correction and 2-bit detection. Two extra fields are appended to a dataword, one for error correction and one to ensure an unordered code. In contrast, the proposed ECU combines both features into a unified appended field.

3. A New ECU Code

Introduction. A new class of unordered ECU codes, called Zero-Sum, is now presented. These codes are designed to simultaneously combine properties of delay-insensitivity with error correction. These codes loosely adopt concepts from Hamming [14] and Berger [4]. In particular, as in Berger codes, the pattern of 0 bits in the dataword is used to create the DI field. However, while the Berger method counts the number of 0’s, the Zero-Sum approach adds the 0 bit indices. Similarly, Zero-Sum adopts the bit index numbering scheme used in Hamming codes, but the former adds extra bits. Details of the approach are now presented. Overview of Code. Figure 2 shows examples of the ZeroSum ECU code for 2-, 3-, and 4-bit information fields. A Zero-Sum code has two fields: (1) dataword and (2) check

a. 2-bit ECU code

c. 4-bit ECU code

dataword check bits indices

check bits

53

8421

7653

16 8 4 2 1

00 01 10

1000 0101 0011

0000 0001 0010

10101 10010 10000

11

0000

01111 01110 10000 01100 01011 01010

653

8421

000

1110

0100 1000 0011 0101 1001 0110

001

1011

1010

01001

010

1001

1100

01000

100 011 101 110 111

1000 0110 0101 0011 0000

0111 1011 1101 1110 1111

00111 00110 00101 00011 00000

b. 3-bit ECU code dataword check bits indices

indices

dataword

Figure 2. Examples of Zero-Sum ECU code

dataword ACK

CD

Sender

Check Bit Generator Check Bits

encoder

(To Next Stage)

Asynchronous communication channel

Error Correction Unit

corrected dataword corrected check bits

Sender Receiver Figure 3. Block-Level system micro-architecture

bits. Each bit position is assigned an index. The check bits indices are powers of two (for non-negative exponents), and the dataword indices are a set of the positive remaining integers consecutively assigned starting from the smallest check bit index. For example, given a 3-bit dataword field, the indices of the check bits are 1, 2, 4, 8. The 3-bit dataword indices are consecutively assigned integers of 3, 5, and 6. Note, that all possible databit indices are not assigned. For example, in the case of the 3-bit dataword field, an index value of 7 is left unassigned. Check Field Generation. In a Hamming code, the check field is a concatenation of individual parity bits, one for each parity group. In contrast, in a Zero-Sum ECU code, a check field for a given dataword is derived using a mathematical operation: it is binary representation of the arithmetic sum of the dataword indices whose bit is 0, hence the name of the codes, Zero-Sum. Formal Calculation of Code Length. The check field must also be large enough to support the binary representation of the sum of all of the data field indices, i.e. to handle the extreme case where all data bits are 0. Therefore, the toP tal number of check bits allocated is the blog2 ( dataword indices)c + 1. Example. For the 4-bit dataword 1010 in Figure 2(c), the sum of those dataword indices whose bits are set to 0 (i.e., indices 6 and 3) is 9. Therefore, the corresponding check field value is the binary representation of 9, which is 01001. Similarly, the 4-bit dataword 1111 is assigned check field 00000, and dataword 0000 is assigned check field 10101. Detecting and Correcting a 1-Bit Errors. While a classic

Hamming code provides a syndrome which is a vector of individual checkbits (i.e. one for each parity group), [14] the new Zero-Sum ECU code provides a unified syndrome which is a single positive integer: the absolute value of the difference between the appended check field and a newlycalculated check field. The basic error detection and correction strategy is a modification of the Hamming approach. In both Zero-Sum and Hamming codes, a non-zero syndrome indicates an error. However, in Zero-Sum, the syndrome is computed differently: the receiver creates a regenerated check field C’ from its data field, and compares C’ to the actual received check field C. The resulting Zero-Sum syndrome, is the absolute difference of |C 0 − C|. If the difference is zero, there is no error. The syndrome is also used to correct a 1-bit error: its value is the index of the corrupted bit, as in Hamming codes. However — unlike Hamming – the ZeroSum appended check field also ensures delay-insensitivity, as will be proven below. Example. Following the previous example, suppose there is an error in transmitting the 4-bit dataword 1010, due to a flip in the data bit with index 7 (i.e. erroneous dataword 0010), transmitted with the original error-free check field (i.e.01001). The newly-calculated check field, based on the corrupted dataword, is 16 (i.e. 7+6+3). Therefore, the syndrome is 16-9 = 7, which is non-zero and not a power-oftwo. This syndrome therefore precisely identifies the index (i.e. 7) of the corrupted data bit. In contrast, if a single check bit has an error, the syndrome will be a power-of-two and identify the corresponding index of the corrupted check bit. Having shown how the Zero-Sum code is constructed, it is first proven that the resulting code is an unordered code in Theorem 1. Next, the error-correcting property is proven in Theorem 2. Theorem 1 (Zero-Sum Code Delay-Insensitivity) Every Zero-Sum code is unordered. Proof: By Definition 1, it is sufficient to show that ZeroSum code, C, is unordered if each pair of codewords is unordered. Given two datawords, X and Y , by Definition 1 if X covers Y , then X has more 1’s than Y , and Y has more 0’s than X. As outlined earlier, the check field for any data field is generated by a summation of the indices of the data field bits which are 0. Therefore, the check field for Y is guaranteed to be greater than that of X, since the sum of corresponding data bit indices for Y will be larger. It can trivially be proven that the binary representation of a larger number is never covered by the binary representation of a smaller number; therefore, the check field of Y will not be covered by the check field of X. As a result, the codeword for Y (including the appended check field) will not be covered by the codeword for X, and vice versa, and the two codewords will be unordered.  Theorem 2 (Zero-Sum Code Error Correction/Detection) Every Zero-Sum code provides 1-bit correction and 2-bit detection. Proof: Given the check bits of a codeword represent the sum of the 0 data bit indices, a simple check can be performed to ensure a dataword and its corresponding check bits are of the correct value (i.e., add the 0 data bit indices and make sure that value is represented by the check bits). If a there is no discrepancy between the sum of the data bit indices and the value represented by the check field, an error has not occurred; however, if the two values differ an

Dataword Size

(a) Basic Encoder

data bits d7 d6 d5 d3

(b) Completion Detector (CD) c16 c4 c1 d3 c16 c2 d7 d6 d5 d3

binary “0” encoded “3” S

S

+

“0”

S

check bits

“5”

S

(ck16 ck8 ck4 ck2 ck1)

+ “0”

C 2

C

ACK

C 3

(c) Unified Encoder

S “6”

S

+

data bits

“0”

S “7”

d7 d6 d5 d3

CL

adder

S

ck16 ck8 ck4 ck2 ck1

check bits

4

selector

Figure 4. 4-Bit hardware components: encoder and CD check bits (received) Magnitude 5 new Comparator data bits check bits (received) 4−Bit (regenerated) 4 Encoder 5

5

S0 S1

"0" 00

(minuend)

error-bit identifier

01 M U 10 X 11

(Described in Figures 4 & 5)

5

"0"

5 00 01 M U 10 X 11

syndrome 5

data bits (received)

ACK syndrome C d3 D Q

check bits c16 (received)

syndrome C DQ

data_bits_i (corrected) 4 check_bits_i (corrected) 5

7

(subtrahend)

Syndrome Generator

6

Corrector

Figure 5. 4-bit error corrector unit design

error has occurred. The value of the syndrome is the difference between the check field and the newly-computed check value which, by construction, must be the index of the corrupted bit. Furthermore, any 2-bit error can be detected (but not necessarily corrected), because no two individual bit errors can ever result in a non-zero syndrome (details are omitted due to space limitations). 

8

9

4. Hardware Support

The target system-level micro-architecture is shown in Figure 3. The sender node generates the check field, which is appended to the dataword to form the ECU code. (a) Encoder: Check Bit Generator. A basic unoptimized encoder design for a 4-bit dataword is shown in Figure 4(a). It consists of a bank of selectors (i.e., multiplexors) followed by adders. There is one selector for each data field index value (3, 5, 6, 7), and each is configured either to pass the hardcoded index (if the corresponding data bit is 0) or the value 0 (if the corresponding data bit is 1). The result of the adder is the desired check field — the sum of the data bit index values that are set to 0. Figure 4(c) shows an alternative unified approach where the entire encoder is designed as a single combinational logic block. (b) Completion Detector. The CD is shown in Figure 4(b). This component is used by the receiver to detect when a valid codeword has arrived. Each C-element detects exactly one of the 16 distinct codeword, and hence only one Celement is enabled per transaction.2 A single multi-input OR gate combines the results to produce the final ACK. 2 A C-element is a standard storage element, whose output is 0 (1) when all inputs are 0 (1), and which otherwise holds its value.

10

ECU Type

Method

ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn ZeroͲSum Systematic Blaum Cheng DualͲRail NonͲ 1ͲofͲ4 Systematic mͲofͲn

Transition Power Coding Effiency ZeroͲSum ZeroͲSum Results Improve Improve 0.33 4.50 Ͳ Ͳ 0.33 0.00% 5.50 18.18% 0.33 0.00% 6.00 25.00% 0.29 14.29% 7.00 55.56% 0.29 14.29% 6.50 44.44% 0.29 14.29% 6.50 44.44% 0.43 Ͳ 6.75 Ͳ 0.38 12.50% 8.25 18.18% 0.38 12.50% 8.00 15.63% 0.30 30.00% 10.00 48.15% 0.30 30.00% 8.00 18.52% 0.33 22.22% 8.00 18.52% 0.44 Ͳ 8.38 Ͳ 0.44 0.00% 9.00 6.89% 0.40 10.00% 10.00 16.20% 0.33 25.00% 12.00 43.20% 0.33 25.00% 9.25 10.38% 0.40 10.00% 9.80 16.95% 0.50 Ͳ 9.93 Ͳ 0.45 9.09% 11.06 10.22% 0.42 16.67% 12.00 17.25% 0.36 28.57% 14.00 40.99% 0.36 28.57% 10.00 0.70% 0.45 9.09% 9.88 Ͳ0.50% 0.50 Ͳ 11.34 Ͳ 0.50 0.00% 12.03 5.74% 0.43 14.29% 14.00 19.00% 0.35 29.41% 17.00 49.91% 0.35 29.41% 9.88 Ͳ12.92% 0.50 0.00% 11.82 4.23%

Results

0.54 0.54 0.44 0.37 0.37 0.50 0.57 0.53 0.44 0.38 0.38 0.53 0.56 0.56 0.45 0.39 0.39 0.53 0.59 0.59 0.45 0.40 0.40 0.56

Ͳ 0.00% 18.75% 31.58% 31.58% 7.14% Ͳ 6.67% 22.22% 33.33% 33.33% 6.67% Ͳ 0.00% 20.00% 30.43% 30.43% 5.88% Ͳ 0.00% 22.73% 32.00% 32.00% 5.56%

12.63 13.00 16.00 19.00 13.50 11.96 14.00 15.88 18.00 21.00 13.00 12.07 15.33 16.76 20.00 23.00 15.00 14.91 16.60 17.61 22.00 25.00 14.50 15.09

Table 1. ECU code comparison

Ͳ 2.85% 21.06% 50.44% 6.89% Ͳ5.30% Ͳ 11.84% 22.22% 50.00% Ͳ7.14% Ͳ13.79% Ͳ 8.53% 23.35% 50.03% Ͳ2.15% Ͳ2.74% Ͳ 5.74% 24.55% 50.60% Ͳ12.65% Ͳ9.10%

Each gate can be decomposed into smaller fanin gates without affecting the hazard-freedom of the design. To reduce the CD overhead, an optimization was performed. In this optimization, C-element inputs were eliminated wherever possible. More specifically, a literal can be removed as an input to a C-element if that particular literal is set to 0 in the corresponding codeword. An example illustrating the optimization is shown in Fig. 4(b). For the 4-bit dataword 0000, the C-element for the corresponding codeword 0000 10101 has only 3 inputs c16 , c4 and c1 which correspond to the bits set to 1. (c) Error Corrector. The structure of a 4-bit error corrector unit, as shown in Figure 5, is divided into two parts: a syndrome generator and corrector. The syndrome generator produces the syndrome by performing the operations of comparison and subtraction. First, given the received dataword, an encoder generates a new check field. Next, the syndrome is gen-

erated by finding the mathematical difference between the received and newly generated check fields. A magnitude comparator is used to perform the absolute value function, thus ensuring that the resulting syndrome is of a positive value. Using the comparator outputs, the top-most multiplexor selects the larger of the two values, while the lower multiplexor selects the smaller of the two. The second part of the error corrector unit performs the correction operation in the classic manner used for Hamming codes [14]. Essentially, a C-element and 2-input XOR gate are allocated for each bit of a codeword. The input to the C-element is the syndrome, and for each bit, a non-zero syndrome which uniquely identifies when an error occurs in that particular bit. The corresponding XOR gate corrects the faulty bit by performing bit-inversion. Given an error, it can be proven that exactly one C-element and XOR gate will be enabled. A bank of latches are included to ensure a glitch-free transaction [28]. (d) Sketch of System-Level Protocol. The system-level protocol is designed to handle both error-free and erroneous transmissions (refer to Figure 3). The protocol assumed is a standard asynchronous four-phase (return-to-zero) protocol [3, 11]. There is only one spacer (i.e., reset) state which is the all 0’s input. When error-free data is transmitted, the CD is triggered normally in a four-phase protocol between the sender and receiver. In certain instances of erroneous transmission, a time-out mechanism [11] is used for ensuring forward progress. During time-out, a flag is asserted when a valid codeword (or reset) is not detected by the CD within a desired amount of time. The CD uses this information to alert the error corrector via the ACK signal. The error corrector then observes the given data and restores the data to its correct value. Note that the timeout mechanism is not on the critical path during error-free transmission, and only plays an active role during an erroneous transmission.

5. Analytical Evaluation

Comparative evaluations are now presented for the new Zero-Sum ECU codes (Section 5.1) and the supporting hardware implementation (Section 5.2).

5.1 Code Evaluation

Table 1 compares the new Zero-Sum ECU codes to other systematic (Blaum, Cheng) and non-systematic (dual-rail, 1-of-4, optimal m-of-n) ECU codes. Two metrics are evaluated: coding efficiency (i.e. # bits/wire) and transition power (i.e. # wire transitions/transaction). Information field sizes from 2 to 10 bits are used. Only moderate-sized datawords were considered, to ensure a manageable completion detector (which can grow exponentially with field size). Larger dataword sizes are easily handled by partitioning them into moderate-size subfields and encoding each separately. The unordered property is still preserved under partitioning; however, details are eliminated due to space. In comparison with other systematic ECU codes, the Zero-Sum code is consistently better or equal for both cost metrics. When compared with Cheng’s approach, the ZeroSum code has a 15.63 to 24.55% reduction in transition power for all field sizes, and 16.67 to 22.73% better coding efficiency for larger field sizes (5 and greater). When compared to Blaum’s approach — the best previous systematic ECU code — the new code has a 5.74 to 18.18% reduction in transition power for all field sizes (except length 7), and better coding efficiency for a third of the field sizes (saving

one bit). Especially promising are the Zero-Sum codes for field size 5 (9.09% better coding efficiency, 10.22% power reduction) and field size 8 (6.67% better coding efficiency, 11.84% power reduction). The Zero-Sum codes were also compared to nonsystematic ECU codes for dual-rail, 1-of-4, and the most coding efficient m-of-n code. Each non-systematic ECU code was formed by taking a non-systematic code and appending a Hamming check field. In comparison with other non-systematic ECU codes, the Zero-Sum code is consistently better than or equal for coding efficiency, and often better for transition power (with exceptions). Compared with the dual-rail ECU code, the new code has at least a 25.00% better coding efficiency (for 9 of 10 field sizes) and 40.99% reduction in transition power. Compared with the 1-of-4 ECU code, the new code has at least a 25.00% better coding efficiency (for 9 of 10 field sizes),3 and power reductions ranging from significantly better (44.44%) to only slightly worse (-2.15%) for 7 of the 10 field sizes, but with some moderate degradation for the remaining cases (-7.14 to -12.92%). Finally, compared to the best m-of-n ECU code for each field size, the new code has 5.56 to 22.22% better coding efficiency power reductions ranging from significantly better (44.44%) to only slightly worse (-2.74%) for 7 of the 10 field sizes, but with some moderate degradation for the remaining cases (-5.30 to -13.79%). However, in the next subsection, it is shown that these latter m-of-n codes have significant hardware overheads. Zero-Sum: Optimal Field Length. It is interesting to assess which field sizes for a Zero-Sum code have the best coding efficiency. The trend shows a strong and consistent improvement in coding efficiency as field sizes increase, ranging from 0.33 (2-bit) to 0.59 (10-bit). The Zero-Sum codes have their best absolute coding efficiency for field sizes of 7 to 10, where coding efficiency ranges between 0.54 to 0.59. Interestingly, for this range, the Zero-Sum coding efficiency is even better than dual-rail and 1-of-4 non-ECU codes (0.50), where the latter does not provide error-correction capability.

5.2 Hardware Evaluation

To fully assess the Zero-Sum codes, pre-layout technology-mapped implementations of the supporting hardware components — encoder (Figure 4a and Figure 4c), CD (Figure 4b), and error corrector unit (Figure 5) are synthesized using the logic synthesis tool ABC [26]. Each component is specified in PLA file format [24], then ABC’s delay script [18] is applied to perform logic optimization and technology mapping. An industrial 90nm standard cell library was used.4 For hardware units which use adder/subtractor blocks (Figure 4a and Figure 5), a preprocessing step is used where a VHDL specification is mapped to a gate-level structural BLIF version [24], which is then optimized and mapped using the above ABC flow. Table 2 shows results for the implementation of supporting hardware of the new Zero-Sum ECU code, and Table 3 shows results for the most coding-efficient m-of-n (nonsystematic) ECU code. Area is reported in µm2 and delay 3 Note that dual-rail and 1-of-4 codes inherently have identical coding efficiency, while 1-of-4 tends to provide reduced transition power. 4 For the CD only, the standard cell library was limited to AND and OR gates only to ensure a logic-hazard-free design [28].

Encoder

Dataword Size 2 3 4 5 6 7 8 9 10

i/o 2/4 3/4 4/5 5/5 6/6 7/6 8/6 9/7 10/7

Basic Unified area* delay† area* delay† 19.34 0.03 17.64 0.04 52.21 0.05 52.21 0.05 67.72 0.06 64.22 0.07 128.40 0.11 136.19 0.08 221.56 0.14 256.07 0.10 256.12 0.18 560.82 0.11 331.87 0.14 886.74 0.15 356.25 0.18 1870.11 0.14 408.49 0.20 3279.68 0.15 * = area reported in µm 2

CD i/o area* delay† i/o 6/1 28.23 0.25 6/4 7/1 72.70 0.38 7/4 9/1 113.64 0.50 9/4 10/1 192.00 0.62 10/5 12/1 366.30 0.58 12/6 13/1 602.74 0.66 13/6 14/1 938.74 0.73 14/6 16/1 1887.91 0.79 16/7 17/1 3076.62 0.89 17/7 † = delay reported in ns

Syndrome Generator area* delay† 189.00 0.16 238.49 0.19 291.36 0.20 277.27 0.19 506.60 0.24 612.44 0.28 620.16 0.29 688.59 0.29 958.78 0.34

Total Design (Encoder, CD, Syndrome Generator) Basic area* delay† 236.57 0.44 363.40 0.62 472.72 0.76 597.67 0.92 1094.46 0.96 1471.30 1.12 1890.77 1.16 2932.75 1.26 4443.89 1.43

Unified area* delay† 234.87 0.45 363.40 0.62 469.22 0.77 605.46 0.89 1128.97 0.92 1776.00 1.05 2445.64 1.17 4446.61 1.22 7315.08 1.38

Table 2. Zero-Sum ECU components: tech mapped results [7] B. Bose. On unordered codes. IEEE Trans. on Comp., 40(2):125– 131, 1991. [8] B. Bose and T.R.N. Rao. Theory of unidirectional error correcting/detecting codes. IEEE Trans. on Comp., 31(6):521–530, 1982. [9] J. Bruck and M. Blaum. Delay-insensitive pipelined communication on parallel buses. IEEE Trans. on Comp., 44(5):660–668, 1995. [10] D. M. Chapiro. Globally-Asynchronous Locally-Synchronous Systems. PhD thesis, Stanford University, October 1984. [11] F.-C. Cheng and S.-L. Ho. Efficient systematic error-correcting codes for semi-delay-insensitive data transmission. In ICCD, pages 24–29, November 2001. [12] M. E. Dean, T. E. Williams, and D. L. Dill. Efficient self-timing with level-encoded 2-phase dual-rail (LEDR). In Proc. of UC Santa Cruz Encoder CD Decoder Total dataword optimal Conf. on Advanced Research in VLSI, pages 55–70, 1991. † size mͲofͲn delay † delay † delay † i/o area* area* i/o area* i/o area* delay [13] H.S. Deogun, R.R. Rao, D. Sylvester, and D. Blaauw. Leakage-and 2 1ͲofͲ4 2/7 23.28 0.03 9/1 47.28 0.38 7/2 44.46 0.05 115.02 0.46 crosstalk-aware bus encoding for total power reduction. In DAC, 3 2ͲofͲ5 3/9 55.73 0.05 12/1 87.50 0.39 9/3 127.01 0.08 270.24 0.52 4 2ͲofͲ7 4/11 126.30 0.06 15/1 172.21 0.66 11/4 350.67 0.10 649.18 0.82 pages 779–782, June 2004. 5 3ͲofͲ7 5/11 280.00 0.08 16/1 320.37 0.64 11/5 653.32 0.12 1253.69 0.84 [14] R.W. Hamming. Error detecting and error correcting codes. Bell 6 4ͲofͲ8 6/12 777.32 0.10 18/1 663.38 0.80 12/6 1348.17 0.14 2788.87 1.04 System Tech. Journal, 29(1):147–150, 1950. 7 4ͲofͲ10 7/14 1483.58 0.12 21/1 1102.48 0.96 14/7 3039.86 0.16 5625.92 1.24 [15] N.K. Jha. Separable codes for detecting unidirectional errors. IEEE 8 4ͲofͲ11 8/15 3112.40 0.13 23/1 2184.30 0.93 15/8 6110.45 0.18 11407.15 1.24 TCAD, 8(5):571–574, 1989. 9 5ͲofͲ12 9/17 6138.80 0.15 26/1 4201.74 1.03 17/9 12099.40 0.20 22439.94 1.38 10 5ͲofͲ13 10/18 11411.35 0.17 28/1 10603.85 1.17 18/10 24389.73 0.21 46404.93 1.55 [16] K. Keutzer, A.R. Newton, J.M. Rabaey, and A. Sangiovanni2 * = area reported in µm † = delay reported in ns Vincentelli. System-level design: orthogonalization of concerns and platform-based design. IEEE TCAD, 17(2):1523–1543, 2000. [17] P.B. McGee, M.Y. Agyekum, M.A. Mohamed, and S.M. Nowick. A Table 3. m-of-n ECU components: tech mapped results level-encoded transition signaling protocol for high-throughput asyn6. Conclusions and Future Work chronous global communication. In IEEE Async Symp., pages 116– 127, April 2008. A novel error-correcting unordered (ECU) code, called [18] A. Mishchenko, R.K. Brayton, and S. Jang. Global delay optimizaZero-Sum, was introduced. The code simultaneously protion using structural choices. In IWLS, pages 1–6, June 2008. vides timing robustness (by allowing skew in the arrival [19] T. Nanya and Y. Tohma. Design of self-checking asynchronous seof individual bits in a packet) and fault tolerance (1-bit quential circuits. In FTCS, pages 278–280, October 1980. [20] S. Ogg, B. Al-Hashimi, and A. Yakovlev. Asynchronous transient correction/2-bit detection), thus supporting the design of resilient links for NoC. In CODES, pages 209–214, October 2008. resilient asynchronous global communication. A prelimi[21] J.D. Owens, W.J. Dally, R. Ho, D.N. Jayasimha, S.W. Keckler, and nary evaluation showed significant benefits in coding effiL.-S. Peh. Research challenges for on-chip interconnection netciency and transition power, over existing systematic ECU works. IEEE Micro, 27(5):96–108, 2007. [22] K.N. Patel and I.L. Markov. Error-correction and crosstalk avoidance approaches. Designs for supporting hardware have been in DSM busses. IEEE Trans. on VLSI Systems, 12(10):1076–1080, implemented, and were shown to have moderate area and 2004. power overhead, especially when compared to the most [23] S. J. Piestrak and T. Nanya. Towards totally self-checking delaycoding-efficient m-of-n ECU codes. insensitive systems. In FTCS, pages 228–237, June 1995. [24] E.M. Sentovich, K.J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P.R. Stephan R.K., Brayton, and A.L. References Sangiovanni-Vincentelli. SIS: A system for sequential circuit syn[1] V. Akella, N.H. Vaidya, and G.R. Redinbo. Asynchronous thesis. Technical Report UCB/ERL M92/41, UC Berkeley, 1992. comparison-based decoders for delay-insensitive codes. IEEE Trans. [25] M.R. Stan and W.P. Burleson. Bus-invert coding for low power I/O. on Comp., 47(7):802–811, 1998. IEEE Trans. on VLSI Systems, 3(1):49–58, 1995. [2] D.A. Anderson and G. Metze. Design of totally self-checking check [26] Berkeley Logic Synthesis and Verification Group. ABC: A circuits for m-out-of-n codes. IEEE Trans. on Comp., 22(3):263– System for Sequential Synthesis and Verification. http://www269, 1973. cad.eccs.berkeley.edu/ alanmi/abc, 2005. [3] W. J. Bainbridge, W. B. Toms, D. A. Edwards, and S. B. Furber. [27] P. Teehan, M. Greenstreet, and G. Lemieux. A survey and taxonomy Delay-insensitive, point-to-point interconnect using M-of-N codes. of GALS design styles. IEEE Design & Test of Comps., 24(5):418– In IEEE Async Symp., pages 132–140, May 2003. 429, 2007. [4] J.M. Berger. A note on error detecting codes for asymmetric chan[28] S.H. Unger. Asynchronous Sequential Switching Circuits. Krieger nels. Information and Control, 4(1):68–73, 1961. Publishing Co., Inc., USA, 1983. [5] D. Bertozzi, L. Benini, and G. De Micheli. Low power error resilient [29] T. Verhoeff. Delay-insensitive codes—an overview. Distributed encoding for on-chip data buses. In DATE, pages 102–109, March Computing, 3(1):1–8, 1988. 2002. [30] B. Victor and K. Keutzer. Bus encoding to prevent crosstalk delay. [6] M. Blaum and J. Bruck. Unordered error-correcting codes and their In ICCAD, pages 57–63, November 2001. applications. In FTCS, pages 486–493, July 1992.

is reported in ns. The hardware overheads of the Zero-Sum code appear moderate, in terms of both delay and area. Both area and delay increase, as expected, with larger field sizes. Note that the CD’s grow exponentially in area with added bits, while other components grow roughly linearly. In comparison, the most coding-efficient m-of-n non-systematic ECU code has significantly greater area overhead, especially with medium to larger field sizes: 3.82 to 10.44x worse area for dataword field sizes of 7 to 10 bits.

Suggest Documents