Low power sub-threshold asynchronous quasi-delay-insensitive 32-bit

0 downloads 0 Views 890KB Size Report
They design an async 32-bit arithmetic and logic unit (ALU) based on the proposed ASVHB realisation approach (at 65 nm CMOS process). Their ASVHB ALU ...
IET Circuits, Devices & Systems Research Article

Low power sub-threshold asynchronous quasi-delay-insensitive 32-bit arithmetic and logic unit based on autonomous signal-validity half-buffer

ISSN 1751-858X Received on 23rd April 2014 Accepted on 16th November 2014 doi: 10.1049/iet-cds.2014.0103 www.ietdl.org

Weng-Geng Ho ✉, Kwen-Siong Chong, Bah-Hwee Gwee, Joseph Sylvester Chang School of Electrical and Electronic Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore ✉ E-mail: [email protected]

Abstract: The authors propose an asynchronous-logic (async) quasi-delay-insensitive (QDI) autonomous signal-validity half-buffer (ASVHB) realisation approach for low power sub-threshold operation (VDD = 0.2 V). There are three key attributes in the proposed ASVHB realisation approach. First, the ASVHB realisation approach embodies integrated autonomous validity signals, which are unique and are used exclusively to simplify the circuit implementation for QDI protocol. Second, the ASVHB realisation approach applies the fine-grained gate-level method, which propagates data through a single-cell datapath pipeline to maximise the throughput rate. Third, the ASVHB realisation approach adopts the static-logic implementation, which maintain stable output states (by connecting them directly to the power rails), to feature high robustness for sub-threshold operation. They compare their ASVHB realisation approach against the competitive reported weak-conditioned half-buffer (WCHB) and pre-charged half-buffer (PCHB) realisation approaches. The WCHB and PCHB library cells, on average, require ∼2.1 × and ∼1.9 × more transistors than the ASVHB library cells. With respect to a 3-stage pipeline realisation, the WCHB and PCHB pipelines, on average, require 1.8 × and 1.5× more transitions per-cycle than the ASVHB pipeline. They design an async 32-bit arithmetic and logic unit (ALU) based on the proposed ASVHB realisation approach (at 65 nm CMOS process). Their ASVHB ALU occupies 0.092 mm2, and in many merits, outperforms the WCHB and PCHB counterparts. The WCHB and PCHB counterparts require ∼1.7 × and ∼1.4× more transistors, respectively, than their design. At the sub-threshold voltage of VDD = 0.2 V, the WCHB and PCHB counterparts dissipate ∼1.7× and ∼2.6× more energy, respectively, and are, respectively, ∼0.95× and ∼0.73× slower throughput.

1

Introduction

Multi-processor (or multi-core computing) platforms have been widely accepted as efficient enablers for highly parallel digital signal processing [1, 2], and can be realised to feature low power dissipation which can be attributed to the advancement of the network-on-chip technology [3]. The potential applications adopting multi-processor platforms include audio and video communication [4], acoustic systems [5, 6], wireless sensor networks [7, 8], Internet of Things [9, 10] and so on. In fact, each processor (in such platform) can be operated independently, for example, by completely shutting off the processor (in idle state) or by slowing down the processor whose corresponding computation is not critical. To achieve such independent control, dynamic voltage (frequency) scaling [11] can be employed with the overall objective of trading off energy dissipation and operating speed. For example, the supply voltage can be scaled down from the nominal voltage (i.e. for high speed high power operations) to the near-threshold voltage (i.e. for mid-speed low power operations) and to the sub-threshold voltage (i.e. for ultra-low speed ultra-low power operations). Some studies even show that the minimum energy operation of most digital circuits is occurred in the sub-threshold voltage region [12]. Despite the desirable attribute of minimum energy dissipation, sub-threshold operation (and the associated dynamic voltage scaling) for digital circuits poses some design challenges. In the sub-threshold operation, the on-current (Ion) of the transistor, consequently the associated circuit delay, is exponentially dependent on process, voltage and temperature (PVT) variations [13]. This circuit delay variation translates directly into timing

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

uncertainties, hence may create data synchronisation issues if the prevalent clock-based synchronous-logic (sync) methodology is adopted. Consequently, excessive amount of safety timing delay margin is often considered in the clock infrastructure to accommodate the worst-case critical path delay (including the clock skew, setup-time and hold-time for registers etc.), hence unnecessarily further slowing down the entire circuit and resulting in higher energy dissipation [14]. Alternatively, asynchronous-logic (async) methodology [15] could be fully or in part adopted to alleviate such data synchronisation issues. The basic premise is that async circuits are essentially self-timed circuits by using async handshake protocols [16] to synchronise digital operations, and thus are more robust for data synchronisation. For reliable data synchronisation, the quasi-delay-insensitive (QDI) delay model [17, 18] is often adopted in view of its practicability to acknowledge the arrival/ completion of signals. The QDI async operational modality has been well established in the electronic community [19]. There are several reported QDI realisation approaches to-date, and they can be categorised into the coarse-grained block-level and fine-grained gate-level methods [20]. The QDI realisation approaches of the coarse-grained block-level method include the delay-insensitive min-term synthesis (DIMS) [21], null convention logic (NCL) [22] and pre-charged static-logic (PCSL) [7]. However, these realisation approaches are not speed-efficient for data propagation because of their undesirable long delay in the multi-cascading cell datapath. In contrast, the QDI realisation approaches of the fine-grained gate-level method [20] potentially maximise the efficiency of the data propagation through the single-cell datapath, and thus more speed-efficient. The recent

309

reported fine-grained gate-level QDI realisation approaches [23] are weak-conditioned half-buffer (WCHB) [24] and pre-charged half-buffer (PCHB) [18]. Despite its speed-efficiency advantage, the fine-grained pipeline potentially suffers from high transistor-count overheads (every cell embeds an acknowledgement circuitry) and potentially high power dissipation (every cell has high number of switchings) [20] when compared to the coarse-grained pipeline. Furthermore, the implementation of the fine-grained pipeline is more challenging than that of the coarse-grained because of lack of support of commercial electronic design automation (EDA) tools; commercial EDA tools are largely register transfer level (RTL)-based and are directly applicable to the coarse-grained pipelines. The QDI realisation approaches can be implemented in several logic families, namely static-logic, dynamic-logic, pass-logic and so on. For robust sub-threshold operation, static-logic is adopted (over other logic families) because of its high robustness towards the PVT variations [13]. Although the WCHB realisation approach was reported in static-logic (and hence suitable for sub-threshold operation [12]), it embodies large numbers of the high-overhead C-Muller [24] and independent latches in order to abide by the QDI requirements. On the other hand, the PCHB realisation approach was reported in dynamic-logic, and might not be robust in the sub-threshold operation. For robustness, digital library cells [15] embodying the PCHB realisation approach may be modified to be static-logic, but their transistor-count undesirably increases and their energy efficiency decreases. Consequently, both the reported WCHB and PCHB realisation approaches suffer from high hardware-overhead problems, when particularly intended for robust sub-threshold operation. Put simply, the reported fine-grained gate-level QDI realisation approaches remain unsatisfactory for low overhead low power/energy attributes. In this paper, we propose a fine-grained gate-level QDI realisation approach, coined as autonomous signal-validity half-buffer (ASVHB) for a 32-bit arithmetic and logic unit (ALU) [25] with emphasis on low energy operation at the sub-threshold voltage region. The ALU is targeted for a processor core with a multi-processor platform and is designed inherently applicable for dynamic voltage scaling and is ideal for sub-threshold operation. Our proposed ASVHB realisation approach features several novel attributes. First, digital library cells embodying our proposed ASVHB realisation approach are designed to have an autonomous validity signal for each datum, and such autonomous validity signal distinctively makes the overall cell architecture simple without compromising the QDI assumptions (see later). Second, the proposed ASVHB pipeline belongs to fine-grained gate-level method, hence potentially maximising the throughput rate. Third, our proposed ASVHB realisation approach belongs to static-logic (and remains QDI), hence is robust towards PVT variations and inherently appropriate for dynamic voltage scaling. The proposed ASVHB realisation approach is compared with the competitive reported WCHB and PCHB realisation approaches. From the circuit perspective, the reported WCHB and PCHB library cells, on average, require ∼ 2.1 × and ∼ 1.9 × more transistors than ASVHB library cells. From the pipeline perspective, the reported WCHB and PCHB pipelines, on average, require 1.8 × and 1.5 × more transitions per-cycle than ASVHB pipeline. We implement the async 32-bit ALU embodying the proposed ASVHB realisation approach based on 65 nm CMOS process and compare it with the reported WCHB and PCHB counterparts. Among the three implemented designs, our ASVHB ALU occupies 0.092 mm2 and features the lowest transistor-count; the WCHB and PCHB counterparts require ∼ 1.7 × and ∼ 1.4 × more transistors, respectively, than our design. At the sub-threshold voltage (0.2 V), our design dissipates the lowest energy per-operation; the WCHB and PCHB counterparts dissipate ∼ 1.7 × and ∼ 2.6 × more energy, respectively. At 0.2 V, our design also features the fastest throughput; the WCHB and PCHB counterparts are, respectively, ∼ 0.95 × and ∼ 0.73 × slower. This paper is organised as follows. Section 2 describes our proposed ASVHB realisation approach. Section 3 explains the 32-bit ASVHB ALU design. The simulation results are presented in Section 4. Finally, conclusions are drawn in Section 5.

310

2 2.1

Proposed ASVHB Cell structure and operation mechanism

Fig. 1 depicts a generic block diagram of a digital cell embodying the proposed ASVHB realisation approach. The digital cell has a n-input data (n ≥ 1) collectively denoted as Data-In and an output data denoted as Q. Data-In and Q are represented in the well-established dual-rail encoding [15] that requires two wires to encode a 1-bit datum. For example, for the datum Q, there are a true wire (Q.T) and a false wire (Q.F). Both the true wire (Q.T) and the false wire (Q.F) are initially ‘0’ (at standby) and during the operation, one of the wires will be asserted to ‘1’ to indicate either a valid ‘0’ or a valid ‘1’ data signal. Both wires cannot be ‘1’s simultaneously. The async handshaking signals are single-ended signals Lack (i.e. an acknowledge signal to the preceding pipeline) and Rack (i.e. an acknowledge signal from the succeeding pipeline). The unique feature of our proposed ASVHB cell is that every datum is accompanied by its autonomous (individual) validity signal. For example, the validity signals for Data-In of all the n inputs are LvalA, LvalB and so on. The validity signal for Q is RvalQ. For ease of describing our proposed ASVHB realisation approach, we depict a 2-input AND/NAND cell as an example in Fig. 2. The dual-rail input data signals A and B are A.T/A.F and B.T/B.F, respectively, and their corresponding validity signals are LvalA and LvalB, respectively. The dual-rail output data signal Q is Q.T/Q.F, and its corresponding validity signal is RvalQ. These validity signals help to validate the input completeness [18], hence fulfilling the QDI requirement. The Acknowledge signals Lack and Rack indicate the acknowledgment to the left channel and from the right channel, respectively. It may be cursory to note that an ASVHB cell has additional autonomous validity signals when compared to the reported dual-rail QDI cells [7, 18, 21–24]. Unlike our ASVHB cell where the validity signals are integrated as parts of the individual cell, the reported QDI cells require separate input/output completion detection (OCD) circuits to decode their respective input/output signals to validate the input completeness. In our ASVHB realisation approach, however, we would like to remark that it is in fact the autonomous validity signal (for each datum) making the circuit implementation more efficient, hence resulting in the highly desirable low power dissipation and low transistor-count attributes (when compared to the reported dual-rail QDI circuits). We now in turn discuss the novel circuit implementation of a QDI cell embodying our proposed ASVHB realisation approach. As shown in the circuit schematic in Fig. 2, the ASVHB 2-input AND/NAND cell consists of an OCD, an inverter and a functional block. The OCD is an NOR gate which generates Lack to collectively acknowledge the input data A and B, indicating that the input data A and B have been computed. The inverter generates RvalQ from Lack to indicate the validity (and nullity) of the output Q. The functional block consists of a ‘true’ rail sub-block and a ‘false’ rail sub-block. The schematic structures of the sub-blocks are conceptually the same. For each schematic structure, it consists

Fig. 1 Block diagram of proposed n-input ASVHB cell

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

Fig. 2 Proposed ASVHB 2-input AND/NAND cell

of a ‘pre-charge’ section, an ‘evaluate’ section, a hold ‘0’ section, a hold ‘1’ section and a cross-coupled inverter. When LvalA, LvalB and Rack are all ‘0’, the ‘pre-charge’ section (in both sub-blocks) pre-charges the intermediate signals S.T and S.F. This condition implies that the ASVHB cell can now be reset because the input data A and B are currently in the empty state, and the (connecting) succeeding pipeline has received the output Q. We further remark that because of the autonomous validity signal for each input data, we need to construct only n + 1 PMOS transistors in series for the ‘pre-charge’ section. Such simple transistor configuration (for reset operation) is much more efficient when compared to the reported QDI cells which require either a separate (and yet complex) input completion detection (ICD) circuit [18] or a high fan-out complex PMOS transistor structure [24]. The ‘evaluate’ section performs evaluate operation from Data-In; the Data-In itself also inherently validates the input availability during the evaluate operation. The hold ‘0’ section maintains the empty output state before the ‘true’ rail sub-block (or the ‘false’ rail sub-block) is ready for the evaluate operation. The hold ‘1’ section maintains the valid output state before the ‘true’ rail sub-block (or the ‘false’ rail sub-block) is ready for the pre-charge operation. Both hold ‘0’ and hold ‘1’ sections merely serve as feedback circuits implemented in static-logic. These feedback circuits are highly efficient in part by incorporating the handshake signals into each sub-block; other reported approaches (e.g. DIMS and PCHB) would require separate C-Muller circuits to interface handshake signals. For the dynamic-logic implementation, the hold ‘0’ and hold ‘1’ sections can be simplified for low transistor-count. Nonetheless, the dynamic-logic implementation requires careful transistor sizing and is inappropriate for sub-threshold operation. The cross-coupled inverter latches the output, making the overall ASVHB cell as a fine-grained gate-level QDI pipeline cell embodying an integrated latch [20]. The overall ASVHB circuit realisation approach abides by QDI attributes where the input-completeness [18] and acknowledgement (between pipelines) are preserved [20]. For the input-completeness, the transistor configurations of the dual-rail input data signals (A.T/A.F and B.T/B.F) in the ‘evaluate’ section validate the data availability and perform evaluate operation. The autonomous validity signals (LvalA and LvalB) in the ‘pre-charge’ section validate the data nullity and perform the pre-charge operation. For the acknowledgement, both ‘evaluate’ and ‘pre-charge’ sections embody the output acknowledge signal (Rack) sending from the succeeding pipeline to repeatedly run the alternate 4-phase evaluate and pre-charge operation cycles. The operation of the ASVHB AND/NAND cell is as follows. Initially, the input data signals (A.T/A.F and B.T/B.F), the output data signals (Q.T/Q.F) and the autonomous validity signals (LvalA, LvalB and RvalQ) are at logic ‘0’, the acknowledge signals (Lack and Rack) and the intermediate data (S.T/S.F) are at logic ‘1’.

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

Consider a case when the input data signals arrive such that A.T = ‘1’ (A.F = ‘0’ and LvalA = ‘1’) and B.T = ‘1’ (B.F = ‘0’ and LvalB = ‘1’), S.T will be switched to ‘0’ (via the ‘evaluate’ section) and Q.T will be switched to logic ‘1’ (via the cross-coupled inverter). Once the output is valid (Q.T = ‘1’), the OCD negates Lack to logic ‘0’ to acknowledge receipt of all the input data (i.e. input-completeness). The OCD further triggers RvalQ to logic ‘1’ (via the inverter) to indicate a valid output (i.e. output-completeness). Once the succeeding pipeline has accepted the output, Rack will be negated to logic ‘0’. When the preceding pipeline has pre-charged the inputs and negated LvalA and LvalB to logic ‘0’, S.T will be switched to logic ‘1’ (via the ‘pre-charge’ section), hence discharging Q.T to logic ‘0’ (via the cross-coupled inverter). The OCD then triggers Lack to logic ‘1’ to acknowledge the completion of pre-charge operation. The OCD further resets RvalQ to logic ‘0’ (via the inverter) to indicate an empty output. Finally, the ASVHB cell is ready for a new operation, and other input cases can be analysed similarly. Other cells can be similarly constructed based on our proposed ASVHB realisation approach. For example, Fig. 3 depicts the ‘evaluate’ section of four other digital rudimentary cells. On the other hand, the ‘pre-charge’ section is similar for all cells, depending on the number of autonomous validity signals. Although not shown, it is worthwhile to note that the hold ‘0’ and hold ‘1’ sections are purely in series–parallel configuration to complement the ‘evaluate’ and ‘pre-charge’ sections, respectively, for all ASVHB cells. To summarise, our proposed ASVHB realisation approach has several significant attributes as follows. First, as a result of the integrated autonomous validity signals, the overall circuit implementation is simple. Second, our ASVHB cell abides by QDI requirements, hence eliminating any timing assumptions (save isochronic forks) for data signal propagation. Third, our proposed ASVHB realisation approach belongs to a static-logic implementation, which is robust towards the PVT variations [13], hence making a good candidate for sub-threshold operation. Lastly, as our ASVHB cell forms a fine-grained gate-level pipeline, hence potentially featuring a high throughput rate (when compared to those coarse-grained block-level pipelines [20]).

2.2 Comparison with the reported QDI realisation approaches For comparison purposes, we first summarise in Table 1 the general features of our proposed ASVHB realisation approach and other reported QDI realisation approaches, including WCHB [24], PCHB [18], DIMS [21], NCL [22] and PCSL [7]. These realisation approaches can generally be divided into the fine-grained gate-level (single-cell) and the coarse-grained

311

Fig. 3 Proposed ‘evaluate’ sections for ASVHB cells a 2-input OR/NOR b 2-input XOR/XNOR c 2-input MUX/IMUX d 3-input AO/AOI

Table 1 General features of various QDI realisation approaches Realisation approaches datapath protocol delay insensitive timing assumptions data encoding input-completeness logic implementation robustness towards PVT variations combinational logic external latches separate ICD separate OCD

Proposed ASVHB

WCHB [24]

DIMS [21]

fine-grained gate-level pipeline

NCL [22]

PCSL [7]

coarse-grained block-level pipeline QDI isochronic forks dual-rail encoding input-complete static-logic excellent

not required not required required

single-cell required not required required

block-level (multi-cascading cells) pipeline methods. In this paper, we will focus on the fine-grained gate-level pipeline in view of its speed-efficiency over the coarse-grained block-level pipeline; our subsequent comparisons will be limited to our proposed ASVHB and the reported WCHB and PCHB. Furthermore, in Table 1, all the QDI realisation approaches are implemented using static-logic to enable low-voltage (sub-threshold) operation [12]. We remark that the implementation in various logic families is orthogonal to the handshake protocol and is largely dependent on the specific operating conditions. For nominal (i.e. supra-threshold) voltage conditions, these realisation approaches (including our ASVHB) can be implemented in dynamic-logic for lower transistor-count (but are inappropriate for sub-threshold operations). To evaluate our proposed ASVHB realisation approach qualitatively and quantitatively, we make the following comparison. First, from the individual circuit-level perspective, we tabulate in Table 2 a comparison of the transistor-count for six basic library cells embodying our proposed ASVHB realisation approach and that embodying the reported WCHB and PCHB realisation approaches. The results are normalised with respect to the readings of the ASVHB cells whose number of transistors are shown in the parentheses. From Table 2, it is clear that our proposed ASVHB cells have the lowest number of transistor-counts. On average, the WCHB and PCHB cells require ∼2.1× and ∼1.9× more transistor-counts, respectively. Second, from the pipeline perspective, as depicted in Figs. 4a–c, we compare the pipeline structure of the proposed ASVHB

312

PCHB [18]

not required required required

required required not required

multi-cascading cells required required not required

required required not required

realisation approach against those of the reported WCHB and PCHB realisation approaches. The pipelines are based on three pipeline stages where the first pipeline stage is labelled with a subscript i and the succeeding pipeline stages are labelled as i + 1 and i + 2 accordingly. For ease of comparison, the various blocks (e.g. functional blocks, latches (if required), ICD or OCD, inverters and C-Muller gates) are explicitly shown in their pipeline stages. From Figs. 4a–c, it is easy to appreciate that our ASVHB pipeline is distinctly different from the reported WCHB and PCHB pipelines. In between two pipeline stages, the reported WCHB and PCHB pipelines have (2n + 1) wires per-input-channel (two wires for each bit of dual-rail data signal and one wire for the acknowledge signal) where n = wordlength. In contrast, the proposed ASVHB pipeline has (3n + 1) wires per-input-channel (additional wires for

Table 2 Transistor-count of ASVHB, WCHB and PCHB cells; normalised to the readings of ASVHB cells Library cells 1-input buffer 2-input AND/NAND 2-input OR/NOR 2-input XOR/XNOR 2-input MUX/IMUX 3-input AO/AOI average

Proposed ASVHB

WCHB [24]

PCHB [18]

1× (30) 1× (44) 1× (44) 1× (46) 1× (50) 1× (58) 1×

1.33× 2.09× 2.09× 2.00× 2.56× 2.62× 2.12×

1.87× 1.77× 1.77× 1.83× 2.36× 1.59× 1.87×

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

Fig. 4 Pipeline structures a Proposed ASVHB b Reported WCHB c Reported PCHB

the autonomous validity signals). The autonomous validity signal is used to validate the data nullity and pre-charge the connecting ASVHB cells. Consider first the differences between our ASVHB pipeline and the reported WCHB pipeline. Unlike the WCHB pipeline, our ASVHB pipeline does not require independent latches (e.g. Li, Li+1, Li+2 as depicted in Fig. 4b) to store the outputs; the latch function (i.e. cross-coupled inverter) has been integrated into the functional block in the ASVHB cell. Further, the autonomous validity signals in our ASVHB cells help in-part pre-charging the connecting ASVHB cells, abiding by the input-completeness requirement and yet making the circuit implementation simple. Whereas, for the WCHB pipeline, the functional blocks (that embodying high-overhead C-Muller gates [24]) are used to validate the input-completeness requirement. Hence, it is not unexpected that the WCHB cells, on average, require ∼2.1× more transistors (see Table 2) than the ASVHB counterparts. Consider now the differences between our ASVHB pipeline and the reported PCHB pipeline. The PCHB pipeline requires a separate ICD (e.g. ICDi, ICDi+1, ICDi+2 as depicted in Fig. 4c) to validate the input-completeness. Whereas, in our ASVHB pipeline, we leverage on the OCD of the preceding pipeline to generate autonomous validity signals (through an inverter) to the current pipeline. The autonomous validity signals effectively perform the ICD function. Consequently, our ASVHB realisation approach is more efficient than the reported PCHB realisation approach. The PCHB cells, on average, require ∼1.9× more transistors (see Table 2) than the ASVHB counterparts. Third, from the operation perspective (coupled from the pipeline structures), as depicted in Figs. 5a–c, we compare the marked graph behaviours of the proposed ASVHB realisation approach

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

against those of the reported WCHB and PCHB realisation approaches. The marked graph behaviours are based on the event operations from the ith to (i + 2)th pipeline stages. Their corresponding event operations are represented as nodes. For example, the nodes Fie and Fip (e.g. see Figs. 5a–c) are, respectively, denoted as evaluate and pre-charge operations for the ith-stage functional block; the nodes Lei and Lpi (e.g. see Fig. 5b) are interpreted accordingly for the ith-stage latch. For other nodes, the subscripts ‘+’ and ‘−’, respectively, denote an assertion and a negation of the signals applicable to ICD, OCD, inverter (Inv) and C-Muller gate (C). The dark bold lines represent a series of event operations for completing one local cycle [20], the longest cycle in the marked graph behaviours. From the local cycles shown in Figs. 5a–c, we determine the number of transitions per-cycle of the ASVHB (TASVHB), WCHB (TWCHB) and PCHB (TPCHB) pipelines. Equations (1)–(3) generalise their number of transitions per-cycle, respectively. For the PCHB pipeline depicted in Fig. 5c, it is interestingly to note that there are two possible local cycle paths (Path-A and Path-B). We represent those two possible local cycle paths as TPCHB-A and TPCHB-B according to (4) and (5), respectively. For brevity in (1)– (5), t(x) denotes the number of transitions for any node x e e ) + t(Fi+2 ) TASVHB = t(Fie ) + t(Fi+1 p + + t(OCD− i+2 ) + t(Fi+1 ) + t(OCDi+1 )

(1)

e e ) + t(Lei+1 ) + t(Fi+2 ) + t(Lei+2 ) TWCHB = t(Lei ) + t(Fi+1 p + + t(OCD− i+2 ) + t(Li+1 ) + t(OCDi+1 )

(2)

313

TPCHB = max{TPCHB−A , TPCHB−B } TPCHB−A =

t(Fie )

+

e t(Fi+1 )

+

+ t(OCD+ i+2 ) t(OCD− ) + t(C+ i+1 i+1 )

(3)

e t(Fi+2 )

p + t(C− i+2 ) + t(Fi+1 ) +

pipelines Wordlength

(4)

e TPCHB−B = t(Fie ) + t(Fi+1 ) + t(ICD+ i+2 ) p − + + t(C− i+2 ) + t(Fi+1 ) + t(OCDi+1 ) + t(Ci+1 )

Table 3 Number of transitions per-cycle of ASVHB, WCHB and PCHB

(5)

Table 3 summarises the different possible number of transitions per-cycle for pipelines embodying 1-bit, 2-bit and 3-bit functional blocks within the ASVHB, WCHB and PCHB realisation approaches. The pipelines embodying 1-bit functional blocks are the de-facto example used [20] for cycle-time analysis. The pipelines embodying multiple-bit functional blocks provide additional information on the relationship between the cycle-time and the wordlength of the pipelines. For our comparison in Table 3, we assume two transitions for each of the evaluate and pre-charge operations (F e and F p) in the functional blocks for the

1-bit 2-bit 3-bit normalised average

a Proposed ASVHB b Reported WCHB c Reported PCHB

314

Proposed ASVHB

WCHB [24]

PCHB [18]

10 10 10 1×

14 18 22 1.8×

14 15 15 1.5×

ASVHB and PCHB cells. Whereas for the WCHB functional blocks, the number of transitions for each F e and F p is 2 × n. We further assume two transitions for each the latch evaluate and pre-charge operations (L e and L p), one transition for OCD assertion/negation, and two transitions for C-Muller gate (C) assertion/negation. For the ASVHB pipeline, the number of transitions per-cycle remains constant at ten transitions for various wordlength. The low and constant number of transitions (for any wordlength of the ASVHB pipeline) implies potential high speed. We, nonetheless, remark that the actual speed of the ASVHB cell will vary, in part because of the fan-in of the functional blocks. For the WCHB pipeline, as the number of transitions for F e and F p increases when the number of inputs increases, the overall number of transitions will increase from 14 (in the 1-bit pipeline) to 22 (in the 3-bit pipeline). The increase of the number of transitions in F e and F p is mainly because of a need of multiple cascading cells to propagate signals to accommodate all the inputs. For the PCHB pipeline, the longest cycle for a 1-bit pipeline e and OCD+ involves Path-A (see Fig. 5c). Path-A involves Fi+2 i+2 − subsequently (three transitions in total) before the Ci+2 node in the (i + 2)th-stage. Hence, the number of transitions per-cycle for the 1-bit pipeline is 14 transitions; see (4). For the 2-bit pipeline and 3-bit pipeline, the longest cycle now involves Path-B, which includes ICD+ i+2 (four transitions for 2-input and 3-input cells) before the C− i+2 node in the (i + 2)th-stage. As a result, the total number of transitions per-cycle for both the 2-bit pipeline and 3-bit pipeline is 15 transitions; see (5). By comparison in Table 3, we find that the proposed ASVHB pipeline features the lowest number of transitions per-cycle. The reported WCHB and PCHB pipelines, on average, undesirably require 1.8× and 1.5× more transitions per-cycle, respectively. This is because the proposed ASVHB pipelines involve only t(F) and t (OCD) in completing one local cycle. The reported WCHB and PCHB pipelines on the other hand, respectively, require the additional t(L) (for output latching) and the additional t(C) (for validating ICD and OCD), thereby longer delay. For completeness, the throughput analysis of the pipeline by using the number of transitions per-cycle remains contentious. This is because the transistor stacking will in part affect the overall speed. For example, the longer PMOS and NMOS transistors in series of the transistor stacking may cause longer delay per-transition, and vice-versa. To a large extent, the lower number of transitions also implies lesser number of switchings, potentially reducing the dynamic power dissipation.

3

Fig. 5 Marked graph behaviours

Number of transitions per-cycle

Proposed 32-bit ASVHB ALU

We demonstrate an async 32-bit ALU using our novel ASVHB realisation approach. The proposed async ALU computes either 32-bit arithmetic operations (add, subtract, accumulate) or logic operations (AND, OR, XOR). Fig. 6 depicts the overall architecture of the proposed async 32-bit ALU, comprising two main parts, an arithmetic unit and a logic unit, together with an input de-multiplexer and an output multiplexer. The input signals are 32-bit dual-rail signals A and B, and the output signal is 32-bit dual-rail signal S. The various operations are determined by the single-rail control signals Add_On, Sub_On, Acc_On, AND_On, OR_On, XOR_On. For simplicity, the handshake signals within

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

Fig. 6 Proposed async 32-bit ALU architecture

various circuits are not shown. The input signals A and B are the data input to the input de-multiplexer for either arithmetic or logic operation. After the operation, the output multiplexer selects the output from either the arithmetic unit or the logic unit. Finally, the output signal S is computed. In the arithmetic unit, a 32-bit Kogge-Stone (KS) adder forms the core. The KS adder operand a is driven by the input signal A through the input de-multiplexer. The adder operand b is driven by either the input signal B through the input de-multiplexer (for add and subtract operations), or the feedback-output of the KS adder (for accumulate operation). The adder operand c is driven by the 3:1 multiplexer which selects the control signals Add_On, Sub_On or Acc_On for add, subtract or accumulate operations, respectively, in the arithmetic unit. The output of the KS adder (operand s) is connected to a 1:2 result de-multiplexer which determines the output is either sent back to the adder operand b (for accumulate operation) or directly sent to the output multiplexer. In contrast, a 32-bit logic operation block is the core component in the logic unit. The configuration is much simpler since both operands of the AND, OR and XOR operations are connected to the input signals A and B through the input de-multiplexer. The output of the logic operation block is connected to a 3:1 multiplexer, and that eventually passes over to the output multiplexer.

4 4.1

Simulation results Design implementation and simulations

The async ALU embodying the proposed ASVHB realisation approach is implemented based on the STMicroelectronics (STM) 65 nm CMOS process. Minimum sizing is used for all transistors. We adopt a full-custom approach to implement the ASVHB ALU. The layouts of all library cells, including the ASVHB cells and standard (single-rail) library cells, are first hand-crafted using the Cadence Virtuoso Layout tool. The extracted library-exchange-format files of the library cells are generated using the Cadence Abstract Generator tool. From the Verilog files, the ASVHB ALU layout is placed and routed using the Cadence

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

First Encounter tool. Eventually, the ASVHB ALU layout is simulated and verified using the Synopsys Nanosim tool. Fig. 7 depicts the layout view of our 32-bit async ASVHB ALU. Our proposed ASVHB ALU core features 304 µm × 304 µm, equivalent to 0.092 mm2 area in total. We analyse the energy dissipation per-operation and throughput of our ASVHB ALU. The energy dissipation per-operation represents the power-delay product for the given throughput. The throughput represents the fastest operating speed for the data propagation in each pipeline. In view of the robust sub-threshold operations for our intended low power applications, we consider temperature variations at high temperature (100°C), room temperature (27°C) and low temperature (−40°C) within the sub-threshold voltage region, ranging from VDD = 0.15 to 0.4 V. Fig. 8a depicts the normalised energy dissipation per-operation of our proposed ASVHB ALU; the results are normalised to the readings at 0.2 V, 27°C. From the graph, we remark the following. First, the higher operating temperature causes the higher energy dissipation. This is because of the increase in the leakage power dissipation when the temperature increases, causing the overall power dissipation to be increased. Second, the minimum energy voltage point increases as the temperature increases. For example, the minimum energy voltage point increases from VDD = 0.2 V at −40°C, to VDD = 0.25 V at 27°C, to VDD = 0.3 V at 100°C. Since the leakage power dissipation increases as the temperature increases (while dynamic power dissipation largely remains the same), the domination effect of the leakage power dissipation over the dynamic power dissipation becomes increasingly larger, moving the minimum energy voltage point further towards higher operating voltage. Fig. 8b depicts the normalised throughput of our proposed ASVHB ALU; the results are normalised to the readings at 0.2 V, 27°C. From the graph, we remark the following. First, as expected, the throughput reduces as the VDD reduces. Second, the throughput increases when the temperature increases because of the sub-threshold operation effects [12]. This is in contrast to within the nominal voltage region, in which the throughput reduces when the temperature increases because of the slower electron mobility at high temperature [12].

315

threshold voltage (LVT; VT ≃ |0.26 V|) for sub-threshold operations, ranging from VDD = 0.15 to 0.4 V. Fig. 9a depicts the normalised energy dissipation per-operation of our proposed ASVHB ALU; the results are normalised to the readings at 0.2 V, SVT. From the graph, we remark the following. First, as expected, the lower threshold voltage results in higher energy dissipation. This is because of the increase in the leakage power dissipation when the threshold voltage decreases, causing the overall increased power dissipation. Second, when VDD < 0.3 V the design using SVT dissipates lower energy than that of HVT. This is because of the sub-threshold effects which cause the poor delay in design using HVT although it dissipates lower leakage. Third, the minimum energy voltage point remains the same at VDD = 0.25 V as the threshold voltage changes. Fig. 9b depicts the normalised throughput of our proposed ASVHB ALU; the results are normalised to the readings at 0.2 V, SVT. From the graph, we remark the following. First, as expected, the throughput reduces as the VDD reduces. Second, the throughput increases when the threshold voltage decreases.

4.2

Fig. 7 Layout view of the proposed ASVHB ALU

To gain insights into the effect of threshold voltage variability, we consider the three process options, each with different threshold voltages: (i) high threshold voltage (HVT; VT ≃ |0.41 V|), (ii) standard threshold voltage (SVT; VT ≃ |0.32 V|) and (iii) low

Comparison

To better illustrate the energy and throughput advantages of our proposed ASVHB ALU within the sub-threshold voltage region (range from VDD = 0.15 to 0.4 V at 27°C, SVT), we benchmark our design against the async ALUs embodying the reported WCHB and PCHB realisation approaches using the same fabrication process. The comparison is based on the pre-layout simulations. Fig. 10a depicts the energy dissipation per-operation of our proposed ASVHB ALU, the reported WCHB and PCHB ALUs;

Fig. 8 Proposed ASVHB ALU at various temperature a Energy dissipation b Throughput; normalised to the readings at 0.2 V, 27°C

Fig. 9 Proposed ASVHB ALU at various threshold voltages a Energy dissipation b Throughput; normalised to the readings at 0.2 V, SVT

316

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

Fig. 10 Async ALUs within sub-threshold voltage region a Energy dissipation b Throughput; normalised to the readings of the proposed ASVHB ALU at 0.2 V

Table 4 Results of Async ALUs on 65 nm CMOS process; normalised to the readings of proposed ASVHB ALU Async ALUs

Transistorcount

Energy dissipation 0.2 V

proposed ASVHB WCHB [24] PCHB [18]

1× (68 734) 1.65× 1.40×

1V

Throughput

0.2 V

Area

1V

1× 1× 1× 1× 0.092 (958 fJ) (13.5 pJ) (7.6 MHz) (0.81 GHz) mm2 1.68× 1.30× 0.95× 1.06× — 2.63× 1.68× 0.73× 1.43× —

the results are normalised to the readings of the proposed ASVHB ALU at 0.2 V. From the graph, we remark the following. First, among all designs, our design dissipates the lowest energy within the sub-threshold region. This is because of the novel features of the proposed ASVHB cells, resulting in the lower transistor-count and lesser transistor switchings, hence in turn reducing the overall energy dissipation. Second, although all the designs have the similar minimum energy voltage point at VDD = 0.25 V; our design is nonetheless more energy-efficient than the reported designs. Third, when further scaling the voltage downwards below VDD = 0.25 V, the energy dissipation increases again (as the leakage energy now dominates) for all the designs. In this 0.15 V ≤ VDD ≤ 0.25 V region, the energy dissipation for our design only increases marginally, whereas for the reported WCHB and PCHB ALUs increase significantly. The relatively flat increase of the energy dissipation for our design indicates that our design is highly appropriate for ultra-low voltage low power applications. Fig. 10b depicts the normalised throughput of our proposed ASVHB ALU, the reported WCHB and PCHB ALUs; the results are normalised to the readings of the proposed ASVHB ALU at 0.2 V. From the graph, we remark that all the designs have the

similar speed at 0.4 V, and our design is marginally more speed-efficient when scaling downwards the voltage. We further remark that our intention in this work is for low energy dissipation. Hence we adopt minimum transistor sizing for all designs. For high speed operation, larger transistor sizing can be employed and thus higher energy dissipation is expected. For completeness, we further compare three async ALUs in terms of transistor-count, energy dissipation per-operation and throughput at both the sub-threshold (0.2 V) and at nominal (1 V) voltages, respectively. Table 4 tabulates the normalised results of the async ALUs. The results with respect to the proposed ASVHB ALU are indicated in the parentheses. In terms of the transistor-count, our design features the lowest. Particularly, the WCHB and PCHB counterparts require 1.65× and 1.40× more transistors, respectively, than our design. The normalised result is largely in-line with the analysis of the transistor-count for the library cells of various QDI realisation approaches (Section 2.2). However, the normalised transistor-count ratio of the ALUs is noted to be smaller than that of the library cells. This is because the ALUs of various QDI realisation approaches embody not only their own dual-rail library cells but also the standard single-rail cells for handshake signals. In terms of the energy dissipation at both the sub-threshold and nominal voltages, our proposed design is shown to be dissipated in the lowest energy. The WCHB and PCHB counterparts, respectively, dissipate 1.68× and 2.63× higher energy than our design at 0.2 V, and, respectively, dissipate 1.3× and 1.68× higher energy at 1 V. The energy reduction is smaller at 1 V as the leakage energy reduction is expected to be smaller. In terms of the throughput at 0.2 V, our design performs the fastest (i.e. 7.6 MHz). The WCHB and PCHB counterparts are, respectively, 0.95× and 0.73× slower than our design. However when operating at 1 V, the PCHB counterpart is the fastest, followed by the WCHB counterpart and our design is the slowest. We attribute the reason to the minimum sizing used (especially for

Table 5 Comparison of various ALUs ALUs

Vangal et al. [26] Shimazaki et al. [27] Mathew et al. [28] Wijeratne et al. [29] Truong et al. [1] WCHB PCHB proposed ASVHB

CMOS, nm

VDD, V

Wordlength/pipeline structure/ logic family

PVT immunity/dynamic voltage scaling/min Op. region

Energy, pJ

Through-put, GHz

Area, mm2

130 180 90 65 65 65 65 65

1.43 1.8 1.3 1.3 1.3 1 1 1

32-bit; NPa; domino-logic 64-bit; NPa; domino-logic 32-bit; NPa; semi-dynamic-logic 32-bit; NPa; domino-logic 16-bit; NPa; domino-logic 32-bit; Pb; static-logic 32-bit; Pb; static-logic 32-bit; Pb; static-logic

weak; non-DVS; nominal moderate; DVS; near-thc weak; non-DVS; nominal weak; non-DVS; nominal moderate; DVS; near-thc excellent; DVS; sub-thd excellent; DVS; sub-thd excellent; DVS; sub-thd

74.0 670.0 34.0 1200.0 40.4 17.6 22.7 13.5

5.00 1.16 7.00 9.00 1.20 0.86 1.16 0.81

0.135 0.152 0.073 — 0.004 — — 0.092

a

Non-pipeline. Pipeline. Near-threshold. d Sub-threshold. b c

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

317

PMOS transistors) for slow speed in our design. As discussed earlier, the operating speed could be enhanced by up-sizing the PMOS transistors but at the cost of higher power. For completeness, Table 5 tabulates the comparison of various ALUs. As the process technology, design architecture, wordlength, implementation of those are different; the comparison of the ALUs is somewhat contentious. Nonetheless, our proposed ASVHB ALU (and the WCHB and PCHB ALUs) feature high robustness (excellent PVT immunity) because of the nature of QDI operation, and they are suitable candidates for the sub-threshold operation and for the dynamic voltage scaling. Furthermore, our design shows the impressive energy efficiency advantage through dissipating the lowest energy per-operation.

5

Conclusions

We have proposed a novel async QDI ASVHB realisation approach for the sub-threshold low power operation. The comparison is benchmarked against the competitive reported WCHB and PCHB realisation approaches. Our ASVHB realisation approach is shown to be the best from the circuit and pipeline perspective. The WCHB and PCHB library cells, on average, require ∼2.1× and ∼1.9× more transistors than the ASVHB library cells, whereas the WCHB and PCHB pipelines, on average, require 1.8× and 1.5× more transitions per-cycle than the ASVHB pipeline. We have further implemented our ASVHB realisation approaches in a 32-bit ALU and compared it to the WCHB and PCHB counterparts. Our design features 0.092 mm2 chip area; the WCHB and PCHB counterparts feature ∼1.7× and ∼1.4× more transistor-count, respectively. At VDD = 0.2 V, our design is the best in overall performance. In terms of the energy dissipation, the WCHB and PCHB counterparts dissipate ∼1.7× and ∼2.6× higher energy, respectively. In terms of the throughput, the WCHB and PCHB counterparts are ∼0.95× and ∼0.73× slower, respectively. Finally, when compared to the various reported ALUs, our proposed design has demonstrated the competitive advantage in terms of energy-efficiency and robustness towards PVT variations.

6

References

1 Truong, D.N., Cheng, W.H., Mohsenin, T., et al.: ‘A 167-processor computational platform in 65 nm CMOS’, IEEE J. Solid-State Circuits (JSSC), 2009, 44, (4), pp. 1130–1144 2 Yu, Z., Meeuwsen, M.J., Apperson, R.W., et al.: ‘AsAP: an asynchronous array of simple processors’, IEEE J. Solid-State Circuits (JSSC), 2008, 43, (3), pp. 695–705 3 Amde, M., Felicijan, T., Efthymiou, A., et al.: ‘Asynchronous on-chip networks’, IEE Proc. Comput. Digital Tech., 2005, 152, (2), pp. 273–283 4 Rossi, D., Campi, F., Spolzino, S., et al.: ‘A heterogeneous digital signal processor for dynamically reconfigurable computing’, IEEE J. Solid-State Circuits (JSSC), 2010, 45, (8), pp. 1615–1626 5 Chong, K.-S., Chang, K.-L., Gwee, B.-H., Chang, J.: ‘Synchronous-logic and globally-asynchronous-locally-synchronous (GALS) acoustic digital signal processors’, IEEE J. Solid-State Circuits (JSSC), 2012, 47, (3), pp. 769–780 6 Raychowdhury, A., Tokunaga, C., Beltman, W., et al.: ‘A 2.3 nJ/frame voice activity detector-based audio front-end for context-aware system-on-chip applications in 32-nm CMOS’, IEEE J. Solid-State Circuits (JSSC), 2013, 48, (8), pp. 1963–1969

318

7 Lin, T., Chong, K.-S., Chang, J.S., Gwee, B.-H.: ‘An ultra-low power asynchronous-logic in-situ self-adaptive VDD system for wireless sensor networks’, IEEE J. Solid-State Circuits (JSSC), 2013, 48, (2), pp. 573–586 8 Chen, G., Hanson, S., Blaauw, D., Sylvester, D.: ‘Circuit design advances for wireless sensing applications’. IEEE Proc., 2010, vol. 98, no. 11, pp. 1808–1827 9 Chang, K.-L., Gwee, B.-H., Chang, J.S., Chong, K.-S.: ‘Synchronous-logic and asynchronous-logic 8051microcontroller cores for realizing the internet of things: a comparative study on dynamic voltage scaling and variation effects’, IEEE J. Emerg. Sel. Top. Circuits Syst. (JETCAS), 2013, 3, (1), pp. 23–34 10 Lazarescu, M.T.: ‘Design of a WSN platform for long-term environmental monitoring for IoT applications’, IEEE J. Emerg. Sel. Top. Circuits Syst. (JETCAS), 2013, 3, (1), pp. 45–54 11 Li, Y.W., Patounakis, G., Shepard, K.L., Nowick, S.M.: ‘High-throughput asynchronous datapath with software-controlled voltage scaling’, IEEE J. Solid-State Circuits (JSSC), 2004, 39, (4), pp. 704–708 12 Wang, A., Calhoun, B.H., Chandakasan, A.P.: ‘Sub-threshold designs for ultra-low-power systems’ (Springer, 2006) 13 Zakaria, H., Fesquet, L.: ‘Designing a process variability robust energy-efficient control for complex SOCs’, IEEE J. Emerg. Sel. Top. Circuits Syst. (JETCAS), 2011, 1, (2), pp. 160–172 14 Sparsø, J., Furber, S.: ‘Principle of asynchronous circuit design: a system perspective’ (Kluwer Academic, Norwell, MA, 2001) 15 Chong, K.-S., Gwee, B.-H., Chang, J.: ‘Design of several asynchronous-logic macrocells for a low voltage micropower cell library’, IET Circuits Devices Syst., 2007, 1, (2), pp. 161–169 16 Jung, E.-G., Choi, B.-S., Won, Y.-G., Lee, D.-I.: ‘Handshake protocol using return-to-zero data encoding for high performance asynchronous bus’, IEE Proc. Comput. Digital Technol., 2003, 150, (4), pp. 245–251 17 Martin, A.J.: ‘The limitations to delay insensitivity in asynchronous circuits’. Proc. Sixth MIT Conf. on Advanced Research in VLSI, 1990, pp. 263–278 18 Martin, A.J., Nystrom, M.: ‘Asynchronous techniques for system on-chip designs’. IEEE Proc., June 2006, vol. 94, no. 6, pp. 1089–1120 19 Mohammadi, S., Furber, S., Garside, J.: ‘Designing robust asynchronous circuit components’, IEE Proc. Circuits Devices Syst., 2003, 150, (3), pp. 161–166 20 Singh, M., Nowick, S.M.: ‘The design of high-performance dynamic asynchronous pipelines: lookahead style’, IEEE Trans. VLSI Syst. (TVLSI), 2007, 15, (11), pp. 1256–1269 21 Sparso, J., Staunstrup, J., Dantzer-Sorensen, M.: ‘Design of delay insensitive circuits using multi-ring structures’. Proc. European Design Automation Conf., 1992, pp. 15–20 22 Jorgenson, R.D., Sorensen, L., Leet, D., et al.: ‘Ultralow-power operation in subthreshold regimes applying clockless logic’. Proc. IEEE, February 2010, vol. 98, no. 2, pp. 299–314 23 Lines, A.M.: ‘Pipelined asynchronous circuits’. Master Thesis, Dept. of Computer Science, California Institute of Technology (Caltech), 1995 24 Thonnart, Y., Beigne, E., Vivet, P.: ‘A pseudo-synchronous implementation flow for WCHB QDI asynchronous circuits’. Int. Symp. on Async. Circuit and Systems (ASYNC), May 2012, pp. 73–80 25 Ho, W.-G., Chong, K.-S., Gwee, B.-H., Chang, J.S.: ‘Low power sub-threshold asynchronous QDI static-logic transistor-level implementation (SLTI) 32-bit ALU’. IEEE Int. Symp. on Cirt. and Syst. (ISCAS), Beijing, China, June 2013, pp. 353–356 26 Vangal, S., Anders, M.A., Borkar, N., et al.: ‘5-GHz 32-bit integer execution core in 130 nm dual-VT CMOS’, IEEE J. Solid-State Circuits (JSSC), 2002, 37, (11), pp. 1421–1432 27 Shimazaki, Y., Zlatanovici, R., Nikolic, B.: ‘A shared-well dual-supply-voltage 64-bit ALU’, IEEE J. Solid-State Circuits (JSSC), 2004, 39, (3), pp. 494–500 28 Mathew, S.K., Anders, M.A., Bloechel, B., et al.: ‘A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS’, IEEE J. Solid-State Circuits (JSSC), 2005, 40, (1), pp. 44–51 29 Wijeratne, S.B., Siddaiah, N., Mathew, S.K., et al.: ‘A 9-GHz 65-nm Intel Pentium 4 processor integer execution unit’, IEEE J. Solid-State Circuits (JSSC), 2007, 42, (1), pp. 26–37 30 Uhrig, S., Jahr, R., Ungerer, T.: ‘Advanced architecture optimisation and performance analysis of a reconfigurable grid ALU processor’, IET Comput. Digital Tech., 2012, 6, (5), pp. 334–341 31 Chatterjee, B., Sachdev, M.: ‘Design of a 1.7 GHz low-power delay-fault-testable 32-b ALU in 180 nm CMOS technology’, IEEE Trans. VLSI Syst. (TVLSI), 2005, 13, (11), pp. 1296–1304

IET Circuits Devices Syst., 2015, Vol. 9, Iss. 4, pp. 309–318 & The Institution of Engineering and Technology 2015

Suggest Documents