J Sign Process Syst DOI 10.1007/s11265-008-0249-8
Forward and Reverse Converters and Moduli Set Selection in Signed-Digit Residue Number Systems Andreas Persson · Lars Bengtsson
Received: 8 March 2007 / Revised: 21 May 2008 / Accepted: 12 June 2008 © 2008 Springer Science + Business Media, LLC. Manufactured in The United States
Abstract This paper presents an investigation into using a combination of two alternative digital number representations; the residue number system (RNS) and the signed-digit (SD) number representation in digital arithmetic circuits. The combined number system is called RNS/SD for short. Since the performance of RNS/SD arithmetic circuits depends on the choice of the moduli set (a set of pairwise prime numbers), the purpose of this work is to compare RNS/SD number systems based on different sets. Five specific moduli sets of different lengths are selected. Moduli-setspecific forward and reverse RNS/SD converters are introduced for each of these sets. A generic conversion technique for moduli sets consisting of any number of elements is also presented. Finite impulse response (FIR) filters are used as reference designs in order to evaluate the performance of RNS/SD processing. The designs are evaluated with respect to delay and circuit area in a commercial 0.13 μm CMOS process. For the case of FIR filters it is shown that generic moduli sets with five or six moduli results in designs with the best area×delay products.
A. Persson Centre for Research on Embedded Systems (CERES), Halmstad University, Sweden e-mail:
[email protected] L. Bengtsson (B) Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden e-mail:
[email protected]
Keywords Residue number system · Signed-digit · Moduli-selection · Converters · FIR filters
1 Introduction This paper presents an investigation into a combination of two number representations; the residue number system (RNS) and the signed-digit (SD) number system. In RNS, an integer is decomposed into a set of residues with shorter binary representations, which can be processed in parallel. Carry propagation within RNS arithmetic circuits can be eliminated by using the SD number system to represent the residues. The SD number system provides a redundant number representation that facilitates carry-free addition. The use of SD numbers also implies efficient modulo arithmetic, which helps to simplify crucial RNS operations. The basis of a residue number system is a set of pairwise prime integers, called the moduli set. The performance of RNS processing depends on the choice of this set and on the implementation of forward and reverse RNS conversion. Many moduli sets and conversion techniques have been suggested for RNS systems with residues on 2’s complement form. This work investigates moduli sets and converters for use with signeddigit residue number systems (RNS/SD). The aim is to investigate how the choice of the moduli set affects the performance of RNS/SD arithmetic operations. A number of moduli sets were selected for evaluation and forward and reverse RNS/ SD converters were implemented for each of these sets. In order to compare the performance of RNS/SD processing, RNS/SD finite impulse response (FIR) filters were implemented using Synopsys Design Compiler and a
A. Persson, L. Bengtsson
0.13 μm CMOS cell library from UMC. The synthesized designs are compared with respect to delay (speed) and circuit area. The paper is organized as follows. Section 2 gives an introduction to the signed-digit residue number system and outlines the principles of forward and reverse RNS/SD conversion. The moduli sets selected for evaluation in this work are presented in Section 3 together with guidelines for selecting efficient moduli sets. Section 4 presents a technique for RNS/SD encoding and Section 5 presents moduli-set-specific decoding techniques for each of the of the sets introduced in Section 3. A reverse conversion technique for RNS/SD number systems using general moduli sets is presented in Section 6. The technique detailed in Section 6 is applicable to all coprime sets with one element of the form 2n . The RNS/SD finite impulse response (FIR) filters which have been used as reference designs are described in Section 7. ASIC synthesis results are presented in Section 8 where performance evaluations and comparisons are made with respect to delay and circuit area. Section 9 gives the conclusions.
where ◦ denotes addition, subtraction, multiplication or any combination of the three. The computation of ci depends upon ai , bi and mi only. Hence, each ci can be computed using a separate arithmetic unit, often called a channel. The reconstruction X from {x1 , x2 , . . . , x L } is based on the Chinese Remainder Theorem (CRT) X=
X → {x1 , x2 , . . . , x L }, xi = Xmi . where Xmi denotes X mod mi . The most important characteristic of the RNS representation is that it is a non-weighted number system, which facilitates parallel computing. If integers A and B have RNS representations {a1 , a2 , . . . , a L } and {b1 , b2 , . . . , b L } respectively, then the RNS representation of C = A ◦ B is C → {c1 , c2 , . . . , c L }, ci = ai ◦ bi mi .
ˆ −1 ˆi M M i
where M =
mi
,
xi
L
(1)
M
ˆ i=1 mi , Mi =
M mi
and
Mˆi−1
mi
is the
ˆ i modulo mi , such that multiplicative inverse of M −1 ˆ iM ˆ ≡ 1. M i mi
2.2 Signed-Digit Number System The radix-2 signed-digit (SD) number system has the ¯ where 1¯ denotes −1. An N-digit SD digit-set {1, 0, 1}, ¯ has the value number Y = [y N−1 . . . y0 ]SD , yi ∈ {1, 0, 1},
2 Background
The residue number system (RNS) is an integer system capable of supporting high speed concurrent arithmetic. In RNS, an integer is decomposed into a set of smaller integers (i.e. with shorter binary representations), which can be processed independently and in parallel. The basis of an RNS is a set of pairwise prime integers S = {m1 , m2 , . . . , m L }, where gcd(mi , m j) = 1 for i = j. The set S is called the moduli set and the dynamic range of the number system is [0, M), where M is the product of all moduli mi in S. Any integer X within the dynamic range has a unique RNS representation given by an ordered set of residues
i=1
Y=
2.1 Residue Number System
L
N−1
2i yi
(2)
i=0
which is the same as for an unsigned binary number except that yi can be −1. This yields a redundant number representation. For example, 6 can be ¯ ¯ represented as [0110]SD , 1110 or 10 10 . Zero, SD SD however, has a unique representation. To represent an SD digit y, two bits, y− and y+ are required. That is, y = [y− y+ ]. Using this digit encoding, the value of an N-digit SD number Y= [yn−1 . . .y0 ]SD is given by Eq. 3. Y=
N−1 i=0
2i yi+ −
N−1
2i yi−
(3)
i=0
Note that, unlike the 2’s complement representation, it is possible to represent any integer and its negation with an equal number of digits. The negation of an integer is a very simple operation in the SD number system. The negation of y = [y− y+ ] is (−y) = [y+ y− ], as can be seen by negating Eq. 3. No logic gates are required for this operation. By exploiting the redundancy of the Signed Digit number representation, carry propagation is limited to one bit position when adding SD numbers. Addition of two numbers X and Y is performed according to the set of rules presented in Table 1, in which ci denotes the (non-propagating) carry and ui the interim sum. These rules avoid any carry propagation when the final sum si is computed according to Eq. 4. Consequently, addition
Forward and reverse converters and moduli set selection... Table 1 Rules for adding SD numbers.
xi yi xi−1 yi−1 ci+1 ui
00 − 0 0
01 neither is 1¯
01 at least one is 1¯
01¯ neither is 1¯
1 1¯
0 1
0 1¯
is performed in constant time, regardless of operand widths. ¯ si = ui + ci , si ∈ {1, 0, 1}
Q2
Rules
(4)
2.3 Combining RNS and SD The use of the SD number system has been suggested as a way to eliminate carry propagation within RNS arithmetic circuits. The carry-free properties of the SD number system provides constant time addition operations. The parallel processing capabilities of the residue number system results in faster and more area-efficient multiplication operations. The use of signed digit representation also implies efficient modulo arithmetic, which helps to simplify crucial RNS operations. An important consideration when designing RNS systems is the choice of the moduli set. Sets with elements of the forms 2n , 2n − 1 and 2n + 1 are of special interest. Such low-cost moduli facilitates the use of simplified arithmetic units. The properties of the SD number system helps to further simplify modulo arithmetic for low-cost moduli. Addition modulo 2n − 1 and 2n + 1 is performed using SD adders with end-aroundcarry logic [1]. Due to the limited carry-propagation in SD adders, there is no delay penalty for the endaround-carry operations. Furthermore, unlike in the 2’s complement number system, the result of modulo 2n + 1 addition is represented using n SD-digits, since representations for sums greater than 2n − 1 are taken from the negative range. Modulo multiplication by powers of two with respect to the low-cost moduli relies on simple shift operations, according to the rules in Eq. 5, where x = [xn−1 . . . x0 ]SD is an n-digit SD Number. These operations are accomplished by wiring connections appropriately.
a 2 x 2n −1 = [xn−a−1 . . . x0 xn−1 . . . xn−a ]SD
a 2 x 2n = [xn−a−1 . . . x0 0 . . . 0]SD
a 2 x 2n +1 = [xn−a−1 . . . x0 x¯ n−1 . . . x¯ n−a ]SD (5) 2.4 Previous Work The RNS and SD number systems are well known and have been thoroughly studied in the literature,
01¯ at least one is 1¯ 1¯ 1
11 − 1 0
11 − 1¯ 0
11 − 0 0
for example in [2–6]. The possibility to combine RNS and SD arithmetic has also, in a lesser extent, been studied, most notably by Wei and Shimizu [7–9]. In [1], Lindström et al. present efficient forward and reverse converters for the combined RNS/ SD number system using the popular moduli set {2n − 1, 2n , 2n + 1}. In [10], Lindahl and Bengtsson present direct form finite impulse response (FIR) filter implementations using RNS/SD. Their work shows that the use of the RNS/SD number representation can reduce circuit area and power dissipation while the clock period is retained.
3 Moduli Selection in RNS/SD One of the most important considerations when designing RNS Systems is the choice of the moduli set. The choice of moduli affects the complexity of forward and reverse converters as well as RNS arithmetic circuits. In [11], Abdallah and Skavantzos state that the moduli set, S = m1 , . . . , m L , should be chosen such that the moduli mi s satisfy the following criteria: 1. They should be pairwise prime. That is, gcd(mi , m j) = 1 for all mi = m j. 2. Each moduli mi should be as small as possible so that operations modulo mi require minimum computational time. 3. The moduli mi s should imply simple binary to RNS and RNS to binary conversions as well as simple RNS arithmetic. 4. The moduli product should be large enough to implement the desired dynamic range. 5. The moduli should provide a well balanced decomposition of the dynamic range. This means that the difference in word length between the moduli should be as small as possible. Sets with all elements being of the forms 2n , 2n − 1 and 2n + 1 satisfy the requirement of simple conversions and efficient modulo arithmetic. Since the SD number system is used to represent residues, addition and subtraction are performed in constant time, regardless of operand widths. Consequently, criteria 2 and 5 are less important for adder-based RNS/SD applications. However, for multiplication-intensive applica-
A. Persson, L. Bengtsson
tions, moduli sets with small and balanced moduli results in faster and more area-efficient implementations. 3.1 Parameterized Moduli Sets Several types of moduli sets have been considered by RNS researchers. A large number of different parameterized moduli sets have been suggested in the literature. The parameterized sets consist of a small number of low-cost moduli on a fix form, where each moduli is expressed as a function of a parameter, say n. The dynamic range of such sets can easily be scaled by adjusting n. The use of parameterized moduli implies efficient RNS conversions, since it is possible to take advantage of moduli-set-specific properties, such as attractive close form expressions for the moduli product and for the multiplicative inverses required for reverse conversion. However, as the number of moduli is increased, such attractive properties are rare and come at the cost of balance in residue word lengths. Five parameterized moduli sets, S1 , . . . , S5 , have been selected for evaluation in this work. S1 = {2n − 1, 2n + 1} S2 = {2n − 1, 2n , 2n + 1} S3 = {2n − 1, 2n , 2n + 1, 22n + 1} S4 = {2n−1 , 2n−1 − 1, 2n − 1, 2n + 1} S5 = {2n , 2n − 1, 2n−1 − 1, 2n−1 + 1} Moduli-set-specific forward and reverse converters have been implemented for RNS/SD number system based on each of the five sets. The forward conversion technique is outlined in Section 4. Reverse converters for the parameterized moduli sets are presented in Section 5. 3.2 General Moduli Sets If large dynamic ranges are required, a general moduli set consisting of a larger number of moduli might result in better performance. If low-cost moduli are used, RNS/SD forward conversions for such sets are as efficient as for the parameterized moduli sets. Reverse conversion, on the other hand, is a significantly more difficult task. A generic RNS/SD decoder must handle the adverse properties of the Chinese Remainder Theorem, that is, modulo M operations for a large-valued M and multiplications by constant factors which do not necessarily have attractive forms. A decoder for general moduli sets has been designed and the implementation is outlined in Section 6.
4 RNS/SD Encoders For each moduli set presented in Section 3, an RNS/SD encoder has been developed. The encoder for a moduli set S = {m1 , m2 , . . . , m L } converts an integer in binary form into L SD residues. Modulo reduction for lowcost moduli is straightforward when using the SD number System. To construct a residue xi from an integer X, X is partitioned into vectors of the same length as the corresponding moduli mi . The last vector is padded with constant zeros if necessary.
xi = Xmi = ki,0 + 2ni ki,1 + 22ni ki,2 + · · · + 2lni ki,l mi , (6) where ki,0 = xni −1 xni −2 . . . x0 , ki,1 = x2ni −1 xni −2 . . . xni , ...
ki,l = 0 . . . 0xW−1 . . . xlni . ni is the word length of moduli mi and W is the word length of X. Since mi is either 2ni , 2ni − 1 or 2ni + 1, the rules of Eq. 5 apply for multiplication by powers of two and Eq. 6 simplifies to ⎧
ni ⎪ ⎨ ki,0 + ki,1 + ki,2 + ki,3 + . . . 2ni −1 if mi = 2 − 1, xi = ki,0 if mi = 2ni , ⎪ ⎩
ki,0 − ki,1 + ki,2 − ki,3 + . . . 2ni +1 if mi = 2ni + 1. The RNS/SD encoders consist of a number of multioperand SD modulo adders, one for each mi = 2ni . Since the input X is in binary form. It is possible to reduce the complexity of the encoder by using simplified SD adder cells on the first levels of each adder tree. Figure 1 shows the encoder for the RNS/SD number system with moduli set {128, 129, 127, 65, 17}.
5 RNS/SD Decoders for Parameterized Moduli Sets 5.1 RNS/SD Decoder for Moduli Set S1 The proposed architecture for decoding SD residues with respect to the S1 = {m1 , m2 } = {2n − 1, 2n + 1} moduli set is based on the Chinese Remainder Theorem, as presented in Section 2.1. For an RNS with two moduli, the CRT procedure in Eq. 1 is reduced to −1 X = Mˆ 1 Mˆ 1
m1
−1 x1 + Mˆ 2 Mˆ 2
m2
.
x2 M
(7)
Forward and reverse converters and moduli set selection...
Figure 1 RNS/SD Encoder for moduli set {128, 129, 127, 65, 17}.
For the particular set S1 , we have
can be computed as the sum of two 2n-digit SD vectors, formed by concatenation, rotation and negation.
M Mˆ 1 = = 2n + 1, m1
M = 22n − 1,
Ax1 = 22n−1 x1 + 2n−1 x1 = x10 x1n−1 . . . x10 x1n−1 . . . x11
Bx2 = 22n−1 x2 − 2n−1 x2 = x10 x¯ 1n−1 . . . x¯ 10 x1n−1 . . . x11
M Mˆ 2 = = 2n − 1. m2 It is easy to see that the two multiplicative inverses needed for computation of Eq. 7 are both powers of two. −1 Claim: = 2n−1 Mˆ 1 m1 −1
Proof: Mˆ 1 Mˆ 1 = 2n−1 × (2n + 1) 2n −1 m1
= 2n−1 × 2n + 1 2n −1 n 2 −1
= 2n−1 × 2 2n −1
= 2n 2n −1 = 1 Claim:
−1 Mˆ 2
m2
= 2n−1
= 2n−1 × (2n − 1) 2n +1
= 2n−1 × 2n − 1 2n +1 n 2 +1
= 2n−1 × (−2) 2n +1
= −2n 2n +1 = 1 −1 Inserting the derived expressions for Mˆ 1 , Mˆ 2 , Mˆ 1 , m1 −1 Mˆ 2 and M into Eq. 7 yields Proof:
−1 Mˆ 2 Mˆ 2
m2
m2
X = (2n + 1)2n−1 x1 + (2n − 1)2n−1 x2 22n −1
= 22n−1 + 2n−1 x1 + 22n−1 − 2n−1 x2 22n −1 = Ax1 + Bx2 22n −1
(8)
Using the rules for multiplication by powers of two from Eq. 5, together with the fact that x1 and x2 both have digit-length n in the SD number system, of Eq. 8
No logic gates are required to form Ax1 and Bx2 . One modulo 22n − 1 SD adder is sufficient to generate X. The result will be in the range (−M, M), due to the fact that SD modulo adders use the negative range as well as the positive. If the output is required to be in the range [0, M), the correct result is obtained by adding M = [12n−1 . . . 10 ] to X, when X is negative. Adding constant ones to an SD integer is a simple operation, as shown in [1]. Carry-look-ahead (CLA) adders are used to obtain the binary representation of X, according to Eq. 3. In order to minimize the extra delay introduced by this range correction, both X and X + M are decoded to binary form, using two CLAs operating in parallel. The correct value is selected by examining the carry-out bit of the adder for X. The hardware architecture of the decoder is depicted in Fig. 2. 5.2 RNS/SD Decoder for Moduli Set S2 The set S2 = {2n − 1, 2n , 2n + 1} is probably the most widely used moduli set for RNS. It is also the moduli set that has been most intensively studied in the literature. The decoding of binary-residue number systems based on set S2 is studied, for example, in [12–14]. An efficient decoder for RNS systems with SD residues is presented in [1]. This is also the decoder that has been used in this work, with some minor modifications. The conversion technique is outlined again in this section since the new decoder for moduli set S3 , presented in Section 5.3, is based on a similar approach. The decoder is based upon a modified formulation of the Chinese Remainder Theorem, the New CRT-I,
A. Persson, L. Bengtsson
Claim: k2 = 2n−1 Proof:
k2 m1 m2 m3
= 2n−1 × 2n × (2n + 1) 2n −1
= 2n−1 × 2n 2n −1 × 2n + 1 2n −1
= 2n−1 × 1 × 2 2n −1
= 2n 2n −1 = 1
2n −1
Using the expressions for m1 , m2 and m3 , together with the derived expressions for k1 and k2 , Eq. 9 simplifies to X = x1 + 2n X ,
X = 2n (x2 − x1 ) + 2n−1 2n + 1 (x3 − x2 ) 22n −1 .
(10)
By expanding the terms of Eq. 10 and grouping the coefficients of x1 , x2 and x3 , the expression for X can be rewritten as the sum of three terms
Figure 2 RNS/SD decoder for moduli set S1 .
X = Ax1 + Bx2 + Cx3 22n −1 , as presented in [15]. According to the New CRT-I, the binary representation X of a residue number {x1 , x2 , . . . , x L } can be computed as
(11)
where A = −2n , B = −22n−1 + 2n−1 ,
X = x1 + m1 X , k1 (x2 − x1 ) + k2 m2 (x3 − x2 ) + . . . , X = · · · + k L−1 m2 m3 . . . m L−1 (x L − x L−1 ) m m ...m 2
3
L
(9) where {m1 , m2 , . . . , m L } is the moduli set and k1 , k1 , . . . , k L−1 are multiplicative inverses, given by k1 m1 m2 m3 ...mL ≡ 1 k2 m1 m2 m3 m4 ...mL ≡ 1 ... k L−1 m1 m2 . . . m L−1 mL ≡ 1 If the elements of S2 are rearranged, such that m1 = 2n , m2 = 2n +1 and m3 = 2n −1, then the two multiplicative inverses k1 and k2 are both powers of two. Claim: k1 = 2 Proof:
n
k1 m1 m2 m3 = 2n × 2n 22n −1
= 22n 22n −1 = 1
C = 22n−1 + 2n−1 . No logic gates are required to form SD representations of Ax1 , Bx2 and Cx3 . Using the rules for multiplication by powers of two from Eq. 5, we have Ax1 = x¯ 1n−1 . . . x¯ 10 0n−1 . . . 00 SD , Bx2 = x¯ 20 x2n−1 . . . x20 x¯ 2n−1 . . . x¯ 21 SD , Cx3 = x30 x3n−1 . . . x30 x3n−1 . . . x31 SD . The result X is the concatenation of x1 and X . Two SD modulo adders are required to generate X according to Eq. 11. To make sure that the result is in the positive range, M = [12n−1 . . . 10 0n−1 . . . 00 ] is added to X, when X is negative. Note that this has no effect on the lower part of X. Figure 3 shows the hardware architecture of the RNS/SD reverse converter for moduli set S2 . 5.3 RNS/SD Decoder for Moduli Set S3 The four-moduli set S3 = {2n − 1, 2n , 2n + 1, 22n + 1} is an extension of the popular S2 moduli set, and has been suggested as a way to increase the dynamic range of the RNS. The resulting RNS decoder is as efficient as the S2 decoder, while the dynamic range is increased
Forward and reverse converters and moduli set selection...
Claim: k2 = 2n−1 Proof:
k2 m1 m2 m3 m4
= 2n−1 × 2n × (22n + 1) (2n −1)(2n +1)
= 2n−1 × 2n × 22n + 1 22n −1 2n
= 2n−1 × 2n × 2 22n −1
= 22n 22n −1 = 1
2 −1
Claim: k3 = 2n−2 Proof:
k3 m1 m2 m3 m4
= 2n−2 ×2n ×(22n +1)×(2n +1) 2n −1
= 2n−2 × 2n 2n −1 × 22n + 1 2n −1
× 2n + 1 2n −1 n 2 −1
= 2n−2 × 1 × 2 × 2 2n −1
= 2n 2n −1 = 1
Inserting the expressions for m1 . . . m4 and k1 . . . k3 into Eq. 12 yields
Figure 3 RNS/SD decoder for moduli set S2 .
from 3n − 1 bits to 5n − 1 bits. An adder-based binaryresidue decoder for the S3 set is presented in [16]. By applying a new moduli reordering scheme and by exploiting the properties of the SD number system, the number of terms which need to be added has been reduced from six for the decoder from [16] to four for the decoder proposed in this section. For an RNS with four moduli, the New CRT-I procedure from Eq. 9 is reduced to
X = x1 + 2n X , 3n 2 (x2 − x1 ) + 22n−1 (22n + 1) (x3 − x2 ) + X = . +2n−2 (22n + 1)(2n + 1) (x4 − x3 ) 24n −1 (13) By expanding all terms in the expression for X of Eq. 13 and grouping the coefficients of each residue x1 , . . . , x4 , we find that X can be rewritten as X = Ax1 + Bx2 + Cx3 + Dx4 24n −1 ,
X = x1 + m1 X , k1 (x2 − x1 ) + k2 m2 (x3 − x2 ) + X = . +k3 m2 m3 (x4 − x3 ) m m m 2
3
where (12)
4
The elements of S3 are rearranged, such that m1 = 2n , m2 = 22n + 1, m3 = 2n + 1 and m4 = 2n − 1. Using this ordering, k1 , k2 and k3 in Eq. 12 are powers of two. Claim: k1 = 23n Proof:
k1 m1 m2 m3 m4 = 23n × 2n 24n −1
= 24n 24n −1 = 1
(14)
A = −23n , B = 23n−1 − 2n−1 , C = −24n−2 + 23n−2 − 22n−2 + 2n−2 , D = 24n−2 + 23n−2 + 22n−2 + 2n−2 . Studying A, B, C and D, we find that the distance between two consecutive non-zero digits in the SD representation of each term, is equal to the word length of the corresponding residue (n bits for A, C, D and 2n bits for B). Consequently, no logic gates are required to form Ax1 , Bx2 , Cx3 and Dx4 . Again, we use the rules
A. Persson, L. Bengtsson
for multiplication by powers of two from Eq. 5 to form terms using concatenation, rotation and negation. Ax1= x¯ 1n−1 . . . x¯ 10 03n−1 . . . 00 Bx2= x2n . . . x20 x¯ 22n−1 . . . x¯ 20 x22n−1 . . . x2n+1 Cx3= x¯ 31 x¯ 30 x3n−1 . . . x30 x¯ 3n−1 . . . x¯ 30 x3n−1 . . . x30 x¯ 3n−1 . . . x¯ 32 Dx4= x41 x40 x4n−1 . . . x40 x4n−1 . . . x40 x4n−1 . . . x40 x4n−1 . . . x42 Three modulo 24n − 1 SD adders are required to generate X . The result X is the concatenation of X and x1 . As for the decoder for moduli set S2 , range correction is carried out by adding M = 14n−1 . . . 10 0n−1 . . . 00 to X, when X is negative. Figure 4 depicts the hardware architecture of the RNS/SD decoder. 5.4 RNS/SD Decoder for Moduli Sets S4 and S5 The set S4 = {2n−1 , 2n−1 − 1, 2n − 1, 2n + 1} is a balanced moduli set, well suited for large dynamic ranges. However, the elements of S4 are pairwise prime for even values of n only. This might be a disadvantage when tailoring the set for a given dynamic range. To overcome this problem, the set S5 = {2n , 2n −1, 2n−1 −1, 2n−1 + 1} will be used as a complement to S4 for odd
values of n. S5 has a similar form compared to S4 , only the exponents differ. The elements of S5 are pairwise prime for odd values of n only. The two sets can be expressed on a common form as {m1 , m2 , m3 , m4 } = {2a , 2a − 1, 2b − 1, 2b + 1} where a = n − 1, b = n for S4 and a = n, b = n − 1 for S5 . The proposed decoders for RNS/SD number systems using these moduli are SD implementations of a two-level approach to RNS decoding, detailed in [17]. On the first level, the moduli set is decomposed into two subsets, {m1 , m2 } and {m3 , m4 }. The corresponding residue subsets ({x1 , x2 } for {m1 , m2 } and {x1 , x2 } for {m3 , m4 }) are decoded using two reverse converters operating in parallel. The second level is a decoder for an RNS with moduli set {m1 m2 , m3 m4 } where the residues X1 m1 m2 and X2 m3 m4 are the results from the first conversion step. The first-level converter for moduli subset {m1 , m2 } = {2a , 2a − 1} is a variant of the New CRT-I decoders presented in Sections 5.2 and 5.3. The required multiplicative inverse has the value 1. The proof of this is trivial, since 2a 2a −1 = 1. Inserting the expressions for m1 and m2 into Eq. 9 yields X1 = x1 + 2a X1 , X1 = x2 − x1 2a −1 .
(15)
The other decoder on the first level is the CRT decoder from Section 5.1, where
X2 = 22b −1 + 2b −1 x3 + 22b −1 − 2b −1 x4 22b −1 . (16) One modulo 2a − 1 SD adder is needed to compute X1 in Eq. 15. X1 is the concatenation of X1 and x1 . The computation of X2 according to Eq. 16 requires one modulo 22b − 1 SD adder. On the second level, two residues are decoded with respect to the moduli set {2a (2a − 1) , 22b − 1}. Equation 9 is reduced to X = X1 + 2a 2a − 1 , X = k (X2 − X1 )22b −1 .
Figure 4 RNS/SD decoder for moduli set S3 .
(17)
Equation 17 differs from the applications of the New CRT-I seen so far. The multiplicative inverse k does not have a closed form expression. A modulo 22b − 1 adder/scaler is required to compute X = k (X2 − X1 )22b −1 for a precalculated value of k. The final result X is computed as the regular (not modulo)
Forward and reverse converters and moduli set selection...
sum of two terms, X X1 and −2a X , where X X1 is the concatenation of X and X1 . Range correction is carried out by adding M = m1 m2 m3 m4 to negative values of X using simplified SD adder cells before X is converted to binary form. The complete decoder is depicted in Fig. 5.
6 RNS/SD Decoders for General Moduli Sets A reverse conversion technique for RNS/SD number systems using general moduli sets is presented. The only constraint given for the moduli set is that one of the elements, say m1 , should be a power of two. The conversion technique presented here is inspired by the work of Wang et al. [18]. Although the general sets studied in this work consist of low-cost moduli exclusively, the technique is applicable to all coprime moduli sets with one element of the form 2n .
Figure 6 RNS/SD decoder for general moduli sets.
In [18], Wang et al. propose a new formulation of the Chinese Remainder Theorem. For an RNS with moduli set {m1 , . . . , m L } and residues {x1 , . . . , x L }, the value of X is
X = x1 + m1 X m2 m3 ...mL , X =
L
ki xi ,
(18)
i=1
where k1 =
ki =
ˆ1 M ˆ −1 M 1
m1
−1
m1
ˆi M ˆ −1 M i
mi
mi
ˆ −1 ˆ i and M M i
mi
,
,
for i = 2, 3, . . . , L.
are from the original formulation of the
CRT in Eq. 1, that is M=
L i=1
mi ,
ˆ i = M, M mi
ˆ iM ˆ −1 M i
mi
≡ 1.
MSD(k): if (k = 0): return {} else: find e, such that 2e ≤ k < 2e +1 if (3k < 2e +2 ): return {2e +1 , – MSD(2e +1 – k)} else: return {2e , – MSD(k – 2e )} Figure 5 RNS/SD decoder for moduli sets S4 and S5 .
Figure 7 Algorithm for finding a minimal signed-digit representation of an integer k.
A. Persson, L. Bengtsson 215 x2 (3) -
214 x2 (2) -
213 x1 (3) x2 (1) x3 (2) x4 (2) -
212 x1 (2) x2 (3) x2 (0) x3 (1) x4 (1) -
211 x1 (1) x2 (2) x3 (0) – x4 (2) x4 (0) x5 (1)
210 x1 (0) x2 (3) x2 (1) x3 (2) – x4 (1) x5 (0)
29 x2 (2) x2 (0) x3 (1) – x4 (0) -
29 x2 (1) x3 (2) x3 (0) -
27 – x1 (3) – x2 (3) x2 (0) x3 (1) x5 (1) -
26 – x1 (2) – x2 (2) x3 (2) x3 (0) x5 (0) -
25 – x1 (3) – x1 (1) x2 (3) – x2 (1) x3 (1) – x4 (2) – x5 (1) -
24 – x1 (2) – x1 (0) x2 (2) – x2 (0) – x3 (2) x3 (0) – x4 (1) – x5 (0) -
23 – x1 (1) x2 (3) x2 (1) – x3 (1) x4 (2) – x4 (0) -
22 – x1 (0) x2 (2) x2 (0) – x3 (0) x4 (1) -
21 x2 (1) x4 (0) – x5 (1) -
20 x2 (0) – x5 (0) -
Figure 8 Partial product array.
The proposed converter has two parts, a multiplication-accumulation (MA) array and a modulo reduction unit. The MA array is used to generate X . The factors k1 , k2 , . . . , k L are constants and are calculated a priori. The modulo operation of Eq. 18 and the final range correction is carried out by the modulo reduction unit. As described in earlier chapters, X of Eq. 18 can be formed using concatenation if the moduli m1 is chosen to be a power of two. The hardware architecture of the general converter is depicted in Fig. 6. Implementations of the MA array and the modulo reduction unit are detailed in Sections 6.1 and 6.2. 6.1 The SD Multiplication-Accumulation Array L The task of the MA array is to compute i=1 ki xi , where x1 , x2 , . . . , x L are variables and k1 , k2 , . . . , k L are integer constants. Wang et al. presents an implementation of an MA array for variables on binary form. The MA architecture outlined in [18] uses the Modified
215 x2 (3) -
214 x2 (2) -
213 x1 (3) x2 (1) x3 (2) x4 (2) -
212 x1 (2) x2 (3) x2 (0) x3 (1) x4 (1) -
211 x1 (1) x2 (2) x3 (0) – x4 (2) x4 (0) x5 (1) -
210 x1 (0) x2 (3) x2 (1) x3 (2) – x4 (1) x5 (0) -
Figure 9 Compressed partial product array.
29 x2 (2) x2 (0) x3 (1) – x4 (0) -
29 x2 (1) x3 (2) x3 (0) -
Booth recoding algorithm to form partial products which are added using a Wallace tree adder. The partial product generation of the MA array proposed here relies on a minimal signed-digit recoding scheme. The algorithm MSD(k) is used to find SD representations for the constant factors k1 , k2 , . . . , k L . In [19], it is proven that the algorithm given in Fig. 7 results in representations of minimal Hamming weight, that is, with a minimum number of non-zero digits. The resulting SD representation has no two adjacent non-zero digits. Thus, for an integer k, the number of non-zero digits is at most log2 k/2 + 1. For example, MSD(383) returns {512, −128, −1} which corresponds ¯ ¯ SD . Each nonto an SD representation of [101000000 1] zero digit in the minimal SD representations results in a partial product. The partial products are formed using shift and negation operations. For example, 383x is computed as (x 9) − (x 7) − x. The operation of the MA array is best explained using an example. Consider the five-moduli set S = {16, 17, 9, 7, 5} with a dynamic range of 16 bits. The
27 – x1 (3) – x2 (3) x2 (0) x3 (1) x5 (1) -
26 – x1 (2) – x2 (2) x3 (2) x3 (0) x5 (0) -
25 – x1 (3) – x1 (1) x2 (3) – x2 (1) x3 (1) – x4 (2) – x5 (1) -
24 – x1 (2) – x1 (0) x2 (2) – x2 (0) – x3 (2) x3 (0) – x4 (1) – x5 (0)
23 – x1 (1) x2 (3) x2 (1) – x3 (1) x4 (2) – x4 (0) -
22 – x1 (0) x2 (2) x2 (0) – x3 (0) x4 (1) -
21 x2 (1) x4 (0) – x5 (1) -
20 x2 (0) – x5 (0) -
Forward and reverse converters and moduli set selection...
As seen in Fig. 9, The computation of X = 1,004x1 + 4,725x2 + 2,380x3 + 1,530x4 + 1,071x5 is achieved by adding eight terms, each ≤ 16 bits wide. Because of the carry-free properties of SD adders, there is no need to employ a complicated adder tree structure (Wallace, Dadda etc.). A binary tree of SD adders is used. Since some of the compressed partial products contains constant zeros, simplified SD adder cells are used where possible. For the example case of S = {16, 17, 9, 7, 5}, the adder tree has three levels. Thus, the total delay of the multiplication-accumulation unit is approximately three times the delay of an SD full adder cell.
6.2 The SD Modulo Reduction Unit
Figure 10 Modulo reduction unit.
constants k1 , . . . , k5 and the corresponding minimal SD representations are precalculated.
The modulo reduction unit computes X M , where X is the result from the MA step and M is the moduli product with m1 excluded, that is M = m2 m3 . . . m L . Since no modulo reduction is performed in the MA stage, the word length of X is greater than the word length of M. Let n be the digit-length of X and let a = log2 M. Two SD vectors are created from X:
¯ 100] ¯ SD , k1 = 1,004 = [10000010
X = 2a−1 Xhigh + Xlow ,
¯ k2 = 4,725 = [1001010010101] SD ,
Xhigh = [Xn−1 . . . Xa−1 ]SD ,
¯ SD , k3 = 2,380 = [100101010100] ¯ ¯ k4 = 1,530 = [10100000 1010] SD , ¯ ¯ SD . 1] k5 = 1,071 = [1000101000 The SD representations of k1 , . . . , k5 contain a total of 22 non-zero digits. Consequently, 22 partial products need to be added. Figure 8 shows the resulting partial product array, where each row represents a partial product. The array contains a large number of constant zero operands, depicted by -s in Fig. 8. The zero operands will not affect the result and can be eliminated by compression of the partial product array. As many zero operands as possible are removed, while the weights of non-constant operands are preserved. Figure 9 shows the compressed partial product array for the given example. Figure 11 Transposed form FIR filter.
Xlow = [Xa−2 . . . X0 ]SD . Since Xlow has digit-length a − 1, we know for sure that −M < Xlow < M. A ROM look-up table is used to generate XLUT = 2a−1 Xhigh M . It is not practical to use redundant signed-digit numbers for ROM addressing. Instead, Xhigh is decomposed into its binary compo+ − + − nents Xhigh and Xhigh . Xhigh and Xhigh are unsigned binary numbers. Two ROM look-up tables are used to find XLUT . + + XLUT = 2a−1 Xhigh , M − − = 2a−1 Xhigh , XLUT M
XLUT =
+ XLUT
−
− XLUT .
A. Persson, L. Bengtsson
The two lock-up tables are identical and a single twoport ROM memory, addressed by n − a + 1 bits, can be + − used for the look-up operations. XLUT and XLUT are unsigned binary numbers in the range [0, M). The SD subtractor cell for unsigned binary operands consists of just two logic gates and the gate depth is one. The result, XLUT , is an a-digit SD number in the range (−M, M). The result of the modulo operation is the sum of XLUT and Xlow . These two numbers are both in the range (−M, M). Thus, their sum is in the range (−2M, 2M). Four potential results are computed:
Figure 12 Mod mi FIR filter tap.
R0 = XLUT + Xlow , R1 = XLUT + Xlow − M,
with forward and reverse RNS/SD converters. Implementation results are presented in Section 8.
R2 = XLUT + Xlow + M, R2 = XLUT + Xlow + 2M. 8 VLSI Implementation Results The constant terms −M, M and 2M are added to Xlow using simplified SD adders in parallel to the look-up operation. One of the potential results is in the desired range of [0, M). R0 , . . . , R3 are converted to binary form using four carry-look-ahead adders operating in parallel. The correct result is selected by examining the carry out bits of the CLA adders. The hardware architecture of the modulo reduction unit is depicted in Fig. 10.
7 RNS/SD FIR Filters In order to evaluate the performance of RNS/SD processing using the presented moduli sets, RNS/SD finite impulse response (FIR) filters have been implemented as reference designs. The filter designs implement programmable N-tap FIR filters.
y(n) =
N
The presented designs have been coded in structurallevel VHDL and mapped to standard-cells using Synopsys Design Compiler and a UMC 0.13 m CMOS cell library with eight metal layers and a core voltage of 1.2 Volts. The VHDL designs were compiled for typical operating and wire load conditions and synthesised for four different equivalent (binary) word lengths (16, 24, 32 and 40 bits). For the parameterized moduli sets, the parameter n was chosen such that the resulting moduli product was as small as possible, but at least equal to desired dynamic range. General moduli sets of length five and six have also been evaluated. The general sets were chosen according to the criteria for effective moduli sets given in Section 3. The moduli sets used for VLSI implementation are presented in Table 2. Note that no six-moduli set has been selected for the 16 bit dynamic range. It is not possible to form a set of six coprime low-cost moduli with a moduli product as small as 216 .
ak x(n − k)
k=1
realized in transposed form as shown in Fig. 11. The filter coefficients a1 , . . . , a N are calculated a priori. For an RNS with moduli set S = {m1 , m2 , . . . , m L }, the FIR filter is decomposed into L subfilters operating in parallel, each subfilter using modulo mi arithmetic. Each filter tap consists of a modulo adder, a modulo multiplier and a register. Figure 12 shows an SD filter tap and Fig. 13 depicts an RNS/SD FIR filter, complete
Figure 13 RNS/SD FIR filter.
Forward and reverse converters and moduli set selection... Table 2 Moduli sets used for VLSI implementation. Moduli set 16 bits
24 bits
Number Values
Number Values
{255, 257} 12 {63, 64, 65} 8 {15, 16, 17, 257} 5 {32, 31, 15, 17} 7 {16, 17, 9, 7, 5} − − −
S1 8 S2 6 S3 4 5 S4 /S5 Five moduli − Six moduli −
32 bits
40 bits
Number Values
{4095, 4097} 16 {255, 256, 257} 11 {31, 32, 33, 1,025} 7 {128, 127, 63, 65} 9 {64, 65, 31, 17, 7} − {32, 33, 31, 17, 7, 5} −
Number Values
{65535, 65537} 20 {2,047, 2,048, 2,049} 14 {127, 128, 129, 16,385} 8 {512, 511, 255, 257} 11 {128, 129, 127, 65, 17} − {128, 127, 65, 31, 17, 7} −
{220 − 1, 220 + 1} {16,383, 16,384, 16,385} {255, 256, 257, 65,537} {2,048, 2,047, 1,023, 1,025} {512, 511, 257, 129, 65} {256, 257, 129, 127, 31, 17}
Table 3 Performance evaluation for RNS/SD encoders. Moduli set
Area [mm2 ]
Delay [ns]
S1 S2 S3 S4 /S5 Five moduli Six moduli
16 bits
24 bits
32 bits
40 bits
16 bits
24 bits
32 bits
40 bits
0.24 0.81 1,34 1,35 1.91 −
0,24 0.79 1,34 1,35 1.75 2.23
0,24 0.81 1,34 1,35 1.69 2.41
0,24 0.81 1,36 1,35 1.69 1.96
0.00045 0.0012 0.0022 0.0022 0.0029 −
0.00068 0.0013 0.0028 0.0032 0.0033 0.0051
0.00091 0.0022 0.0039 0.0042 0.0051 0.0074
0.0011 0.0028 0.0046 0.0052 0.0061 0.0079
Table 4 Performance evaluation of RNS/SD decoders. Moduli set
Area [mm2 ]
Delay [ns]
S1 S2 S3 S4 /S5 Five moduli Six moduli
16 bits
24 bits
32 bits
40 bits
16 bits
24 bits
32 bits
40 bits
3,63 3.99 4,50 7.57 7.04 −
5.01 4.43 5,17 8.67 8.92 8,28
5,65 5.46 6.08 10.83 9.78 10.25
6,58 6.15 6,52 10.97 11.24 11.05
0.0029 0.0035 0.0047 0.0084 0.0142 −
0.0044 0.0047 0.0059 0.0143 0.0278 0.0286
0.0059 0.0064 0.0083 0.0215 0.0382 0.0501
0.0074 0.0082 0.0094 0.0307 0.0660 0.0587
Table 5 Performance evaluation of 8-tap RNS/SD FIR filters. Moduli set S1 S2 S3 S4 /S5 Five moduli Six moduli
Area × Delay
Area [mm2 ]
Delay [ns] 16 bits
24 bits
32 bits
40 bits
16 bits
24 bits
32 bits
40 bits
16 bits
24 bits
32 bits
40 bits
4,17 3,86 4,16 3.70 2.74 −
4,81 4,11 4,66 4.09 3,89 3,65
5.11 4,82 4,81 4.56 4,08 4,08
5.77 4,92 4,82 4.92 4.65 4.16
0.0927 0.0724 0.0840 0.0736 0.0626 −
0.1978 0.1201 0.1297 0.1371 0.1200 0.1176
0.3391 0.2248 0.2352 0.2174 0.1842 0.1903
0.5166 0.3378 0.2941 0.3207 0.2820 0.2534
0.387 0.280 0.349 0.272 0.172 −
0.951 0.494 0.604 0.561 0.467 0.4291
1.733 1.084 1.131 0.991 0.7515 0.7764
2.981 1.662 1.418 1.578 1.311 1.054
A. Persson, L. Bengtsson
8.1 RNS/SD Encoders Table 3 shows VLSI implementation results for RNS/ SD encoders using the moduli sets from Table 2. For the parameterized moduli sets, the circuit delay is not affected by the value of the parameter n. The RNS/SD encoders for general moduli sets, on the other hand, consist of different adder-tree structures for different dynamic ranges. Thus, the circuit delay is not constant. The circuit area grows linearly with increased dynamic ranges for all encoders. 8.2 RNS/SD Decoders Table 4 shows VLSI implementation results for the proposed RNS/SD decoders. As seen in Table 4, the decoder for moduli set S1 has the smallest area and, for the 16-bit dynamic range, also the shortest circuit delay. For the larger dynamic ranges, the decoder for moduli set S2 has the shortest delay. The decoders for sets S4 and S5 has considerably longer delay and larger area, due to the constant multipliers in the second stage of the converters. 8.3 RNS/SD FIR Filters Implementation results for 8-tap FIR filters are presented in Table 5. When implementing the FIR filters, pipeline stages where added in the RNS/SD converters to maintain a clock cycle that is determined by the critical path of the filter taps. The RNS/SD forward conversions introduce an additional latency of one clock cycle. The reverse conversions introduce an additional latency of two clock cycles for filters using moduli sets S1 , . . . , S3 and three clock cycles for filters with moduli sets S4 and S5 . The reverse conversions for general moduli sets introduce a latency of three clock cycles. For the case of 8-tap FIR filters we see that generic moduli sets with five or six moduli results in designs with the best area×delay products. Considering even longer filters, the impact of the forward and backward converters on the total circuit area will decrease. This will furthermore favor the longer moduli sets.
9 Conclusions This work has presented new forward and reverse converters for signed-digit residue number systems. Since
the performance of RNS/SD arithmetic circuits depends on the choice of the moduli set (a set of pairwise prime numbers), the purpose of this work has been to compare RNS/SD number systems based on different sets. Four moduli-set-specific conversion techniques are proposed. A conversion technique for general moduli sets consisting of any number of coprime moduli has also been presented. Finite impulse response (FIR) filters have been used in order to evaluate the performance of RNS/SD processing using the proposed moduli sets. All designs have been implemented in a commercially available 0.13 μm CMOS process. The designs have been compared with respect to delay, area and area×delay products. The implementation results show that the complexity of RNS/SD converters grows as the number of moduli is increased. However, if the designs are large enough, the increased complexity of the converters is overcome by area savings in RNS/SD processing units. For the case of FIR filters it is shown that generic moduli sets with five or six moduli results in designs with the best area×delay products.
References 1. Lindström, A., Nordseth, M., Bengtsson, L., & Omondi, A. (2004). Arithmetic circuits combining residue and signeddigit representations. In Lecture notes in computer science (LNCS) (Vol. 2823, pp. 246–257). Springer. 2. Szab, N. S., & Tanaka, R. I. (1967). Residue arithmetic and its applications to computer technology. McGraw-Hill (December). 3. Soderstrand, M., & Jenkins, W. (1986). Residue number system arithmetic: Modern applications in digital signal processing. IEEE Press. 4. Wang, W., Swamy, M., & Ahmad, M. (2003). Rns application in digital image processing. In Proceedings of the 3rd IEEE international workshop on system-on-chip for real-time applications (pp. 77–80) (July). 5. Avizienis, A. (1961). Signed-digit number representation for fast parallel arithmetic. IRE Transactions on Electronic Computers, EC-10, 389–400. 6. Parhami, B. (1988). Carry-free addition of recoded binary signed-digit numbers. IEEE Transactions on Computers, 37(11), 1470–1476 (November). 7. Wei, S., & Shimizu, K. (2000). A novel residue arithmetic hardware algorithm using a signed-digit number representation. IEICE Transactions on Information and Systems, E83– D(12), 2056–2064 (December). 8. Wei, S., & Shimizu, K. (2001). Fast residue arithmetic multipliers based on a signed-digit number system. In Proceedings of the 8th IEEE international conference on electronics, circuits and systems (Vol. 1, pp. 263–266) (September).
Forward and reverse converters and moduli set selection...
9. Wei, S., & Shimizu, K. (2002). Residue signed-digit arithmetic circuit with a complement of mudulus and the application to rsa encryption processor. In Proceedings of the 9th IEEE international conference on electronics, circuits and systems (Vol. 2, pp. 591–594) (September). 10. Lindahl, A., & Bengtsson, L. (2005). A low-power fir filter using combined residue and radix-2 signed-digit representation. In Proceedings of the 8th EUROMICRO conference on digital system design (DSD’05) (pp. 42–47). Porto, Portugal: IEEE Computer Society Press (August–September). 11. Abdallah, M., & Skavantzos, A. (1995). A systematic approach for selecting practical moduli sets for residue number systems. In Proceedings of the 27th IEEE southeastern symposium on system theory (pp. 445–449) (March). 12. Vinnakota, B., & Rao, V. B. (1994). Fast conversion techniques for binary-residue number systems. IEEE transactions on circuits and systems I: Fundamental theory and applications, CAS-41(12), 927–929 (December). 13. Wang, W., Swamy, M., Ahmad, M., & Wang, Y. (1999). The applications of the new Chinese remainder theorems for three moduli sets. In Proceedings of the 1999 IEEE Conadian conference on electrical and computer engineering (Vol. 1, pp. 571–576) (May). 14. Wang, Y., Song, X., Aboulhamid, M., & Shen, H. (2002). Adder based residue to binary number converters for (2n − 1, 2n , 2n + 1). IEEE Transactions on Signal Processing, 50(7), 1772–1779 (July). 15. Wang, Y. (2000). Residue-to-binary converters based on new Chinese remainder theorems. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 47(3), 197–205 (March). 16. Cao, B., Chang, C., & Srikanthan, T. (2003). An efficient reverse converter for the 4-moduli set {2n − 1, 2n , 2n + 1, 22n + 1} based on the new Chinese remainder theorem. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 50(10), 1296–1303 (October). 17. Skavantzos, A., & Saturates, T. (1999). Grouped-moduli residue number systems for fast signal processing. In Proceedings of the 1999 IEEE international symposium on circuits and systems (ISCAS’99) (Vol. 3, pp. 478–483) (May). 18. Wang, W., Swamy, M., & Ahmad, M. (2000). An area-time efficient residue-to-binary converter. In Proceedings of the 43rd IEEE midwest symposium on circuits and systems (pp. 904–907) (August). 19. Shallit, J. (2005). A primer on balanced binary representations. Retrieved October 2005, from http://www.cs.uwaterloo. ca/ shallit/Papers/bbr.pdf.
Andreas Persson obtained the M.Sc. degree from Chalmers University of Technology, Gothenburg, Sweden in 2006. He is now pursuing the Ph.D. degree at the Centre for Research on Embedded Systems (CERES) at Halmstad University, Sweden.
Lars Bengtsson obtained the M.Sc. and Ph.D. degrees from Chalmers University of Technology, Gothenburg, Sweden in 1983 and 1997 respectively. After working in industry for some years as a HW and SW engineer he was recruited for a position as senior lecturer and later promoted to associate professor at Halmstad University, Sweden. He subsequently moved to Chalmers where he was appointed associate professor in year 2000. His research interest lies in the area of embedded and networked processors, active RFID, and digital VLSI circuits.