A Multi-Core Sphere Decoder VLSI Architecture for ... - Semantic Scholar

3 downloads 26284 Views 351KB Size Report
paper presents a scalable multi-core sphere decoder architecture that can combine the .... the sphere decoding algorithm, where bi = ˆyi−. ∑M j=i+1 Rijsj.
A Multi-Core Sphere Decoder VLSI Architecture for MIMO Communications Chia-Hsiang Yang, Student Member, IEEE, Dejan Markovi´c, Member, IEEE University of California, Los Angeles, USA

Abstract—The sphere decoding algorithm finds applications in multi-input multi-output (MIMO) decoding, because it achieves near maximum likelihood (ML) detection performance with significantly reduced computational complexity. Previous work has focused on implementations based on K-best or depth-first search, limiting the BER performance or the search speed. This paper presents a scalable multi-core sphere decoder architecture that can combine the advantages of the K-best and depth-first search methods. The proposed architecture demonstrated a 3-5 dB improvement in the BER performance for 16×16 systems using 16 processing elements (PEs) compared to the architecture with one PE. An improved search speed of the multi-core architecture also enables a 10× energy efficiency improvement over the single core architecture for the same data rate.

I. I NTRODUCTION MIMO technology is able to increase the spectral efficiency for high-speed wireless connectivity. Data streams are transmitted and received through multiple transmit and receive antennas, creating independently-faded replicas from the same signal source as well as independent data streams from multiple sources. By constructively combining these signals at the receiver end, more reliable data transmission and higher data rates can be achieved. For MIMO decoding, ML detection provides the optimum BER performance, but its exponential computational complexity makes it unattractive for practical implementation. The sphere decoding algorithm is able to achieve the ML detection performance with reduced computational complexity by only examining the possible solutions within the search radius. Theoretically, the sphere decoding algorithm has cubic computational complexity with respect to the number of antennas, which is much lower than the exhaustive search with exponential complexity [1]. There are two major search algorithms for hardware implementation: depth-first [2][3] and K-best [4]-[8]. The advantages of depth-first are that ML performance can be achieved given a sufficient number of processing cycles and that radius shrinking can be used for tree pruning. On the other hand, the advantages of K-best are its regular data path interconnect and higher processing speed due to a larger number of processing units. With an increase in antenna array size and throughput requirement, the depth-first scheme examines fewer possible solutions due to the weaker processing capability, while the K-best scheme has only a fixed search range. A practical consideration is to speed up data processing while maintaining flexibility to examine more solutions within the search radius if more processing cycles are allowed. In this paper, we introduce a multi-core sphere decoder

architecture to resolve the limitations of existing architectures by combining the advantages of the depth-first and K-best search methods. Multiple PEs are used to speed up the search and improve error probability. Unlike the K-best architectures, each PE supports forward-trace and backward-trace. This feature makes it possible to search more paths like the depthfirst architectures when more processing cycles are allowed. Since multiple PEs are searching simultaneously, paths with a smaller metric (Euclidean distance) are more likely to be found earlier. Therefore, the search radius can be shrunk faster, reducing the number of nodes which need to be examined. To leverage signal processing for varying numbers of PEs, choices of interconnect network for the multi-core architecture are analyzed. This paper is organized as follows. Section II reviews the existing sphere decoder architectures and their limitations. The search tree partition, data-interleaved processing, and building blocks of the muti-core sphere decoder architecture are described in Section III. As a proof of concept, simulation results for several configurations are demonstrated in Section IV. Section V concludes the paper. II. E XISTING S PHERE D ECODER A RCHITECTURES The sphere decoding algorithm can be used to reduce the exponential problem of finding the ML solution to a polynomial tree-search problem. The ML decoding is complex, because in the worst-case it requires solving an integer leastsquares problem. This is much harder than the standard leastsquares problem, where the solution can be found by pseudoinverse [1]. The ML decoding can be greatly simplified by solving an equivalent problem of finding the shortest path in a tree topology within a limited radius. In addition to MIMO decoding, similar concepts apply to other problems such as multi-user detection for CDMA systems [5]. A. Sphere Decoding Algorithm Consider a multi-antenna system with M transmit antennas and N receive antennas. The received vector y can be represented by y = Hs + n

(1)

where y is an N×1 vector of received symbols, and H denotes an N×M channel matrix. Vectors s and n (M×1 and N×1 respectively) represent the transmitted vector and additive white Gaussian noise (AWGN), respectively. The transmitted

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

Fig. 1. The matrix structure and the concept of sphere decoding. Unlikely nodes and branches are indicated with gray shading.

vector with the minimum (square) Euclidean distance in Eq. (2) is selected as the ML estimate. ˆ sM L = arg min y − Hs2 s∈Λ

(2)

The channel matrix H can be decomposed further using QR factorization. The ML estimate thus can be written as ˆ sM L = arg min ˆ y − Rs2 s∈Λ

(3)

where y ˆ = QH y, Q is a unitary matrix, R is an upper triangular matrix, and Λ is the set of possible constellation points. Figure 1 shows the matrix structure and the concept of M the sphere decoding algorithm, where bi = yˆi − j=i+1 Rij sj . Using the upper triangular nature of R, the symbol decoding begins at the last row and occurs in several steps. The decoded symbols are used for successive decoding steps until all symbols are decoded. This decoding algorithm can be mapped to finding the shortest path (with minimum Euclidean distance) in a tree topology. Each node denotes one constellation point, and each path represents a different sequence of data symbols. A path with full length (from root to leaf) corresponds to a solution. The whole search space of this tree is equivalent to an exhaustive search in the trellis diagram of the original problem; the number of total combinations of transmitted symbols is |Λ|M , where |Λ| is the constellation size. By properly choosing a search radius and a search method, the ML solution can be approached by visiting only nodes within a hyper-sphere (bounded by the search radius), rather than performing an exhaustive search. The complexity reduction is feasible because the Euclidean distance is a cumulative sum of square terms. This means that for each node, if its partial Euclidean distance is larger than the search radius, the corresponding branches are outside the search radius as well. Once a possible solution with a smaller Euclidean distance (compared to the search radius) is found, the search radius is shrunk to this value. This feature of radius shrinking further reduces the solutions which need to be considered. Tree pruning techniques make sphere decoding achieve ML performance with polynomial complexity (highlighted nodes in Fig. 1) rather than exponential complexity (all nodes in Fig. 1).

Fig. 2.

Depth-first algorithm and the basic architecture.

Tree-search algorithms developed in computer science such as depth-first search and breadth-first [9] can be applied to the sphere decoding problem. These algorithms arise from the traditional uni-processor architecture, where only one node is examined each processing cycle in the search process. To enhance the search speed, K-best, which uses parallel processing to approximate the breadth-first, is proposed. To evaluate the limitations of the existing architectures, algorithms and architectures of depth-first and K-best are analyzed. B. Depth-First Algorithm The depth-first algorithm [2][3] starts the search from the root of the tree, Fig. 2(a), and explores as far as possible along each branch. Then, it back-traces until either a leaf node is found or the node is outside the search radius. The numbers indicate the number of processing cycles. In hardware implementation, depth-first is realized in a foldinglike architecture, Fig. 2(b), because only one node is visited at a time during the tree search process. A candidate list also enumerates the next search node when the path is beyond the search range or reaches the leaf node. Ideally, the search process continues until all possible nodes within the search radius are examined. The ML solution is guaranteed to be found if the solution is within the initial search radius and the whole search space is examined. C. K-Best Algorithm The K-best algorithm [4]-[8] approximates a breadth-first search by keeping only K best branches (with the smallest partial Euclidean distance) at each level. Figure 3 shows the algorithm and the basic architecture of K-best. After M cycles, the path with the smallest Euclidean distance is chosen as the best estimate. To process K data paths at the same time, parallel architecture is applied. Since no trace-back is needed, it can be realized as a multi-stage architecture without feedback loops, Fig. 3(b). In general, more than one candidate node is examined for each branch, so sorting circuits are inserted between stages to keep the K best branches. Strictly speaking, K-best does not take the advantage of radius-shrinking of the sphere decoding algorithm. Once the ML solution is lost in the decoding process, it can never be recovered in the following stages.

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

Fig. 4. Fig. 3.

Multi-core search algorithm and the basic architecture.

K-best algorithm and the basic architecture.

TABLE I C OMPARISON OF D EPTH - FIRST AND K- BEST A LGORITHM Throughput Latency Radius Shrinking BER Performance Depth-first

variable

long

Yes

ML

K-best

constant

short

No

near-ML

D. Comparison of Depth-first and K-best Table I summarizes the architecture comparison in terms of circuit metrics and algorithmic performance. The main drawback of the depth-first architecture is the longer latency because only one node is examined each processing cycle. ML performance may not be guaranteed in practice due to finite processing cycles and buffer size. This problem would be more severe for larger antenna array sizes since the search space becomes bigger and the throughput requirement is also higher, i.e., the processing intervals are shorter. In contrast, Kbest has no ability to improve the BER performance even if more processing cycles were provided. This is because traceback is not allowed and no search radius is set to confine the search range. III. M ULTI -C ORE A RCHITECTURE The scaling of CMOS has been enabling the integration of more devices and higher-frequency operation. This benefit can be exploited to increase the number of processors and the number of processing cycles. In light of these advantages, we propose a multi-core architecture to resolve the limitations faced by the existing architectures. The proposed architecture has multiple PEs to speed up the search as well as flexibility to examine more solutions when more processing cycles are allowed. It is, therefore, suitable for the next-generation MIMO systems with large antenna array sizes. The basic idea of the multi-core is to search multiple nodes within the search radius to speed up the search of admissible solutions. Unlike the parallel architecture reported in [10], each PE has the ability to search the whole search space rather than only the branches with a common ancestor node. Figure 4 shows the conceptual view of the multi-core algorithm and block diagram of the architecture. With the ability to forward-trace and backward-trace, each PE will be assigned a new search branch when the node is outside the search radius (or when a leaf node is reached). Once a solution with a smaller Euclidean distance is found, the new search

Fig. 5. Tree-partition scheme. Each PE is assigned a sub-tree sequentially according to the SE ordering.

radius is updated for all PEs through the interconnect network. For a smaller search space, this architecture is more energyefficient compared to the one-PE architecture. Given the same number of search nodes, it can reduce the clock frequency and distribute computation over all PEs, consuming less power by lowering supply voltage. To coordinate all functional blocks, the tree-partition scheme, architecture of the PE, interconnect network, and data interleaving need to be discussed. A. Tree Partition Scheme We propose a tree-partition scheme to distribute search paths over multiple PEs. For each parent node, there are |Λ| child nodes which correspond to |Λ| sub-trees. These subtrees are ranked by the distance between bi /Rii and Λ for antenna i, as shown in Fig. 5. The Schnorr-Euchner (SE) enumeration suggests traversing the constellation candidates according to the distance in an ascending order [11] to speed up the search. L PEs start from the L closest points for the first decoded symbol (from antenna M). Each PE is responsible for searching the associated sub-tree within the search radius. Once the assigned sub-tree is examined, a new sub-tree from the remaining available sub-trees for antenna M is assigned to this PE according to the SE enumeration. If all possible subtrees for antenna i are examined, an unexamined sub-tree of the most likely solutions (ranked by smaller partial Euclidean distances) for antenna i−1 is assigned to this PE based on the SE ordering. This tree-partition search algorithm ensures that the whole tree can be examined and that no PEs examine the same path twice.

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

Fig. 6. Relationship between partial Euclidean distance for consecutive levels.

Tree pruning allows for discarding of the associated child branches if the node is outside the search radius. By taking advantage of SE enumeration, tree pruning can be used for the sibling nodes. Since the candidates have already been sorted, the remaining sibling nodes can be discarded if the node is already outside the search radius. Figure 6 illustrates the relationship between the tree structure and the corresponding partial Euclidean distance. Nodes with the same letter belong to the same level in the tree structure. The numbers after the letters denote the sequence of the SE enumeration; the associated branches form the corresponding sub-tree described in Fig. 5. Since only the paths with partial Euclidean distance less than the search radius are admissible solutions, the nodes outside the search radius can be discarded. The partial Euclidean distance of the path A1-B1-C1-D1 is higher than the value of the search radius, so node D1 can be discarded. Node D2 can be discarded directly without checking the partial Euclidean distance of A1-B1-C1-D2, because the distance increment contributed by D2 must be larger than that by D1. This tree-pruning technique further speeds up the tree-partition search. The SE enumeration can be simplified to the 2-D constellation plane without a sorting operation using geometric relationships [12] or applying lookup tables [13]. A 40-45% improvement in search speed is observed for 4×4 16-QAM systems by applying the tree pruning scheme to sibling nodes. B. Processing Element (PE) The proposed PE consists of two parts: Metric Calculation Unit (MCU) and Metric Enumeration Unit (MEU). A scalable PE architecture which supports varying antenna array sizes and modulation schemes is proposed in [12], as shown in Fig. 7. The MCU is used for the calculation of the partial Euclidean distance. The search state representing decoded symbols is recorded in a bi-directional shift register chain to adjust the search depth in forward-trace and backward-trace. A new search path can be also loaded to these shift registers when the whole assigned sub-tree is examined. The MEU enumerates the possible constellation points according to the SE ordering and provides the next candidate when the search path lies outside the search radius or when a leaf node is reached. There are m-stage pipeline registers inserted in the loop, so the critical path can be shortened according to the timing constraint by choosing a larger m. This architecture works with complex numbers since joint decoding is more attractive from the BER performance and throughput considerations.

Fig. 7.

Circuit diagram of one PE.

C. Interconnect Network The interconnect network has to coordinate all PEs to facilitate signal processing across all PEs. The required signal processing includes 1) search radius updating/shrinking and 2) conditional new search path assignment. Once a smaller Euclidean distance is found, the value is used as the new search radius (radius shrinking) and broadcasted to all PEs. When the PE finishes the search for the assigned sub-tree, one unexamined sub-tree is assigned to it for the upcoming processing cycles. The required operations include minimum value searching and sorting. The new sub-tree can be assigned according to the rank of partial Euclidean distance, which provides flexibility for advanced search algorithms. We can also set the search constraint to keep the best K paths for quick termination. Sorting is a relatively complex operation. In hardware, two architectures are widely used: serial sorting [4] and parallel sorting architectures [14]. Both architectures are shown in Fig. 8, where the number of inputs is 8 for the parallel architecture. The serial comparison nature results in a longer latency. The parallel sorter is widely used in packet-switch networks, which make use of parallelism to speed up sorting at the cost of increased area. In this application, the choice between these two architectures depends on the latency requirement for signal processing of the PE. Similarly, minimum value searching also has these two kinds of architectures. The serial implementation has only the first stage in Fig. 8(a) to store the minimum value (with change of descending sorting operator with ascending sorting operator); the parallel implementation has a comparator tree structure like an adder tree. While using sorting operation as an illustration, the following concept can also be applied to minimum search. A 1-D mesh interconnect is applied to the serial architecture. The associated Euclidean distances for multiple PEs are fed into registers serially after the metric calculation. Radius checking is conducted serially, and followed by a conditional new path assignment. A tree topology interconnect network is applied to the parallel architecture. The Euclidean distances for multiple PEs are fed into the parallel sorting circuits simultaneously. Table II summarizes the latency performance of the serial and parallel architectures. For a given number of inputs, the parallel sorting architecture has a shorter latency at the cost of increasing hardware resources. We can combine these two

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

Fig. 9.

Data-interleaved processing for P data streams.

TABLE III P ERFORMANCE FOR D IFFERENT N UMBERS OF PE S

Fig. 8.

# of PEs

1

2

4

8

# of visited nodes per cycle

1

2

4

8

16 16

Latency (cycle)

16

16

16

16

16

Energy efficiency (GOPS/mW)

0.5

1.1

2.1

3.6

5.0

Sorting circuits: (a) serial and (b) parallel.

TABLE II L ATENCY C OMPARISON OF S ERIAL AND PARALLEL S ORTING /M IN - SEARCH C IRCUITS Latency

Sorting

Min-search

Serial (# cycles)

2L−2

L−1

Parallel (# comparators) l(l+1)/2

l

*L inputs, l=log2 L

conducted through the interconnect network. For each cycle, only the metrics of one data stream are computed, while other data streams conduct radius update and new path assignment across the PEs. The input data streams can be buffered in advance for this data-interleaving. For OFDM-based systems, the data of different sub-carriers can be mapped to these data directly. E. Low-Power Operation

architectures and adjust the pipeline depth of the parallel architecture to find a compromised solution for a proper latency. The latency is decided by the number of pipeline stages required to finish metric calculation and enumeration. Multiple architectural solutions may exist. For example, consider the case where the real-time processing requirement is 8 clock cycles and the clock cycle includes two sorting operators. One option is to use a fast parallel architecture, which for 8 inputs requires latency of just 3 cycles. Another option is a hybrid architecture where the 8 inputs are divided into subgroups of 4 inputs: it takes 6 cycles to sort L = 4 inputs using serial architecture, as calculated from Table II, but two additional cycles are needed to merge these two sequences using parallel sorting. Both architectures, therefore, meet the requirement of 8 cycles, but the parallel architecture requires a larger area to finish the computation early. D. Data-Interleaved Processing To maximize the use of hardware, the sorting should be finished in one cycle such that the results can be used in the next cycle. As the number of PEs is increased, it is not possible to finish these operations within one cycle, resulting in processing speed mismatch between the PE and the sorting circuit. It is not feasible to insert the pipeline directly in the feedback loop. Therefore, the data-interleaving technique [15] is adopted to eliminate the throughput bottleneck. As shown in Fig. 9, independent data xi is interleaved into each PE. The metric calculation and enumeration are executed in the PEs while the radius update and new path assignment are

The operation of the multi-core architecture is flexible to improve error performance or power consumption according to the size of the search space and the throughput requirement. For a large search space resulting from larger antenna arrays and constellation sizes or a higher throughput requirement, the system objective is to examine as many nodes as possible. Therefore, all PEs are operated at a high clock frequency to speed up the search. Conversely, the multi-core architecture can reduce power consumption by slowing down the clock and distributing computation over several PEs for a small search space. This can be viewed as system-level extension of the concept of parallelism, which is a well known technique in digital CMOS [16]. The power consumption of the two-PE can be 64% lower than that of one-PE by taking advantage of the supply voltage scaling [16]. More PEs allow a more aggressive voltage scaling such that the power consumption can be minimized. Table III lists the performance for different numbers of PEs. The 16-PE architecture provides a 10× 2 ) improvement over the singleenergy efficiency (∝ Csw ·VDD core architecture [17]. IV. S IMULATION R ESULTS To demonstrate the functionality of the proposed multi-core architecture, simulations for several configuration modes were conducted. In this preliminary experiment, the bandwidth was set to be 16 MHz, and the number of processing cycles was equal to the number of transmit antennas to achieve the highest throughput. The clock frequency fclk was 64 MHz and 256 MHz for 4×4 and 16×16 systems, resulting in throughput of

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.

architecture. It is also interesting to investigate the generation of soft information using this multi-core architecture. The soft information can be used for the outer receiver to achieve nearcapacity performance by the iterative decoding scheme [11]. The multi-core architecture has the potential to provide better soft information because it is able to examine more possible solutions. ACKNOWLEDGMENT The authors acknowledge the support of the Focus Center for Circuit & System Solutions (C2S2), one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation Program. R EFERENCES

Fig. 10.

Simulation results for several design configurations.

6fclk and 4fclk bps for 64-QAM and 16-QAM modulation. Figure 10(a) shows the BER performance of the 4×4 antenna array system with 16-QAM modulation. We see more than 7 dB improvement for high SNR regime by deploying 16 PEs. Compared to the ML performance, a 3-5 dB difference is achieved using 16 PEs when Eb /N0 exceeds 15 dB. Figure 10(b) shows the BER performance of the 16×16 system for different constellation sizes and number of PEs. Around 3 dB improvement is achieved using 16 PEs over one-PE architecture. For a high SNR regime, a 5 dB improvement is observed for 16-QAM modulation. The BER performance can be further improved if more processing cycles are allowed. V. C ONCLUSION A scalable multi-core VLSI architecture for the sphere decoding algorithm is presented. We proposed a tree-partition search scheme based on SE enumeration to partition search paths across multiple PEs. To coordinate all PEs, the choice of interconnect network and required data-interleaving was discussed. The architecture with 16 PEs improves the BER performance over the single PE architecture by a 3-5 dB. It could also operate in a low-power mode by slowing down the operating frequency and lowering the supply voltage for a 10× improvement in energy efficiency. The tradeoff between the BER performance and the number of processing cycles is currently being explored in order to maximize the computational capability of the multi-core

[1] B. Hassibi and H. Vikalo, “On Sphere Decoding Algorithm. I. Expected Complexity,” IEEE Trans. Signal Processing, pp. 2806-2818, Aug. 2005. [2] A. Burg et al., “VLSI implementation of MIMO detection using the sphere decoding algorithm,” IEEE J. Solid-State Circuits, vol. 40, pp. 1566-1577, July 2005. [3] D. Garrett et al., “Silicon Complexity for Maximum Likelihood MIMO Detection Using Spherical Decoding,” IEEE J. Solid-State Circuits, vol. 39, pp. 1544-1552, Sep. 2004. [4] K.-W. Wong, C.-Y. Tsiu, R. S.-K. Cheng, and W.-H. Mow, “A VLSI architecture of a K-best lattice decoding algorithm for MIMO channels,” in Proc. IEEE Int. Symposium on Circuits and Systems (ISCAS’02), vol. 3, pp. 273-276, May 2002. [5] G. Knagge, G. Woodward, S. Weller, and B. Ninness, “An Optimised Parallel Tree Search for Multiuser Detection with VLSI Implementation Strategy,” in Proc. Global Telecommunications Conf. (GLOBECOM’04), pp. 2440-2444, Dec. 2004. [6] G. Knagge et al., “A VLSI 8×8 MIMO Decoder Engine,” in IEEE Workshop on Signal Processing Systems, SIPS’06, pp. 387-392, Oct. 2006. [7] L. G. Barbero and J. S. Thompson, “Rapid Prototyping of a FixedThroughput Sphere Decoder for MIMO Systems,” in Proc. Int. Conf. Communications (ICC’06), vol. 7, pp. 3082-3087, June 2006. [8] Z. Guo and P. Nilsson, “Algorithm and implementation of the k-best sphere decoding for MIMO detection,” IEEE J. Selected Areas in Communications, vol. 24, pp. 491-503, Mar. 2006. [9] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms, MIT Press, 1998. [10] B. Widdup, G. Woodward and G. Knagge, “A Highly-Parallel VLSI Architecture for a List Sphere Detector,” in Proc. Int. Conf. Communications (ICC’04), vol. 5, pp. 2720-2725, June 2004. [11] B. M. Hochwald and S. ten Brink, “Achieving near-capacity on a multiple-antenna channel,” IEEE Trans. Communications, vol. 51, pp. 389-399, Mar. 2003. [12] C.-H. Yang and D. Markovi´c, “A Flexible VLSI Architecture for Extracting Diversity and Spatial Multiplexing Gains in MIMO Channels,” to appear at IEEE Int. Conf. Communications (ICC’08). [13] A. Wiese, X. Mestre, A. Pages, and J. R. Fonollosa, “Efficient Implementation of Sphere Demodulation,” in IEEE Workshop on Signal Processing Advances in Wireless Communications, pp. 36-40, June 2003. [14] N. K. Sharma,“Modular Design of a large sorting network,” in Third Int. Symposium on Parallel Architectures, Algorithms, and Networks, pp. 362-382, Dec. 1997. [15] D. Markovi´c, B. Nikoli´c, and R. W. Brodersen, “Power and Area Efficient VLSI Architecture for Communication Signal Processing,” in Proc. Int. Conf. Communications (ICC’06), Jun. 2006. [16] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, “Low-power CMOS digital design,” IEEE J. Solid-State Circuits, vol. 27, no. 4, pp. 473-484, Apr. 1992. [17] R. Nanda, C.-H. Yang, D. Markovi´c, “DSP Architecture Optimization in Matlab/Simulink Environment,” to appear at Int. Symposium VLSI Circuits (VLSI’08).

978-1-4244-2324-8/08/$25.00 © 2008 IEEE. This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2008 proceedings.