SOC DESIGN FOR WIRELESS ... - Semantic Scholar

18 downloads 214 Views 594KB Size Report
Wireless; system-on-chip; design; processor; hardware accelerator; power ..... in the IEEE Standard 802.11 is Carrier Sense Multiple Access with Collision.
Journal of Circuits, Systems, and Computers Vol. 20, No. 8 (2011) 15051527 # .c World Scienti¯c Publishing Company DOI: 10.1142/S0218126611008055

SOC DESIGN FOR WIRELESS COMMUNICATIONS

¤

ZORAN STAMENKOVIĆ IHP, Frankfurt (Oder), 15236, Germany [email protected] Received 2 January 2011 Accepted 21 June 2011 The paper emphasizes methods, architectures, and components for system-on-chip design. It describes the basic knowledge and skills for designing high-performance low-power embedded devices whose complexity increases exponentially, as so does the e®ort of designing them. Relying upon an appropriate design methodology which concentrates on reuse, executable speci¯cations, and early error detection, these complexities can be mastered. The paper bundles these topics in order to provide a good understanding of all the problems involved. It shows how to go from description and veri¯cation to implementation and testing, presenting three systemson-chip for three di®erent wireless applications based on con¯gurable processors and custom hardware accelerators. Keywords: Wireless; system-on-chip; design; processor; hardware accelerator; power management.

1. Introduction A good system-on-chip (SOC) design °ow113 assumes getting a design from the architectural level or RTL level to the chip layout. It should provide the designer with a working starting point for each stage of the design process. We describe such a methodology that relies on a library of con¯gurable IP cores and custom hardware accelerators and satis¯es the unique needs of wireless applications (Fig. 1). System speci¯cation and system modeling are very critical tasks, since the hardware/software partitioning will be based on the assumptions taken at this (system) level. There are many di®erent levels of abstraction in which the overall design will be described, and each one must always rely on a higher level or golden model for veri¯cation purposes. It is very common that when the product is on the last phases of development, the golden model is not referenced anymore and it can become obsolete. This is somehow a risk in the overall design °ow since the golden model should always be updated to re°ect the latest changes at the low levels of *This

paper was recommended by Regional Editor Krishna Shenai.

1505

1506

Z. Stamenkovic

Fig. 1.

Overview of a typical SOC design °ow (Ref. 16).

abstraction. We should never forget that SOC is all about reutilization and complete systems now could become IP blocks in future. There are many di®erent languages that can be used to model the system and create an executable speci¯cation based on the written requirements for a particular SOC.14 For the proof of concept, the electronic system level (ESL) languages such as speci¯cation and description language (SDL), Matlab, Simulink, SPW, and C/Cþþ could be used. From this very high level description, the design could be re¯ned or translated into a ¯xed point representation, for this purpose traditional C/Cþþ implementations have been used for many years but a general purpose system level design language such as SystemC is currently being used and has a very high support from the EDA industry. In recent years, a new entry to the SOC arena at this level of

SOC Design for Wireless Communications

1507

abstraction is SystemVerilog whose strongest capabilities are in the design of testbenches and assertions. Once the high-level abstraction model has been developed and tested, the design has to be measured in terms of performance and trade-o®s. Performing most of the functions in software will allow the product to be updated and any additional features could be added later. The architects need to identify which IP blocks are necessary for a particular application and ¯nd out if these are already available and economically viable. The last and most critical task is to decide what components need to be designed from scratch, such that the product will have a di®erentiation. The main ingredients of innovation will be new IP blocks together with software. At this stage, the system performance is budgeted and a particular general purpose processor is selected to compute as many system tasks as possible in software, leaving the high throughput algorithms to be implemented as hardware accelerators. A traditional SOC is based on a processor with the interconnect fabric in the form of a system bus. The hardware accelerators can be designed in house or can be obtained from di®erent sources.15 From this stage, the hardware and software design and integration teams can start working concurrently. The next job is to describe the design in computer readable form using one of the Hardware Description Languages (HDL): VHDL or Verilog. Sometimes this can be done by making use of the languages like SystemC, SystemVerilog, etc. The aim is to produce the code that clearly exhibits the functionality prescribed in the speci¯cation (and architectural model), whilst meeting the constraints placed upon it by the target ASIC or FPGA technology. The skill and experience of the designer is the most important factor in this process. Design °ow starts with the creation of a top HDL ¯le describing the system and its components as black boxes. Most of components are con¯gurable functional building modules, which are automatically, after choosing the parameters, described by generation of the netlist and physical layout. Con¯guration of functional modules is possible using an in-house prepared Tcl interface. In addition, the area and average power dissipation estimates are also generated to support exploration of design alternatives. A testbench is written in order to prove correctness of the RTL model against the speci¯cation. The aim of a good testbench (or testbenches) is to: .

Prove that the device functions as described in the speci¯cation reporting any discrepancies to the transcript in order that they may be ¯xed; . Fully exercise every line of RTL code; . Allow resimulation of the post-layout netlist. In many cases, it is only possible to perform a subset of the total test suite at gate level since massive increase in computing power is required. Testbenches are sometimes used to produce cycle-based test vectors for ¯nal electrical testing during manufacture. Con¯guration of testbenches is possible using an in-house prepared Tcl interface.

1508

Z. Stamenkovic

Figure 1 shows all main steps of the design °ow including both logical and physical stage. Logic synthesis tools are used to synthesize a netlist representation of design.17,18 RTL and netlist (without and with timing information) simulations are performed by HDL simulation tools.19,20 Layout tools can do the °oor planning, placement, clock tree generation, routing (including the delay optimization), and veri¯cation of layout.17,21 2. Hardware Components In this section, we describe the main hardware components of a SOC for wireless applications: con¯gurable processors, embedded memories, hardware accelerators, power management units (PMUs), buses, controllers, debug support units (DSUs), and peripherals. These components can be soft or hard IP cores depending on the chosen design methodology and design °ow. 2.1. Con¯gurable processors A con¯gurable general purpose processor2225 can be con¯gured according to the requirements of various applications and tasks. Con¯gurations include one or more coprocessors, one or more cache memories, scratchpad memory, on-chip trace memory, on-chip buses, etc. Usually, the highest attention is paid to con¯guring caches. The con¯guration process supposes decision making on the size, associativity and organization of the instruction, and data caches. To select appropriate con¯gurations and choose the power-optimal one for speci¯c application is a di±cult task.2631 2.2. Embedded memories The customer usually wants a particular number of words (depth) and bits (width) for each memory (RAM or ROM) ordered. Each of the ¯nal building blocks (physical layout) will be implemented as a stand-alone, densely packed, pitch-matched array. Using complex layout generators and adopting state-of-the-art logic and circuit design technique, today's embedded memories can reach extreme density and performance. Designers can choose the memory aspect ratio according to the requirements of SOC level layout. A memory generator32,33 is a tool which can create memory blocks in a range of sizes as needed. Each memory generator is a set of various, parameterized generators: Layout generator generates an array of custom, pitch-matched leaf cells; Schematic generator and netlister extracts a netlist used for both layout versus schematic and functional veri¯cation; Function and Timing Model generators create models for gate level simulation, dynamic/static timing analysis and synthesis; Symbol generator generates schematic; and special purpose generators such as Critical Path generator are used for both circuit design and timing characterization.

SOC Design for Wireless Communications

1509

2.3. Hardware accelerators A hardware accelerator is the key component for implementing wireless communication algorithms. It is integrated on a silicon chip together with a processor core, memories, and peripherals. Most often, hardware accelerators in wireless SOC perform the channel estimation and equalization, error detection and correction, encryption and decryption, interference suppression, multichannel and multibeam reception, and other timing-critical functions.34 2.4. Standard buses System bus architecture is among the top challenges in SOC design due to rapidly increasing operation frequency and growing chip size. In general, the performance of an SOC design heavily depends upon the e±ciency of its bus structure. The balance of computation and communication in any application or task is, of course, known as a fundamental determinant of delivered performance. Standard on-chip bus structures have been developed to reach this balance. Currently there are a few publicly available bus architectures from leading manufacturers, such as CoreConnect from IBM,35 AMBA from ARM36 and others. These bus architectures are usually tied to processor architectures to work with and require minimal extra interface logic. 2.5. Memory controllers The external memory bus is usually controlled by a programmable memory controller. The controller acts as a slave on the system bus. The function of the memory controller is programmed through memory con¯guration registers. The memory bus provides a direct interface to PROM, memory mapped I/O devices, and asynchronous static RAM (SRAM). Chip-select decoding is done for several PROM banks, I/O banks, and SRAM banks. Therefore, there are numerous chip-select signals in the memory controller. 2.6. Debug support units The DSU takes the control during the processor debug mode. The DSU is attached to the system bus as a slave. Through the speci¯c address space, any bus master can access the processor registers. The DSU control registers can be accessed at any time, while the processor registers and caches can only be accessed when the processor has entered debug mode. In debug mode, the processor pipeline is held and the processor is controlled by the DSU. 2.7. Peripherals The interrupt controller is used to prioritize and propagate interrupt requests from internal or external devices to the integer unit. Usually, there are in total 15 interrupts divided on two priority levels.

1510

Z. Stamenkovic

The timer is a counter with several operating modes and capture/compare registers. It also supports multiple capture/compares, pulse width modulation, con¯gurable outputs, asynchronous input and output latching, interval timing, and interrupts. Interrupts may be generated from the counter on over°ow conditions and from each of the capture/compare registers. The timer unit implements usually a watchdog and a shared prescaler. The universal asynchronous receiver and transmitter (UART) is used for communication with serial input/output devices. Typically, the UART is connected between a central processor and a serial device. To the processor, the UART appears as a parallel port, which can be written to or read from. To the serial device, the UART presents two data wires, one for input and one for output, which serially communicate data. The rate of data communication depends on the peripheral device. Some devices operate at a single clock speed and generate an internal clock. Other devices operate at multiple clock rates and get their clock input from the UART. The parallel I/O module allows the core to interface to the user connectors. For each of the dedicated pins, the direction can be programmed. The default direction is \input" after reset. In a typical multitasking environment, each process only owns some speci¯c bits in an I/O register. 3. Design and Implementation of MAC Layer This section presents in detail the design and implementation of three SOCs. They support the MAC layers of three wireless communication standards: IEEE 802.11 (WLAN), IEEE 802.15.3 (high-rate WPAN), and IEEE 802.15.4 (low-rate WPAN). The di®erent implementation emphasis is placed on each of the three communication standards because of di®erent application needs. Namely, the IEEE 802.11 WLAN37 needs high-performance hardware and does not care much about power consumption constraints. It imposes the use of a high-performance 32-bit general purpose processor and such a hardware accelerator. In the contrary, the low-rate IEEE 802.15.4 WPAN38 ultimately requests for low-power hardware components and techniques. It consequently means the use of 16-bit and eight-bit processors, low-power cores, and power gating mechanism. The high-rate IEEE 802.15.3 WPAN39 stays in the middle as a trade-o® between high-performance and low-power demands. 3.1. IEEE 802.11 MAC layer SOC We ¯rst present a SOC customized for IEEE 802.11 MAC layer operations.40 The implemented architecture exploits dedicated hardware for timing-critical tasks. The IEEE 802.11 MAC layer provides reliable data delivery for the upper layers over the wireless medium with data rate up to 54 Mbit/s. It speci¯es how a computer on the wireless network gains access to receive and transmit data, and once communication is established how it is maintained. The basic medium access control technique used

SOC Design for Wireless Communications

1511

in the IEEE Standard 802.11 is Carrier Sense Multiple Access with Collision Avoidance (CSMA/CA). The ¯rst step of processor customization was to establish a detailed simulation model of the IEEE 802.11 protocol using SDL. This model was used to verify the functional correctness of our design. On the other hand, it served as a basis for performance investigations to identify those parts of the protocol that require either software optimization (hand-optimized C code) or implementation in hardware in order to meet the real-time requirements. The top level structure of the SDL simulation model is shown in Fig. 2. 3.1.1. Hardware/software co-design We have generated the C model of the protocol and compiled it for the target MIPS processor. During simulations on MIPS32 4KEp, it turned out that excessive invocations of the SDL run-time system cause heavy overheads when executing the software. Therefore, we have replaced some of the automatically generated C code with hand-optimized code. It has helped us to speed up the execution, but due to the tight timing requirements of the protocol, the processor executing the software would have to be clocked at nearly 1 GHz in order to comply with the standard. This would, of course, lead to high power consumption, which is not feasible for operation in mobile devices. Therefore, we have decided to implement some of the timing-critical MAC functionalities into dedicated hardware. Based on pro¯ling information from the C implementation and analysis of the real-time requirements speci¯ed in the standard, we have conducted hardwaresoftware partitioning of our model. Among the timing-critical algorithms implemented in hardware are CRC calculation, RC4 encryption, and decryption as well as a large part of the frame reception process including address ¯ltering and generation of the acknowledgment. All other functionalities that are not timing critical are executed in software. The system

Fig. 2.

SDL model of the IEEE 802.11 WLAN (Ref. 40).

1512

Z. Stamenkovic

EJTAG

MIPS Core

IC

I2C

DC

Hw GPIO Acc

EC-to-X Bus Controller

PCMCIA PCMCIA

EPP PCMCIA GPIO

Peripheral Bus Controller

SRAM

I2C

UART 0

Serial 0

UART 1

Serial 1

FLASH

Fig. 3. Architecture of the IEEE 802.11 MAC layer SOC.

architecture (Fig. 3) is based on MIPS32 4KEp22 that performs most of the MAC functionality. The core communicates with the rest of the system through the X-bus. Two targets are connected to the X-bus: General Purpose Input/Output (GPIO) and Peripheral Bus Controller (PBC). The interface to the physical layer is provided through EPP port implemented via GPIO. EPP represents a standard eight-bit parallel port speci¯ed by the IEEE 1284 standard. The timing-critical MAC functionalities are mapped into a hardware accelerator. This unit acts as a data pipe between the processor and IEEE 802.11 a physical layer connected via EPP port. An interface to other external components has been provided through I 2 C bus and two serial ports. I 2 C bus controller is implemented as a part of GPIO. PCMCIA is connected to the processor via GPIO too. The serial ports are controlled by two UARTs connected via PBC. PBC also provides an interface to external SRAM, °ash, and other memory-like peripherals. The MAC protocol software is compiled into the MIPS speci¯c machine code and stored in the °ash, requiring 300 Kbyte of memory. 3.1.2. Hardware accelerator The hardware accelerator reduces the burden of the protocol processor by executing timing-critical functions of the IEEE 802.11 MAC layer. Functionality of the hardware accelerator can be divided into four major tasks: frame transmission, frame reception, channel-state monitoring, and timing. Figure 4 shows the general block diagram of the hardware accelerator structure. The Bus Interface is the interface to the processor. A standard memory bus is used to interact with the hardware accelerator. Additionally, one interrupt line is provided. The interface to the baseband processor is the EPP interface. The requests from the processor bus are interpreted

SOC Design for Wireless Communications

1513

Processor bus

Accelerator Accelerator_Core

Bus_Interface

RAM

RC4 CRC

Timers

Accelerator_Control

MIB

RC4 Rx

Channel_State

RAM

Tx CRC

CRC PHY_Interface

Calc_Duration

Group_Table Rx_Frame_Reception

RAM

EPP Interface

Fig. 4.

Structure of the MAC hardware accelerator (Ref. 40).

by the Accelerator Control component. It is responsible for applying correct data to the data bus in the case of read access from the processor, controlling the interrupt handling, decrypting frames, and for interpreting the requests from the processor. Frame transmission is handled by the Tx component. It generates the required control signals for the PHY Interface to indicate the start of a new frame, and processes the frame data received from the processor. The PHY Interface component implements the EPP protocol, which is used to communicate with the baseband processor. Furthermore, it forwards control signals, like the state of the wireless medium and the presence of received data, to other components. The Channel State component keeps track of the state of the wireless medium and implements the IEEE 802.11 logical carrier-sense mechanism. Based on this state information, it controls the backo® procedure that is performed before sending a new frame. Frame reception is controlled by the Rx component. Frame data is fetched from the PHY Interface when there is an indication of available data. The data is analyzed to determine the frame type and to decide if the station is an intended receiver of that frame. Additionally, the frame data is passed to the CRC algorithm and stored in a bu®er to be retrieved later by the processor. After error-free reception, an acknowledgment frame might have to be sent back to the frame source, and a noti¯cation to the processor is generated. Timers are used to control the operation

1514

Z. Stamenkovic

of certain components. All station attributes and timing characteristics that are relevant for the hardware accelerator are stored in the MIB (management information base) component. There are a couple of mathematical algorithms implemented as separate components. These are the CRC component for the CRC-32 algorithm, the RC4 component for the RC4 encryption algorithm, and the Calc Duration component that is used to calculate the duration of a frame based on its length and the transmission rate. The Calc Duration component is instantiated only once, so that access to it must be controlled by an arbiter, whereas CRC and RC4 are available exclusively for each component (Tx, Rx, Accelerator Control) that needs the respective algorithm. Finally, the Group Table component manages a list of up to eight multicast addresses. It is a ¯xed-size table of MAC addresses. The addresses can be con¯gured by software running on the processor. The Rx component in the accelerator checks if the address is a multicast address and, if so, sends a request to the Group Table component. Each of the eight addresses is checked sequentially. The table is implemented using registers. The Rx Frame Reception component controls a bu®er for one received frame. It receives frame data from the Rx component and delivers this data to the processor on later request. The frame is stored in RAM so that two addresses — the write pointer and the read pointer — must be updated. 3.1.3. System implementation The complete hardware has been modeled using a mixed VHDL/Verilog design style. Functional testing has been performed in two ways. The ¯rst was using the MIPS evaluation board containing the MIPS processor core and a FPGA. The second was using a VHDL testbench written to simulate the system behavior. The board has been mainly used to test the functionality of the hardware accelerator and MAC protocol while the behavioral simulation model has been used to test the functionality of the implemented peripheral controllers and functional correctness of the whole system. The second approach is more time consuming but it allows tracing of all the internal signals and e±cient debugging. The VHDL simulation testbench provides SRAM and PROM behavioral models as well as the emulation of the behavior on the system's peripherals. To test the system, a test program has been written in the MIPS assembler, compiled into a bit stream, and loaded into the boot PROM. The program performs processor initialization and tests all of the basic data transactions on the system ports. The structure of the simulation environment is shown in Fig. 5. The design has been fabricated in the IHP's 0.25 m CMOS technology.41 The CPU, hardware accelerator, and memory blocks take the most of 40 mm 2 of the chip area. Due to intensive calculations, these two components consume most energy. The total power consumption at the operating frequency of 80 MHz is 1.2 W. The hardware accelerator uses about 3 Kbyte of memory (5  512 byte single-port RAM and 2  256 byte dual-port RAM). Figure 6 shows the SOC photo.

SOC Design for Wireless Communications

Asynchronous RAM

Asynchronous PROM containing testprogram

DUT UART testbench

UART0

PBC

UART1

UART testbench

I 2C

I 2C testbench

MIPS core

EPP testbench

EPP

GPIO

GPIO testbench Fig. 5.

Fig. 6.

Simulation testbench.

Layout of the IEEE 802.11 MAC layer SOC.

1515

1516

Z. Stamenkovic

3.2. IEEE 802.15.3 MAC layer SOC This design implements a high-performance low-power SOC based on LEON2 processor25 and executes the medium access and error control (MAC) protocol of the IEEE 802.15.3 standard.39 The MAC protocol functionality is clearly separated in the data path and the control path (Fig. 7). The data °ow processing includes: .

cyclic redundancy check (CRC) sum calculation, encryption and decryption of the frame payload, . interfacing with the physical layer, and . frame bu®ering. .

3.2.1. Hardware/software co-design To identify bottlenecks in the pure software implementation and to estimate the required clock frequency to meet all timing constraints, we have performed a pro¯ling of the software. For that purpose, the software has been simulated using the LEON2 instruction set simulator. Analysis of di®erent design alternatives by introducing dedicated hardware accelerators instead of software has been carried out. The most timing-critical protocol functions have been iteratively removed from the software model and put into a hardware component connected to the LEON2 processor via the on-chip bus. This has allowed us to study the new timing behavior

Fig. 7. SDL MAC protocol model for the IEEE 802.15.3 (Ref. 42).

SOC Design for Wireless Communications

1517

of the MAC protocol and to optimize the hardware/software partitioning until all the timing constraints were met. As a result of the hardware/software partitioning, we have identi¯ed the frame reception and transmission procedure, superframe timing control, immediate acknowledgment handling, and parts of the transmission queue as candidates for hardware implementation. In other words, all the low-level, timing-critical, and processing-intensive tasks of the channel access mechanism have been mapped into the hardware partition. This corresponds well to the lowest service layer in our SDL model (transport engine). The remaining protocol functionality is handled by the LEON2 processor. Interrupts are used to signal protocol-related events from the hardware to the software. Conversely, the software interacts with the hardware by writing to and reading from a number of control registers. The LEON2 processor is highly con¯gurable, allowing the user to customize it for a certain application (selecting di®erent cache sizes, multiplier performance, etc.) and target technology. It is available as an open core in the form of a VHDL model describing the SPARC V8 processor core, system bus, and peripheral components. New modules can easily be added using the on-chip system bus. A graphical con¯guration tool based on UNIX kernel scripts is used to con¯gure the system. The con¯guration environment is modi¯ed to include the IHP's 0.25 m CMOS SGB25V technology library.41 The architecture of the con¯gured system is presented in Fig. 8. The system is based on LEON2 core connected through the AMBA bus to system peripherals. The core integrates both instruction and data cache memories (I-Cache and D-Cache) and corresponding cache controllers. It also includes an interface to the AMBA advanced high-performance bus (AHB) and its controller. A memory controller is attached to AHB. It provides an interface to both internal and external °ash memories (PROMs) and static RAMs. The slower AMBA advanced peripheral bus (APB) is attached to AHB via a bridge. Two UARTs, timer, I/O port, and interrupt controller are connected to APB. 3.2.2. Hardware accelerator We have designed a hardware accelerator called BASUMA (Body Area System for Ubiquitous Multimedia Applications) as a key component of our MAC layer processor. The hardware accelerator is connected to the system bus (AHB) via an AHB master interface, it is possible for the accelerator to directly access the system memory, for instance to store and retrieve frame data without involving the LEON2 processor. Figure 9 shows the main accelerator components. The tasks performed by each of the main components re°ect the protocol functions that have been identi¯ed to be designed in hardware. In receive direction, the accelerator retrieves frame data from the physical layer byte by byte, performs ¯ltering and CRC check, and stores the data at a given

1518

Z. Stamenkovic

Fig. 8. Architecture of the IEEE 802.15.3 MAC layer SOC.

Fig. 9. Structure of the MAC hardware accelerator (Ref. 42).

SOC Design for Wireless Communications

1519

memory location by means of direct memory access (DMA) (Rx controller, CRC, and DMA). In transmit direction, it retrieves frame data from a memory location, calculates and appends the check sum, and pushes the data to the physical layer (Tx controller, CRC, and DMA). The hardware accelerator also signals a successful reception or transmission of a frame to the processor by an interrupt (Interrupts). It analyzes received and transmitted beacon frames and extracts information on channel time allocations (Beacon parser). It also manages a frame queue and selects an appropriate frame for transmission (transmission queue). At the start of a time slot or following a frame transmission, it queries a new frame from the queue and, in the case that the frame must be acknowledged by the receiver, waits for the acknowledgment frame (Scheduler and Timers). The accelerator performs the backo® procedure in the contention access period (Scheduler and Timers) and sends an acknowledgment at the right time upon reception of a frame that needs to be acknowledged (Scheduler, Timers, and Tx Controller). The Calc Duration component, which is not shown in Fig. 9 for simplicity, calculates the actual duration of a frame transmission based on its payload length and data rate. This component is used to determine if a frame transmission ¯ts into an available time slot and when a transmission initiated by the protocol accelerator will be completed by the physical layer. 3.2.3. System implementation For system implementation and veri¯cation, we have used the original simulation and synthesis scripts having provided necessary modi¯cations. First, modi¯cations have been done to incorporate custom SRAM (for the caches) and PROM (for the °ash) Verilog simulation models into the original VHDL processor model. The design with directly instantiated memory blocks and pads has been synthesized and the layout has been prepared. A generic testbench is provided for generation of a few testbench con¯gurations. In addition, we have written two C programs: one, which tests the hardware accelerator by sending and receiving data packets through the EPP interface and another, which tests the °ash interface by writing/reading data to/from the °ash memory. These testbenches have been used for the design veri¯cation. The SOC layout is shown in Fig. 10. The chip occupies 32 mm 2 of the silicon area and consumes the power of 0.35 W at the operating frequency of 25 MHz. 3.3. IEEE 802.15.4 MAC layer SOC Wireless Sensor Networks (WSN) are networks of a limited number of nodes, typically less than ten, deployed in an ad hoc fashion to cooperate for sensing one or more physical phenomena.4345 They o®er a high capability of processing and communicating data in medical and surveillance applications. In the TANDEM project,46 we propose a °exible sensor network node architecture, which allows an easy adaptation to the application type and scenario. In principle, the sensor node should

1520

Z. Stamenkovic

Fig. 10. Layout of the IEEE 802.15.3 MAC layer SOC.

consist of a processor, wireless communication unit, one or two sensor interfaces, °ash memory, random access memory and possibly PROM. Eventually, some hardware accelerators are needed for functions that cannot be e±ciently executed in software by the processor. Such functions are normally time critical or resource intensive. We assume that parts of the medium access control, network, and transport protocols are implemented in hardware. All these components should be interconnected by a bus system. The actual node size is de¯ned by the processor type and memory size, and depends on the application type and scenario. The proposed architecture allows also an easy recon¯guration or adaptation of the sensor node according to the transmission cycle or channel access mechanism. The SOC includes, beside a processor core, a clock divider, a PMU, program and data memories, no °ash memory, additional peripherals (timer and input/output port), interrupt chain, and glue logic. The system is described by synthesizable VHDL. The architecture of the con¯gured system is presented in Fig. 11. 3.3.1. Hardware components The IPMS430 processor is an IP core designed by IPMS, Fraunhofer in the form of a VHDL model.47 It is a clone of the Texas Instruments MSP430 microcontroller's central processing unit.48 It is a 16 bit von-Neumann reduced instruction set computer (RISC) architecture with a 16 bit arithmetic logic unit (ALU). It contains 16 registers, where 12 are general purpose registers while the other special purpose registers are the program counter (R0 ¼ PC), the stack pointer (R1 ¼ SP), and the status register (R2 ¼ SR). R2 is not only the SR, but also a constant generator. R3 serves exclusively as a constant generator. The IPMS430 core executes almost all the instructions equally to the MSP430. There are only slight di®erences in execution of

SOC Design for Wireless Communications

I2C Debug

1521

Program Memory

Timer IPMS 430

MUX GPIO Data Memory

Fig. 11. System architecture of the IEEE 802.15.4 MAC layer.

some very synthetic instructions. A known bug of the MSP430 does not appear in the IPMS430. Memory and peripheral components are accessed via a 16 bit main address bus (MAB) and a 16 bit main data bus (MDB). Memory and peripherals are organized byte-wise and therefore every byte has an address. Both byte (eight bit) and word (16 bit) accesses are possible. The clock divider slows down the input clock signal by factor of two. The slowed down clock drives the processor core and peripherals, while the inverted input clock drives the memories. This is necessary since almost all IPMS430 instructions are executed in a single clock cycle. Power consumption is managed by a dedicated PMU that contains all the logic required to control the power gating mechanism.49,50 It is a programmable unit seen from the processor as a peripheral. The power gating is a static operation and generally controlled by the application. Whenever the functional mode of a sensor node (transmitting, receiving, data collect, etc.) is changed, the application decides which functional blocks are not to be used and activates the PMU to switch them o®. It is also possible to implement a decision-logic within PMU and make the switching independent of the application. PMU consists of a simple register ¯le that stores the control values for selecting the peripherals to be switched o® and the values that set the duration time of the peripheral sleep mode. It also contains a ¯nite-state-machine responsible for the correct execution of the switch-on and switch-o® procedures. Upon the time-out, timers generate the corresponding interrupts that start the switch-on procedure. Sensors collect the data that are preprocessed and then stored into a dedicated memory. Other functional blocks are switched o® from the power and wait for a signal from PMU to wake up and operate. The IPMS430 uses a byte-organized memory. Bytes are located at even or odd addresses, however, words are only located at even addresses. Therefore, when using word instructions, only even addresses may be used. The address signal is stable, whenever the chip select signals are active. The read access on the peripheral is

1522

Z. Stamenkovic

started with the falling edge of the chip select signals. The data processing during a write access on the peripheral should be started with the rising edge of the write enable signal. To access words in one step, the IPMS430 needs two byte-organized memories, one containing all bytes with the even addresses and one for all bytes with the odd addresses. This holds for both program and data memories. All of the physically separated memory areas (program and data memories, peripherals, and special function registers) are mapped into the common address space. The addressable memory is 64 Kbytes. We have implemented a system memory con¯guration consisting of a 4 Kbyte program memory and a 2 Kbyte data memory. Program memory serves primary for hosting system ¯rmware. The processor system includes hardware debug support to aid software debugging on target hardware. The support is provided through I 2 C interface. It can put the processor in debug mode, allowing read/write access to all processor registers and an arbitrary address in the memory space, thus allowing writing a program into the system program memory. Peripherals may signal an interrupt request. Interrupts are handled in an interrupt chain. The nearer is a peripheral to the start of the chain, the higher is its priority. Every peripheral has an interrupt input, which is connected to the interrupt source of next lower priority, and an output, which is connected to the interrupt source of next higher priority. Incoming interrupts are fed through by peripherals. Interrupts can be maskable (having provided lower priority) and nonmaskable. This design features two standard peripherals: the timer and parallel I/O port. Peripherals have to be connected to the data bus and address bus. A multiplexer is needed to feed the input data bus of the processor core. The system's timer mimics functionality of the MSP430 microcontroller timer. The timer is a 16-bit counter with four operating modes and three capture/compare registers. It also supports multiple capture/compares, pulse width modulation, con¯gurable outputs, asynchronous input and output latching, interval timing, and interrupts. Interrupts may be generated from the counter on over°ow conditions and from each of the capture/ compare registers. A dedicated interrupt vector register provides fast decoding of timer interrupts. The parallel I/O port mimics behavior of the MSP430 I/O port as well. The di®erence lies in the number of ports. Our design features four digital I/O ports, while the MSP430 microcontroller features up to six. Each port has eight I/O pins. Every I/O pin is individually con¯gurable for input or output direction, and every I/O line can be individually read or written to. The ¯rst two ports include interrupt capability. Interrupt capability of these ports can be individually enabled and con¯gured to provide an interrupt on a rising edge, falling edge or both edges of an input signal. The ¯rst two I/O lines source di®erent single interrupt vectors. 3.3.2. System implementation For system implementation and veri¯cation, we have used the IPMS430 processor core described in VHDL, the corresponding veri¯cation environment (VHDL

SOC Design for Wireless Communications

1523

testbenches)47 and in-house developed C-tests and simulation, synthesis, and layout generation scripts. First, the system top module has been de¯ned. Then, necessary modi¯cations have been done to incorporate custom SRAM (for the program and data memories) Verilog simulation models into the original VHDL processor model. In next steps, we have developed and veri¯ed VHDL models of the PMU, timer, parallel I/O port, and multiplexer. The VHDL system model is synthesized and veri¯ed too. Finally, the SOC layout is generated and veri¯ed. After functionality of the synthesized netlist had been veri¯ed, we have created a °oor plan. In °oor planning phase, the memory blocks have been placed as hard macros. The design layout has been generated using a standard sequence of the backend process steps: power planning, placement, clock tree generation, routing, and veri¯cation. The SOC is produced in the IHP's 0.25 m CMOS technology.41 The chip layout is shown in Fig. 12. The chip occupies 9 mm 2 of the silicon area and consumes the power of 1.2 mW at the operating frequency of 20 MHz. The peculiarity of this chip is the power gating mechanism introduced to reduce the power loss due to the leakage currents.49,50 The implementation of power gating at the block level implies the design of suitable power gates and an isolation circuitry. The power gates are designed to switch o® the power of a gated block. The isolation circuitry is used to isolate the signals of a gated block from the active circuitry. The simplest way to design a power gate is to use a corresponding PMOS or NMOS transistor for cutting the power (Fig. 13). The power of a block is switched o® either by a NMOS transistor cutting the ground connection or a PMOS transistor cutting the supply connection. The two most important parameters of a transistor (in this

Fig. 12. Layout of the IEEE 802.15.4 MAC layer SOC.

1524

Z. Stamenkovic

Fig. 13. PMOS transistor as a power gate.

case used as a power gate) in on-state are the maximum output voltage and maximum output current that it can handle. The maximum current calculated after the peak power of each gated block is estimated. We use the Synopsys Prime Power tool to perform the peak power estimation. The output voltage of a gate is desired to be as small as possible, since it directly a®ects the performance of a gated block when it is active. On the other hand, the lower voltage drop means the larger transistor area. Having the same output parameters, NMOS transistors are smaller and switch faster but have slightly higher leakage currents compared to PMOS transistors. When a functional block is switched o® from the power, its ports must be decoupled from the active functional blocks. The port isolation prevents the creation of sneak leakage paths back to the block and the disturbance of the active logic. The isolation circuitry is simply constructed from \AND" and \NAND" gates that set the ports to logic \0" or \1", respectively. However, the user may decide to design a custom isolation-logic. Implementation of the power gating mechanism starts with a precise estimation of the power consumption at the system level. The ¯rst step is to partition the system into blocks suitable for gating. Each block has to be designed separately and its peak power has to be estimated. It is useful to design a library of NMOS and PMOS transistors of di®erent strengths so that they may be used according to the design requirements. In the layout phase, power islands must be created for each gated block and gates have to be inserted. The user may decide to put a couple of smaller gates around each block in order to achieve a better power distribution during the active mode (Fig. 14). Having more gates, it may make sense not to switch them at the same time but in groups. This way the noise on the power lines may be reduced. Power gating is controlled via the already described PMU. A block is isolated before the power is switched o®. After the power is switched on, the clock is activated and the isolation is removed. Finally, the reset is released. The reset is active from the very beginning of the power on procedure. The control procedure becomes more complex when it is required to save the state of a certain block.

SOC Design for Wireless Communications isol

pw_off

1525

clk sel rst

Block

power grid Fig. 14. Power-gating a hardware block.

The state can be saved in a dedicated low power memory (an embedded °ash or retention registers). 4. Conclusion The design °ow and main hardware components of modern SOCs for wireless communications are described using three typical wireless SOC designs. The most of the components are con¯gurable functional building modules, which can be automatically (after choosing con¯guration parameters) generated. However, it is necessary to design and implement custom hardware accelerators and power saving techniques that are speci¯c for the wireless communication applications.

References 1. W. Wolf, Modern VLSI Design: System-on-chip Design, 3rd edn. (Prentice Hall, New Jersey, 2002). 2. J. Henkel, Closing the SoC design gap, Computer 36 (2003) 119121. 3. C. Rowen, Engineering the Complex SoC: Fast, Flexible Design with Con¯gurable Processors (Prentice Hall, New Jersey, 2004). 4. S. Sarkar, S. G. Chandar and S. Shinde, E®ective IP reuse for high quality SoC design, Proc. IEEE Int. SOC Conf. (Washington, 2005), pp. 215224. 5. A. Hekmatpour, K. Goodnow and S. Hemen, Standards-compliant IP-based ASIC and SoC design, Proc. IEEE Int. SOC Conf. (Washington, 2005), pp. 322323. 6. M.-A. Dziri, W. Cesario, F. R. Wagner and A. A. Jerraya, Uni¯ed component integration °ow for multi-processor SoC design and validation, Proc. Design, Automation and Test in Europe Conf. (Paris, 2004), pp. 11321137.

1526

Z. Stamenkovic

7. M. Bocchi, C. Brunelli, C. De Bartolomeis, L. Magagni and F. Campi, A system level IP integration methodology for fast SoC design, Proc. Int. Symp. System-on-Chip (Tampere, 2003), pp. 127130. 8. S. Nugent, D. S. Wills and J. D. Meindl, A hierarchical block-based modeling methodology for SoC in GENESYS, Proc. 15th IEEE Int. ASIC/SOC Conf. (Rochester, 2002), pp. 239243. 9. S. J. E. Wilton and R. Saleh, Programmable logic IP cores in SoC design: Opportunities and challenges, Proc. IEEE Conf. Custom Integrated Circuits (San Diego, 2001), pp. 6366. 10. P. G. Paulin, Chips of the future: Soft, crunchy or hard? Proc. Design, Automation and Test in Europe Conf. (Paris, 2004), pp. 844849. 11. J. Koehl, D. E. Lackey and G. Doerre, IBM's 50 million gate ASICs, Proc. 8th Asia and South Paci¯c Design Automation Conf. (Kitakyushu, 2003), pp. 628634. 12. B. Dipert, S. Rawat and S. Tam, Future systems-on-chip: Software or hardware design? Proc. 37th Design Automation Conf. (Los Angeles, 2000), pp. 336336. 13. D. D. Gajski, High-level Synthesis: Introduction to Chip and System Design (Kluwer Academic Publishers, Boston, 1992). 14. A. Habibi and S. Tahar, A survey on system-on-a-chip design languages, Proc. 3rd IEEE Int. Workshop on SoC for Real-Time Applications (Los Alamitos, 2003), pp. 212215. 15. http://www.design-reuse.com. 16. Z. Stamenkovic, G. Panic, U. Jagdhold, H. Frankenfeldt, K. Tittelbach-Helmrich, G. Schoof and R. Kraemer, Modular processor: A °exible library of ASIC modules, Proc. IASTED Int. Conf. Applied Simulation and Modelling (Rhodes, 2004), pp. 428432. 17. http://www.synopsys.com/Tools/Implementation. 18. http://www.xilinx.com/tools/logic.htm. 19. http://www.mentor.com/products/fv. 20. http://www.cadence.com/products/fv. 21. http://www.cadence.com/products/di/edi system. 22. http://www.mips.com/products/cores. 23. http://www.arm.com/products/processors. 24. http://www.arc.com/con¯gurablecores. 25. http://www.gaisler.com. 26. P. R. Panda, N. Dutt and A. Nicolau, Memory Issues in Embedded Systems-on-Chip: Optimizations and Exploration (Kluwer Academic Publishers, Boston, 1999). 27. W. T. Shiue and C. Chakrabarti, Memory exploration for low power embedded systems, Proc. 36th Design Automation Conf. (New Orleans, 1999), pp. 140145. 28. T. Givargis, F. Vahid and J. Henkel, System-level exploration for Pareto-optimal con¯gurations in parameterized SoC, IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 10 (2002) 416422. 29. R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan and P. Marwedel, Scratchpad memory: A design alternative for cache on-chip memory in embedded systems, Proc. 10th Int. Symp. Hardware/Software Codesign (Estes Park, 2002), pp. 7378. 30. Z. Stamenkovic, F. Vater and Z. Dyka, A framework for selection of cache con¯gurations for low power, Proc. 4th Int. Workshop on IP-Based System-on-Chip Design (Grenoble, 2003), pp. 137140. 31. P. Kalla, X. S. Hu and J. Henkel, A °exible framework for communication evaluation in SoC design, Proc. 10th Asia and South Paci¯c Design Automation Conf. (Shanghai, 2005), pp. 956959. 32. http://www.viragelogic.com/Memory.

SOC Design for Wireless Communications

1527

33. http://www.dolphin-ic.com/Memory.html. 34. M. Helfenstein and G. Moschytz, Circuits and Systems for Wireless Communications (Kluwer Academic Publishers, Boston, 2000). 35. CoreConnect bus architecture, http://www.ibm.com. 36. AMBA on-chip bus standard, http://www.arm.com. 37. IEEE standard for information technology — Local and metropolitan area networks — Speci¯c requirements: Wireless LAN MAC and PHY speci¯cations, IEEE STD 802.11, IEEE Computer Society (2007). 38. IEEE Standard 802, Part 15.4: Wireless medium access control (MAC) and physical layer (PHY) speci¯cations for low-rate wireless personal area networks, IEEE Computer Society (2006). 39. IEEE Standard 802, Part 15.3: Wireless medium access control (MAC) and physical layer (PHY) speci¯cations for high-rate wireless personal area networks, IEEE Computer Society (2003). 40. G. Panic, D. Dietterle, Z. Stamenkovic and K. Tittelbach-Helmrich, A system-on-chip implementation of the IEEE 802.11a MAC layer, Proc. 3rd EUROMICRO Symp. Digital System Design (Antalya, 2003), pp. 319324. 41. Innovations for High Performance microelectronics, http://www.ihp-microelectronics. com. 42. Z. Stamenkovic, D. Dietterle, G. Panic, W. Bocer, G. Schoof and J.-P. Ebert, MAC processor for BASUMA wireless body area network, Proc. 5th IASTED Int. Conf. Circuits, Signals, and Systems (Ban®, 2007), pp. 4752. 43. A. Milenkovic, C. Otto and E. Jovanov, Wireless sensor networks for personal health monitoring: Issues and an implementation, Comput. Commun. 29 (2006) 25212533. 44. E. Farella, A. Pieracci, L. Benini and A. Acquaviva, A wireless body area sensor network for posture detection, Proc. 11th IEEE Symp. Computers and Communications (Cagliari, 2006), pp. 454459. 45. R. Bults, K. Wac, A. T. van Halteren, D. Konstantas, V. Jones and I. A. Widya, Body area networks for ambulant patient monitoring over next generation public wireless networks, Proc. 13th IST Mobile and Wireless Communications Summit (Lyon, 2004), pp. 181185. 46. TANDEM system speci¯cation, Technical Report, IHP, Frankfurt Oder (2007). 47. IPMS430 processor user manual, IPMS Fraunhofer, Dresden (2007). 48. MSP430x1xx family user's guide, Texas Instruments, Dallas (2006). 49. G. Panic, Z. Stamenkovic and R. Kraemer, Power gating in wireless sensor networks, Proc. 3rd IEEE Int. Symp. Wireless Pervasive Computing (Santorini Island, 2008), pp. 499503. 50. G. Panic, D. Dietterle and Z. Stamenkovic, Architecture of a power-gated wireless sensor node, Proc. 11th IEEE EUROMICRO Conf. Digital System Design (Parma, 2008), pp. 844849.