Terrain Rendering Engine (TRE): Cell Broadband Engine Optimized ...

1 downloads 0 Views 187KB Size Report
Terrain Rendering Engine (TRE):. Cell Broadband Engine Optimized Real-time Ray-caster. Barry Minor, IBM. Gordon Fossum, IBM. Van To, IBM. 11501 Burnet ...
Terrain Rendering Engine (TRE): Cell Broadband Engine Optimized Real-time Ray-caster Barry Minor, IBM 11501 Burnet Rd. Austin, TX 78758 [email protected]

Gordon Fossum, IBM 11501 Burnet Rd. Austin, TX 78758 [email protected]

Abstract The development of the Cell Broadband Engine (CBE), together with the advent of commercially available high resolution satellite images and digital elevation models (DEM), brings new possibilities to visualizing the world around us. We introduce the Terrain Rendering Engine (TRE), a client/server ray-casting system in which the client, through a thin layer of network interfaces, directs the CBE server to render and deliver 3-dimensional (3D) terrain views at interactive speed. The TRE server, optimized for the Cell Broadband Engine Architecture, is a ray-casting engine that can compute high resolution, high quality 3D views of terrain in real-time without the assistance of a Graphics Processing Unit (GPU). Moreover, the flexibility of the client/server model allows quick bring-up of the client interface on other systems.

Van To, IBM 11501 Burnet Rd. Austin, TX 78758 [email protected]

capabilities of the CBE such a system can achieve interactive frame rates at HDTV resolutions. In our application, we have two sets of input, a digital elevation model (DEM) which consists of a uniform grid containing elevation data and a satellite image that contains color data for each grid position. Figure 1.1 shows the two inputs we have for Mt. Rainier in Washington State. The DEM can be represented as a grey scale image and the color satellite image can be saved as a regular JPEG.

Keywords Cell Broadband Engine (CBE), “Cell”, Synergistic Processor Element (SPE), Single Instruction Multiple Date (SIMD), Direct Memory Access (DMA), graphics hardware, raycasting, Digital Elevation Model (DEM).

1. Introduction The advent of commercially available high resolution satellite imagery and USGS digital elevation models (DEM) has greatly enhanced our ability to visualize the world around us. In this paper, we describe an interactive system, the Terrain Rendering Engine (TRE) for visualizing terrain in 3D with height data and satellite image overlay using a software implementation of raycasting optimized for the Cell Broadband Engine (CBE). The system implements a client/server model where the client directs the server to render, compress, and deliver images at interactive speeds for remote display. We show that by leveraging all the parallel processing

Figure 1.1 Digital Elevation Model and Color Satellite Image

In order for a traditional graphics processing unit (GPU) to render this data it must be preprocessed into a more digestible form. This typically involves decomposing the height field specified by the DEM into triangle meshes with precomputed texture and normal information per triangle vertex. This process results in a huge amount of data that is then transferred redundantly, rows of shared vertices are transferred multiple times, over an IO bus to the GPU for rendering. Data relaxation techniques can be used to reduce this volume of data but they also reduce model detail. Furthermore many of the triangles communicated to the GPU are not even visible in the finished image. To optimally handle this either the CPU must remove some of the non-visible triangles via system side view frustum culling and the rest with GPU back facing, and depth buffering or communicate the entire scene to the GPU each frame and accept the degradation in performance.

Figure 1.2 TRE Ray-casting output

Rendering the terrain directly with ray-casting [Figure 1.2] reduces the scene’s memory foot print by removing the need to decompose the height field into triangles and store pre-computed normals. Furthermore, ray-casting only processes the visible terrain fragments, thereby eliminating the need to cull and depth buffer fragments. However, today’s general purpose CPUs can’t handle such rendering tasks at HDTV resolutions and interactive frame rates. By leveraging the parallel processing capabilities of the CBE, we can finally ray-cast large 3D terrains in real-time. This paper is organized as follows: section 2 contains background information on relevant aspects of the Cell Broadband Engine architecture; section 3 presents an overview of the ray-casting algorithm used in this implementation; section 4 gives a description of the TRE system, including description of the client/server model; section 5 provides detailed description of our ray-casting implementation and optimization on Cell; section 6 provides detailed information on our image compression technique; section 7 describes a technique called SPE repurposing which we use to increase the SPEs’ utilization; section 8 presents the TRE’s performance results; and section 9 contains the conclusion and future work.

Figure 2.1 Cell Broadband Engine

The key attributes of this concept are: •

A high-frequency design, allowing the processor to operate at a low voltage and low power while maintaining high performance [5].



Power Architecture® compatibility, to provide a conventional entry point for programmers, for virtualization and multiOS support, and to be able to leverage IBM’s experience in designing and verifying symmetric multiprocessors (SMP).



SIMD architecture, supported by both the vector (VMX) extensions on the PPE and the SPE, as one of the means to improve game/media and scientific performance at improved power efficiency.



A power and area efficient PPE that implements the Power Architecture ISA and supports the high design frequency.



SPE with local memory, asynchronous coherent DMA, and a large unified register file, in order to improve memory bandwidth and to provide a new level of combined power efficiency and performance [6][7].



A high-bandwidth on-chip coherent fabric and high bandwidth memory, to deliver performance on memory bandwidth intensive applications and to allow for highbandwidth on-chip interactions between the processor elements.

2. Cell Broadband Engine Overview The Cell Broadband Engine (CBE) consists of one dual-threaded Power Processor Element (PPE) with eight SIMD Synergistic Processor Elements (SPE), along with an on-chip memory controller (MIC), and an on-chip controller for a configurable I/O interface (BIC) interconnect with a coherent on-chip Element Interconnect Bus (EIB) [8].



High bandwidth flexible I/O configurable to support a number of system organizations, including a single BE configuration with dual I/O interfaces, and a glue-less coherent dual processor configuration.



A full custom modular implementation to maximize performance per watt and performance per square mm of silicon and to facilitate the design of derivative products.

These features enable the CBE to excel at many rendering tasks including ray-casting. The key to performance on the CBE is leveraging the SPEs efficiently by providing multiple independent data parallel tasks that optimize well in a SIMD processing environment.

information necessary to compute all of the samples in that vertical cut is contained in a close neighborhood of the intersection between the vertical cut and the height map. Furthermore, each sample in the vertical cut, when computed from the bottom to top, only needs to consider the height data found in the buffer starting at the last known intersection. These two optimizations: • •

Greatly reduce the amount of height data that needs to be considered when processing each vertical cut of samples. Continue to reduce the data that needs to be considered as the vertical cut is processed.

4. TRE Application 3. Ray-casting Overview The TRE uses a highly optimized height field ray-caster to render images. Ray-casting, like ray-tracing, follows mathematical rays of light from the viewer’s virtual eye, through the view plane, and into the virtual world. The first location where the ray is found to intersect the terrain is considered to be the visible surface. Unlike ray-tracing which generates additional secondary rays from the first point of intersection to evaluate reflected and refracted light, raycasting evaluates lighting with a more conventional approach. At this first point of intersection a normal is computed to the surface and a surface shader is evaluated to determine the illumination of the sample. Ray-casting has the performance advantage, unlike polygon rendering, of only evaluating the visible samples. The challenge of producing a high performance ray-caster is finding the first intersection as efficiently as possible. While ray-casting is inherently a data-parallel computational problem, the entire scene must be considered during each ray intersection test. The TRE is optimized to render height fields which are 2D arrays of height values that represent the surface to be viewed. A technique call “vertical ray coherence [1]” was used to optimize the ray-casting of such surfaces. This optimization uses "vertical cuts", which are halfplanes perpendicular to the plane of the height map, all radiating out from the vertical line containing the eye point. Each such vertical cut slices the view plane in a not-necessarily-vertical line. The big advantage is that all the

The TRE was implemented using a client server model. The client, implemented on an Apple G5 system, is connected to the server via a gigabit Ethernet. The client specifies the terrain, viewer position and view vector, and rendering parameters to the TRE server which in turn computes the 3D images and streams the compressed images back to the client for display. We have designed the interface between the client and the CBE server to be flexible enough so that we can replace the Apple client with any light-weight device that has a display and a framebuffer (i.e. wireless PDA, PlayStation Portable).

4.1 TRE Server The TRE server, implemented on the STIDC CBE bring-up boards, runs on both uni-processor (UP) and symmetric multi-processor (SMP) mode. The server scales across as many SPEs as there are available in the system. When the server starts rendering a frame, one SPE is reserved for image compression while the other available SPE execute ray-casting kernels. Once the frame compression is complete the reserved SPE is repurposed for ray-casting. Three threads execute on the available Multi-Threading (MT) slots of the PPE. The three threads responsibilities break down as follows: 1. 2. 3.

Frame preparation and SPE work communication. Network tasks involving frame delivery. Network tasks involving client communication.

Thread one requires the most processor resource followed by threads two and three, so threads two and three share the same processor thread affinity in the UP model. The server implements a three frame deep pipeline (Figure 4.1.2) to exploit all the parallelism in the CBE. Prep Frame 0

PPE SPE 0-6 SPE 7

Prep Frame 1

Prep Frame 2

Prep Frame 3

Render Frame 0

Render Frame 1

Render Frame 2

Encode Frame 0

Encode Frame 1

Time

Figure 4.1.2 Server Image Pipeline

In stage one, the image preparation phase, PPE thread one decomposes the view screen into work blocks based on the vertical cuts dictated by the view parameters. Stage two, the sample generation phase, is where the SPEs decompose the vertical cuts into rays, ray-cast the rays, shade the samples, and store the samples into the accumulation buffer. In stage three, the image compression/network phase, the compression SPE encodes the finished accumulation buffer and PPE thread two delivers it to the network layer. All three stages are simultaneously active on different execution cores of the chip maximizing the image throughput.

• • •

Server connection and selection Map cropping Streaming image decompression and display

5. Ray-casting on CBE When implementing ray-casting of height fields we use vertical ray coherence to break down the rendering task into data parallel work blocks, or vertical cuts of screen space samples, using the PPE, and dispatching the blocks to ray kernels running on the SPEs. It is key that the SPEs work as independently as possible so as to not over task the PPE, given the 8 to 1 ratio, and we don’t want the SPE’s streaming data flowing through the PPE’s cache hierarchy. These work blocks are therefore mathematical descriptions of the vertical cuts based on the current view parameters. In Figure 5.1, where the green plane intersects the gray height map are the locations of the input height data needed to process the vertical cut of samples. Where the green plane intersects the blue view screen are the locations of the accumulation buffer to be modified.

4.2 TRE Client The TRE client directs the rendering of the TRE server. We have picked the Apple G5 as the platform to develop our client for aesthetic reasons only. The client is very light-weight and could easily be deployed on any device that has a framebuffer and enough computational power to receive, and decode our images. The current TRE client provides for the following facilities: • • • •

Input data loading and delivery to the server Smooth path generation both user directed and random. Joy stick directed flight simulation Rendering parameter modification o Sun angle o Lighting parameters o Fog and haze control o Bump mapping control o Output image size o Number of SPEs to render

Figure 5.1

The PPE communicates the range of work blocks each SPE is responsible for via base address, count, and stride information once per frame. Each SPE then uses this information to compute the address of each work block which is then read into local store via a DMA read and processed. This decouples the PPE from the vertical cut processing allowing it to prep the next frame in parallel. The SPEs place the output samples in an accumulation buffer using a gather DMA read, modify, and a scatter DMA write. Each SPE is responsible for four regions of the screen. Vertical cuts are processed in a round robin fashion, one vertical cut per region, and left to right within a each region. No

synchronization is therefore needed on the accumulation buffer read/modify/write as no two SPEs will ever attempt to modify the same locations in the accumulation buffer even with two vertical cuts in flight (double buffering) per SPE. The SPEs are free to run at their own pace processing each vertical cut without any data synchronization on the input or output. None of the SPE’s input or output data is touched by the PPE, thereby protecting the PPE’s cache hierarchy from transient streaming data.

finds each intersection for rays in the vertical cut thereafter.

Ray Ray Intersection Intersection

Shader Shader Texture Texture Filtering Filtering Normal Normal Generation Generation

5.1 Ray Kernel

Bump BumpMapping Mapping

The SPE ray kernel is responsible for decomposing each vertical cut into rays, computing the ray/terrain intersections, evaluating the surface shader at each intersection, and updating the accumulation buffer with the new samples.

Lighting Lighting Atmosphere Atmosphere

Sample Sample Accumulation Accumulation

This process begins with a command parser that looks for work in the mailbox and dispatches the proper code fragment. The implemented commands include: • • •

Multi-Sample Multi-Sample Adjustment Adjustment

Update Rendering Context Render Vertical Cuts Repurpose to Encoder Kernel

The “Update Rendering Context” command signals the SPE to DMA in a new block of rendering state which includes such information as shader parameters for the lighting model and atmospheric effects. This command can be issued on a per frame basis as the “Render Vertical Cuts” command describes a frame’s worth of vertical cuts to each SPE. The SPE starts the rendering process by issuing a DMA to bring in the first work block describing a vertical cut. DMA lists are then computed to load the required height/color and accumulation data. Once the data is present in local store the rendering loop is executed which decomposes each vertical cut into rays, finds the ray/terrain intersections, evaluates the shader at each intersection, and updates the accumulation buffer with new samples. The two blocks of code that account for the majority of ray kernel cycles are the intersection test and shader. The intersection test is broken into two phases, the first searches for the initial ray’s intersection and the second more carefully

Figure 5.1.1 Ray Kernel Sample Generation

The shader provides both terrain illumination and atmospheric effects. The current shader contains the following functionality: • • • • • • • •

Texture filtering via a two by two color neighborhood. Surface normal computation via four by four height neighborhood Bump mapping via a normal perturbation function [3] Surface illumination via diffuse reflection and ambient lighting model. Visible sun plus halo effects Non-linear atmospheric haze Non-linear ground fog Resolution independent Clouds computed via multiple octaves of Perlin noise evaluated on the fly [4].

Dynamic multi-sampling is implemented by adjusting the distance between rays along the vertical axis (vertical cut) based on the under or over sampling of the prior four rays [2]. The sampling rate along the horizontal axis is a function of the size of the image, and is

computed to ensure that no pixel is skipped over, even after accounting for round off error. The compute and data fetch phases of the ray kernel execute in parallel by exploiting the asynchronous execution of the SPE’s DMA engine (SMF) and SIMD core (SPU). Multiple input and output buffers are used to decouple the two phases. The ray kernel processes the data using double buffered inputs and outputs so as to cover up the memory latency and data load time. Height/color and accumulation data are brought in by the SMF for vertical cut n+1 while samples are computed by the SPU for vertical cut n. This results in the SPU waiting for DMA transferred memory less than 1% of its execution time. Texture color data is interleaved together with height data in system memory, 16 bits of color (5/6/5) and 16 bits of height. The surface shader requires a four by four height neighborhood to compute the surface normal and a two by two neighborhood for texture filtering. The height/color DMA list gathers blocks of four height/color pairs along the intersection of the vertical cut plane and the height/color plane [Figure 5.1.2].

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

Shuffle Byte

H C

H C

H C

H C

H C

H C

H C Quad Word (128

Figure 5.1.3 SPE Unaligned Quad Word Load

Samples are computed in a SIMD fashion by searching for four intersections at a time [Figure 5.1.4] and when all four are located they are then evaluated in parallel by the surface shader. Rays are packed four at a time into the single precision floating point channels of each vector register. All ancillary information need for the shader is packaged in the same format using additional vector registers. When an intersection is found for a ray the channel containing the ray is locked for all registers participating in the rendering operations using select instructions.

Ray 1

Ray 2

Ray 3

Ray 4

Vector (128 bits) H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

H C

Quad Word (128

Quad Word (128

Figure 5.1.2 Height Color Data Memory Layout

These 128 bit (quad word) quantities are word aligned in system memory and therefore can’t be directly fetched by the SPE’s quad word optimized DMA engine. To handle this situation the height/color DMA list is constructed to fetch 256 bits of data starting at the naturally aligned quad word address of the desired word aligned 128 bit quantity. Once the data has been gathered into local store, pairs are run through the shuffle byte engine to reconstruct the desired 128 bit quantity for the shader [Figure 5.1.3].

4 3 2 1

Figure 5.1.4 SIMD Ray-casting

The resulting IBM XLC compiled C code occupies 29KB of local store and uses 204KB of buffer space plus 4KB of stack [Figure 5.1.5]. Height/color (HC) DMA lists are implemented in a destructive fashion where the list is constructed in the return data buffer and is written over, or destroyed, as the DMA list is executed and data is returned to the buffer. The accumulation buffer operations are read/modify/write so the DMA lists used to gather/read the data are preserved and re-executed to scatter/write the modified data back to the accumulation buffer.

Height Color Data (57 KB)

Height Color Data (57 KB)

HC DMA List (13 KB)

HC DMA List (13 KB)

Accumulation Data (30 KB)

Accumulation Data (30 KB)

Accum DMA List (15 KB)

Accum DMA List (15 KB)

R

R

AC

Stack 4 KB

Code (29 KB)

Figure 5.1.5 Ray Kernel Local Store Memory Allocation

6. Image Compression In addition to the SPEs running the ray-kernels one SPE is repurposed during the frame time for image compression. The compression kernel operates on the finished accumulation buffer which is organized in a column major 2D array of single precision floating point (red, green, blue, count) data, one float per channel (128 bits per pixel). The count channel contains the number of samples accumulated in the pixel. The image compression kernel is responsible for the following tasks: • • •

Normalization of the accumulation buffer Compression of the normalized buffer. Clearing of the accumulation buffer

The compression kernel reads sixteen by sixteen pixel tiles from the accumulation buffer into local store using DMA lists. These tiles are then normalized by dividing the color channels by the sample count for each pixel. The tile is then compressed using a multi-stage process involving color conversion, discrete cosine transformation (DCT), quantization, and run length encoding. The resulting compressed data is then written to a holding buffer in local store which when full is DMA written back to system memory for network delivery. As each tile is processed a tile of zeros is returned to the accumulation buffer providing the clear operation.

7. SPE Repurposing Repurposing is a way to transfer unused cycles from an idle SPE kernel type to an active kernel type during a given frame. Each TRE SPE

kernel includes the capability to repurpose itself to the other kernel type. For example, if a “repurpose” command is issued to a SPE running the ray kernel it will finish its current ray kernel work, load the encoder kernel code, and begin encoder execution. Any commands behind the “repurpose” command in the mailbox will be interpreted as encoder kernel commands. The repurpose SPE thread command is implemented as a list of programs passed to the thread upon creation. When a “repurpose” command is issued the SPE thread returns execution to a special SPE crt0 that DMAs in the next programs code and executes the new main. This provides a very light weight application controlled SPE context switch that, in our case, can be issued multiple times per frame giving us the flexibility to use any SPE as our encoder. Since the TRE server dynamically distributes ray-casting work across all the SPEs based on which SPE finishes first and which SPE finishes last, the repurposed SPE gets less work when the encoding load is larger (high frequency image) and more work when the encoding load is smaller (low frequency image). Work rebalancing takes place after each frame so the work given to the repurposed SPE is rarely out of balance.

8. TRE Performance The TRE code base was initially implemented on Apple G5 systems using SIMD VMX engines. This provided a clean porting path to the CBE as both the SPE and the G5’s VMX engine have similar instruction sets. The SIMD optimization effort on the G5 resulted in a 3.25x speedup over the scalar code base showing that we were in fact computationally bound on the G5. The CBE PPE is used only as a control processor in the TRE and therefore has little impact on the TRE’s overall performance. The overall programming goal for the TRE PPE threads was to not have them touch any of the SPEs streaming data and provide little if any of the computational load. Only a basic tuning effort was therefore expended on this code. The SPE threads on the other hand were highly optimized for computational efficiency, data throughput, and code/data footprint. All of our performance gains are the direct result of the heavy lifting but simple SPE processors. The SPE ray kernel was originally written for the G5 using only a subset of the VMX intrinsics that

mapped directly to the SPE. This allowed us to verify the algorithms at full speed and then directly port them to the SPE with minimal effort. Performance analysis was then performed on the CBE implementation using our system simulator. Hot spots were identified and aggressively optimized using loop unrolling, branch removal via select operations, immediate field instruction encodings, code interleaving, and SPE instruction pipeline rebalancing. Functional blocks found not to be performance critical were squeezed the opposite direction for reduced code footprint. Since the SPE is a dual issue machine the theoretical optimum cycles per instruction (CPI) is 0.5. The optimized ray kernel executes with a CPI range of 0.7 to 0.8 and occupies only 29KB of local store space. The volume of scattered data gathered into SPE local stores created a problem for the SPE transition look aside buffer (TLB). We found that with 4KB pages DMAs were bound by TLB misses as a result of ray kernel DMA lists accessing very large quantities of pages. All streaming data was placed in 16MB pages to alleviate this problem. Since we wrote an optimized VMX version of our server for the Apple G5 as part of our development process we thought it would be interesting to compare the performance differences of this rendering task. We didn’t implement an encoder for the G5 renderer since it ran as part of our client and therefore had direct access to the framebuffer. In our benchmark the Apple has a distinct advantage given it only has to render and display the uncompressed frames. The CBE server, on the other hand, has to render, compress, and deliver each frame. Even with this handicap Cell outperforms the local G5 implementation by a factor of 50x. The server has many rendering parameters that affect performance including: • • • •

Output image size Map size Visibility to full fog/haze Multi-sampling rate

For benchmarking purposes the following values were selected: • • • •

1280x720 (720p) output image size 7455x8005 Map size 2048 map steps to full haze 1.33 x (2 – 8 Dynamic) or ~2-32 samples per pixel

With these settings the following rendered frame rates were measured: System / frames per second 2.0 GHz Apple G5 1GB memory (no image encode) 3.2 GHz UP Cell 512MB memory 3.2 GHz 2-way SMP Cell 1GB memory

0.6 30 58

9. Conclusion and Future Work In this paper, through the TRE system, we have outlined how a compute intensive graphics technique, ray-casting, can be decomposed into data parallel work blocks and executed efficiently on multiple SIMD cores within one or multiple Cell Broadband Engines. We have also shown that by pipelining the work block preparation, execution, and the delivery of results, additional parallelism within the CBE can be exploited. One order of magnitude performance improvements over similarly clocked single threaded processors can be achieved when computational problems are decomposed and executed in this manner on the CBE. Graphics techniques once considered offline tasks can now be executed at interactive speeds using a commodity processor. The TRE framework and the CBE present many avenues for future works: •

• •

Embed the computed depth information for each pixel in the returned framebuffer. By preloading this depth information into the zbuffer, we can combine the ray-cast scene with GPU rasterized objects. Level of Detail rendering. Interactive modifications of the terrain. This will enable us to simulate shock waves, add fluid simulation to the lakes, avalanches to the snow, and other simulations that require real-time change in terrain geometry.

References [1] Cheol-Hi Lee, “A Terrain Rendering Method Using Vertical Ray Coherence”, The Journal of Visualization and Computer Animation [2] Cohen-Or, D., Rich, E. Lerner, U., and Shenkar, V., “A real-time photo-realistic visual flythrough,” IEEE® Trans, Vis. Comp. Graph., Vol 2, No 3, September 1996.

[3] Blinn, "Simulation of Wrinkled Surfaces", Computer Graphics, (Proc. Siggraph), Vol. 12, No. 3, August 1978, pp. 286-292. [4] Perlin, Ken, "Making http://www.noisemachine.com/talk1/

noise,"

[5] H. Peter Hofstee, “Power Efficient Processor Architecture and the Cell Processor”, to appear, Proceedings 11th Conference on High Performance Computing Architectures, Feb. 2005. [6] B. Flachs, S. Asano, S. H. Dhong, H. P. Hofstee. G. Gervais, R. Kim, T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H-J. Oh, S. M. Mueller, O. Takahashi, A. Hatakeyama, Y. Watanabe, N. Yano. “The Microarchitecture of the Streaming Processor for a CELL Processor,” IEEE International Solid-State Circuits Symposium, Feb. 2005. [7] T. Asano, T. Nakazato, S. Dhong, A. Kawasumi, J. Silberman, O. Takahashi, M. White, H. Yohsihara, “A 4.8GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a CELL processor”, IEEE International SolidState Circuits Symposium, Feb. 2005. [8] D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, K. Yazawa, “The Design and Implementation of a FirstGeneration CELL Processor,” IEEE International Solid-State Circuits Symposium, Feb. 2005.

© IBM Corporation 2005 IBM Corporation Systems and Technology Group Route 100 Somers, New York 10589 Produced in the United States of America May 2005 All Rights Reserved This document was developed for products and/or services offered in the United States. IBM may not offer the products, features, or services discussed in this document in other countries. The information may be subject to change without notice. Consult your local IBM business contact for information on the products, features and services available in your area. All statements regarding IBM future directions and intent are subject to change or withdrawal without notice and represent goals and objectives only. IBM, the IBM logo, Power Architecture, are trademarks or registered trademarks of International Business Machines Corporation in the United States or other countries or both. A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. IEEE and IEEE 802 are registered trademarks in the United States, owned by the Institute of Electrical and Electronics Engineers. Other company, product, and service names may be trademarks or service marks of others. Photographs show engineering and design models. Changes may be incorporated in production models. Copying or downloading the images contained in this document is expressly prohibited without the written consent of IBM All performance information was determined in a controlled environment. Actual results may vary. Performance information is provided “AS IS” and no warranties or guarantees are expressed or implied by IBM THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.