larrabee:amany-core x86 architecture for visual computing - CiteSeerX

9 downloads 1143 Views 667KB Size Report
general parallel applications, such as image processing, physical simulation, and medical and financial analytics, as our scalability and performance analysis.
[3B2-3]

mmi2009010010.3d

3/2/09

14:24

Page 10

..........................................................................................................................................................................................................................

LARRABEE: A MANY-CORE X86 ARCHITECTURE FOR VISUAL COMPUTING ..........................................................................................................................................................................................................................

THE LARRABEE MANY-CORE VISUAL COMPUTING ARCHITECTURE USES MULTIPLE IN-ORDER

Larry Seiler Doug Carmean Eric Sprangle Tom Forsyth Pradeep Dubey Stephen Junkins Adam Lake Robert Cavin Roger Espasa Ed Grochowski Toni Juan Intel Corporation Michael Abrash RAD Game Tools Jeremy Sugerman Pat Hanrahan Stanford University

X86 CORES AUGMENTED BY WIDE VECTOR PROCESSOR UNITS, TOGETHER WITH SOME FIXED-FUNCTION LOGIC. THIS INCREASES THE ARCHITECTURE’S PROGRAMMABILITY AS COMPARED TO STANDARD GPUS. THE ARTICLE DESCRIBES THE LARRABEE ARCHITECTURE, A SOFTWARE RENDERER OPTIMIZED FOR IT, AND OTHER HIGHLY PARALLEL APPLICATIONS.

THE ARTICLE ANALYZES PERFORMANCE THROUGH SCALABILITY STUDIES BASED ON REAL-WORLD WORKLOADS.

......

Modern GPUs are increasingly programmable to support advanced graphics algorithms and other parallel applications. However, general-purpose programmability of the graphics pipeline is restricted by limitations on the memory model and by the fixed pipeline that the GPUs implement. For example, work on the programmable pipeline stages is scheduled by fixed-function logic blocks interspersed in the pipeline. The architecture code named Larrabee is based on an array of processor cores that run an extended version of the x86 instruction set. Each core contains a 16-wide vector processor for high throughput. Each processor core accesses its own subset of a coherent L2 cache, providing high-bandwidth local access and simplifying data sharing and synchronization. Larrabee also includes some fixed-function blocks for specialized operations such as texture filtering, although

it performs rasterization, depth testing, and alpha blending entirely in software. Larrabee supports a wide variety of general parallel applications that exploit standard software methods such as subroutines, virtual memory, and irregular data structures. Larrabee implements a binned renderer to increase parallelism and reduce memory bandwidth. It avoids the problems of some previous tile-based architectures, as demonstrated by performance analysis on existing game applications. Larrabee also provides a

......................................................................

Editor’s Note:

 2008 ACM, Inc. This is a minor revision of c

the work published in L. Seiler et al., ACM Transactions on Graphics, 27:3. http://doi.acm. org/10.1145/1399504.1360617.

..............................................................

10

Published by the IEEE Computer Society

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

c 2008 ACM

3/2/09

14:24

Page 11

programming model that supports more general parallel applications, such as image processing, physical simulation, and medical and financial analytics, as our scalability and performance analysis demonstrates.

Hardware architecture Figure 1 shows a block diagram of the basic Larrabee architecture. We designed Larrabee around multiple instantiations of an inorder processor core that we augmented with a wide vector processor unit (VPU). Cores communicate through a high-bandwidth interconnect network with some fixed-function logic, memory I/O interfaces, and other necessary I/O logic, depending on the exact application. Larrabee’s architecture was motivated by a study comparing peak performance for a modern CPU—the Intel Core 2 Duo processor—to a test design similar to (but not the same as) the ultimate Larrabee architecture. Modern CPUs are designed to perform well on a variety of application workloads by providing features such as out-of-order and speculative instruction execution, and currently provide 4-wide vector units to match natural vector sizes. Highly parallel workloads such as graphics can more easily exploit the compiler to schedule instructions and can be vectorized more easily than typical workloads. We concluded that a processor design that is optimized for these highly parallel workloads could achieve significantly higher throughput by using a 16-wide vector processor and smaller processor cores.

In-order processor

In-order processor

Coherent L2 cache

Coherent L2 cache

•••

Coherent L2 cache

Coherent L2 cache

Coherent L2 cache

Coherent L2 cache

•••

Coherent L2 cache

Coherent L2 cache

•••

In-order processor

In-order processor

Interprocessor ring network

Interprocessor ring network In-order processor

In-order processor

•••

In-order processor

In-order processor

Memory and I/O interfaces

mmi2009010010.3d

Fixed function logic

[3B2-3]

Figure 1. Schematic of the Larrabee many-core architecture. The number of processor cores and the number and type of fixed-function and I/O blocks are implementation dependent, as are the positions of the core and fixed-function blocks on the chip.

Instruction decode

Scalar unit

Vector unit

Scalar registers

Vector registers

L1 I-cache and D-cache

Local subset of the L2 cache

Scalar unit and caches Figure 2 shows a schematic of a single Larrabee processor core, plus its connection to the on-die interconnect network and the core’s local subset of the second-level (L2) cache. To simplify the design, the scalar and vector units use separate register sets. Data transferred between them is written to memory and then read back in from the L1 cache. We derived Larrabee’s scalar pipeline from the dual-issue Pentium processor, which uses a short, inexpensive execution pipeline. Larrabee provides modern additions

Ring network

Figure 2. Larrabee processor core and associated system blocks. Each core has fast access to its 256-Kbyte local subset of a coherent L2 cache. L1 cache sizes are 32 Kbytes each for instruction cache (I-cache) and data cache (D-cache). Ring network accesses pass through the L2 cache for coherency.

....................................................................

JANUARY/FEBRUARY 2009

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

11

[3B2-3]

mmi2009010010.3d

3/2/09

14:24

Page 12

............................................................................................................................................................................................... TOP PICKS

Mask registers

16-wide vector arithmetic logic unit

Replicate

Swizzle

Vector registers

Numeric convert

Numeric convert

L1 data cache

Figure 3. Vector unit block diagram. The VPU supports three-operand instructions that allow swizzling/replicating the vector lanes, and numeric conversion of memory data. Mask registers allow predicating the resulting vector writes.

such as four-way multithreading and 64-bit extensions. The cores support the full Pentium processor x86 instruction set so they can run existing code, including operating system kernels and applications. In our analysis, it’s relatively easy for compilers to schedule dual-issue instructions. Larrabee’s L1 data cache allows lowlatency accesses to cache memory into the scalar and vector units, which lets programmers treat it somewhat like a 32-Kbyte extended register file. Cache control instructions include prefetching into L2, prefetching from L2 to L1, and evicting from L1 after access, which allows efficient L1 cache use for streaming data. Larrabee also has a 32-Kbyte L1 instruction cache. Cache use is more effective when multiple threads running on the same core use the same data set—for example, rendering triangles to the same tile.

Larrabee’s global L2 cache is divided into a separate local subset per core. Each core has a fast direct access path to its own L2 cache, so all cores can access their own L2 cache in parallel for private or shared read-only data. Data written by a core is stored in its own L2 cache subset and is flushed from other subsets, if necessary. The 1,024-bit wide ring network allows fast accesses between cores, L2 caches, and memory. We specify 256 Kbytes for each L2 cache subset. This supports large tile sizes for software rendering.

Vector processing unit Larrabee gains its computational density from the 16-wide VPU, which executes integer, single-precision float, and doubleprecision float instructions. The VPU and its registers are approximately one-third the area of the processor core but provide most of the integer and floating-point performance. Figure 3 shows a block diagram of the VPU with its connection to the shared L1 data cache. We chose a 16-wide VPU as a trade-off between increased computational density and the difficulty of obtaining high utilization for wider VPUs. Early analysis suggested 88 percent utilization for typical pixel shader workloads if 16 lanes process 16 separate pixels one component at a time—that is, with separate instructions to process red, green, and so on, for 16 pixels at a time, instead of processing multiple color channels at once. The Nvidia GeForce 8 operates similarly, organizing its scalar single-instruction, multiple-data (SIMD) processors in groups of 32 that execute the same instruction.1 The main difference is that in Larrabee the loop control, cache management, and other such operations are code that runs in parallel with the VPU, instead of being implemented as fixed-function logic. Larrabee VPU instructions allow three source operands, one of which can come directly from the L1 cache. The VPU can read 8-bit unorm, 8-bit uint, 16-bit sint and 16bit float data from the cache and convert it to 32-bit floats or 32-bit integers with no loss of performance. The VPU also allows swizzling or replicating the source data in various ways. These two features greatly

....................................................................

12

IEEE MICRO

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

[3B2-3]

mmi2009010010.3d

3/2/09

14:24

Page 13

improve cache storage efficiency and reduce the number of instructions required to implement typical graphics algorithms. The VPU supports a wide variety of instructions on both integer and floatingpoint data types, including fused multiplyadd, store instructions that convert the data type, and load instructions for data types other than those mentioned. The VPU instruction set also includes gather and scatter support, so each VPU lane can specify its own memory load or store address. This lets the VPU run 16 shader instances in parallel, each of which appears to run serially, even when performing array accesses with computed indices. The resulting gather and scatter operations are faster when the 16 VPU lanes access a small number of cache lines. A mask register can predicate Larrabee VPU instructions. This register has one bit per vector lane and controls which vector lanes are written. It’s also essential for mapping multiple shader instances onto separate VPU lanes. For example, the compiler can map a scalar if-then-else control structure onto the VPU by using an instruction to set a mask register based on a comparison, and then executing both if and else clauses with opposite polarities of the mask register controlling whether to write results. It can skip clauses entirely if the mask register is all zeros or all ones. This reduces branch misprediction penalties for small clauses and gives the compiler’s instruction scheduler greater freedom. Finally, the VPU also uses these masks for packed load instructions and packed store instructions, which access enabled elements from sequential locations in memory. This lets the programmer bundle sparse strands of execution satisfying complex branch conditions into a format more efficient for vector computation.

Fixed-function logic Unlike typical GPUs, Larrabee uses software in place of fixed-function logic for post shader alpha blending, rasterization, and interpolation. In this article, rasterization refers to finding the coverage of a primitive, and interpolation refers to finding the values of parameters at covered sample positions in

the primitive. This is beneficial because fixed-function logic typically requires firstin, first-out (FIFO) queues for load balancing. It can be difficult to properly size these logic blocks and their FIFO queues to avoid wasted area and performance bottlenecks. Also, this lets Larrabee easily alter how these operations are performed and where they fit into the pipeline. Larrabee includes fixed-function texture filter logic because this operation can’t be efficiently performed in software on the cores. Our analysis shows that software texture filtering on our cores would take 12 to 40 times longer than our fixed-function logic, depending on whether decompression is required. Larrabee supports the usual texture operations present in modern GPUs as well as virtual memory for textures.

Software renderer The key issue for achieving high performance for any parallel-rendering algorithm is to divide the rendering task into many subtasks that can be load balanced and executed in parallel with very few synchronization points. We achieve this using a sortmiddle2 rendering algorithm that takes advantage of Larrabee’s flexible memory model and software-controlled scheduling. Other commercial and research architectures use sort-middle algorithms under names such as binning, tiling, or chunking.

Stages of software rendering For simplicity, we describe rendering to a single set of render targets, such as a 32-bit pixel buffer and a 32-bit depth/stencil buffer. We call the rendering commands that modify them an RTset. We call a set of primitives to render, together with a shared current rendering state, a primitive set or PrimSet. Seiler et al. discusses more complex cases involving reordering rendering commands to multiple RTsets and more complex sets of render targets.3 Figure 4 shows the broad structure for rendering a single RTset’s PrimSets. The surface being rendered is split into tiles of pixels that are rendered during back-end processing. Each tile has a bin that will be filled with the triangles from a PrimSet that intersect that tile during front-end processing. The set

....................................................................

JANUARY/FEBRUARY 2009

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

13

[3B2-3]

mmi2009010010.3d

3/2/09

14:24

Page 14

...............................................................................................................................................................................................

Back end

Process a primitive set

Render a tile

•••

Process a primitive set

Render a tile •••

Frame buffer

Front end Bin set (stored in off-chip memory)

Command stream

TOP PICKS

Render a tile

Figure 4. Larrabee software renderer structure. Multiple sets of primitives (PrimSets) can be processed in parallel to fill per-tile bins, which are later processed in parallel to render screen tiles.

of bins for the whole RTset is the bin set. ‘‘Tile’’ and ‘‘bin’’ are sometimes used interchangeably, but we use ‘‘tile’’ to refer to the actual pixel data, and ‘‘bin’’ to refer to a set of primitives that are rendered to a tile. For this 64-bit RTset, we used a tile size of 128128 for single sampling or 6464 for four samples per pixel. This fills half of a core’s L2 cache subset. As a result, the entire tile can stay in cache while a core renders to a tile. This reduces memory bandwidth, because if a pixel is rendered multiple times, it’s accessed from the cache multiple times, but is read from and written to memory just once. The bins of render commands are stored off-chip, which eliminates the large FIFO queues that typical GPUs require between the front-end vertex processing and back-end pixel processing. Each PrimSet and each tile is rendered on a separate core. As a result, cores don’t need to interact with each other except when they complete a task and need to get the next tile or PrimSet for processing. This lets the Larrabee software renderer use large numbers of cores in parallel. The front end performs vertex and geometry shading, and computes coverage to determine which bins the primitive overlaps. The back end performs pixel shading, depth/ stencil testing, and alpha blending. Within a PrimSet or a tile, commands are processed in order, unlike some other tile-based algorithms. This works consistently well across a broad spectrum of existing applications.

Software rasterization and interpolation Unlike modern GPUs, Larrabee doesn’t use dedicated logic for rasterization and

parameter interpolation. These terms are used in this article to refer to computing coverage and computing parameter values, respectively. The justification for performing interpolation in software is relatively simple. In older graphics APIs, interpolation produced fixed-point numbers, much like the most common texture-filtering operations. In modern graphics APIs such as Microsoft’s DirectX 10, the required result is a 32-bit float. Therefore, re-using the existing VPU for interpolation is efficient. Rasterization is unquestionably more efficient in dedicated logic than in software when running at peak rates, but using dedicated logic has drawbacks for Larrabee. In a modern GPU, the rasterizer is a fine-grained serialization point: all primitives are put back in order before rasterization. Scaling the renderer over large numbers of cores requires eliminating all but the most coarse-grained serialization points. Also, a separate fixed-function rasterizer would require large FIFO queues to manage resource contention. Implementing the rasterizer in software allows us to parallelize it over many cores or move it to different places in the rendering pipeline. For example, we can perform rasterization entirely in the front end, or perform coarse rasterization (to the tile level) in the front end and detailed rasterization in the back end. We can optimize the rasterization code for a particular workload or support alternative rasterization equations for special purposes. Our algorithm is a highly optimized version of Greene’s recursive descent algorithm.4 About 70 percent of the rasterizer instructions run on the VPU and exploit Larrabee’s computational density. About 10 percent of the algorithm’s efficiency comes from a few new bit-manipulation instructions.

Back-end pixel processing After the front-end processing for an RTset has completed filling the bins with triangle data, the renderer puts the RTset into an active list. Each core doing backend work takes the next available tile from the list and renders the triangles in the

....................................................................

14

IEEE MICRO

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

14:24

Page 15

corresponding bin. The back-end code starts by prefetching the render target pixels into the L2 cache, where they remain until they’re written back to memory after tile rendering is complete. Several optimizations can save substantial memory bandwidth. We can eliminate the tile read if the first command clears the entire tile. We can eliminate the write for depth data that isn’t required after rendering. Finally, multisampled colors can often be resolved after the tile is rendered, so that we write only one color value to memory per pixel. Figure 5 shows a back-end implementation that makes effective use of multiple threads that execute on a single core. A setup thread reads primitives for the tile. Next, the setup thread interpolates per-vertex parameters to find their values at each sample. Finally, the setup thread issues pixels to the work threads in groups of 16 that we call a qquad. The setup thread uses scoreboarding to ensure that qquads aren’t passed to the work threads until any overlapping pixels have completed processing. The setup thread can also perform hierarchical Z tests on qquads. The three work threads perform all remaining pixel processing, including preshader early Z tests, the pixel shader, regular late Z tests, and alpha blending. Modern GPUs use dedicated logic for depth tests and alpha blending, but Larrabee uses its VPU. This lets Larrabee load balance over the widely varying performance required for these operations in different shaders. One remaining issue is texture coprocessor accesses, which can require hundreds of latency clocks. Larrabee hides this latency by computing multiple qquads on each hardware thread. Each qquad’s shader is called a fiber. The different fibers on a thread cooperatively switch between themselves after each texture read command, without any operating system intervention. Fibers execute in a circular queue. The compiler chooses the number of fibers so that by the time control flows back to a fiber, its texture access has had time to execute and the results are ready for processing. Fixedfunction logic in modern GPUs use similar methods, but in Larrabee the number of

Work thread: early Z, late Z, pixel shader, and alpha blender Work thread: early Z, late Z, pixel shader, and alpha blender Work thread: early Z, late Z, pixel shader, and alpha blender

L2 cache for tile

3/2/09

Setup thread: interpolate and scoreboard

mmi2009010010.3d

Sub-bins for tile

[3B2-3]

Figure 5. Back-end rendering sequence for a tile. One setup thread processes the primitives and assigns them to one of three work threads that do early Z depth tests, pixel shader processing, late Z depth tests, and alpha blending.

Table 1. Workload summary for the three tested games. Game Half Life 2 ep. 2 F.E.A.R. Gears of War

Size

Tested frames

1,600 × 1,200 4 sample 1,600 × 1,200 4 sample 1,600 × 1,200 1 sample

25 frames (1 in 30) 25 frames (1 in 100) 25 frames (1 in 250)

fibers and the fiber switching are under software control.

Renderer performance studies We ran performance and scalability studies for the Larrabee software renderer workloads derived from three well-known games: Epic Games’ Gears of War, Monolith Productions’ F.E.A.R., and Valve’s Half Life 2 episode 2. Table 1 contains information about the tested frames from each game. Because we’re scaling out to large numbers of cores, we use a high-end screen size with multisampling when supported. The frames are widely separated to catch different scene characteristics as the games progress. These studies use a functional model to measure workload performance in terms of Larrabee units. A Larrabee unit is one Larrabee core running at 1 GHz. We chose the clock rate solely for ease of calculation, because real devices would ship with multiple cores and a variety of clock rates. Using Larrabee units lets us compare performance of Larrabee implementations with different numbers of cores running at different clock rates. A single Larrabee unit corresponds to a theoretical peak throughput of 32 Gflops, counting fused multiply-add as two operations.

....................................................................

JANUARY/FEBRUARY 2009

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

15

[3B2-3]

mmi2009010010.3d

3/2/09

14:24

Page 16

............................................................................................................................................................................................... TOP PICKS

Scaled performance

6 F.E.A.R.

5

Half Life 2 episode 2 4

Gears of War

3 2 1 8

16

24

32

40

48

Larrabee units (1 GHz cores)

Figure 6. Relative scaling as a function of core count. The graph shows configurations with 8 to 48 cores, with each game’s results plotted relative to the performance of an 8-core system.

30 Larrabee units needed to achieve 60 fps

F.E.A.R.

Gears of War

Half Life 2 episode 2

from the three games at 60 frames per second. We simulated these results on a single core with the assumption that performance scales linearly. For Half Life 2 episode 2, roughly 10 Larrabee units are sufficient to ensure that all frames run at 60 fps or faster. For F.E.A.R. and Gears of War, roughly 25 Larrabee units suffice. Software locks, which we didn’t simulate, can also limit scalability. We explicitly designed the software-rendering pipeline to minimize the number of locks and other synchronization events. Locks are required when cores start to work a new tile or PrimSet and in a few lower frequency cases. Modern games usually have significantly fewer than 10,000 locks per frame. The Larrabee ring network requires on the order of 100 clocks for low contention locks, so lock scaling should be fairly linear with the number of cores.

25

Binning and bandwidth studies

20 15 10 5 0 25 tested frames

Figure 7. Overall performance. The graph shows the number of Larrabee units (cores running at 1 GHz) needed to achieve 60 frames per second for the series of sample frames in each game.

Scalability studies Figure 6 shows the results of our loadbalancing tests, accounting for dependencies and scheduling delays. The figure shows six configurations, each of which scales the memory bandwidth and texture-filtering speed relative to the number of cores. The results of the load-balancing simulation show a falloff of 7 to 10 percent from a linear speedup at 48 cores. For these tests, we subdivide PrimSets containing more than 1,000 primitives. Additional tests show that F.E.A.R. falls off by only 2 percent if PrimSets are subdivided into groups of 200 primitives, so code tuning should improve the linearity. Figure 7 shows the number of Larrabee units required to render sample frames

We adopted a binning algorithm primarily to minimize software locks, but it also benefits load balancing and memory bandwidth. Our algorithm assigns back-end tiles to any core that is ready to process one, without attempting to reorder them for load balancing. This works well because in our tests most bins fall in the range of 1/2 to 2 times the mean bin processing time. Few bins exceed 3 times the mean. The rendering software can break up unusually large bins if they occur. This isn’t a viable option when scheduling is controlled by fixed-function logic, as in modern GPUs. Memory bandwidth is important because the memory subsystem can be one of the more costly and power-hungry parts of a GPU, from high-end to low-cost designs. It is often a limited resource that can cause bottlenecks if not carefully managed, in part because computational speed scales faster. Our performance studies measure computational speed, unrestricted by memory bandwidth, but it’s important to consider how our binning method compares with standard immediatemode rendering algorithms. Figure 8 compares the total memory bandwidth per frame that we calculated for immediate-mode and binned rendering for the three games. The graph presents per-frame data in sorted order from least to

....................................................................

16

IEEE MICRO

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

3/2/09

14:24

Page 17

most bandwidth for the immediate-mode frames. For immediate mode, we assume perfect hierarchical depth culling, a 128Kbyte texture cache, and 1-Mbyte depth and color caches to represent an ideal implementation. We further assume 2 color and 4 depth compression for single sampling and 4 color and 8 depth compression for 4-samples per pixel. Immediate mode uses more bandwidth for every tested frame: 2.4 to 7 times more for F.E.A.R., 1.5 to 2.6 times more for Gears of War, and 1.6 to 1.8 times more for Half Life 2 episode 2. Notably, binning achieves its greatest improvement when the immediate-mode bandwidth is highest, most likely because overdraw forces multiple memory accesses in immediate mode. Even with depth culling and frame buffer compression, the 1-Mbyte caches aren’t large enough to catch most pixel overdraw. High resolutions tend to increase the advantage of binning because they increase the impact of pixel access bandwidth on performance.

Performance breakdowns Figure 9 shows the average time spent in each rendering stage for the three tested games. Pixel shading is always a major portion of the rendering time, but the balance between different stages can vary markedly in different games. This is illustrated by F.E.A.R., which uses stencil-volume shadows extensively. This results in a reduced pixel shading load, but heavy rasterization and depth test loads. Furthermore, F.E.A.R. changes dramatically on a frame-to-frame basis. Thus, it’s important to be able to reconfigure the computing resource allocation among stages, including rasterization, which is 3.2 percent in two of the games but 20.1 percent in F.E.A.R. In addition to variances in the average load per stage, the balance varies dramatically within each frame. When rendering finely detailed objects that use small triangles, most of the processing time will be in the front end, because many triangles cover relatively few pixels. When rendering backgrounds or doing full-scene modifications, most of the processing time will be in the back end, because few triangles cover many pixels.

1000 Mbytes per frame

mmi2009010010.3d

800

F.E.A.R.

Binning BW Immediate mode BW

Gears of War

600

Half Life 2 episode 2

400 200 0

Figure 8. Bandwidth comparison of binning versus immediate mode per frame. Binning requires bin reads and writes, but eliminates many depth/ color accesses that hierarchical depth tests don’t eliminate. This results in less total bandwidth for binning.

Average amount of time spent (%)

[3B2-3]

100 90 80 70 60 50 40 30 20 10 0

Alpha blend Pixel shade Pixel setup Depth test Rasterization Vertex shade Pre-vertex

F. E. A. R.

Gears of War

Half Life 2 episode 2

Figure 9. End-to-end average time spent in each rendering stage for Larrabee’s software renderer in three games.

We conclude that dynamic load balancing is important to achieving high average performance. Larrabee’s software scheduling algorithms dynamically adjust to varying load requirements across all of the pipeline stages. Modern GPUs can load balance between vertex and pixel shaders but not the other pipeline stages. The result is performance bottlenecks unless the FIFO queues and fixed-function units are over-designed to support worst-case loads.

Advanced applications Larrabee supports performance implementations of many other parallel applications, including parallel applications that use irregular data structures such as complex pointer trees, spatial data structures, or large sparse

....................................................................

JANUARY/FEBRUARY 2009

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

17

[3B2-3]

mmi2009010010.3d

3/2/09

14:24

Page 18

............................................................................................................................................................................................... TOP PICKS

n-dimensional matrices. See Seiler et al. for further details and references.3

Many-core programming model The Larrabee native programming model resembles the programming model for x86 multicore architectures. Many C/C++ applications can be recompiled for Larrabee and will execute correctly with no modification. Such application portability alone can be an enormous productivity gain for developers, especially for large legacy x86 code bases such as those in high-performance computing and numeric-intensive computing environments. Larrabee Native presents a flexible softwarethreading capability. The architecture-level threading capability is exposed as the wellknown POSIX Threads API (P-threads). We extended the API to also let developers specify thread affinity with a particular hardware thread or core. Although P-threads is a powerful threadprogramming API, its thread-creation and thread-switching costs might be too high for some application threading. To amortize such costs, Larrabee Native provides a taskscheduling API based on a lightweight distributed task-stealing scheduler. A production implementation of such a task-programming API is available in Intel Thread Building Blocks. Finally, Larrabee Native provides additional thread programming support via OpenMP pragmas in the C/C++ compiler. All Larrabee SIMD vector units are fully programmable by Larrabee Native application programmers. Larrabee Native’s C/C++ compiler includes a Larrabee version of Intel’s autovectorization compiler technology. Developers who need to program Larrabee vector units directly can do so with C++ vector intrinsics or inline Larrabee assembly code. In addition to high-throughput application programming, developers could also use Larrabee Native to implement higher-level programming models that might automate some aspects of parallel programming or provide domain focus. Examples include Ct style programming models,5 high-level library APIs such as Intel Math Kernel Library (Intel MKL), and physics APIs. Developers can also re-implement existing GPGPU (generalpurpose computing on GPUs) programming models via Larrabee Native if desired.

Finally, Larrabee provides excellent support for high-throughput applications that use irregular data structures such as complex pointer trees, spatial data structures, or large sparse n-dimensional matrices. Because thread or task scheduling is under their control, programmers can dynamically rebundle tasks that operate on these data structures to maintain SIMD efficiency or dynamically allocate different amounts of memory to different nodes. Unlike stream-based architectures,6 Larrabee allows, but doesn’t require, direct software management to load data into different levels of the memory hierarchy. Unlike many modern GPUs,1 the coherent cached memory hierarchy transparently supports data structure sharing across all of the cores. Software simply reads or writes data addresses, and hardware transparently loads data across the hierarchy. Software complexity is significantly reduced and data structures can employ hard-to-predict unstructured memory accesses.

Extended rendering applications The Larrabee graphics-rendering pipeline is a Larrabee Native application. Because it is software written with high-level languages and tools, developers can add innovative rendering capabilities. We describe three example extensions of the graphics pipeline that we’re studying. Future implementations could evolve toward a fully programmable graphics pipeline. Render target read. Because Larrabee’s graphicsrendering pipeline uses a software frame buffer, we can enable additional programmer access to those data structures. More specifically, a trivial extension to the Larrabee rendering pipeline would be to let pixel shaders directly read previously stored values in render targets. Such a capability could serve a variety of rendering applications, including programmer-defined blending operations and single-pass tone mapping. Order independent transparency. Currently, 3D application developers must either presort translucent surfaces or use multipass algorithms such as depth peeling to render the surfaces in the proper order. Neither method

....................................................................

18

IEEE MICRO

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

mmi2009010010.3d

3/2/09

14:24

Page 19

allows the kinds of postrendering area effects that are possible with opaque models. Larrabee can support order independent transparency (OIT) with no additional dedicated logic by storing multiple translucent surfaces in a per-pixel spatial data structure. After rendering the geometry, we can perform effects on the translucent surfaces, because each surface retains its own depth and color, before sorting and resolving the fragment samples stored per pixel. Irregular shadow mapping. Shadow mapping is a popular real-time shadow approximation technique, but most implementations are plagued by visually displeasing aliasing artifacts. Irregular shadow mapping (ISM) offers an exact solution to this problem and places no additional burden on the application programmer.7 To implement ISM, we dynamically construct a spatial data structure in the light view using depth samples captured in the camera view. We customize Larrabee’s software graphics pipeline by adding a stage that performs light-view ISM rasterization against ISM’s spatial data structure. Because the shadow map is computed at exact positions, the resulting shadow map is alias free. Developers can use this technique to achieve real-time hard shadowing effects or as the foundation for real-time soft shadowing effects.

Frames per second

[3B2-3]

80

71.63

Larrabee, 1 GHz nominal

60

8 out-of-order cores @ 2.6 GHz

40

41.16

21.92 12.38

20

11.15 0

0

8

16 24 Larrabee units (1-GHz cores)

32

Figure 10. Real-time ray tracing scalability. This graph compares different numbers of Larrabee cores with a nominal 1-GHz clock speed to an Intel Xeon processor 2.6-GHz with eight cores total.

Other throughput computing applications Larrabee is also suitable for a wide variety of nongraphics throughput applications. Real-time ray tracing. The highly irregular nature of spatial data structures used in Whitted-style real-time ray tracers benefits from Larrabee’s general-purpose memory hierarchy, relatively short pipeline, and VPU instruction set. Figure 10 compares Larrabee’s performance with an instance of the ray tracer running on an Intel Xeon processor 2.6 GHz with eight cores total. The latter uses 4.6 times more clock cycles than are required by eight Larrabee cores. Game physics. Figure 11 shows the scalability of widely used game physics benchmarks and

70 Game fluid Foreground estimation Text indexing Home video editing Production face Production fluid Video cast indexing Game rigid body

Parallel speedup

60 50 40

3D-FFT BLAS3 Game cloth Sports video analysis Human body tracking Marching cubes Portfolio management Production cloth

30 20 10 0

0

8

16

24 32 40 Larrabee units (1-GHz cores)

48

56

64

Figure 11. Scalability of select nongraphics applications and kernels.

....................................................................

JANUARY/FEBRUARY 2009

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

19

[3B2-3]

mmi2009010010.3d

3/2/09

14:24

Page 20

............................................................................................................................................................................................... TOP PICKS

algorithms for game fluid, game cloth, and game rigid body. We achieve better than 50 percent resource utilization using up to 64 Larrabee cores for cloth, and achieve near-linear parallel speedup for game fluid. Bader et al. provide details on the implementation and scalability analysis for these game physics workloads.8 Image and video processing. Larrabee native implementations of traditional 2D filtering functions (both linear and nonlinear) and advanced functions such as videocast indexing, sports video analysis, human body tracking, and foreground estimation, offer significant scalability. Biomedical imaging represents an important subset of this processing type, which requires back-projection, volume rendering, automated segmentation, and robust deformable registration. Figure 11 shows examples of these algorithms. Physical simulation. Modeling complex natural phenomena such as fire effects, waterfalls in virtual worlds, and collisions between rigid or deformable objects is challenging because of their unstructured control flow and large data sets. Figure 11 shows specific simulation results based on Stanford’s PhysBAM for production face, production fluid, and production cloth. Hughes et al. describe implementation and scalability analysis details.9

workload convergence10 implies potential for a common programming model, a common run time, and a native Larrabee implementation of common compute kernels, functions, and data structures. MICRO Acknowledgments Doug Carmean, Eric Sprangle, Dean Macri, Tony Salazar, and Anwar Rohillah started the Larrabee project with assistance from many others, both inside and outside Intel. We thank the many people whose hard work made this project possible, as well as the many who helped with this article. Jeff Boody, Dave Bookout, Jatin Chhugani, Chris Gorman, Greg Johnson, Danny Lynch, Oliver Macquelin, Teresa Morrison, Misha Smelyanskiy, Alexei Soupikov, and others from Intel’s Application Research Lab, Software Systems Group, and Visual Computing Group performed workload implementation and data analysis. Intel, Intel Core, Pentium and Xeon are trademarks of Intel Corporation in the US and other countries. DirectX, GeForce, Gears of War, F.E.A.R., and Half Life 2 episode 2 as well as other names and brands may be claimed as the property of others.

.................................................................... References 1. J. Nickolls, I. Buck, and M. Garland, ‘‘Scalable Parallel Programming with CUDA,’’ ACM Queue, vol. 6, no. 2, 2008, pp. 40-53.

Numeric computing. Finally, Figure 11 shows how Larrabee scales for various numerical analysis workloads. Examples include enterprise throughput computing applications such as text indexing; financial analytics such as portfolio analysis; and traditional high-performance computing kernels such as 3D-FFT and BLAS3 (with data sets larger than on-die cache).10

2. M. Eldridge, ‘‘Designing Graphics Architectures around Scalability and Communication,’’ PhD thesis, Stanford Univ., 2001. 3. L. Seiler et al., ‘‘Larrabee: A Many-Core x86 Architecture for Visual Computing,’’ ACM Trans. Graphics, vol. 27, no. 3, 2008, pp. 1-15. 4. N. Greene, ‘‘Hierarchical Polygon Tiling with Coverage Masks,’’ Proc. Siggraph, ACM Press, 1996, pp. 65-64. 5. A. Ghuloum et al., ‘‘Future-Proof Data Paral-

T

he Larrabee architecture opens a rich set of opportunities for both graphics rendering and throughput computing and is an appropriate platform for the convergence of GPU and CPU applications. We’ve observed a great deal of convergence toward a common core of computing primitives across the workloads that we analyzed on Larrabee. This underlying

lel Algorithms and Software on Intel MultiCore Architectures,’’ Intel Technology J., vol. 11, no. 04, Nov. 2007, pp. 333-348. 6. D. Pham et al., ‘‘The Design and Implementation of a First Generation CELL Processor,’’ Proc. IEEE Int’l Solid-State Circuits Conf., IEEE Press, 2005, pp. 184-186. 7. G.S. Johnson et al., ‘‘The Irregular Z-buffer: Hardware Acceleration for Irregular Data

....................................................................

20

IEEE MICRO

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

[3B2-3]

mmi2009010010.3d

3/2/09

14:24

Page 21

Structures,’’ ACM Trans. Graphics, vol. 24, no. 4, 2005, pp. 1462-1482.

MS in computer science from Clemson University.

8. A. Bader et al., ‘‘Game Physics Performance on the Larrabee Architecture,’’ Intel white paper, 2008; www.intel.com/technology/ visual/microarch.htm. 9. C.J. Hughes et al., ‘‘Physical Simulation for Animation and Visual Effects: Parallelization and Characterization for Chip Multiprocessors,’’ Proc. 34th Ann. Int’l Symp. Computer Architecture, ACM Press, 2007, pp. 220-231. 10. Y. Chen et al., ‘‘Convergence of Recognition, Mining, and Synthesis Workloads and

Adam Lake is a senior software architect with Intel’s Advanced Visual Computing group. He has an MS in computer science from the University of North Carolina, Chapel Hill. Jeremy Sugerman is a PhD candidate in computer science at Stanford University and a principal engineer at VMware. He has an MS in computer science with a systems concentration from Stanford University.

Its Implications,’’ Proc. IEEE, vol. 96, no. 5, 2008, pp. 790-807.

Larry Seiler is a senior principal engineer with Intel’s Visual Computing Group, working on graphics APIs and algorithms for Larrabee. He has a PhD in computer science from MIT. Doug Carmean is an Intel fellow and Larrabee chief architect in the Visual Computing Group of Intel’s Digital Enterprise Group. He has a bachelor’s degree in electrical engineering from Oregon State University. Eric Sprangle is a principal engineer with Intel’s Visual Computing Group and a lead architect on the Larrabee project. He has an MS in computer architecture from the University of Michigan.

Robert Cavin is a senior architect on the Larrabee Project at Intel and a primary contributor to the ISA, execution engine design and implementation, and performance projections. He has an MS in computer engineering and electrical engineering from the University of Florida. Roger Espasa is a principal engineer with Intel’s Visual Computing Group. He has a PhD in computer architecture from the Universitat Polite`cnica de Catalunya. Ed Grochowski is a senior principal engineer at Intel, working on the architecture of the Larrabee core. He has an MS in electrical engineering from the University of California, Berkeley.

Tom Forsyth is a software and hardware architect working on the Larrabee project at Intel. He has an MA in computer science from Cambridge University.

Toni Juan is a principal engineer at Intel and the architect of Larrabee’s on-die interconnect, tag-directory, and the memory controller. He has a PhD in computer engineering from the Universitat Polite`cnica de Catalunya.

Michael Abrash is a programmer at RAD Games Tools. He has an MS in energy management and policy from the University of Pennsylvania.

Pat Hanrahan is the CANON Professor of Computer Science and Electrical Engineering at Stanford. He has a PhD in biophysics from the University of Wisconsin.

Pradeep Dubey is a senior principal engineer in the Microprocessor Technology Lab at Intel working on emerging applications for Larrabee. He has a PhD in electrical engineering from Purdue University. He is a Fellow of the IEEE.

Direct questions and comments about this article to Pradeep Dubey, Intel Corp., 3600 Juliette Lane, Santa Clara, CA 95054; [email protected].

Stephen Junkins is a principal engineer in Intel’s Software Services Group. He has an

For more information on this or any other computing topic, please visit our Digital Library at http://computer.org/ csdl.

....................................................................

JANUARY/FEBRUARY 2009

Authorized licensed use limited to: University of Pittsburgh. Downloaded on November 30, 2009 at 10:11 from IEEE Xplore. Restrictions apply.

21

Suggest Documents