multimedia signal processing in either its digital or analog form. However, the term ..... Dolby Digital, and CD-DA or LPCM. For full audio support, an audio DSP ...
ARCHITECTURE AND COMPILER DESIGN ISSUES IN PROGRAMMABLE MEDIA PROCESSORS
Jason Fritts
A DISSERTATION PRESENTED TO THE FACULTY OF PRINCETON UNIVERSITY IN CANDIDACY FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
RECOMMENDED FOR ACCEPTANCE BY THE DEPARTMENT OF ELECTRICAL ENGINEERING
June 2000
© Copyright by Jason Fritts, 2000. All rights reserved.
iii
Abstract
The processing demands for multimedia applications are rapidly escalating. Many current applications are pushing the limits of existing microprocessors, and the next generation of multimedia promises considerably greater demands. Adequate support for future multimedia requires the flexibility and computing power of high-level language (HLL) programmable media processors.
This thesis examines the architecture and
compiler design issues for programmable media processors. Design of the architecture requires an accurate understanding of multimedia characteristics.
Using the MediaBench benchmark suite and the Impact compiler,
workload and architecture evaluations were performed to define the essential architecture for programmable media processors.
The workload evaluation examines various
processing aspects, including functional necessities, data types and sizes, branch performance, loop characteristics, memory statistics, and instruction level parallelism. The architecture evaluation examines the performance of dynamic versus static architecture features. Most existing media processors use static architectures, but as processors progress to higher frequencies, the dynamic aspects become more prominent and dynamic hardware may be necessary to minimize stall penalties. The architecture evaluation examines static versus dynamic scheduling, dynamic aspects of instruction fetch, and performance effects in higher frequency processors. Finally, an investigation
iv of the memory hierarchy identifies the most significant bottlenecks in memory performance. The high degree of parallelism available in multimedia applications is well researched, but less well understood is how a compiler extracts and schedules that parallelism to highly parallel architectures. Evaluation of the compiler issues begins with an investigation of the available parallelism in multimedia. While instruction level parallelism unfortunately provides only modest performance, data parallelism offers a promising avenue for increased parallelism. However, data parallelism is of a coarser level of granularity than instruction level parallelism, so conventional compiler methods do not prove very effective. Parallel compiler methods are necessary to realize the benefits of data parallelism.
Unfortunately, parallel compilation requires complex
dependence analysis that is often unable to identify all available parallelism. Consequently, we propose a speculative run-time technique for data parallelism that executes loop iterations in parallel across a multi-clustered architecture. This method speculatively executes several loop iterations in parallel and provides architecture support for identifying and recovering from misspeculations.
v
Acknowledgments
Like all theses, this thesis is not the work of a single person. It reflects the ideas, support, efforts, sacrifices, and insights of innumerable people. While I am unable to indicate my appreciations to everyone individually below, I want to thank everyone who influenced this work and my efforts. First of all, I would like to extend my deep appreciations to Prof. Wayne Wolf, my thesis advisor, who has been a constant source of inspiration to me. I am very thankful for all his invaluable guidance and support. It has been an honor and a pleasure to work with him. I cannot imagine a better advisor. I am very grateful to Prof. Margaret Martonosi and Dr. Kemal Ebcioglu for their efforts in reading my thesis. Their comments and suggestions have been very helpful in improving the quality of this thesis. I would like to thank my former advisor Dr. Andrew Wolfe for introducing me to video signal processors. His enthusiasm for media processing inspired the direction of my thesis work. I am also thankful to Zhao Wu for his influence on my research. His work on video signal processing was very beneficial to my studies in media processing. I am very appreciative to all the faculty, staff, and students of Electrical Engineering and Computer Science at Princeton University for such a rewarding graduate experience. They have all played an important part in my graduate education and have
vi helped me develop a strong computer engineering background. I also thank all my friends at Princeton for making my time their so enjoyable. Special thanks to Dr. Kemal Ebcioglu and my many colleagues at IBM Watson Research.
I thoroughly enjoyed my two years working with them and am indebted to
them for the opportunities and experience I received at Watson Research. Leaving Watson Research was one of the hardest parts of my decision to go into academia. I also want to extend my thanks to the IMPACT group at the University of Illinois at Urbana-Champaign for the use of their compiler and simulation tools.
I am
particularly grateful to Brian Deitrich and John Gyllenhall for all their time and assistance in helping familiarize me with the IMPACT compiler tools. The IMPACT tool set has been essential to my research work. I am forever indebted to my wife Shannon for her all love and support.
I
appreciate her understanding during the times I was unable to be with her, and am thankful for the spirit she shares with me when I am with her. She is my life. I dedicate this thesis to my family. My wife, my parents, Bill and Mary Fritts and Janet and Arthur Congdon, my brothers, my grandparents, and all my family who have always been there for me with constant love and support. This thesis would not have been possible without them. Final thanks go to Dr. Shapiro, Dr. Galloway, and my physical therapist, Barbara. They helped save my leg after an accident midway through my graduate education. I will be forever grateful. This work was supported by the National Science Foundation under Grant Number MIP-9408462 and the New Jersey Center for Multimedia Research.
vii
Contents
Abstract .......................................................................................................................... iii Acknowledgments.......................................................................................................... v Chapter 1. Introduction............................................................................................... 1 1.1 Evolution of Multimedia ............................................................................... 3 1.2 Current Processing Methods ......................................................................... 11 1.2.1 Application-specific processors .................................................. 11 1.2.2 Multimedia Extensions to General-Purpose Processors.............. 15 1.2.3 Media Processors......................................................................... 20 1.3 Future Media Processors ............................................................................... 29 1.3.1 Potential of Parallel Media Processors........................................ 33 1.4 Thesis Overview............................................................................................ 38 Chapter 2. Design Methodology.................................................................................. 41 2.1 Design Methodology ..................................................................................... 43 2.2 IMPACT Compiler........................................................................................ 49 2.2.1 Compiler...................................................................................... 49 2.2.2 Performance Analysis.................................................................. 54 2.2.3 Primary ILP Optimizations ......................................................... 57 2.3 MediaBench benchmark suite ....................................................................... 63 2.4 Summary ....................................................................................................... 69 Chapter 3. Intrinsic Characteristics of Multimedia.................................................. 71 3.1 Related Work................................................................................................. 75 3.2 Operation Frequencies................................................................................... 77 3.3 Basic Block and Branch Statistics................................................................. 81 3.4 Data Types and Sizes .................................................................................... 85
viii 3.5 Memory Statistics.......................................................................................... 88 3.6 Loop Statistics ............................................................................................... 97 3.7 Instruction Level Parallelism......................................................................... 102 3.8 Summary ....................................................................................................... 106 Chapter 4. Datapath Architecture.............................................................................. 109 4.1 Static versus Dynamic Architectures ............................................................ 110 4.1.1 4.1.2
Related Work............................................................................... 110 Base Architecture Model............................................................. 112
4.1.3 Fundamental Architecture Style.................................................. 115 4.1.4 Fetch Architecture ....................................................................... 124 4.1.5 Frequency Effects........................................................................ 131 4.1.6 Summary ..................................................................................... 135 4.2 Highly Parallel Architectures ........................................................................ 137 4.2.1 Proposed Distributed Architectures............................................. 138 4.2.2 Princeton Multi-Cluster Architecture.......................................... 141 4.3 Summary ....................................................................................................... 143 Chapter 5. Memory Hierarchy ................................................................................... 146 5.1 Related Work................................................................................................. 146 5.2 L1 Cache........................................................................................................ 148 5.3 L2 Cache........................................................................................................ 149 5.4 External Memory........................................................................................... 153 5.5 Summary ....................................................................................................... 159 Chapter 6. Compiler Methods for Media Processors ............................................... 160 6.1 Levels of Parallelism in Multimedia ............................................................. 162 6.1.1 Instruction Level Parallelism....................................................... 162 6.1.2 Subword Parallelism ................................................................... 163 6.1.3 Task-Level Parallelism................................................................ 165 6.1.4 Data Parallelism .......................................................................... 170 6.2 Compiling for Data Parallelism..................................................................... 177 6.2.1 Related Work............................................................................... 178 6.2.2 Speculative Execution of Data Parallelism ................................. 185 6.3 Summary ....................................................................................................... 230 Chapter 7. Parallel Media Processor.......................................................................... 232
ix 7.1 Basic Multi-Cluster Organization ................................................................. 232 7.2 Functional Requirements............................................................................... 233 7.3 Instruction Control Stream ............................................................................ 233 7.4 Memory Hierarchy ........................................................................................ 234 7.5 Static vs. Dynamic......................................................................................... 235 7.6 Compilation Methods.................................................................................... 236 7.7 Summary ....................................................................................................... 236 Chapter 8. Conclusions and Future Directions ......................................................... 238 8.1 Thesis Contributions ..................................................................................... 239 8.1.1 Comprehensive Evaluation of Multimedia Characteristics......... 239 8.1.2 Static vs. Dynamic Architecture Evaluation ............................... 240 8.1.3 Cache Memory Hierarchy Evaluation ......................................... 240 8.1.4 Investigation of the Parallelism in Multimedia ........................... 241 8.1.5 Speculative Broadcast Loop (SBL) Execution............................ 241 8.1.6 Multi-Level If-Conversion (MLIC)............................................. 242 8.1.7 Dynamic Memory Conflict Checking ......................................... 242 8.2 Future Work .................................................................................................. 243 8.2.1 Multi-Level Prefetch Hierarchy .................................................. 243 8.2.2 Combine Parallel Compiler with Speculative Broadcast Loop .................................................................................................. 243 8.2.3 Single-Chip Multiprocessors for Media Processing.................... 243 8.2.4 Extend Multi-Level If-Conversion to Subword Parallelism ....... 244 8.2.5 Evaluating DSP Features for Media Processing.......................... 244 Appendix A.
Architecture Performance by Application.................................... 245
Appendix B.
Video Signal Processing Kernels.................................................... 257
Appendix C.
Data Partitioning ............................................................................. 264
Bibliography .................................................................................................................. 270
1
Chapter 1. Introduction
In recent years, the multimedia industry has been growing at a tremendous rate. The success of the Internet and World Wide Web, and the growing feasibility of image and video compression techniques have pushed multimedia into mainstream computing. Multimedia now defines a significant portion of the computing market, and this is expected to grow considerably.
As a consequence, the processing demands for
multimedia applications are rapidly escalating as users desire new and better applications. Many current applications are already beyond the limits of microprocessors, and the next generation of multimedia promises a wider range of applications with considerably greater processing demands. Adequate support for future multimedia will require much greater flexibility and computing power than currently available. This thesis presents single-chip programmable media processors as a potential solution to this problem. Media processors are the class of processors specifically designed to provide efficient processing support for multimedia data. In its truest sense, media processors could conceptually encompass the entire range of hardware designed to enable multimedia signal processing in either its digital or analog form. However, the term “media processor” has more recently come to define the field of processors that are designed to support multimedia in its digital representation.
Furthermore, media
processors are processors designed to support multiple types of multimedia signals.
2 Processors that handle only a single form of multimedia occupy their own categories, such as video signal processors, audio processors, or graphics processors. It is the goal of media processors to provide efficient processing support for multimedia in many, or ideally all, of its forms. The purpose behind the development of media processors is to take advantage of the increasing availability of silicon resources in single-chip processors and produce a single cost-effective solution for supporting many or all forms of multimedia. Currently, most multimedia systems use separate processors for the separate types of multimedia utilized within the system. For example, personal computer systems typically employ separate processors for audio, video, and graphics. Only recently have processors begun integrating audio and video or graphics and video within a single processor. The same is true of processors in embedded multimedia systems such as VCRs, DVDs, and video cameras.
Media processors eliminate the need for many separate processors within
multimedia systems, enabling full multimedia support using a single processor. This thesis proposes programmable media processors as a solution for supporting the next generation of multimedia. Adequate support for future multimedia requires processors with both significant computing power and flexibility over a wide range of applications.
Programmable media processors will combine aggressive compiler
methods with high frequency, highly parallel processors to provide the necessary throughput and flexibility. This thesis examines the architecture and compiler issues in realizing programmable media processors. The remainder of this chapter will examine multimedia and the media processor industry. Beginning with a discussion of multimedia signal processing, it shall be seen
3 that multimedia has gone through a number of evolutionary phases during its history and that each phase of evolution typically requires new or modified forms of processing. We are now on the verge of yet another generation of multimedia, and this must be considered in the design of future multimedia processing support. The second section will examine the current support methods for multimedia processing and indicate how they are expected to be insufficient for the next generation of multimedia. The third section will discuss the future of media processors, exploring the expected changes between existing and future media processors, and illustrating the potential performance of media processors over the next decade.
Finally, the chapter will close with an
overview of the organization of the remainder of this thesis.
1.1
Evolution of Multimedia As in any processor field, media processors are strongly dependent upon the state
of the multimedia industry.
Like general-purpose processing, multimedia signal
processing has been in existence for quite some time. However, unlike general-purpose processing, which has a rather well established set of applications, the multimedia signal processing industry has not yet stabilized. Applications for multimedia are so dependent upon computing power that the multimedia industry continues to evolve with the continual increase in computing power.
Consequently, to effectively design media
processors, it is necessary to examine not only the history and current state of the multimedia industry, but also have a good understanding of the future of multimedia. To understand the evolution of multimedia, it is necessary to first answer the question: "What is multimedia?" In essence, multimedia is a means of communication. Specifically, it is the use of a variety of communication agents to convey information to
4 people through one or more of the human senses. Multimedia is particularly important in conveying information that may not be easily or efficiently communicated through standard human conversation. Some fields of information that may be more easily expressed through multimedia often include feelings and emotions.
Artists have
effectively conveyed their ideas through paintings, sculptures, music, and theater for thousands of years. By a similar token, multimedia has also been useful at recording history and passing it down through time. Such methods of recording history have included caveman drawings, epic poems and bards’ tales verbally passed down through the generations, and more recently in history, writing. Of course these traditional forms of multimedia are not what jumps to mind when we refer to multimedia in relation to computers and multimedia signal processing. What multimedia refers to in this capacity is the electronic representation of multimedia. While this may include either analog or digital multimedia, it now largely embodies multimedia in its digital form. In its current state, electronic multimedia usually appeals to the human senses of sight and sound, and occasionally, touch; it remains to be seen how electronic multimedia will be used with respect to most of the remaining senses. The current dominant types of electronic multimedia include video, images, audio, graphics, and speech. Security is another area that has been classified within multimedia, as it is essential to the field of electronic communication. Machine-Based Representation For purposes of examining the evolution of multimedia, it is desirable to begin with a more general sense, and assume multimedia in the sense of a machine-based representation.
By a machine-based representation, we refer to those forms of
5 multimedia that are enabled by an electrical and/or mechanical system. The multimedia in these cases uses a representation meaningful only to the electrical/mechanical system, and it is the job of the system both to perform any necessary processing as well as to translate the representation into a form meaningful to humans. Under the assumption of a machine-based representation for multimedia, modern day digital multimedia can be traced back to photography in the mid-1800s1. Developed in 1844, still image photography represents the first machine-based method of storing visual information. While photography involves only a single medium, it represents the beginning of machine-based multimedia representations. The first visual medium was followed shortly thereafter by the telegraph in 1866, the telephone in 1876, and the phonograph in 1877, which marked the introductions of various mediums for machinebased representations of audible information.
Similarly, the first visual mediums
appeared in 1889, initially in the form of the kinetoscope (viewing device that emulated video by flipping photographs rapidly in sequence), and eventually in 1895-96 in the form of the first motion pictures. The introduction of radio in 1895 (although the first commercial broadcast radio did not appear until 1920) marked the first wireless media. While each of these mediums is fundamental to multimedia, they are only single mediums for communication. The first true multimedia did not appear until 1926 when sound was introduced into movies.
1
The dates in the subsequent paragraphs and Figure 1.1 are from: [1][2][3][4][5][6][7]
6 1845
still image photography
1866
telegraph
1877
phonograph; telephone
1889 1895
kinetoscope radio; movie projector
1905
movies (commercial avail.)
1920 1926 1933 1940s early 1950s late 1950s
radio (commercial avail.) television; sound in movies frequency modulation (FM) color TV; cable TV; television (commercial avail.) stereo movies stereo recordings (commercial avail.)
1960s
solid-state TV
1970s
VCR; digital multimedia
1980s
stereo TV; CD-ROM; videoconferencing; HDTV
early 1990s late 1990s 2001 2005
JPEG coding; MPEG-1 coding; MPEG-2 coding DVD; MPEG-4 coding MPEG-7 coding MPEG-21 coding
Figure 1.1 – Timeline of significant events in multimedia history. As illustrated in Figure 1.1, many other advances in multimedia followed thereafter, including the first television in 1926 (although it did not gain widespread commercial availability until after WWII in 1945), color TV in the 1940s, and stereophonic sound in the 1950s. These advances were strictly in the realm of analog methods for multimedia until the 1970s when digital multimedia first began to appear in computers. Of course, digital multimedia did not immediately supplant analog forms of multimedia. Analog multimedia persists in many forms today, including much of the television industry, videocassette recorders and cameras, and many voice message systems. However, while analog multimedia will continue to remain in use for some time, digital multimedia is beginning to become the dominant form of multimedia.
7 Digital Multimedia The era of digital multimedia began in the 1970s when early computer users started realizing that computers were capable of much more than just scientific computing. Computer users began experimenting with computer graphics and audio in computers, and it quickly became apparent that digital multimedia required significant computing power for the large volumes of data. However, the introduction of multimedia in computers found considerable popularity, particularly with respect to computer games and entertainment. While the quality of computer multimedia was initially quite limited, it was still very successful in drawing in a new market of computer users. Like multimedia in the pre-digital era, the popularity of multimedia encouraged continued evolution in the digital era. However, unlike the pre-digital era, in which evolution was limited mostly by the ingenuity of engineers to design or adapt systems for new technologies, evolution in the digital era is limited primarily by the capabilities of existing computers. Multimedia is very compute intensive because it uses enormous amounts of data and typically has real-time constraints that require the processing to be performed within a limited period of time.
Consequently, the state of the digital
multimedia industry is determined by the capabilities of its processing support. As the performance of multimedia processing support increases, new applications and higher quality versions of older applications become feasible.
Essentially, the multimedia
industry evolves with processor performance. Compression of image and video is one example of this trend in digital multimedia. Digital video and image support in computers evolved initially through image processing, because video processing was well beyond the capabilities of early
8 computers. Early image processing methods stored and transmitted images in a raw (uncompressed) format. However, the large amounts of data in uncompressed images made the cost of storing and transmitting these images prohibitively expensive. Consequently, methods such as JPEG [7] were devised to enable compression and decompression of images. The compression algorithms initially processed images rather slowly, but once the processing power developed to enable fast compression and decompression of images, the use of digital images became popular in the computing industry. At about the same time, video was beginning to appear in digital multimedia. Initially, the JPEG compression methods were applied to video frames individually. However, because video requires orders of magnitude more data, the amount of data even when compressed still remained relatively high, so video was little used in computers initially. As computing power continued to increase and better compression methods became available like MPEG [8], which also reduces temporal redundancy in video, video began to become a more realizable media. Computer support has finally begun to reach the levels necessary for enabling real-time decoding of limited resolution levels of MPEG video, so video is now developing a strong foundation in the computing industry2. Whether the primary limiting factor is human ingenuity or computing power, each new major advance marks an evolutionary phase of multimedia. And with each new evolution a complementary upgrade in processing techniques is required for supporting multimedia.
New advancements require new, and oftentimes substantially different,
systems for supporting the next generations of multimedia. This was evidenced above in the discussion of the evolution of image and video multimedia in computers. As new
2
Encoding is still not realizable in real-time, but it can be done off-line so is less critical.
9 compression methods were introduced, computing technology had to achieve the necessary performance before the methods became practical. The evolution of image and video with respect to processing power is illustrated in Figure 1.2. Object-Based
Performance
Processing
Multimedia
Video Compression
Image Compression
Time
Figure 1.2 – Evolution of image and video vs. processor performance. Television provides a good example from the pre-digital era. Not only did it combine video and audio media, but it also required wireless support and utilized a much greater bandwidth than any previous form of multimedia. It was first successfully tested in 1926, but because of its complexity, it did not see widespread commercial availability until 1945, after the end of World War II. And even with the significant popularity of television, it still took a decade before color television became commercially available (mid-1950s). Like television, typical system evolution time between the generation of a new multimedia technology and its commercial availability has usually been at least a few years, and often a decade or two. This time is lessening with digital multimedia since computing systems have the flexibility to upgrade to the next generation via
10 software. However, because evolution is still limited by the computing capabilities of multimedia processing support, upgrading software is usually not sufficient. Object-Based Multimedia Multimedia is now verging on another evolutionary phase as it begins to evolve towards advanced multimedia representations, as evidenced by applications such as MPEG-4 [9], MPEG-7 [10], video libraries [11], and even MPEG-21 [12], the newest potential multimedia standard3. Whereas current methods describe multimedia in terms of entire images, audio channels, or sequences of video frames, the next generation will describe multimedia using more advanced representations. An example is the MPEG-4 standard, which describes multimedia in terms of streams of objects [13]. These objects represent real-world objects, each with its own audio, video, and graphics characteristics that describe its spatial and temporal behavior. This object-oriented representation will effect significant changes by enabling higher compression rates, more interactive media, and content-based processing. As with previous generations of multimedia, this new generation will require new methods of support that provide considerably more advanced media processing [14]. The object-oriented representation provides new freedom and flexibility that introduces much greater computing demands and processing irregularity than seen in earlier generations of digital multimedia. As such, it is to be expected that the current systems for supporting multimedia will require significant modifications to effectively support the next generation of multimedia.
3
The MPEG Committee just began discussion of MPEG-21 in December 1999. Its goal is the standardization of a generic framework for all multimedia.
11
1.2
Current Processing Methods To understand the need for new methods of support for the next generation of
multimedia, it is necessary to first understand the limitations of the current methods of support. As discussed above, the next generation of multimedia is expected to require considerably more computing power and greater flexibility to accommodate increased processing irregularity and support over a wide range of applications. Current industry support for multimedia appears in three forms: application-specific processors, multimedia extensions to general-purpose processors, and media processors. While each of these methods of support is effective at providing either the necessary performance or flexibility, none are capable of accommodating both, which will be necessary to the future of multimedia.
1.2.1 Application-specific processors Application-specific processors offer a hardware solution to multimedia processing support. This approach extends from early machine-based representations when computers did not exist and hardware solutions were the only means of support. In this alternative, designers carefully examine the characteristics of the application (or class of applications, if the applications have sufficient similarity, such as MPEG-1 [8], MPEG-2 [15], and H.263), and then construct the hardware to most effectively meet the needs of that application. Application-specific processors are able to achieve high degrees of performance by optimizing the hardware for the most critical sections of the application. Many multimedia applications spend a significant amount of their time within a small number of processing routines. As an example, the primary routines in MPEG video include the
12 discrete cosine transform (DCT), motion estimation, and variable bit-rate coding. Chip designers devote large portions of the silicon area to special functional units that implement these critical routines in the least amount of time by taking advantage of all available parallelism. Because VLSI technology now offers very high degrees of integration, application-specific processors are able to achieve low cost by combining on a single chip not only the highly optimized functional units, but also much of the additional system logic. This includes: •
on-chip memory for temporary storage and interfaces to external memory,
•
management for on-chip memory, external memory, and memory prefetching
•
read-only memory storage for instructions and application-specific data
•
registers and/or memory for user-definable application and system settings
•
interfaces to the external multimedia sources/sinks
•
analog to digital conversion of analog multimedia sources/sinks
•
synchronization logic for separate media types
•
network interfaces for multi-chip support
By combining much of the system logic on a chip and providing the ability to enable or disable parts of the system logic, application-specific processors offer a high-speed, single-chip solution useable in a wide variety of systems implementing that application. The chip is therefore useable over a larger market, enabling high performance at low cost. An example of an application-specific multimedia processor is C-Cube Microsystems’ ZiVA-3 DVD Decoder [16], shown in Figure 1.3. The C-Cube ZiVA-3 is an advanced DVD decoder with integrated audio DSP support. It supports DVD, VCD,
13 and CD-DA formats. The internal structure, shown above, provides interfaces to external memory (up to 16 Mbit SDRAM), a host processor, the DVD/CD input, and the audio and video outputs. Upon input, media in DVD/CD form is descrambled and decoded into separate audio and video streams. The video stream is sent to both the MPEG-1/2 and sub-picture decoders, which decode the video before sending it to the video mixer. The video mixer combines the MPEG and sub-picture video streams, as well as a separate OSD (On-Screen Display) stream, to produce the final video output. The audio stream is routed to the three audio decoders, which support a variety of formats including MPEG, Dolby Digital, and CD-DA or LPCM. For full audio support, an audio DSP is also integrated on the ZiVA-3, enabling additional audio enhancements and support, including 3D audio, DVD audio, Pro Logic decoding, and C-Cube’s RealSonic audio technology. Additional examples of application-specific processors for video are given in [17]. The primary drawback of application-specific processors is that because they are highly optimized to implement a specific application at minimal cost, they provide very little flexibility. In many application-specific processors, the extent of the flexibility is a set of registers that allow the user to define a small number of application settings, or define which particular application to use if the processor targets a class of similar applications. The result is that these processors are only practical for implementing a specific existing application. They are not capable of supporting other applications or even supporting future versions of the same application. The simple, hardware-only approach of application-specific processors provides little to no flexibility.
DVD/CD Interface
Host Interface
SRAM Interface
Bus Key Authentication (optional)
SecureView CSS Descrambling
Host Interface Control Logic
Memory Controller
Program Stream Decoder
MPEG Audio Decoder
Dolby Digital Audio Decoder
CD-DA and LPCM Decoder
MPEG Video Decoder
Subpicture Decoder
OSD Decoder
Audio DSP
Video Mixer
Digital Audio Interface
Audio Interface
Video Interface
14
Figure 1.3 - C-Cube’s ZiVA-3 application-specific processor for DVD decoding [16].
15 It will be difficult, if not impossible, to define application-specific processors for many future multimedia applications with advanced representations, such as MPEG-4 [9][4]. MPEG-4 is a standard that defines a syntax for object-based multimedia. Each object may have many media types associated with it, including but not limited to, video, audio, and computer graphics. Furthermore, the processing of the media associated with each object is not limited to a specific set of tools. The processing of each objects’ media can be defined separately for each object. This flexibility of various media types and freedom of processing for each media type enables the possibility for a great deal of processing irregularity, which may be impossible to support strictly in hardware. Application-specific processors offer excellent performance in applications with significant processing regularity, but offer negligible flexibility for other uses. Alternative methods of multimedia support will likely be needed for the next generation of multimedia.
1.2.2 Multimedia Extensions to General-Purpose Processors A second form of media processing that has achieved much success in recent years is the addition of multimedia extensions to the instruction sets of general-purpose processors.
This has become a very popular method in workstations and personal
computer systems as it provides improved performance for multimedia with little added cost to the processor. The primary basis of these multimedia extensions is that they take advantage of two facets of the nature of multimedia: a) multimedia signal processing typically uses small data types no larger than 8 or 16 bits, and b) multimedia spends a significant portion of its time in inner loops that have a high degree of processing regularity and
16 perform the same processing using separate data in each iteration. Performing individual processing of 8-bit or 16-bit data elements, general-purpose processors with 32-bit and 64-bit datapaths were wasting much of the datapath’s resources. Instead, it was realized that several small data elements could be packed into a single register, enabling simultaneous processing of these separate data elements without requiring extra registers or operations. This form of SIMD (Single Input Multiple Data) parallelism is commonly known as subword parallelism [18]. The multimedia extensions provide many instructions to enable efficient processing of subword parallelism.
Instructions are provided for both packing and
unpacking the separate data elements to/from a single register. With a 64-bit datapath, a packed register may contain eight 8-bit words, four 16-bit words, or two 32-bit words. Once in packed form, a variety of instructions exist for performing parallel operations, including parallel arithmetic, logic, compare, and shift operations. An example of a parallel ADD is shown in Figure 1.4(a). Additional instructions are necessary for mixing and reordering the contents of packed registers. One such example that mixes data elements from two separate packed registers is shown in Figure 1.4(b). Subword parallelism was first introduced by Hewlett-Packard in 1994 with the introduction of MAX-1 in the PA-RISC 1.0 instruction set [19].
MAX-1 initially
provided only the rudiments for subword parallelism, but it has since been extended by HP’s MAX-2 [18][20] and many other microprocessor vendors. Other implementations of subword parallelism in general-purpose processors include Intel’s MMX [21], Sun’s VIS [22], Alpha’s MVI [23], MIPS MDMX [24], and Motorola’s AltiVec [25][26]. These initial multimedia extensions operated solely on small integer data types, enabling
17 packing of 8-bit, 16-bit, and 32-bit data types into 32-bit or 64-bit registers. AMD’s 3DNow! [27] extension added subword parallelism for floating-point by enabling packing of two 32-bit single precision floating-point words into a single 64-bit floatingpoint register.
(a) x0
x1
x2
x3
y0
y1
y2
y3
+ x0+y0
x1+y1
x2+y2
x3+y3
(b) a0
a1
a2
b0
a3
b1
b2
b3
MIX a0
a1
b0
b1
Figure 1.4 – Examples of typical subword parallel operations: (a) 4-way subword add, (b) mix operation mixing two subwords from each group. Besides subword parallelism instructions, the multimedia extensions also include a few additional operations to assist with specific multimedia functions. Sun’s VIS is a good example, as it provides additional operations for efficient boundary processing, 3D addressing, and pixel distance computation (which is critical to motion-estimation in MPEG encoding). Many of the other multimedia extensions provide other specialized instructions as well. A good summation of many of these multimedia ISA extensions is provided by Wu and Wolf [17].
18
x0
x1
x2
x3
a0
b0
a1
b1
x4
x5
x6
x7
z0
z1
z2
z3
Register File
Partitioned ALUs
Figure 1.5 – Subword parallelism enabled by partitioned functional units. The multimedia extensions have been such a success in general-purpose processors because they offer performance improvements for multimedia with minimal hardware modification. To enable processing for subword parallelism, about all that is necessary is to partition the ALUs (arithmetic logic units), as illustrated in Figure 1.5. This partitioning involves manipulating the carry chains to prevent overflow of the processing of one subword datum into the next. Some additional hardware is needed to implement methods of rearranging and packing/unpacking the packed registers, but little else is needed. In particular, subword parallelism does not require any additional ports in the register file, unlike most other forms of parallel processing. Additional hardware is also needed for the instructions that implement specific multimedia functions, but overall the typical area overhead for multimedia extensions in general-purpose processors is only between 0.1% (HP’s MAX-2) to 3% (Sun’s VIS) of the entire chip area. The primary drawback of these multimedia extensions is that there is currently no effective way to automate their use [28][29]. Conventional compiler methods have not been successful in utilizing subword parallelism, and while research is underway towards
19 solving this problem (see Section 6.1.2), no methods have yet been found that enable efficient compilation for subword parallelism. Use of the additional multimedia-specific instructions also cannot be effectively automated since it is difficult for compilers to determine when they may be used. Consequently, the use of multimedia extensions requires programmers to explicitly reference the subword parallel or multimedia-specific instructions in their programs. Because this is tedious to do at the assembly level, special libraries are provided for referencing these instructions in high-level languages. However, these libraries vary between separate processors so applications have to be coded separately for each computing platform. Even with these libraries the performance gain is limited by a variety of factors, including limited SIMD parallelism, packing overhead, and intervening control code. While the peak performance improvement for most processors is between 4x and 8x speedup, the typical performance gain is usually only about 2-4x [30]. Multimedia extensions to general-purpose processors, and subword parallelism in particular, have been quite successful at providing additional performance for multimedia applications in general-purpose processors. Achieving 2-4x improvement at negligible added cost to the processor is certainly a worthwhile gain.
However, multimedia
extensions alone will not be sufficient to support the next generation of multimedia. While the multimedia-enhanced instruction set of a general-purpose processor provides the necessary flexibility, it only provides moderate performance improvement. Future multimedia applications will require much higher levels of performance.
The
fundamental limitation of this approach is that general-purpose processors are optimized for general-purpose applications, not multimedia applications.
General-purpose
20 processors devote a significant portion of the silicon area to complex memory management, TLBs, multi-level cache hierarchies, and dynamic scheduling methods. These complex structures are less important to multimedia applications, which would instead prefer additional parallel processing units and prefetching in the memory hierarchy.
To meet the processing demands of future multimedia, processors for
multimedia should be designed as separate entities [28][29]. As with other specialized processors that are optimized for their specific market [31], media processors should optimize chip area for the greatest media processing efficiency and provide the flexibility of high-level language (HLL) programmability, ideally without the necessity of special libraries or programming paradigms.
1.2.3 Media Processors Programmable media processors have started to appear in the marketplace in recent years.
Digital signals processors have been supporting various forms of
multimedia for many years, but what distinguishes media processors from traditional digital signal processors is their ability to also support video and/or computer graphics. As such, the Texas Instruments TMS320C80 Multimedia Video Processor (MVP) [32][33] was one of the earliest commercially available media processors. This early media processor used a multiprocessor DSP (digital signal processor) architecture with four 32-bit integer DSPs and a floating point RISC (reduced instruction set computers) processor to achieve up to 100 MFLOPS at 50 MHz. It was successful at paving the way for media processors although it had a relatively short lifetime as it was soon supplanted by a host of other media processors.
21 TI’s MVP is particularly important because it was the first and last media processor to employ a traditional DSP architecture. The traditional approach to digital signal processing has been to use a master processor to perform the control processing and then off-load the digital signal processing to separate DSP coprocessors. In the case of the TI MVP, the master processor and DSP coprocessors were simply integrated into a single chip. The problem that has existed for many years in this computing style is that it is difficult to program such systems.
Early digital signal processors could only be
programmed in assembly, and even though C compilers now exist for DSP processors, much of the programming still requires tedious hand optimization and/or special programming libraries to achieve reasonable performance. For simpler programmability and shorter development times, it is desirable to instead use an architectural approach that is more amenable to high-level language programming. This is why most subsequent media processors have adopted a VLIW architecture. VLIW Architecture for Media Processors VLIW architectures have become appealing to digital signal processing for a variety of reasons [34]. One of the foremost reasons is that digital signal processing applications typically provide a significant amount of instruction level parallelism (ILP). There exists a strong foundation of compiler technology for ILP, so VLIW processors enable excellent performance with high-level language programming without requiring tedious hand optimization. Superscalar architectures are also effective at supporting instruction level parallelism, but they employ complex control hardware for achieving high performance on control-intensive general-purpose applications.
The complex
control hardware typically becomes prohibitively expensive when attempting to support a
22 large number of parallel functional units. It is believed that VLIW processors are better able to support multimedia since they depend only on the compiler for program scheduling, enabling simpler hardware that can more easily support a large number of parallel functional units. Also, a simpler hardware design enables processor designs with lower cost and power, which are important in many embedded multimedia systems. The shift towards VLIW architectures for media processors is readily apparent among the media processors developed after the TI MVP. Included in this group are Philips TriMedia’s TM-1000 [35][36], Chromatic Research’s Mpact-1 [37] and Mpact-2 [38], MicroUnity’s Broadband MediaProcessor [39], Samsung’s MSP-1 [40], the joint Equator-Hitachi MAP1000 [41], and Texas Instruments’ own addition, the VelociTI architecture [42], which includes the TMS320C62x fixed-point processor and the TMS320C67x floating-point processor. The primary exception to this trend of using VLIW processors is NEC’s V830R/AV [43][44], which is a 2-way superscalar architecture for embedded multimedia applications. A good overview of many of these media processors is provided by Wu and Wolf [17]. The newest entry to the ranks of commercial media processors is Philips TriMedia’s successor to the TM-1000, the TM-2000, which was recently introduced at the 1999 International Conference for Computer Design [45][46][47][48].
This
introduction only covered the design of the VLIW core for the TM-2000, because the remainder of the chip is expected to strongly resemble the TM-1000. The resulting architecture would therefore appear as shown in Figure 1.6.
23
(a)
SDRAM Main Memory Interface Video In
VLD Coprocessor
Audio In
Video Out
Audio Out
Timers
I2C Interface
Synchronous Serial Interface
32K
VLIW I$ CPU 16K
Image Coprocessor
D$
PCI Interface
(b)
Dcache 16 Kb
mmu mmu
global register file, 128 words x 64 bit 15 read ports + 5 write ports bypass network
Icache 32 Kb
mmu
exception handling PC pipelined instruction decode & launch
Figure 1.6 – Architecture for the TriMedia TM-2000: (a) the design of the VLIW core for the TM-2000, and (b) the existing chip layout for the TM-1000, expected to be similar for the TM-2000 [46].
24 The TM-2000 is very similar to the TM-1000 in many respects. Like the TM1000 it has a 5-issue VLIW architecture, and achieves its performance through a combination of instruction level parallelism, subword parallelism and specialized hardware for variable bit-rate coding, image filtering and scaling, and color space conversion. The goal in the design of the TM-2000 was to achieve a 6x performance gain over the 100 MHz TM-1000. The use of a new process technology was expected to provide nearly 3x gain in clock frequency, so improvements to the architecture needed to account for a 2x performance improvement. The primary architecture modification was an increase of the datapath size from 32 to 64 bits, which provides twice the potential for subword parallelism. Additional modifications include a hardware TLB, a true dual-port data cache, additional instructions for critical multimedia functions, and super-ops, instructions that execute on 2 issue slots and can therefore have up to 4 source arguments. In addition to the existing media processor industry, there exists a large market of graphics processors that also appear to have many of the traits of media processors. It is difficult to say for certain since there is little technical literature on these processors. It is likely, however, that these graphics processors will eventually merge into the media processor marketplace. Some of the major graphics processor vendors include ATI [49], S3 [50], and NVidia [51]. Each provides a host of graphics processors, the latest of which are ATI’s Rage 128 Pro, the S3’s Savage2000, and the Nvidia’s GeForce 256. Graphics processors are built around a 64-bit, 128-bit, or 256-bit rendering engine, which provides the basic features (including triangle rendering) for supporting computer graphics.
In addition, there is a significant amount of specialized hardware for
supporting both video and graphics: a) graphics features include texturing, lighting,
25 shading, and alpha blending, and b) video functions include interpolation, scaling, color space conversion, motion compensation, discrete cosine transform, and variable-bit rate coding. Because graphics is an integral part of digital multimedia and future multimedia applications such as MPEG-4 incorporate graphics with other forms of multimedia, it is expected that graphics processors will have to merge with media processors to provide efficient support for all forms of multimedia. Performance Mechanisms As discussed above, the TriMedia TM-2000 achieves performance through a combination of instruction level parallelism, subword parallelism, and specialized hardware. Like the TM-2000, all the existing media processors combine a variety of different mechanisms for achieving high performance. These mechanisms can be broken down into five categories: •
Subword parallelism
•
Specialized instructions and functional units
•
Instruction level parallelism (ILP)
•
Parallel processing
•
Clock frequency
The first mechanism, subword parallelism, is the same form of SIMD parallelism on small data types discussed in Section 1.2.2 with respect to multimedia extensions for general-purpose processors. Subword parallelism is also a highly viable method for media processors and many of the existing media processors use it to some degree, as shown in Table 1.1. The second method, specialized instructions and functional units, is similar to the approach discussed in Section 1.2.1 with respect to application-specific
26 processors. Many media processors employ specialized hardware and instructions for achieving greater performance on specific multimedia functions, such as motion estimation, color-space conversion, variable-bit rate coding, filtering, and triangle rendering. Again however, subword parallelism and specialized hardware require special programming libraries for effective use, so are not as desirable from the standpoint of ease of high-level language programming. The third method, instruction level parallelism, is enabled by numerous parallel functional units in VLIW and superscalar architectures. Of the existing media processors only three do not utilize ILP: the TI MVP, which is a multiprocessor DSP, and the MicroUnity Broadband MediaProcessor and Samsung MSP-1, both of which provide extensive levels of subword and SIMD parallelism. Table 1.1 indicates the level of instruction level parallelism (i.e. issue width) available in each media processor. The fourth parallel mechanism, parallel processing, is utilized by only a few of the media processors. Parallel processing is the use of separate processors (not just separate functional units) to support more coarse-grained parallelism than available with ILP. This can be an effective method for achieving levels of performance beyond what is capable with just subword parallelism and ILP. However, special parallel compiler methods are required for utilizing parallel processors, and oftentimes best performance can only be achieved with the programmer explicitly declaring the parallelism in an application using special libraries and/or programming paradigms. The only existing media processors that employ parallel processing are the TI MVP, which combines four DSP processors with a separate RISC master processor, and the Samsung MSP-1, which combines a 256-bit vector processor with an ARM7 RISC master processor.
27 The final method of enabling higher performance is clock frequency. Table 1.1 indicates the peak clock frequencies for the existing, commercially available media processors.
The simple hardware design of VLIW architectures is believed to be
particularly amenable to high frequency design.
However, from Table 1.1, it is
immediately evident that existing media processors do not fully utilize their frequency potential. Media Processor Texas Instruments MVP (C8x) Texas Instruments VelociTI (C6x) Philips TriMedia TM-1000 Philips TriMedia TM-2000 Equator/Hitachi MAP1000A NEC V830R/AV Chromatic Research5 Mpact-1 Chromatic Research5 Mpact-2 MicroUnity6 MediaProcesor Samsung7 MSP-1
Subword Parallelism
Specialized Hardware
ILP
-
-
-
8-wide
-
300 MHz
5-wide
-
166 MHz
5-wide
-
~300 MHz
4-wide
-
220 MHz
2-wide
-
200 MHz
5-wide
-
?
6-wide
-
125 MHz
-
-
-
-
Parallel Processing
Clock Frequency 50 MHz
-4
-
-
Table 1.1 – Performance methods utilized by existing media processors
4
Subword parallelism was originally intended in the NEC V830R/AV, but recent data sheets do not indicate any subword parallelism in the realized product. 5 Chromatic Research no longer produces either Mpact-1 or Mpact-2 and is now a fully owned subsidiary of ATI technologies [52]. 6 MicroUnity is still in the process of developing the MediaProcessor. 7 Samsung never completed development of the MSP-1.
28 Media processors have certainly demonstrated their worth, as many of them are able to achieve peak performance of many billions of operations per second. TI’s VelociTI is capable of up to 2.4 BOPS (billion operations per second), while TriMedia’s TM-1000 can achieve 4 BOPS and their new TM-2000 is targeted for 6 times that – 24 BOPS! Similarly, the Equator/Hitachi MAP1000A claims a performance of up to 23 BOPS. However, even with their existing capabilities, these early media processors achieve only a fraction of their potential performance. As is evident from Table 1.1, while most of them make use of subword parallelism, specialized hardware, and moderate levels of ILP, they are not taking advantage of all avenues of performance. Additional performance mechanisms that few media processors are utilizing include greater levels of ILP, parallel processing, and high frequency. These processors operate at only moderate frequencies of up to 300 MHz, while general-purpose processors are already operating at more than twice that.
The Intel Pentium III is
currently available at 800 MHz [53], while Compaq will soon be shipping Alpha 21264 processors at 1 GHz [54]. There is also significant performance potential with the use of single-chip parallel processors. Research in multimedia has demonstrated significant levels of parallelism at many levels of granularity. Current processors are only utilizing fine-grained subword parallelism and ILP. Much coarser granularities of parallelism also exist that require parallel processing methods. Early media processors provide reasonably good performance and flexibility, but they currently achieve only a fraction of their potential performance. They operate at only moderate frequencies and rely mostly on subword parallelism, ILP, and applicationspecific hardware to achieve the bulk of their performance. Consequently, they do not
29 enable both performance and flexibility over a wide range of applications. The next generation of multimedia will demand significantly higher levels of performance and flexibility.
Achieving these goals requires advanced media processors capable of
effectively utilizing all avenues for performance.
1.3
Future Media Processors To achieve their full potential, media processors will undergo some significant
changes in the coming decade.
As indicated above, media processors will have to
explore the additional performance avenues of increased ILP, parallel processing, and high frequency to meet the demands of the next generation of multimedia. performance mechanisms will require significant silicon resources.
Such
Fortunately, the
continuing advancement of VLSI technology will enable this within the next decade. Advances in VLSI technology will make possible chips with one billion transistors within a decade [55]. The increasing frequencies and numbers of transistors will enable the design of media processors that provide the necessary performance and flexibility for future multimedia. However, the question remains as to how to most effectively utilize these resources. Future media processors are expected to be single-chip parallel media processors (PMPs), which obtain performance using both high frequency and high degrees of coarse and fine-grained parallelism. Optimizing these processors for optimum multimedia performance requires a careful balance of throughput, memory, and programmability, as illustrated in Figure 1.7. Achieving this balance defines the three key areas where PMPs will differ from existing media processors: greater throughput from more parallelism and higher frequencies, larger on-chip memory hierarchies, and more regular architectures.
30
Throughput
Storage
- fast clock speed - high parallelism - high utilization
- large on-chip memory - large register file - efficient memory I/O
Balance
Programmability - high connectivity - regular arrangement - optimizing compiler Figure 1.7 – Programmable media processor design philosophy. Greater Throughput Meeting the computing demands of next generation multimedia requires significantly higher throughput than currently possible. Achieving this throughput entails optimizing speed, parallelism, and utilization. Future media processors will therefore need to provide much higher frequencies and numerous additional parallel functional units. Multimedia applications have demonstrated considerable parallelism [56], so there should be sufficient work for these additional parallel resources. Maximum throughput can then be achieved with a compiler capable of scheduling for high utilization. As will be discussed below, more regular architectures improve programmability and help the compiler achieve higher utilization. The additional parallelism provided within future media processors will be organized both within a single processor to provide additional fine-grained parallelism with ILP, and within separate processing units in the same chip to provide coarse-grained
31 parallelism with parallel processing. While it is currently unknown to what degree the additional functional units will be used with respect to ILP and parallel processing, it is expected that parallel processing will account for most of the additional units.
A
significant amount of coarse-grained data parallelism resides in multimedia that is not effectively being utilized by existing media processors with ILP and subword parallelism. High frequency, parallel architectures are needed to realize the full potential of future media processors. Larger On-Chip Memory Hierarchies Large, aggressive on-chip memory hierarchies are necessary to accommodate the high data rates and minimize penalties from large external memory latencies. Multimedia applications are typically dominated by enormous amounts of data. Applications that process a single media such as video or graphics entail a significant amount of data themselves, but this is compounded when applications use numerous types of media. Larger memories are needed to contain the memory image for each media type when supporting many media types simultaneously on the same processor. Additionally, reducing the increasing processor-memory gap is critical to multimedia applications, which are so memory dependent. Using more of the silicon area for memory structures can effectively reduce the external memory latency in two ways. First, larger memory hierarchies enable more of the working set to be contained on chip, reducing penalties when re-accessing previously used data. applications often have predictable memory access patterns.
Also, multimedia Prefetching memory
structures have been shown to be effective at reducing the memory latencies for many multimedia applications [57][58]. It is not currently known what types of memory
32 hierarchies will be most suitable for multimedia, but providing additional silicon resources for on-chip memory hierarchies will enable innovative memory hierarchy designs that combine memory size and prefetching to most effectively reduce the impact of memory latency. Increased Architecture Regularity Programmability is key to the success of programmable media processors. Achieving high utilization of a processor’s computational resources is crucial for maximizing throughput.
Such utilization can only be achieved with a flexible
architecture that can support a full range of multimedia applications. Ideally, media processors should be able to achieve efficient high-level language programmability without requiring special libraries or programming paradigms. This goal puts the burden on the compiler to transform high-level language programs into high-performance assembly implementations. Conventional compiler wisdom indicates that a compiler can more effectively target architectures that are more tightly coupled and have a more regular organization. Consequently, future media processors are expected to have more regular architectures and less specialized hardware for better high-level language programmability and greater flexibility over a wide range of applications. Most existing media processors are not amenable to the desired levels of programmability and flexibility because they depend heavily on subword parallelism and specialized hardware to achieve performance. Currently no compiler techniques exist for automating the use of subword parallelism, and the use of individual specialized functional units goes against the nature of architecture regularity. Also, specialized hardware does not provide much flexibility as it is only useful for specific multimedia
33 functions and represents a waste of silicon area otherwise. Future programmable media processors will see significantly less specialized hardware, and will eventually develop the compiler methods for automating the use of subword parallelism. The Texas Instruments’ VelociTI media processor is the first media processor to promote the ideals of a regular architecture for better programmability and flexibility. It uses neither subword parallelism nor specialized hardware, depending only on ILP and architecture regularity to achieve its performance. Because it has been one of the more successful media processors, it lends considerable credence to the belief that programmability and flexibility are becoming necessary to the success of programmable media processors. Greater throughput, larger on-chip memory hierarchies, and more architecture regularity represent significant differences between existing and future media processors. These changes will enable future media processors to provide the flexibility, programmability, and computing power necessary to meet the needs of the next generation of multimedia. To understand the extent of this potential, the next section examines the potential performance of media processors over the coming decade using a trace-driven simulation methodology.
1.3.1 Potential of Parallel Media Processors As examined in a recent publication of ours [59], there exists considerable potential in parallel media processors. Over the next decade, advancing VLSI technology will enable media processors that achieve significantly higher parallelism and frequency than existing media processors.
This potential was explored using a trace-driven
simulator to evaluate the performance of three key video coding applications: H.263,
34 MPEG-2, and MPEG-4. MPEG-2 is a popular video coding standard that exploits spatial and temporal redundancies to achieve high compression ratios for mid to high-resolution video. H.263 is a similar application targeted for very-low bit rate video, and MPEG-4 is an object-based representation video coder. The trace-driven simulation environment measures performance by instrumenting a program with probes and then running the program to produce a program trace. The trace contains a sequential listing of the operations executed in the program, providing their instruction and data addresses. The program trace is input to a simulator, which schedules the code for simulation on a specified processor architecture. The architecture specification includes a variety of resource parameters, such as issue width, functional unit resources, number of registers, and organization and size of the memory hierarchy. A detailed evaluation of video signal processors using the trace-driven simulation methodology is provided by Wu [60]. The trace-driven simulation results represent an upper-limit of the potential performance of media processors.
The simulator currently assumes perfect branch
prediction and memory disambiguation. Additionally, the simulator’s scheduler uses a scheduling window of up to one billion instructions, far exceeding the size of most compilers’ scheduling windows, which are typically around a hundred instructions. These idealisms cause the trace simulator to overestimate performance, so trace-driven simulation represents an upper-limit to the achievable performance. Parallel Performance The first experiment examines the performance of a 32-issue multi-cluster VLIW architecture.
To enable high frequency the datapath model of the multi-cluster
35 architecture subdivides the issue slots into 8 clusters of 4 issue slots per cluster. Each cluster contains its own functional units, register file, and potentially its own memory. The issue slots of all clusters are all tightly interconnected with a single-cycle crossbar network, so this architecture is nearly equivalent to a 32-issue VLIW architecture. The operation latencies are modeled after those in the Alpha 21264 [61], which are shown below in Table 1.2. This architecture will be discussed in greater detail later in this thesis, and more on it can be found in our previous works [62][60]. Operation
Latency
ALU
1
Memory
2
Shift
1
Multiply
7
Divide
34
Table 1.2 – Operation latencies for the Alpha 21264. Figure 1.8 shows the performance results for encoding and decoding of the three video applications. This experiment evaluates only the peak parallel performance, so assumes a perfect cache memory with no penalties for memory stalls. As is evident from the figure, there exists considerable parallelism in these video applications. Most are able to execute around 15 instructions per cycle (IPC), and even the application with the lowest degree of parallelism still has an IPC of 9. Additional experiments on video applications using trace-driven simulation are provided by Wu [60], including simulations with memory systems composed of cache and streaming memory structures.
36
20
IPC
15 10 5 0 H.263 H.263 MPEG-2 MPEG-2 MPEG-4 MPEG-4 Decode Encode Decode Encode Decode Encode Application
Figure 1.8 – Potential parallelism within select video applications evaluated using tracedriven simulation [59]. Potential from Increasing VLSI Technology With the increasing silicon resources during the next decade, it is expected that media processors will use the majority of the additional real estate for larger on-chip memory hierarchies.
More on-chip memory will allow higher datapath-to-memory
bandwidths and decrease average latencies for memory accesses. As mentioned earlier, it is still unknown what the ideal memory hierarchy structure is for media processors. For purposes of this experiment, we will only consider on-chip cache memories, as this is the most widely used and thoroughly studied memory model. Comparisons of different memory architectures and their impact on multimedia applications using trace-driven simulation can be found in the literature [63][64][60]. In all cache simulations, a block size of 64 bytes and two-way set associativity are used, as this configuration outperforms others cache configurations of similar area [58]. The same 32-issue multi-cluster VLIW architecture is used, but we now assume that 256
37 KB of on-chip cache (32 KB per cluster) is used in 1999 and the size doubles every two to three years, finally reaching 8192 KB (1024 KB per cluster) after ten years. The simulation results, taking into account clock rate increase according to the National Semiconductor Technology Roadmap [55], are shown in Figure 1.9. As can be seen, the performance continues to double approximately every four years. As the number of transistors and speed increase, so does performance. Media processors can expect performance improvements in excess of 5x in the coming decade just from advancing VLSI technology.
Relative Speedup
7 6
Encoder
5
Decoder
4 3 2 1 0 1999
2001
2003
2005
2007
2009
Year
Figure 1.9 – Relative speedup for MPEG-2 coder based on increasing frequency and transistor density over the next decade [59]. As evidenced by these experiments, there remains significant potential for media processors.
Trace-driven simulation found an average parallel performance of 15
instructions per cycle for video applications, and expectations of 5-6x speedup from increasing VLSI technology over the coming decade. While these results were obtained with trace-driven simulation, and so represent an upper-limit on achievable performance,
38 the potential performance is very significant, so highly parallel programmable media processors are strong candidates for support of the next generation of media processors. These unique processors for multimedia, parallel media processors (PMPs), will offer both the performance of high frequency, highly parallel architectures, and the flexibility of simple, regular architectures that are easily programmable using high-level languages. While such processors would initially be more costly, the flexibility over a wide range of applications should allow for low cost through mass production.
1.4
Thesis Overview Chapter 2 describes the design methodology and experimentation environment.
Design of programmable media processors, like any programmable processor, requires the architecture and compiler designs to be highly complementary in order to achieve the highest efficiency. The designs of the architecture and compiler must be well balanced so the compiler is not constantly saturating some architecture resources while rarely using others. Achieving this balance requires a design methodology that explores both the technology-driven hardware tradeoffs to determine what architectural features are feasible, and the application-driven tradeoffs to determine which architectural features are desired. Chapter 2 will present an overall design methodology for incorporating both technology-driven and application-driven evaluation, but focuses primarily on the latter, examining the application-driven architectural issues from a compiler perspective. An overview of the evaluation environment, including the IMPACT compiler and the MediaBench benchmark suite, are also given in this chapter. Chapter 3 performs a workload evaluation of the MediaBench benchmark for purposes of examining the intrinsic characteristics of multimedia applications. This
39 workload evaluation examines such characteristics as operation frequencies, basic block and branch statistics, data types and sizes, working set sizes, spatial locality, loop statistics, and scheduling parallelism. These intrinsic characteristics of multimedia help define many of the fundamental architecture features necessary to programmable media processors, considerably narrowing the size of the overall design space. Chapter 4 presents an architecture evaluation examining the impact of various architecture features on media processors. Most available commercial media processors have followed a static scheduling approach to media processor design, placing the bulk of the performance in the hands of the compiler and/or programmer. As media processors progress to higher frequencies and higher degrees of parallelism, more dynamic architectures may become necessary. Chapter 4 examines the effectiveness of static and dynamic hardware support for a variety of architecture features, including static versus dynamic scheduling, compiler optimizations, issue width, branch prediction, and instruction fetch mechanisms. It also explores the performance of various processor frequency models to determine the impact of increasing frequency on future media processors. Finally, the chapter examines some distributed architecture methods for enabling both high frequency and highly parallel media processor architecture designs. Chapter 5 examines the upper levels of the memory hierarchy using a cache-based memory system. This evaluation examines the change in performance from varying cache parameters including the L2 cache parameters of cache size, line size, and latency, and the external memory parameters of bandwidth and latency. Determining the impact of these various parameters on overall performance defines the bottlenecks of memory
40 performance in media processors. Recommendations for modifications to the memory hierarchy design are presented based on these results. Chapter 6 examines the compiler issues for programmable media processors. It begins with an overview of the various levels of parallelism within multimedia. Existing media processors predominantly use subword parallelism and instruction level parallelism, but there is also significant parallelism at coarser granularities, which can be used to significantly improve performance. Data parallelism in particular is found as the most likely method of coarse-grained parallelism available to the compiler. The second part of the chapter examines the Speculative Broadcast Loop (SBL) method, a speculative approach to data parallelism that allows the compiler to optimistically schedule for data parallelism (versus non-speculative techniques that require conservative scheduling). Presented first is a discussion of the compiler and architecture methods necessary for this speculative run-time technique, followed by an examination of the resulting performance improvements. The results of Chapters 3-6 are combined in Chapter 7, which uses the results to propose a design for future media processors.
The proposed programmable media
processor is a parallel media processor (PMP) that uses a multi-cluster architecture and supports SBL parallel execution. The thesis concludes with Chapter 8, which presents the conclusions of this work, describes the contributions made by this thesis, and provides recommendations for future research on programmable media processors.
41
Chapter 2. Design Methodology
The traditional approach to microprocessor design uses existing processor designs, CAD tools, and benchmarks to provide a starting point in the design process. Existing processor designs provide an architectural reference point from which design modifications can be made. Application characteristics determined from the benchmarks are then used to define desirable architectural features. And finally, CAD design tools are used to narrow the design space and simulate potential designs for evaluating various architectural tradeoffs. With programmable media processors, this is not the case. Unlike microprocessor architecture design, the design of programmable media processors is a relatively immature field. There are no design tools, no established benchmarks, and only a small number of existing programmable media processors. foundation from which to begin a new design.
This provides a very limited Approaching the problem of
programmable VSP design therefore requires a different strategy. Our proposed design strategy for the design of media processors allows concurrent evaluation of both the technology-driven hardware tradeoffs, and the application-driven architecture tradeoffs. The technology-driven evaluation was done previously by Dutta [65], so this study focuses on the latter, performing an application-driven architecture evaluation from a compiler perspective.
42 Evaluating the architecture from a compiler perspective provides three benefits. First, the use of a compiler performs a representative mapping of an application from high-level language to assembly code, so provides an accurate representation of the application characteristics. This is opposed to the trace-driven simulation methodology, which has inherent unrealistic assumptions, such as perfect branch prediction, perfect memory disambiguation, and an infinite-sized scheduling window, that limit the accuracy of the application characteristics.
Second, exploring architecture features through a
compiler enables evaluation of both the architecture and the compiler. Performance results can be used to perform iterative improvements to both the architecture and the compiler. Finally, the simultaneous evaluation of both the architecture and compiler also models the interaction between the two, and so produces an architecture and compiler that are highly complementary. Performing a compiler-based architecture evaluation requires an aggressive, retargetable compiler and a representative benchmark suite. With respect to multimedia, the MediaBench benchmark suite, defined at the University of California at Los Angeles, provides the most representative benchmark for multimedia [66].
It was designed
specifically to provide a workload representative of the multimedia and communications industry. The IMPACT compiler, developed at the University of Illinois at UrbanaChampaign, provides the necessary evaluation environment for this media processor architecture evaluation [67].
It is an aggressive, optimizing ILP compiler with a
retargetable backend that provides the necessary tools for simulation and performance evaluation as well as compilation.
43 This chapter shall continue with a discussion of the design methodology in Section 2.1. Section 2.2 then provides an overview of the IMPACT compiler tools, while Section 2.3 gives an introduction to the MediaBench benchmark suite. Section 2.4 concludes the chapter with a summary of the design methodology and how it is carried out in the subsequent chapters.
2.1
Design Methodology The most accurate method of architectural assessment involves circuit-level
timing simulation of full processor layout and cycle-level simulation of full applications based on optimized, compiled code. Unfortunately, the development of such a simulation model is extremely expensive and can only be accomplished for a tiny subset of the design space. A more practical design strategy is required that obtains results over a much larger design space. The proposed design methodology includes two exploration paths. The first evaluates technology-driven design parameters such as circuit speed and area performance to determine what architectural features are possible. The second path examines the application-driven parameters such as cycle-level instruction behavior to determine what architecture features are desired. This is a popular method of processor design because it allows concurrent exploration of both paths and enables both paths to continually refine the design space throughout the design process. An early version of this design methodology can be found in [68].
This section presents an updated
methodology incorporating aspects from other proposed design methodologies. Shown in Figure 2.1, the first path of the design methodology, evaluating technology-driven hardware tradeoffs, was performed by Santanu Dutta for his Ph.D. thesis in 1996 [65]. He examined the circuit speed and area performance of register files,
44 memory structures, and interconnect networks using a 0.25 m CMOS process. While this work is now dated, it provides a good understanding of the relative tradeoffs in area and frequency for key architectural features. His findings were then applied to the second design path, evaluating the application-driven architectural tradeoffs of various multi-cluster datapath alternatives for video signal processors [62]. This study, lacking compilation and simulation tools and a benchmark suite, evaluated the performance of key video kernels hand-scheduled onto the various architecture alternatives (as discussed in 6.1.4). The application-driven evaluation was continued thereafter for video signal processors using a trace-driven simulation environment by Zhao Wu in his recent Ph.D. thesis [60]. This thesis shall add to the existing work by extending the application-driven evaluation to media processors using a compiler-based evaluation environment that examines both the architecture and compiler issues of a full multimedia benchmark suite.
Circuit Analysis
Architecture Analysis 1
2
f1
f2
Design Space
Processor Figure 2.1 - Design methodology with paths for: a) evaluation of technology-driven hardware tradeoffs, and b) evaluation of application-driven architectural tradeoffs. The design methodology used for our compiler-based architecture evaluation is most aptly characterized by the Y-chart approach developed by Bart Kienhuis in his Ph.D. thesis [69]. The essence of the Y-chart approach is the Y-chart shown in Figure
45 2.2. This approach evaluates each potential candidate architecture, referred to as an Architecture Instance, by evaluating its performance on a suite of Applications, using a Mapping by a compiler to generate the optimized assembly code. Performance Analysis is then used to evaluate the performance of the resulting code. The implications of the results can be used for iterative improvements to the architecture instance, mapping, or applications, as shown by the three feedback loops marked with light bulbs in the figure. For our purposes, we desire a media processor that is high-level language programmable without requiring special libraries or iterative improvements by the programmer for performance. Consequently, the feedback loop to the Applications box is ignored and the compiler (Mapping) must carry the burden of achieving performance.
Architecture Instance
Applications Mapping
Performance Analysis
Performance Numbers
Figure 2.2 - The Y-chart with light bulbs indicating the three areas that influence performance of programmable architectures [69]. The Y-chart approach is particularly useful because it can be used at all levels of the design space exploration. Bart Kienhuis also defines an Abstraction Pyramid, shown in Figure 2.3, which describes the design process at each level of the design space exploration. The upper levels of the abstraction pyramid represent the early stages of the design space exploration. The first two levels are back-of-the-envelope and estimation
46 models, where the design is merely modeled by mathematical relationships under simple design assumptions. The next two levels, abstract executable models and cycle-accurate models, describe the correct functional behavior of the architecture instance. The first provides general behavioral performance metrics unrelated to time, while the second provides accurate timing performance as well. The lowest level is the synthesizable VHDL model, which describes the architecture completely with the accuracy of a model potentially realizable in silicon. There are many tradeoffs to the various levels of the abstraction pyramid. Lower levels provide greater accuracy, but require significantly more time for building the model and evaluating performance. Also, the models at lower levels are so detailed that they can only explore a narrow range of the design space. Consequently, exploration at all levels of the design process is crucial for evaluating the full range of the design space. Our design space exploration of programmable media processors focuses on the third and fourth levels of the abstraction pyramid, which describe the architecture at the behavioral model. The first two levels are more general than needed, and the lowest level, the VHDL model, is much more specific than needed at this stage of the exploration process. Figure 2.4 presents the complete design methodology for this thesis. The Y-chart approach is slightly modified for our design methodology.
As
opposed to searching for a single architecture design for media processor, this thesis is more concerned with determining the performance effects of various architecture and compiler features and parameters. Therefore, instead of running a single architecture instance and then iteratively improving the architecture or compiler, the approach is modified to examine a set of architectures with different values/implementations of a
47 single parameter/feature. This provides an understanding of the effects of that feature with respect to media processors. However, the most promising features and parameter values found at higher levels of the abstraction pyramid are used to update the base architecture model when moving to lower levels of the pyramid.
High
back-of-the-envelope
Low
explore
abstract executable models
explore
cycle-accurate models
Cost of Modeling/Evaluation
Abstraction
Opportunities
estimation models
synthesizable VHDL models
Low
High Alternative realizations Design Space
Figure 2.3 - The abstraction pyramid represents the trade-off between modeling effort, evaluation speed, and accuracy, the three elements involved in a performance analysis [69].
48 Architecture Evaluation
Examine Intrinsic Characteristics of Multimedia (Chap. 3) Evaluate Architecture Features (Chap. 4 + 5)
Y-chart
Y-chart
Compiler Evaluation
Y-chart
Evaluate Parallelism in Multimedia (Chap. 6)
Y-chart
Extracting Coarse and Fine-Grained Parallelism (Chap. 6)
Programmable Media Processor (Chap. 7)
Figure 2.4 - Outline of the design methodology used within this thesis. The Y-chart boxes indicate the Y-chart approach is used at each level of the design process. The remainder of this thesis explores the design of media processors as outlined in Figure 2.4. The architecture evaluation path is taken first in Chapter 3, which explores the intrinsic characteristics of multimedia applications. This stage uses a simple singleissue architecture model with classical-only optimizations in conjunction with profiling and high-level simulation to explore a wide range of the design space, and find the intrinsic properties of the given multimedia benchmark suite. The results of this stage are utilized in Chapter 4 and Chapter 5, which evaluate a variety of architecture and memory features, respectively, using a cycle-accurate simulator to determine which features are desirable in media processors. Chapter 4 compares static and dynamic scheduling and architecture features for media processors, while Chapter 5 examines memory issues, particularly for the lower levels of the memory hierarchy (i.e. level 2 cache and external memory interface), to determine the critical areas of media processor memory hierarchy
49 design. Chapter 6 examines the compiler issues for media processors. It begins with an examination of the parallelism available within media processors and how we used hand scheduling of key multimedia kernels to initially evaluate the desirable compiler methods. The second half of the chapter examines compiler methods for extracting parallelism and proposes the Speculative Broadcast Loop (SBL) technique, a speculative run-time method designed for supporting more coarse-grained parallelism in multimedia applications. Chapter 7 completes the design process by combining the results of the architecture and compiler explorations to propose a design for future media processors.
2.2
IMPACT Compiler The evaluation environment for this media processor study was provided by the
IMPACT compiler developed at the University of Illinois at Urbana-Champaign [67][70]. The IMPACT compiler is ideal because it supplies not only an aggressive ILP compiler but also the simulation environment and performance analysis tools necessary for a thorough design space exploration using the design methodology outlined above. This section will provide an overview of the compiler and performance analysis tools, followed by a short introduction to its two primary ILP optimizations, the superblock and hyperblock.
2.2.1 Compiler The IMPACT compiler is a profile-based ILP compiler that supports a variety of aggressive ILP optimizations. It transforms C source code into a low-level intermediate representation (IR) called Lcode, which is effectively equivalent to assembly code. The compilation process, as outlined in Figure 2.5, uses three intermediate representations, Pcode, Hcode, and Lcode.
Pcode and Hcode are two high-level intermediate
50 representations, but Lcode is the primary IR used by the compiler. With the exception of procedure inlining, nearly all compiler optimizations are implemented at the Lcode level. gen_CtoP
- converts C to the Pcode high-level intermediate representation
Pcode Pinlining
- performs procedure inlining
Inlined Pcode gen_PtoH
- converts Pcode to the Hcode high-level intermediate representation
Hcode gen_HtoL
- converts Hcode to the Lcode low-level intermediate representation
Lcode Lopti Lopti Lopti
Lhyper
- performs classical compiler optimizations - hyperblock optimization
Lsuperscalar Lsuperscalar Optimized Lcode Limpact
- superscalar optimization, including unrolling and superblock optimization
- scheduling and register allocation
Optimized, Scheduled Lcode
Figure 2.5 - Organization of the IMPACT compiler. There are three primary optimization paths: a) leftmost path – classical-only optimizations, b) middle path – adds superblock optimization, and c) rightmost path – adds hyperblock optimization. The basic compiler method enables the choice of three primary optimization paths. The first path, indicated by the leftmost path in Figure 2.5, performs only classical optimizations.
On the second path, the middle path in the figure, superblock and
additional superscalar optimizations such as loop unrolling are applied in addition to
51 classical optimization. Finally, the third path, represented by the rightmost path in Figure 2.5, also includes the hyperblock optimization (see Section 2.2.3 for more on the superblock and hyperblock). These three compilation methods will be used throughout this thesis and are summarized again in Table 2.1 for ease of reference.
After
optimization, scheduling and register allocation are performed by the Limpact scheduler. The result is essentially optimized, scheduled assembly-level code. CompilationMethod
Optimizations Classical optimizations and procedure
Classical (C) inlining. Includes all optimizations in Classical, and Superscalar (S)
adds the superblock optimization and loop unrolling. Includes all optimizations in Superscalar,
Hyperblock (S) and adds the hyperblock optimization. Table 2.1 – Three primary compilation methods in IMPACT compiler. The compiler offers a number of important features that make it ideal for this media processor study, including many aggressive ILP optimizations, a retargetable backend, profiling tools, a machine-independent low-level intermediate representation, and flexible programming support.
One of the foremost reasons for choosing the
IMPACT compiler is that it supports many of the ILP optimizations believed to be necessary for successfully achieving high degrees of parallelism in multimedia applications.
In addition to the traditional compiler optimizations such as constant
52 propagation and common subexpression elimination, IMPACT provides many advanced ILP optimizations such as loop unrolling, procedure inlining, data and control speculation, predication, and software pipelining8. The two primary global scheduling methods used in the IMPACT compiler are the superblock and hyperblock. Using profile information, these optimizations form larger scheduling regions to enable increased ILP using speculation and predication, respectively (see Section 2.2.3). In addition to its aggressive ILP optimizations, the IMPACT compiler offers a profiling tool that executes the compiled code and then annotates the execution information back into the code [71][72]. While the accuracy of profiling is hotly debated with respect to general-purpose applications, the predictable nature of most multimedia applications should enable relatively accurate profiling.
Consequently, in media
processing profiling is an invaluable tool to the compiler for identifying the primary paths of execution.
Information about a path’s execution weight enables the compiler to
identify where it should concentrate its efforts. Also, path execution weight is necessary in many global ILP optimizations, such as trace scheduling, and superblock and hyperblock formation, which attempt to increase performance on the critical paths of execution at the expense of potentially decreased performance on non-critical paths. Another important element is the fact that the IMPACT compiler’s low-level IR, Lcode, provides a generic, architecture-independent language ideal for evaluating the intrinsic characteristics of multimedia applications.
The Lcode representation is
essentially a large, generic instruction set of simple operations. The instruction set is similar to those found on most typical RISC architectures, but not biased towards any
8
Software pipelining is not available in publicly available versions of IMPACT.
53 particular architecture. Consequently, Lcode is perfect for an architecture-independent workload evaluation. Additionally, the IMPACT compiler provides a retargetable backend through the use of a machine-description language [73][74].
The machine description language
enables a full description of the instruction set architecture (ISA) of the target architecture, including its instructions, the number of architectured registers, the number and type of functional units, and the instruction formats and their execution latencies. This enables compilation and simulation of a variety of different architectures and ISAs, potentially even existing ISAs9. Additionally, while the compiler was originally designed as a general-purpose compiler, its retargetable backend may be used to provide support for some of the signal processing operations typically seen in DSPs. It is anticipated that some of these special operations may be useful in media processors. So, while this study only utilizes a general-purpose processor ISA, special DSP/media support may be added later with full advantage of aggressive ILP scheduling available in either case. Finally, the compiler itself is a research compiler designed with tools to readily adapt it to the needs of a particular research project. Source code is provided for all optimizations, and is typically organized in a modular fashion.
Consequently, the
existing functions form an extensive API (application programming interface) from which additional optimizations may be constructed.
The ability to modify existing
optimizations and construct new ones is critical to the research and design of aggressive processor technologies like programmable media processors.
9
IMPACT used to provide support for existing ISAs such as HP PA-RISC, Intel x86, and Sun Sparc, but this support is being phased out.
54
2.2.2 Performance Analysis IMPACT provides a variety of performance analysis tools, including profiling, simulation, and a machine-description language, for evaluating the performance of various architecture and compiler features and parameters. The first of these, profiling, was also mentioned above with respect to compiler optimizations. The profiling tools are used to keep track of the total number of times each operation executes, and then annotates these statistics back into the code after profiling. These instruction usage statistics are useful not only for assisting the compiler in identifying the primary paths of execution, but are also helpful for static performance analysis. The average instruction usage rates indicate such statistics as average operation frequencies, average basic block size by execution weight, and peak static branch prediction performance.
Lprobe
probed executable
Lcode
Lsim Lencode
code image
simulation performance results
parameter files and machine decription language
Figure 2.6 – Implementation of the IMPACT emulation-based simulator. For more accurate analysis, simulation is enabled by an emulation-based simulator. As shown in Figure 2.6, the simulator involves three tools, Lprobe, Lencode, and Lsim. The Lprobe tool constructs a probed executable, which produces a program trace when executed by the simulator, Lsim. Because the trace contains only dynamic information such as instruction control flow and memory access addresses, the trace is
55 combined with a static code image, produced by Lencode, to provide all the necessary information about each instruction being executed.
Having full static and dynamic
information about each instruction allows cycle-accurate simulation. Parameter files and the machine-description language are then used to fully describe the architecture features and parameters, enabling simulation of many different architectures and ISAs. Two modes of simulation are provided. The first evaluates performance using a high-level behavioral model, which provides non-timing-based statistics such as cache miss-rates, operation frequencies, and basic block and branch statistics. The second mode returns cycle-accurate simulation statistics. High-level simulation is ideal in the earlier stages of design because it enables a broader search of the design space, providing less accurate information with less simulation time. Conversely, the low-level simulator is better at later stages of design since it provides much more accurate information at the cost of significantly greater simulation time. Included in the low-level simulation tools are two base architecture models: a VLIW architecture, and an in-order superscalar architecture. To examine the complete range of architectures for static and dynamic scheduling, as detailed in Chapter 4, we augmented the simulator to also model an out-oforder superscalar architecture. The dynamic scheduler for the out-of-order superscalar uses a method similar to that in the AMD K6-2 processor [75]. Because cycle-accurate simulation of long traces can be immensely time consuming, the IMPACT simulation environment, like many common performance analysis environments [76], uses a sampling method for performing partial simulation of such traces. The sampling method specifies two parameters: the number of instructions in each simulation sample, and the number of instructions to skip between samples. The
56 IMPACT developers recommend a sample size of 200,000 instructions, with the number of instructions to skip specified by the following equation:
(
)
min 1x10 9 , trace _ size max_skip_size = max − sample _ size,0 50
The above equation provides progressive degrees of sampling according to application size.
For applications with 10M instructions, full sampling is necessary, while
applications with 100M instructions and 1B instructions may require as little as 10% and 1% sampling, respectively. Sampling by these criteria is reputed to enable accuracy within 5% of that from simulating the entire trace [77]. This error range should certainly hold for multimedia applications, which have more predictable compute patterns than general-purpose applications. It is questionable however, for what range of target architectures this accuracy holds. The IMPACT developers do not specify precise criteria regarding the acceptable range of target architectures. Other studies in trace sampling have found that sampling ratios of 10% typically work very well. A trace sampling evaluation by Martonosi, et. al. [78], found that sampling with a ratio of 10% and sample sizes of 0.5M instructions gave an absolute error of less than 0.3% when using smaller cache sizes (of up to 128 KB), but much larger sampling sizes are needed for cache sizes of 1MB and up. In our own simulations, we also found that accuracy degenerates on architecture simulations modeling long external memory latencies. Consequently, to ensure reasonable accuracy, we doubled IMPACT’s recommended sample size to 400,000 instructions, and typically used a skip sizes of only half that specified by the above equation. And since the majority of our simulations use cache sizes of 256 KB or less, we expect the simulation accuracy from trace sampling to stay well within the 5% error margin.
57
2.2.3 Primary ILP Optimizations Early ILP optimizations attempted to increase execution performance by reducing the execution time of each basic block in a program. They focused on the basic block because a basic block has only a single entrance and single exit for control flow, so the optimizations could ignore control dependencies, simplifying optimization complexity. However, because basic block size is limited, with a typical basic block having no more than 5 instructions, the benefit from these optimizations is limited. Consequently, more advanced methods that optimize ILP over a range of basic blocks were needed. The intra-basic block optimizations are known as local scheduling techniques, while interbasic block optimizations are global scheduling techniques. Additional background on instruction level parallelism is provided by Fisher, et. al. [79] and Rau [80]. Two common methods of for global scheduling include control speculation and predication. Speculation is the process of scheduling an instruction to execute before it is known whether its result will actually be needed. The goal of speculation is to minimize the effects of operation latency by producing an operation result early. Therefore when the result is needed it will not be necessary for the program to wait while generating it. Speculation typically involves moving an instruction across one or more control flow boundaries. It is necessary to ensure that such speculative code motion does not impact the alternate paths of execution. Live variable analysis and additional bookkeeping code can be used to maintain program correctness. Predication is the process of converting control dependencies to data dependencies. The goal of predication is to produce larger scheduling regions to enable greater ILP by eliminating intervening control dependencies.
To accomplish this,
58 predication enables an extra source operand, called a predicate, to be added to each instruction. If an instruction with a predicate source executes, the predicate source is evaluated to either true or false. If the predicate is false, execution of the instruction is cancelled, otherwise execution proceeds normally.
Predication eliminates control
dependencies by converting branch operations to compare instructions that set a predicate. Instructions dependent upon the branch operation are then predicated with the appropriate value of that predicate, so that only operations on the taken path will be executed if that branch had been taken, and only operations on the fall-thru path will be executed if the branch had not been taken. In this manner, the control dependency on the branch is removed, replaced by a data dependency on the compare instruction that defines the predicate. There have been a variety of proposals for supporting speculation and predication in the compiler. Considerable research at IBM T.J. Watson Research has gone into the development of tree-VLIW instructions called tree regions, where each instruction is a separate region that combines multiple execution paths [81][82].
This multi-path
compilation approach supports speculation and provides a predicate mask in the control header of each tree-VLIW instruction.
The Tinker group at North Carolina State
University adopts a similar multi-path tree structure, called a treegion, but the region is extended from a single parallel instruction to include multiple parallel instructions [83][84]. HP Labs uses an approach similar to the IMPACT group for ILP, but provides a more extensive ISA with the HP PlayDoh architecture [85]. The IMPACT compiler’s methods for utilizing speculation and predication involve the superblock and hyperblock optimizations, respectively. Each of these will be discussed in turn below.
59 Superblock One early proposal for global scheduling using control speculation is trace scheduling [86][87][88]. The trace scheduling method attempts to increase performance by optimizing performance on the most critical paths in a program. The optimization first divides the most frequently executed paths into sets of traces, with each trace defining a control path containing numerous basic blocks. Each trace is then scheduled, using control speculation while ignoring control flow boundaries. Because there may exist side entrances from off-trace instructions branching into the trace, or side exits where the control flow branches off-trace, bookkeeping must be performed at the side entrances and exits after scheduling to satisfy all the control dependences and ensure the correct execution of off-trace paths.
The result is that performance of the trace is
optimized using speculative code motion, while off-trace paths have reduced performance due to the bookkeeping overhead.
(a)
(b) A
F
B
A
G
F
C
C’
D
D’
B
G
Figure 2.7 – Superblock formation and tail duplication.
C
D
60 One problem with trace scheduling is that the bookkeeping needed for handling side entrances to the trace can be extremely cumbersome and reduces the effectiveness of further optimization of off-trace paths.
For improved performance, the IMPACT
compiler developed the superblock optimization [89]. A superblock is effectively a trace that does not have any side entrances, although it may still have side exits.
Tail
duplication is used to eliminate the side entrances to the superblock. An example of superblock formation and tail duplication is shown in Figure 2.7. Figure 2.7(a) shows the trace before superblock formation. Basic blocks A, B, C, and D reside on a critical path through the program. Basic block G represents a side exit from the trace at A that enters back into the trace at D. Basic block F represents a side entrace to the trace at C. To eliminate the side entrances by F and G, basic blocks C and D are duplicated and removed from the trace. Basic blocks F and G now branch to C’ and D’, which are no longer part of the trace, so there are now no side entrances to the trace. Basic blocks A, B, C, and D now define the resulting superblock, as shown in Figure 2.7(b), with no side entrances and a side exit from A to G. Speculative code scheduling can now be applied to the superblock as it would similarly be used in trace scheduling, but without the added bookkeeping necessary for supporting side entrances. Hyperblock A common global scheduling method for supporting predication is if-conversion [90][91]. If-conversion is a popular method of global scheduling on code regions in which the control flow splits from one basic block, and then merges back together in a later basic block. An example of such a region is depicted in Figure 2.8(a), in the region encompassed by basic blocks A-G. They initially split at block A, but then re-join at G.
61 In such split-join code regions, if-conversion can be used to remove all control dependencies and convert the multi-path code segment to a single-path code segment using predication. All control dependencies are then converted to data dependencies on predicated operations, which enables greater opportunity for ILP. Two problems exist with if-conversion, though. First, not all of these paths may be frequently executed paths. It is not desirable to include operations on infrequently executed paths. These operations will be rarely executed and consequently represent wasted resources in critical traces. The second problem is that the various basic blocks may have significantly different lengths.
Some basic blocks may have very long
execution times whereas others may be quite short. It is not desirable to merge two paths with very different lengths since the long paths will impose their lengths on the short paths, thereby considerably reducing performance on the short paths. (a)
(b) A
H
A
B
D
C
E
G
F
B
D
C
E
G
H
F
C’
G’
Figure 2.8 – Hyperblock formation with tail duplication. The IMPACT compiler addresses these two problems using the hyperblock optimization [92]. The hyperblock forms regions in a similar fashion to if-conversion, but instead of forcing every block in the split-join region to be included, it uses tail
62 duplication and allows side exits. Therefore, only the frequently taken paths of similar lengths need be included in the hyperblock. Paths that are too long or too short can be eliminated via a side exit, and infrequently taken paths can similarly be excluded using side exits. Figure 2.8(a) shows a split-join region where paths ABCG and ADEG represent frequently taken paths of relatively similar lengths. Path ADFG may be either rarely taken or of significantly dissimilar length, so has been excluded.
And like
superblocks, no side entrances are allowed to the hyperblock, so block H must be removed via tail duplication. The resulting hyperblock containing blocks A, B, C, D, E, and G is shown in Figure 2.8(b). Tail duplication was performed on blocks C and G to eliminate the side entrances of F and H. The resulting hyperblock is now free of all control dependences except the side exit at D, and can be aggressively scheduled with other local and global ILP optimizations, including speculation10. The IMPACT compiler has proved an invaluable tool for this media processor study. It provides both the compiler and simulation tools, along with the flexibility to examine a variety of different architectures and the tools for modifying and creating new compiler optimizations. The IMPACT compiler defines a complete architecture and compiler evaluation tool, enabling a thorough exploration of the design space for programmable media processors.
10
Superblock scheduling can be applied to hyperblocks, as indicated by the third optimization path in Figure 2.5.
63
2.3
MediaBench benchmark suite A benchmark suite representative of the multimedia industry is necessary to
complete this evaluation environment for programmable media processors. While such a requirement is relatively simple to accommodate in the general-purpose processor industry, this is not so with media processors. Until recently, there were only a few benchmarks that encompassed small niches of the multimedia industry, such as audio processing, image processing, or small DSP benchmarks. More commonly only small kernels, such as filters, the discrete-cosine transform (DCT), or motion estimation, are used in evaluating processor performance. SPEC is currently working on a multimedia benchmark, SPECmedia, but it is not scheduled for release until early 2000, and it will initially focus only on the MPEG-2 motion video coder, and key kernels for audio, video, and speech [93]. The only currently available benchmark containing full system-level media applications is MediaBench. The MediaBench benchmark, introduced by Lee, Potkonjak, and Mangione-Smith in late 1997 [66][94], is the first combination of multimedia applications to truly represent the overall multimedia industry. The benchmark was designed specifically to focus on portable applications written in a high-level language that are representative of the workload of emerging multimedia and communications systems.
It incorporates
multimedia applications written in C, ranging from image and video processing, to audio and speech compression, and even encryption and computer graphics. Table 2.2 gives a short description of the applications in the MediaBench benchmark suite.
A more
thorough description of the benchmark, as well as links to the various application developers, can be found at the MediaBench website [94].
64 Application ADPCM EPIC G.721 Ghostscript GSM
H.263
JPEG
Mesa
MPEG-2
MPEG-4
Pegwit
PGP
Rasta
Description A simple adaptive differential pulse code modulation scheme for audio compresssion (rawcaudio) and decompression (rawdaudio) An image compression coder (epic) and decoder (unepic) based on wavelets and run-length/Huffman entropy coding Voice compression coder (g721enc) and decoder (g721dec) based on the CCITT G.711, G.721, and G.723 standards Interpreter (gs) for the PostScript language; performs file I/O but no graphical display Full-rate speech transcoding coder (gsmencode) and decoder (gsmdecode) based on European GSM 06.10 provisional standard Very low bit-rate video coder (h263enc) and decoder (h263dec) based on the H.263 standard; performs file I/O but no graphical display; provided by Telenor R&D [95] Lossy image compression coder (cjpeg) and decoder (djpeg) for color and gray-scale images, based on the JPEG standard; performs file I/O but no graphical display 3-D graphics library clone of OpenGL; includes three demo programs (mipmap, osdemo, texgen); performs file I/O but no graphical display Motion video compression coder (mpeg2enc) and decoder (mpeg2dec) for medium to high-quality video transmission, based on the MPEG-2 standard; performs file I/O but no graphical display Motion video compression coder (mpeg4enc) and decoder (mpeg4dec) using an object-based representation; based on the MPEG-4 standard; performs file I/O but no graphical display; provided by the European ACTS project MoMuSys [96] Public-key encryption and authentication coder (pegwitenc) and decoder (pegwitdec), using elliptic curves over GF(2255), SHA1 for hashing, and square block cypher for encryption Public-key encryption coder (pgpencode) and decoder (pgpdecode) providing a hybrid-RSA encryption method, data compression, message digests for digital signatures, and sophisticated key management Feature extraction (rasta) for speech recognition, which supports the PLP, Rasta, and Jah-Rasta feature extraction techniques
Table 2.2 – Description of MediaBench benchmark suite.
65
Media
Application Package
Programs
MPEG-2
mpeg2dec, mpeg2enc
H.263
h263dec, h263enc
MPEG-4
mpeg4dec, mpeg4enc
JPEG
cjpeg, djpeg
EPIC
epic, unepic
Ghostscript
gs
Graphics
Mesa
mipmap, osdemo, texgen
Audio
ADPCM Coder
rawcaudio, rawdaudio
GSM
gsmdecode, gsmencode
G.721
g721dec, g721enc
Rasta
rasta
PGP
pgpdecode, pgpencode
Pegwit
pegwitdec, pegwitenc
Video
Image
Speech
Security
Table 2.3 – Breakdown of MediaBench applications by media type. The MediaBench benchmark suite is broken down according to the six media types of video, image, graphics, audio, speech, and security, as shown in Table 2.3. There are a total of 25 benchmark programs among the 13 application packages. MediaBench was initially proposed with only one video application, MPEG-2. However, because video is one of the most critical media types in terms of computational complexity, and each of the benchmarks is given equal weight with respect to the results, we chose to augment MediaBench with two additional video applications, H.263 and MPEG-4. H.263 and MPEG-4 were chosen because they are distinct from MPEG-2, which targets medium to
66 high-quality video compression. H.263 was designed for very low bit-rate video, while MPEG-4 supports object-based video, which entails considerably more video processing and computational complexity. The addition of these two applications gives video a marginally greater weight than any of the other media types, which is quite reasonable because of its importance in media processing. Furthermore, the addition of H.263, and MPEG-4 in particular, are believed to make the benchmark more representative of emerging and future applications. And finally, these two applications enable comparison with the results of similar trace-driven simulation studies by Wu and Wolf [97][64][58]. Hereafter, all references to MediaBench refer to our augmented version of MediaBench. Some statistics for each benchmark’s data sets are given in Table 2.4. The first column represents the number of static instructions in the assembly code (Lcode) when compiled onto a single-issue processor using classical-only optimizations. The last four columns provide statistics for the two data sets, including input data file size and number of dynamic instructions (determined using profiled instruction usage statistics). As is evident when comparing the two data sets, the number of dynamic instructions from input 2 is typically larger than that of input 1. Consequently, input 2 was chosen as the data set for profiling (training), since profiling with the larger data set should provide greater profiling accuracy. Also, simulation is significantly more time consuming than profiling, so simulating with the smaller input data set is more practical. Also evident from Table 2.4 is that MediaBench, while designed to be representative of the multimedia industry as a whole, is still in the early stages of development. There remain a number of areas for improvement, including the size and type of its applications and the size and number of input data sets. Some of the
67 applications in the suite, such as ADPCM, G.721, EPIC, and Pegwit, are relatively small programs that are not as representative of more complete system-level applications. Also, as we adjusted the weight of video in the benchmark suite by adding two additional video applications, it may be desirable to further change the make-up of the benchmark by including additional audio benchmarks, while possibly reducing the number of speech and security benchmarks. It would be particularly pertinent to add a higher-quality audio benchmark such as MP-3 to MediaBench.
Program Cjpeg Djpeg Epic Gs g721dec g721enc gsmdecode gsmencode h263dec h263enc mipmap mpeg2dec mpeg2enc Mpeg4dec Mpeg4enc osdemo pegwitdec pegwitenc pgpdecode pgpencode rasta rawcaudio rawdaudio texgen unepic
# Static Instrs 15,883 19,397 4737 226,348 1826 1817 11,365 11,077 8721 17,750 116,668 9520 14,136 108,273 112,605 8576 8522 75,756 75,756 55,547 220 211 115,316 3767
Input 1 # Dynamic File Size Instrs 101,484 5756 65,595 78,519 73,760 295,040 30,426 295,040 20,364 1,438,272 34,906 506,880 39,213 45,619,200 91,537 91,503 20,167 91,503 17,024 295,040 73,760 7432
10,845,202 3,032,372 38,577,696 68,400,275 226,268,757 239,657,034 60,794,843 135,010,575 60,291,153 1,498,418,843 20,479,889 99,657,160 992,775,122 1,432,505,735 6,272,117 14,516,495 24,883,764 497,765,011 8,590,189 5,760,303 5,091,778 54,692,716 5,273,016
Input 2 # Dynamic File Size Instrs 968,046 31,074 65,610 488,795 116,798 467,192 48,180 467,192 19,338 5,702,400 1,593,409 1,555,200 503,060 45,619,200 53,665 53,631 11,905 53,631 33,024 467,192 116,798 10,129
Table 2.4 – MediaBench input data set characteristics.
86,499,438 25,080,826 39,366,149 70,201,200 348,661,608 379,687,769 96,223,492 216,712,453 66,026,217 1,557,457,063 723,990,481 805,664,583 505,360,524 12,569,444 21,830,683 494,735,842 6,475,968 9,388,421 8,275,859 5,743,473
68 The input data sets also need further development. The graphics applications need to be re-written to enable additional data sets, and the size of some of the input data sets needs to be increased significantly. Many of the applications, including ADPCM, EPIC, JPEG, Mesa, Pegwit, and Rasta, currently provide small input sets (< 20M dynamic instructions) for one or both of the input data sets. Larger data sets are needed to avoid potentially skewing some of the architecture evaluation results, particularly those results associated with memory performance.
Additional data sets would also be
beneficial for enabling additional profiling and further performance analysis. Finally, some of the input parameters do not reflect typical operating parameters for those applications. For example, the MPEG-2 application is directed to use floating-point for computing the discrete-cosine transform (DCT) when an integer implementation is available and would typically be used. Overall, MediaBench is an excellent initial rendition of a multimedia benchmark suite, but we hope it continues to undergo refinement for more representative results in future analyses. Through the course of this thesis, the MediaBench applications are examined at three levels of granularity. The coarsest level of granularity evaluates the aggregate performance of the entire benchmark suite. This level provides an initial indicator for multimedia performance. Two finer levels of granularity provide results by media type and for each individual benchmark. When performance of a media type or benchmark deviates significantly from the average performance, this thesis shall endeavor to examine the implications of those results. For purposes of examining by media type, the various benchmarks are grouped into their six media type categories as delineated in Table 2.3. Also, because so many of the benchmarks are compression applications, the
69 compression benchmarks will also be identified by the encoder and decoder groupings, as specified by Table 2.5.
It is expected that the performance of these two types of
compression will deviate significantly in some areas, particularly with respect to memory characteristics. Method
Programs mpeg2dec, h263dec, mpeg4dec, djpeg,
Decode
unepic, rawdaudio, gsmdecode, g721dec, pgpdecode, pegwitdec
mpeg2enc, h263enc, mpeg4enc, cjpeg, Encode
epic, rawcaudio, gsmencode, g721enc, pgpencode, pegwitenc
Table 2.5 – Breakdown of MediaBench benchmarks by compression direction.
2.4
Summary This chapter provides an overview of the design methodology and evaluation
environment used in our exploration of media processors. As we indicated in Section 2.1, we are performing an application-driven architectural exploration of the design space for programmable media processors. The design methodology uses the Y-chart approach at multiple levels of abstraction (shown in Figure 2.4) to enable a thorough exploration of the design spaces for both the architecture and compiler. The IMPACT compiler and MediaBench benchmark suite provide an effective evaluation environment as they provide all the necessary tools for the Y-chart design approach. This thesis shall continue in Chapters 3-5 with an exploration of media processor architecture, and then continue in
70 Chapter 6 with a similar evaluation of compilers for programmable media processors. Chapter 7 shall use the results from the prior chapters to propose a potential parallel media processor implementation.
71
Chapter 3. Intrinsic Characteristics of Multimedia
Design of programmable media processors requires an accurate understanding of multimedia characteristics from a compiler perspective. While some of the fundamental characteristics of multimedia applications are already understood from a qualitative perspective, performing a workload evaluation of a multimedia benchmark suite will enable a quantitative understanding of these and other intrinsic characteristics. Using the MediaBench benchmark suite in conjunction with the IMPACT compiler, this chapter begins the first stage of our design methodology by examining the fundamental properties of multimedia applications.
Among the properties being evaluated are operation
frequencies, basic block and branch statistics, data types and sizes, memory characteristics such as working set size and spatial locality, loop statistics, and instruction level parallelism. From these properties we are able to draw initial conclusions about many of the media processor architecture features, including the type and ratio of functional units, datapath size, branch architecture, and basic memory structure. This information will be used to significantly narrow the design space before proceeding with the second phase of the architecture design exploration in Chapter 4. It will also be beneficial during the compiler design space exploration in Chapter 6.
72 Multimedia Qualitative Understandings Multimedia applications are generally understood to have certain characteristics unique from typical general-purpose applications. Such characteristics include intense computational loads, large amounts of streaming data, significant processing regularity that affords extensive parallelism, real-time constraints, and a tendency towards small integer data types.
Additional research on media processors further contends that
multimedia also has considerable control complexity in the less computationally intensive program sections [98][99]. Of course there is some variation in the characteristics of different media types, but these qualitative understandings define the most common characteristics.
This chapter will provide a more quantitative understanding of
multimedia characteristics and to what degree they are true for the various media types. This media processor research study shall proceed to examine many of these wellknown multimedia qualities, as well as examine a variety of other characteristics. The intensive processing demands and large amounts of data are readily apparent from Table 2.4 in the last chapter.
Video applications are particularly data and computation
intensive, requiring between a few million (H.263) and many tens of millions (MPEG-4) of operations for decoding each video frame, and orders of magnitude more for encoding. Speech and security benchmarks are also quite computationally intensive, with speech benchmarks like G.721 requiring hundreds of millions of operations for coding/decoding less than half a minute of audio, and security benchmarks like PGP requiring hundreds of millions of operations to decode documents originally smaller than 100 KB. Image and graphics benchmarks are less intensive, but only because they process a single frame.
73 Like the transition from image to video, the transition from still to motion graphics will increase graphics’ computation and data loads immensely. Multimedia processing regularity and control complexity shall be examined both in this chapter and in Chapter 4. In Section 3.6 of this chapter, it shall be seen that multimedia is heavily loop-oriented with a large number of iterations per loop, providing much processing regularity, but also has unexpected degrees of intra-loop control complexity. In Chapter 4, the issue of control will be examined while comparing the performance of dynamic out-of-order scheduling to static scheduling. It shall be seen that control complexity limits the effectiveness of static scheduling, while dynamic out-oforder scheduling provides an average of 60-80% better performance. Section 3.4 shall address the issue of multimedia’s tendency to use small data types. This section uses profiling to determine the actual data type and size necessary for each multimedia operation and shows that multimedia uses small integer data types of 16 bits or less more than 65% of the time. The issue of real-time constraints in media processing will not be examined in this study, as this area has already received a good deal of attention. This thesis instead assumes that a real-time operating system can be used to support the real-time constraints assuming the necessary processing power is available. The focus of this work is to achieve the highest levels of processing performance and thereby enable sufficient computing power for a real-time operating system to provide real-time performance. For further reference on real-time operating systems for multimedia, Steinmetz [100] and Nieh et. al. [101] provide good overviews of the issues and describe potential implementations.
74 Evaluation Method To accurately evaluate the intrinsic characteristics of the multimedia applications, the compiler was set to apply only classical optimizations while compiling the benchmark suite. Using solely classical optimizations, the compiler only applies those optimizations that eliminate redundancies in the code at the assembly level, such as common subexpression elimination and constant propagation. More aggressive optimizations such as loop unrolling, procedure inlining, or global scheduling are specifically disallowed as they can add or remove non-redundant operations and can also change the size of basic blocks. Such modifications change the characteristics of the workload. Using only classical optimizations and compiling to Lcode, the IMPACT compiler’s generic instruction set architecture, provides the most accurate method for measuring multimedia application characteristics. The architecture model used for the evaluation procedure was a single-issue processor with a RISC-like instruction set, as defined by Lcode, IMPACT’s low-level intermediate representation. The assumed operation latencies were 1 cycle for integer ALU operations, 2 cycles for loads, 3 cycles for multiplies and floating-point operations, and 10 cycles for divides. The architecture model supported a large number of registers (64 integer registers and 64 floating-point registers), in order to exclude spill and fill code while measuring the intrinsic application characteristics. The remainder of this chapter continues with an exploration of the intrinsic characteristics of applications in the MediaBench multimedia benchmark suite. The implications of characteristics on media processor architecture will also be discussed. Related work is first discussed in Section 3.1. Sections 3.2 and 3.3 then begin by
75 examining the operation, basic block, and branch statistics, which define the basic functional necessities and branch architecture.
Section 3.4 uses value profiling to
examine the actual data types and sizes used by multimedia. Knowledge of data types and sizes has important ramifications on datapath size and the utility of subword parallelism.
Section 3.5 performs a cache analysis, which determines the data and
instruction working set sizes and spatial locality, and provides insight into the memory requirements for media processors. Section 3.6 examines the loop characteristics of the various media types, evaluating loop weight according to loop level, and defining the path ratio and average number of iterations for loops in each application. And finally, the applications are scheduled onto a simple 8-issue processor for initial determinations of the available instruction level parallelism (ILP) in Section 3.7. The chapter concludes with a short summary in Section 3.8.
3.1
Related Work Up until MediaBench was presented in late 1997 [66], there was not a significant
amount of research that studied characteristics of the entire multimedia field. For lack of a benchmark, prior research up to that point had focused primarily on small niches of the multimedia industry, and often examined the characteristics of kernels or small benchmarks. Kernels and toy benchmarks do not adequately define the industry, so the presentation of MediaBench marks the first real characterization of the multimedia industry11. In this first paper, MediaBench and SPECint95 were compiled and simulated on a single-issue processor in the IMPACT environment, and performance was compared for the following characteristics: instruction and data cache read and write miss rates (on
76 a single 16 KB direct-mapped cache), bus utilization, branch statistics, integer ALU utilization, and IPC (instructions-per-cycle) results. Of these characteristics, MediaBench was found to have better performance than SPECint95 with respect to IPC, instruction cache hit rates, data cache read hit rates, and memory bus utilization. Only in data cache write hit rates was MediaBench found to have lower performance than SPECint95, and this is consistent with the streaming nature of data found in multimedia (when using a nowrite-allocate data cache policy). A subsequent evaluation of MediaBench was also done that evaluated performance on a variety of different architectures [102]. Simulations were performed to determine the best price-performance point on media processor architectures with up to 8 issue slots, data and instruction cache sizes of 512 bytes to 8 KB, and different numbers and types of functional units. The performance was compared with a PowerPC 604, and the results indicated that media processors could achieve up to 3x performance over the PowerPC 604. Additionally, it also found that there was little performance benefit to having more than 3-4 issue slots. Evaluations of video signal processors (closely-related to media processors) have been performed using trace-driven simulations [56][97][64][58] that evaluate parallelism, operation frequencies, and memory performance. However, these studies do not use compiled code, but code generated in a trace-driven simulation environment with assumptions such as perfect branch prediction, perfect memory disambiguation, and an infinite-sized scheduling window. Consequently, the performance results are much more
11
Information on studies prior to the introduction of MediaBench can also be found in [66].
77 idealistic than the results for compiled code. The trace-driven simulation results are best used to define an upper bound on potential performance. Unlike the multimedia industry, the characteristics of general-purpose applications have been extensively researched. We refer the readers to Hennessy and Patterson [103], which provides an extensive overview of many of the characteristics for the SPEC benchmarks.
It shall also provide the primary vehicle for many of our
comparisons of multimedia and general-purpose application characteristics. The work presented in this chapter is designed to be much more comprehensive than prior studies. It not only examines many of the same characteristics over a broader range of the design space, but also provides an evaluation of many other characteristics such as memory spatial locality, data types and sizes, and loop statistics. The work presented in this chapter was first presented in [104], but it is extended here by incorporating loop statistic information and by breaking down the results according to the various media types, as defined in Table 2.3 and Table 2.5.
3.2
Operation Frequencies Defining the correct resource balance in a media processor is of critical
importance. Not having enough of the desired resources lengthens the execution time of applications, while having too many underutilized resources forces extra area, lowers yield, and increases wire length and cycle time. Achieving the proper balance requires consideration of the operation frequencies and instruction level parallelism. While it is currently unknown what degree of parallelism can be obtained from a compiler, knowledge of the operation frequencies can be used to define the appropriate resource ratios.
78 Using the IMPACT compiler’s profiling tool, described in Section 2.2.1, the profiled Lcode indicates the execution frequency for each operation in the program. This workload evaluation extracted the operation frequencies and provided aggregate results for various categories of operations, including integer and floating point, arithmetic versus logic and compares, and all control flow operations. The aggregate results for operation frequencies are reported in Figure 3.1 and Figure 3.2. Figure 3.1 displays the average operation frequencies over all benchmarks in MediaBench, while Figure 3.2 examines the integer and floating-point load and store frequencies for each media type.
Frequency (%)
30 25 20 15 10
other_fp
conv_fp
mult_fp
arith_fp
store_fp
load_fp
call/ret
jump
cbr
div
mult
mov
shift
logic
cmp
arith
store
0
load
5
Operation Type Figure 3.1 – Average operation frequencies for all applications in MediaBench. Examination of Figure 3.1 indicates multimedia operation frequencies are relatively similar to those for general-purpose applications, but there are a few exceptions. One major difference is that multimedia has significantly more arithmetic operations than general-purpose applications. The frequency of arithmetic operations on the DLX and Intel x86 architectures average 14% for SPECint92 [103], while it is over 25% for multimedia (over 30% if you include the frequency of MOV operations). The extra arithmetic operations mean there is typically more processing per data element, so
79 the frequency of memory accesses is a bit lower (about 5-10%). Additional differences include less floating-point usage. Also the percentage of compares is much lower, but this is because the Lcode instruction set provides ‘compare and branch’ operations, which eliminate the need for most compare operations. One big surprise is that the multiply operation is used less than 2% of the time. This is still twice the frequency of multiplies in general-purpose procesors, but because DSP processors rely heavily on the multiplyaccumulate operation, it was expected that the multiply operation would be more highly used. However, strength reduction is able to convert many of the multiply (and divide) operations for powers of 2 into shift operations. This would account for the low multiply
load load_fp
35% 30% 25% 20%
store store_fp
av er ag e
en co de
se cu r it y de co de
sp ee ch
au di o
gr ap hi cs
im ag e
15% 10% 5% 0% vi de o
Operation Frequency (%)
and divide usage and the unusually high usage (10%) of shift operations.
Application
Figure 3.2 – Load and Store integer and floating-point frequencies. To define the appropriate ratio of functional resources, we assign the operations to six basic functional unit groups: an integer ALU (IALU), a memory unit (LS), a branch unit (BR), a shifter (SH), a floating-point unit (FPU), and a multiplier (M). There could also be a divider, but from the negligible use of divide operations, a software
80 implementation is probably sufficient. Assuming an integer ALU can perform arithmetic operations, compares, logic operations, and moves, its overall usage frequency would be about 40%. As illustrated in Figure 3.2, the memory unit must support about 18% of the operations for integer and floating-point loads, as well as 8-9% for store operations. The shifter receives 10% usage, while the branch unit supports nearly 20% of operations for conditional branches, jumps, calls, and returns. The final 3-4% of operations are used by the multiplier and remaining floating-point operations. From this initial perspective, the ratio of resources might appear as follows: •
(IALU, LS, BR, SH, FPU, MULT) => (8, 5, 4, 2, 1, 1)
This ratio must also be able to accommodate significant variations in usage for certain critical units. From Figure 3.2 it can be seen that the percentage of loads and stores can vary significantly among the different benchmarks. For some applications, particularly video, image, and graphics applications, the frequency of loads and stores can reach upwards of 30-35%. It is also evident from this figure that the ratio of loads to stores can also vary considerably, as indicated by the differences between encode and decode compression applications. Similar to memory usage, floating-point operations are heavily utilized in certain applications like epic, mpeg2dec12, rasta, and the Mesa applications of mipmap, osdemo, and texgen, where non-memory floating-point operations can reach up to 20% usage. With these considerations, and the fact that it can be difficult to support more than one branch per cycle, a more reasonable ratio would be as follows:
12
The MPEG-2 decoder’s use of floating point is an irregularity of MediaBench. The current parameters call for use of floating point for the DCT, but an integer version would be used under most circumstances.
81 •
(IALU, LS, BR, SH, FPU, MULT) => (3, 2, 1, 1, 1, 1)
One final consideration is that as more aggressive compiling methods are used, additional operations are introduced through speculation and predication that tend to increase the usage of the integer ALU unit. So, for more aggressive compiling strategies, the following ratio of resources might prove the most efficient: •
(IALU, LS, BR, SH, FPU, MULT) => (4, 2, 1, 1, 1, 1)
This resource ratio is similar to the results from trace-driven simulation studies by Wu and Wolf [97].
For a 32-issue video signal processor (VSP), they found the
appropriate ratio of resources for a VLIW VSP to be 24 ALUs, 16 memory units, 8 shifters, and 8 pipelined multipliers (or 16 unpipelined multipliers).
3.3
Basic Block and Branch Statistics It is also possible to extract basic block and branch prediction statistics from the
profiled Lcode using the IMPACT’s profiling tools. Information about basic block size provides a good estimation of the basic instruction level parallelism obtainable with a compiler. The average size of a basic block defines the maximum amount of local parallelism, i.e. the parallelism available only among operations in the same basic block. Of course, achieving the maximum is highly unlikely, as it would effectively require every basic block to execute all of its operations simultaneously. Typically, the overall speedup from local parallelism is not greater than 25-35% of that amount. However, the larger the basic block the greater the potential parallelism. And as multimedia has demonstrated high degrees of parallelism [56], it is interesting to determine whether this translates into large basic blocks.
82 As discussed in Section 2.2.3, basic block sizes are often small, averaging around 5 operations in general-purpose applications, so global scheduling is necessary for achieving greater parallelism. For extracting more parallelism, global scheduling uses methods such as speculation and predication to search for parallelism beyond the bounds of a basic block. These two methods can have varying degrees of success, but both are dependent upon profiling statistics and the compiler’s ability to accurately predict branches. Speculation is the process whereby an operation residing on the expected path of control flow executes as soon as its source operands become available, before it is actually known whether it needs to execute. Consequently, it is best used on critical paths with branches that are highly predictable. Predication is the conditional execution of an operation based on the state of a condition associated with the operation. It essentially combines two or more control paths into a single conditional control path, eliminating the dependence of operations on branches. Therefore, it is best used on branches that are more unpredictable. Because both rely on static branch prediction, the compiler’s ability to predict branches influences the effectiveness of these methods. Higher branch prediction accuracy means more effective speculation and predication and greater parallelism. Using profiling, the static branch prediction results can be extracted from the profiled Lcode, which will provide a measure of the effectiveness of these global scheduling techniques. In the section on operation frequencies it was found that overall, nearly 20% of the operations are branch operations. This corresponds with our results for the overall average basic block size of 5.5 operations per basic block. More interesting than this simple average however, are the average basic block sizes for each application, as shown
83 in Figure 3.3.
The basic block results for various multimedia applications show
enormous variations in the average basic block sizes. The three Mesa applications, the JPEG decoder, the H.263 encoder, and the GSM encoder all have much larger basic block sizes than the remaining applications, so it is expected that these applications will
14 12 10 8 6 4
unepic
texgen
rawdaudio
rasta
pgpdecode
pegwitenc
pegwitdec
osdemo
mpeg4dec
mpeg2dec
mipmap
h263enc
h263dec
gsmencode
gsmdecode
g721enc
0
djpeg
2 cjpeg
Average Basic Block Size
achieve better parallel performance.
Application
Figure 3.3 – Average basic block sizes for each benchmark. The results for average basic block size are very similar to general-purpose applications, which have an average basic block of about 5 operations [103]. However, general-purpose applications do not normally display such a wide variation of basic block sizes between different applications. It should be noted however, that while the compiler is only using classical optimizations, the developers of some of the MediaBench applications manually unrolled some of the critical loops in the C source code to improve performance. In these cases, the large basic block sizes are not intrinsic properties of the multimedia code. Consequently, the typical basic block size is expected to be the same in multimedia as in general-purpose applications.
84 To provide a complete evaluation, static branch prediction performance was measured using both the training input and a separate evaluation input. The training input is the input used when performing profiling, which provides the instruction usage statistics from which the compiler optimizes the code. Because the code is optimized for the training input, evaluation of static branch prediction on the training input yields the peak performance for that input. Other input sets will typically yield lower performance. The performance of the evaluation input data set represents the realistic static branch prediction performance, which will typically, but not always, be lower than the performance of the training input. The results of static branch prediction on the training and evaluation inputs are shown in Figure 3.4. Overall the performance was very good. The average static branch prediction accuracy is 89.5% with the training input and 85.9% for the evaluation input. For the most part the realistic performance, as represented by the evaluation input, is only moderately less than the ideal performance. However, the difference is significant for pgpdecode, where the hit rate drops to only 41.5%. As expected the evaluation input performance was nearly always lower than the ideal performance for the training input. The epic benchmark was the only exception, where the evaluation data set’s hit rate was 0.5% higher. The static branch prediction results are better than expected, as compared with typical general-purpose application.
With the exception of pgpdecode, the typical
realistic static branch hit rate is 87.9%. This is significantly better than the typical performance of static branch prediction in general-purpose processors, which have an average branch miss rate of 30.95% (hit rate of 69.05%), as measured on SPECint92 by Calder et. al. [105].
The static branch prediction efficiency in multimedia is of
85 considerable benefit as it provides more accurate compile time information about branches. This enables speculative execution and predicated execution to be applied with greater efficiency, increasing the benefits of global scheduling.
unepic
texgen
rawcaudio
rasta
pgpdecode
pegwitenc
osdemo
mpeg4dec
mpeg2enc
training input
mpeg2dec
h263enc
h263dec
gsmencode
gsmdecode
gs
g721dec
epic
100 90 80 70 60 50 40 30 20 10 0
cjpeg
Branch Prediction Rate (%)
evaluation input
Application Figure 3.4 – Static branch prediction performance: comparing peak performance on training input versus performance on separate evaluation input.
3.4
Data Types and Sizes Data type and size is another important issue in multimedia. As opposed to
conventional microprocessor applications, multimedia applications typically use small integer data types of 16 bits or less. This is particularly true of most video and image applications, and some audio applications.
Applications requiring larger data types
include computer graphics and other audio applications, which rely more on integer data types up to 32 bits and floating-point data types of 32 or 64 bits. The data characteristics used by multimedia applications are important for two reasons. First, subword parallelism relies on the use of small data types to achieve
86 additional SIMD parallelism, as discussed in Section 1.2.2. Second, smaller data types may allow media processors with narrower datapaths. If a sufficient degree of the instructions in media processing operate on small integer data types, then it may be possible to design media processor datapaths that are less than 32 bits wide. Functional units for smaller integer datapaths would consume much less area and require fewer levels of logic, potentially enabling media processors with much higher frequencies. Because the support of larger data types, which do not fit in a narrow datapath, can require numerous additional operations and registers, the benefits of a smaller datapath can only be realized if the majority of the variables fit in the smaller datapath. To determine the effective data sizes for all integer data, the profiling tool for the IMPACT compiler, Lprobe, was modified to dump the value of each integer operation to the program trace. The high-level simulator was then able to monitor the actual value for each integer operation and keep track of its maximum absolute value. The number of bits required to hold this value defined the actual data type required for that operation. While this is not an exact method for computing the largest possible value for an operation or variable, the results are scaled according to the execution weight of the operations. Those operations that are executed more frequently are expected to be more accurate and will contribute more to the results, while those operations executed less frequently will be more prone to error, but will impact the final results much less. Consequently, the results are expected to be reasonably accurate. Figure 3.5 shows the average ratio of data types used by the various media types in the MediaBench benchmark suite.
100% 80%
floating-point pointers word halfword byte
60% 40% 20% 0% vi de o im ag gr e ap hi cs au di o sp ee ch se cu r it y de co de en co de av er ag e
Ratio of data types (%)
87
Media type
Figure 3.5 – Ratio of data types according to media type. From the results it is apparent that there is indeed a tendency toward small integer data types. Overall, nearly 40% of the operations in the benchmark suite require only byte integer data types, and over 65% require halfword or smaller data types. Also, the only significant exception to the average results is from the graphics applications, which rely more on floating point data types. Aside from graphics, it can quantitatively be stated that multimedia applications, as defined by the MediaBench benchmark suite, use 16-bit or smaller data types nearly 70% of the time, offering ample opportunity for subword parallelism or potentially even narrower datapaths. In comparison with general-purpose applications, the ratio of data types and sizes is significant. As reported by Hennessey and Patterson [103], the percentage for byte, halfword, word, and double word data sizes is 7%, 19%, 74%, and 0% for SPECint92, and 0%, 0%, 31%, and 69% for SPECfp92. While these results were not obtained in the same manner that measured the minimum data size necessary for each operation, there is still a stark contrast between the multimedia and general-purpose results.
88 In examining the possibility of using a smaller datapath, it is evident that there is not a sufficient degree of byte data types to warrant datapaths less than 16-bits wide. However, there is a strong possibility for datapath widths of 16 bits or more, since nearly 73% of integer data types are 16-bits or less. The remaining large integer data types for pointers and non-pointers represent only 15% and 12% of the integer operations, respectively. The issue with regards to the feasibility of a narrow datapath is whether the pointers and other larger data sizes can be sufficiently supported to achieve an overall speedup by using a smaller, and consequently faster, datapath. An argument can be made for pointers that the lower bits in the pointer change often, while the more significant bits of the pointer change much less frequently. It should therefore be possible to perform computations on pointers typically using only the lower 16 bits. Another possible scheme might provide a single 32-bit unit for pointer manipulation while the remaining units in the datapath are 16 bits. Assuming a method can be found that provides a viable solution for processing pointers, only the 12% of integer data types defining large non-pointer variables will require additional operations for computation and use.
3.5
Memory Statistics Understanding the memory characteristics of typical multimedia applications is of
paramount importance. Not only is it necessary to determine the amount of instruction and data memory necessary for achieving good performance, other characteristics such as spatial and temporal locality are also important factors.
Additionally, multimedia
applications typically involve streaming data. The memory characteristics of multimedia
89 applications should be examined for evidence that memory prefetching structures, such as stream buffers or stride prediction tables, may provide improved performance. Examination of the memory characteristics involved a cache regression study using the IMPACT simulator. To evaluate the working instruction and data set sizes for each application, instruction and data miss ratios were measured for all base 2 cache sizes between 1 KB and 4 MB, using a line size of 64 bytes. Similarly, spatial locality was evaluated for each application by measuring the instruction and data miss ratios for all base 2 cache line sizes between 8 bytes and 1024 bytes, assuming a 64 KB cache. When measuring the miss ratios, both read and write misses were measured. For a cache that uses a no-write-allocate policy, the write misses would have little effect on the working set size, but a conservative approach was assumed here to cover both policies. No tests were performed for measuring the effectiveness of stream buffers or stride prediction tables.
However, some initial observations can be drawn about the existence of
streaming data from the other results. Working Set Size The data and instruction working set sizes are displayed in Figure 3.6.
To
evaluate working set size, a cache regression was performed using a direct-mapped cache for all base 2 sizes between 1 KB and 4 MB, using a line size of 64 bytes. The number of read and write misses were measured and an analysis of the results yielded the working set size for each application. This working set size is defined by the knee on the cache regression graph where the percentage of misses decreased dramatically with respect to smaller cache sizes. An example of such a knee is illustrated by the 8KB cache in Figure 3.7, which presents the average instruction miss rate for various cache sizes. In the
90 absence of a cache size exhibiting a knee, the working set size is defined as the size that reduces the miss ratio to below 3%.
instruction data 100
unepic
texgen
rawcaudio
rasta
pgpdecode
pegwitenc
osdemo
mpeg4dec
mpeg2enc
mpeg2dec
h263enc
h263dec
gsmencode
gsmdecode
gs
g721dec
1
epic
10
djpeg
Working Set Size (KB)
1000
Application Figure 3.6 – Instruction and data memory working set sizes. Based on the statistics, it appears that cache sizes do not need to be very large for most MediaBench applications, even in light of the large amounts of data required in multimedia. A data cache size of 32 KB provides an average miss rate of 2.0% on the benchmark suite. This size is sufficient for most of the benchmarks, except for pegwitenc and pegwitdec, which have miss rates of about 11% at 32 KB. However, its miss ratios remain high unless it has the full 128 KB necessary to contain its working set size. The other applications, osdemo and unepic, which have working set sizes larger than 32 KB still have reasonable miss ratios of 3.9% and 5.5%, respectively, for this cache size. The trace-driven studies of Wu and Wolf [58] yielded similar results, as they concluded a 32 KB 2-way set associative cache with a 64 byte line was necessary for good performance on the H.263, MPEG-2, and MPEG-4 motion video applications.
91 The data working set sizes shown in Figure 3.6 were only computed for one input data set, but we expect data working set sizes will remain less than 32 KB for most input data sets. Variations in data set size or application operating parameters are known to affect data working set size, but the input data set used in this experiment provides a wide range of input and output data sizes. Some applications, such as EPIC, JPEG, and Rasta, have fairly small data sets, while other applications such as MPEG-4, H.263, and GSM have much larger data sets (see Table 2.4). The MPEG-4 encoder application, mpeg4enc, has a particularly large input set that is over 40 MB, yet its working set size is still only 32 KB. Consequently, we expect a 32 KB data cache will prove sufficient for this generation of media processors. The instruction cache results shown in Figure 3.6 and Figure 3.7 are even more surprising. A cache size of 8 KB provides the ideal cache size for these applications, with an overall miss ratio of only 0.3%. Cache sizes smaller than 8 KB increase miss rates significantly, while larger instruction caches increase performance only marginally. Even the one application, gsmencode, with a larger working set size still has a miss rate of only 1.5% at 8KB.
These results are somewhat surprising since many of the
applications have relatively large code sizes, shown previously in Table 2.4, with between ten thousand and a hundred and twenty thousand operations. This means all applications spend the majority of their processing time within only a small fraction of the entire code, in some cases less than 3%. The small instruction working set size provides a first indication of the processing regularity in media processing.
92
Miss Ratio (%)
6 5 4 3 2 1
12 8 25 6 51 2 10 24
64
32
16
8
4
2
1
0
Instruction Cache Size (KB) Figure 3.7 – Overall instruction cache miss ratios. In comparison with the multimedia results, general-purpose applications typically exhibit lower performance. According to Hennessey and Patterson [103], the average performance for an 8 KB direct-mapped instruction cache and a 32 KB data cache are 1.1% and 4.8%, respectively. These miss ratios are nearly 2-4 times worse than the multimedia miss rates for equivalent caches.
To achieve the same miss ratios as
multimedia applications, general-purpose processors would require a 32 KB instruction cache and a 128-256 KB data cache. The initial MediaBench results [66] for instruction and data cache miss rates indicate a similar difference between multimedia and generalpurpose applications.
However, they measured data read and write hit rates
independently and found general-purpose applications have better write hit rates (when using a no-write-allocate policy). Spatial Locality Evaluation of the spatial locality for instruction and data memory was based on a cache line size regression for all base 2 line sizes between 8 bytes and 1024 bytes on a 64 KB direct-mapped cache. As line size increases, performance typically increases because
93 the processor will often use the additional data contained within the cache line without having to generate additional cache misses. The degree to which the processor can use the additional memory within longer line sizes represents the degree of spatial locality for an application. An equation was defined to quantitatively describe the spatial locality for increasing line size. 100% spatial locality is represented by a perfect decrease in cache misses relative to the change in line size. So if line size doubles, the number of cache misses would halve. The degree of spatial locality is then defined as the ratio of the actual decrease in cache misses compared to the ideal decrease. Assuming A is the number of cache misses for the long line size, B represents the number of misses for the shorter line, and la and lb are the line sizes for A and B, the equation becomes: spatial locality =
(A − B) A l a / lb
Measuring the spatial locality between subsequent cache line sizes in the cache regression, the equation becomes spatial locality (from doubling line size) =
(A − B) ( A / 2)
Using this equation, the spatial locality for data memory is shown in Figure 3.8. For a given line size, the spatial locality is relative to the next shorter line size, e.g. spatial locality for the 256 byte line is relative to the 128 byte line. As evident in the figure, the spatial locality for data memory is very good for smaller lines with an average spatial locality of 60.8% for line sizes up to 128 bytes, but quickly begins to decrease for lines with more than 32-64 bytes. For 256 byte lines, the degree of spatial locality becomes
94 negative for many media types, meaning the number of cache misses actually increased due to cache conflicts. Consequently it appears that the ideal cache line size for data
Degree of Spatial Locality (%)
memory is either 32 or 64 bytes. 150 100 video image graphics audio speech security average
50 0 -50 -100 -150 -200 16
32
64 128 256 Line Size (bytes)
512
1024
Figure 3.8 – Data memory spatial locality. Indicates spatial locality between indicated line size and previous line of half the size. The spatial locality results for instruction memory are similarly shown in Figure 3.9. For instruction memory, spatial locality remains very high for line sizes up to 1024 bytes in most media types except video and graphics. Overall, the average instruction memory spatial locality is 84.8% for line sizes up to 256 bytes, and still remains as high as 77.2% for line sizes up to 1024 bytes. As is evident in the figure, the degree of spatial locality begins to drop off considerably after 128-256 bytes, so the ideal block size for instruction memory is either 128 or 256 bytes.
Degree of Spatial Locality (%)
95
120 100
video image graphics audio speech security average
80 60 40 20 0 16
32
64
128
256
512
1024
Line Size (bytes) Figure 3.9 – Instruction memory spatial locality. Indicates spatial locality between indicated line size and previous line of half the size. We are not aware of studies for general-purpose applications that quantitatively define spatial locality, but an examination of performance using different block sizes for the data cache was done by Hennessey and Patterson [103].
There was not any
significant performance improvement from increasing block sizes within the same data cache, and block sizes of 64 bytes or more generated performance degradations with many of the smaller data caches. Overall the results indicated 32 byte line size of 32 bytes is the best choice in general-purpose applications. The spatial locality results for multimedia are quite good with an average spatial locality of 60.8% for data memory of line sizes up to 128 bytes, and 84.8% for instruction memory of line sizes up to 256 bytes. It is important to note however, that these spatial locality results are not valid for all cache sizes. Spatial locality will vary with cache size since smaller cache sizes are more prone to cache conflicts that reduce spatial locality. Consequently, the spatial locality using a 64 KB cache serves as a reference point.
96 Larger cache sizes will have higher spatial locality while smaller cache sizes will have lower spatial locality. For 32 KB caches, we anticipate only marginal reduction in spatial locality, and expect line sizes of 32 or 64 bytes for data memory and 128 or 256 bytes for instruction memory will provide the best performance. Wu and Wolf [58] found similar results for data cache line sizes in a study using trace-driven simulation. However, they only experimented with data memory, and only evaluated line sizes up to 64 bytes. Streaming Data While no tests were performed to directly examine the value of stream buffers or stride prediction tables for multimedia applications, the cache and line size results do provide evidence that such support could be beneficial. Multimedia typically requires very large amounts of data. This is particularly true of video and computer graphics applications. In the operation frequencies results a number of applications, including H.263, Mesa, MPEG-2, MPEG-4, and Pegwit, were found to have very high frequencies for loads and stores. However, with the exception of Pegwit, none of the data working set sizes for these applications is very large, while the spatial locality results are good in all cases. The large amounts of data coupled with the small working set sizes indicates that the processor typically loads in a small amount of data, processes it, then throws it away. The high frequency of memory accesses and good spatial locality indicate that the many memory accesses are performed to and from the same cache lines, so most of the data is used before it is cast out of the cache. From these two indications, it can be concluded that the processor is constantly loading in small amounts of data, performing all the necessary work on that data, then throwing the data out, never (or rarely) needing access to it again. This perfectly describes the nature of streaming data. So while, these
97 studies cannot comment on the performance gain from using stream buffers or stride prediction tables, there is substantial evidence to support the existence of a considerable amount of streaming data, so it is likely that performance gains can be obtained from memory prefetching support.
3.6
Loop Statistics In the previous section it was found that multimedia applications have small
instruction working set sizes, spending the majority of their time processing over small sections of the program code. To more fully understand the processing characteristics within these frequently executed program sections, we again utilize the profiling tools to examine the loop characteristics of multimedia applications.
Included among the
characteristics examined are loop execution weight by loop level, average number of iterations per loop, and the typical path ratios in loops. These statistics will enable a greater understanding of the degree of processing regularity in media processing. The first parameter measured is the loop execution weight by loop level. Using a depth-first search through the functions in each application, we assign a loop level to each loop in the function. The loop level is defined as the number of levels from an innermost loop. Innermost loops are assigned level 1, their parent loops are level 2, and so on. When a parent loop has multiple child loops of different loop levels, the loop level of the parent is defined as one greater than the maximum loop level of the child loops. In this loop level definition, function boundaries are ignored so all loop levels are global. The results, given in Figure 3.10, indicate that most multimedia applications spend 80-90% or more of their processing time just within the inner loops of the programs. The second level loops have an even higher execution weight with nearly 95%
98 of the execution time spent within the first and second level loops. The two exceptions that spend a much greater portion of their execution time in lower loop levels are the G.721 applications, g721dec and g721enc, and the Ghostscript application, gs. The G.721 applications have large 4th level loops in which they spend over 30% of their execution time. Ghostscript spends a significant portion of its execution time in outer loops because it is an especially large application with 15 loop levels, which is much
Loop Execution Weight (%)
greater than the average of 4-8 loop levels found in most other applications. 110% video image graphics audio speech security average
100% 90% 80% 70% 60% 1
2
3
4
5
Loop Level Figure 3.10 – Percentage of loop execution weight by loop level. From the loop execution statistics, we come to the conclusion that instruction set working sizes can be so small because such a significant portion of the execution time is spent processing over the two innermost loop levels. However, while this indicates processing regularity, it does not indicate the degree of processing regularity. This requires an understanding of how often these loops iterate. Results on the average number of loop iterations per loop are shown in Figure 3.11.
99 The average number of loop iterations is weighted according to the number of invocations of each loop. It is calculated by taking the sum for all loops of the average number of iterations multiplied by the number of invocations for that loop, and dividing by the total number invocations for all loops. This average is weighted by the number of invocations instead of loop execution weight, since weighting by loop execution weight would benefit those loops with more iterations. The results indicated typical loops have a large number of iterations, about 10 iterations per loop on average. We can expect some variation in the average number of loops when using different data sets, but a comparison of the average number of iterations per loop on the training data set versus the evaluation
1000
100
unepic
texgen
rawcaudio
rasta
pgpdecode
pegwitdec
osdemo
mpeg4dec
mpeg2dec
mipmap
h263enc
h263dec
gsmdecode
gs
g721enc
1
epic
10
djpeg
Average Number of Iterations
data set indicated these variations are typically within 5% for each application.
Benchmark Figure 3.11 – Average number of loop iterations for each benchmark. Among specific applications, the Mesa application, mipmap, and the audio applications are all particularly prone to large numbers of iterations, each averaging many hundreds of iterations per loop. The only applications that have few iterations on average are the Epic encoder, epic, and the Mesa application, texgen. Overall, these results
100 enable us to assert that there is significant processing regularity because of the large average number of iterations per loop. The last issue regarding processing regularity is the question of the control complexity within each loop. This is measured by the path ratio in each loop. The path ratio is defined as ratio of the average number of instructions executed per loop iteration to the total number of instructions in the loop. The path ratio is therefore a comparison of the typical loop path length to the combined length of all paths. The remaining off-path instructions in the loop are control divergences from the typical path. Low path ratio numbers indicate a high number of control divergences and high control complexity, whereas high path ratio numbers indicate a small number of control divergences and low control complexity. 1
Path Ratio
0.8 0.6 0.4 0.2 0 video
image graphics audio
speech security decode encode average
Media Type
Figure 3.12 – Average path ratios of loops for the various media types. Computation of the average path ratio proceeds as follows. For each loop, the loop execution weight can be broken down into three elements: the number of loop invocations, the average number of loop iterations per invocation, and the average number of instructions executed per loop invocation. The path ratio for each loop is
101 computed by taking the average number of instructions executed per loop invocation, and dividing by the total number of instructions in the loop. The typical loop path ratio, weighted according to loop execution weight, is given in Figure 3.12. Of the six media types, all have moderately high average path ratios. However, because multimedia applications are typically associated with having minimal control complexity, the results are lower than expected. The graphics and audio media types in particular have significantly lower path ratios than expected. The image media type’s average was brought down principally by the Ghostscript application, gs, which has a typical path ratio of less than 50%, and the JPEG coder, cjpeg, which has a path ratio of 61.4%. The video applications also would have been higher, but the H.263 and MPEG-4 decoders have modest path ratios of only 70-73%. Again, we can assume some variation between different data sets, but the variations between the training and evaluation data sets were typically within 1-2%. Overall, the results indicate good processing regularity. However, we expected path ratios of 90% or higher, not the 78% average that resulted. Consequently, media processing entails more control complexity than anticipated. The results of the loop statistics indicate that multimedia applications are highly loop-oriented. Nearly 95% of all execution time is spent within the two innermost loop levels, and loops typically iterate about 10 times per loop invocation. These results indicate considerable regularity of loop processing in multimedia, but the path ratio results imply more control complexity within loops than expected. The average path ratio of 78% means that generally only three-quarters of the instructions are executed per loop, and the remaining quarter are not. These off-path instructions signify greater control complexity exists within loops than anticipated.
This additional control
102 complexity will become even more apparent in Chapter 4 when comparing static scheduling to dynamic out-of-order scheduling. While the loop statistics indicate media processing is highly loop-oriented, future media processors must be able to accommodate additional intra-loop control complexity.
3.7
Instruction Level Parallelism The last experiment in evaluating intrinsic multimedia characteristics examined
the instruction level parallelism available from static scheduling.
This experiment
performs compilation using both local and global scheduling methods to evaluate the effectiveness of both traditional and aggressive ILP optimizations. The procedure for studying parallelism evaluates five different compilation methods for each application. The first compilation targets a single-issue processor and allows only classical optimizations for determining the base performance of the applications.
This first
compilation is the same compilation as used for all the above evaluations. The remaining four compilations all target an 8-issue processor. This processor is a simple architecture that provides 8 universal issue slots and uses the same operation latencies as the initial machine (1 cycle integer ALU operations, 2 cycle loads, 3 cycle multiplies and floating-point ops, 10 cycle divides), but only allows one branch per cycle. The first of the four compilations uses only classical optimizations with no procedure inlining.
The second, defined by the classical optimization path in Table 2.1, uses
classical optimizations as well as procedure inlining. The third compilation scheme uses the superscalar optimization path described in Table 2.1, performing the superblock optimization and other optimizations such as loop unrolling. The final compilation method uses the hyperblock optimization path described in Table 2.1.
103 Evaluation of the four different compilation methods on an 8-issue processor yields some initial ILP performance results, as shown in Figure 3.13. This study only examines parallel scheduling performance, so an ideal processor model was assumed, excluding any performance penalties from cache and branch effects. The overall results, while not spectacular, are reasonable and consistent with similar results found in a separate study of ILP on MediaBench [102]. The average results are given in Table 3.1. 8-issue classical only 8-issue classical w/ inlining 8-issue superblock 8-issue hyperblock
3.5 3 Speedup
2.5 2 1.5 1 0.5 0 video
image
graphics
audio
speech
security
average
Media Type Figure 3.13 – Speedup up of an 8-issue architecture with respect to a single issue architecture for various compilation methods. Compilation Method
Parallelism
8-issue classical only
1.40
8-issue classical w/ inlining
1.44
8-issue superscalar
2.22
8-issue hyperblock
2.03
Table 3.1 – Parallelism results from various compiler optimization methods
104 These results are in considerable contrast with the results from Wu and Wolf’s trace-driven studies [97][64], which typically found 4-8x better parallel performance. However,
these studies
assumed
perfect
branch prediction, perfect memory
disambiguation, and an infinite-sized scheduling window, and so represent an upper bound on the potential parallelism. Comparing the various compilation methods, it is somewhat surprising that the superscalar compilation path provides better speedup than the hyperblock compilation path.
Because it incorporates both hyperblock and superscalar optimizations, the
hyperblock has the potential for better performance than superscalar optimization. However, according to members of the IMPACT development team [77], the hyperblock optimization is still undergoing development and has not been fully tuned for maximum performance.
It is also important to note that having a higher ideal ILP does not
guarantee better performance. As will be seen in Chapter 4, the hyperblock enables better branch prediction and has better memory performance, so it often wins out over superscalar compilation under realistic architecture assumptions. The hyperblock is also effective at providing comparable performance to superscalar optimization without the same degree of code explosion, which can considerably increase instruction cache effects in the absence of a larger instruction cache. For MediaBench, superscalar optimization increased average code size by 95%, while the hyperblock increased it by only 62%. In comparison with ILP performance in general-purpose applications, we find that the levels of ILP are similar. A recent study by the IMPACT group examined the performance of SPECint92, SPECint95, and UNIX applications on an EPIC (VLIW-like) architecture 0. While the results used a different architecture model (with shorter
105 operation latencies), they demonstrated average ILP results of 2.85 IPC. Many other ILP studies conducted on VLIW and superscalar processors in the research community have demonstrated similar levels of ILP. While this ILP study on multimedia applications does not examine the performance of all aggressive ILP optimizations, our initial investigation suggests that multimedia applications contain no more ILP than generalpurpose applications.
Furthermore, we believe a more comprehensive study that
evaluated the performance of additional aggressive ILP optimizations, such as software pipelining and loop transformations, will give similar conclusions. One positive note is that performance for specific applications can vary considerably from application to application. The video applications that are usually more compute intensive, such as MPEG-2/4, and H.263, typically exceed these averages. One surprising deviation from this trend however, is h263enc, which has a maximum speedup of only 1.62. This is particularly unexpected in view of the fact that h263enc has the second largest average basic block size. The applications with the largest basic block sizes nearly always exhibited the best parallel performance. H263enc must have many sequentially-dependent operations in its basic blocks to defy this general principle. On the whole however, video, speech, and image applications performed better than the other applications with respect parallel scheduling. While prior research studies have found considerable parallelism in multimedia applications [56][97][64], it is evident from this experiment that such levels of parallelism are not likely to be attained with just instruction level parallelism. ILP provides respectable parallelism, with typical scheduling performance of about 2 IPC, but
106 achieving high degrees of parallelism is critical to the success of programmable media processors. Consequently, it is necessary to explore additional avenues for parallelism
3.8
Summary This chapter presents a workload evaluation of the MediaBench multimedia
benchmark suite for purposes of defining the intrinsic properties of multimedia applications. Using the IMPACT compiler we were able to analyze and quantitatively define the characteristics of complete applications from a compiler perspective. Included among the characteristics examined were operation frequencies, basic block sizes, branch prediction rates, data sizes, working set sizes, spatial locality, loop characteristics, and ILP scheduling performance. From these results, conclusions are made about many aspects of media processing. The operation frequency statistics define the proper ratio of functional resources as (4, 2, 1, 1, 1, 1) for integer ALUs, memory units, branch units, shifters, floating-point units, and integer multipliers. Profiling data types and sizes indicated that nearly 70% of instructions operation on integer data sizes of only 8 and 16 bits. The only major exception to this was from the graphics media type, which relied heavily on floating point. This tendency towards small integer data types provides significant opportunity for subword parallelism or potentially the use of narrower, faster datapaths. The typically small basic block sizes indicate that the parallelism available in multimedia applications is not available within basic blocks. More aggressive parallel optimizations are needed extract it. While the effectiveness of these scheduling methods depends upon profiling and the accuracy of static branch prediction, static branch
107 prediction accuracy was relatively high, realistically averaging 87.9% for most applications, which bodes well for the obtaining this parallelism. A cache regression analysis concludes that working set sizes for both data and instruction are relatively small, and so cache sizes of 32 KB for data and 8 KB for instruction are sufficient.
Spatial locality was excellent, averaging 60.8% for data
memory and 84.8% for instruction memory, for line sizes up to 128 bytes and 256 bytes, respectively. We expect the ideal line sizes for data memory will be 32-64 bytes, and 128-256 bytes for instruction memory. While no tests were performed to evaluate the benefits of stream buffers or stride prediction tables, considerable evidence was found of streaming data, so such additional memory support will likely provide improved memory system performance. An evaluation of loop characteristics was performed for purposes of examining the processing regularity of multimedia code. It was found that most applications spend nearly 95% of their execution time processing over the two innermost loops, and that loops tend to have a large number of iterations, typically 10 or more, with some loops having many hundreds or thousands of iterations.
While these results indicate
considerable processing regularity of loops, the path ratios within loops were only 78%, indicating higher control complexity than expected.
Consequently, future media
processors will need to accommodate greater degrees of intra-loop control complexity. Evaluation of instruction level parallelism unfortunately revealed multimedia contains little more ILP than general-purpose applications. Static scheduling was unable to achieve any more than 2.2 times speedup on average, even with the most aggressive ILP optimizations. While this study did not examine the full range of aggressive ILP
108 optimizations, these initial results indicate the ILP in multimedia falls well short of the parallelism known to be available. In comparison with general-purpose applications, it was found that there were some similarities and some differences. The typical size of basic blocks was found to be similar, but the operation frequencies showed some variation. In particular, there were typically more arithmetic operations, much fewer floating-point operations, and slightly fewer memory accesses. There was a significant difference in the static branch prediction results, with media processors having 2-3x fewer branch misses. There was also a large disparity between data types and sizes on multimedia and general-purpose applications, with multimedia using significantly fewer large data types. The memory characteristics favored the multimedia applications, and the ILP between the two was similar. Among the various results found, one of the most critical was that there is not a significant amount of ILP in multimedia applications. This was unexpected in light of the high degrees of parallelism that have been found in multimedia applications. Since parallelism is crucial to the success of programmable media processors, alternate methods of compilation will need to be explored. explored in greater detail in Chapter 6.
These and other compiler issues will be
109
Chapter 4. Datapath Architecture
This chapter begins the second phase of our design methodology for media processor architectures, performing an in-depth evaluation of the datapath architecture for programmable media processors. There are two primary issues in the design of datapaths for media processors.
The first involves the choice of static versus dynamic
architectures. Early media processor designs have followed the DSP design philosophy, building architectures with predominantly static architectures.
While architectures
employing static scheduling, particularly VLIW architectures, have been regarded as the appropriate processor model for media processing, there is no definitive evidence proving it to be the ideal model. This first half of this chapter will investigate the dynamic aspects of media processing to determine whether static architectures are sufficient, or if dynamic support may be necessary in future generations of media processors. The second issue involving datapath design for media processors is how to achieve both high frequency and high parallelism. High frequency and high parallelism are conflicting goals because of the extra demands additional parallelism places on the hardware. The design of high frequency, highly parallel media processors therefore requires the use of distributed architectures. The second half of this chapter will examine some proposed distributed architectures for media processors, and introduce the Princeton multi-cluster architecture for video signal and media processing.
110
4.1
Static versus Dynamic Architectures Most existing programmable media processors have used statically-scheduled
VLIW and DSP architectures. These early media processors have used the simple hardware design of statically-scheduled architectures to help achieve low cost and low power. Static architectures have two primary drawbacks, however. First, they depend upon static scheduling by the compiler and/or programmer to provide effective program performance. Second, as media processors progress to higher frequencies and higher degrees of parallelism, the dynamic aspects of processing become more pronounced and dynamic hardware support may be needed to achieve high performance. This section compares the performance of static and dynamic architectures, including fundamental architecture style, instruction fetch architecture, and high frequency effects, in order to determine whether static or dynamic scheduling will be more desirable for future media processors. This section shall proceed with a discussion of the base architecture model for the architecture evaluation in Section 4.1.2. The subsequent sections will examine three processor architectures that define the range of static and dynamic scheduling. Results are presented from experiments that evaluate various architecture features, including fundamental architecture style, instruction fetch architecture, and high frequency effects.
4.1.1 Related Work There has not been a significant amount of research done in comparing static versus dynamic architecture mechanisms for media processors. There are a number of industry designed and commercially available media processors, as discussed in Section 1.2.3, but they predominantly use static architectures (the NEC 830AV/R is the sole
111 exception), and the information published about the processors does not go into detail regarding architecture comparisons. There have been a number of research-based media and video signal processor proposals [106][107][108][98][99], but evaluation of the performance of these proposals have nearly all been done using kernels. Only a select few have examined media processors using full application benchmarks [66][102][60], and these have all proposed static architectures. Among these, only [66] and [102] use compiled code from full applications, and only the second of these two evaluates multiple-issue media processors. In that paper they examined performance using various issue widths, cache sizes, and numbers of branch units on a static architecture. One important conclusion they made was that there was little performance benefit from having more than 3-4 issue slots for ILP. We also found this to be true in Section 4.1.3 below. General-purpose applications typically have similar issue-width bounds [109]. There have been a number of evaluations of static and dynamic architectures for general-purpose processing in the research community, but nearly all of these focus on only one of the two architecture models. With regards to research that has directly compared static and dynamic architectures, we are only aware of two studies [109][110], both made within the IMPACT group. Both of these studies examined static VLIW architectures,
in-order
superscalar
architectures,
and
out-of-order
superscalar
architectures. The studies examined two scheduling models, restricted and general. The restricted model did not allow code movement of potentially exception-causing instructions across branches.
The general model provides non-trapping instructions,
which eliminates this restriction. Under both scheduling models, it was found that the VLIW architecture and in-order superscalar perform comparably, with the superscalar
112 providing only slightly better performance. significantly better under both models.
The out-of-order superscalar was
On an 8-issue processor, the out-of-order
superscalar provided up to 100% better performance with restricted scheduling and up to 50-60% better performance with general scheduling. However, as the processor issue width decreases, the performance benefits decreased significantly (only half the performance difference on a 4-issue processor). These results can easily be compared to our results below since we are using the same compiler. Our experiments only use the general scheduling model though. Numerous studies of dynamic branch prediction have been performed for generalpurpose applications, though we do not know of any for media processors. While media processors have good static branch prediction, there is still much room for improvement. We refer the readers to [111][105][112] for a good overview and comparison of different dynamic branch prediction schemes.
The results found below for multimedia
applications are quite similar to those found for general-purpose applications when using small dynamic branch predictors. The primary difference we found with multimedia applications is that dynamic branch predictors with more than 512 entries provide negligible additional benefit.
4.1.2 Base Architecture Model Systematic evaluation of different architectures requires a base processor model to which all other processor models can compare their performance. The base processor model defined here is an 8-issue media processor targeting the frequency range from 500 MHz to 1 GHz. The processor has separate L1 instruction and data caches, a unified onchip L2 cache, and an external memory bus that operates at ¼ the frequency of the
113 processor. Using the working set size and spatial locality results from Section 3.5 in the previous chapter, the L1 data cache is defined as 32 KB direct-mapped with 64 byte lines, and the L1 instruction cache is defined as 16 KB direct-mapped with 256 byte lines. The instruction working set size was only 8 KB, but the 16 KB instruction cache allows for code expansion from aggressive compiler optimizations. Further details on the base processor model are provided in Table 4.1. Parallelism Frequency Range Processor to Bus frequency ratio Pipeline Stages instruction fetch decode/dispatch execute write back Register Files Integer floating-point L1 Instruction Cache associativity line size miss latency L1 Data Cache associativity line size miss latency model writes L2 Cache associativity line size miss latency Model Writes
8-issue 500 MHz – 1GHz 4:1 1 2 (see Table 4.2) 1 64 registers 64 registers 16 KB direct-mapped 256-byte 20 processor cycles 32 KB direct-mapped 64-byte 15 processor cycles non-blocking, 8-entry miss buffer no write allocate, 8-entry write buffer 256 KB 4-way set associative 64-byte 50 processor cycles non-blocking, 8-entry miss buffer write allocate, 8-entry write buffer
Table 4.1 – Base processor model.
114
Operation
Number of Units
Latency
ALU
8
1
Branches
1
1
Load/Store
4
loads - 3, stores - 2
Floating-Point
2
4
Multiply/Divide
2
mult - 5, div - 20
Table 4.2 – Number of parallel functional units and their operation latencies. The base processor model supports numerous parallel functional units. resource ratios are chosen according to the results from Section 3.2.
The
The type,
availability, and operation latencies of the functional units and their operations are given in Table 4.2. This base processor model provides the foundation for our architecture evaluation. All modifications to the base processor model are noted in each experiment. With this base processor model, the architecture evaluation also uses a number of different compilation models.
The performance of static scheduling is primarily
dependent upon the capabilities of the compiler, so evaluation under a variety of compiler methods is necessary to thoroughly gauge the effectiveness of static scheduling. Additionally, even with non-static architectures the compiler is becoming more important to architecture evaluation with the growing interaction between architecture and compiler [113]. The Classical, Superscalar, and Hyperblock compilation models introduced in Section 2.2.1 shall be used throughout the architecture evaluation to enable a thorough evaluation of all scheduling methods and mechanisms.
115
4.1.3 Fundamental Architecture Style This section performs an evaluation of fundamental architecture types to determine the most suitable architecture for media processors. The basic difference between static and dynamic architectures is the type of scheduling method involved. Using three different processor architectures, this evaluation will examine the full range of static and dynamic scheduling. The effectiveness of a particular scheduling method revolves around how often the dynamic aspects affect a program’s schedule. The dynamic processing aspects that have the primary impact on performance arise from memory stalls and branch misprediction. Fully dynamic scheduling, which is found in out-of-order superscalar processors, is most effective at dealing with these dynamic aspects by providing dynamic branch prediction and out-of-order execution that effectively hide many of the stall penalties from memory stalls and branch mispredictions.
This is not true of static
scheduling which enforces strict in-order processing, has only static branch prediction, and may require instructions to issue in groups of parallel operations.
In static
architectures, the occurrence of any dynamic stall affects the entire pipeline, forcing all operations to wait until the stall is resolved. No hiding of stall penalties is provided. While static scheduling does not have the ability to hide stall penalties, it is commonly regarded as an effective architecture style for predictable applications, where effective static branch prediction and memory prefetching13 can be used to minimize the occurrence of dynamic stalls. Media processing has typically fallen within the realm of
13
The IMPACT compiler does not enable automatic memory prefetching, so that feature is not currently examined.
116 predictable applications because of the repetitive nature of processing large amounts of multimedia data. However, multimedia applications also contain complex control code interspersed between sections of data processing code. This is particularly true in the emerging applications that integrate many types of multimedia or use advanced media representations. The added control complexity can significantly strain the ability of the compiler to make effective scheduling decisions, and therefore impedes performance. This architecture style investigation continues with four experiments that examine various architecture features, including basic architecture style, compiler optimization levels, processor issue width, and different instruction formats for VLIW architectures. Basic architecture style The first architecture experiment examines the range of static and dynamic scheduling using three basic architectures:
an out-of-order superscalar, an in-order
superscalar, and a VLIW processor. The first model is the out-of-order superscalar architecture, which provides full dynamic scheduling. It enables aggressive out-of-order instruction issue with a 32-entry issue-reorder buffer, provides dynamic memory disambiguation support for data speculation, and allows early evaluation of branches so long as they are resolved in order. The out-of-order scheduler effectively hides dynamic stall penalties by only holding up those operations dependent upon a dynamic stall, allowing all non-dependent operations (within a scheduling window defined by the size of the issue-reorder buffer) to bypass the dependent operations and execute early. The out-of-order superscalar processor also provides dynamic branch prediction with a 1024entry 2-bit counter predictor, which helps avoid stalls from changes in the instruction control stream by keeping track of the branch run-time statistics and predicting the
117 expected branch direction and target from that dynamic information. The out-of-order superscalar model is the most effective architecture for dealing with the dynamic aspects of a program and is expected to have the best overall performance. The second architecture model, an in-order superscalar, provides partial dynamic scheduling. It uses scoreboarding to enforce dynamic data dependency checking, so it still enables non-dependent operations to issue during a memory stall. However, because they must issue in order, any one operation dependent on that memory stall will hold up the issue of all operations behind it.
Consequently, while it still hides some stall
penalties, it is not as effective as the out-of-order superscalar model. Also, like the outof-order model, it employs dynamic branch prediction for minimizing stall penalties from changes in instruction control flow. The third architecture model is a VLIW (very long instruction word) architecture, which is fully statically-scheduled and provides no dynamic hardware support. The VLIW architecture depends entirely on the compiler to define all data dependencies and explicitly schedule for parallelism.
The compiler schedules parallel operations into
groups of long instruction words, each of which enables simultaneous execution of all its parallel operations.
It requires strict in-order processing with no dynamic data
dependency checking, and employs only static branch prediction.
It provides no
mechanism for hiding latencies from stall penalties, so will typically have the worst performance. However, it also has the benefit of the simplest hardware design, which is believed to enable higher frequency processor design. To enable better performance, the base VLIW architecture supports a compressed instruction format.
118 Because the two superscalar processors use a different branch prediction scheme than the VLIW processor, a simple uncorrelated branch predictor was used that provides relatively comparable performance to that of static branch prediction. More complex dynamic branch predictors certainly exist which provide better performance, but such predictors employ branch correlation to improve performance.
Since static branch
prediction on the IMPACT compiler does not enable correlated branch prediction, we chose to also use uncorrelated dynamic branch in the base superscalar processor models to enable a fair comparison of the three processors. Memory-limited applications While performing the experiments it immediately became obvious that memory is the primary bottleneck for many of the multimedia applications. As evident in Figure 4.1 and Figure 4.2, eight of the benchmarks are severely memory limited.
Their bus
utilization is very high and their IPC is quite low, especially in comparison with perfect cache performance. Because they are memory-limited, the impact of other architecture features being examined in this study will be significantly dampened by this bottleneck.
h2 63 de c m ip m ap m pe g4 de c os de m o ra w da ud io
gs
un ep ic av er ag e
superscalar superscalar w/ perfect cache
5 4 3 2 1 0
dj pe g
IPC
Consequently, these benchmarks will not be used in the remainder of this study.
Application Figure 4.1 – Performance of memory-limited benchmarks on an out-of-order superscalar.
119
average (non-mem-limited) average (mem-limited)
Application
unepic rawdaudio osdemo mpeg4dec mipmap h263dec gs djpeg
0
20
40 60 Bus Utilization (%)
80
100
Figure 4.2 – Bus utilization ranges for memory-limited benchmarks. Before concluding with these benchmarks though, it is interesting to note that aside from Ghostscript (gs) and the two Mesa graphics benchmarks, mipmap and osdemo, which are all non-compression applications, the remaining memory-limited benchmarks are all decompression benchmarks.
As opposed to encoding, where the amount of
computation per data element is typically much higher, the processing per data element in decoding is much lower, so decoding is more heavily dependent upon memory access than encoding. Decoding is much more susceptible to memory bottlenecks. Non-memory-limited applications Figure 4.2 and Figure 4.3 show that the remaining benchmarks are not memory limited. Figure 4.2 provides the average bus utilization for both memory-limited and non-memory-memory limited benchmarks, and there is an obvious contrast between the two.
Also, Figure 4.3 indicates a much higher IPC for the non-memory limited
benchmarks on an out-of-order superscalar processor. There are still some benchmarks
120 that demonstrate a significant difference between their performance with and without perfect cache. However, the bus utilization is low in these cases, so the performance degradation is due to poor branch performance and/or normal memory stall penalties. These benchmarks form the basis for this media processor architecture evaluation. superscalar w/ perfect cache
4 3.5 3 2.5 2 1.5 1 0.5 0 cj pe g ep g7 ic 21 g7 dec gs 21e m nc d gs eco m de en c h2 ode 6 m 3en pe c g m 2de pe c g pe 2en gw c pe itde gw c pg ite pd nc ec od e r ra ast w a ca ud i te o xg av en er ag e
IPC
superscalar
Application Figure 4.3 – Comparison of performance of non-memory-limited benchmarks. Results comparing the three basic architecture styles, shown in Figure 4.4, provide two important conclusions.
First, there is not a significant difference in
performance between the VLIW architecture and the in-order superscalar architecture. The difference is generally no more than 11%, a third of which is attributable to the difference in branch prediction methods, as discussed in Section 4.1.4. It was expected that the performance difference between VLIW and in-order superscalar would be higher since static scheduling inhibits the VLIW. The in-order superscalar is not so constrained and may issue operations as dictated by the dynamic data dependencies. The fact that the two results are so similar indicates that static scheduling works nearly as well as dynamic in-order scheduling for media processors. Because of its simple hardware design, the
121 VLIW architecture is obviously the better choice of the two. The results shown in Figure 4.4 are also presented individually for each application in Appendix A. The second conclusion is that static scheduling is no match for dynamic out-oforder scheduling. The out-of-order superscalar processor enables an average of 70% better performance over the VLIW architecture for both real and perfect cache results. Some of this difference may be attributed to compiler inefficiencies and it is possible that other aggressive ILP optimizations, such as software pipelining or loop transformations, may be able to boost VLIW and in-order superscalar performance. However, we believe the minimal difference in performance between in-order superscalar and VLIW processors indicates relatively good compiler efficiency. Consequently, we attribute the majority of the performance difference to out-of-order issue capability. For achieving maximum ILP in media processors, dynamic out-of-order scheduling is a necessity.
3
VLIW
2.5 in-order superscalar
IPC
2 1.5
out-of-order superscalar
1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure 4.4 – Comparison of performance of three processor models, simulated with and without perfect caches
122 Compiler optimizations Figure 4.4 also displays a marked difference in the performance between the various compiler optimization levels. Because the performance of static scheduling is primarily dependent upon the compiler, evaluation with different compiler methods is necessary to thoroughly gauge its effectiveness. This experiment examined the three compiler optimization methods introduced in Section 2.2.1. From the results we can see that the hyperblock is the most effective method providing 11% greater performance than the superscalar method, and 31% better performance than the classical optimization method.
However, the hyperblock’s effectiveness varies considerably, with a 20%
performance increase over superscalar optimization on the out-of-order superscalar processor, but only 6% for the other architectures.
Also, predication requires an
additional source operand for specifying the predicate, so it is questionable whether an 11% performance improvement over the superscalar method justifies the extra register port (or separate predicate register file) and increased instruction size. Issue width Because even with the most aggressive processor and compilation methods, the previous experiments only produced ILP results of 3 or fewer instructions per cycle, so we also experimented with smaller processor issue widths. Using the same ratio of functional units (rounding up as necessary), we evaluated processor widths of 1, 2, 4, and 8 issue slots.
As shown in Figure 4.5, performance levels off at 4 issue slots.
Performance increases by 18% when going from 2 to 4 issue slots, but increasing the issue width from 4 to 8 only generated 3% higher performance. Consequently, we expect optimum issue width for ILP in media processors to be 3-4 issue slots.
123
2.5 VLIW
IPC
2 1.5
in-order superscalar out-of-order superscalar
1 0.5 0 0
5 Issue-Width
10
Figure 4.5 – Performance of various issue widths for VLIW and superscalar processors. VLIW instruction formats Within the VLIW architecture, there are two alternatives for representing a VLIW instruction, compressed and uncompressed formats. In the uncompressed format, VLIW instructions are scheduled so that all issue slots are explicitly assigned an operation. If not enough parallel operations are available to fill all issue slots, the empty issue slots are explicitly scheduled with NOP operations. This format wastes considerable space in memory and the instruction cache when there are a lot of unused issue slots, as is typically the case. An alternative is to use a compressed instruction format, which eliminates the NOP operations and provides some mechanism, such as a stop bit, for indicating which operations belong in the same VLIW instruction. We found that the compressed format improved performance by 19% when using more aggressive compiler optimizations, as illustrated in Figure 4.6. The superblock and hyperblock optimization methods tend to increase code size by 50-100%. Combining a larger code footprint with the additional space required for NOP operations, the working set size of many of the benchmarks exceeded the instruction cache size, significantly
124 decreasing performance.
When compiling with just classical optimizations, the
IPC
compressed instruction format only improved performance by 5%. 1.4 1.2 1 0.8 0.6 0.4 0.2 0
fixed-width variable-width
Classical
Superscalar Hyperblock
Compilation Method
Figure 4.6 - Comparison of performance of VLIW fixed-width and variable-width instruction formats.
4.1.4 Fetch Architecture Dynamic aspects of instruction fetch are also important to processor design. The penalties from changes in instruction control flow are the second major source of stall penalties.
Minimizing these penalties on media processors requires a complete
understanding of the performance of the instruction fetch engine. This study examines the dynamic aspects of instruction fetch by evaluating three architecture features: aggressive versus conservative fetch mechanisms, dynamic versus static branch prediction, and the length of pre-execution pipelines. Aggressive versus conservative fetch A popular method for reducing the impact of stall penalties from instruction cache misses is to decouple the instruction fetch pipeline from the execution pipeline. This is enabled by an instruction buffer, which works in conjunction with branch prediction to
125 provide prefetching of the predicted instruction control stream.
Instructions are
prefetched and placed in the instruction buffer, from which the execution pipeline is able to access instructions as needed. The buffering between the fetch and execute pipelines enables each to continue operation when the other is stalling. This decoupled fetchexecute method is commonly found in superscalar architectures.
IPC Difference (%)
6 VLIW w/ fixedwidth instrs VLIW w/ variablewidth instrs in-order superscalar out-of-order superscalar
5 4 3 2 1 0 Classical
Superscalar Hyperblock Compilation Method
Figure 4.7 – Performance difference between aggressive and conservative fetch mechanisms for four different processor models. An experiment was performed to evaluate the benefit of decoupled fetch versus conventional un-buffered instruction fetch. The size of the instruction buffer was set to three times the processor issue width. The difference in performance of the aggressive and conservative methods on four different architectures is shown in Figure 4.7. Aggressive fetch provides negligible benefit for VLIW architectures, which are forced to issue complete groups of parallel operations. The aggressive fetch mechanism is more beneficial to superscalar architectures, which are able to issue operations atomically. However, the out-of-order superscalar processor has an issue-reorder buffer, which serves in a similar capacity, so an instruction buffer provides minimal additional benefit. Even
126 the in-order superscalar architecture, which receives the greatest benefit, still achieves only a moderate performance improvement. Because multimedia applications are highly loop-oriented and only execute over small sections of code, the decoupled fetch engine provides minimal gain for media processors. Branch prediction Branch prediction is a necessary mechanism for reducing the penalties associated with changes in the direction of the instruction control stream. Static branch prediction is most important from the perspective of static scheduling because its accuracy dictates the effectiveness of global scheduling techniques such as speculation and predication, which move operations across branches or combine branches, respectively. However, even though the predictable nature of multimedia code enables good static branch prediction in media processing, its accuracy is still limited. For example, the base VLIW processor, which uses static branch prediction has an average branch hit rate of 86.1%. The base superscalar processors use a 1024-entry 2-bit counter dynamic branch predictor, which provides a hit rate of 91.1%, an improvement of 56% in branch miss rate. For further reduction of the penalties from mispredicted branches, a comparison was made of three dynamic branch predictors: a dynamic uncorrelated 2-bit counter predictor, and two dynamic branch history table predictors.
For the 2-bit counter
predictor, five different prediction table sizes were examined with 512 to 8192 entries. With regards to dynamic branch history table predictors, there exist a variety of alternatives. Yeh and Patt [111] organized the major alternatives into nine different categories. On a cost-performance basis, they found that the PAs(6,16) predictor, which is a per-address history table with 6 bits of branch history and 16 pattern history tables,
127 provides the best performance on general-purpose code. Since they also found additional bits of branch history provide improved performance, but only evaluated PAs with 1, 16, and 256 pattern history tables, we also suspect the PAs(10,8) predictor may prove a good alternative.
Consequently, for branch history table predictors we are evaluating the
PAs(6,16) and PAs(10,8) models with 256 to 2048 entries. The results will be presented according to the number of history and prediction bits required for that predictor, as defined in Table 4.3. This cost metric ignores the bits for address fields, as these fields will be necessary for all methods. Dynamic Predictor
Uncorrelated 2-bit Counter
Number of Entries
Simplified Size (bits)
512
1K
1024
2K
2048
4K
4096
8K
8192
16K
256
3.5K
512
5K
1024
8K
2048
14K
256
3.5K
512
6K
1024
11K
2048
21K
PAs(6,16)
PAs(10,8)
Table 4.3 – Dynamic branch predictor size.
128 From the results presented in Figure 4.8, it is immediately apparent that media processors do not require large dynamic branch prediction tables. Even for the larger dynamic branch history tables, sizes larger than a few kilobits are simply wasted area. The most effective predictor was the PAs(10,8) model, although its hit rate was typically less than 0.5% better than the PAs(6,16) model, and both of these had hit rates only about 2% better than the uncorrelated 2-bit counter branch predictor. The best performance gains were actually realized by the hyperblock compiler method. It improves branch performance by removing some of the difficult to predict branches. Overall, media
Branch Prediction Rate (%)
processors can achieve good performance with small, simple dynamic branch predictors.
94.5 94 93.5 93 92.5 92 91.5 91 90.5
2-bit counter (Classical) 2-bit counter (Superscalar) 2-bit counter (Hyperblock) PAs(6,16) (Classical) PAs(6,16) (Superscalar) PAs(6,16) (Hyperblock) PAs(10,8) (Classical) PAs(10,8) (Superscalar) PAs(10,8) (Hyperblock) 0
5
10 15 20 Predictor Size (KB)
25
Figure 4.8 – Performance of three dynamic branch predictors versus predictor size: uncorrelated 2-bit counter, PAs(6,16), and PAs(10,8). Figure 4.9 gives another perspective on dynamic branch prediction performance. In Figure 4.8, it was found that small dynamic branch predictors are sufficient for good branch prediction performance in media processing.
Figure 4.9 compares the
performance of a 512-entry 2-bit counter predictor, a 256-entry PAs(6,16) predictor, and a 256-entry PAs(10,8) predictor on an in-order superscalar processor. The results are
129 normalized to the execution time of the 2-bit counter predictor for each compilation method. The results indicate that the history table predictors perform an average of 1.5%
Normalized Execution
better than the 2-bit counter, and the 256-entry PAs(10,8) performs the best overall.
1.03 1.02 512 entry 2-bit counter 256 entry PAs(6,16) 256 entry PAs(10,8)
1.01 1 0.99 0.98 Classical
Superscalar Hyperblock Compilation Method
Figure 4.9 – Average normalized execution time for dynamic branch prediction schemes; normalized to 2-bit counter performance for corresponding compilation method. A final alternative for branch prediction on media processors is to use correlated static branch prediction. One method has been proposed that encodes branch history information in the program counter by duplicating basic blocks along paths with correlated branches. The initial study was able to achieve up to 14.7% improvement in branch prediction accuracy using branch correlation with a history of up to 8 branches, at the cost of less than 30% increase in code size [114]. A subsequent study compared static correlated branch prediction to various dynamic branch predictors and found that it provided comparable performance to correlated dynamic branch predictors [115]. Furthermore, the use of static correlated branch prediction in conjunction with correlated dynamic branch predictors further improved performance. Also, it is conceivable that static correlated branch prediction could be performed without requiring code
130 duplication, by using a prepare-to-branch (PBR) operation [116] for static branch prediction in conjunction with conditional execution. We leave the studies of correlated static branch prediction performance in media processing as an area for further research. Pre-execution pipeline length The length of the pipeline prior to the first execution stage is of importance in determining the cost of mispredicted branch penalties.
A branch typically resolves
during the first execution stage of the pipeline, so the cost of mispredicting a branch is equal to one plus the number of pre-execution pipeline stages.
In most existing
processors the number of pre-execution pipeline stages is usually between three and five. There are one or two instruction fetch stages, one or two decode stages, and a register fetch stage. Superscalar processors typically have longer pre-execution pipelines than VLIW processors. We evaluated pre-execution pipelines with lengths from two to six stages on all three processor models. We also evaluated the performance using both conservative and aggressive fetch mechanisms, and fixed and variable-width VLIW instruction formats. Surprisingly, there was minimal variation in performance across the different architectures.
For each additional pre-execution pipeline stage added, performance
dropped by only 2%. Furthermore, this performance degradation was always within the range of 1.5-2.5%, irrespective of architecture style, compilation method, or fetch mechanism. Consequently, we can definitively expect a 2% performance degradation for each additional pre-execution stage in media processor design.
131
4.1.5 Frequency Effects As frequency of a processor increases, wire delay becomes a more dominant factor. In order to maintain a short cycle time, the extra wire delay means less time is available for communication and logic delay in each pipeline stage. Some effects of the extra wire delay on architecture design and performance include longer operation latencies and delayed bypassing. Each of these is discussed in turn below. Operation latencies The increasing operation latencies from higher processor frequencies reduce performance. From a scheduling point of view, longer operation latencies increase the length of the critical path in each scheduling region. Because the same number of operations is available per scheduling region, but the critical path is now longer, the overall parallelism is reduced. To examine the performance degradation from increasing operation latencies we evaluate three different processor models over the frequency range from 250 MHz to 2 GHz, each with the operation latencies scaled appropriately, as defined in Table 4.4. Aside from scaling the operation latencies, the architectures all remain the same as the base architecture described in Table 4.1. Figure 4.10 presents the performance results for the three processor frequency models with nine different combinations of compiler optimizations and architecture types.
While this figure indicates an approximate performance difference of 10%
between processor frequency models, the variation in performance degradation for different architectures and compiler methods is more apparent in Figure 4.11, which illustrates the performance differences between models 1 and 2, and between models 2 and 3. The average performance degradation in going from one processor frequency
132 model to the next high frequency model degrades performance by approximately 11%. However, this varies considerably according to the type of processor and compiler optimization method used.
The dynamic out-of-order superscalar displays the least
performance degradation by far among the various architecture types.
Because
scheduling is variable at run-time in out-of-order superscalar architectures, the schedule can be adjusted to lessen performance degradation. Scheduling is fixed at compile time in statically-scheduled architectures, so they must accept full degradation. Operation
Model 1
Model 2 (Base)
Model 3
Frequency Range
250-500 MHz
500 MHz – 1 GHz
1-2 GHz
ALU
1
1
1
Branches
1
1
1
Store
1
2
3
Load
2
3
4
Floating-Point
3
4
5
Multiply
3
5
7
Divide
10
20
30
Table 4.4 – Operation latencies for three processor frequency models. Among the compiler optimization methods, the superscalar method is the least influenced by longer operation latencies. It employs significant amounts of speculation that help minimize the effects of longer operation latencies. The hyperblock method also employs speculation, but the method is optimized for predication, which converts control dependencies to additional data dependencies, thereby limiting speculation.
133
2.5
IPC
2 model 3
1.5
model 2
1
model 1
0.5
I-O
V LI S W O -O upe (C rs -O ) ca Su la r( pe C) rs ca la r( C) V I-O LI S W O -O upe (S rs ) -O ca Su l pe ar ( S) rs ca la r( V I-O S) LI S W O -O upe (H rs -O S) ca Su la r( pe H rs S) ca la r( H S)
0
Compilation/Simulation Method Figure 4.10 – Performance of three frequency models; I-O Superscalar indicates in-order
20 difference between models 2 and 3 difference between models 1 and 2
16 12 8 4 0
I-O
V Su LIW O -O p -O ers (C) Su cal pe ar ( rs ca C) la r( C) I-O V L I S O -O up W ( e S) -O rs Su cala pe r( rs ca S) la r( I-O V S) L S I O -O u p W (H -O ers Su cala S) pe r( rs ca HS) la r( H S)
IPC Difference (%)
superscalar; O-O-O Superscalar indicates out-of-order superscalar.
Compilation/Simulation Method
Figure 4.11 – Performance difference of three frequency models; I-O Superscalar indicates in-order superscalar; O-O-O Superscalar indicates out-of-order superscalar.
134 Bypassing models Another aspect of the increasing wire delay is greater communication costs. In existing processors that provide multiple parallel functional units, bypassing is typically provided between all the functional units. As opposed to waiting for a result to be written to the register file, results are immediately made available to all the other functional units. Dependent operations may then immediately use the result, avoiding one or two cycles of extra delay in waiting for the result from the register file. With increasing wire delay, the time to move a result between functional units increases, so bypassing requires more time. Providing bypassing therefore requires either lengthening cycle time to allow the necessary communication time, or assuming a onecycle delay for delayed bypassing or no bypassing. An experiment was conducted to compare the performance between bypassing without any delay and bypassing with a one-cycle delay. Figure 4.12 displays the difference in performance for immediate bypassing and delayed bypassing.
The extra one-cycle delay in bypassing creates significant
performance degradation, with an average decrease of 35%. It is possible that limited degrees of bypassing, such as self-bypassing and/or bypassing between nearest neighbor functional units, could eliminate some of this, but it is unknown to what degree. Like the experiment in longer operation frequencies, the dynamic out-of-order superscalar processor and the superscalar compilation method were again the least susceptible to decreasing performance. Out-of-order scheduling and speculation will be important architecture and compiler features for mitigating the effects of increasing frequency in media processors.
135
IPC Difference (%)
50 40
VLIW
30
in-order superscalar out-of-order Superscalar
20 10 0 Classical Superscalar Hyperblock Compilation Method
Figure 4.12 – Difference between processors with immediate bypassing and bypassing with a one-cycle delay.
4.1.6 Summary This section performed an architecture evaluation for purposes of examining the dynamic aspects of media processing. Most commercial media processors currently use statically-scheduled architectures, but with increasing frequencies and degrees of parallelism the dynamic aspects of processing become more pronounced, so future media processors may require dynamic scheduling.
This section examined three dynamic
aspects of media processing, fundamental architecture style, instruction fetch architecture, and high frequency effects, in order to determine the necessity of dynamic hardware support. In the exploration of fundamental architecture style, we found that staticallyscheduled VLIW architectures perform comparably to in-order superscalar processors, but dynamic out-of-order scheduling performs significantly better, with 70% average improvement over the VLIW processor.
An evaluation of the compiler techniques
indicated the hyperblock optimization method produced the best performance, but the
136 performance difference over the superblock optimization method may not warrant the cost of adding conditional execution to the architecture. Because the combination of the best architecture and compiler still produced average IPCs less than 3, we experimented with smaller issue widths, and found 3-4 issue slots to be ideal for supporting ILP in media processors. Finally, an evaluation of VLIW instruction formats yielded 19% better performance with a compressed format when using aggressive ILP optimizations. Evaluation of the instruction fetch architecture determined that media processors do not require significant dynamic support for instruction fetch. Aggressive decoupled fetch mechanisms provide only moderate performance improvements, and good branch prediction performance can be achieved with small, simple branch predictors, such as a 256 or 512-entry PAs(10,8). Additionally, long pre-execution pipelines perform well since each additional stage only decreases performance by 2%. Finally, the effects of higher processor frequencies were analyzed using processor models with longer operation latencies and delayed bypassing. The processor models with longer operation latencies only reduced performance by an average of 10%, whereas adding a one-cycle delay to bypassing caused a significant degradation of 35%. However, dynamic out-of-order scheduling and superblock optimization were less susceptible to performance degradation than the other architectures and compiler methods. Overall we have shown that some dynamic hardware support will be needed in future generations of media processors. Dynamic out-of-order scheduling in particular had significantly better performance than the alternative architectures. Plus it is less susceptible to high frequency effects of longer operation latencies and reduced bypassing.
137 Also, small dynamic branch history tables will enable nearly 100% decrease in the branch miss rates as compared to static branch prediction. Dynamic hardware support such as out-of-order scheduling and branch history tables are necessary for achieving the highest performance in future media processors.
4.2
Highly Parallel Architectures As discussed in Section 1.3, support for future generations of multimedia will
require media processors that are capable of considerably higher throughput than available today. The parallelism inherent in most multimedia applications can enable significant throughput, but even with this parallelism a high frequency processor is still necessary for meeting the demands. One example of the computational demands on parallel media processors (PMPs) is given by the MPEG-2 encoder, which can require in excess of one billion operations to compress a few frames of 720x480 resolution video. This means media processors capable of many billions of operations per second are needed to achieve real-time MPEG-2 encoding. MPEG-4 encoding will only exacerbate this situation, as it must also perform segmentation of video frames into objects. Unfortunately, high parallelism and high frequency are counter-productive goals. Very high frequency processors are quite feasible with low issue widths, but as parallelism increases, the demands on the register file, memory, and datapath grow significantly, often exponentially. Each additional issue slot adds 3-4 ports to the register file, potentially a memory port, one or more bypass paths, and extra function units and wires that increase area and loading. Consequently, increased parallelism results in decreased frequency for conventional architectures. To achieve both high parallelism and high frequency, more distributed architectures are required.
138
4.2.1 Proposed Distributed Architectures A variety of architectures for future processors have been proposed that incorporate distributed processing methods for achieving greater parallelism in high frequency processors. While many of these proposals have been for general-purpose processor architectures, they may also be relevant to media processors. These proposals include trace processors [117], single-chip multiprocessors [117], simultaneous multithreading [106], array processors [107], and multi-cluster architectures. Trace processors are processors that consist of multiple distinct processing cores. Each processing core contains separate functional units and its own register file. The size of a processing core is limited to no more than 4-6 issue slots to maintain high frequency, but together the multiple processing cores enable high degrees of parallelism. A compiler or dynamic scheduler issues traces from a shared instruction issue unit for speculative execution on separate processing cores.
To maintain memory coherency during
speculative execution the processing cores all share the same cache memory. A single-chip multiprocessor is essentially a shared-memory multiprocessor integrated into a single chip. The are numerous processors on the same chip, each with its own register file, functional units, scheduler, and first level instruction and data caches. The second level cache and subsequent levels of the memory hierarchy are all shared. Like the trace processor the processing cores are all limited in size for high frequency, but combined they enable high degrees of parallelism. One benefit of singlechip multiprocessors over conventional shared-memory multiprocessors is that the tight integration on a single chip enables significantly shorter communication times. This
139 allows the compiler to effectively take advantage of finer granularities of parallelism than reasonable with larger communication costs. Simultaneous multi-threading is an extension to a processor whereby the processor is able to execute operations from multiple instruction streams (threads). These threads all vie for the same processor resources such that when one thread is stalled, the other threads can take advantage of the stalled thread’s resources and continue to maintain high utilization of processor resources.
While this is not a distributed
architecture technique in and of itself, it can easily be combined with distributed architecture techniques. By their nature, threads are distinct from each other, and can be run on separate processing cores. When one thread stalls, other threads can temporarily borrow the stalled thread’s processing unit(s). Array processors have been identified as possible alternatives for media processors because they enable high degrees of SIMD processing. The array processor proposed in [107] combines arrays of simple 8 or 16-bit processors with a RISC master processor. The array of processors can effectively support the regular, computationally intensive data processing, while the RISC master processor is able to support the less computationally intensive program sections that entail greater control complexity. Because all the processing units are separate, they enable high frequency, and their combined processing power enables significant throughput. A final alternative for increasing parallelism while maintaining high frequency is through clustering. A clustered architecture divides the architecture into disjoint groups of functional units, with all clusters sharing the same instruction stream. Clusters are homogeneous, usually providing 2 to 4 issue slots, its own register file and potentially its
140 own local memory or cache. The distributed design allows high frequency by limiting the register ports, memory ports, bypass paths, and area within each cluster. The clusters are connected by a communication network, which passes results between clusters as required. Any number of clusters may be used to achieve the desired parallelism, with the primary impact on frequency coming from the increased demands on the communication network, the memory hierarchy, and the control logic. In essence, a clustered architecture is a group of processors that share a common control stream and are tightly connected by a low-latency interconnect.
This multi-cluster architecture
approach is the one used in our architecture for video signal and media processors. Multi-cluster architectures are becoming popular because it is possible to achieve nearly equivalent performance on a clustered architecture as on an unclustered architecture with the same total issue width. Commercial processors such as the Alpha 21264 [61] and the TI VelociTI [42] are already using clustering. Studies of clustering in superscalar architectures have shown that there is only a slight decrease in performance if there exists an appropriate selection algorithm for assigning instructions to clusters [118]. Similarly, on static architectures, a number of studies have developed effective techniques for scheduling on clustered architectures.
Two such works include list
scheduling of acyclic dependency graphs [119], and modulo scheduling of cyclic dependency graphs [120]. On clustered architectures with 3-4 issue slots per cluster, the performance degradation from clustering is typically less than 10% compared to a nonclustered architecture of the same issue width.
141
4.2.2 Princeton Multi-Cluster Architecture In order to achieve a high frequency processor design, the size of a processor’s datapath is constrained by the size and the number of ports in the register file and local memory, as well as by the degree of bypassing and the cost of decoding many parallel operations. To enable processor widths of 8+ issue slots, a single global register file and memory simply are not feasible.
The alternative selected was a distributed cluster
architecture, similar to the one proposed in [121], where each cluster has its own functional units, register file and local memory, and all clusters are connected by a communication network. An evaluation of the VLSI design space for this wide-issue multi-cluster video signal processor was performed by Santanu Dutta during his Ph.D. work at Princeton [65][122]. Using a 0.25 m CMOS process, he constructed and evaluated parameterizable designs of key architecture modules, including the register file, local memory, and a network interconnect. From the results of the detailed module designs, it was found that a video signal processor with a 16-bit datapath and 8 clusters of 4 issue slots per cluster was possible at a clock rate of 650 MHz. Each cluster could support a register file of up to 256 registers and a maximum of 32 KB of local memory. Inter-cluster communication is provided by a single-cycle 32x32 crossbar interconnect. Detailed design of the functional units was not explicitly performed, but external sources [123][124] indicate the performance of similar functional units designed in a 0.25 m process.
From these results it was
concluded that the processor could support 4 ALUs, 1 memory unit, 1 shifter, and 1 multiplier, each with a latency of 1-2 cycles.
142 An alternate model with smaller clusters allowed a video signal processor with a clock frequency of 850 MHz. It supports 16 clusters of 2 issue slots per cluster, with local memories of up to 16 KB of memory, and register files of 64 registers. In this case, however, only a 16x16 crossbar interconnect may be supported, allowing for only one slot per cluster to be connected to the crossbar. More detail regarding the multi-cluster architecture for video signal processors may be found in Wolfe et. al. [62]. This multi-cluster architecture has been modified for media processing, and is shown below in Figure 4.13. First, the datapath has been widened to 32 bits. While video signal processing predominantly uses 8 and 16-bit data types, the results of Section 3.4 indicate media processing requires floating point and a greater degree of larger integer data types. Local Register File
I/O
Level 1 Instr Cache
Cluster 0
Cluster 2
Cluster 4
Cluster 6 ALU
Interconnect Network Cluster 1
Cluster 3
Cluster 5
Cluster 7
FP
Mem
Local Level 1 Data Cache
Mult Shift
Level 2 Unified Cache
Figure 4.13 - Overview of the Princeton multi-cluster architecture for media processing. Additional modifications have been made to the number and type of functional units. The results of Dutta’s work indicate only a minor change in performance from increasing the number of memory ports from one to two, so we will continue with the resource ratios found in Section 3.2, and offer two memory units per 4-issue cluster.
143 Also, floating point support has been added. Again, it uses the appropriate resource ratio, so one floating-point unit is provided per 4-issue cluster. This multi-cluster architecture for media processing will be examined more closely in Chapter 6, when discussing the levels of parallelism available in multimedia, and how a compiler must go about extracting that parallelism. While there does not exist significantly more ILP in media processing than general-purpose processing, there is extensive data parallelism in multimedia. This data parallelism corresponds well with the design of a multi-cluster architecture.
A speculative method of compiling for data
parallelism will be introduced that enables much larger degrees of parallelism than ILP alone.
Some additional vector-like modifications will be made to the multi-cluster
architecture for supporting this speculative compilation approach.
4.3
Summary This chapter examined two important issues in the design of datapaths for media
processors. First, it performed an architecture evaluation, which examined the dynamic aspects of media processing and the performance of various architectural features. Second, it discussed possible methods of design for highly parallel architectures and proposed a multi-cluster media processor design in particular. The architecture evaluation provided a comparison of static and dynamic architectures for media processing.
Media processors have traditionally used static
architectures, but with increasing frequency and parallelism, the dynamic aspects of processing become more pronounced so dynamic support may be needed. The areas examined in the architecture exploration included fundamental architecture style, fetch architectures, and frequency effects.
144 In the basic architecture style investigation it was found that VLIW architectures and in-order superscalars have comparable performance, but out-of-order processors provide 70% better performance.
Consequently, the highest performance media
processors will require dynamic out-of-order scheduling, but VLIW architectures are sufficient when low power and low cost are necessary. Additional conclusions include the fact that ILP can only practically support widths of 3-4 issue slots, and aggressive optimizations provided the best performance even on the dynamic superscalar processors, with the hyperblock optimization offering the best overall performance. Evaluation of the fetch architecture for media processors concluded that instruction fetch is not a critical component of media processing. Aggressive fetch methods provide only moderate performance improvements on superscalar processors and negligible benefits on VLIW processors. Dynamic branch prediction is a desirable feature because it provides 50-100% better prediction rates than static branch prediction, but only small simple dynamic branch predictors are required. Finally, adding extra preexecution pipeline stages creates only a minimal 2% performance degradation for each additional stage, so longer pipelines have only minimal effect on branch penalties. Analysis of various processor frequency models yielded important results. While increasing some operation latencies for higher frequency processors degrades performance by only 10%, delaying bypassing of results between functional units causes a 35% average performance decrease. The dynamic out-of-order superscalar processor and the superscalar optimization method, however, were least affected, proving the utility of speculation in minimizing the effects of high frequency.
145 The multi-cluster architecture was introduced as a possible architecture for future media processors. The next generation of multimedia promises significant computing demands, which can only be met by processors capable of both high frequency and high parallelism. Distributed architecture methods are necessary to achieve these conflicting goals. And while a number of distributed architecture solutions are available, we believe the multi-cluster architecture will serve well as it corresponds favorably with the characteristics of data parallelism, as discussed in Chapter 6.
146
Chapter 5. Memory Hierarchy
This chapter continues the architecture evaluation of media processors with an examination of the memory hierarchy. While numerous studies of memory performance in media processors have been performed, we still feel that the full memory hierarchy for programmable media processors is an ill-defined area.
Most existing studies have
focused on a particular level or aspect of the memory and have not examined the impact on the entire memory hierarchy. This chapter performs an evaluation of a cache memory hierarchy, varying many parameters at different levels of the hierarchy to determine the primary memory problems that exist in media processing. The chapter first begins with an overview of some previous research on memory structures for multimedia, paying particular attention to memory prefetching structures such as stream buffers and stride prediction tables. Sections 5.2-5.4 then examine the various levels of the memory hierarchy from the L1 cache to external memory. For experiments performed in this chapter, the same base processor model introduced in Section 4.1.2 is used.
5.1
Related Work Whereas general-purpose processors use multi-level cache memory hierarchies
[103], and typical DSPs use local memory with some form of prefetching such as DMA,
147 the memory hierarchy for media processors remains an unresolved issue. We believe some hybrid of the two will provide the best performance. Studies by Wu and Wolf on video application traces [58][57] have examined both cache memory systems and hybrid memory architectures that combine a stream buffer or stride prediction table with cache. These results concluded that cache memory combined with stream buffers had the best performance, however these studies are based on trace-driven simulations that assume perfect branch prediction and memory disambiguation. Zucker et al. [125] at Stanford also examined the value of these streaming memory structures for multimedia. Using the JPEG and MPEG multimedia applications, his work examined three prefetching techniques for streaming data: stream buffer, stride prediction table, and a hybrid of the two called a stream cache. Compared to a cache system without any prefetching, all the techniques were effective at eliminating many of the cache misses, though the effectiveness of each varied according to the size of the cache used. For small caches, the stream buffer and stream cache were more effective, while the stride prediction table performed best for large caches sizes. Another media processor architecture with streaming memory structures combined a clustered architecture with a stream buffer-like prefetching structure called a stream register file [108].
This architecture and memory hierarchy enable a high-
frequency clustered architecture coupled with the benefits of prefetching support. However, this scheme also requires a special programming paradigm, which is undesirable for generic HLL programmability. In addition to providing support for streaming memory, the large data rates in multimedia can place significant burdens on external memory, so the external memory
148 bandwidth often becomes a bottleneck. To meet the necessary data rates, either memory hierarchies must be designed to reduce the external memory traffic, or scheduling methods must be used to restructure the program for improved data locality [107][126]. The issue of the number of parallel memory accesses each cycle is another critical issue in memory hierarchy design for media processors. Because memory operations amount to 25-30% of all operations, supporting many parallel operations also requires many parallel memory accesses per cycle. Large multi-ported memories are not feasible at high frequencies, so highly banked or distributed memories are required instead. Because of the large communication costs at high frequencies, even highly banked memories will not support more than 4-8 parallel memory accesses, so it is likely that separate local memories will be required in each cluster. Memory coherency will then need to be handled by the compiler and/or higher levels of the memory hierarchy. The memory hierarchy for media processors is poorly understood. Although it is likely streaming memory structures will be necessary, it is unknown whether cache, memory with prefetching, cache with prefetching, or even some other novel memory structure will prove most suitable for multimedia. Furthermore, the problems of reducing external memory bandwidth and supporting numerous memory accesses per cycle are also areas open areas. Considerable research remains in the area of memory hierarchy design for media processors.
5.2
L1 Cache The studies on memory characteristics of multimedia applications in Chapter 3
revealed typical working set sizes of less than 32 KB for data memory and less than 8 KB instruction memory. Use of aggressive compiler optimizations, however, increases code
149 size by 50-100%, so a 16 KB instruction cache is needed. The architecture evaluation in Chapter 4 used direct-mapped caches of 32 KB and 16 KB for data cache and instruction cache, respectively. These caches provided excellent first level cache performance and represent good choices for first level memory in media processors. The primary issue of importance remaining for first level memory hierarchy design is whether memory prefetching structures such as stream buffers or stride prediction tables should be incorporated at this level, and if so, what their sizes should be. While no direct evaluation of these memory structures was performed here, considerable insight will be found with regards to memory prefetching in the subsequent sections.
5.3
L2 Cache Conventional wisdom in processor design has been that providing more memory
on-chip enables better cache performance and reduces the impact of long external memory latencies. It is questionable whether this will be true in multimedia, however. Multimedia is prone to streaming data that is accessed by the application for a short period of time, and then thrown away and never needed again. In such cases, additional on-chip memory may not be of benefit. Experiments with the L2 cache varying cache size, line size, and miss latency have been performed to more thoroughly understand multimedia’s use of the second level cache. Cache Size The first experiment on the L2 cache involved evaluating the performance from increasing cache size. As mentioned above, we believed that increasing the amount of on-chip memory may not be as useful for media processing as general-purpose processing, but even so the results are quite surprising. As is evident in Figure 5.1, there
150 is almost no impact of L2 cache size on performance. The average performance increase from doubling cache size is only 0.5%. Only three benchmarks have any noticeable different in performance: unepic, mpeg4dec, and h263enc, which have performance increases of 7.0%, 6.2%, and 1.7%, respectively, for each doubling of L2 cache size. While it is quite obvious that L2 cache size is not important from the standpoint of any one multimedia benchmark, it will become much more important when simultaneously executing many multimedia applications on the same processor. In such cases, larger L2 cache sizes will be needed to maintain the memory image for multimedia applications during context switches. However, with our base of 256 KB, the L2 cache is already able support the memory image for more than 8 typical multimedia applications since the average working set size is less than 32 KB.
Instruction Per Cycle (IPC)
1.6 1.4
fixed-width VLIW variable-width VLIW in-order superscalar out-of-order superscalar
1.2 1 0.8 0.6 0.4 0.2 0 128
256 512 L2 Cache Size (KB)
1024
Figure 5.1 – Performance results on IPC from varying L2 cache size. Miss Latency A second experiment was performed to determine the impact of increasing L2 cache miss latency on media processing. As shown below in Figure 5.2, L2 cache miss
151 latency had only a moderate effect on multimedia performance. Overall the performance degradation was only 5.6% for the various processor types and multimedia applications. The performance degradation was slightly greater on the out-of-order superscalar because of its higher IPC, and on the fixed-width VLIW because the L1 instruction cache is unable to hold the entire instruction working set size for some applications. Among the various benchmarks, only three were significantly affected by increasing miss latency: pegwitenc, pegwitdec, and mpeg2dec.
The two Pegwit applications had average
performance degradations of 35% because the data working set size does not fit entirely in the L1 data cache. Mpeg2dec has a considerably lower degradation of 16%, and all other applications are affected less than 7% from each doubling of the L2 cache miss latency.
Instruction Per Cycle (IPC)
1.6 1.4
fixed-width VLIW variable-width VLIW
1.2 1 0.8
in-order superscalar out-of-order superscalar
0.6 0.4 0.2 0 8
15 30 60 L2 Cache Latency (cycles)
Figure 5.2 - Performance results on IPC from varying latency to access a 64-byte L2 cache line (on a L1 cache miss).
152 Line Size The final experiment with respect to the L2 cache evaluates performance with respect to line size. The line size was varied from 32 to 512 bytes, as shown below in Figure 5.3.
As opposed to L2 cache size and miss latency, there is a much more
noticeable deviation in performance from increasing L2 cache line size. Overall, the average decrease in performance is 10% when doubling line size, but this varies considerably over the various benchmarks and media types. As is evident in the figure, speech and security media types are only affected minimally with degradations of 1.53.5%, video is affected moderately with a 12.8% degradation, while audio, graphics, and image media types are heavily affected with average degradations in the range of 3257%. Looking at explicit applications, it quickly becomes evident that those applications considered memory-limited in Section 4.1.3 of the previous chapter, are those the most heavily affected by increasing line size, with performance decreases in the range of 40.1-
Instructions per Cycle (IPC)
71.4% from each doubling of L2 cache line size.
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
video image graphics audio speech security decode encode 32
64 128 256 L2 Cache Line Size (# bytes)
512
Figure 5.3 - Performance results on IPC from varying the width of the L2 cache line.
153 The reason for the significant change in performance is due to the increasing latency to access memory when using longer line sizes. As will become very apparent in the next section, many multimedia benchmarks are highly dependent upon memory and varying memory latency can change their performance significantly. In this experiment the effect is reduced somewhat because of spatial locality. By pulling in longer line sizes, spatial locality enables processing to make use of much of the additional line size pulled in, but the extra time accessing the memory has greater impact than the benefit from spatial locality. Also, spatial locality begins to decrease considerably for data with line sizes above 64 bytes. Overall, it has been shown the media processing is minimally affected by L2 cache parameters.
Cache size and miss latency have little to no effect on media
processing performance, although the L2 cache size will be more useful for supporting context switches when simultaneously supporting multiple multimedia applications. Line size can have considerable impact, but this is primarily due to the effect of increasing memory latency. Consequently, this is not really a problem with the L2 cache, but more a problem of long external memory latencies. The next section will reinforce this conclusion.
5.4
External Memory As found in the previous section, memory latency is a significant problem for
media processing. This conclusion can have one of two ramifications, however. Either the dependence on memory latency is caused naturally by the long latencies to external memory, or the applications are in fact limited by the external bandwidth, which is
154 lengthening the memory latency.
Two experiments are performed that examine
variations in the latency and bandwidth to external memory. These experiments are not only important from the standpoint of understanding the memory bottleneck, but are also necessary for understanding the impact of the increasing processor-memory gap. As processor performance continues to increase faster than memory performance, the latency between the processor and memory continues to generate longer memory latencies. This also affects memory bandwidth since bandwidth is dependent upon the speed that memory is able to send/receive information to the external data bus. Consequently, in the design of future media processors it is important to understand the impact of relatively longer memory latencies and smaller bandwidths. To evaluate the impact of memory latency on media processing, an experiment was performed that evaluated the performance over a range of memory latencies. The performance results relative to IPC are shown in Figure 5.4, while the bus utilization performance is displayed in Figure 5.5. Overall, the results are not optimistic, indicating a considerable performance degradation of 20% for each doubling of external memory latency. However, this varies considerably among the various benchmarks and media types. From the two figures it is easy to discern that the speech and security media types had little degradation from external memory latencies, with an average degradation of 3.0% and 7.6%, respectively. Conversely, the audio, image, and graphics benchmarks had severe degradation from increasing memory latency, with degradations between 5977%.
Instructions per Cycle (IPC)
155
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
video image graphics audio speech security decode encode 25
50 100 200 400 Memory Latency (# cycles to access 64-byte line)
Figure 5.4 - Performance results on IPC from varying latency to access a 64-byte line from external memory (on a L2 cache miss).
Bus Utilization (%)
120 video image graphics audio speech security decode encode
100 80 60 40 20 0 25
50 100 200 400 Memory Latency (# cycles to access 64-byte line)
Figure 5.5 – Bus utilization from varying latency to access a 64-byte line from external memory (on a L2 cache miss). Unfortunately, this experiment does not definitively prove that memory latency is the primary problem, since memory latency is not mutually exclusive of memory
156 bandwidth. By doubling the memory latency, we are also in effect halving the memory bandwidth. And as can be seen from the bus utilization results in Figure 5.5, the audio, graphics, and image media types, which have the greatest performance degradation in this experiment, also have very high bandwidth utilization. Consequently, their bottleneck
Instructions per Cycle (IPC)
may be bandwidth instead of latency.
1.8 1.6 1.4 1.2 1 0.8
video image graphics audio speech security decode encode
0.6 0.4 0.2 0 4 8 16 32 Memory Bandwidth (width in bytes of data bus)
Figure 5.6 - Performance results on IPC from varying the width of the external data bus. Only by evaluating the effects of memory bandwidth in concert with memory latency can the actual cause of the problem be determined.
An evaluation of the
dependence of media processing on memory bandwidth is performed by varying the size in bytes of the external memory bus. The results from this experiment are shown in Figure 5.6. Overall, the results were much more optimistic than the memory latency results. The average performance increase (decrease) from increasing (decreasing) from doubling the size of external memory bus is only 6.0%.
Again however, there is
considerable variation of the results with respect to media type. The speech, security, and
157 encoding benchmarks have a very low degradation of 0.6-2.7%, while the other benchmarks are much higher with averages of 7.5-13.9%. Again, these results are not mutually exclusive from memory latency. When memory bandwidth increases without a corresponding increase in latency, the act of fetching the extra data for free is effectively prefetching. So in applications with high spatial locality, the prefetching helps reduce the effects of memory latency. However, determination of the primary bottleneck can be made by examining the effects from both experiments in concert, as shown below in Table 5.1. This table displays the average degradations for memory latency and memory bandwidth, as well as identifying the degree of utilization of the external memory bus. This is defined the Bandwidth column. Benchmarks with low bus utilization in the range of 0-32% are marked with low (L) utilization. Similarly, the medium (M) and high (H) utilization ranges correspond with 33-66% and 67-100% utilization, respectively. The results in the table indicate that all benchmarks are 3-5 times more susceptible to changes in memory latency than memory bandwidth. While there exists some cross correlation between memory latency and memory bandwidth degradation that would decrease the true memory latency degradation and increase the true memory bandwidth degradation, that correlation would not account for the 3-5x difference. Furthermore, the cross correlation would exist primarily in those applications with high external memory bandwidths, which include the djpeg, h263dec, mipmap, mpeg4dec, osdemo, rawcaudio, rawdaudio, and unepic benchmarks. Of these eight benchmarks, we estimate that memory latency is the primary limiting factor, although djpeg, h263dec, mpeg4dec, rawdaudio, and unepic also experience a memory bandwidth limitation to a
158 lesser degree. Consequently, memory latency is the dominant bottleneck in cache-based media processor memory hierarchies. Avg. Latency Avg. Bandwidth Degradation (%) Degradation (%)
Program cjpeg djpeg epic gs g721dec g721enc gsmdecode gsmencode h263dec h263enc mipmap mpeg2dec mpeg2enc mpeg4dec osdemo pegwitdec pegwitenc pgpdecode rasta rawcaudio rawdaudio texgen Unepic
68.1 96.2 57.3 66.8 0.1 0.1 0.8 3.6 99.1 14.1 75.6 27.8 25.3 95.3 92.7 25.1 14.9 0.0 37.7 87.7 108.1 53.3 88.1
Bandwidth (L, M, H)
11.3 27.8 6.0 15.4 0.0 0.0 0.1 0.4 30.8 1.6 13.1 3.6 2.8 27.8 24.5 3.0 2.7 0.1 4.8 0.3 22.3 6.1 21.5
M H M M L L L L H L H M L H H L L L L H H M H
Table 5.1 – MediaBench memory characteristics. Although memory latency is the dominant memory bottleneck, it is also the problem most amenable to a solution. As noted in Section 5.1, many studies have found that streaming memory structures in media processors can provide significant benefit in reducing the performance degradation due to long external memory latencies. Should streaming memory structures be able to reduce the memory latency problem significantly, the next problem media processors will have to contend with is limited external memory bandwidth.
159
5.5
Summary This chapter used a cache-based memory hierarchy to explore many of the issues
in memory hierarchy design for media processors. The previous chapters were effective at defining the first level of the cache memory hierarchy, so this chapter focused on the second level cache and external memory interface. Surprisingly, it was found that the second level cache had little impact on memory performance. Its performance is the nearly the same irrespective of L2 cache size, and L2 cache miss latency has only a nominal effect on performance. The L2 cache will be important however, for holding the memory image during context switches when simultaneously executing multiple multimedia applications. The L2 cache parameter that affected performance the most significantly was L2 cache size, and this is can be attributed to the negative impact of longer cache lines on memory latency. Examining external memory latency and bandwidth in greater detail, it was found that memory latency presents the greatest problem for media processing. However, because of the success other studies have found with streaming memory structures like stream buffers and stride prediction tables, it is also the problem with the simplest solution. Aside from memory latency, the only other significant problem media processor hierarchies will need to contend with is external memory bandwidth.
160
Chapter 6. Compiler Methods for Media Processors
This chapter proceeds to the second facet of our design methodology by exploring the compiler issues in the design of programmable media processors. The compiler is an important component in any high-level language programmable processor, but is even more important to media processors since extraction of the parallelism is crucial for achieving practical performance in many multimedia applications. The high degrees of parallelism found in multimedia offer optimistic opportunities for achieving high levels of throughput, but there remain many issues regarding how compilers can achieve high levels of parallelism and schedule it on high-frequency parallel processors. The previous chapters have shown that instruction level parallelism itself is inadequate for achieving high parallelism, so it is necessary to pursue alternate methods of parallelism for highperformance media processors. This chapter begins with an examination of the various levels of parallelism in multimedia.
It starts with a summary of instruction level parallelism and subword
parallelism, and then examines parallelism at coarser levels of granularity, including tasklevel parallelism and data parallelism. Task-level parallelism provides the coarsest level of parallelism and generally enables the best performance results. However, it is also the level of parallelism most difficult to automate since recognizing the parallelism often requires in-depth knowledge of the application not available to the compiler.
161 Consequently, data parallelism is presented as the most likely avenue for obtaining further parallelism in multimedia. The second section of the chapter proposes the Speculative Broadcast Loop (SBL) method, a new run-time loop parallelization method for supporting data parallelism. This method speculatively executes separate loop iterations in parallel across a wide-issue clustered architecture. We propose the use of run-time versus static techniques because run-time methods are able to overcome some of the limitations of traditional parallel compiler methods. A significant amount of research has gone into the study of parallel loops and the design of compiler techniques for supporting loop-based parallelism on multiprocessors. It has been found that the biggest problem with loop parallelization is not in transforming the loops for parallelization, but in recognizing loops as parallel. Methods including dependence analysis, privatization, idiom recognition, and interprocedural analysis have been defined that significantly aid in the recognition of parallel loops, but even with these advanced methods compilers are still unable to recognize all the available loop parallelism in applications. Consequently, run-time techniques provide an alternative for realizing greater degrees of loop parallelism. The second section of this chapter begins with an overview of the existing work in static and run-time parallel compiler techniques.
Discussion of the static parallel
methods will examine basic loop parallelization and the most critical techniques for recognizing loop parallelism, including dependence analysis, privatization, idiom recognition, and inter-procedural analysis. Introduction to run-time techniques will cover both speculative and non-speculative techniques for run-time loop parallelization. The remainder of the chapter presents the SBL run-time technique for the speculative
162 execution of data parallelism, and discusses both the compiler and architecture modifications necessary for supporting this large-scale speculative method.
6.1
Levels of Parallelism in Multimedia The existence of large degrees of parallelism in multimedia has been well
researched [56][97], but less well understood is the degree to which that parallelism is available at the various levels of granularity. This section will examine the range of parallelism from the finest level of granularity, instruction level parallelism, to the coarsest level of granularity, task-level parallelism.
While this section does not
quantitatively define the degree of parallelism at each of these levels, it shall endeavor to provide an understanding of both the type and relative degree of parallelism at each level. To achieve maximum performance in media processors the compiler must accommodate all possible levels of parallelism in media applications. In particular, data parallelism is presented as a level of parallelism that is currently under-exploited and may provide significant performance improvements for media processors.
6.1.1 Instruction Level Parallelism Instruction level parallelism is defined as the parallelism that exists between separate instructions in the assembly level application code. Because an instruction represents the smallest atomic element in application code, ILP is the finest granularity of parallelism. Additional details on ILP were examined in Chapter 2, and Chapters 3 and 4 performed a thorough examination of the ILP in multimedia applications with respect to the classical, superscalar, and hyperblock ILP compiler optimizations. Overall it was found that multimedia applications contain little more ILP than general-purpose applications. Using a realistic cache memory hierarchy, the typical ILP was only 1-2
163 IPC, while the performance assuming perfect memory was only 20-30% better, as shown in Figure 4.4. There exist some of other ILP techniques that were not examined in this thesis, including software pipelining and loop transformations, which may yield better performance, but overall the performance gains we can expect from ILP are limited.
6.1.2 Subword Parallelism As discussed in Section 1.2.2, subword parallelism is an efficient form of SIMD parallelism in which small data elements are packed into larger registers, and then special subword-parallel instructions are employed to perform the same operation on each data element in the packed registers.
Because a packed register usually contains 2 to 16
individual data elements, the atomic size of a subword parallel operation is equivalent to multiple regular operations. Consequently, subword parallelism embodies a slightly coarser granularity of parallelism than instruction level parallelism. The foundation of subword parallelism is built upon the data parallelism and small data sizes common to many multimedia applications. As discussed in Section 6.1.4 below, the data parallelism in multimedia naturally creates groups of operations that process independent data elements. Because the processing of one data-parallel element is independent of other data-parallel elements, the groups of operations associated with these elements are independent. Subword parallelism can be employed on independent operation groups when the operations use small data types and share a common instruction control stream. Because of the coarser granularity of parallelism, conventional compiler techniques are unable to effectively accommodate subword parallelism. However, since subword parallelism requires negligible hardware modification, and has been shown to
164 offer promising speedups of 2-4x with hand coding [30], there is significant impetus to find compiler methods for subword parallelism. Cheong and Lam [127], Fisher and Dietz [128], and Lelait and Krall [129] are among the groups researching subword parallel compiler methods. Both of these groups are employing existing parallel loop compiler techniques [130] in their attempt to automate subword parallelism. Because data parallelism in multimedia is predominantly found in loops (as discussed below in Section 6.1.4), this approach offers much promise. An alternative approach is being undertaken by Laren and Amarsinghe at MIT [131]. This approach views subword parallelism as not just a limited form of data parallelism, but also another form of instruction level parallelism. They argue subword parallelism is useful not just on data-parallel elements, but on any set of isomorphic statements (statements within a basic block containing the same operations in the same order). They propose a process of selecting the appropriate groups of operations from among all possibilities using a cost criterion that measures the gain from subword parallelism after accounting for the packing and unpacking overhead.
They enable
subword parallelism on data parallel elements as well by using loop unrolling to ensure that a sufficient number of operations are available to achieve a performance gain. Overall, there is a strong likelihood that an effective method will be found for compiling with subword parallelism. While the granularity of subword parallelism is coarser than ILP, the size of the independent groups of operations can still be relatively small so long as there is a sufficient number of independent groups (i.e. groups of 3-4 operations may be sufficient for performance improvement if there exist 3-4 or more independent groups). Also, additional research is being undertaken for automatically
165 determining the minimum data sizes of variables in multimedia applications [132]. Minimizing data sizes could be of considerable benefit in two respects. First, using smaller data sizes can increase the number of packed data elements in fixed register sizes. Second, minimizing data sizes may also help increase the size of independent groups of operations by increasing the number of operations with appropriate data sizes. Unfortunately, the degree to which subword parallelism exists in multimedia cannot currently be evaluated.
The maximum achievable benefit from subword
parallelism is limited by the degree to which small data elements can be packed into registers. This limit is typically 4-8x, but with the overhead of packing, unpacking, and permuting small data elements in packed registers, the realizable benefit may be much lower. Furthermore, the above arguments indicate that subword parallelism overlaps the domains of both data parallelism and ILP. Consequently, it is difficult to determine the actual domain of subword parallelism.
The true degree of subword parallelism in
multimedia is only measurable once effective compiler methods for subword parallelism become available.
6.1.3 Task-Level Parallelism While instruction-level parallelism and subword parallelism are finer-grained levels of parallelism, task-level parallelism enables much coarser degrees of parallelism. In task-level parallelism, a task may be any arbitrarily-sized piece of work within an application. By this definition, tasks are not necessarily parallel. Task parallelism is only possible among tasks that have little to no dependence between them and can therefore be executed concurrently or semi-concurrently.
166 In the broadest sense, task-level parallelism is the superset of all parallelism methods. Because an individual task can range in size from a single operation, to a single iteration of a loop, or even the call of a major function, task-level parallelism defines a superset of ILP, subword parallelism, and data parallelism. Beyond these three subsets, it also enables parallelism at coarser and less well-defined levels of parallelism. However, within this thesis task-level parallelism is defined as the levels of parallelism not encompassed by ILP, subword parallelism, and data (loop-based) parallelism. Essentially, task-level parallelism defines the coarser levels of parallelism beyond the scope of the other three. Sequential Program Decomposition
break up computation into tasks (pieces of work)
Tasks Assignment Processes
P1
P0
P2
... Orchestration
Parallel Program
P1
P0
...
P2 Mapping P1
P0 Processors
specify mechanisms by which tasks will be distributed among processes specify mechanisms for communication, synchronization, and data access among processors bind processes to processors
...
P2
Figure 6.1 – Parallelization process for employing task-level parallelism [133]. The process of identifying and employing task-level parallelism within an application is a complex one. The overall process is well characterized by Culler, Singh, and Gupta [133], and a diagram summarizing the four-phase process is shown in Figure 6.1. First a decomposition phase is necessary to break execution of the application into
167 individual tasks. These tasks then go through an assignment phase that assigns tasks to processes, which are “abstract entities that perform tasks.”
Orchestration is then
performed on the processes to prepare them for parallel execution by defining the communication, synchronization, and memory access mechanisms necessary for executing the tasks in parallel. Finally, a mapping phase assigns the abstract processes to physical processors. The goal of this entire process is achieve a balanced ratio of tasks to processors, and provide all tasks with the necessary mechanisms for maintaining program correctness while executing in parallel on separate processors. Unfortunately, this process has only been successfully automated on certain welldefined subsets of task-level parallelism. The primary subsets for which this is true are ILP, subword parallelism, and loop-level (data) parallelism. Instruction level parallelism and loop parallelism have been researched extensively and enable reasonably good compiler performance. Subword parallelism is a newer area, but successful compilation methods are expected soon. Outside of these three subsets, little success has been found for automating task-level parallelism. Consequently, employing task-level parallelism is primarily the responsibility of the programmer. The disadvantage of this is the enormous amount of extra work. The advantage is that experienced parallel programmers can usually generate significantly better results than any automation technique because they have additional knowledge about the application not available to compilers. An example of task-level parallelism used in multimedia is a parallel implementation of a MPEG-2 video decoder14 [134]. In this example, we analyzed the MPEG-2 syntax and identified two possible methods for defining concurrent tasks. The
14
This is the same MPEG-2 decoder used in the MediaBench benchmark suite.
168 MPEG-2 syntax is defined by a hierarchical structure, as shown in Figure 6.2. In this structure, a video stream is broken down into sequences, groups of pictures, pictures, slices, macroblocks, and blocks. Sequences are optional within the syntax so are not suitable for use as tasks. Because MPEG-2 uses inter-frame correlation to eliminate temporal redundancies, there is a significant amount of dependence between pictures, so the picture level is not suitable for concurrent tasks. The macroblock and block levels are both very fine-grained and would require significant communication and synchronization overhead, so are also ruled out. However, the remaining two options, groups of pictures and slices, do present feasible alternatives for tasks, since in each case, separate tasks are relatively independent of each other. The group of pictures defines a very coarse-grained level of task-parallelism, while the slice level creates fined-grained task parallelism.
...
Video stream
Sequence
Group of Pictures
Picture
Slice
Macroblock Block
Figure 6.2 – Hierarchical structure of the MPEG-2 syntax. A parallel implementation of the MPEG-2 decoder, shown in Figure 6.3, was developed that enabled definition of tasks at either the group of pictures or slice level. In both versions, the basic structure of the MPEG-2 decoder required three primary types of processes. Because the video stream arrives at the decoder in a compressed sequential stream, it was necessary to have one process, known as the scan server, search through the video stream and identify each task (either the beginning of a group of pictures or the beginning of a slice). The locations of these tasks in the video stream were then stored in a task queue. From the task queue, a variable number of slave processes retrieved tasks
169 from the task queue, performed MPEG-2 decoding of those tasks, and stored the decoded output stream into a display queue. Finally, a display server would send the decoded output stream to the either the file system or to a video display unit. In both the group of pictures version and the slice version, the results of task-level parallelism were excellent.
The parallel MPEG-2 decoder was executed on a 16-
processor SGI Challenge multiprocessor. Two processors were assigned as the scan server and display server, so up to 14 additional processors were available for performing the actual decoding. For both the group of pictures and the slice versions, the speedup results were nearly linear with respect to the number of slave processors. The results for the parallel slice version are shown below in Figure 6.4. The results for the group of pictures version were slightly better because the coarser granularity of the tasks requires less synchronization overhead.
Scan Server
Task Queue
Slave
Display
Display Server
Slave
Disk
Slave
Display Queue
Figure 6.3 – Task-level parallel implementation of MPEG-2 video decoding [134]. While the hand-coding of task-level parallelism can certainly return exceptional results, the enormous amount of work for generating explicitly parallel code is not always cost effective.
Furthermore, parallel code generated for one platform is not always
portable to other platforms. Consequently, it is desirable to use automatable alternatives
170 when they exist. Data (loop-level) parallelism presents one such alternative that enables coarser levels of parallelism than either ILP or subword parallelism. 14 12
Speedup
10 8
6 4
fast, 13 frames/GOP fast, 13 frames/GOP fast, 13 frames/GOP
2 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors Figure 6.4 – Speedup of the slice-based parallel MPEG-2 video decoder [134].
6.1.4 Data Parallelism The data parallelism found in multimedia offers optimistic opportunities for additional parallelism. First, aside from the limited forms of data parallelism supported by subword parallelism, data parallelism is the most underutilized type of parallelism in multimedia. Secondly, it also offers potentially larger performance gains than ILP and subword parallelism. Finally, as shall be seen below, the data parallelism available in multimedia corresponds primarily to loop-based parallelism. Loop-based parallelism has been researched for decades and many parallel compiler methods have been developed for utilizing loop-based parallelism. While there are still a number of limitations on what these parallel compiler methods can accomplish, the existing methods are often able to achieve significant performance gains on loop centric applications. Consequently, there are excellent opportunities for supporting data parallelism in multimedia.
171 The remainder of this section examines various aspects of data parallelism. The first subsection gives an overview of data parallelism, and shows an example of data parallelism in video signal processing.
The second subsection examines an early
experiment that hand-scheduled kernels onto a wide-issue clustered architecture. The results of this experiment led to further exploration of the potential of data parallelism. Overview of Data Parallelism Data parallelism is the parallelism that exists between data elements that have little or no processing dependency between them.
It may exist between any data
elements, but most commonly occurs between data elements that are not in close proximity to each other in either the spatial or temporal domains. The independence of such data allows computations on them to occur in parallel. Because multimedia has significant processing regularity, the data in multimedia is predominantly processed within loops. Consequently, multimedia data parallelism typically equates to loop-level parallelism. An example of this is the 2-D discrete cosine transform (DCT) algorithm, which is a critical kernel for both JPEG and MPEG compression. This algorithm, shown in Figure 6.5, takes an 8x8 block of an image or video frame and turns the 2-D spatial information into its corresponding 2-D frequency representation. A straightforward implementation of this algorithm uses a four-level nested loop where the two outer loops process the 64 elements of an 8x8 block. The two inner loops perform the processing of a single element. It can be seen that the statement y[m][n] = y[m][n] + x[i][j] * c[i][m] * c[j][n] creates a loop-carried dependence in the two inner loops. However, there is no such dependence in the outer loops. Each iteration in the two outer loops is independent of all other iterations, so the 64 iterations of the two
172 outer loops may be run in parallel. Additionally, the DCT is independent over all separate 8x8 blocks in each image or video frame, allowing for even greater degrees of parallelism. Such large degrees of parallelism may be found throughout multimedia applications.
Further experiments that hand-scheduled key video signal processing
kernels onto a wide clustered architecture clearly delineate this fact.
for (m = 0; m < 8; m++) for (n = 0; n < 8; n++) { y[m][n] = 0.0; for (i = 0; i < 8; i++) for (j = 0; j < 8; j++) y[m][n] = y[m][n] + x[i][j] * c[i][m] * c[j][n]; }
Figure 6.5 – Straightforward implementation of the 2-D DCT algorithm. Hand-Scheduling for Data Parallelism Early in the evaluation of architectures and compiler methods for media processors and video signal processors (VSPs), there was no existing knowledge base from which to begin the design of VSPs and media processors. The large degrees of parallelism in multimedia had been discovered, but it was uncertain to what degree the parallelism was available at various levels of granularity.
Consequently, a hand-
scheduling experiment was used to explore both architecture and compiler design spaces by scheduling key VSP kernels onto various architectures using a number of different compiler techniques. While hand-scheduling of kernels does not accurately indicate the characteristics of full applications, kernels tend to dominate signal processing code to a
173 greater degree than general-purpose code, so we believe the results to be reasonably representative. Furthermore, kernel evaluation provides a good high-level view of the design space. The results of this experiment were first published at HPCA-3 [62]. The choice of kernels was primarily dictated by the trend in video applications towards compression and decompression.
The most common and critical routines
involved in video compression include motion estimation, discrete cosine transformation (DCT), variable bit rate (VBR) coding, and color-space conversion. While there are numerous other video application kernels aside from these four, between them they display many of the characteristics of video applications and thus are believed to provide a reasonable kernel set for early exploration. From these four areas, six different video kernels were selected, including full search motion estimation, three-step search motion estimation, traditional 2-D discrete cosine transform (DCT), row-column DCT, variable bit rate (VBR) coding, and RGB to 4:2:0 YCrCb color-space conversion. These kernels were hand scheduled onto various cluster architectures using many compiler strategies, including speculation, predication, list scheduling, loop unrolling, blocking, specialized instructions, and SIMD scheduling of parallel loops across clusters (scheduling for data parallelism). The C source code for these kernels is given in Appendix B. From the results we were able to draw a number of conclusions about architectures and compilers for VSP processors [62], but the results were most significant with regards to the various compiler strategies. For ILP techniques, software pipelining and loop unrolling produced the best results. However, the compilation technique that was the most valuable in general was the use of loop parallelism. A SIMD approach for exploiting loop parallelism was used that assigned a separate loop iteration to each
174 cluster. With this technique, the same series of operations performs identical processing on the separate loop iterations in each cluster. The only kernel that did not benefit from loop parallelism was the variable bit rate (VBR) coder. Because it is compressing data into a sequential output stream, the processing is highly sequential with many loopcarried dependencies.
Speedup
40
8x data parallel
16x data parallel
30 20 10 0 Full Search Motion Est.
3-Step Search Motion Est.
Straight DCT
Row/Col DCT
RGB to YCrCb CSC
Kernel Figure 6.6 – Speedup over a single-issue processor when using both data parallelism and instruction level parallelism; 8x parallel is an 8-cluster, 4-issue slot/cluster architecture; 16x parallel is a 16-cluster, 2-issue slot/cluster architecture. It was further found that the ILP methods and the data/loop parallelism techniques are orthogonal. ILP could be applied simultaneously with loop parallelism to aggregately improve performance. The best performance was achieved through a combination of loop unrolling on inner loops and loop parallelism on outer loop levels. The overall speedup of this combined compilation technique with respect to a single-issue processor is shown in Figure 6.6. Speedup results are shown for two different architecture models. The first was an 8-cluster processor with 4-issue slots per cluster, while the second was a
175 16-cluster processor with 2-issue slots per cluster, so loop parallelism of 8x and 16x were used on these two architectures, respectively. While both architectures are effectively 32issue architectures, it is easy to see that the processor with 16 clusters performed much better than the 8-cluster model. Consequently, the data parallelism within parallel loops easily dominated the ILP. Granularity of Data Parallelism To better understand the characteristics of data parallelism, it was necessary to determine the granularity of data parallelism in multimedia. We first examined the kernels used in the above hand scheduling experiment to determine the minimum size of a single data parallel element. Looking back to the DCT example, since the first loop not containing loop-carried dependences defines the minimum data element size, in the DCT is a single iteration of the n loop defines the size of an independent data element. A single iteration of the n loop iterates 8x8=64 times over the inner loop body, which contains 9 assembly operations.
Consequently, the minimum granularity of an
independent data element is 64*9=576 operations in the DCT kernel. Static evaluation of the other VSP kernels defined similar granularities for the minimum sized independent data elements. The average range for these kernels was found to be between 250 and 1K operations.
While this static determination of
granularity provides a useful base understanding, it is not necessarily indicative of full applications. Static evaluation of data parallelism granularity for full applications is a considerable undertaking, so instead an alternative experiment was using trace-driven simulation. Using the trace-driven simulator to examining the parallelism available in application traces over various scheduling window sizes, it is possible to determine the
176 amount of parallelism available in different-sized sections of the program trace. Comparing the parallelism over a range of scheduling window sizes will illustrate the granularity of different levels of parallelism in the application. 25 Encoder Decoder
IPC
20 15 10 5
1G
256M
64M
16M
4M
1M
256K
64K
16K
4K
1024
256
64
16
0
Scheduling Window Size (no. of opers)
Figure 6.7 – Granularity of data parallelism in an MPEG-2 video coder. Using MPEG-2 encoder and decoder application traces, the experiment varied the scheduling window size from 16 to one billion operations. As shown in Figure 6.7, with smaller scheduling window sizes, only ILP is available, with an IPC of 8 for the encoder, and 1 for the decoder. Once the instruction scheduling window size increases above 4K operations, data parallelism begins to become available, and the parallelism increases to 14 for both the encoder and decoder. Consequently, in the MPEG-2 application, the minimum granularity of independent data elements is on the order of 4K+ operations. While the observed granularity of data parallelism in these applications seems large, the true granularity of the data parallelism is actually smaller than this experiment indicates.
The trace-driven simulator is unable to define the actual level of data
parallelism because it schedules the operations in the trace exactly as they occur and does
177 not perform any optimizations on the code as a compiler would. Many parallelizable loops may not be run in parallel because of loop-carried dependences from induction variables. Consequently, the trace-driven simulator cannot precisely model the actual level of data parallelism. The effect on the results in Figure 6.7 would be to shift the slope at 4K operations a little to the left. The actual minimum level of data parallelism will likely be between 500 and 2K operations.
6.2
Compiling for Data Parallelism As found in the previous chapters, conventional compiler methods are unable to
make use of the data parallelism in multimedia.
The problem lies in the level of
granularity of data parallelism. The amount of processing for a single data element is usually quite large so that even when two independent data elements are relatively close together in the spatial or temporal domains, computations on these elements are typically several hundreds or thousands of operations apart in sequential program order. As found above, the typical minimum granularities of data-parallel elements are around 1K operations. Such coarse-grained parallelism is difficult for conventional compilers to accommodate, because they are most effective at finding parallelism within scheduling windows comprised of a limited number of sequential operations. The size of these instruction sequences is usually no more than a hundred instructions, which is much smaller than the minimum range of operations for most data-parallel elements. Consequently, alternative compiler methods are necessary for data parallelism. The issue of finding data parallelism in multimedia applications is closely related to the issues of automatic parallelization in multiprocessors and vector processors. Significant research has been performed in parallelizing compilers for taking advantage
178 of loop-based parallelism at various loop levels in program code. Section 6.2.1 will provide an overview of the static and run-time methods that have been found for supporting loop parallelism in parallel processors. Because of the limitations of static methods in recognizing loops as parallel, we developed the Speculative Broadcast Loop (SBL) technique, a new run-time loop/data parallelism method. This method is a vector-like technique that extracts data parallelism by scheduling independent loop iterations in parallel across a wide-issue clustered architecture. Speculative hardware is added that allows execution of parallel loop iterations even when it is uncertain whether the loop iterations are truly independent. The speculative hardware enables the compiler to optimistically assume a loop is parallel, and then provides a way to recover after executing dependent loop iterations in parallel. A run-time method supporting large-scale speculative execution in media processors will enable the greatest benefit from data parallelism.
6.2.1 Related Work It was recognized early in the history of computers and compilers that significant performance improvements could be obtained from executing independent loop iterations in parallel. This realization led to decades of study on parallel compiler techniques. The majority of this work has been in the area of static compilation techniques for recognizing and transforming parallel loops, but there has also been significant research on run-time methods as well. These two areas of research are examined below. Static Compilation There are essentially two parts to static compilation for parallel loops. The first is recognizing parallelism in loops, while the second is transforming the program to enable
179 parallel execution of these loops. Of these two, the recognition of parallel loops is by far the most complex. Dependence analysis is used to determine whether a loop is parallel or not, but because of difficulties in statically disambiguating memory references, many parallel loops cannot be recognized. Except when speculative run-time methods exist to recover from conflicts in non-parallel loops, loops with indeterminate data dependences must be conservatively assumed as not parallel. Dependence analysis works by evaluating the variables within a loop body to determine the data dependence between statements in the loop. There are three types of data dependences: true dependences, anti dependences, and output dependences. A true dependence (also known as a flow dependence or read-after-write dependence) is a dependence where a value is written to a variable earlier in sequential program order and then read later in sequential program order. This dependence enforces an ordering on the program such that the read of the value cannot occur before the write of the value, or program correctness will be violated. Similarly, an anti dependence (also known as a write-after-read dependence), is a dependence where a value is read earlier in the program and then written later. The write cannot occur before the read or the read will load an incorrect value. Finally, an output dependence (also known as a write-after-write dependence) is a write followed by another write to the same variable. Again, these dependences enforce orderings between dependent statements to ensure program correctness. There exist various methods for performing dependence analysis. Dependence analysis on scalar variables is typically straightforward, but analysis of arrays becomes more difficult. For arrays with indices defined by linear equations, there are well-defined
180 methods that examine the array indices equations over the loop range to determine whether the equations match for two program statements.
A match indicates the
existence of a data dependence. Some common methods for doing this include the gcdtest [135], the -test [136], the I-test [137], and the -test [138]. However, these tests are primarily limited to linear/affine equations.
There exist significant problems in
evaluating dependences on arrays when the equations become nonlinear.
Similar
problems arise with respect to the dependence analysis of pointers. Because of these limitations, dependence analysis is often unable to determine all data dependences, and consequently must conservatively assume many loops are not parallel. Additional methods have been found to be beneficial in evaluating loop parallelism.
Among these are privatization, idiom recognition, and interprocedural
analysis. Privatization is the process of localizing data references for variables only used in a single iteration of a loop. This is useful in eliminating variables from the dependence analysis problem.
Interprocedural analysis is necessary for performing dependence
analysis across function boundaries [139]. Without interprocedural analysis, compilers are limited to parallelism only within individual functions. Idiom recognition is a process of identifying two specific types of variables in loop bodies: induction variables and reduction variables.
Induction variables typically represent the indices of a loop.
Inductions variables are defined as variables that are incremented by a constant value once per loop iteration. Reduction variables typically act as accumulators. They are also incremented once per loop iteration, but usually by a variable. Once the data dependences have been identified in a loop, the appropriate loop transformations may be applied to enable parallel execution of the loop. Based on the
181 data dependences, loops are categorized into three different groups: doacross loops, and while loops.
doall loops,
For doall loops, there are no cross-iteration
dependences, so whole loop iterations may be run in parallel (i.e. SIMD parallelism). Doacross loops may have cross-iteration dependences, so they usually cannot be run entirely in parallel. While loops are loops that iterate until a termination condition is satisfied, so they have indeterminate iteration spaces.
Once the loop has been
categorized, the appropriate loop transformations can be applied to organize the loop for parallel execution. Various loop transformations include loop interchanging, skewing, unrolling, loop splitting, and two-version loops, among others. For an excellent overview on loop transformations, dependence analysis, privatization, and idiom recognition, we refer the reader to Banerjee [130]. For purposes of this thesis, we will be most concerned with doall loops, the primary set of loops suitable for SIMD parallelism. Examining MediaBench, the doall and the doacross loop styles appear to dominate multimedia code. Few while loops were found. However, the speculative nature of our proposed run-time loop parallelization method will also enable support for the while-doall subclass of while loops [140], which are while loops that are fully parallelizable but have an unknown number of iterations. Run-Time Methods In static compiler methods, when the dependence analysis fails to recognize a loop as parallel, the compiler must conservatively assume that the loop is not parallel. While this assumption certainly eliminates all non-parallel loops from being parallelized, it also eliminates all parallel loops. There are two alternative approaches, however. The first is to mark the loop as indeterminate and use run-time checks to determine if the loop
182 is parallel, and if so, execute it in parallel. The second alternative is to assume it is parallel and speculatively execute it in parallel at run-time. These two methods are often referred to as non-speculative and speculative run-time methods, respectively. Non-speculative run-time methods for loop parallelism typically employ the inspector/executor approach for checking and executing loops with indeterminate parallelism. The inspector is a test loop executed before the actual loop that examines the addresses of accesses to memory from variables that had indeterminate data dependences. Following execution of the test loop, the memory addresses are used to define the data dependences on those indeterminate variables and check if any cross-iteration dependences exist. Based on the results of the cross-iteration dependences, the loop may be either fully parallel, partially parallel, or not parallel. If the loop is not parallel, a sequential version of the loop is executed. If the loop is fully parallel, then the executor can execute all loop iterations in parallel. If the loop is only partially parallel, a run-time scheduler partitions the iterations into wavefronts. Each wavefront is a set of iterations that are all parallel with each other, but may not be parallel with iterations in other wavefronts. The wavefronts are then executed in sequential program order, with each iteration in a wavefront executing in parallel.
A variety of methods using the
inspector/executor approach have been proposed.
A good overview of the various
methods is provided by Rauchwerger [141]. Speculative run-time methods do not require the overhead of an inspector test loop.
Testing of data dependences occurs simultaneously with parallel execution.
Speculative run-time methods for parallel loop execution can be grouped into two categories: those that only support fully parallel loops, and those that also support
183 partially parallel loops. The one method that supports only fully parallel loops is the LRPD test [142][143]. According to their approach, a shadow array is used to keep track of all memory accesses for each variable (usually arrays) that had indeterminate data dependences at compile time. As these variables are accessed during loop execution, information is added to the shadow array indicating whether the memory access was a read or a write (separate field for each), and whether the element is privatizable (i.e. an element is defined as not privatizable if it is read and written in the same iteration, but read first). Additionally, the shadow array keeps track of the total number of writes and the total number of variables written in order to check if any output dependences exist. After execution of the loop, a post execution phase examines the shadow array and determines if any cross-iteration dependences were found. If so, the execution of the loop is cancelled, the processor is restored to the state prior to execution of the loop, and loop execution restarts in sequential mode. While fully parallel loops provide the greatest potential parallelism since their parallelism is limited only by the number of iterations in the loop, partially parallel loops can represent a large portion of the execution time in some applications, so it is important to support them as well. Three methods have been proposed for speculatively supporting partially parallel loops.
In actuality, these methods were all designed to support a
superset of the parallelism encompassed by partially parallel loops. They all support parallelism across arbitrary tasks, even including tasks that cross function boundaries (and therefore require support for separate stacks). Of these three, the first defines tasks as sets of nodes in the control flow graph (CFG), while the other two define tasks as threads.
184 The multiscalar processor was the first method proposed for supporting largescale speculative execution of arbitrary-sized parallel tasks. This method defines a task as an arbitrary group of nodes in a program’s control flow graph (CFG) [144][145]. This architecture groups a number of independent processing units into a ring architecture with a head processing unit and a tail processing unit. The head processing unit executes its task non-speculatively, while all subsequent processing units up to the tail unit execute their tasks progressively more speculatively (i.e. progressively later in the expected sequential program order). Execution of a speculative task can be interrupted by data or control misspeculations that squash the task and all tasks following it in the ring (up to the tail processing unit).
Speculative memory support is provided by an Address
Resolution Buffer (ARB) [144][146], which is a buffer that stores speculative memory accesses and checks for memory dependence violations. These speculative memory accesses either update the true processor state when their task successfully retires, or they are removed as their corresponding speculative tasks are squashed. Later versions of the multiscalar processor proposed the Speculative Version Cache (SVC) [147], which is similar to the ARB, but is a distributed cache instead of a buffer, and memory dependence checking is performed as part of the cache coherence protocol between separate caches in separate processing units. As opposed to the LRPD test, which stores the speculative state in shadow memory, this method uses the ARB buffer or SVC cache to store all speculative memory accesses, so this is a purely hardware approach to largescale speculative parallel execution. The primary software required for the multiscalar approach is a compiler or a binary translator that statically defines tasks in the CFG and inserts the necessary communication and synchronization primitives.
185 The two proposed methods for large-scale parallel speculative execution using threads are the Thread-Level Data Speculation (TLDS) method [148][149] and the Thread-Level Speculation (TLS) method [150]. These methods both propose generating many potentially parallel threads within a program for speculative execution on a singlechip multiprocessor.
The separate processors on the multiprocessor are used to
speculatively execute separate tasks.
The primary differences between these two
methods and the multiscalar approach are: 1) the method for defining tasks, 2) a more distributed nature, and 3) the method for storing speculative state. While the multiscalar processor is more tightly coupled with a single logical register file, the speculative multithreading architectures are more distributed with each processor having its own register file, cache, and instruction stream. With respect to storing the speculative state, the multiscalar architecture uses an ARB buffer or SVC cache that is separate from the existing data cache, while the multithreaded architectures propose using the L1 data cache itself for storing speculative memory operations. However, all three methods are similar in proposing modifications to the cache coherence protocol for performing speculative memory conflict checking. While the differences between the multithreaded architectures and the multiscalar processor are readily apparent, there does not currently seem to be significant differences between the two multithreaded architecture proposals.
6.2.2 Speculative Execution of Data Parallelism We propose a variation of these speculative run-time methods known as the Speculative Broadcast Loop (SBL) method. This new vector-like run-time method is a simplified version of the multiscalar and multithreaded speculative methods that combines SIMD parallelism with large-scale speculative execution for supporting data
186 parallelism in multimedia. Like the multiscalar and multithreaded methods, a run-time method was chosen over static compilation because of the limitations of static methods in recognizing loops as parallel. In multimedia, the predictable nature of memory access typically means that potentially parallel elements are truly parallel even though they may not be provably parallel with dependence analysis. Consequently, we believe run-time methods will better suit the needs of media processors. The basis of the SBL run-time technique uses profiling and register dependence analysis to identify loops that are potentially parallel, and then optimistically schedules potentially parallel loop iterations across separate clusters (one iteration per cluster) in a multi-cluster architecture. During SBL execution the multi-cluster architecture simulates vector processing. Unlike the multiscalar and multithreaded architectures, which provide independent control streams for separate processing units, the SBL method uses a simplified scheme that broadcasts a single instruction control stream to each cluster so that the loop iterations are all processed in SIMD form, as shown in Figure 6.8. The SIMD parallelism of this method is not as flexible as the multiscalar and multithreaded implementations, but we believe this method matches well with the processing regularity in media processing, and expect it will enable similar performance levels with less hardware complexity. To enable SIMD execution of both outer loops and loops with complex control (MIMD) structures, parallel loops are scheduled using Multi-Level If-Conversion (MLIC) to eliminate all unnecessary branches, and speculative hardware is provided for recovering from instruction flow deviations between parallel iterations. The speculative hardware also enables recovery for iterations that have memory conflicts (i.e. loops that
187 are partially parallel or not parallel) and for iterations that have executed beyond the bounds of the loop (i.e. overshot the loop termination condition). A limitation of this method is that its SIMD nature prohibits parallelism across function boundaries. Consequently, while the SBL method supports both fully parallel and partially parallel loops, it cannot support arbitrary-sized tasks like the multiscalar and multithreaded architectures. Initialize Loop Variables Begin Broadcast
...
1 n
End Broadcast
Figure 6.8 – SIMD parallelism model for broadcasting and speculatively executing parallel loops across a n-issue cluster architecture; broadcast loop back-edge is now taken only 1/n times. There are two major aspects to the SBL run-time parallelization method. These two aspects are not inter-dependent and could be used individually, but together they provide the greatest benefit. These aspects are:
188 1.) Loop broadcast and multi-level loop scheduling for SIMD parallelism across a wide-issue cluster architecture. 2.) Hardware extensions for supporting large-scale speculative execution of parallel loop iterations. These two aspects, as well as the corresponding architecture and compiler implications, will be discussed in the remainder of this section.
Discussion of the Speculative
Broadcast Loop method will first examine the multi-cluster architecture and how it supports loop broadcast.
Second, profiling and register dependence analysis are
presented as a method for finding parallel loops. Then the method for selecting and performing Multi-Level If-Conversion (MLIC) scheduling of parallel loops for broadcast on the multi-cluster architecture is examined. Finally, large-scale speculative execution and the resulting hardware modifications are discussed. The section closes with an examination of the simulation results on the MediaBench benchmark suite. Multi-Cluster Architecture for Loop Broadcast Loop parallelization on a multi-cluster architecture can be accomplished using loop broadcast. On a multi-cluster architecture, there exists only one instruction fetch engine, which retrieves a single instruction control stream and distributes it as necessary among the clusters. Because SIMD processing requires only a single control stream, SIMD parallelism may be supported on a clustered architecture by enabling the instructions for a parallel loop to be broadcast to all clusters. A likely method for setting and resetting broadcast mode in the processor would be the use of begin broadcast and end broadcast instructions, which could be scheduled before and after each parallel loop. When in broadcast mode, separate iterations of a parallel loop execute concurrently using
189 the same instruction stream, but each iteration uses its own data. Figure 6.9 shows the pipeline of a multi-cluster architecture as it would appear in broadcast mode. In normal mode, it would only issue instructions to individual clusters. Broadcast
IF
= instruction fetch = pre-decode = decode = register fetch and issue = execute = retire and write back
...
IF PD D RF+I EX RT+WB
PD
D
RF+I
EX
RT+WB
Figure 6.9 – Multi-cluster pipeline architecture with support for loop broadcast. While access to a single instruction control stream is a limitation of SIMD parallelism, the choice of SIMD processing versus MIMD processing was based on minimizing hardware complexity and simplifying synchronization. First, there is much less pressure on the instruction cache with a single control path. Second, instead of designing a full single-chip multiprocessor, we are simply proposing a few extensions to a typical multi-cluster architecture. Third, we believe SIMD parallelism enables simpler synchronization.
Because all loop iterations share the same control stream, the
instruction flow synchronization is implicit in the implementation. The only additional synchronization necessary is that for ensuring the proper memory image, which can be accomplished by appropriately committing and squashing loops following speculative execution. And finally, the data parallelism available in multimedia lends itself well to SIMD parallelism. The processing of data elements is usually very regular, so most data
190 elements of the same type are operated on in the same manner. Furthermore, dataparallel elements are by definition independent or nearly independent, so little to no communication is need between clusters.
Consequently, the data parallelism in
multimedia makes it amenable to SIMD processing across a multi-cluster architecture. We expect only a small loss in performance as compared with MIMD parallelism. An additional benefit of the multi-cluster architecture over a multiprocessor is that it can also be used as a wide-issue processor when the degree of ILP exceeds the number of issue slots in each cluster. As mentioned in Section 4.2.1, compiler methods exist [119][120] that enable scheduling of ILP across clusters with only a minimal decrease in performance relative to an un-clustered architecture of the same width. For supporting loop parallelism, a multi-cluster is actually preferable to an unclustered wide-issue processor. A wide-issue un-clustered architecture cannot provide the necessary register resources for supporting multiple parallel loops. Furthermore, a clustered architecture automatically provides privatization of register variables, and may enable privatization of memory variables when separate local memories or cache are used. Consequently, a multi-cluster architecture provides an excellent compromise for supporting both ILP and loop-based parallelism. Finding Parallel Loops Profiling and register dependence analysis are employed to determine which loops are potential candidates for loop parallelization. An alternative for finding parallel loops would be to use parallel compiler methods such as those discussed in the prior section. However, profiling can evaluate not only the memory access characteristics from each loop iteration to determine if cross-iteration dependences exist, but also many other
191 characteristics, including the loop statistics presented in Section 3.6, and additional memory characteristics such as read sharing, memory (cache) line sharing, number and granularity of loads/stores, and so on. These additional memory characteristics can prove useful for data partitioning, as discussed in Appendix C. Consequently, profiling is the method currently used by the SBL method for finding parallel loops. To generate loop profile information, the program trace is instrumented with additional information indicating the start of each loop, the beginning of subsequent iterations for each loop, and the end of each loop. Then the profiler is augmented with data structures that record the loop statistics as profiling is performed. The primary data structure is a loop data structure for each profiled loop in the program. Because loops are executed in a nested fashion, a loop execution stack is used to keep track of the nesting levels of loops. Whenever a new loop begins execution, its data structure is pushed onto the stack, and once it completes execution it is popped off the stack and placed in a reserve list. Within each loop data structure there is a linked list of loop iteration data structures. And within the loop iteration data structure are two linked lists of memory line data structures, one for memory reads and one for writes. These data structures keep track of every byte of memory accessed in the loop iteration currently being executed. Once a loop completes execution, the memory access information for each iteration is compared to determine the average memory usage statistics for the loop, including the number of memory conflicts per iteration (this is zero if the loop is fully parallel), the number of memory lines each iteration shared with other iterations, the number of loads and stores, the granularity of loads and stores within the memory lines accessed by the iteration, and the degree of read sharing per iteration. The completed
192 loop is popped off the loop execution stack, and the aggregate memory access information for all iterations are combined into the memory information for the current iteration of the next lower loop on the stack. This way the memory information can be maintained at all loop levels. While the loop profile statistics define the memory data dependences in each loop, it is also necessary to evaluate the register-based data dependences in each loop. This can be accomplished using the static code information after profiling is complete. The existence of cross-iteration dependences from register operands can easily be determined using knowledge of the induction variables and live-out sets for each loop. For cross-iteration dependences to exist in a loop, the dependence must occur through a register variable contained in one of the loop back-edge live-out sets (i.e. live-out set on path of branch to loop header). Collecting the set of register variables for all loop backedge live-out sets, each register is evaluated as to whether it is written within the loop. If it is not written in the loop, then it cannot cause a cross-iteration dependence. If all register variables written by a loop are loop induction variables, then the loop is potentially parallelizable.
Because loop induction variables are incremented by a
constant on every loop iteration, the value of an induction variable can be predicted ahead of time for every iteration of the loop. Consequently, loop induction variables are allowable within parallel loops. If there exist register variables in the live-out set that are not induction variables, but are written within the loop, then the loop contains a crossiteration dependence and is not parallelizable. When both the memory dependences and register dependences indicate a potentially parallel loop, the loop may be a candidate for SBL execution.
193 Schedule Parallel Loops for Broadcast After defining the set of candidate loops, it is necessary to select the appropriate loops and schedule them for SBL loop parallelization. Selection of the loop can be done according to any heuristic as long as two basic rules are followed. First, only one loop level can be selected for parallel execution. If a multi-level loop is parallel at numerous levels, only one level may be selected for loop broadcast. Second, because this method uses SIMD processing, loops containing function calls cannot be parallelized.
The
processor cannot guarantee the same flow of control within parallel function calls, so SBL across function calls is not allowed. Outside of these criteria, any heuristic may be used to select which loops to broadcast. A couple of different heuristics will be presented in the experiments section below. After selecting the loops to parallelize, it is necessary to schedule them for maximum SIMD parallelism. This essentially entails removing all unnecessary branches. We propose the Multi-Level If-Conversion (MLIC) method for scheduling broadcast loops. In loops that contain multiple control paths, this method combines all control paths into a single control path to enable SIMD processing. We shall first describe how MLIC works on inner loops, and then extend that for scheduling of multi-level loops. For an inner loop, combining multiple paths of control flow into a single control path proceeds as follows: 1.) Combine all loop back-edges into a single back-edge. 2.) Combine all loop exits into a single loop exit; insert extra branches outside the loop body if there are multiple exit destinations. 3.) Use if-conversion on resulting diamond region to form a single control path.
194 The first two steps essentially convert the loop body into a diamond region with only one entry point, the loop header, and one exit point. The diamond region can then be predicated into a single control path using if-conversion [90][91]. The first step can be accomplished by converting all loop back-edges to instead branch to a single basic block. Within this basic block is a branch to the loop header, which serves as the single loop back-edge for the loop. The second step is a little more complex. A method similar to that used for the first step would be feasible if there was only one exit destination, but there can potentially be many exit destinations. Instead, a general methodology is needed for combining any number of exits into a single loop exit. An example of such a method is demonstrated in Figure 6.10. In this method, loop exits are all retargeted to jump to the loop back-edge basic block, from which point control flow will exit the loop through a single exit point. In the subsequent basic block, control flow then branches to the appropriate destination based on the state of a set of temporary values. The MLIC method converts the loop exit branches from control dependences into combined control and data dependences. This process is implemented by first splitting the loop back-edge basic block (BBz in the example) into two basic blocks, with BBz containing all the non-branch code, and BBz’ containing the loop back-edge branch. All loop exits may now be retargeted to branch to BBz’ without fear of executing the initial non-branch code in BBz. For each loop exit, except the fall-thru exit in the loop back-edge basic block, BBz’, a temporary variable15 (ra or rb) is assigned to that branch. This variable is initially reset in the loop header (BBa). Prior to each loop exit branch (in BBc or BBf), a comparison instruction is
15
These variables can either be registers or predicates.
195 added to set the variable if the loop exit branch is taken. However, instead of branching to the loop exit destination (either BB1 or BB2), the branch is directed to branch to the new loop back-edge basic block, BBz’. After exiting the loop, control flow passes to a new exit basic block (BB0’), which is added immediately after the loop back-edge basic block, BBz. In the new exit basic block, BB0’, a branch is added that branches to the original loop exit destination (BB0, BB1, or BB2) when the branch’s variable is set. A similar temporary variable (rc) is also assigned to the loop back-edge. Consequently, at the end of each iteration, control flow will either: branch to the loop header (BBa) if rc is set, branch to BB1 if ra is set, branch to BB2 if rb is set, or fall-thru to BB0 if none of the new variables are set. The new control flow graph is now a diamond region with BBz’ as the only exit point of the graph.
If-conversion can now be applied to convert
all loop paths to a single control path, thereby eliminating all intra-loop branches. When the loop being parallelized is an outer loop, the situation becomes more complicated. In this case, it is no longer possible to eliminate all intra-loop branches. Some of these branches will be associated with loop back-edges and loop bypass branches for loops nested within the parallel loop. However, if-conversion can still be used on individual regions within the loop to eliminate all unnecessary branches. A recursive pseudo-code procedure for accomplishing this is given in Figure 6.11, and an example outer loop is shown in Figure 6.12.
196
ra = 0 rb = 0 rc = 0
BBa
BBa
BBb BBb
BBc
BBc
BBd
BBd
ra = 1 BBe
BBe
BBz
BB0
BB1
BB2
taken when rc == 1
BBf
BBf rb = 1
BBz
rc = (loop back condition)
BBz’
new loop back basic block
BB0’
new loop exit basic block
(combine loop exits) taken when rc == 0
taken when ra == 1
BB0
taken when rb == 1
BB1
BB2
Figure 6.10 – Example demonstrating loop exit combining on an inner loop. The procedure for performing Multi-Level If-Conversion on a broadcast loop splits a multi-level loop into nested loop regions, pre-loop regions, and one post-loop region. Nested loop regions are recursively processed with the above procedure until an inner loop is found. The inner loop is processed according to the method defined earlier and demonstrated in Figure 6.10. For each nested loop there is a section of code prior to the loop (and potentially after an earlier nested loop), which we refer to as the pre-loop region. This region is if-converted in a similar manner to inner loop regions except that there is no loop back-edge, and instead of creating a new basic block after the last basic
197 block in the region, it uses the same loop exit basic block (BB0’) as the nested loop it precedes. For branches that branch out of the pre-loop region, but do not jump to the succeeding nested loop, a branch bypass path is taken around the nested loop to the loop exit basic block, BB0’. From there, control proceeds to the appropriate exit destination. combine_loop_exits (main_loop) { if (main_loop has child loops) { for each (child_loop) { combine_loop_exits (child_loop); combine_pre_loop_exits (child_loop); } combine_post_loop_exits ( child_loop); } else combine_inner_loop_exits (child_loop); }
Figure 6.11 – Recursive procedure for combining loop exits in outer loops.
Pre-Loop Region
Inner Loop Body
Post-Loop Region
Figure 6.12 – Multi-level if-conversion to eliminate all unnecessary branches.
198 With parallel outer loops, there is still the potential that parallel iterations may have instruction control flow deviations when nested loops branch in different directions. In such events, the speculative execution support comes into play and squashes the deviating loop iterations. This will be discussed in greater detail shortly. After performing Multi-Level If-Conversion on the parallel loops, it is also necessary to insert the operations for setting and resetting broadcast mode, performing synchronization, and initializing loop variables. As described earlier, begin broadcast and end broadcast statements are used to put the processor into broadcast mode and back into normal mode, respectively.
As shown in Figure 6.8, these are to be placed
immediately before and after the beginning and ending of the parallel loop. Synchronization operations are also needed for supporting speculative execution. These synchronization operations act as checkpoints. At each checkpoint, the processor state is saved to provide a recovery point in case a misspeculation occurs. At the beginning of the loop, the begin broadcast operation can double as a synchronization operation, but synchronization is also needed at both the end of the loop and between groups of parallel iterations, so the end broadcast operation is not sufficient. A checkpoint operation needs to be placed in the parallel loop’s back-edge/exit basic block so that synchronization occurs on both the loop back-edge and the loop exit. Initialization operations are also needed to prepare all the variables for parallel loop execution. There are two types of initialization operations. The first type are copy operations for variables in the loop’s live-in set. Each loop needs access to all variables in the live-in set, so they must be copied to each cluster. The second type of operation initializes the loop induction variables. Because the loop induction variables will need to
199 be initialized separately for each cluster, the initialization of loop induction variables should occur before the begin broadcast operation.
However, the live-in set copy
operations could (and generally should) be scheduled after the begin broadcast operation. After multi-level if-conversion and initialization of all the loop variables, the program code is ready for parallel execution. Speculative Hardware Support Because static dependence analysis was not used to guarantee that there are no data dependences in parallel loops, it is necessary to assume that memory conflicts will occur. Profiling of loop memory iterations is performed to determine whether loops have data dependences under a given input set, but this does not definitively determine loop memory independence. Fortunately, the processing regularity in multimedia enables the profiling results to be relatively accurate, so by only choosing loops for parallelization whose profiles indicate no memory conflicts, memory conflicts will typically occur infrequently. Even so, it is necessary to provide a method of recovery when a memory conflict between loop iterations does occur. Three Forms of Speculation In addition to supporting speculative execution in the shadow of potential memory conflicts, it is also necessary to support control flow speculation. As mentioned above with regards to scheduling, it is sometimes desirable to perform parallel execution on an outer loop, but this gives rise to potential control flow deviations from nested loops. When control flow deviations do occur, a method for squashing the deviating outer loop iterations and recovering from the misspeculation is necessary. Furthermore, it is also desirable to support parallelism on loops, such as while loops, which have an
200 indeterminate number of loop iterations. This can only be achieved by speculatively executing as many parallel iterations as possible and then squashing and recovering from those iterations that exceeded the loop bounds.
Summarized, these three areas of
speculation are: 1.) Memory Independence Speculation 2.) Control Flow Speculation 3.) Parallel Loop Iteration Speculation Of these three, memory independence speculation is the most difficult to support. The other two areas of speculation are primarily covered by the speculation and recovery requirements of Memory Independence Speculation, so they can be supported with little or no additional hardware. Levels of Speculative State Support for recovery from the misspeculation of parallel loop execution requires extending the size of the speculative state that may be maintained by the processor. As shown in Figure 6.13, there are a number of processor levels that could potentially maintain speculative state. Of these various levels, the only commonly used levels for supporting speculative state are register renaming and the register file. Of these two, the most frequently used is register renaming. While register renaming can be performed by the compiler, most general-purpose superscalar architectures employ out-of-order scheduling with register renaming in hardware.
The register renaming enables
speculative out-of-order execution of operations in the issue-reorder buffer. The renamed registers may hold a speculative processor state without contaminating the actual state contained in the register file. Those operations that are misspeculated are cancelled at
201 instruction retirement so that only correctly speculated operations update the true processor state.
EM L3 L2 L1 RF RR
RR RF L1 L2 L3 EM
= renaming registers = register file = level 1 cache = level 2 cache = level 3 cache = external memory
Figure 6.13 – Potential levels of speculative processor state. Speculative state may be maintained in the register file by using a restorable register file, where registers contain a backup bit for each primary storage bit, as displayed in Figure 6.14. The primary bits in each register can hold speculative state while the backup bits maintain the true processor state. Consequently, this method allows a larger speculative state than register renaming by enabling the entire register file to hold a speculative state. Checkpoint operations are used to save the processor state before and after speculative execution. When in speculative execution mode, if the processor proceeds correctly to the next checkpoint without a misspeculation, then at the checkpoint the speculative state is known to be true and is stored into the backup register bits.
If a misspeculation occurs, then the saved processor state is restored and the
processor restarts from the last checkpoint and proceeds again in a non- or lessspeculative mode. This method was used on a cycle-by-cycle basis for recovery from exceptions in the DAISY and BOA processors for binary translation [151][152], and is also being used in a recent patent [153].
202
RegData Restore RegWrite
D
Q
D
En
Q En
Backup
Restore
Figure 6.14 – Register file with backup bit for enabling speculation. In the SBL method, we use the checkpoint scheme with a backup register file to enable speculative execution of broadcast loops.
As demonstrated in Figure 6.15,
execution shifts between a non-speculative sequential (normal) mode and a speculative parallel (broadcast) mode. In the figure, the processor is computing in normal nonspeculative mode until it comes upon a broadcast loop. It synchronizes at the first checkpoint, then continues executing in speculative broadcast mode until either the next checkpoint is encountered or a misspeculation (from one of the three speculation areas) occurs. In a), no misspeculation occurs, so the broadcast loop executes two consecutive sets of parallel loop iterations and then returns to normal mode. In b), the broadcast loop successfully executes one set of parallel loop iterations, but misspeculates on the second set, so it backups up to the last checkpoint (last saved state), and then continues again in a less speculative mode. In c), the broadcast loop fails on the first set of parallel loop iterations, so it backups up to the saved state just before entering broadcast mode, and restarts in a less speculative mode.
203
a)
S0
NS
Chk Pt
Chk Pt
b)
S0
NS Chk Pt
c)
S1
NS Chk Pt
S1 Chk Pt
S0
NS Chk Pt
Figure 6.15 – Checkpoint scheme stores processor state and enables a recovery point for misspeculations. Unfortunately, the speculative register file and checkpoint scheme do not provide a large enough speculative state for Speculative Broadcast Loop execution.
When
concurrently executing potentially non-parallel loop iterations, speculative memory support is necessary to ensure proper recovery from any memory conflicts that may occur. Therefore, a large-scale speculative state that defines a portion of memory as speculative is necessary.
Because it will be necessary for the processor to recover
quickly from misspeculations, it is desirable to have the speculative memory as close to the processor as possible. Similar to the speculative multi-threading approaches, we propose extending the processor’s speculative state to include the L1 data cache. During broadcast mode, the L2 cache will be used to hold the true processor memory state, while the L1 data cache holds the speculative state. The full speculative model for the SBL method therefore includes restorable register files and checkpointing as well as a speculative L1 data cache. To enable speculation in the L1 data cache, a few modifications are necessary. First, during broadcast mode, writes to memory can only be sent to the L1 data cache. They cannot be written through to lower levels of the cache hierarchy, nor can
204 speculative cache lines be replaced if there is insufficient storage in the cache. Consequently, the use of a speculative L1 cache requires a write-allocate policy and a greater degree of associativity.
The TLDS multithreading research found that an
associativity of up to 4 was typically sufficient for general-purpose applications, but associativities as low as 2 were possible if a victim cache was also used [148]. We expect an associativity of 4 will be sufficient for media processing. With regards to using the L1 data cache to hold speculative state, it should also be noted that while the size of the speculative state that can be supported by the L1 data cache is quite large, it is not infinite. This means the L1 data cache will occasionally run out of room, and a provision must be made for squashing loops because of insufficient cache. Consequently, there is now a fourth form of squashing – a squash based on insufficient speculative state. Two additional fields are needed in each cache line of the L1 data cache in order to support speculative memory. The first is an extra state bit that declares whether the cache line contains speculative data. The second field, max write iter (MWI), indicates the maximum cluster ID that wrote to this cache line. The purpose of the MWI field will be discussed in more detail below. The speculative state bits are designed to prevent the processor from overwriting true processor state. They indicate whether a cache line contains true processor state information (when the speculative bit is not set), or whether it contains speculative, potentially invalid information. Use of the speculative bits with respect to SBL execution proceeds as follows. Upon entering speculative broadcast mode, it is not desirable to have to immediately cast out every dirty L1 cache line. Instead, all dirty cache lines continue to remain in the L1
205 data cache until they are requested by a speculative write operation. When a speculative write is made to a dirty cache line, the speculative bit is checked first. If the speculative bit is not set, the cache line contains valid processor state information, so the cache line is immediately cast out to the L2 cache. Then the memory write is completed and both the dirty and speculative bits are set for the cache line. Subsequent writes to the cache line will see that the speculative bit is set, and can write to the cache line normally. When reaching a checkpoint without having encountered a misspeculation, all speculative data in the L1 cache immediately becomes valid, and the speculative bits are cleared.
Conversely, if a checkpoint is not successfully reached because of a
misspeculation, the speculative data in the L1 cache is invalid, so all speculative cache lines must be marked invalid. In the next section we shall see that there is also the possibility of misspeculating on some iterations, but continuing parallel execution of others. This requires a selective checkpoint and recovery scheme, which complicates support for speculation in the L1 data cache. Selective squashing is the reason for the max write iteration (MWI) field. Its function will be discussed in more detail below. Selective Squashing of Loop Iterations Of the three forms of loop speculation defined earlier, we anticipate that Memory Independence Speculation and Parallel Loop Iteration Speculation will only rarely cause misspeculations. Conversely, in order to support outer loop parallelism on a SIMD processing platform, we expect control flow deviations from nested loops with non-static iteration counts will occur much more regularly. When this occurs it would be desirable to just squash the deviating loops and continue executing the remaining loop iterations in parallel. While this requires additional hardware support, we felt the potential gain was
206 worth the cost. Furthermore, this functionality gives the SBL method the capabililty to support partially parallel loops. For selectively squashing loop iterations when a misspeculation occurs, it is important to pay attention to sequential program order. Since none of the loops being broadcast are provably parallel, we cannot allow a later iteration to continue executing after squashing an earlier iteration without potentially violating sequential consistency. Consequently, when we squash any loop iteration we must also squash all loop iterations following it in sequential program order. To enable this, we enforce a left-to-right ordering on the clusters in the processor, as shown in Figure 6.16. Cluster 0 has the highest priority and will always be assigned the earliest loop iteration in program order. Additionally, when not running in broadcast mode, all sequential code will be executed on cluster 0. We refer to it as the Sequential Master. The priority decreases from cluster 0, such that cluster n-1 in an n-cluster architecture will always receive the latest, most speculative, loop iteration. As a result, when an iteration on a given loop is squashed, all loop iterations in clusters to the right must also be squashed. This method is not as flexible as the ring architecture of the multiscalar architecture, but our implementation allows cluster priority to be hardwired into the architecture, thereby simplifying the control hardware for SBL. Latest Iteration
Earliest Iteration Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
0
1
2
3
4
5
6
7
Sequential Master
Figure 6.16 – Left-to-right ordering of clusters in the processor.
207 After a loop has been squashed, it still remains assigned to the same cluster. Because a given cluster has already been initialized for a particular iteration (i.e. the cluster was properly initialized with the appropriate iteration variables, and it may have already pulled in some of the pertinent cache lines if it has its own local cache), it is best to let the iteration execute on that cluster. After the current set of parallel loop iterations have completed, the broadcast loop will restart for those iterations that were squashed. Those left-most iterations that have completed their loop iterations will become quiescent while the later loop iterations are re-executed.
They cannot be assigned new loop
iterations until the right-most clusters have completed their iterations because this would violate the left-to-right ordering constraint. To keep track of which iterations have been squashed and which have completed, the processor uses two pairs of cluster pointers. The first pair of pointers (Cs, Ce) indicate the the cluster IDs for the clusters containing the earliest (Cs) and latest (Ce) parallel iteration being executed. The Cs pointer is a particularly important pointer, because it points to the current master loop iteration. The cluster executing this loop iteration defines the control flow for all other iterations. Since all loop iterations before it have already completed, it is the least speculative iteration being executed (although it can still be cancelled due to memory conflicts). The second pair of pointers (Ns, Ne) defines the start and end cluster IDs for the next set of parallel loop iterations. As loop iterations are squashed during SBL execution, these pointers change to reflect the actively executing iterations and the set of iterations to be executed next cycle. As an example, assume an 8-issue multi-cluster architecture is executing a parallel loop with 50 iterations. The first round of iterations assigns iterations 0-7 to clusters 0-7.
208 The (Cs, Ce) pointers then become (0, 7). And because we are currently assuming all 8 iterations will complete, the (Ns, Ne) pointer set becomes (0, 7). During execution, the iteration in cluster 4 misspeculates, so (Cs, Ce) and (Ns, Ne) are immediately modified. The Ce pointer now changes to the latest iteration being executed, so (Cs, Ce) becomes (0, 3). Likewise, (Ns, Ne) are set to begin the next execution round starting with cluster 4, so (Ns, Ne) becomes (4, 7). Assuming the first round completes without further misspeculations, the next round starts with (Cs, Ce) equal to (4, 7), and (Ns, Ne) is once again set to start with the next group of iterations, so becomes (0, 7). Of course, there are ramifications of this scheme on handling the speculative state. When some iterations are squashed and the remaining iterations continue executing, the speculative state for the squashed loop iterations may also need to be squashed. When iterations are squashed from memory conflicts or because of executing an excessive number of iterations (i.e. overshooting the loop bounds), the speculative state for these squashed iterations is invalid and needs to be restored. However, for iterations squashed because of instruction control flow deviations, the speculative state does not need to be restored, because during re-execution of that iteration, the loop will again write to and read from the same memory locations. Consequently, speculative state recovery is only needed after memory conflict and excessive loop iteration squashes. Recovery from these types of squashes involves restoring the original processor state that was saved at the last checkpoint. However, for selective squashing we should only restore the state for the squashed loop iterations. The state should not be changed for those iterations that completed correctly. The process of selectively restoring register
209 files is easily accomplished by resetting the register files only in the clusters whose loop iterations were squashed. Unfortunately, selective restoring is not as simple for memory. Selective recovery in memory is complicated by the fact that multiple loop iterations may share the same cache line, so a single cache line could be written by many loop iterations. Using selective squashing, some of the entries in the cache line may been written by squashed loop iterations, so will need to be invalidated. This is the purpose of the max write iter (MWI) field in each cache line. By keeping track of the maximum cluster ID to write a cache line, it can be determined if any squashed loop iterations wrote to that cache line. If so, it is necessary to generate a valid mask for the cache line, which indicates which bytes in the cache line are valid and which are invalid. As seen in the next section, a loop memory conflict (LMC) cache is used to keep track of the memory accesses made for each iteration. And while it may take many processor cycles, the LMC maintains sufficient information to compute the valid mask. Following completion of the remaining valid iterations, valid masks must be generated for all cache lines with invalid speculative data. All these cache lines must then be cast out to the L2 cache, but only writing back to the L2 cache those bytes marked as valid in the valid mask. This process of selectively restoring memory is expected to require a significant amount of time, but since memory conflict and excessive loop iteration squashes are not expected to occur frequently, the extra recovery time should have little impact on overall performance. Loop Memory Conflict Cache The final element necessary to enabling speculative execution is a test mechanism that checks for memory conflicts during execution of parallel loop iterations. Additionally, for supporting the selective iteration squashing mechanism, a set of rules is
210 also needed for determining which iterations to squash on a memory conflict. We shall first discuss this set of rules and then examine the loop memory conflict (LMC) cache. Figure 6.17 summarizes the rules for squashing loop iterations due to a memory conflict. This figure shows the three possible situations in which a memory conflict may occur. These three categories correspond to the three types of data dependences. In the figure, the ’x’s below the loop iterations indicate which iterations must be squashed in each case for maintaining memory correctness. When one of the accesses is a read, the iteration performing the read access defines the first iteration to be squashed. When an output dependence conflict occurs, the earlier iteration defines the first iteration to be squashed. Also note the implications of iteration squashing on the (Ns, Ne) pointers. To prevent the same conflict from occuring again, (Ns, Ne) must be assigned the appropriate loop bounds as defined in Figure 6.17. Given this set of rules for squashing loop iterations on memory conflicts, the final requirement for speculative execution is a test mechanism for checking if memory conflicts exist. This is provided by the Loop Memory Conflict (LMC) cache. The LMC cache keeps track of all memory accesses in each parallel loop iteration.
Using a
specialized cache for performing memory conflict checking provides numerous benefits. First, a cache provides high-speed on-chip memory without contaminating the L1 data cache with additional memory access information arrays. Second, a hardware approach enables cycle-by-cycle checking of memory conflicts as opposed to a post-loop-execution checking phase. This nearly eliminates the post-loop memory conflict checking overhead and also enables early notification of misspeculations. Early notification prevents the misspeculated loop iterations from continuing to access memory locations, and so avoids
211 additional memory stall penalties. And finally, the cache methodology eliminates the need for the compiler to schedule actual memory conflict checking code into the program, so the loop will run at normal speed. a)
0
1
2 R
n -1
W
x
x
Ne
Ns
b)
0
1
2 W
Ns
c)
0
x
x
2
n -1
Ne 1
x
Output Dependence
...
W
Ns
AntiDependence
x
W x
n -1
...
R x
Flow Dependence
...
x
x
Ne
Figure 6.17 – Guidelines for squashing loop iterations with memory conflicts. Figure 6.18 and Figure 6.19 below show two possible versions of a LMC cache. The first design uses a single global LMC cache that keeps track of the memory access information for all clusters, while the second offers a distributed design where each cluster has a separate local LMC cache for keeping track of its own memory accesses. In both LMC cache designs, the cache is first completely invalidated when beginning speculative execution. Each memory access is then logged in the LMC cache on a byteby-byte basis. A memory conflict checker continually examines the LMC cache over the course of the speculative period, comparing the memory accesses between iterations to
212 determine whether a memory conflict has occurred.
While these two designs are
inherently different in how they check for memory conflicts, both work to detect memory conflicts between separate parallel loop iterations. When a memory conflict occurs, the LMC cache returns the cluster IDs corresponding to the conflicting loop iterations. Based on these IDs, the appropriate loop iterations are squashed according to the rules defined in Figure 6.17. Cluster 0
Cluster 1
...
Cluster 2
Cluster j -1
Set 0 Set 1 ...
Set k -1
Address Tag
BCI0
BCI1
...
BCIn -1
BCI = byte conflict info
Figure 6.18 – Global loop memory conflict (LMC) cache design. In both LMC cache designs, the memory statistics for each byte are recorded in a byte conflict information (BCI) field. This field is organized into two subfields, one that keeps track of reads to this memory location, and one that keeps track of writes. In the global LMC cache, there is a separate bit in each field for each iteration. Consequently, when loop iteration i accesses a memory location, it sets either the ith read bit (Ri) or the ith write bit (Wi) in the appropriate byte conflict information field. Consequently, each byte conflict information field in the LMC cache contains 2n bits, where n is the number of clusters in the processor. The distributed LMC cache similarly monitors memory accesses, but since each local LMC cache only monitors its own memory accesses, only 2
213 bits are needed for the byte conflict information field. However, since there will be n separate local LMC caches, this still amounts to the same aggregate number of bits for the byte conflict information field. Memory conflict checking is relatively simple within the global LMC cache since the memory access information for all loop iterations is contained in the same byte conflict information field. After updating a byte conflict information field, the field can be checked for a memory conflict. This involves checking to see if two different loop iterations have accessed this byte. If so, and one or both of the iterations performed a write access, a memory conflict has occurred. Cluster 0
Cluster 1
Cluster 2
...
Cluster j -1
...
...
...
...
Conflict Checker Address Tag
BCI0
BCI1
BCI = byte conflict info
...
BCIn -1
- compare tags of lines in same sets for each iteration - if tags match, check if iterations access the same byte - if multiple iterations access the same byte and at least one access is a write, indicate a conflict
Figure 6.19 – Distributed loop memory conflict (LMC) cache design. Memory conflict checking on the distributed LMC cache design is more complicated because all the loop access information is distributed. The memory access information for each iteration must be combined for checking. One alternative for doing this is to send the memory access information for all loop iterations to a central location
214 for checking. However, since sending the contents of the entire cache at once would require too many wires, we recommend sending a single set of cache lines at a time. Checking then ensues by checking cache lines in corresponding sets to see if the tags match, followed by a byte-wise cache line check to see if any memory bytes conflict, indicating a cross-iteration dependence. In this scheme, an associativity of 4 or less is recommended, since k2 comparators are needed to compare the tags between loop iterations with k cache lines per set.
This number would become too large for
associativities greater than 4. Another alternative for combining memory access information for separate loop iterations in a distributed LMC cache is to distribute the information for one cluster (one loop iteration) to each of the other clusters individually, and then perform checking locally at each cluster. A ring-based approach could work well here, where the memory access information for each cluster progresses around the ring to the other clusters for local conflict checking. As the memory access information for a cluster progresses around the ring it is compared with the memory access information of each local cluster to see if the two loop iterations conflict. After comparing memory against the local cluster, the memory access information progresses to the next cluster in the ring, compares memory with that cluster, then moves on again, and so on. In an n-cluster machine, assuming a loop iteration’s memory access information could be compared against a local cluster and progress to the next cluster in the ring in one cycle (should be feasible since the two can be done in parallel), all loop iterations could be compared in n1 cycles.
215 Hardware Delay and Complexity Issues In the preceding sections, we have described some hardware additions necessary to a multi-cluster architecture for supporting the SBL method.
In summary, those
hardware modifications include: •
ability to broadcast instruction control stream to all clusters
•
two pairs of cluster pointers, (Cs, Ce) and (Ns, Ne), identifying the current and next set of clusters to execute in parallel
•
checkpoint support
•
restorable register files
•
two additional fields in the L1 data cache
•
a global or distributed LMC cache
•
a global or distributed memory conflict checker
•
hardware support for selectively restoring memory state
Among these required modifications, the first four are relatively trivial. Assuming we have a single global instruction cache, there already exists support for distributing instructions to clusters, so adding broadcast capability requires negligible hardware modifications. Likewise, the checkpoint signals and signals defining the cluster pointers could also originate near the instruction cache and be broadcast with instructions. There will be some wire delay when switching between different cluster pair configurations, but this delay should be less than the overhead for synchronizing at the end of speculative periods while waiting for memory conflict checking to complete. Finally, the restorable register files will increase register file area (by about 40-50%), but it is not expected to significantly affect cycle time.
216 The remaining four hardware additions are the primary noteworthy modifications. With respect to the additional fields in the L1 data cache, it is necessary to be able to globally reset both fields at the beginning and end of a speculative execution period, so these two cache line fields will likely need to be contained in a small, separate cache directory. With regards to the LMC cache, we face the same design challenges as for the L1 data cache. To support a global cache, the cache will need to be highly banked, and as with the data cache, this will probably only work for up to 4 or 8 parallel accesses. Consequently, the distributed LMC cache is the more likely choice. Unlike the data cache, however, the delay of the LMC cache will not significantly affect performance. Because synchronization is provided at the end of each speculation period, it is not necessary for memory conflict checking to declare conflicts prior to synchronization. Consequently, accesses to the LMC cache and memory conflict checking can be done asynchronously with actual execution, so LMC cache delay only affects the amount of synchronization overhead required at the end of a speculative period. A final concern with respect to the LMC cache is the size. The size of the LMC cache is determined both by the number of bytes being monitored as well as by the number of clusters. For more than 4 clusters, the LMC cache requires more bits for memory monitoring than the actual memory does. Fortunately, the size of the LMC cache does not need to be that large.
In the results section we initially ran the
experiments assuming an LMC cache that monitored all memory accesses in the L1 data cache (i.e. monitored 32 KB), but additional experiments showed that LMC caches only monitoring 4-8 KB of memory decrease performance by only about 5%. Additionally, if
217 the SBL method is used in conjunction with a parallelizing compiler, it would only be necessary to monitor those memory accesses that have indeterminate data dependences (as done in the LRPD test). This could significantly decrease the amount of memory that must be monitored. Finally, if the number of clusters should become too large, there is always the option to perform memory monitoring and access checking as part of the cache coherence protocol, as proposed by the multiscalar and multithreading speculative approaches. With regards to memory dependence checking, its implementation is relatively straightforward in a global LMC cache, but is a bit more complicated with the distributed LMC cache. We believe the distributed ring memory checking scheme described above will work well without impacting synchronization overhead too significantly.
The
primary issue with the distributed ring checker is with respect to how much memory access information is kept for each loop iteration. If it is too large, the number of wires may prohibit passing all memory access information for an iteration at once and have to break it up into parts. With small LMC caches that monitoring only 4-8 KB of memory, it should be feasible to pass all memory access information for an iteration in only one or two parts. The most complicated portion of the hardware for SBL execution involves selective recovery of memory. This involves examining the max write iteration (MWI) field for each cache line in the L1 data cache and determining whether a data cache line contains invalid fields. If so, it is necessary to construct the valid mask using the memory access information stored in the LMC cache. We expect this recovery process will take a significant number of cycles. However, the squashes requiring selective recovery of
218 memory are not expected to happen very often.
Should it turn out that they are
happening more frequently, it is possible that it could be too expensive to support selective squashing for those types of misspeculations. Results To examine the effectiveness of the SBL method for multimedia, the MediaBench benchmark suite was simulated on a multi-cluster architecture using various numbers of clusters.
The first experiment evaluates the performance assuming perfect cache
memory, while the second experiment examines performance with a global cache memory hierarchy. A final experiment examines the performance of select MediaBench applications after manually applying parallel compiler optimizations in a manner similar to a parallelizing compiler. The results measure the speedup as the ratio of performance of a multi-cluster architecture versus the performance of a non-clustered architecture, which has resources equivalent to one cluster on the multi-cluster architecture. The base processor used in these experiments is a 4-issue processor with 4 ALUs, 2 memory units, 1 shifter, 1 multiplier, 1 float-point unit, and 1 branch unit. Because both architectures can take equal advantage of ILP, the two processors always compare performance using the same ILP compilation techniques. The LMC cache used in these initial experiments was a 32 KBE (KB entry) cache, meaning it monitors up to 32 KB of memory.
All other
architecture parameters are consistent with the base processor model used in Chapter 4. One benefit of the SBL method is that it does not require any specific datapath architecture.
The SBL method effectively views the datapath as a black box with
external interfaces for memory access, instruction control flow, and synchronization.
219 How the datapath works internally has little to no impact on the method. Consequently, the datapath can assume any architecture. Within these experiments we will examine three architectures models:
1) a VLIW architecture with conservative fetch and a
compressed instruction format, 2) an in-order superscalar with aggressive fetch, and 3) an out-of-order superscalar with aggressive fetch. When measuring performance for these three processor models, performance will be compared with the single-cluster architecture using the same datapath. Consequently, a multi-cluster VLIW architecture is compared with a single-cluster VLIW, while a multi-cluster out-of-order superscalar processor is measured against a single-cluster out-of-order superscalar processor. SBL Results on Multi-Cluster with Perfect Cache The first experiment evaluates the performance of multi-cluster architectures with VLIW, in-order superscalar, and out-of-order superscalar datapaths. The classical ILP compilation method is used on both the multi-cluster and base single-cluster architectures. Two heuristics were used for selecting loops for broadcast speculative execution. The first heuristic, ‘dp’, simply selects the outer most loop for broadcast in an attempt maximize parallelism. The second heuristic, ‘smt’, is similar in that it favors outer loop levels, but it only selects loops whose profiling information indicate an average of 2+ iterations and 5+ operations in the loop body. The results using these heuristics will be compared against the ideal parallism, which is calculated using Amdahl’s Law according to the number of clusters and the percentage of the program recognized as parallel. The average results across all of the MediaBench applications are given in Figure 6.20 and Figure 6.21. Figure 6.20 shows the average number of loop iterations the SBL
220 method was able to execute within parallel loop regions. The results indicate the ideal, perfect, and true performance for the dp and smt heuristics.
The ideal number of
iterations when executing in broadcast mode is equivalent to the number of clusters in the multi-cluster architecture.
The perfect results define the average number of loop
iterations that could be executed if none of the loops were squashed (i.e. no memory conflicts, no instruction flow deviations, etc.). The true results indicate the average number of loop iterations that are actually executed when accounting for squashed loops that had to be re-executed. Overall, it can be seen that the parallelism results are quite good. For up to 8 clusters, the true results are within 80% of ideal performance and 90% of the perfect performance results. Additionally, the percentage of squashed loops is quite low, and the majority of them are caused by instruction flow deviations. The SBL
Iterations Executed in Parallel
method shows excellent performance on the parallel regions within MediaBench.
18 16 14 12 10 8 6 4 2 0
Ideal Perfect - dp Perfect - smt True - dp True - smt
1
2 4 8 Number of Clusters
16
Figure 6.20 – Average number of loops executed in parallel by SBL method; uses classical compilation and compares the dp and smt selection heuristics. As shown in Figure 6.21, the performance of the SBL method on full applications is not quite as good. The SBL method was not able to find large enough amounts of
221 parallelism to achieve more than 2x speedup. To determine why the performance was lower than expected we examined the loop profiling statistics to determine what was causing the problem. Shown below in Figure 6.22 are the parallelism statistics from loop profiling. The overall parallelism in these applications averages about 40-50%, and only a few of the applications exceed 50% parallelism. While it was expected that some applications would have low parallelism, such as the audio benchmarks which had 0% parallelism (not shown), it was expected that the parallelism would be higher in most applications. In particular, the image and video compression benchmarks such as cjpeg, djpeg, h263dec, h263enc, and mpeg2enc were expected to have much more parallelism. 2.2 VLIW (ideal) VLIW (dp) VLIW (smt) in-order superscalar (ideal) in-order superscalar (dp) in-order superscalar (smt) out-of-order superscalar (ideal) out-of-order superscalar (dp) out-of-order superscalar (smt)
2
Speedup
1.8 1.6 1.4 1.2 1 0.8 1
2 4 8 Number of Clusters
16
Figure 6.21 – Average performance of SBL method on MediaBench; uses classical compilation and compares the dp and smt selection heuristics. It is very positive to note however, that the SBL method performed quite well with respect to the ideal performance. Even including the synchronization overhead (usually about 4-8 cycles) at the end of speculative periods, the SBL method was still able to perform at about 70-80% of ideal performance. The major exception to this trend,
222 however, is with the out-of-order superscalar datapath.
Because the out-of-order
superscalar datapath is able to achieve much higher IPC than in-order superscalar or VLIW datapaths, when using SBL execution in a multi-cluster architecture the synchronization
overhead
affects
its
performance
much
more
significantly.
Consequently, the multi-cluster out-of-order superscalar is only able to achieve speedups
unepic
texgen
rasta
pegwitenc
pegwitdec
mpeg4dec
mpeg2enc
mpeg2dec
mipmap
h263enc
h263dec
gsmencode
gsmdecode
g721enc
g721dec
epic
Training Data Set Evaluation Data Set
djpeg
90 80 70 60 50 40 30 20 10 0
cjpeg
Loop Parallelism (%)
of about 50% of ideal.
Application Figure 6.22 – Maximum loop parallelism according to profiling statistics. On the applications that did display a significant amount of parallelism, the broadcast loop run-time method achieved excellent performance. Figure 6.23 and Figure 6.24 display the results for the epic and djpeg benchmarks. Epic, which has a parallelism of nearly 90% according to profiling, was able to achieve nearly a 6x speedup in the 16issue multi-cluster architecture. The fact that the epic benchmark performed so well indicates that other benchmarks may do even better assuming similar degrees of parallelism can be found using a parallel compiler. The epic benchmark demonstrated an especially large number of instruction control flow deviations, so a significant amount of
223 overhead was required for recovering from misspeculated outer loops.
Programs
containing loops with a higher degree of static loop iterations, such as MPEG-2 and H.263, should perform exceptionally well once the parallelism is found.
7 6
Speedup
5 VLIW (dp, smt) in-order superscalar (dp, smt) out-of-order superscalar (dp, smt)
4 3 2 1 0 1
2 4 8 Number of Clusters
16
Figure 6.23 – Performance of SBL execution for the epic benchmark.
3
Speedup
2.5
VLIW (dp) VLIW (smt) in-order superscalar (dp) in-order superscalar (smt) out-of-order superscalar (dp) out-of-order superscalar (smt)
2 1.5 1 0.5 0 1
2 4 8 Number of Clusters
16
Figure 6.24 – Performance of SBL execution for the djpeg benchmark. Djpeg also performed quite well with an average speedup of 2.3x for the 8-cluster architecture. The djpeg benchmark is particularly interesting in that it was the only
224 benchmark to display a significant performance difference between the dp and smt broadcast loop selection heuristics. There are a couple of final notes regarding these performance results. First, close examination of the results indicates the 1-cluster model usually has 5-10% lower performance than the base un-clustered model. This 5-10% performance degradation indicates the overhead cost for using multi-level if-conversion. Second, while these results were obtained using a 32 KBE LMC cache, a regression that evaluated smaller LMC cache sizes found that sizes of 4-8 KBE (i.e. monitor 4-8 KB of memory) typically have performance within 5% of the 32 KBE cache, so are sufficient for most applications. Epic is the only major exception to trend, as it desires a full 32 KBE LMC cache or performance immediately drops by 15% for a 16 KBE cache, and progressively lower for smaller LMC cache sizes. Overall, we can conclude that the SBL method worked very well on the MediaBench applications. The primary problem arose from the fact that profiling and register dependence analysis were unable to recognize higher degrees of parallelism. The typical parallelism was only about 40-50%, so the average speedup was about 2x on 8 and 16-cluster processors. While this is lower than expected, this performance is quite good considering that the overall IPC when combined with ILP well exceeds the typical IPC from just ILP. When designing the TM-2000, TriMedia did not increase its issue width over that used in the TM-1000 because they were unable to find any additional ILP in their applications [46]. Consequently, a method that achieves 2x speedup over ILP performance provides respectable gain. Regardless, we still anticipate that much higher speedups can be achieved with a parallelizing compiler. A third experiment performed
225 below will model the performance obtainable with a parallelizing compiler by manually applying parallel compiler optimizations and measuring the resulting performance. SBL Results on Multi-Cluster with Global Cache In the second experiment we examine the performance of SBL execution on a multi-cluster architecture with a real memory hierarchy. This experiment assumes a global cache memory hierarchy with the same parameters as defined for base processor in Chapter 4, except that the L1 data cache is changed to 4-way set associative and uses a write-allocate protocol. As indicated in Chapter 4, a highly banked global L1 data cache is only expected to be realizable for up to 4 or 8 parallel memory accesses, so distributed caches are expected for more than 4 clusters. Regardless, using a global cache for this experiment will still effectively model the effect of the SBL method upon memory performance.
As determined in Chapter 5, the primary problems with memory
hierarchies in media processors are external memory latency and external memory bandwidth. It will be interesting to note what effect the extra memory pressure will have upon overall performance. Simulating the SBL method with a distributed cache is left as an experiment for future research. The average results of the multi-cluster architecture simulations with a cache memory hierarchy are excellent. As displayed in Figure 6.25, the results of simulations with cache memory are within 20-25% of those using perfect cache memory. The extra parallel memory accesses during broadcast loops do not appear to be putting undue pressure on the memory system. Additionally, the extra cost of writing data between the L2 cache and the speculative L1 data cache does not appear to impact performance significantly. This memory hierarchy performance bodes very well for the SBL method.
226
1.8 VLIW (dp) VLIW (smt) in-order superscalar (dp) in-order superscalar (smt) out-of-order superscalar (dp) out-of-order superscalar (smt)
Speedup
1.6 1.4 1.2 1 0.8 1
2 4 Number of Clusters
8
Figure 6.25 – Average performance of SBL method on MediaBench; uses classical compilation and compares the dp and smt selection heuristics. SBL Results with Manual Parallel Optimizations To better understand why profiling and register dependence was unable to identify greater degrees of parallelism, we took a close look at the code for the MediaBench applications. It was found that many of the benchmarks employ “faster” versions of critical signal processing functions such as the DCT, and these optimized versions are not easily parallelizable. Also, function calls often prevented parallelism. For example, in mpeg2dec, the motion compensation routine, which accounts for 32% of execution time, is not parallelizable because of a single system call. However, while much of the code is not parallelizable in its current form, we found that use of parallel compiler optimizations should be able to reorganize many of these sections of code and achieve greater parallelism. Consequently, it appears that the expertise of loop-level parallelizing compilers will be necessary to achieve better performance. To measure the potential from using a parallelizing compiler a third experiment was performed that hand parallelized critical loops as a parallel compiler would within a
227 few of the MediaBench. The two applications on which this was done are the GSM applications, gsmdecode and gsmencode, and the MPEG-2 applications, mpeg2dec and mpeg2enc. The parallel optimizations used in manually parallelizing critical loops in these applications included inlining, privatization, reduction methods, and loop splitting. The results for these experiments are shown in Figures 6.26-6.29. The results indicate a clear increase in performance from using parallel compiler optimizations. The average speedup for 8 and 16-cluster architectures jumps from 1.3x to over 1.8x, nearly a 3 fold increase in SBL performance. This is because use of the parallel optimizations increased the available parallelism in the GSM applications from 20% to about 60%, and increased the parallelism in the MPEG-2 applications from an average of 30-40% to about 75%.
These performance increases are a result of
parallelizing only one or two of the critical loops in each of these applications. Using a parallel optimizing compiler to fully parallelize the entire application should increase performance even further. Two notes with respect to the parallel results in Figures 6.26-6.29. First, the results from the manually parallelized tends to be lower than the performance of the original code for the 1-cluster and 2-cluster architectures because some of the optimizations such as loop splitting may increase the number of sequential operations that need to be executed. Also, the jump in performance for the mpeg2enc application is due to inlining a critical procedure that IMPACT chose not to automatically inline because of its considerable size.
However, the function is executed so frequently, inlining it
significantly improves performance even for regular sequential operation.
228
1.4
VLIW (original)
1.2 VLIW (parallelized)
Speedup
1 0.8 0.6 0.4 0.2 0 1
2 4 8 Number of Clusters
16
in-order superscalar (original) in-order superscalar (parallelized) out-of-order superscalar (original) out-of-order superscalar (parallelized)
Figure 6.26 – Comparison of original and manually parallelized code for gsmdecode
Speedup
using the smt heuristic.
1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
VLIW (original) VLIW (parallelized)
1
2 4 8 Number of Clusters
16
in-order superscalar (original) in-order superscalar (parallelized) out-of-order superscalar (original) out-of-order superscalar (parallelized)
Figure 6.27 – Comparison of original and manually parallelized code for gsmencode using the smt heuristic.
229
3.5
VLIW (original)
3 VLIW (parallelized)
Speedup
2.5 2 1.5 1 0.5 0 1
2 4 8 Number of Clusters
16
in-order superscalar (original) in-order superscalar (parallelized) out-of-order superscalar (original) out-of-order superscalar (parallelized)
Figure 6.28 – Comparison of original and manually parallelized code for mpeg2dec using the smt heuristic.
2.5
VLIW (original)
2
Speedup
VLIW (parallelized) 1.5 1 0.5 0 1
2 4 8 Number of Clusters
16
in-order superscalar (original) in-order superscalar (parallelized) out-of-order superscalar (original) out-of-order superscalar (parallelized)
Figure 6.29 – Comparison of original and manually parallelized code for mpeg2enc using the smt heuristic. From these three experiments, we can see that using the SBL method on multicluster architectures can significantly improve performance, particularly the 8 and 16cluster architectures. The SBL method was found to work quite well within parallel code
230 regions, but the limitation on achieving even better speedups was due to the inability of profiling and register dependence analysis to find greater degrees of parallelism. Consequently, we conclude that a parallelizing compiler will be needed to achieve maximum performance.
6.3
Summary The increasing computational complexity of multimedia applications continually
demands more computing power of media processors. To meet these demands media processors are gradually moving to wider architectures, which put greater demands on compilers for increased parallelism. An examination was made of the various levels of parallelism in multimedia applications. ILP and subword parallelism were identified as the two most common methods for achieving parallelism in multimedia, while data parallelism is currently very underutilized by media processing.
However, studies
indicate enormous opportunities for increased performance with data parallelism. Consequently, we propose extending media processor architectures to support data parallelism. The second half of this chapter examined the architecture and compiler methods necessary for supporting data parallelism in media processors. Because data parallelism is closely associated with loop-based parallelism, the necessary loop-level compiler methods already exist.
However, because of the limitations of these compilers in
recognizing all parallel loops, and we propose using run-time methods for enabling higher degrees of parallelism. A new run-time method, Speculative Broadcast Loop (SBL) execution, is proposed that is capable of speculatively executing both fully parallel and partially parallel loops. This technique uses SIMD processing to broadcast loop
231 iterations for parallel execution across a wide-issue multi-clustered architecture. Speculative support is provided to enable proper recovery from memory conflicts, control flow misspeculations, and full loop iteration misspeculations.
The results indicate
excellent parallelism is achieved within parallel regions of MediaBench, but the speedup for full applications was not as high as expected because profiling and register dependence analysis were unable to find more than 40-50% parallelism on average in the multimedia applications. However, experiments indicate the use of a parallel compiler in conjunction with SBL execution can offer a significant parallelism in media processors.
232
Chapter 7. Parallel Media Processor
The last few chapters have provided a thorough evaluation of the architecture and compiler issues in media processor design. This chapter shall take the results of the evaluations and generate a proposal for a future media processor based on a multi-cluster architecture with support for Speculative Broadcast Loop (SBL) execution.
7.1
Basic Multi-Cluster Organization Close examination of the multi-cluster broadcast loop simulation results in the last
chapter indicates that for many applications, performance did not increase as substantially between the 8-cluster and 16-cluster models as it did for the smaller cluster models. Furthermore, the loop statistics in Chapter 3 indicate that the average number of iterations per loop is around 10 for most applications, so we expect a multi-cluster architecture with 8 clusters will work well. We assign 4 issue slots to each cluster. This corresponds well with the results in Chapter 4 that indicated there is minimal performance improvement when using more than 4 issue slots.
Additionally, because of the enormous performance degradation
caused by delayed bypassing (as found in Chapter 4), it is preferable to use full, immediate bypassing, which is usually only possible for up to 4 issue slots. Assigning 3
233 issue slots per cluster may also be a reasonable design alternative, but this would require additional performance verification, so we shall assume 4 for this design proposal. The datapath for this media processor will be 32 bits wide, and will include support for subword parallelism as well. According to the data type and size statistics found in Chapter 3, 8-bit data types are used nearly 40% of the time, and 16-bit or smaller data types are used nearly 70% of the time. Because compiler research for subword parallelism is expected to eventually realize effective compilation methods, a 32-bit datapath will provide some opportunity for supporting subword parallelism.
7.2
Functional Requirements Chapter 3 presented a workload evaluation of the MediaBench benchmark suite.
Among the various characteristics examined were operation frequencies, data types and sizes, and memory characteristics such as working set size and spatial locality. From the results it was concluded that media processors should provide a particular ratio of functional units. The recommended ratio was 4 ALUs to 2 memory units, 1 branch unit, 1 floating-point unit, 1 shifter, and 1 multiplier. This ratio of functional units shall provide the functional make-up of each cluster in our homogenous multi-cluster architecture.
7.3
Instruction Control Stream Evaluation of the instruction memory characteristics in Chapter 3 indicated
typical instruction memory working set sizes are 8 KB. To effectively support codeexpanding compilation methods, allowances should be provided for larger working sets, so a direct-mapped 16 KB instruction cache is used. Spatial locality is very high in instruction memory, so a line size of 256 bytes will be used.
Because only one
234 instruction control path exists in a multi-cluster architecture, the instruction cache is global and feeds instructions to all clusters in the processor. Finally, while Chapter 4 found only moderate benefits from using an aggressive fetch engine, we expect aggressive fetch mechanisms will be beneficial for broadcast loops, so a small instruction buffer is used to provide a modicum of decoupled fetch/execute support. Chapter 4 evaluated a number of architecture features, including dynamic branch prediction. While static branch prediction was found to be quite good on multimedia applications in Chapter 3, small dynamic branch predictors with 512 entries offer miss rates nearly two times lower, so a small dynamic branch predictor will be incorporated into the media processor.
7.4
Memory Hierarchy Chapter 3 also found that data memory has an average working set size of 32 KB,
and has relatively good spatial locality. Ideally, a single global memory would be preferable, but on a wide-issue multi-cluster architecture with numerous parallel memory accesses per cycle, it is expected that the data cache will need to be distributed. It is difficult to say precisely how much data cache is needed in each cluster, because the working set size does not necessarily scale with the number of processors. The broadcast loop simulations on multi-cluster architectures in the last chapter exhibited very good memory performance on a 8-cluster machine, so we suspect 8-16 KB is sufficient per local cache. In the absence of further information, we shall assume a 16 KB data cache. Since spatial locality is good in multimedia applications, a line size of 64 bytes will be used. To support the speculative broadcast loop execution, it will be necessary to adopt a
235 write-allocate policy and provide some set associativity. Consequently, each cluster has its own local 4-way set associative 16 KB cache, which uses a write-allocate policy. To minimize the effects of long external memory latencies and take advantage of the streaming nature of multimedia data, a small prefetch memory structure, such as a stream buffer or stride prediction table, will be used in conjunction with the data cache. As found in Chapter 5, however, external memory bandwidth is one of the two most significant problems in media processor memory hierarchies. To prevent the memory prefetching from becoming overly aggressive and overloading the external bus bandwidth, a relatively small prefetch memory structure is used on-chip. We propose that a larger, more aggressive off-chip memory prefetch unit may provide improved memory performance without overloading the external memory bandwidth. To support speculation in the L1 data cache, both a loop memory conflict (LMC) cache and an on-chip L2 cache are necessary. The L2 cache will be a unified, 4-way setassociative 256 KB on-chip. The LMC cache will be distributed with a local copy in each cluster. We expect a 4-way set associative 4 KBE (4 KB Entry) LMC cache will be sufficient in each cluster.
7.5
Static vs. Dynamic Chapter 4 performed a thorough evaluation of static versus dynamic methods and
found dynamic out-of-order scheduling usually provides 60-80% better performance on MediaBench. However, having 8 clusters each providing out-of-order scheduling would be prohibitively expensive in terms of cost, area, complexity, and power. On a multicluster architecture supporting Speculative Broadcast Loop (SBL), dynamic scheduling would be most effective in speeding up the sequential sections of the program.
236 Consequently, we propose making one cluster (the Sequential Master cluster) an out-oforder superscalar core and use static scheduling on all other clusters. This will enable the benefit of dynamic scheduling on multimedia code segments that do not contain significant parallelism.
7.6
Compilation Methods Effective compiler support for parallel media processors requires a full range of
optimizations.
ILP and subword parallelism compilation methods are necessary for
achieving maximum IPC within each cluster, while loop-based parallel methods are necessary for maximizing parallelism between clusters. Additionally, if there is a desire to schedule ILP across clusters, cluster scheduling methods will be required as well. Predication is necessary for supporting multi-level if-conversion, so compilation methods involving prediction, such as the hyperblock optimization, may be used as well.
7.7
Summary This chapter combined the conclusions from each of the previous chapters to
generate a design for a potential parallel media processor.
While this design is
structurally similar to the media processor proposed in Section 4.2.2, there are a significant number of variations between the two. Two major differences are the support for Speculative Broadcast Loop execution, and an out-of-order superscalar core for one of the clusters.
Additionally, the processor supports dynamic branch prediction, an
instruction buffer, and subword parallelism. Whereas only ILP could be supported in the initial multi-cluster architecture, this improved parallel media processor will be able to accommodate ILP, subword parallelism, and data parallelism. The improved design
237 offers significant performance improvement over the previous implementation.
The
proposed parallel media processor is shown in Figure 7.1. Cluster 0 Clusters 1-7
Out-of-Order Superscalar VLIW
Cluster Cluster Cluster Cluster 0 2 4 6
I/O
Interconnect Network Level 2 Unified Cache
Cluster Cluster Cluster Cluster 1 3 5 7 LMC Conflict Checker
Instruction Buffer Dynamic Branch Pred
Local Local Register Prefetch Memory File ALU
Mem
FP Mult Shift
Level 1 Instruction Cache
Figure 7.1 – Proposal for a multi-cluster parallel media processor.
Local LMC Cache Local Level 1 Data Cache
238
Chapter 8. Conclusions and Future Directions
Multimedia dominates a significant portion of the computing industry. In the near future, multimedia workloads are expected to occupy nearly 90% of the computing cycles in most personal computers. And it’s still growing. As multimedia continues to grow, the demand for newer, more advanced, and more diverse applications will also continue to grow. To support these increasing demands, new methods for media processing will become necessary.
With its computationally intensive nature, many multimedia
applications are already beyond the abilities of general-purpose microprocessors, and the future of multimedia promises significantly greater processing demands. Applicationspecific processors currently provide a cost-effective method for supporting many multimedia applications, but as applications continue to become more advanced, and new representations such as MPEG-4 and MPEG-7 are introduced, greater flexibility is required than can be supported by these processors. The future of multimedia will require processors that can provide both considerable flexibility as well as significant computing power. Programmable media processors offer one such solution for the next generation of media processing. In this thesis we present a thorough evaluation of many of the architecture and compiler issues in media processor design. Media processor design is still a relatively immature field. As such there is not a significant amount of research that has been done
239 in the field.
Unlike general-purpose processors, where the research community
understands well how they work and what architectures, architecture features, and compiler features produce the best results, there is no similar knowledge base in media processing. To promote the development of that knowledge base in media processor design, this thesis first uses the MediaBench benchmark suite to characterize multimedia applications, then performs an architecture evaluation that examines the performance of many fundamental architecture features on multimedia applications, and finally examines the parallelism in multimedia to determine what compilation methods are best suited for extracting parallelism in multimedia applications. This thesis found data parallelism to be one currently under exploited means for achieving greater parallelism, and presents the Speculative Broadcast Loop run-time method for supporting data parallelism in multimedia.
8.1
Thesis Contributions Contained in this section is a summary of the primary contributions of this thesis
to the research community:
8.1.1 Comprehensive Evaluation of Multimedia Characteristics To design a processor for a specific application area, it is necessary to have a thorough understanding of the characteristics of that processing field.
Within
multimedia, there were understandings of some of the key features such as streaming data, small data types, and processing regularity, but there was neither a quantitative understanding of these traits, nor an awareness of the more detailed characteristics of multimedia applications. This thesis performed a compiler-driven workload evaluation to quantitatively evaluate application features of the MediaBench multimedia benchmark
240 suite. Included among the features examine were operation frequencies, basic block and branch statistics, data types and sizes, instruction and data memory working set sizes and spatial locality, loop statistics, and path complexity.
8.1.2 Static vs. Dynamic Architecture Evaluation Existing media processors have almost exclusively used static VLIW or DSP architectures.
While there are benefits such as lower cost and power from static
architectures, there are a lot of dynamic architecture features that are conducive to media processing as well.
This thesis performed a thorough architecture evaluation that
analyzed the performance of many fundamental architecture features on media processors. The evaluation made a comparison of three processor architectures over the full range of static and dynamic architectures. Each of these architectures was evaluated using a variety of different architecture and compiler methods and features, including different issue widths, various compiler methods, aggressive and conservative fetch mechanisms, various dynamic branch prediction methods, different pipeline lengths, compressed versus uncompressed explicitly parallel instruction formats, and high frequency effects such as longer operation latencies and delayed bypassing.
8.1.3 Cache Memory Hierarchy Evaluation An investigation was performed to determine the bottlenecks of media processor memory hierarchies. Earlier studies had primarily focused on either streaming memory structures for multimedia or on a single level of the memory hierarchy. This thesis examined a full memory hierarchy, including L1 cache, L2 cache, and external memory, to determine effects of various memory parameters upon media processors. It was found
241 that external memory latency and external memory bandwidth are the primary problems in media processor memory hierarchies.
8.1.4 Investigation of the Parallelism in Multimedia Prior research studies had established the existence of significant parallelism in multimedia applications, but it was uncertain to what degree parallelism existed at different levels of parallelism in multimedia applications.
This thesis categorized
multimedia parallelism into four levels: instruction level parallelism (ILP), subword parallelism, data parallelism, and task parallelism.
Example studies are presented
indicating the approximate degree of parallelism available at each these levels. Particular focus was paid to instruction level parallelism and data parallelism. It was initially believed that instruction level parallelism would provide significantly greater degrees of parallelism than it actually does. However, these preliminary investigations indicate there is no more ILP in multimedia applications than in general-purpose applications. Data parallelism is identified as the currently most underutilized level of parallelism, but also the level parallelism that provides the greatest opportunities for improved performance.
8.1.5 Speculative Broadcast Loop (SBL) Execution To support the execution of data parallelism on a multi-cluster media processor architecture, the Speculative Broadcast Loop (SBL) method is presented as a new speculative run-time method for supporting data parallelism in multimedia. This method broadcasts separate loop iterations for SIMD processing across a wide-issue clustered architecture.
Because existing parallel compilation techniques are often unable to
accurately recognize all the parallel loops in an application, a speculative run-time
242 method was developed to enable speculative execution of both fully parallel and partially parallel loops. This method is able to parallelize outer loop levels as well as inner loop levels.
8.1.6 Multi-Level If-Conversion (MLIC) To support the SBL execution method described above, a new scheduling method was necessary for maximizing the SIMD parallelism in parallel loops. The speculative run-time method uses SIMD processing, so it is necessary to eliminate all unnecessary branches by combining all loop paths into a single control path. Normal if-conversion is able to eliminate all branches in an inner loop, but it was also was desired that the SBL method support SIMD parallelism on outer loops as well. Because outer loops contain nested loops it is not possible to eliminate all branches within the loop body. However, it is possible to eliminate all branches except the loop back-edge and loop bypass branches for those nested loops. We refer to this method as Multi-Level If-Conversion (MLIC).
8.1.7 Dynamic Memory Conflict Checking For speculative execution of parallel loop iterations, it is necessary to have a method for checking for memory conflicts between separate iterations.
This thesis
proposed new methods for storing the memory access information and dynamically checking for memory conflicts between loop iterations speculatively executing in parallel. The Loop Memory Conflict (LMC) monitors the read and write accesses to every byte of memory during SBL execution. A memory conflict checker performs compares the LMC’s memory access information between loop iterations to check for occurrences of memory conflict. Both global and distributed designs for the LMC cache and memory conflict checker were proposed.
243
8.2
Future Work Because media processor design is a relatively new field there is an enormous
amount of work still to be done. Some suggestions are presented here:
8.2.1 Multi-Level Prefetch Hierarchy We alluded to this idea of a multi-level memory prefetch hierarchy in Chapter 7. In the evaluation of the cache memory hierarchy it was found that external memory latency and external memory bandwidth are the two major bottlenecks of a media processor’s memory hierarchy. Prefetching has been identified as an effective method to mitigate the external memory latency problem by prefetching the data before it is known whether that data is actually needed. However, because external memory bandwidth is also a problem, we want to avoid using an overly aggressive prefetching engine on-chip, since aggressive prefetching may overload the external memory bandwidth. We propose a two level prefetch hierarchy, which uses conservative prefetching on-chip and provides a more aggressive prefetching engine off-chip.
8.2.2 Combine Parallel Compiler with Speculative Broadcast Loop The results from the broadcast loop simulations were lower than expected, but that was because profiling and register dependence analysis were unable to identify greater levels of parallelism. If a parallel compiler (like SUIF or Polaris) was used as a front end to both identify parallelism (and potential parallelism) in loops and transform those loops for SBL execution, then performance could be significantly higher.
8.2.3 Single-Chip Multiprocessors for Media Processing Instead of simply using SIMD processing on a multi-cluster architecture for supporting data parallelism in multimedia, it would be interesting to examine the
244 performance gains from using a single-chip multiprocessor. We do not believe the SIMD processing hinders performance significantly on the multi-cluster architecture, but in addition to allowing separate instruction control streams, a single-chip multiprocessor would also enable support for parallel loops containing function calls.
8.2.4 Extend Multi-Level If-Conversion to Subword Parallelism If-conversion is essentially a method for packing multiple paths into a single path, whereas subword parallelism is a method for packing multiple operations into a single operation. If combed with some parallel loop compilation methods, there may be a way to retarget multi-level if-conversion for use with subword parallelism.
8.2.5 Evaluating DSP Features for Media Processing This thesis explored the design space for media processing primarily using general-purpose processing design tools. Consequently, it was not possible to consider many architectural or compiler features from the DSP arena. It would be pertinent to also examine media processing from the perspective of the DSP design space.
245
Appendix A. Architecture Performance by Application
CJPEG: 3
VLIW
2.5 in-order superscalar
IPC
2 1.5
out-of-order superscalar
1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.1 – Comparison of performance of three processor models on cjpeg
246 DJPEG:
IPC
4 3.5 3 2.5
VLIW in-order superscalar
2 1.5 1 0.5 0
out-of-order superscalar VLIW w/ perfect caches
Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.2 – Comparison of performance of three processor models on djpeg
EPIC: 2.5
VLIW
2 IPC
in-order superscalar 1.5 out-of-order superscalar 1 VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.3 – Comparison of performance of three processor models on epic
247 GS: 1.2
VLIW
IPC
1 0.8
in-order superscalar
0.6
out-of-order superscalar
0.4
VLIW w/ perfect caches
0.2 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.4 – Comparison of performance of three processor models on gs
G721DEC: 3.5
VLIW
3
IPC
2.5
in-order superscalar
2 out-of-order superscalar
1.5 1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.5 – Comparison of performance of three processor models on g721dec
248 G721ENC: 3
VLIW
2.5 in-order superscalar
IPC
2 1.5
out-of-order superscalar
1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.6 – Comparison of performance of three processor models on g721enc
GSMDEC: 3
VLIW
2.5 in-order superscalar
IPC
2 1.5
out-of-order superscalar
1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.7 – Comparison of performance of three processor models on gsmdec
249 GSMENC:
IPC
4 3.5 3 2.5
VLIW in-order superscalar
2 1.5 1 0.5 0
out-of-order superscalar VLIW w/ perfect caches
Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.8 – Comparison of performance of three processor models on gsmenc
H263DEC: 5
VLIW
4 IPC
in-order superscalar 3 out-of-order superscalar 2 VLIW w/ perfect caches
1 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.9 – Comparison of performance of three processor models on h263dec
250 H263ENC:
IPC
1.6 1.4 1.2 1
VLIW in-order superscalar
0.8 0.6 0.4 0.2 0
out-of-order superscalar VLIW w/ perfect caches
Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.10 – Comparison of performance of three processor models on h263enc
IPC
MIPMAP: 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
VLIW in-order superscalar out-of-order superscalar VLIW w/ perfect caches
Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.11 – Comparison of performance of three processor models on mipmap
251 MPEG2DEC: 3
VLIW
2.5 in-order superscalar
IPC
2 1.5
out-of-order superscalar
1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.12 – Comparison of performance of three processor models on mpeg2dec
IPC
MPEG2ENC: 4 3.5 3 2.5 2 1.5 1 0.5 0
VLIW in-order superscalar out-of-order superscalar VLIW w/ perfect caches
Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.13 – Comparison of performance of three processor models on mpeg2enc
252 MPEG4DEC: 3
VLIW
2.5 in-order superscalar
IPC
2 1.5
out-of-order superscalar
1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.14 – Comparison of performance of three processor models on mpeg4dec
PEGWITDEC: 3.5
VLIW
3
IPC
2.5
in-order superscalar
2 out-of-order superscalar
1.5 1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.15 – Comparison of performance of three processor models on pegwitdec
253 PEGWITENC: 3.5
VLIW
3
IPC
2.5
in-order superscalar
2 out-of-order superscalar
1.5 1
VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.16 – Comparison of performance of three processor models on pegwitenc
PGPDECODE: 2.5
VLIW
2 IPC
in-order superscalar 1.5 out-of-order superscalar 1 VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.17 – Comparison of performance of three processor models on pgpdecode
254 RASTA: 2.5
VLIW
2 IPC
in-order superscalar 1.5 out-of-order superscalar 1 VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.18 – Comparison of performance of three processor models on rasta
IPC
RAWCAUDIO: 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
VLIW in-order superscalar out-of-order superscalar VLIW w/ perfect caches
Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.19 – Comparison of performance of three processor models on rawcaudio
255 RAWDAUDIO: 2.5
VLIW
2 IPC
in-order superscalar 1.5 out-of-order superscalar 1 VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.20 – Comparison of performance of three processor models on rawdaudio
TEXGEN: 2.5
VLIW
2 IPC
in-order superscalar 1.5 out-of-order superscalar 1 VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.21 – Comparison of performance of three processor models on texgen
256 UNEPIC: 2.5
VLIW
2 IPC
in-order superscalar 1.5 out-of-order superscalar 1 VLIW w/ perfect caches
0.5 0 Classical
Superscalar
Hyperblock
Compilation Method
in-order superscalar w/ perfect caches out-of-order superscalar w/ perfect caches
Figure A.22 – Comparison of performance of three processor models on unepic
257
Appendix B. Video Signal Processing Kernels
B.1. Straightforward DCT for (m = 0; m < 8; m++) for (n = 0; n < 8; n++) { y[m][n] = 0.0; for (i = 0; i < 8; i++) for (j = 0; j < 8; j++) y[m][n] += x[i][j] * c[i][m] * c[j][n]; }
B.2. Row-Column DCT for (n = 0; n < 8; n++) for (i = 0; i < 8; i++) { z[i][n] = 0.0; for (j = 0; j < 8; j++) z[i][n] += x[i][j] * c[j][n]; } for (m = 0; m < 8; m++) for (n = 0; n < 8; n++) { y[m][n] = 0.0; for (i = 0; i < 8; i++) y[m][n] += z[i][n] * c[i][m]; }
258
B.3. Full Search Motion Estimation Variable definitions: best_x = x coordinate of the best matching motion vector best_y = y coordinate of the best matching motion vector sum = absolute difference between block and reference image min = minimum absolute difference (at position motion vector (x,y)) Code: for (m = -8; m < 7; m++) for (n = -8; n < 7; n++) { sum = 0; for (i = 0; i < 15; i++) for (j = 0; j < 15; j++) { sum += abs (a[i][j] – b[i-m][j-n]); } if (sum < min) { min = sum best_x = m best_y = n } }
B.4. Color Space Conversion Notes: Conversion is from 4:4:4 RGB to 4:2:0 YCrCb. Conversion is done on 8 x 8 blocks. Code: for (i = 0; i < 8; i++) { for (j = 0; j < 8; j++) { y[i][j] = 0.299 * (r[i][j] - g[i][j]) + g[i][j] + 0.144* (b[i][j] - g[i][j]); by[i][j] = b[i][j] - y[i][j]; ry[i][j] = r[i][j] - y[i][j];
259 } } for (i = 0; i < 8; i += 2) { for (j = 0; j < 8; j += 2) { divider = 3; if (i == 0) { top_by = 0; top_ry = 0; divider -= 0.5; } else { top_by = 0.5 * by[i - 1][j]; top_ry = 0.5 * ry[i - 1][j]; } if (j == 0) { left_by = 0; left_ry = 0; divider -= 0.5; } else { left_by = 0.5 * by[i][j - 1]; left_ry = 0.5 * ry[i][j - 1]; } bot_by = 0.5 * by[i + 1][j]; bot_ry = 0.5 * ry[i + 1][j]; right_by = 0.5 * by[i][j + 1]; right_ry = 0.5 * ry[i][j + 1]; cur_by = by[i][j]; cur_ry = ry[i][j]; cb[i >> 1][j >> 1] = 0.564 * (cur_by + top_by + left_by + bot_by + right_by) / div; cr[i >> 1][j >> 1] = 0.713 * (cur_ry + top_ry + left_ry + bot_ry + right_ry) / div; } }
260
B.5. Variable Length Encoding Notes: Variable length coder as defined by the MPEG-2 video syntax. Case used was that of coding non-intra blocks (table B-14). - one of two most common cases - one of two most complex/demanding cases - same methods may be used for all other tables (tables B-1 thru B-13, and B-15) Assumed zig-zag scan. Variable definitions: word[i] = current partially built word in stream pos = last unused bit position in ’word’ run = 0; Code: for (n = 0; n < 64; n++) { length = 0;
/* length determines if escape sequence needed */
/* determine code and code length */ if (blk[n] == 0) run++; else { if ((run < 2) && ((level = abs(blk[n])) < 41)) { if ((n == 0) && (level == 1)) { code1 = 1; length2 = 1; } else { code1 = DCTtab1[run, level].code; length1 = DCTtab1[run, level].len; } sign = (blk[n] < 0); code2 = sign; length2 = 1; }
261 else if ((run < 32) && (level < 6)) { code1 = DCTtab2[run, level].code; length2 = DCTtab2[run, level].len; sign = (blk[n] < 0); code2 = sign; length2 = 1; } if (length == 0) /* encode with escape sequence */ { code1 = (1 next run = 0 */
/* shift code onto outgoing video stream */ pos -= length1; if (pos > 0) word[i] = word[i] | (code1 -pos); i++; pos += 16; word[i] = 0 | (code1 0) word[i] = word[i] | (code2 -pos); i++; pos += 16; word[i] = 0 | (code2 0) word[i] = word[i] | (2 -pos); i++; pos += 16; word[i] = 0 | 2