Available Instruction-Level Parallelism for Superscalar ... - CiteSeerX

Recommend Documents

Compiler Code Transformations for Superscalar-Based ... - CiteSeerX

University of Illinois. Urbana-Champaign, IL 61801. Pohua P. Chang. Intel Corporation. Hillsboro, OR 97124. Tokuzo Kiyohara. Matsushita Electric Industrial Co., ...

performance factors for superscalar processors - CiteSeerX

The research described herein has been supported by NASA-Ames under grants. NAG2-248 and NAGW 419, using equipment supplied by Silicon Graphics, Inc ...

A Simple Superscalar Architecture - CiteSeerX

Jan 21, 1995 - cost is re ected in the size of these devices 1, 5, 6, 11, 14]. .... Figure 5 shows the modi ed NIW instruction format, and gure 6 lists the.

performance factors for superscalar processors - CiteSeerX

bandwidth. Dynamic scheduling with appropriate levels of bus bandwidth and branch prediction is ..... unlimited fetch bandwidth case. One exception to this is ...

Nested Loop Transformation for Full Parallelism - CiteSeerX

ware, which justi es the study of multi-dimensional optimization algorithms. E cient .... Iterations are identi ed by a vector i, equivalent to a multi-dimensional index .... v and dr(l) = d(l) for every cycle l 2 G. After retiming, the execution of

Profiling Java Programs for Parallelism - CiteSeerX

Profiling Java Programs for Parallelism. Clemens Hammacher, Kevin Streit, Sebastian Hack, Andreas Zeller. Department of Computer Science. Saarland ...

Extended Parallelism Models For Optimization On ... - CiteSeerX

be effective in reducing the time required to compute optimal solutions. ... parallel optimization, multilevel parallelism, massively parallel computing ... the coordinate pattern search algorithm has 2n independent evaluations per cycle, the impleme

Preparing Students for Ubiquitous Parallelism - CiteSeerX

Mar 7, 2009 - University of Wisconsin â Eau Claire ... University of California, Berkeley ... In addition, the availability of cheap processor cycles has created.

FastForward for Efficient Pipeline Parallelism - CiteSeerX

Feb 23, 2008 - processors (e.g., Power4), this is an unavoidable cost and must be paid with any communication mechanism, whether based on lock-.

Extended Parallelism Models For Optimization On ... - CiteSeerX

within the DAKOTA software toolkit. Mathematical analysis on achieving peak efficiency in multilevel parallelism has shown that the most effective processor ...

Integration of GRID superscalar and GridWay ... - CiteSeerX

Efficient, reliable and unattended execution of jobs: GridWay automatically .... 3-site machines (UCM in Madrid, BSC in Barcelona and UniS in Surrey) with.

FastForward for Efficient Pipeline Parallelism - CiteSeerX

Feb 23, 2008 - A Cache-Optimized Concurrent Lock-Free Queue .... tion (APP) stage performs the actual application related frame pro- ..... We will call these.

FastForward for Efficient Pipeline Parallelism - CiteSeerX

Feb 23, 2008 - tion (APP) stage performs the actual application related frame processing. ... tem, lock-based queues cost at least 200 ns per operation (enqueue. Processors. IP (P1) .... Startup time, incl. roundtrip comm. latency. Legend.

Adhesive DPO Parallelism for Monic Matches - CiteSeerX

Adhesive DPO Parallelism for Monic Matches. Filippo Bonchi, Tobias Heindel. Dipartimento di Informatica, Universit`a di Pisa, Italy. Abstract. This paper presents ...

Specifying superscalar microprocessors in Hawk - CiteSeerX

When an instruction's destination register value has been computed, we de- ... clude speculative execution, the use of virtual register names, and out-of-order.

Dynamically Mutable Functional Unit in Superscalar ... - CiteSeerX

Patt et al. [22]. With the trace cache, it is possible to implement the steering logic in ..... [15] Ralph D. Wittig and Paul Chow, âOneChip: An FPGA Processor with ...

Programming Grid Applications with GRID superscalar - CiteSeerX

Jan 9, 2004 - node of an IBM Power4 with 4 processors. As client machine, a PC based system with Linux was used. In each case, a maximum number of ...

2: Data-Parallelism and Data-Flow 1. The Parallelism ... - CiteSeerX

81/2 introduces a new data structure, the type fabricbcorre- sponding to a .... mesh (Xi = ih; Tj = jk) which discretizes the space of the variables x and t. One.

Robust Adaptation to Available Parallelism in Transactional Memory ...

case non-tuned execution time, but improve resource usage significantly in benchmarks that exhibit dynamic available parallelism. This paper presents an ...

A Fine-Grain Multithreading Superscalar Architecture - CiteSeerX

benefits of this program partitioning based on data parallelism are ... Regs. Register Remap Values. T#0 T#1 T#2 T#3 T#4 T#5. 1. 1 - 15. 0 n/a n/a n/a n/a n/a.

R10000 Superscalar Microprocessor Presentation

(Translation Look-aside Buffer). Two Arithmetic/Logic Units ... Example shows a simple integer ALU instruction, such as an ADD. Unit. Select which Queue and ...

RtrASSoc - An Adaptable Superscalar

Fundação de Ensino Eurípides Soares da Rocha. Programa de Pós-Graduação em Ciência da Computação. Marília, S.P., Brasil. 2. Universidade do Sagrado ...

Superscalar Encrypted RISC

Jonathan Bowen. London South Bank University, UK. Esther Palomar. Birmingham City University, UK. Zhiming Liu. Southwest University, China ...

Java Optimization for Superscalar and Vector Architectures

The superscalar G4 core (also known as the Motorola MPC74xx series processor) ... PowerPC. Execution Flow. Figure 1. The G4 Execution Flow. Although the ...

Available Instruction-Level Parallelism for Superscalar ... - CiteSeerX

Download PDF

1 downloads 0 Views 1MB Size Report

Comment

Computer designers and computer architects have ... multiple processors in parallel. ... lelism, we define a base machine that has an execution ... assume that the time to read the operands has been piped ... operands out of the register file, do an ALU operation, ..... One reason for this is that we have only forty temporary.

Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines Norman P. Jouppi David W. Wall Digital Equipment Corporation Western Research Lab

Abstract

They take advantage of this parallelism by issuing more than one instruction per cycle explicitly (as in VLlW or superscalar machines) or implicitly (as in vector machines). In this paper we will limit ourselves to improving uniprocessor performance, and will not discuss methods of improving application performance by using multiple processors in parallel.

Su erscalar machines can issue several instructions per cyc Ipe. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to be roughly equivalent ways of exploiting instruction-level parallelism. A parameterizable code reorganization and simulation system was developed and used to measure instruction-level parallelism for a series of benchmarks. Results of these simulations in the presence of various compiler optimizaThe average degree of tions are presented. superpipelining metric is introduced. Our simulations suggest that this metric is already high for many machines. These machines already exploit all of the instruction-level parallelism available in many nonnumeric applications, even without pa.raIlel instruction issue or higher degrees of pipelining.

As an example of instruction-level parallelism, consider the two code fragments in Figure l-l. The three instructions in (a) are independent; there are no data dependencies between them, and in theory they could all be executed in parallel. In contrast, the three instructions in (b) cannot be executed in parallel, because the second instruction uses the result of the first, and the third instruction uses the result of the second. Load Add FPAdd (a)

1. Introduction Computer designers and computer architects have been striving to improve uniprocessor computer performance since the first computer was designed. The most significant advances in uniprocessor performance have come from exploiting advances in implementation technology. Architectural innovations have also played a part, and one of the most significant of these over the last decade has been the rediscovery of RISC architectures. Now that RISC architectures have gained acceptance both in scientific and marketing circles, computer architects have been thinking of new ways to improve uniprocessor performance. Many of these proposals such as VLIW [12], superscalar, and even relatively old ideas such as vector processing try to improve computer performance by exploiting instruction-level parallelism.

1989 ACM O-8979 l-300-O/89/0004/0272

parallelism=3

Figure l-l:

Add Add Store (b)

R3