Trident: A Scalable Architecture for Scalar, Vector, and Matrix Operations

Trident: A Scalable Architecture for Scalar, Vector, and Matrix Operations Mostafa I. Soliman and Stanislav G. Sedukhin Graduate School of Computer Science and Engineering The University of Aizu Aizu-Wakamatsu City Fukushima, 965-8580 Japan [email protected]; [email protected]

Abstract Within a few years it will be possible to integrate a billion transistors on a single chip. At this integration level, we propose using a high level ISA to express parallelism to hardware instead of using a huge transistor budget to dynamically extract it. Since the fundamental data structures for a wide variety of applications are scalar, vector, and matrix, our proposed Trident processor extends the classical vector ISA with matrix operations. The Trident processor consists of a set of parallel vector pipelines (PVPs) combined with a fast in order scalar core. The PVPs can access both vector and matrix register files to perform vector, matrix, and matrix-vector operations. One key point of our design is the exploitation of up to three levels of data parallelism. Another key point is the ring register files for storing vector and matrix data. The ring structure of the register files reduces the number and size of the address decoders, the number of ports, the area overhead caused by the address bus, and the number of registers attached to bit lines, as well as providing local communication between PVPs. The scalability of the Trident processor does not require more fetch, decode, or issue bandwidth, but requires replication of PVPs and increasing the register file size. Scientific, engineering, multimedia, and many other applications, which are based on a mixture of scalar, vector, and matrix operations, can be speeded up on the Trident processor. Keywords: Parallel processing, data parallelism, vector/matrix processing, ring register file, scalable hardware.

1

Introduction

Rapid improvements in semiconductor technology fuel processor performance growth: each increase in integration density allows for higher clock rates and offers new opportunities for microarchitectural innovation (Vajapeyam and Valero 2001, Hammond, Nayfeh, and Olukotun 1997). Within a few years it will be possible to integrate a billion transistors on a single chip (Brinkman 1997, Burger and Goodman 1997, Patt, Patel, Evers, Friendly, and Stark, 1997). At this integration level, it is necessary to find new processor architectures to use this huge transistor budget efficiently and meet the requirements of future applications.

Copyright © 2002, Australian Computer Society, Inc. This paper appeared at the Seventh Asia-Pacific Computer Systems Architecture Conference (ACSAC'2002), Melbourne, Australia. Conferences in Research and Practice in Information Technology, Vol. 6. Feipei Lai and John Morris, Eds. Reproduction for academic, not-for-profit purposes permitted provided this text is included.

Traditionally, additional transistors have been used to improve processor performance by exploiting one or more forms of parallelism to perform more work per clock cycle. Instruction-level parallelism (ILP), threadlevel parallelism (TLP), and data parallelism (DP) are the three major forms of parallelism. These forms of parallelism are not mutually exclusive and can be combined in one computer. We briefly discuss a superscalar as an ILP processor and a single chip multiprocessor as an example of TLP before introducing our processor design, which exploits a significant amount of DP. Superscalar processors are capable of executing more than one instruction in parallel by exploiting ILP (Jouppi and Wall 1989, Smith and Sohi 1995). A significant portion of the die for most commercial superscalar processors is used by logic which searches for independent operations to execute in parallel. In contrast, the die area used to actually execute operations is relatively small (Lee and DeVries 1997). Superscalar processor performance is improved not only by trying to fetch and decode more instructions per cycle, but also by using wider out–of-order issue. However, the cost of issuing multiple instructions per cycle grows at least quadratically with issue width and the required circuitry may soon limit the clock frequencies of superscalar processors (Palacharla, Jouppi, and Smith 1997). Moreover, research on improving superscalar performance suggests that superscalar processors wider than 4-issue may not be the most effective technique for exploiting ILP and using chip resources (Lee and DeVries 1997). Implementing more than one processor on the same chip offers performance advantages over wide-issue superscalar processors (Hammond, Hubbert, Siu, Prabhu, Chen, and Olukotun 2000, Krishnan and Torrellas 1999). These single chip multiprocessors offer high performance on single applications by exploiting loop-level parallelism and provide high throughput and low interactive response time on multiprogramming workloads (Nayfeh, Hammond, and Olukotun 1996). Although, the tightly integrated single chip multiprocessor has low interprocessor communication delays (for a relatively small number of processors), programs must still lay data out carefully in memory to avoid conflicts between processors, minimize data communication between processors, and express synchronization at any point in a program where processors may actively share data. To

Latency

Throughput

movaps xmm0, [Vec1]

4

1/2 cycles

mulps

5

1/2 cycles

movaps xmm1, [Vec1+16]

4

1/2 cycles

mulps

xmm1, [Vec2+16]

5

1/2 cycles

addps

xmm0, xmm1

4

1/2 cycles

movaps xmm1, xmm0

1

1/1 cycles

shufps xmm1, xmm1, 4Eh

2

1/2 cycles

addps

xmm0, xmm1

4

1/2 cycles

movaps xmm1, xmm0

1

1/1 cycles

shufps xmm1, xmm1, 11h

2

1/2 cycles

addps

4

1/2 cycles

xmm0, [Vec2]

xmm0, xmm1

Instr. Set

Example

Scalar

Addition

z=x+y;

Addition

for(i=0;i

Trident: A Scalable Architecture for Scalar, Vector, and Matrix Operations

Trident: A Scalable Architecture for Scalar, Vector, and Matrix Operations

Suggest Documents

Trident: Technology-Scalable Architecture for Data ... - cs.UManitoba.ca

Computer Architecture and Performance Outline Scalar Operations ...

Programmable and Scalable Architecture for

High-Speed Optical Vector and Matrix Operations Using ... - IEEE Xplore

CONFIGURABLE SCALAR AND VECTOR ... - CiteSeerX

Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector ...

Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector ... - FFTW

Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector

A Scalable Architecture for Autonomous Heterogeneous ... - CiteSeerX

Service-Oriented Architecture for Building a Scalable ...

A SCALABLE ARCHITECTURE FOR DIRECTORY ... - Semantic Scholar

A Scalable Register File Architecture for

A Scalable Architecture for Autonomous Heterogeneous ... - CiteSeerX

A DISTRIBUTED QOS ROUTING ARCHITECTURE FOR SCALABLE ...

a scalable parallel hardware architecture for ...

A Scalable Architecture for Ordered Parallelism - people.csail.mit.edu

A Generalized Scalable Software Architecture for ...

A Scalable Architecture for Coherence-Preserving Qubits

A Scalable Mobility-Centric Architecture for Named

A Scalable Architecture For High Performance

A Scalable Multipath Architecture for Intra-domain

A Scalable Architecture for Responsive Auction

A Scalable Architecture for Ordered Parallelism - People.csail.mit.edu

A Generalized Scalable Software Architecture for ...