Hardware Prediction for Data Coherency of Scientific ... - CiteSeerX

1 downloads 0 Views 121KB Size Report
Our investigations show, that for scientific codes run- .... cache line can be in one of the three classic states: Invalid, ... tion of page migration and coherency are not investigated .... STAGE 3: Check against the miss buffer. ..... Misprediction Ratio: number of wrong invalidation generated by the coherency optimizer over total ...
Hardware Prediction for Data Coherency of Scientific Codes on DSM 

JT. Acquaviva and W. Jalby CEA/DAM PRiSM Lab. French Atomic Energy Commission Versailles University [email protected] [email protected] July 28, 2000 Abstract This paper proposes a hardware mechanism for reducing coherency overhead occurring in scientific computations within DSM systems. A first phase aims at detecting, in the address space regular patterns (called streams) of coherency events (such as requests for exclusive, shared or invalidation). Once a stream is detected at a loop level, regularity of data access can be exploited at the loop level (spatial locality) but also between loops (temporal locality). We present a hardware mechanism capable of detecting and exploiting efficiently these regular patterns. Expectable benefits as well as hardware complexity are discussed and the limited drawbacks and potential overheads are exposed. For a benchmarks suite of typical scientific applications results are very promising both in terms of coherency streams and the effectiveness of our optimizations.

1

Introduction

Current architectures offering the shared memory abstraction over a physically distributed memory (DSM) constitute a very attractive solution for large scalable shared memory systems. Many of them rely on a hardware based coherency scheme which is costly in terms of hardware complexity but also in terms of performance. Latency of memory operations involving complex coherency transactions may become very large. Additionally, the coherency traffic itself can become substantial, consuming a non negligible portion of the network bandwidth. Previous studies have showed that coherency traffic exhibit some regular patterns [CBZ90]. Corresponding optimizations have been proposed to address some of these 0-7803-9802-5/2000/$10.00 (c) 2000 IEEE

specific patterns [Per93], [Kax98]. More general schemes have also been proposed, but they remain costly in hardware, they require on-chip modification, or large extension of directory structure (memory overhead) [MH98]. In the scientific computing world, it is well known that data access are highly structured. This property is induced by the specific nature of scientific algorithms. Success of vector architectures, and a large amount of research in latency optimization technique rely on this fact. Stream buffers in T3E, processor prefetching in Power3, or with the coming Merced, and even compiler directed prefetching of SGI aim at exploiting these regular patterns. Our investigations show, that for scientific codes running over DSM there is a strong presence of vectors in coherency traffic. A way to optimize traffic is to take into account data streams and regularity in coherency patterns. This is illustrated by figure 1. Starting from this observed regularity of the coherency phenomena we propose a mechanism able to capture and exploit efficiently such regular patterns not only within loops but also between parallel loops. Our objectives are twofold: first, reduce read and write latencies and second reduce data traffic (invalidation/downgrade). Our approach will be essentially based on anticipating the memory transactions (Read, Write, Invalidation, Downgrade) relying on a simple hardware mechanism located at the L2 cache level (off chip). Our scheme will first work by exploiting spatial locality properties within a loop, to aggregate sequence of homogeneous references to contiguous cache lines (such sequences will be called streams). For each of these sequences, a compact descriptor is dynamically built up. Then, our proposed mechanism is capable of recognizing the occurrences of similar streams in the following loops, exploiting temporal locality between streams. One of the difficulty with prediction mechanism is the

Figure 1: Data requests in shared or exclusive mode occurring for LU. Horizontal x-axis represents the time in CPU cycles, while the vertical y-axis represents the address space by cache line number. Accesses present a high degree of stream and coherent nature is also structured. Streams corresponding to the different nodes are recognizable. Such a figure is strongly related to the code structure and the parallelization scheme. Our preliminary profiling investigations clearly exhibits that few loop nests produced the main part of this coherency traffic. Notice the strong behavioral distinction between shared and exclusive accesses. No write events appear after the first quarter of the execution, all nodes have already requested all needed data in Exclusive mode and further writes are only local operations not logged on the graph.

compromise to be made between depth of anticipation (How far ahead an action can be anticipated), and the accuracy (wrong predictions and too many of them in particular) can severely degrade performance due to useless traffic. We will show that our mechanism is capable of performing well with respect to these two problems. The remainder of the paper is organized as follow. In section 2 the framework of our study is defined. Section 3 exposes the proposed architecture for the coherency optimizer. Methodology and simulation environment are described in section 4. Results are analyzed in section 5. Several enhancements/extension are presented in section 6. Section 7 presents and reviews other anticipation mechanisms which have been proposed in the literature. Finally, a short conclusion is given in section 8.

2.3 Target Architecture Our target architecture is a Distributed Shared Memory system, whose overall architecture is given in figure 2. The address space is distributed across nodes, on a page basis, in a round robin fashion. With each page a home node is associated and this home node will remain unchanged during the whole code execution: no page migration mechanism are used. For sake of simplicity, we will assume that the node is a uniprocessor. Coherency is maintained via a full-map directory structure at cache line granularity. For every cache line, the corresponding directory structure and information are maintained by the corresponding home node.  "!#%$'& (

2 Background/Framework Network Interface

2.1 Target Codes Our proposal is strongly aimed at improving performance of the so called scientific codes. A large fraction of these codes can be characterized by very regular data structures (typically multidimensional arrays which in turn are accessed in a very regular (predictable) manner. This regularity has led to the concept of “vector” (a sequence of memory locations regularly spaced) which has been very successfully exploited by vector architectures.

Protocol Engine

) * +' ,-.0/1 ,2 +' 2 

Memory Declared as Shared

CPU

354 6768-9    

2.2 Programming Model Due to the underlying architecture, we will naturally use a shared address space and we will assume that these scientific codes have been parallelized using OpenMP parallel constructs. Due to the structure of these codes, most of the parallelism will be exploited at the loop level. According to our terminology, an epoch is a fully parallel construct, in which no synchronization other than barriers1 (located at the end and the beginning) are necessary to ensure correct execution. In particular, we will assume that there is no critical section within an epoch. This assumption is essentially made for simplifying the presentation. The scheduling of iterations within an epoch is static, i.e., the iteration space is divided equally among the nodes, at compile time. Interestingly enough, the regularity of access patterns which is naturally presents within epochs, is also present between epochs i.e. the same vectors are used from one epoch to the other one (cf figure 1). 1 Barrier code include an extra write to a memory mapped region in order to inform our mechanism than a barrier is occurring.

Figure 2: Basic Architecture of a node in a DSM system. The consistency model supported is relaxed consistency. Data races may occur only within epochs. Memory coherency is enforced only at synchronization barrier. It is the compiler’s or programmer’s responsibility to place barriers to guarantee the correctness of the execution. A cache line can be in one of the three classic states: Invalid, Shared or Exclusive and the coherency protocol works by invalidation. The transition from the Exclusive or Shared state down to Invalid state, will be called Invalidation, and the transition from Exclusive to Shared will be called Downgrade. Coherency protocol is executed by a protocol engine located at the Network Interface level of each node. This engine is involved to maintain coherency among nodes. Since a node may be composed of multiple CPU, inside a node, the MESI protocol is used to guarantee the coherency of each cache with the node memory. This last layer is transparent for the protocol engine.

This system is strongly related to the Prism architecture described by Ekanadham and al. [ELPS98]. Prism supports an interesting page migration algorithm, but interaction of page migration and coherency are not investigated in this paper.

To summarize, we advocate an off-chip coherency engine, performing prediction at address level and monitoring data flowing out of the node as well as requests flowing in. We need also a dedicated buffer to store prefetched data. We should try to limit as much hardware complexity but still be able to catch the main part of coherency traffic.

3 Coherency Optimizer

3.2 Targeted Traffic

3.1 Overview

The streams detected will be classified into four major categories:

In order to reduce coherency overhead, we propose to add a coherency optimizer within the protocol engine of each node. The regularity of data access will be characterized by means of streams. A Streams is constituted of a segment of cache lines, consecutive in term of addresses, having the same memory and coherency behavior. More precisely, a stream will be characterized by a head (start address), a tail (end address), type of memory access (Read of Write), type of transition in the state diagram. The set of characteristics of a stream will be called a Stream Descriptor (SD). The following subsections give an outline of the coherency optimizer design, some details might have to be changed depending upon implementation choices. The coherency optimizer will work in two phases: :

:

Streams detection/recognition, identical coherency requests to consecutive data are detected. At this stage, a key decision has to be made: either we are observing a new stream and a Stream Descriptor has to be generated or we “recognized” a Streams already observed in the past.

:

:

Read Streams, this corresponds to a standard prefetch request.

:

Write Streams, act like a Read stream, except that data are requested in Exclusive mode. Hence, Invalidation are anticipated and overlap is improved.

:

Invalidate Streams, a node may receive a large sequence of invalidation for data it has loaded. On such a sequence, the coherency optimizer generates a stream of self invalidation requests, invalidation messages are also sent to the corresponding home node to decrease the degree of sharing. Downgrade Streams, similarly to the Invalidate Streams, a node may receive large sequence of remote read requests for data it has produced. These requests downgrade the coherency state from exclusive to shared, send a copy to the requesting node and to the home node. The coherency optimizer anticipates such behavior, downgrading in advance the mode from exclusive to shared, which leads to an update message to the home node. A further optimization would be brought by anticipating also the requester and sending him the data.

Streams exploitation, again two cases need to be distinguished. In the case of a new stream, anticipation of the actions will be launched usually in a careful manner (one or two blocks ahead of the current 3.3 Hardware Organization block). In the case of an already observed stream, the anticipation can be much more aggressive, i.e. antic- The coherency optimizer needs to be tightly coupled to ipating the behavior of multiple consecutive blocks, the coherency protocol engine. It has to monitor requests even of the whole stream. produced by each node to detect the regular data patterns (vector) of identical requests. Due to page allocation Although, our mechanism share a lot in common with which tries to balance accesses on all home nodes, these standard hardware prefetchers, three major differences vectors will be less obvious to detect if the Descriptor is have to be noted. First, we monitor not only Read/Write placed at the home node level. Hence, we chose to place the optimizer at the Network operations but also the corresponding coherency actions. Second, we store information on data access behavior Interface level, which is also the location of the coherency through the Stream Descriptor, this is essential for the ex- protocol engine. ploitation phase. Third, we observe not only the outgoing Network Interface and coherency engine allow the oprequests issued by the processor but also the incoming re- timizer to access global addresses, alleviating the adquests (Invalidation/ Downgrade). dress translation problem found is several DSM (like

Prism [ELPS98] or more generally S-COMA systems viced by a stream buffer and second for detecting that [SWCL95],[SN95]). a stream has been already detected. A simplified view of the overall hardware organization Upon a synchronization (typically a barrier), all of the is given in figure 3. Four sets of basic components are address buffers and the Stream Buffers will be flushed and distinguished: the content of the New Section of the Stream Descriptor : Address Buffers: these buffers are in turn subdi- Table is shifted into the Old Section to include the streams vided into four buffers (Read Miss Buffer, Write detected within the finishing epoch. All of the activity bits Miss Buffer, Invalidation Buffer, Downgrade buffer) are reset to 0. depending of the type of memory request. All four address buffers are indexed by the address of the 3.4 Stream Detection/Recognition missing cache line and hold the most recent transThis phase is an essential one, since all further optimizaactions observed, i.e. potential candidates for stream tions/ anticipation will take place on detected/recognized detection. streams. : Let us distinguish two cases depending whether reStream Buffers: these buffers are used to store the data corresponding to a detected stream. These quests are issued by the local CPU or by a remote CPU. Local request buffers are subdivided into two classes, Read Stream Upon a read miss (the write miss case will be handled Buffers and Write Stream Buffers. Each Stream Buffer will be allocated to a given detected stream, in a similar manner), the address will go through potenaccording to a LRU policy. The buffers will oper- tially up to three stages (corresponding to checks with the ate according to a FIFO policy. For example, a Read various buffers) : Buffer will be filled by requests issued by the anticSTAGE 1: Check against the Stream Buffers. ipation mechanism and emptied by requests issued If the data is found in one of the Stream Buffer (hit), by the processor. The two key parameters for these it is directly sent to the processor and the correbuffers are first their depth which is strongly related sponding Stream Descriptor is updated; if it is a new to the degree of anticipation and second their numStream being currently detected, the tail address is ber which corresponds to the number of simultaneincremented by one, otherwise no operation is perously detected streams within an epoch. In our exformed on the SD table. periments, we found out that 32 buffers were enough. If no Stream Buffers contains the requested cache : Stream Descriptor Table: this table will record all line, the requested address is forwarded directly to of the descriptors corresponding to streams detected. the network and in parallel, the requested address is This table is indexed by the head address of the sent to Stage 2. stream. At each time, a new stream is detected, an : STAGE 2: Check against the entries of the Old Secentry is allocated in the SD Table. This table is split tion of the Stream Descriptor Table. into two sections: the first section called New SecIf the requested address matches one of the head adtion is dedicated to “new” Streams detected within dress, the activity bit is set and a prefetch is launched. the current epoch while the second section, called This case corresponds to the situation where a stream old section is dedicated to Streams detected in prehas been recognized. Otherwise, the requested advious epochs. With each of these entries an activity dress is sent to stage 3. flag is set depending whether the Stream is active for the current epoch. For the New Section, all of the ac: STAGE 3: Check against the miss buffer. tivity flags should be set to active. In case of an overFirst the address is decremented by one before lookflow of the table, Streams Descriptors are discarded 2 ing for a match with an entry in the miss buffer. If a according to a LRU policy. In our experiments we match is found, it means that two consecutive cache observed that a 256 entries table was large enough to lines have been requested and a new stream has been store all of the streams detected. detected. In such a case, an entry is created into the : Address Comparators: these comparators will be new section of the Stream Descriptor Table and the used first for checking that a miss request can be sermatched address is retrieved from the miss buffer. 2 A better policy would be to discard entries in the table depending upon the length of the stream. Short streams should be evicted first.

If no match is found, the requested address is simply inserted into the miss buffer.

NETWORK INTERFACE

| DS.R V'~K=.Q XEG@ Y,D5HIF J F

| SD .R "| M FKR#DZUHIF J F

; ?A@ .R [WDGJ XKR#