Producer-Push – a Protocol Enhancement to Page-based ... - CiteSeerX

1 downloads 0 Views 78KB Size Report
Abstract. This paper describes a technique called producer-push that enhances the performance of a page-based software distributed shared memory system.
Producer-Push – a Protocol Enhancement to Page-based Software Distributed Shared Memory Systems Sven Karlsson and Mats Brorsson Computer Systems Group, Department of Information Technology, Lund University P.O. Box 118, SE-221 00 LUND, Sweden email: [email protected], URL: http://www.it.lth.se Abstract This paper describes a technique called producer-push that enhances the performance of a page-based software distributed shared memory system. Shared data, in software DSM systems, must normally be requested from the node that produced the latest value. Producer-push utilizes the execution history to predict this communication so that the data is pushed to the consumer before it is requested. In contrast to previously proposed mechanisms to proactively send data to where it is needed, producer-push uses information about the source code location of communication to more accurately predict the needed communication. Producer-push requires no source code modifications of the application and it effectively reduces the latency of shared memory accesses. This is confirmed by our performance evaluation which shows that the average time to wait for memory updates is reduced by 74%. Producerpush also changes the communication pattern of an application making it more suitable for modern networks. The latter is a result of a 44% reduction of the average number of messages and an enlargement of the average message size by 65%.

1. Introduction Systems that implement a shared memory programming model on a distributed memory architecture provide a costeffective means to create parallel computing platforms that are relatively easy to program. Such systems are generally called software distributed shared memory, software DSM, systems and can easily be put together from commodity The research in this paper was supported with computing resources by the Swedish council for High Performance Computing (HPDR) and Center for High Performance Computing (PDC), Royal Institute of Technology.

components such as PCs and standard networks. Current LAN technologies are possible to use but generally deliver poor performance. However, high-bandwidth, low-latency networks such as SCI [15], have been available as off-theshelf components for some time and it is likely that more high-performance network technologies will emerge. Even with a standard ATM T1 network it is possible to achieve reasonable efficiency (50%-90%) for a wide range of applications on a software DSM system [16, 8]. However, given the current performance growth of microprocessor technology, it has been shown that even if the network technology keeps up with the processor performance trend, aggressive latency hiding techniques will be needed to sustain the parallel performance efficiency in the future [5, 16]. The existing standard technique to tolerate shared memory latency is to use a relaxed memory model [6]. In this paper we present a technique that complements this idea to mask and overlap some of the communication latency in shared memory applications on a software DSM system. In a shared memory parallel computing model with relaxed memory the actual transfer of data does not occur when the data is produced but is deferred to when it is needed, i.e. when a processor issues a load- or storeinstruction to a data item that has previously been written by some other processor. This leads to a request-reply scheme where data is requested and the processor that has an up-todate copy replies. As a consequence, the communication latency will directly contribute to the execution time. We have implemented and evaluated a technique called producer-push that uses simple heuristics to determine when data that has been modified by one processor is needed by some other processor and to push that data to its anticipated destination before it has been requested. This communication can thus be almost completely overlapped with computation on the receiving processor. An additional advantage is that the processor that needs the data will not

be required to send a request message since the data is already present in the local memory. We have integrated the producer-push technique in TreadMarks, a state-of-the-art software DSM system [2]. The heuristics trigger producer-initiated communication automatically without modification of the application and, based on repetitive access patterns, is found to predict communication in certain parallel applications. We have performed an evaluation on seven applications from the TreadMarks distribution that we executed with and without the producer-push technique on an IBM SP2 using up to eight processing nodes. The evaluation shows that the producer-push technique indeed effectively reduces the time to request shared data. The resulting performance improvement depends on the original efficiency of the application. Our experiments show a performance improvement of 55% in one case. Producer-push also does not hurt performance for applications that do not fit into the behaviour assumed by the simple heuristics. The performance improvement is mainly due to a reduction in the time an application spends in the communication routines. There are two reasons for this. First, the requests for memory updates can in many cases be eliminated since the updated memory is already in place; second, the interconnection network is better utilized since producer-push results in fewer and larger messages.

1.1 Contributions The idea to transfer data before it has been requested is not new in the context of software DSM systems. The success, or lack of success, depends on the quality of the mechanism that decides what to send and when. If the mechanism decides to send more data than is actually needed, or to the wrong destinations, the performance gain is usually smaller than the performance loss because of the increased network traffic. Keleher proposed a lazy-hybrid protocol in his Ph.D. thesis that appears similar to the protocol enhancement proposed here [11]. He later re-evaluated this protocol in the context of home-based software DSM systems [12]. The lazy-hybrid protocol is, like our technique, based on repetitive access patterns but uses more indiscriminate heuristics that lead it to send substantially more data than needed and to processors that do not use it. Seidel et al. have proposed an almost identical technique in their Affinity Entry Consistency Protocol [17]. In contrast to the techniques proposed by Keleher and Seidel, the heuristics in our producer-push technique go one step further to identify the correct time to send data proactively by means of an identification of where in the program new data is generated. This is described in detail in section 2.

Producer-push is related to producer initiated communication commonly found in message-passing applications [1]. However, it is also fundamentally different in that it does not require that explicit message passing code is inserted into the applications. Instead the software DSM system takes care of all message passing. No modification of the application’s source code is needed at all. Prefetching [4, 9] is also closely related to producer-push in that it tries to move data to its destination before it has been requested. However, while producer-push reduces the number of messages since the data is moved proactively, prefetching uses the same request-reply scheme as the normal protocol does and if the efficiency of the prefetching scheme is not very high it will send much more data over the network than needed. Producer-push eliminates many of the small request messages and thus utilizes the network better. • •

In short, the contributions in this paper are: the specification and implementation of producer-push in TreadMarks a thorough performance evaluation that provides valuable insight into parallel application behaviour

The performance evaluation was done on an upgraded IBM SP2 system with substantially higher performance than the experimental systems used by Keleher and Seidel. In the rest of the paper we next describe the lazy release consistency protocol and how producer-push has a potential to increase performance. In section 3 we then describe our experimental methodology that we have used to evaluate the producer-push technique. The results are presented in section 4 and after a short discussion on potential improvements of producer-push in section 5, the paper is concluded in section 6.

2. Lazy release consistency and producer-push 2.1 The TreadMarks software DSM system TreadMarks is a software package that provides programmers with a shared memory programming model on top of a network of workstations, NOW [2]. Although TreadMarks defines its own programming model, it has been shown that it is possible to implement a standard programming model such as OpenMP on top of TreadMarks [7, 14]. As the nodes in a NOW do not physically share memory, information about changes to the shared memory is captured through the use of the address translation mechanism in the processor which then invokes the consistency protocol. The coherence granularity is thus a virtual page. The performance penalty to maintain memory coherency in software and with such a coarse granularity is very high and therefore TreadMarks uses a relaxed memory consistency model with a multiple-writer, invalidate protocol.

In the code for the barrier, each processor sends a message to processor 0 to notify that it has reached the barrier. Together with this message it transmits information about which pages it has modified during the previous interval. This information is called a write notice. Processor 0 collects all write notices, including the ones that have been created by itself, and redistributes them to all processors with the same message that signals to all processors that they can proceed past the barrier.

In a relaxed memory model information about changes in the shared memory may be delayed to a synchronization point [6]. If all accesses to shared data in a shared memory program are protected by synchronization mechanisms, e.g. barriers or locks, we say that the program is data race free. It is thus not necessary that all processors see the same shared memory image at a given time. If a variable is protected by a lock and modified by a processor, it is only necessary to inform another processor about this update when the lock has been released and acquired by the other processor.

When a processor receives the barrier release message, it continues execution after having removed all permissions to access pages for which it has received write notices. Thus, when processor 2 later accesses variable X it will experience a page fault. Processor 2 has received two write notices for this particular page since both processor 0 and 1 have modified data in the page. It will then send a message to both processor 0 and 1 and request the updated memory. In order to save time and to easier combine the changes by processors 0 and 1, these processors creates a diff that is an encoding of the changes made by only this processor. For this reason it uses the twin created earlier. When the diff has been created, it is sent to processor 2 that applies the changes and when all diffs have arrived it continues with its memory reference.

This memory model has been further improved in TreadMarks to reduce communication. A processor is notified about the changes to the shared memory made by other processors at the time of a lock acquire (instead of at the lock release as is normal in relaxed memory models). The actual data is of course not transferred until it is needed which is signalled by a memory access fault. This mechanism is called lazy release consistency, LRC. Since the implementation of producer-push is intimately related to LRC we shortly explain how LRC works with a simple example in figure 1. In this example three processors access two variables within the same virtual page. Processor 0 first reads and writes to variable X and processor 1 reads and writes to variable Y. The processors are then synchronized with a barrier and after this, processor 2 reads and writes variable X. It should then see the value of X that was updated by processor 0 before the barrier. To achieve this, a copy of the page, called a twin, is created when processor 0 modifies the variable for the first time. This is done by the page fault handler that is invoked at the write access since the page is write-protected from the start of the execution.

Lazy release protocols reduce network bandwidth requirements at the expense of increased latency caused when faulting processors have to wait for diffs to arrive. This latency is analogous to coherency miss latencies in hardware based systems. Another approach would be to let the processor broadcast all of its diffs directly when it enters a barrier. Protocols that behave like this are called eager protocols [13]. It has been shown that lazy protocols

Create and send write notices in the barrier

Create twin at first write access Barrier P0

R WX Create and send diff

P1

R WY

R

P2 Remove permissions

WX Access fault. Request diffs from other processors

RWX Reads and writes to address X RWY Reads and writes to address Y in same page as X Figure 1. The Lazy Release Consistency protocol in TreadMarks defers the propagation of consistency information to the latest possible moment.

outperform eager protocols since eager protocols send too much data so that even very high performance networks become saturated.

2.2 Producer-push overview The vast majority of parallel scientific applications use some sort of iterative algorithm. It could be a system simulation by small time steps or an iterative algorithm of a numerical problem. Such an application typically performs the same calculations repetitively in each iteration and also often uses the same data. We could thus use information gathered during one iteration to predict the communication in the following iterations. Really simple applications only have one phase which is repeated over and over again. For such applications the lazy-hybrid protocol as proposed by Keleher would perform very well [11]. However, many applications consist of several phases and for these applications the simple heuristics used to send data proactively in the lazy-hybrid protocol will send too much data. A processor can collect information on which other processors that have requested diffs during an interval, i.e., a TreadMarks interval as described earlier. The processors that have issued diff requests are called consumers, since they want to use data. The processor that responds to the diff requests are in this context called producer. From the diff requests, the producer can, when it enters a barrier or releases a lock, collect information regarding which diffs the consumers need in a phase of the computation. Knowing this, the producer can send these diffs directly after it has released a lock or entered a barrier the next time it comes to this phase. We call this pushing. The consumer stores the pushed diffs in a diff cache and when it later needs a diff, it first looks in the diff cache to see if the diff has been pushed. If so, the consumer does not need to request the diff from the producer and the latency due to the diff request is greatly reduced. This is the essence of the producer-push technique. Besides reducing the latency for diff requests, producerpush also has the advantage that it cause the average network message size to increase. This can further boost performance since modern high bandwidth networks cannot reach their peak bandwidth unless fairly large messages are used [10]. Another advantage is that the technique does not require any source code modifications at all.

2.3 Heuristics and implementation In order to make producer-push work well we need some kind of heuristic that can guide us in the selection of diffs to push, when the diffs should be pushed and which processor they should be pushed to. We describe here the heuristic that we have used and some details about the actual implementation. We first describe how we find logical intervals in the execution related to phases in the execution.

int i; for (i=0; i

Suggest Documents