Linear Aggressive Prefetching: A Way to Increase the ... - CiteSeerX

0 downloads 0 Views 118KB Size Report
but this solution would modify the general philosophy of xFS. What we have ... closest design we have been able to find without mod- ifying the ... This software not only simulates the machine and disk access ..... Shirriff, and J. K. Ousterhout.
Linear Aggressive Prefetching: A Way to Increase the Performance of Cooperative Caches.

y

T. Cortes J. Labarta Departament d’Arquitectura de Computadors Universitat Polit`ecnica de Catalunya - Barcelona http://www.ac.upc.es/hpc ftoni,[email protected]

Abstract

This kind of caches offer huge amounts of caching memory that is not always used as well as it could. We might find blocks in the cache that have not been requested for many hours [4]. These blocks will hardly improve the performance of the system and the buffers they occupy could be better used to improve the performance of the applications. Furthermore, the size of the caches will keep growing as memory sizes increase quite rapidly. For this reason, the above situation will become even more frequent. A good way to use these buffers in a more effective way, would be to prefetch the blocks needed by the applications. With this mechanism we could replace very old blocks by others that have a high probability of being used by the applications in a near future. In this paper, we present the idea of linear aggressive prefetching. As the size of the caches is quite large, we can implement quite aggressive-prefetching algorithms as the prefetched blocks will replace very old ones. If the blocks are correctly predicted, the system will observe a performance improvement. On the other hand, if the blocks are miss-predicted, they will mostly replace very old ones that nobody expects to find in the cache. These miss-predictions should not decrease the system performance. In conclusion, the larger the cache is, the more aggressive the prefetching can be. All the performance results presented in this paper have been obtained through simulation so that a wide range of architectures can be tested. The proposed algorithms have been tested on two state-of-the-art parallel/distributed file systems that implement a cooperative cache: PAFS [4, 6] and xFS [1].

Cooperative caches offer huge amounts of caching memory that is not always used as well as it could be. We might find blocks in the cache that have not been requested for many hours. These blocks will hardly improve the performance of the system while the buffers they occupy could be better used to speed-up the I/O operations of the running applications. In this paper, we present a family of simple prefetching algorithms that increase the file-system performance significantly. Furthermore, we also present a way to make any simple prefetching algorithm into an aggressive one that controls its aggressiveness not to flood the cache unnecessarily. All these algorithms and mechanisms have proven to increase the performance of two state-of-the-art parallel/distributed file systems: PAFS and xFS.

1. Introduction Cooperative caches are becoming very popular in parallel/distributed environments [3, 5, 6, 8]. A cooperative cache is a big global cache build from the union of the local caches located in each node in the system. As all local caches are globally managed, a more effective cache can be achieved, thus the performance of the file system is greatly improved.  This work has been supported by the Spanish Ministry of Education (CICYT) under the TIC-95-0429 contract. y A shorter version of this report was published in the Proceedings of the 13th International Parallel Processing Symposium (IPPS’99)

1

In parallel prefetching, there has also been some theoretical work that assumes a knowledge of the access stream. The objective of that work was to make a scheduling of when each block has to be prefetched to achieve the best performance [9]. Our work could be a complement to this one as we propose a way to decide which blocks have to be prefetched and thus no need of knowing the whole access stream is needed. There has also been some work for applications to express the pattern used to access their files. With this information, the system can perform parallel prefetching if the data is kept in several disks [16]. This mechanism can perform a quite aggressive prefetching as the accurate access pattern is known and no misspredictions will be done. This work relies on the user to detect the access pattern while our approach tries to learn it on-the-fly. Finally, there was another project where files were prefetched before they were even opened. The system could predict which files were going to be used keeping a record of the last used ones [11]. Whenever the system is able to predict that a given file is going to be used, the whole file is prefetched. This very aggressive approach is useful in a Unix-like environment where files are small. In a parallel environment with huge files, as the one we propose, this mechanism would prove too aggressive.

This paper is divided into 6 sections. In Section 2, a set of simple and intelligent prefetching algorithms are presented. Section 3 describes a mechanism to transform simple algorithms into linear aggressive ones. Section 4 goes through some implementation details that are needed to understand the experimental results presented in Section 5. Finally, Section 6 presents the conclusions extracted from this work.

1.1. Related Work Although prefetching is a very important issue to achieve high-performance file systems, not much work has been devoted to it. Most of the research done on prefetching has been focused on mono-processor file systems and very little has been done for parallel/distributed systems. Furthermore, the majority of the work done on prefetching for parallel file systems has been taking only one application into account. In this work, we propose prefetching algorithms and mechanisms aimed at increasing the performance of a system where many applications are running concurrently. The first related work we would like to mention is the prefetching algorithm proposed by Vitter and Krishnan [18, 7]. In their work, they propose a prefetching algorithm based on Markov models. This algorithm has been the start point of the IS PPM family of prefetching algorithms presented in this paper. In their paper, they also proposed a somewhat aggressive prefetching where the most probable blocks where prefetched, although only one of them might be requested. In our case, we trust the predictions and prefetch many blocks assuming that the ones prefetched have been correctly predicted. In our case, all prefetched blocks may be requested by the application. One of the first works on parallel prefetching was developed by Kotz [10]. He presented a study of prefetching mechanisms and algorithms for parallel file systems. In his study, he used synthetic applications and the algorithms were designed for a single application, not for all the applications in the machine as we do in this paper. Furthermore, the algorithms presented in his paper are quite different from the ones presented here and no aggressive mechanisms were proposed.

2. Prefetching Algorithms In this section, we will describe the two basic algorithms used in this paper. The first one, OBA, is a wellknow and widely-used prefetching algorithm while the second one, IS PPM, is a contribution of this paper. Anyway, we should remember that the main objective of this paper is not to propose the best prefetching algorithm but to build aggressive ones from others much simpler. These new aggressive algorithms should increase significantly the performance of the applications.

2.1. One Block Ahead (OBA) The first algorithm is the most widely used in sequential and parallel/distributed file system. The idea of this simple algorithm is to take advantage of spatial locality. This means that if a given file block is 2

5 blocks

requested, it is very probable for the next sequential block to be requested too. Using this heuristic, whenever a block i is read or written, block i+1 is also requested for prefetching [17]. This algorithm is a very conservative one as only one block is prefetched after each request.

5 blocks

5 blocks

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

3 blocks Accessed block

2.2. Interval and Size Predictor (IS PPM)

3 blocks

3 blocks Not accessed block

Figure 1. Access pattern used in the example.

The second algorithm is a more sophisticated one that tries to find the access pattern used by the application to predict the blocks that will be requested in the future. This algorithm is an evolution from the prediction-by-partial-match (PPM) proposed by Vitter and Krishnan [18, 7]. This is an algorithm based on Markov models. The first thing we need to do is to describe the way we model the access pattern. In this work, we model a sequence of user requests by a list of pairs formed by a size and and offset interval. The size is the number of file blocks in a request. If a given operation only requests 2 bytes but from two different blocks, we assume that it was a two block request. The second element used to model a sequence of requests is the offset interval. This interval is the difference in blocks between the first block of a given request and the first block of the previous one. With this information, our algorithm will try to detect the access pattern of offset intervals and requested sizes. As in PPM, we build a graph that will let us predict which blocks will be requested by the application. The main difference between IS PPM and the original PPM is the information kept in the graph. In the original algorithm, the graph was used to maintain the information of which block had been requested after another given block. This is quite useful for paging (as proposed in that paper) but it is not the most adequate information for file prefetching. In our case, we want to keep the offset intervals between requests. We want to know that after jumping x blocks, the following request jumped another y blocks. As files are usually accessed using sequential or regular strided patters [2, 13, 14], keeping the sequence of blocks is not enough as the system would have to wait until a block has been accessed once before being able to prefetch it. In our case, we predict intervals, thus blocks that have never been requested can also be predicted.

A second important difference compared to the original algorithm is the number of blocks to prefetch. In PPM, only one page was prefetched each time. On the other hand, in file-system accesses, we can predict both the position of the next request and the number of blocks that will be requested. This new information can also be predicted using the IS PPM algorithm presented in this paper. Finally, the last difference is that in PPM, a probability was kept in the links between nodes. This was used when more than one block had previously been accessed after a given block. In this case, the one that has been accessed more times was the one prefetched. We have observed that this heuristic is not the most adequate one for file systems [4]. We have found, that following the path that has most recently been followed achieves a more accurate prediction. As we have already mentioned, the basic structure used in this predictor is a graph where offset intervals and sizes are kept in the nodes while the links between nodes are labeled with the time they were last used. The best way to understand how this mechanisms works is to see how one such graph is created and the how it can be used to predict the blocks to be accessed. In figure 1, we show the access pattern we will try to detect and predict as an example. The application starts accessing the two first blocks in the file and continues with the following pattern. A request of 2 blocks is always followed by another of 3 blocks that is located 3 blocks apart. This 3 block request is followed by another 2-block request that starts 5 blocks further in the file. After this new access, the pattern starts again as a 2-block request has been done. The first time the application makes a request, the system cannot compute any interval (it needs at least two requests) and nothing can be done (Figure 2.t1 ). 3

t1

constructed, we need to see how this graph can be used to predict the blocks needed by the application. Whenever the application makes a request, the system computes the offset interval between this request and the previous one. With this interval and the size of the previous request, the system searches for the node that matches this information. If the node is in the graph, the system just has to follow the most-recently updated link and it will have the interval and the size of the next request. Adding this interval to the current file pointer, we get the offset of the predicted request. The number of blocks to prefetch is the size kept in the node. For example, if we use the graph shown in Figure 2.t4 , we could predict the fifth request very easily. First, we compute the offset interval between the third and forth requests and we obtain an offset interval of 3. Second, as the size of the forth request is 3 blocks and the computed interval is also 3 blocks, we need to find the node (I=3,S=3). Once found, we follow the most-recently updated link and we go to node (I=5,S=2). Finally, we have to compute the offset of the next request. This offset is computed by adding the offset of the forth request (12) to the offset interval obtained from the graph (5). Now we have all the information to prefetch the next request: the prefetcher has to read blocks 17 and 18. If we used this algorithm as it is, we would have a startup problem. When there are no nodes that describe the current situation, there is nothing the system can do to predict the next block. As this will happen quite often during the first accesses to a file, we need to find a solution. Our proposal consists of using the OBA algorithm whenever not enough information is available in the graph. This should not happen very often as it is only intended for the first accesses to a file. Actually, the number of blocks prefetched using the OBA algorithm was, in average, less than 1% when the files were large (CHARISMA workload) and around 25% when the files were small (Sprite workload). These percentages have been extracted from all the experiments discussed later in this paper.

t2

I=3 s=3

t3

I=3 s=3

t3

I=5 s=2

t4

I=3 s=3

t3 t4

I=5 s=2

t5

I=3 s=3

t5 t4

I=5 s=2

t6

I=3 s=3

t5 t6

I=5 s=2

Figure 2. Steps done by the system to construct the graph for the access pattern shown in Figure 1.

After the second request, we can add the first node to the pattern graph. This node means that, in the access pattern, there is a 3-block request after an offset interval of 3 blocks. (Figure 2.t2 ). After the third request, the system can add a new node to the graph. This node means that, in the access pattern, there is a 2-block request after an offset interval of 5 blocks. Furthermore, the first link can also be added. This link specifies that after jumping 3 blocks and requesting 3 blocks, it jumped 5 blocks to request 2 more blocks. The label of this link should contain the time this happened (Figure 2.t3 ). After the forth request, no new nodes are added as the one the system should add (I=3,S=3) is already in the graph. The only thing to be done is to add a new link from both nodes but in the opposite direction (Figure 2.t4 ). The label of this link is also the current time stamp. Finally, after the fifth request, no links nor nodes are added as they are already in the graph. The only modification to the graph is the time stamp of the link from (I=3,S=3) to (I=5,S=2) that has a new time stamp (Figure 2.t5 ). This mechanism is used throughout all the execution of the application so that new patterns are also reflected in the graph. Once we understand how the prediction graph is

jth-order Markov Predictor A jth-order Markov predictor uses the last j requests from the sequence to make its predictions for the next request. So far, we have only used the last pair (offset 4

I=3, S=3 I=5, S=2 I=3, S=3

ti ti+1

limitation except for the disk and network bandwidth.

I=5, S=2 I=3, S=3 I=5, S=2

Aggressive OBA (Agr OBA) Making the aggressive version of OBA simply consists of continuously prefetch blocks in a sequential way starting from the last requested one until the end of the file. If a new request is done before the prefetching algorithm reaches the end of the file, and the requested block has not already been prefetched, the system restarts the prefetching from the new position of the file pointer. If the requested blocks have already been prefetched, this means that the system prediction was correct and there is no need to modify the prefetching path. For this reason, the system continues bringing new blocks as if the user had not requested any block.

Figure 3. Final graph obtained by the 3rdorder predictor for the access pattern shown in Figure 1.

interval, size) to predict the next sequence. This means that the predictor we have presented is a 1st-order one. One such predictor will not be able to detect many regular access patterns where a request not only depends on which was the last offset interval, but it depends on what happened in the previous j offset intervals. For this reason, we have implemented the whole family of IS PPM algorithms that have the order as a parameter. For instance, the 3rd order algorithm keeps track of the last-three pairs (offset interval, size). The only difference with the one just described is that each node in the graph will have to keep 3 intervals and 3 sizes. Figure 3 shows the graph obtained by this predictor after the access pattern proposed in Figure 3. To distinguish the order of the algorithms, we will add this order to the name of the predictor. Now the 1st-order predictor will be called IS PPM:1 and IS PPM:3 will be the name used for the 3rd-order one. These two instances of the family are the ones used the experiments presented later in this paper.

Aggressive IS PPM (Agr IS PPM) The aggressive version of IS PPM:j works in a very similar way. After a user request, the prefetching mechanism decides which are the next blocks to prefetch as was done in the non-aggressive version. Once these blocks have been prefetched, it behaves as if the user had already requested the prefetched blocks and goes for the next node in the graph. This keeps going until the next block to prefetch is out of the file in which case the prefetching mechanism stops waiting for a new request to be performed. If the next user request is not the one predicted by the prefetching mechanism, the algorithm modifies the graph and restarts the prefetching sequence from the last requested block.

3. Linear Aggressive Prefetching 3.1. Aggressive Prefetching

3.2 Limiting the Aggressiveness: Linear Aggressive Prefetching

In this section, we explain how to improve the performance of simple algorithms, such as the ones already presented, by making them aggressive. The idea is very simple, the prefetching mechanism will prefetch blocks continuously as long as it can predict data that is not in the cache yet. On the other hand, if the system detects that a miss-prediction has occurred, it understands that the prediction path is not correct. This triggers a mechanism that updates the prefetching data structures (if needed) and the aggressive prefetching restarts once again from the miss-predicted block. The number of prefetched blocks does not have any

It is obvious that there has to be a limit to the aggressiveness of the prefetching algorithms. For example, if the system had an extremely fast disk so that it could read the whole file between two requests, a very aggressive algorithm could be implemented. The system could prefetch the whole file after each request. This algorithm would not work well if files are larger than the cache as the same prefetch would discard its own prefetched blocks. To set a limitation to the aggressiveness of this mechanism, we propose the linear aggressive algo5

limit the number of simultaneously prefetched blocks to one. These servers have all the information about the access patterns because all requests have to go through them. All these factors favor the implementation of the proposed prefetching mechanism.

rithms. The idea behind this limitation is to only allow the prefetching of a single block at a time for each file. We propose that a system with more than one disk should not use them to prefetch more than one block of the same file in parallel. Exploiting the parallelism offered by multiple disks can be achieved by prefetching blocks from different files. We may be prefetching as many blocks as disks but always from different files. A more intuitive way of aggressive prefetching consists of using the potential parallelism offered by the disks to prefetch several blocks in parallel [9, 15]. This way of prefetching achieves a very good disk utilization when only one file is being accessed. In the same situation, our proposal leaves all disks but one unused. On the other hand, we have the feeling that parallel machines are used to run many applications in parallel minimizing the time that only one file is being used. When as many files as disks are being accessed, our proposal achieves a similar disk utilization. Furthermore, it is a way to limit the aggressiveness and the prefetching is more balanced as blocks from more than one file are being prefetched in parallel. We are not claiming that exploiting the parallelism of several disks ends up in a bad system performance, we are only proposing a different way to do an aggressive prefetching that limits the number of prefetched blocks by only prefetching one block per file at a time.

Implementing this kind of linear algorithms is not as simple in xFS if we want to maintain its original philosophy. In this system, each node is allowed to make its own decisions. These servers only contact a manager whenever an external help is needed. Following this idea, it makes sense for each node to make its prefetching decisions. Furthermore, only the local node is able to have full information about the access patterns made by an application over a file. The manager only sees some of the operations, the ones that the local system cannot handle locally. This local prefetching presents a problem in order to implement a linear aggressive prefetching. We can limit the number of blocks that are being prefetched by all the applications located in a single node, but there is no easy way to coordinate all nodes to avoid several prefetches of the same file being done in parallel. This problem could be solved using some external prefetcher, but this solution would modify the general philosophy of xFS. What we have done consists of implementing a linear aggressive prefetching per node. This means that each node will only prefetch one block per file at a time but several blocks of the same file may be prefetched in parallel from different nodes. This is not a perfect implementation of our proposal but it is the closest design we have been able to find without modifying the original philosophy of xFS. Furthermore, it will help us to evaluate whether this linearity is a good idea or not.

4. Implementation Details Once designed the prefetching algorithms, we need to test them on top of a file system with a cooperative cache. We have done it on two different file systems: PAFS [4, 6] and xFS [1]. These two file systems have been chosen because both implement a state-ofthe-art cooperative cache but it is important to notice that they have important design differences. In this paper, we are not willing to compare both file systems but to evaluate the viability of the linear aggressiveprefetching algorithms we are presenting. A comparison among both system may be found in earlier papers by our research group [4, 5, 6]. In PAFS, the management of a given file is handled by a single server. This kind of centralized management allows a simple implementation of the idea of a linear aggressive prefetching. The server in charge of a file keeps all the prefetching information and can

There is still another implementation issue that should be explained before getting into the experimental results. It is clear that both, read and write operations are more important than prefetching a block. A user-requested operation is always useful while a prefetched block might not be used if it was a missprediction. For this reason, we have given a lower priority to prefetching operations. Prefetching a block will never be done if other operations are waiting to be done on the same disk. 6

5. Experimental Results Nodes Buffer Size Memory Bandwidth Network Bandwidth Local-Port Startup Remote-Port Startup Local Memory copy Startup Remote Memory copy Startup Number of Disks Disk-Block Size Disk Bandwidth Disk Read Seek Disk Write Seek

5.1. Methodology Simulator The file-system and cache simulator used in this project is part of DIMEMAS1 [12] which reproduces the behavior of a distributed-memory parallel machine. This software not only simulates the machine and disk access, but also different short-term processscheduling policies. The whole simulator is a trace-driven one where traces contain CPU, communication and I/O demand sequences for every process instead of the absolute time for each event [4]. The communication and disk models implemented in the simulator are very important to understand the results presented later. All communications are divided into two parts: a startup, or latency, and a data-transfer rate, or bandwidth. The startup is constant for each type of communication and it is assumed to require CPU activity. This startup is different whether the communication is within a node or it crosses the interconnection network. The data-transfer time is proportional to the size of the data sent and the interconnection-network bandwidth. The disk is also modeled by a latency and a bandwidth where the latency depends on the kind of operation (read or write). Using this simulator we have implemented all the proposed prefetching algorithms as part of PAFS and xFS. This software simulates all the servers, their communication, their CPU consumption and the interaction with the clients.

PM 128 8 KB 500 MB/s 200 MB/s 2 s 10 s 1 s 5 s 16 8 KB 10 MB/s 10.5 ms 12.5 ms

NOW 50 8 KB 40 MB/s 19.4 MB/s 50 s 100 s 25 s 50 s 8 8 KB 10 MB/s 10.5 ms 12.5 ms

Table 1. Simulation parameters.

The information kept in this trace file only contains the operations that went through the Concurrent File System. This means that any I/O which was done through standard input and output or to the host file system was not recorded. The complete trace is about 156 hours long and was collected over a period of 3 weeks. To reduce the simulation time, we only used part of this trace file. We used a two-day period (February 15th and 16th) which was the busiest one in the trace. Incidentally, the first 10 hours were used to warm the cache as we want to study the permanentstate behavior. The Sprite user community included about 30 fulltime and 40 part-time users of the system. These traces list the activity of 48 client machines and some servers over a two day period measured in the Sprite operating system.

Workloads and Architectures

Although the trace is two days long, all measurements presented in this paper are taken from the 15th hour to the 48th hour. This is done because we used the first fifteen hours to warm the cache.

For our experiments, we used two sets of traces: CHARISMA [14] and Sprite [2]. The first one characterizes the behavior of a parallel machine (PM) while the second one characterizes the workload that may be found in a network of workstations (NOW). The CHARISMA traces were taken from the Intel iPSC/860 at NASA Ames’ Numerical Aerodynamics Simulations (NAS). This multiprocessor has 128 computational nodes, each with 4 Mbytes of memory.

As we have two different workloads we also simulated two basic architectures. While the CHARISMA trace file has been tested on a parallel machine (PM), the Sprite trace file has been simulated on a NOW similar to the one used by Dahlin et al. [8]. The parameters used to simulate both architectures are shown in Table 1.

1

DIMEMAS is a performance-prediction simulator developed by CEPBA-UPC and is available as a PALLAS GmbH product.

7

Milliseconds

3

serve that the prefetching algorithms can be divided into three groups. The first one, which only contains OBA, is the one that achieves the worst performance. The reason behind this little improvement in performance is that the algorithm is too conservative. Only one block is prefetched per request thus only a few blocks are prefetched when many more could have been brought from disk. The second group of algorithms is made by IS PPM:1 and IS PPM:3. They achieve very similar performance which is much better than the one obtained by OBA. These algorithms decrease the average read time in a very significant way for two reasons. The first one is that they are intelligent enough to predict most of the access patterns used by the applications. The second reason is that they behave in a quite aggressive manner. These algorithms not only predict the offset of the next request, but the size of that request. This means that in a workload that has large requests, the number of prefetched blocks will also be large. This is the case of the CHARISMA workload that has many large user requests. Another interesting result that can be extracted by examining this group of algorithms is that the order of the Markov predictor does not make a significant difference in the effectiveness of the predictor. This means that most access patterns can be predicted with a first order Markov predictor. Finally, all the aggressive algorithms make the third group. These algorithms are the ones that achieve the best performance results. they nearly double the performance of the previous group and perform up to 4.6 times faster than with no prefetching when large caches are used. The controlled aggressiveness of these algorithms is the key to their excellent performance. Remember that these algorithms prefetch aggressively but only one block per file at a time. In this third group of algorithms, the behavior depends on whether small or large caches are used. In the first case, Ln Agr OBA achieves better performance results than the other two algorithms. On the other hand, when caches are lager than 4 Mbytes, Ln Agr IS PPM:1 and Ln Agr IS PPM:3 obtain better read times. To explain this behavior we need first to describe a very frequently used access pattern. Many of the applications only access the first part of a file and leave the last portion of the file unaccessed. Fur-

NP OBA Ln_Agr_OBA IS_PPM:1 Ln_Agr_IS_PPM:1 IS_PPM:3 Ln_Agr_IS_PPM:3

2

1

0

0

5 10 15 "Local cache" size (in Mbytes)

Figure 4. Average read time obtained running the CHARISMA workload under PAFS.

5.2. Performance Results This section is divided into four subsections. The first two describe the performance obtained by the CHARISMA workload when run on both PAFS and xFS. The other two sections describe the performance of Sprite also under both file systems. It is important to notice that only the performance of read operations is presented as write operations are not specially affected by an increase in the hit ratio [5, 4]. Anyway, both of them have been simulated and taken into account. CHARISMA(PM) on PAFS The first experimental result that will be presented is the read performance obtained by all the prefetching algorithms described in this paper. Figure 4 shows the average read time obtained by PAFS under the CHARISMA workload. These times have been measured with different cache sizes that vary from 1 to 16 Mbytes per node. The first thing that can be observed is that prefetching is a good solution to increase the performance of parallel/distributed file systems. All prefetching algorithms achieve a better performance than the original system where no prefetching was done (NP). If we examine the graph in deeper detail, we ob8

10

8

Milliseconds

but the gain is very small. This behavior can be explained by the same reason we mentioned earlier: it is too conservative. The second group, made of all the other algorithms, achieves a better performance but the gain is not as good as the ones observed under PAFS. The reason behind this lack of improvement is that these aggressive algorithms are not really controlled by only prefetching one block per file at a time. Remember that this implementation was not easy in a system such as xFS. For this reason, too many blocks are prefetched and the cache is flooded. We have observed that in the xFS executions, the number of prefetched blocks doubles the number observed in the PAFS executions. This means that many useful blocks are discarded by other blocks that might be needed after the ones that were discarded, or even they might never be used. This shows that controlling the aggressiveness of prefetching algorithms by making them linear is a good idea. If we examine the behavior of this last group in greater detail, we can observe the behavior with small caches is not the same as when large caches are used. In the first case, the aggressive algorithms end up prefetching too many blocks that flood the cache. This allows the less-aggressive algorithms such as IS PPM:1 and IS PPM:3 to achieve better read times. On the other hand, when the size of the cache is large enough, these extra prefetchings become less important and the aggressive algorithm achieve a better performance. We have to keep in mind that many more prefetches are done under xFS as no real linear aggressive algorithms had been implemented. It is interesting to see that there is an exception to the above explanation. If we examine the graph, we can see that in the 1Mbyte executions both Ln Agr IS PPM:1 and Ln Agr IS PPM:3 behave better than any of the others. This should not happen as the size of the cache is very small. The reason behind this odd behavior is a curious side effect. On one hand, the aggressive algorithms prefetch blocks continuously regardless of the block that is currently being requested. On the other hand, the same file blocks are requested by more than one process at different instants of time. The curious effect is that the blocks prefetched for a given process reach the cache when another process really needs them. This synchronization only happens with very small caches and when

NP OBA Ln_Agr_OBA IS_PPM:1 Ln_Agr_IS_PPM:1 IS_PPM:3 Ln_Agr_IS_PPM:3

6

4

2

0

0

5 10 15 "Local cache" size (in Mbytes)

Figure 5. Average read time obtained running the CHARISMA workload under xFS.

thermore, the access of the first part is done using a given access pattern that usually ends up accessing all blocks in this first part (not necessarily in a sequential way). As Ln Agr OBA reads all the blocks in a sequential way, it is not frequent for this algorithm to access blocks located in the last part of the file, which will never be requested. It accesses all the blocks in the first part of the file, which might be requested later on. On the other hand, Ln Agr IS PPM can make large jumps and start prefetching from the not-accessed part. This prediction errors are quite important if the caches are small because useful blocks are discarded to keep useless ones. On the other hand, when caches are large enough to support these miss-predictions without discarding useful blocks, the Ln Agr IS PPM algorithm have a better behavior because they have a better knowledge of the access pattern and predict the blocks more accurately and on the right time. CHARISMA(PM) on xFS Let us now perform the same experiment but using xFS as the base file system. The average read times obtained in these simulations are presented in Figure 5. The first thing that we observe is that there are only two groups of algorithms, one made by OBA and another that contains the rest of the algorithms. As in the previous subsection, OBA increases the performance, 9

NP OBA Ln_Agr_OBA IS_PPM:1 Ln_Agr_IS_PPM:1 IS_PPM:3 Ln_Agr_IS_PPM:3

4

6

Milliseconds

Milliseconds

6

2

0

4 NP OBA Ln_Agr_OBA IS_PPM:1 Ln_Agr_IS_PPM:1 IS_PPM:3 Ln_Agr_IS_PPM:3

2

0

0

5 10 15 "Local cache" size (in Mbytes)

Figure 6. Average read time obtained running the Sprite workload under PAFS.

0

5 10 15 "Local cache" size (in Mbytes)

Figure 7. Average read time obtained the running Sprite workload under xFS.

this workload is used. Anyway, this is just an anecdotic detail as we are interested in large caches due to the rapid increase in memory sizes.

an example, with a 4-Mbyte cache, Ln Agr OBA has a miss-prediction ratio of 32% while Ln Agr IS PPM only miss-predicts 15% of the prefetched blocks. This means that Ln Agr OBA wastes much time prefetching useless blocks instead of prefetching the ones really needed. Finally, we can also observe that the improvements obtained by all prefetching algorithms are not as good as the ones observed under the CHARISMA workload. This is basically due the small size of the files. As there are many files that only have a few blocks, and our algorithms cannot predict the first block, the percentage of non-prefetched blocks increases.

Sprite(NOW) on PAFS To extend the results presented so far, we have also tested the prefetching algorithm on another architecture and using a different workload. Figure 6 presents the average read time obtained by the different prefetching algorithms when the Sprite workload was run on PAFS. This experiment has also been done for different cache sizes. In this graph, we can see that both Ln Agr IS PPM algorithms obtain the best performance as they are intelligent enough to detect the access patterns and aggressive enough to prefetch many needed blocks. We can also observe that the non-aggressive versions of IS PPM do not achieve results as impressive as in the CHARISMA experiment. This is due to their lack of aggressiveness. With the CHARISMA workload, many requests were very large and this made the prefetching algorithm to behave quite aggressively. With this workload, requests are much smaller and thus the aggressiveness is also much lower. Regarding the lack of performance obtained by Ln Agr OBA, there is a very simple explanation for it. The algorithm is not smart enough to correctly detect the access patterns used by the applications. As

Sprite(NOW) on xFS The last experiment we have performed is to run the Sprite workload on xFS with all the prefetching algorithms. The results of this experiment are presented in Figure 7. We can see that, in this case, there is no much difference between the results obtained by PAFS (linear) and the ones obtained by xFS (not really linear). This can be easily explained by examining the trace files. In the Sprite workload, there is very little file sharing which means that the non-linear aggressive algorithms will behave much like their linear version even if they are not. Another important characteristic of the work10

Disk accesses (in millions)

2.0

Ln_Agr_OBA Ln_Agr_IS_PPM:1 Ln_Agr_IS_PPM:3 NP

1.5

NP Ln Agr OBA Ln Agr IS PPM:1 Ln Agr IS PPM:3

1MB 5.9 5.2 4.2 4.0

2MB 8.8 7.9 7.2 7.6

4MB 11.7 10.4 10.4 10.1

8MB 11.7 10.9 10.5 10.5

16MB 11.7 11.0 10.6 10.5

Table 2. Average number of times a block is written to disk when running the CHARISMA workload under PAFS.

1.0

0.5

0.0

1

formance is obtained without much disk traffic. Furthermore, we can observe that there are cases where this traffic is even lower than when no prefetching is done. The reason behind this anti-intuitive behavior is quite simple. On one hand, the number of read blocks increases with linear aggressive-prefetching algorithms. On the other hand, the number of blocks written to disk decreases more than the increase observed on disk reads due to the prefetching algorithms. Let us explain the reason behind this lower number of disk writes. There are many blocks that are modified many times during their life in the cache. For faulttolerance issues, these blocks are periodically sent to the disk. If they remain in the cache for a long time and are modified frequently, they will be written to the disk many times. On the other hand, if they are in the cache only a small amount of time, they will not be written more than once or twice. This is what happens in this experiment. As the applications run much faster with aggressive prefetching algorithms, the time a block is active in the cache is also smaller and thus it is sent to the disk less times. To show this effect, we present the average number of times a block is sent to the disk in table 2. In Figure 9, we also present the number of disk accesses done by CHARISMA when run on top of xFS. The results are very similar but in this case, the aggressive algorithms always perform more disk accesses than the execution with no prefetching. This is because the algorithms are not really linear. For this reason, the extra read accesses done by the unlimited aggressive prefetching outweighs the decrease in the number of disk writes. Finally, we present the same measure but using the Sprite workload (Figures 10 and 11). We can also observe that these aggressive prefetching algorithm do no

2 4 8 16 "Local-cache" size (in Mbytes)

Figure 8. Number of disk accesses performed when running the CHARISMA workload under PAFS.

load is the size of the files. As they are quite small, they fit in the cache with no problem. For this reason, even when some file sharing occurs, a block will not be brought several times for different processes even with the non-linear algorithms.

5.3. Disk traffic So far, we have only been interested in the performance achieved by the linear aggressive algorithms, but this is only one part of the story. It is also important to study how the disk traffic is increased due to these prefetching algorithms. If the disks saturate, this might end up slowing the whole system. For this reason, we have also studied the increase in the number of disk accesses due to the prefetching algorithms described in this paper. Figure 8 presents the number of accessed blocks during the execution of the CHARISMA workload on top of PAFS. In the graph, we have drawn three bars that correspond to the number of accessed blocks when each of the aggressive algorithms was tested. The horizontal line represents the number of blocks accessed by the system when no prefetching was done. In the graph, we can see that the extra number of disk blocks accessed is not very high except for very small caches (1 Mbyte). This means that this kind of prefetching can be an adequate solution as good per11

2.5 2.0 1.5 1.0 0.5 0.0

1

Ln_Agr_OBA Ln_Agr_IS_PPM:1 Ln_Agr_IS_PPM:3 NP

0.4

Ln_Agr_OBA Ln_Agr_IS_PPM:1 Ln_Agr_IS_PPM:3 NP

Disk accesses (in millions)

Disk accesses (in millions)

3.0

0.3

0.2

0.1

0.0

2 4 8 16 "Local-cache" size (in Mbytes)

Figure 9. Number of disk accesses performed when running the CHARISMA workload under xFS.

1

2 4 8 16 "Local-cache" size (in Mbytes)

Figure 10. Number of disk accesses performed when running the Sprite workload under PAFS.

increase the disk traffic too much. This shows that this algorithms are adequate to be used to achieve better performance without loading the disk too much.

with fairly small ones.

References

6. Conclusions

[1] T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli, and R. Y. Wang. Serverless network file systems. In Proceedings of the 15th Symposium on Operating Systems Principles, pages 109– 126, December 1995. [2] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirriff, and J. K. Ousterhout. Measurements of a distributed file system. In Proceedings of the 13th Symposium on Operating System Principles, pages 198– 212. ACM Press, July 1991. [3] M. Brodowicz and O. Johnson. Paradise: An advanced featured parallel file system. In Proceedings of the International Conference on Supercomputing, pages 220–226. ACM Press, July 1998. [4] T. Cortes. Cooperative Caching and Prefetching in Parallel/Distributed File Systems. PhD thesis, Universitat Polit`ecnica de Catalunya, Departament d’Arquitectura de Computadors, December 1997. http:// www.ac.upc.es/ homes/ toni/ thesis.html. [5] T. Cortes, S. Girona, and J. Labarta. Design issues of a cooperative cache with no coherence problems. In Proceedings of the 5th Workshop on I/O in Parallel and Distributed Systems, pages 37–46, November 1997. [6] T. Cortes and J. Labarta. PAFS: a new generation in cooperative caching. In Proceedings of the International Conference on Parallel and Distributed Pro-

The first achievement of this work has been the proposal of the IS PPM family of prefetching algorithms. They are prefetching algorithms based on Markov models that do not base they information on the accessed blocks but on the intervals and size of the requested data. We have also presented a mechanism that transforms very simple algorithms into aggressive ones that control their aggressiveness in a very simple way. We have also shown that these linear aggressive algorithms perform much better than their non-aggressive versions. It is also important to remark, that although the proposed algorithms are quite aggressive, they do not load the disk much more than the system with no prefetching. Even more, there are some cases when such algorithms decrease the disk traffic while still achieving excellent performance. Finally, although we first thought that such aggressive algorithms would only work with large caches, we have also seen that this is not necessarily true. Very good performance results have been obtained 12

Ln_Agr_OBA Ln_Agr_IS_PPM:1 Ln_Agr_IS_PPM:3 NP

Disk accesses (in millions)

0.4

[13] T. M. Madhyastha and D. A. Reed. Input/Output pattern classification using hidden markov models. In Proceedings of the 5th Workshop on I/O for Parallel and Distributed Systems, pages 57–67. ACM Press, November 1997. [14] N. Nieuwejaar, D. Kotz, A. Purakayastha, C. S. Elli, and M. L. Best. File-access characteristics of parallel scientific workloads. IEEE Transactions on Parallel and Distributed Systems, 7(10):1075–1089, October 1994. [15] R. H. Patterson and G. A. Gibson. Exposing I/O concurrency with informed prefetching. In Proceedings of the 3rd International Conference on Distributed Information Systems, pages 7–16, 1994. [16] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Zelenka. Informed prefetching and caching. In Proceedings of the 15th Symposium on Operating Systems Principles, pages 79–95. ACM Press, 1995. [17] A. J. Smith. Disc cache - miss ratio analisys and design considerations. ACM transactions on Computer Systems, 3(3):161–203, August 1985. [18] J. S. Vitter and P. Krishnan. Optimal prefetching via data compression. Journal of the ACM, 43(5):771– 793, September 1996.

0.3

0.2

0.1

0.0

1

2 4 8 16 "Local-cache" size (in Mbytes)

Figure 11. Number of disk accesses performed when running the Sprite workload under xFS.

[7]

[8]

[9]

[10]

[11]

[12]

cessing Techniques and Applications, pages I:267– 274. CSREA Press, July 1998. K. M. Curewitz, P. Kristian, and J. S. Vitter. Practical prefetching via data compression. In Proceedings of the SIGMOD Management of Data, pages 257–266. ACM Press, May 1993. M. D. Dahlin, R. Y. Wang, T. E. Anderson, and D. A. Patterson. Cooperative caching: Using remote client memory to improve file system performance. In Proceedings of the 1st Symposium on Operating System Design and Implementation, pages 267–280. USENIX Association, November 1994. T. Kimbrel, A. Tomkins, R. H. Patterson, B. Bershad, P. Cao, E. W. Felten, G. A. Gibson, A. Karlin, and K. Li. A trace-driven comparison of algorithms for parallel prefetching and caching. In Proceedings of the 2nd International Symposium on Operating System Design and Implementation, pages 19–34. USENIX Association, October 1996. D. Kotz and C. S. Ellis. Prefetching in file systems for MIMD multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(2):218–230, April 1990. T. M. Kroeger and D. D. E. Long. Predicting the future file-system actions from prior events. In Proceedings of the Annual Technical Conference. USENIX Association, January 1996. J. Labarta, S. Girona, V. Pillet, T. Cortes, and L. Gregoris. DiP: A parallel program development environment. In Proceedings of the 2nd International EuroPar Conference, pages II:665–674, August 1996.

13

Suggest Documents