Data and Program Restructuring of Irregular Applications for Cache-Coherent Multiprocessors Karen A. Tomko and Santosh G. Abraham y Department of Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109-2122 Email:
[email protected]
Abstract Applications with irregular data structures such as sparse matrices or nite element meshes account for a large fraction of engineering and scienti c applications. Domain decomposition techniques are commonly used to partition these applications to reduce interprocessor communication on message passing parallel systems. Our work investigates the use of domain decomposition techniques on cache-coherent parallel systems. Many good domain decomposition algorithms are now available. We show that further application improvements are attainable using data and program restructuring in conjunction with domain decomposition. We give techniques for data layout to reduce communication, blocking with subdomains to improve uniprocessor cache behavior, and insertion of prefetches to hide the latency of interprocessor communication. This paper details our restructuring techniques and provides experimental results on the KSR1 multiprocessor for a sparse matrix application. The experimental results include counts of cache misses provided by the KSR PMON performance monitoring tool. Our data show that cache coherency trac can be reduced by 30%-60% using our data layout scheme and that more than 53% of the remaining coherency cache misses can be eliminated using prefetch instructions.
1 Introduction Applications with irregular data structures such as sparse matrices or nite element meshes account for a large fraction of engineering and scienti c applications. Domain decomposition techniques are commonly used to partition these applications to reduce interprocessor communication on parallel systems. Many good domain decomposition algorithms are now available. However, data partitioning using domain The University of Michigan Center for Parallel Computing is partially funded by NSF grant CDA-92-14296 and this research was supported in part by ONR grant N00014-93-1-0163. y Currently, Santosh Abraham is with Hewlett-Packard Laboratories, Palo Alto, CA 94303-0969.
decomposition is just one step in the process of optimizing a code for parallel execution. The problem from a compiler standpoint is: given a domain decomposition algorithm, an application, and a multiprocessor, how should the code and data be structured to achieve high performance? In this paper, we consider the problem of optimizing irregular applications for a cachecoherent multiprocessor system such as the KSR1. If the code and data are naively parallelized, the application will not take advantage of the memory hierarchy of the multiprocessor. We concentrate on using data and code restructuring techniques that reduce capacity and coherence cache misses by taking into account the cache and cache line sizes of the multiprocessor system. Other researchers are developing more ecient and eective methods of domain decomposition. This work complements our own. We use recursive spectral bisection (RSB) decomposition for the experiments in this paper, but our methods are not tied to any one decomposition method so the decomposition algorithm can be chosen based on application requirements. There are also other researchers investigating compiler techniques for irregular codes. Much of the work has been targeted toward message passing machines and ecient message generation for these architectures. The goals are the same for a message passing architecture or for a cachecoherent architecture; minimize interprocessor communication while balancing load. However, the means to the goal are dierent for these two architectures. On a message passing architecture, communication is explicit and the frequency of messages, not the amount of data transferred, often dominates the communication cost. On a cache-coherent multiprocessor communication is implicit. It is determined by the layout of data in memory and the access pattern of the application. The communication cost is the number of cache lines transferred between processors, i.e. cache coherency trac, which is a function of both the amount of data that must be transferred and the utilization of data within the cache line. The rest of this paper is organized as follows. We discuss related work. We give some background material on domain decomposition algorithms and nite element meshes. Then we describe our program model for nite element applications and our machine model. We follow with a presentation of our techniques for optimizing the application and give experimental results for a nite element application.
2 Related Work Our methods for optimizing irregular code for cache-coherent multiprocessors draw from many related research areas. The work closest to our own is the compilation of irregular codes for message passing machines and compilation of regular codes for cache-coherent machines. Additionally, we make use of domain decomposition algorithms which have primarily been developed by researchers in the application area. Many research groups are working on run-time compilation of irregular applications for message passing machines. The application analysis required for message passing machines is similar to that required for shared memory machines. However, on a shared-address system, the size of messages is xed to the cache line size and the scheduling of messages (the timing of inter-cache transfers) is determined by the data layout and access pattern. Saltz et al. [15] have developed a run time library, PARTI, which has been used to facilitate much research on compiling for irregular applications. PARTI provides primitives to gather and scatter data, map local to global indices of distributed arrays, and determine which data items must be fetched from or transferred to another processor. We also determine which data items must be fetched from or transferred to another processor; currently we get this information from our domain decomposition tools, however the PARTI primitives may be useful for this purpose in the future. Since our machine model has a single address space, local and global mapping is not necessary. Ponnusamy, et al. [16] and Lu and Chen [14] give methods for building an iteration dependence graph at runtime which is partitioned for parallel execution. We also generate a graph at runtime but it represents the communication dependencies in the data structure not dependencies between loop iterations. In the applications that we have studied in depth, the communication dependencies are de ned by a nite element mesh or sparse matrix. We use a domain decomposition tool to partition the graph and determine the communication dependencies between the domains. Brezany, Gerndt, Sipkova and Zima have incorporated generation of calls to the PARTI library into their SUPERB compiler [5]. Hanxleden, Kennedy, Koelbel, Das and Saltz have incorporated PARTI into the FORTRAN D compiler and use some heuristics to optimize communication performance [11]. The program analysis in the two compilers mentioned above is similar to the analysis that is required to automate our irregular partitioning scheme, we intend to use what we can from this work in the future. The communication optimizations in the SUPERB and FORTRAN D compilers are designed for message passing machines and are not directly useful for our machine model. Automatic partitioning techniques for regular parallel loops on cache-coherent processors which minimize coherency trac have been developed by Hudak and Abraham [12] and by Agarwal, Kranz and Natarajan [1]. These techniques nd optimal partitions for programs with linear array subscript expressions. However, the mathematical approaches used do not extend to indirect array references common in irregular code. We use domain decomposition algorithms to partition irregular codes containing indirect access and then apply optimizations to minimize coherency trac. Blocking and data prefetching are common techniques for improving memory hierarchy performance on uniprocessors as well as multiprocessors. Gannon, Jalby and Gallivan [9] and Wolf and Lam [20] describe methods for improving uniprocessor cache performance using blocking for regular applications. Temam and Jalby [18] model the cache
1
a
b 3
2
4
c
d e
5
f 6
7
Figure 1: A Simple Finite Element Mesh behavior of sparse codes and propose a diagonal blocking technique. We propose a blocking technique using domain decomposition to partition the data assigned to each processor into blocks. Our results comply with the results in [18], blocking in sparse codes is bene cial only when unavoidable capacity misses1 are not dominant. Gornish, Granston and Veidenbaum [10] and Callahan, Kennedy and Porter eld [6] both propose compiler algorithms for inserting prefetch instructions in regular dense applications. Neither algorithm takes indirect array accesses into account. Windheiser et al. [19] evaluate using prefetch and poststore commands in a sparse matrix application. Similarly, we insert prefetches into a sparse matrix application. In addition, we use information provided by the domain decomposition algorithm to identify the data that will cause coherency misses and thus should be prefetched. We use domain decomposition algorithms to partition data structures of irregular applications. Many domain decomposition algorithms have been proposed. Some examples are: binary decomposition [3], recursive coordinate bisection (RCB), recursive graph bisection (RGB), recursive spectral bisection (RSB) decomposition [2], [17], greedy algorithms [8] and simulated annealing. These algorithms partition a graph representing the data structure or communication structure of the program with the goal of evenly distributing computational load amongst processors and minimizing the amount of data shared between two or more domains. We assume that domain decomposition is performed using one of these algorithms or a similar algorithm. Our optimizations are then applied in the context of a given data set, decomposition and application.
3 Finite Element Meshes and Domain Decomposition A nite element mesh is a discrete representation of a continuous structure used as an approximate representation of the entire structure. The continuum is divided into discrete elements that may represent two-dimensional or threedimensional objects. The elements are simple geometric shapes that are put together to represent the complex structures being modeled. Applications that model the surfaces of an object typically use two-dimensional elements such as 3-node triangular elements or 4-node rectangular elements, while applications modeling solid bodies typically use threedimensional elements such as 4-node tetrahedral elements and eight node brick elements. The elements are interconnected at points called nodes and have interactions at shared 1 Unavoidable capacity misses are misses that occur on the rst access to a datum within the control loop assuming the total data set is too large for the cache
read input data call domain decomposition algorithm for time 1 to max cycles Parallel Region node or element vector updates End Parallel Region . . .
Parallel Region node or element vector updates End Parallel Region Figure 3: FE Application Template
4 Finite Element Application Model Figure 2: Finite Element Mesh from a Radiation Modeling Application nodes or along shared edges. The graph of a nite element mesh consists of nodes and elements, where the nodes are vertices of the graph and the elements are the faces of the graph. In Figure 1, a, b, c, d, e, and f are triangular elements and 1, 2, 3, 4, 5, 6, and 7 are nodes. Elements a and b are connected to each other at nodes 1 and 3. A nite element mesh from a radiation modeling application is given in Figure 2. The three-dimensional sphere in the gure is composed of approximately 3000 four-node tetrahedral elements, the lines in the gure represent the edges of the tetrahedral elements. A domain decomposition algorithm partitions and assigns the nodes and elements of a nite element mesh to a set of processors. The graph partitioning and mapping of subgraphs to processors is commonly referred to in the literature as the mapping problem. For a given partition and assignment the application performance depends on computation and communication costs. The computation cost is a function of the number and type of elements and number of nodes assigned to each processor and is minimized if all processors are assigned equal amounts of work. The communication cost is minimized if the number of boundary nodes and boundary elements are minimized. A node assigned to a processor is a boundary node if it borders an element which is assigned to a dierent processor. Likewise, an element assigned to a processor is a boundary element if the element contains a node that is assigned to another processor. In Figure 1, if the mesh is partitioned into three domains D1 = fa; c; 1; 2; 3g, D2 = fe; f; 5; 6g, and D3 = fb; d; 4; 7g then nodes 1, 3, 5, and 7 are boundary nodes and elements b, c, d, e, and f are boundary elements. If we chose a dierent partition for the mesh in Figure 1, for example D1 = fa; c; 1; 2; 3g, D2 = fb; e; 5; 6g, and D3 = fd; f; 4; 7g then nodes 1, 3, 4, 5, and 6 are boundary nodes and elements b, c, d, e, and f are boundary elements. This partition has one more boundary node that must be communicated between processors than the previous partition.
Our research group has experience with two production applications which use the nite element (FE) method: an engineering simulation and a radiation modeling application. We base our program model on these two applications. Our FE application model has a single iterative loop, called the control loop, that dominates the execution time of the application. For example, the control loop of an application may iterate for a number of time steps or it may iterate until some convergence criterion is met. Within each iteration of the control loop, each variable is updated at most once but may be referenced several times. For example if the data is represented by vectors of node or element data, then each vector has a single update per control loop iteration. Additionally, we assume that the relationships between the variables are static with respect to the control loop. For example, the interaction between nodes and elements does not vary with respect to the control loop. The control loop may contain multiple parallel regions separated by sequential regions where each parallel region performs updates to dierent vectors. For simplicity we assume a single parallel region during the description of our optimization techniques and give an explanation of how to apply the techniques when there are multiple parallel regions in Section 6.4. A template for our program model is given in Figure 3. The assumption that the relationships between variables is static is not strictly correct for our engineering simulation. Techniques to address the dynamic nature of the application will be considered in future work and are beyond the scope of this paper.
5 Machine Model We use a shared address space MIMD multiprocessor as our machine model and assume a hardware-coherent memory hierarchy consisting of local cache, possibly a local memory, and remote memory where the latency to access remote memory is on the order of hundreds of cycles. Remote memory consists of the memory or cache storage local to other processors and communication between processors occurs when remote memory is accessed. This model covers a wide spectrum of machine types including distributed shared memory architectures such as the Stanford DASH, bus based multiprocessors such as the Sequent Symmetry and SGI Power Challenge Series, and cache only memory ar-
chitectures (COMA) like the Kendall Square Research KSR1 and KSR2. The Kendall Square Research KSR1 was used to evaluate our methods. We give a brief description of the architecture here, much of which has been taken from [19, 4]. The KSR1 is characterized by a hierarchical ring interconnection network and cache-only memory architecture. Each cell, consisting of a 20 megahertz processor, a 512 kilobyte subcache, and a 32 megabyte local cache, is connected to a unidirectional pipelined slotted ring. Up to 32 processors may be connected to each ring and multiple rings may be connected in a hierarchy of rings. We ran our experiments on a 64 processor two ring system. The size of a local cache subblock, called a subpage, is 128 bytes and serves as the unit of transfer between processors. Communication requests by any processor proceed around the ring in the direction of ring communication. Such requests are viewed by all processors as they pass by, enabling the hardware cache management system to maintain memory coherency. Given a P processor system, when a processor i makes a read request on the ring, the rst processor encountered which has a valid copy of the subpage, processor j , will respond by placing a copy of the subpage on the ring. Every processor encountered between j and i which contains an invalid copy of the subpage can optionally update its copy automatically as the request passes on its return path to the requesting processor i (referred to as automatic update). Likewise, if a processor needs to write to a shared copy of a subpage, it must send a transaction around the ring requesting that each processor with a copy of the subpage in its local cache mark the subpage as invalid. The KSR1 processor provides two instructions for explicitly hiding the latency of interprocessor transactions, prefetch and poststore. The prefetch instruction requests that a copy of a given subpage be moved into the local cache of the requesting processor. The poststore instruction acts as a selective broadcast by causing the local cache to send a copy of a subpage out for one complete tour of the ring. Automatic updates occur for prefetch and poststore transactions as described above. A shared-address space multiprocessor such as the KSR1 provides a simple programming paradigm to the user. There is no need to explicitly assign data to processors and determine when data should be sent or received from another processor. However, because remote data accesses may take an order of magnitude longer than local data accesses, high performance can only be achieved when remote accesses are minimal. On the KSR1 remote accesses take 135-175 processor cycles within a single ring of the system. Remote accesses occur implicitly whenever a new or an invalid data item is referenced by the processor and whenever a shared data item is written by the processor.
6 Optimizing Performance We perform three types of optimizations to improve the performance of applications with irregular data sets. Our optimizations restructure the data layout and the code structure of the application in order to minimize interprocessor communication, improve uniprocessor cache behavior, and hide communication latency. While, code restructuring can be done at compile time, data layout must be performed at run time because the sizes and contents of the arrays are based on the input data set. Additional run time parameters such as number of processors, cache and cache line size are also required to perform the layout. After the data set
is partitioned into domains, the data layout can be done once, thus the cost of doing the data layout is amortized over many iterations of the control loop. In this section we give some background terminology used in describing our optimizations then we detail our strategy. The following terminology is commonly used when discussing memory hierarchy performance. Memory reuse occurs when a data item or cache line is accessed in more than one iteration of a loop or within more than one loop. Temporal reuse occurs when the same datum is reused. Spatial reuse occurs when data in the same cache line are reused. A reuse is exploited when the data is retained in the cache between successive accesses. Data layout restructuring promotes exploitation of spatial reuse by controlling the assignment of data to cache lines such that reuse within a cache line occurs. Blocking promotes temporal reuse by limiting the active data set to the size of the cache. Thus the next time a datum is accessed it is still present in the cache. Throughout this section we use the following conventions when referring to objects in the nite element mesh. A mesh is the complete nite element graph for a given application data set. A submesh is a subgraph of a mesh. A domain is the submesh assigned to an individual processor. A subdomain is a submesh of a domain. A domain is further partitioned into subdomains to optimize uniprocessor cache behavior. Each processor may be assigned several subdomains which make up a domain. Each element/node within a domain has several associated data elds; only some of which are used in the computation of neighboring nodes/elements respectively. Inter-domain boundary data consists of the data elds associated with boundary nodes and elements on the border between two or more domains, i.e., data which introduces interprocessor communication. Intra-domain boundary data consists of data elds associated with boundary nodes and elements on the border between two or more subdomains which are assigned to the same processor. Intra-domain boundaries do not induce interprocessor communication.
6.1 Minimizing Interprocessor Communication
The amount of communication between processors is directly proportional to the size of the inter-domain boundary data. If the data is not laid out well in memory there may be only a single word in a given cache line which is boundary data thus a cache line must be transferred between processors for each boundary node datum. However if the data is laid out properly several boundary datum are grouped into a cache line taking advantage of spatial locality. In order to minimize interprocessor communication we reorder the node and element vectors which are involved in boundary interactions as follows. The data for each vector is grouped rst by domain. Thus, all of the data assigned to domain D has a contiguous index range. Within each domain the data falls into three categories: interior data, boundary data shared with one other domain, and boundary data shared with two or more other domains. Data d is the data assigned to processor i which borders only domain j , and data d is the data assigned to processor i that borders at least two other domains. Within each domain the data is ordered by grouping the two-way shared boundary data by neighboring domain. For example all boundary data d are laid out consecutively in memory, starting at the beginning of a cache line. Cache lines are padded with interior data when the number of boundary data shared with domain j is not an integral multiple of the cache line size. i
ij
i
ij
Parallel Region domain get pid() node start node start index[domain] node end node end index[domain] element start element start index[domain] element end element end index[domain] for element element start to element end body of element loop for node node start to node end body of node loop End Parallel Region
AAA AA AAA AA AAA AAA
Boundary d21
• • •
D1
Boundary d23
• • •
• • •
D2
Boundary d2p
Boundaries d2*
• • •
DP
Interior of D2
• • •
• • •
Figure 4: Data Layout for Minimizing Communication The boundary data shared with two or more domains, d , is also grouped contiguously starting on a cache line boundary. We assume that d is small so it is not advantageous to use a sophisticated scheme to order this data. Finally, the remaining interior data is laid out contiguously in memory. This data layout strategy is diagramed in Figure 4. The following formula gives an expression which bounds the amount of remote data that processor i must supply for other processors j 6= i given our layout strategy i
i
X r (djd j=cache line sizee+ p
ij
i;j
6=i
min(jdm j; djd j=cache line sizee)) (1) where r is the number of shared arrays associated with the boundary data, jd j is the number of two way shared boundary data assigned to i on the boundary with j , jdm j is the number of the multiway shared boundary data assigned to i and on a boundary of j , and jd j is the total number of multiway shared boundary data assigned to processor i. In equation (1) we sum over all processors other than i. The rst term of the summation accounts for the number of cache lines that must be retrieved due to data residing on i that is shared only with j . The second term of the summation accounts for the maximum number of cache lines that must be retrieved due to data residing on i that is shared by many processors and required by j . Since this data is not ordered in any particular way, the data required by j might be spread amongst all of the cache lines containing multiway shared data or spread one word per cache line. The best that an ideal (not necessarily realizable) layout strategy could do is given by the following equation. ij
i
ij
ij
i
X r (d(jd j + jdm j)=cache line sizee) p
ij
i;j
(2)
ij
6=i
In practice we have found that both jdm j and jd j are small and the communication requirements of our layout are very close to that provided by equation (2). Figure 5 gives the template for the restructured application assuming the data layout strategy described above. Each processor looks up its start and end index for the node and element arrays. The loop bounds for all node and element loops within the parallel region are modi ed to execute from the start index to end index. ij
i
Figure 5: Restructured code for minimizing interprocessor communication
6.2 Utilizing First Level Cache by Blocking
After performing domain decomposition and data layout to minimize communication the next objective is to improve the cache performance on each processor. Blocking is a common technique used in regular dense applications for improving cache performance when the data working set is too large for the cache. Capacity misses are reduced by dividing the working set of a program region (e.g. a loop nest or parallel region) into blocks that t entirely into the cache and performing calculations one block at a time. For regular programs such as matrix multiply blocking is done by partitioning the matrices into rectangular blocks, and restructuring the program to perform the multiply on one submatrix at a time. We propose blocking of irregular data structures using domain decomposition to improve cache performance. We are not aware of any previous work to block irregular applications using domain decompostion. Instead of using rectangular blocks we use the domain decomposition algorithm to produce subdomains within the domain assigned to each processor. The subdomains are generated such that the subdomain working set for the blocked code region ts entirely into the cache. Each processor iterates through its subdomains executing the code in the blocked portion of the program one subdomain at a time. The domain decomposition algorithm is used to partition the data of a domain into b subdomains. The boundary data is assigned to an appropriate subdomain with the constraint that the number of cache lines containing boundary data is minimized. Methods to determine b and assign boundary data to subdomains are described in the next paragraphs. For simplicity we only present single level blocking. The methods given here can be applied recursively to block for multiple levels of caching.
6.2.1 Calculation of b
Assume the rst level cache on a processing element of the multiprocessor is of size C . The number of subdomains, b, must be chosen such that all the data associated with a subdomain that is accessed within the blocked region of code is less than C . Assuming that elements are partitioned equitably among domains and subdomains, b can be determined as follows. element data + node data+
boundary element data + boundary node data C Let x be the number of elements assigned to each subdomain and assume that the number of nodes, boundary elements and boundary nodes can be determined as a function of x. Then we can rewrite the above equation as a0 x + a1 f (x) + a2 g(x) + a3 h(x) C (3) where a0 , a1 , a2 , and a3 are the size of element, boundary element, node, and boundary node data respectively and f (x), g(x), and h(x) are the number of boundary elements, number of nodes and number of boundary nodes. A value for x that satis es equation (3) is calculated. The number of subdomains, b, is then chosen to satisfy the following relation.
b > d elements xin domain e
A two-dimensional regular mesh with a subdomain of size
nxn has n2 elements, n2 nodes, 2n + 1 boundary elements and 2n + 1 boundary nodes. Thus equation (3) for a twodimensional regular mesh where the number of elements is x is a0 x + a1 (2x1 2 + 1) + a2 x + a3 (2x1 2 + 1) C =
=
or
(a0 + a2 )x + 2(a1 + a3 )x + (a1 + a3 ) C: The above quadratic equation can be solved for x and the number of subdomains calculated once the coecient values are determined. Blocking the two-dimensional meshp will reduce the capacity misses by a factor of n where n C if the coecients are 1. Similarly, a three-dimensional regular 3 mesh with a subdomain of size m x m x m has m elements and nodes and 33m2 + 3m boundary elements and boundary nodes, if x = m then equation (3) can be written 1=2
(a0 + a2 )x + 3(a1 + a3 )x2 3 + 3(a1 + a3 )x1 3 C: As in the previous case, this equation can also be solved analytically andpthe reduction in capacity misses will be proportional to 3 C The meshes for our applications are irregular which makes it dicult to calculate an exact working set size for a subdomain. When the partitioning algorithm is recursive, we can test whether the subdomain obtained after a subdivision is suciently small. Partitioning algorithms often directly provide the number of nodes, boundary nodes and boundary elements. Even if not directly provided they can be determined by inspection of the subdomains. Further subdivision is performed only if the size of the data set exceeds the cache size. In some cases, the partitioning tool requires the actual number of subdomains as an input. In such cases, we can bound the size of the working set and obtain a conservatively large value for b. Let s be the size, in number of nodes, of the most complex element type of the mesh. For example the size of a triangular element is 3 and the size of a brick element is 8. Let y be the maximum number of elements to which a single node is adjacent in the input mesh. Remember that x, f (x), g(x), ans h(x) are the number of elements, the number of boundary elements, the number of nodes and the number of boundary nodes respectively. We can derive the following relationships. The total number of nodes, including boundary nodes in the working set is at most sx. Therefore g(x) + h(x) sx which can be =
=
rewritten as h(x) = sx and g(x) = (1 ? )sx. If we assume that the boundary data shared between two domains is split equally between the two domains then for any given domain at most half of the node data that it references can belong to another processor. Therefore 0 1=2. This information gives us a 1bound for the terms g(x) and h(x), g(x) sx and h(x) 2 sx. We now have to determine the maximum possible number of boundary elements. There are boundary elements for every node in g(x) which is on the subdomain boundary. The number of these nodes is at most equal to h(x) so the number of boundary elements, f (x), is ysx 12 ysx. As a result we can bound the left hand side of equation (3) as given by equation (4). a0 x + a1 f (x) + a2 g(x) + a3 h(x) (4) a0 x + 21 a1 ysx + a2 sx + 21 a3 sx
For example, if the data size associated with elements, boundary elements, nodes, and boundary nodes is one word (a0 = a1 = a2 = a3 = 1), the largest element is a four node element (s = 4), and the maximum adjacent elements of a node is four (y = 4) then the right hand side of equation (4) is equal to x + 8x + 4x + 2x = 15x.
6.2.2 Assignment of boundary data
As mentioned earlier, the creation of subdomains is performed under the constraint that the number of cache lines of inter-domain boundary data is minimized. A domain D is divided into b subdomains using the decomposition algorithm, and the inter-domain boundary cache lines are assigned to the subdomains as follows. The data layout algorithm described in the previous section is used to minimize the number of cache lines of inter-domain boundary data. Recall that the algorithm provides a data layout in which the data is sorted by j , the neighboring domain with which the data is shared. Within the set d the data is sorted by subdomain thus grouping together inter-domain data by subdomain. Finally, each cache line of d is assigned to the subdomain that has the most data elements in the cache line. The node and element data is laid out to re ect the subdomain decomposition and the boundary data assignment. The data is rst ordered by domain, and within each domain grouped by subdomain. Data within a subdomain is grouped by inter-domain boundary data, then intra-domain boundary data, and nally interior data. Interior data is used to ll in incomplete cache lines of the inter-domain and intra-domain data. Figure 6 diagrams the data layout for blocked domain decomposition. The application must be restructured by adding a loop which iterates from 1 to b around the blocked code region. Two arrays can be used to keep track of data assigned to each subdomain, a subdomain start index array and a subdomain end index array. Each subdomain consists of the elements (nodes) indexed between the start and end index. Figure 7 gives a pseudo code template for restructuring an application. i
ij
ij
6.3 Hiding Communication Latency
In the previous sections we described a data layout scheme that reduces the number of shared cache lines between domains thereby reducing the cost of interprocessor communication. After applying the data layout schemes described above there will still be some boundary data that must be
D1
D2
• • •
• • •
SD1
SD2
AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA AAAAA • • •
Inter-domain boundary data
• • •
Intra-domain boundary data
• • •
DP
• • •
• • •
SDb
• • •
AA AA AAA AA AAA AAA AAA AA AAA AAA AAA AA AAA AAA AAA AA AAA AAA AAA AA AAA AAA AAA AA AAA AAA AAA AA
d21 d23
• • •
d2p d2* sd21 sd23
• • •
• • •
sd2b sd2* Interior of SD2
Figure 6: Data Layout for Blocked Domains
Parallel Region domain get pid() for subdomain 1 to b node start node start index[domain; subdomain] node end node end index[domain; subdomain] element start element start index[domain; subdomain] element end element end index[domain; subdomain] for element element start to element end body of element loop for node node start to node end body of node loop End Parallel Region Figure 7: Restructured code for blocked domain decomposition communicated (i.e. d 8 i and j ). This communication causes some performance degradation of the application. The degradation can be reduced by overlapping the communication with computation. Thus the communication latency is hidden by prefetching the inter-domain boundary data. Prefetching can be combined with the data layout and program restructuring given in the previous two sections. We describe prefetching in the context of a blocked code, but prefetching can be similarly implemented on an unblocked code. To hide the latency of a code restructured as described in section 6.2 we start by partitioning the subdomains assigned to each processor into two sets, interior and boundary subdomains. An interior subdomain is one which requires no inter-domain boundary data and thus requires no communication. A boundary subdomain is one which does require inter-domain boundary data and thus requires communication. The blocked program can be augmented with prefetches as given in Figure 8. For each boundary domain the remote data are prefetched followed by execution of the interior blocks and then execution of the boundary blocks. If the remote data arrives before execution of the block requiring the data then there is no cache miss and the processor does not have to stall. A mapping array is used to reorder the ij
Parallel Region domain get pid() for boundary subdomain num interior + 1 to b subdomain map[boundary subdomain] prefetch boundary data for subdomain for s 1 to b subdomain map[s] node start node start index[domain; subdomain] node end node end index[domain; subdomain] element start element start index[domain;subdomain] element end element end index[domain; subdomain] for element element start to element end body of element loop for node node start to node end body of node loop End Parallel Region Figure 8: Restructured code for latency hiding subdomains such that the interior subdomains are processed before the boundary subdomains. Since latency-hiding mechanism are highly variable from machine to machine, an implementation on a real machine may require a variation of the scheme outlined in Figure 8. For example on the KSR1 both prefetch and poststore operations are available so each operation's eectiveness must be evaluated. Data prefetching can be used by processor j in the example above to retrieve the cache lines containing boundary nodes required from processor i. Alternatively, processor i can poststore all cache lines containing boundary nodes assigned to i. A hybrid approach where j prefetches all boundary nodes bordering i only, and i poststores all boundary nodes bordering on more than two subdomains can also be used. Ultimately, the choice of prefetching or poststoring data depends on the availability and cost of these operations on the target multiprocessor. Another machine speci c constraint on the prefetch (or poststore) mechanism is the number of outstanding prefetches supported by the machine. In Figure 8 all prefetches are executed together, but this will typically over ow the prefetch buer of the machine causing most of the prefetches to be ignored. To solve this problem we can spread the prefetches out by interleaving prefetches with subdomain execution such that prefetching for subdomain j is performed with the execution of subdomain j ? 1. Figure 9 gives a template for a parallel region with the prefetches interleaved with execution of subdomain computation. The algorithm outlined assumes that there are b total subdomains, k boundary subdomains, the prefetch latency is less than the time to execute the computation for one subdomain, and the number of cache lines to fetch for a given subdomain is less than or equal to the prefetch buer size. The algorithm can be modi ed to spread the prefetches out to accommodate a larger prefetch latency or a larger number of data prefetched per subdomain.
Parallel Region for s 1 to b ? k ? 1 subdomain map[s] compute subdomain for s b ? k to b ? 1 subdomain map[s] next subdomain map[s + 1] prefetch for next subdomain compute subdomain subdomain map[b] compute subdomain End Parallel Region Figure 9: Restructured code for interleaved prefetching
read input data call domain decomposition algorithm While residual > convergence criterion Parallel Region array and/or vector updates End Parallel Region . . .
Parallel Region array and/or vector updates calculation of residual End Parallel Region Figure 10: Sparse Matrix Application Template
6.4 Extensions
Not all FE applications are written with an explicit notion of elements and nodes. Some applications such as our radiation modeling program (FEM-ATS, Finite Element Method Absorbing Termination Surface) build a sparse matrix which represents the interactions between elements. The control loop performs linear algebra operations on the sparse matrix and other data vectors abstracted from the original mesh. A template for such an application is given in Figure 10. The data layout algorithms given above can be adapted to this model as follows. Communication occurs in the program as a result of the sparse matrix, denoted as A, which represents the interactions within the nite element model. In the FEM-ATS application, A is nxn where n is the number of edges in the mesh and there is a non-zero entry in a if there is an interaction between edge i and edge j . Thus we are concerned with edges (pairs of nodes) but not with elements. To use our data layout scheme we need the notion of a boundary edge. A boundary edge is an edge that interacts with an edge assigned to another domain. Using this de nition of boundary edges the boundary data sets d can be calculated and the reordering of edge data can be performed as described in section 6.1. The matrix A must be reordered ij
ij
both by row and by column. The subdomain generation and data restructuring can be done as described section 6.2, but equation (3) must be modi ed. The working set should be calculated as a function of the sparse array size, (e.g. number of edges) not the number of elements. The new equation is a0 x + a1 f (x) C where x is the number of edges assigned to a domain and f (x) is the number of edges required from other domains. Let y be the maximum number of other edges with which an edge may interact. We can thus bound the equation as follows. a0 x + a1 f (x) a0 x + a1 yx For a matrix/vector multiply, the coecients are a function of the sparse matrix representation and the initial and result vector. The value of y is a function of the number of nonzeros per row. For example, in our application a0 = 3, a1 = 5, and y = 15. Data restructuring for subdomains must be applied to both the rows and columns of A, as well as the other vectors. In addition to handling sparse matrix representation of the FE mesh it is necessary to account for multiple parallel regions within the control loop. Each parallel region may have a dierent working set and dierent computational requirements. As a result, the domain decomposition that balances the load in one parallel region may not balance the load in another parallel region. Our method assumes that there is a single domain decomposition for the whole application. This domain decomposition should be chosen to balance the load of the parallel region having the largest execution time. In the future, we intend to investigate ways to vary the domain decomposition to account for dynamic changes that occur across iterations of the control loop. The dynamic load balancing techniques may also be able to alter domain boundaries between parallel regions within a single control iteration. Since the working set varies from one parallel region to the next, another consequence of multiple parallel regions is that the subdomain size for one parallel region is not necessarily the size required for another parallel region. We determine the block size required for each parallel region and use the minimum (most constrained) block size for the entire control loop.
7 Experimental Results In order to evaluate the techniques described above we hand restructured the FEM-ATS radiation modeling application developed at the University of Michigan [7] and ran several experiments on a 64 processor Kendall Square Research KSR1. The main routine in FEM-ATS is a sparse solver which uses the biconjugate gradient method to iteratively calculate approximate solutions to a linear system. The vector operations of the main loop are outlined in Figure 11. The loop is executed as a single parallel region with barrier synchronizations to ensure the correctness of the dot products. Summing of the partial sums for the dot products is done sequentially on the master processor and requires a small amount of communication. The majority of interprocessor communication occurs between processors during the matrix vector multiply and the vector update of p. Since A is static, the communication pattern is the same for all iterations of the solver. A detailed description of the parallelization of FEM-ATS is given in [19].
Repeat until (resd tol) = Ap = tmop=(q p) x = x + p r = r ? q
q
b
p
=r d
resd = jr r j = (r b)=tmop tmop = tmop p = b + p
(1) (2) (3) (4) (5) (6) (7) (8) (9)
EndRepeat
A is a matrix. are vectors. ; ; tmop;resd; tol are scalars. q; p; x; r; b; d
Figure 11: Main loop of FEM-ATS, a radiation modeling application. Reproduced from [19] We used two data sets in our experiments. The sizes of the data set are n = 8; 841 (9k data set), and n = 20; 033 (20k data set) where the sparse matrix A is nxn. There are 136; 491 and 306; 635 non-zero elements in the A matrix for the two data sets respectively. The data sets were partitioned into domains using a program provided by Barnard and Simon which implements their Recursive Spectral Bisection (RSB) algorithm [2]. The A matrix was partitioned directly (as opposed to partitioning the nite element mesh) since the structure of the A matrix determines the communication pattern within the iterative solver. We used data copying to group together the non-contiguous array sections accessed within each domain (subdomain). This transformation was done for all of our experiments in order to minimize con ict cache misses in the data subcache of the KSR1. The measurements reported in our experiments were gathered using the KSR PMON library [13]. PMON provides performance data on a per processor basis for all threads in a process. PMON allows access to a hardware event monitor which counts cache miss events, prefetch events and clock cycles for timing. This tool enables us to get real cache event counts on realistic data sets while the code executes. We performed three sets of experiments. The rst set of experiments were designed to investigate how well our optimizations are able to minimize interprocessor communication in the FEM-ATS application. The second set evaluate the eects of blocking the application into subdomains. The third set of experiments investigate the ability to hide latency due to interprocessor communication in FEM-ATS.
7.1 Communication Experiments
For the communication experiments we restructured the program as given in section 6.1, tried several data layout schemes and varied the number of processors from 8 to 40. The data layout schemes are summarized in Table 1. When the number of processors are increased beyond 28, processors are divided among the two rings of the KSR1. Table 2 gives the division of processors to rings. For each run we measured the number of cache lines transferred between processors and
Mnemonic nd
d dr drnp
Description no domain decomposition, contiguous block partitioning domain decomposition, no data reordering domain decomposition, data reordering domain decomposition, data reordering, no padding of cache lines
Table 1: Data Layout Schemes Number of Processors 8 16 24 32 40
Ring 1 8 16 24 27 27
Ring 2 0 0 0 5 13
Table 2: Processors Division Among Rings the execution time for 100 iterations of the solver. The number of cache lines transferred between processors is equal to the number of cache misses in the secondary (local) cache of the KSR1. We separately measured the misses due to synchronization and reduction operations in the application since our optimizations do not address this type of trac. These misses are labeled as overhead in our results. The results of our communication experiments are given in Figure 12 and Figure 13. For both data sets our dr data layout scheme (as described in Section 6.1) performed better than contiguous blocking and domain decomposition. Overall, the dr layout has 30%-60% fewer coherency misses than contiguous blocking. The execution time measurements re ect the decrease in cache misses for our layout scheme. However, given the slow processor speed of the KSR1, 20 MHz, it is not surprising that these gures are less dramatic than the cache miss gures. The dr layout scheme takes 8%-16% less time to execute than contiguous blocking. The execution time improvements will be more signi cant on systems with higher processor speeds. Additionally, the improvements are greater for the 20k data set than the 9k data set. We used two dierent data layout schemes, the dr scheme is exactly as was described in section 6.1, and the drnp which is as described in section 6.1 except that interior data is not used to pad out cache lines for each set of boundary data. Boundary data sets are packed tightly together and two or more sets may share a cache line. We expected that the dr scheme would out perform the drnp scheme because dr minimizes the number of remote cache lines that a processor must fetch while drnp has the potential for using more than the minimum number of cache lines for a boundary data set. Surprisingly, the drnp scheme performed as well as the dr scheme. This is a result of the automatic update feature of the KSR1. If a cache line contains data required by two dierent processors it is likely that one of the processors will get the data for free via automatic update.
7.2 Blocking Experiments
In order to investigate blocking with subdomains we executed several runs of the 9k data set on a uniprocessor and varied the number of subdomains from 1 to 56. The 9k data
Uniprocessor Cache Misses - First Level Cache
Secondary Cache Misses
35000 30000
nd
25000
d dr
20000 drnp 15000
overhead
Primary Cache Misses
Average Secondary Cache Misses, 9k Data Set
10000000 8000000 6000000 4000000 2000000 0
10000
8841
5000
16 24 Number of Processors
277
222
185
158
Size of Subdomains (rows of A) Execution Time - 9k Data Set
0 8
553
32
Execution Time, 9K Data Set
25 3.5 nd
20
d
15
2.5 dr 2 drnp 1.5
5 8841
553
277
222
185
158
Size of Subdomains (rows of A)
0 8
16 24 Number of Processors
Figure 14: Results from Communication Subdomain Blocking Experiments, 9K Data Set
32
Figure 12: Results from Communication Experiments, 9K Data Set Average Secondary Cache Misses, 20k Data Set 60000 50000 Secondary Cache Misses
10
0
1 0.5
nd d
40000
dr drnp
30000
overhead 20000 10000 0 8
16 24 32 Number of Processors
40
Execution Time, 20K Data Set 7 nd 6 d User Time (seconds)
User Time (seconds)
User Time (seconds)
3
5 dr 4 drnp 3 2
set was chosen because the entire working set of the solver ts in the secondary cache of a single processor. We measured misses from the rst level data cache of the processor and execution times for 100 iterations of the solver. The results are given in Figure 14. Blocking did not reduce cache misses nor execution time for this application. After studying the application more closely to understand why subdomain blocking was not bene cial we realized that blocking could only reduce misses occurring during the matrix vector multiply. Furthermore, there is only reuse in the data from the p vector and the structure of the A matrix determines the reuse distance for elements of p. Our data set requires up to 78 words of data to perform the calculations for each row of the matrix vector multiply and the KSR1 has a processor data size of 32K words. Therefore, block sizes less than 420 should t entirely in the processor data cache. For our data sets the original layout of A is such that most of the available reuse of data from p is exploited, thus blocking provided no additional bene ts. We give a histogram of the reuse distance for accesses to p in Figure 15 where reuse distance is given in number of rows of A. For example if the reuse distance for a reference to an element of p is 2 then the same element or cache line was referenced two rows earlier. Reuse of data will occur whenever the reuse distance is less than 420. Other applications will have to be evaluated to determine the usefulness of subdomain blocking.
7.3 Prefetch Experiments
1 0 8
16 24 32 Number of Processors
40
Figure 13: Results from Communication Experiments, 20K Data Set
Our nal set of experiments evaluate using the prefetch mechanism on the KSR1 to hide the latency of cache coherency trac. Since blocking with subdomains did not improve the performance of our application we instrumented an unblocked version of the application with prefetch instructions. Prefetching was done at two locations in the
Average Secondary Cache Misses, 20k Data Set (No Prefetch vs. Prefetch)
Histogram of Reuse Distance - 9k Data Set
Secondary Cache Misses