SSE2 (Streaming SIMD Extensions)[9,10], AMD's 3Dnow![11], Sun's VIS ... Intel ICC compiler intrinsics provide the access to the ISA functionality using C/C++ ... The NAS-MG benchmark uses a conventional multigrid V-cycle scheme in combination with a ..... [11] Advanced Micro Devices, Inc. âAMD 3Dnow! ... [19] Intel Corp.
Vectorization of Multigrid Codes using SIMD ISA Extensions C. García, R. Lario, M. Prieto, L. Piñuel, F. Tirado Departamento de Arquitectura de Computadores y Automática Facultad de C.C. Físicas. Universidad Complutense. Ciudad Universitaria s/n 28040 Madrid. Spain. {garsanca,rlario,mpmatias,lpinuel,ptirado}@dacya.ucm.es
Abstract
Motivated by the recent trend towards small-scale SIMD processing, we have addressed in this paper the vectorization of multigrid codes on modern microprocessors. The aim is to demonstrate that this relatively new feature can be beneficial not only for multimedia programs but also for such numerical codes. As target kernels we have considered both standard and robust multigrid algorithms, which demand different vectorization strategies. Furthermore, we have also studied the well-known NASMG program from the NAS Parallel benchmarks. In all cases, the performance benefits are quite satisfactory. The interest of this research is particularly relevant if we envisage using in-processor parallelism as a way to scale-up the speedup of other optimizations such as efficient memoryhierarchy exploitation or multiprocessor parallelization. Index Terms—Multigrid methods, SIMD exploitation, Robust Multigrid, Cache-efficient multigrid. 1
Introduction
The memory-wall trend [1,2] that we have witnessed over the last few years has motivated a significant amount of work on cache-conscious multigrid algorithms and several schemes, which overcome locality problems in a wide variety of situations, have been developed [3,4,5]. We have also focused this research on the improvement of multigrid codes from an implementation point of view. However, our motivation is in consequence of a different microprocessor design tendency: the media trend [6,7]. Given the increasing importance of media to desktop computing, most microprocessor vendors have improved their general-purpose architectures with specific support for this kind of workload. The key point is that multimedia codes typically process narrower data types (8,16, 32 bits), whereas general-purpose systems are optimized for processing wider data. Rather than waste the wider ALUs when operating on small data, instruction-set architectures (ISA) have been extended with new instructions that can operate on several narrower data items at the same time, allowing a form of small-scale SIMD parallelism at the subword level. Examples of such ISA extension are Intel's MMX (MultiMedia eXtension)[8], Intel’s SSE and SSE2 (Streaming SIMD Extensions)[9,10], AMD's 3Dnow![11], Sun's VIS (Visual Instruction Set) [12], Hewlett Packard's MAX-2 (Multimedia Acceleration eXtension) [13] and PowerPC's AltiVec [14].
Currently, state-of-the-art media extensions, such as Intel's SSE/SSE2 and PowerPC's AltiVec offer 128-bit SIMD registers with both integer and floating-point arithmetic (single or double). Although it is difficult to predict what features will be included in future general-purpose microprocessors, current trends suggest the growing importance of such media ISA extensions, which will become as natural as floating-point or integer instructions. SIMD-conscious multigrid is not a new research issue. In fact, the effectiveness of vector processing in such codes has been studied thoroughly in the golden ages of vector supercomputers [15,16,17]. In any case, this kind of computers substantially differ from general-purpose microprocessors in their vector capabilities and consequently, new issues have to be addressed. For example, over the years, traditional vector computers added strided addressing and gather/scatter addressing to increase the number of programs that can be vectorized. In contrast, these short-vector SIMD architectures support only unit strided accesses: memory accesses load or store all elements at once from a single memory location. In summary, it is our opinion that a revision of SIMD-conscious multigrid is of great practical interest in order to evaluate the potential benefits of this novel architectural characteristic. In our view, the performance impact of such kind of optimization may become even more significant as soon as new ISA extension generations augment the number and the size of the SIMD registers. Nevertheless, we should mention again that technology trends are difficult to forecast. In fact, some authors have questioned for example, the benefits that multimedia applications can get by increasing the SIMD register size [18], although it is probably not the case in our context. The rest of this paper is organized as follows: Sections 2 and 3 describe briefly the main characteristics of the target computing platform and the investigated multigrid algorithms respectively. Sections 4 and 5 summarize the proposed optimizations and present some performance results. Finally, the paper ends with some conclusions and hints about our future research. 2
Experimental Environment
The platform on which we have chosen to study the benefits of the SIMD extensions is a Pentium 4 based PC running under Linux, the main features of which are summarized in table 1. Table 1. Pentium-4 system Main Features Processor
Intel® Pentium® 4 1.5 GHz
Motherboard
DFI WT70-EC (chipset Intel® 82850)
Memory
768 GB PC800 RDRAM RedHat Distribution 7.2 (Enigma)
Operating system
Linux kernel 2.4.17 glibc 2.2.4-13
Compiler
Intel® C/C++ and Intel® Fortran Compilers for Linux v5.0
Compiler switches
-O3 –tpp7 –xW Automatic Vectorization: -vec -restrict
All optimizations have been carried out at the source code level avoiding assembly coding. In fact, the automatic vectorization provided by the Intel Compilers achieves certain performance improvements in some cases. However, better speedups can be usually obtained by means of handtuned codes based on the ICC compiler intrinsics [19]. 2.1
Guided Automatic Vectorization
From a programmer’s point of view, the most suitable way to exploit SIMD extensions is automatic vectorization since it avoids low level coding techniques, which are platform dependent. Nevertheless, loops must fulfill some requirements in order to be automatically vectorized, and in most practical cases both code modifications and guided compilation are usually necessary. In particular, the Intel compiler [19] can only vectorize simple loop structures such as: for (i = 0; i < max; i++){ D[i] = A[i] + B[i]*C[i]; } Primarily, only inner loops with simple array index manipulation (i.e. unit increment) and which iterate over contiguous memory locations are considered (thus avoiding non-contiguous accesses to vector elements). In addition, global variables must be avoided since they inhibit vectorization. Finally, if pointers are employed inside the loop, pointer disambiguation is mandatory (this must be done by hand using compiler directives). 2.2
Compiler Intrinsics
Intel ICC compiler intrinsics provide the access to the ISA functionality using C/C++ style coding instead of assembly language [19]. Most of them map one-to-one with actual SSE/SSE2 assembly code. Their operands are four additional data types representing 64-bit and 128-bit data. In particular, we have employed _m128 (4-wide single precision data vectors) and _m128d (2-wide double precision data vectors). Apart from arithmetic and logical SIMD operations, Intel’s SSE/SSE2 also provides data manipulation instructions (scatter, gather, shuffle, etc.) and some operations, named cacheability instructions, to improve memory usage (prefetching, streaming stores, etc.). Although these intrinsics limit portability, we should notice that the GCC 3.2 GNU compiler accepts these intrinsics as well, and translates them into vectorial code for the AMD K7 [20]. Furthermore, this GCC version also admits similar compiler intrinsics that support other SIMD ISA extensions, such as the PowerPC's Altivec [20]. 3
Target Applications
Multigrid methods are regarded as being the fastest numerical method for the solution of elliptic partial differential equations, and as amongst the fastest methods for other types of partial differential equations and integral equations [21]. These characteristics have made multigrid a common solution method in many application areas. As a result of its popularity, some multigrid solvers have also gained widespread acceptance among the scientific and the computer architecture communities as standard performance indicators. A well-known example is the NAS-MG [22], one of the kernels included in the NAS Parallel
Benchmark, whose sequential version can also be found in the SPEC CPU2000 suite [23]. A more recent example is the SMG98 from the ASCI 30-TeraOps Sample Application Codes [24]. In this work, we have opted to study at first, different 2-D kernels before dealing with more elaborate codes such as the cited benchmarks, since they allow us to evaluate the investigated optimizations more precisely. Unlike most scientific codes, we have chosen in this case singleprecision data. This simplification is done as a way to provide some degree of estimation about the potential benefits of wider SIMD registers (it allows us to use short vectors of four elements instead of two). Furthermore, we also present some preliminary performance results for the NAS-MG (Class B) in order to improve the diffusion and understanding of our results and to facilitate further comparisons. The target kernels and the NAS-MG are described below. 3.1
Point-wise Relaxation Kernels
We have considered two different point-wise relaxation kernels based on a 5-point finite difference discretization of the Laplace operator on a 2-D rectangular grid. They consist in a V-cycle, with a standard coarsening procedure, linear interpolation and linear restriction operators. One kernel is based on the conventional Gauss-Seidel smoother whereas the other uses the Red-Black version. In both cases we have employed one pre- and one post-smoothing iteration and five iterations on the coarsest level [21]. 3.2
Robust Kernels
These kernels represent a better characterization of the typical workload of a real-life multigrid application. They are also based on a 5-point finite difference discretization of the Laplace operator but on a 2-D stretched grid with stretching in two directions [21]. In this highly anisotropic problem, point-wise relaxation and standard (full) coarsening is not a reasonable combination [25]. The first approach studied keeps the standard coarsening but changes the relaxation procedure from point-wise to alternating zebra line-wise relaxation [21,25]. From an implementation point of view, the main problem of this robust technique is caused by the discrepancies between the memory access patterns of two principal components of the algorithm: the x-line and the y-line relaxation. This difference causes one of these components to exhibit poor data locality in the straightforward implementations of the algorithm. As a consequence, the performance of this kernel can be highly limited by the memory accesses. The other robust kernel considered tries to avoid this problem by combining x-line zebra relaxation with y-semi-coarsening [21,25]. In both cases, line relaxation is performed using the Thomas’ algorithm [21]. 3.3
NAS-MG Benchmark
The NAS-MG benchmark uses a conventional multigrid V-cycle scheme in combination with a standard coarsening procedure. It obtains an approximate solution to a scalar Poisson problem on a rectangular 3D grid with periodic boundary condition. More information about the NAS-MG can be found in [22]. We should remark that unlike the kernels, this benchmark uses double-precision data, which halves the theoretical speedup available in this case.
4
Kernel Optimizations
Before considering hand-tuned optimizations based on compiler intrinsics, we have tried at first, to take advantage of the Intel ICC (the kernels have been developed in C language) automatic vectorization capabilities.For this purpose, we have introduced some directives that inform the compiler about pointer disambiguation and data alignment. In addition, we have performed some code modifications, such as changing the scope of the variables or introducing loop fission, which eliminate some vectorization inhibitors. This approach works fine for some operators such as the prolongation, the restrictor or the metrics. However, it provides insignificant improvements on the smoother due to data dependencies and noncontiguous data accesses, and hence this component requires more coding effort. 4.1
Gauss-Seidel Point-Wise Relaxation
As is well known, the Gauss-Seidel smoother is not fully vectorizable due to Read-After-Write (RAW) data dependencies. for i { for j { dw i,j=(ri,j - xmi,j *dw i-1,j)/dgi,j a=ymi,j /dgi,j
vectorizable
dw i,j = dw i,j - a * dw i,j-1 RAW dependencies
w i,j = w i,j + dw i,j }
vectorizable
}
Fig 1: Main loop of the Gauss-Seidel smoother .
However, as figure 1 shows, some parts of the computation can be vectorized if the RAW dependencies are properly isolated [16]. In any case, the algorithm improvement is limited by this non-vectorizable part. The vectorization of the dependency free computations is performed by just strip-mining the inner loop. As mention above, the SIMD operations are explicitly specified by means of the Intel ICC compiler intrinsics [19]. The tricky part of this scheme is the integration of the sequential computation with the SIMD operations. As figure 2 shows, it consists of four steps, where computations are serialized using appropriated vector masks and shuffle operations. Taking into account the overhead associated with this computation, the performance results are satisfactory, even for small problem sizes (see figure 3). As expected, the speed-up grows with the problem size due to start-ups overheads, varying from 1.3 to 1.85 for the range of grid sizes considered.
S ≡ _mm_shufle_ps dwi,j-1
for i { load ( ? ? for j=1:+4:n { a3
a2
a1
?
b0
+ ≡ _mm_add_ps & ≡ _mm_and_ps
)
- ≡ _mm_sub_ps * ≡ _mm_mul_ps
=
a0
ym3 ym2 ym1 ym0
/ dg3 dg2 dg1 dg0
vectorial computation mask0 dw3 dw2 dw1 dw0
step0 ?
?
b1
?
- = ( ? ? ? a0 S= dw3 dw2 dw1 dw0
*
?
?
?
b0
)&
0
0
*
?
?
b1
?
)&
0
0
step1
?
b2
?
?
- = ( ? ? a1 ? S= dw3 dw2 dw1 dw0
*
?
b2
?
?
)&
0
1
step2
b3
?
?
?
*
b3
?
?
?
)&
1
0
1
- = ( ? a2 ? ? S= dw3 dw2 dw1 dw0
step3
?
?
?
b0
0
0
mask3
shuffle dw3 dw2 dw1 dw0
0
mask2
shuffle dw3 dw2 dw1 dw0
1
mask1
shuffle dw3 dw2 dw1 dw0
0
- = ( a3 ? ? ? S= dw3 dw2 dw1 dw0
0
0
shuffle w3 w2 w1 w0
+=
dw3 dw2 dw1 dw0
} }
Fig 2: Integration of the non-vectorizable computations of the Gauss-Seidel smoother using Intel ICC compiler intrinsics.
2 Gauss-Seidel
Speed-up
1.8
1.6
1.4
1.2
1 128x128
256x256
512x512
1024x1024
Problem Size
Fig 3: Speed-up of the vectorial Gauss-Seidel kernel.
4.2
Red-Black Point-Wise Relaxation
In order to implement the vectorial version of the Red-Black smoother, we have employed an agglomeration technique, which is graphically described in figures 4 and 5.
_mm_unpacklo_ps
SIMD Computation
_mm_unpackhi_ps
vector A vector B
vector A
_mm_shuffle_ps _mm_load_ps
_mm_load_ps
_mm_store_ps
_mm_shuffle_ps
vector B
Fig 5: Final step of the vectorial Red-Black smoother.
Fig 4: Initial step of the vectorial Red-Black smoother.
The algorithm can take advantage of the SIMD extensions at the expense of an initial and a final step where data (red or black points) are transferred between memory and the vectorial registers. In the first step (see figure 4), groups of four red or black points are packed on a single vectorial register by means of the Intel ICC _mm_load_ps and _mm_shuffle_ps compiler intrinsics (with the proper vectorial mask depending on the color). After the smoother iteration, which takes advantage of the SIMD operations, the new data can be stored in the original matrix without changing the previous values of the opposite color. The implementation is based on the _mm_unpacklo_ps, _mm_unpackhi_ps and _mm_store_ps compiler intrinsics (see figure 5). 2 Red-Black
Speed-up
1.8
1.6
1.4
1.2
1 128x128
256x256
512x512
1024x1024
Problem Size
Fig 6: Speed-up of the vectorial Red-Black kernel.
Figure 6 shows the benefits of this optimization for different problem sizes. Contrary to our expectations, the achievable speed-ups are lower than the Gauss-Seidel counterparts, varying from 1.2
to 1.5. This behavior is a consequence of the overhead associated with the additional data transfers and shuffle operations required by the agglomeration. 4.3
Robust Kernels
The vectorization scheme used for zebra line-wise relaxation is similar to the agglomeration technique applied in the Red-Black kernel. Y-Line Data Transfers
VECTORIALRELAXATION Vectorial register (4 floats)
Y X
Auxiliary Buffer
Fig 7: Agglomeration scheme for zebra line relaxation.
The relaxation of a single line cannot be vectorizable due to the data dependencies involved in the Thomas’ algorithm. However, lines of the same color can be simultaneously relaxed, which allows the exploitation of the SIMD parallelism. For this purpose, groups of four lines of the same color are previously transferred into an auxiliary buffer (see figure 7). Obviously, data transfers between the original matrix and the auxiliary buffer involve a significant overhead (see figure 8). However, it is by far compensated (in terms of execution time), since it avoids non-contiguous accesses when vectorial computations are performed. 100%
Optimized X-Line Smoother
90% 80% 70% 60% 50% 40% 30% 20%
X-Line Solverr Data Tranfers 2D -> 1D
10%
Data Tranfers 1D -> 2D
0% 128x128
256x256
512x512
1024x1024
Fig 8: Execution time profile of the optimized X-Line smoother of the Alternating-line kernel.
Figure 9 shows the benefits of this optimization for the alternating line approach. The performance gains are significantly greater than the point-wise kernels counterparts, ranging from 1.9 to 2.25. Unlike the point-wise kernels, these speed-ups decrease with the problem size due to the growing weight of agglomeration data transfers (see figure 8).
3 Alternating-line + Standard Coarsening
2.75
Speed-up
2.5
Execution time (Original)
100%
80%
2.25 2
60%
1.75 Total
1.5
40%
X-Line
1.25
Y-Line
Others Y-Line X-Line
20%
1 128x128
256x256
512x512
1024x1024
0%
Problem Size
128x128
Fig 9: Speed-up of the vectorial Alternating-Line kernel
256x256
512x512 1024x1024
Fig 10: Execution time profile of the original Alternating-line kernel.
Basically, the Alternating-line kernel mimics the behavior of the X-line smoother since, as figures 10 and 11 show, X-line relaxation is the most expensive component of this kernel, accounting for around 40% to 50% of the execution time. As expected, Y-line relaxation achieves better speed-ups due to a better exploitation of the spatial data locality, which reduces the overhead associated with the agglomeration data transfers.
3
Execution time (Optimized)
100%
Y-Line Relaxation + SemiCoarsening
2.75 2.5 Speed-up
80%
60%
40%
2.25 2 1.75 1.5
Others Y-Line X-Line
20%
Total
1.25
Y-Line 1 128x128
0% 128x128
256x256
512x512 1024x1024
Fig 11: Execution time profile of the optimized Alternating-line kernel.
256x256
512x512
1024x1024
Problem Size
Fig 12: Speed-up Semicoarsening kernel.
of
the
vectorial
Figure 12 shows the improvement achieved on the Semicoarsening approach. In this case, the performance gains are almost independent of the problem sizes. But contrary to our expectations, they are slightly lower than the Alternating-line counterparts (around 1.75). As mentioned above, Y-Line relaxation allows a better exploitation of the spatial locality than the X-line relaxation, which translates into better speed-ups. However, the Y-Line size in the Semicoarsening approach does not decrease with the multigrid level, thus involving greater data transfers overheads.
5
NAS-MG Optimizations
Unlike the previous kernels, automatic vectorization achieves a noticeable percentage of speedup (around 18%) in this case, and without any code restructuring or modification, partly due to the features of the Fortran language. Nevertheless, better performance can be achieved expressing the SIMD parallelism explicitly through compiler intrinsics. For this purpose we have performed at first, a profiling of this benchmark, which allow us to identify where it is necessary to invest more coding effort. As figure 13 shows, four routines, resid, psinv, comm1p and rprj3 account for around 85% percent of the execution time.
NAS-MG class B
NAS-MG class B
40% 2
35%
auto-vectorization hand-tunned vectorization
30% 1.75 Speep-up
25% 20% 15%
1.5
10% 1.25
5% 0% resid
psinv
comm1p
rprj3
Others
1 TOTAL
Fig 13: NAS-MG (Class B) profiling on the target platform.
resid
psinv
comm1p
rprj3
Others
Fig 14: Speed-up of the automatic and intrinsicbased vectorial versions of the NAS-MG.
The resid function computes the residual operator on every grid level. In order to exploit the SIMD extensions, its main loop only requires strip-mining since it neither presents data dependencies nor non-contiguous accesses (NAS-MG authors design this function taken into account a potential vectorization). Nevertheless, an additional source of improvement can be obtained in this case by using non-temporally stores (_mm_stream_pd). The goal of this alternative to the conventional store is to avoid cache pollution, bypassing the memory hierarchy and writing directly into the main memory. This kind of store is quite helpful on media applications, since they often deal with long streams of data, which lack temporal locality. In our context, this behavior is observed on the finest multigrid level. For this reason, our optimized code chooses between stores or non-temporally stores depending on the grid level. The psinv function, which computes the smoother operator, is similar to the previous one, although in this case, unaligned vector loads has also been necessary (_mm_loadu_pd besides the conventional _mm_load_pd). The comm1p function deals with the symmetric boundary conditions and only performs some transfers of data between different buffers. The optimization in this case only consists in introducing vectorial load and stores operations.
The rprj3 function, which computes the restriction operator, does not present data dependencies either. However, unlike resid and psinv, it has required some kind of agglomeration (similar to the Red-Black kernel) due to non-contiguous data accesses. Figure 14 compares the improvements achieved by the automatic and the intrinsic-based vectorization. As expected, the hand-tuned version outperforms the automatic version, reaching a satisfactory global speed-up close to 1.4. The largest difference between both versions is obtained on the comm1p routine since introducing vectorial copies, reduces the number of instructions and hence, the number of data dependences. In resid and psinv, the hand-tuned version achieves an additional 20% gain over the automatic one. The main problem arises in the resid function. In spite of being fully vectorizable, its speed-up is only 1.2. Preliminary analysis, using Pentium-4 hardware performance counters, suggests a significant memory bottleneck, which limits SIMD exploitation and makes cache optimizations of great interest. The results of the rprj3 optimization are quite satisfactory if we take into account the effect of data agglomeration. 6
Conclusions and Future Research
To the best of the author’s knowledge, the vectorization of multigrid codes using small-scale SIMD ISA extensions has not been studied previously. We have proposed several implementations of such codes for exploiting this kind of in-processor parallelism. To extend our analysis to a broad range of schemes, we have considered both standard and robust multigrid kernels, as well as the popular NASMG benchmark. Using the Intel P4 processor as experimental platform, we can drawn the following conclusions: 1. Standard multigrid based on point-wise smoothers. The proposed vectorization of the GaussSeidel point-wise relaxation scheme achieves a speedup between 1.3 and 1.85 depending on the problem size. Contrary of our expectations, the speedup gain for the Red-Black kernel is slightly lower, between 1.2 and 1.5, due to the agglomeration overhead. In both cases, the speedup increases with the problem size as result of the start-ups overhead. 2. Robust multigrid based on zebra line-wise relaxation. In this case, we have used a vectorization scheme based on an agglomeration technique instead of employing a vectorial line solver. That is, groups of four lines of the same color are relaxed simultaneously at the expense of a data transfer overhead. Nevertheless, this overhead is by far compensated, achieving a satisfactory speedup (around 2) in both alternating line and semi-coarsening versions. 3. NAS-MG benchmark. Finally, we have carried out a first attempt to adapt the optimizations previously studied to the ubiquitous NAS-MG, obtaining a preliminary speedup of 1.4 for the overall program. We should notice that the vectorial transfers represent a significant percentage of this performance gain. These encouraging results, suggest that combining in-processor parallelism with other performance optimizations such as efficient memory-hierarchy exploitation and multiprocessor parallelization can attain a significant improvement. Our future research plans are aimed at analyzing the interaction and potential benefits of the their combination.
7 [1] [2]
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]
References W. A. Wulf, S. A. McKee “Hitting the Memory Wall: Implications of the Obvious”. ACM SIGARCH Computer Architecture News. March 1995. A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the memory wall: The case for processor/memory integration. In Proceedings of the International Symposium on Computer Architecture (ISCA ’96), May 1996. C. Weiß, W. Karl, M. Kowarschik, U. Rüde. “Memory Characteristics of Iterative Methods”. In Proceedings of the Supercomputing Conference, Portland, Oregon, November 1999 S. Sellappa, S. Chatterjee. “Cache-Efficient Multigrid Algorithms”. In Proceedings of the 2001 International Conference on Computational Science, San Francisco, USA May 2001. C. C. Douglas, J. Hu, W. Karl, M. Kowarschik, U. Rüde, C. Weiss, “Cache optimization for structured and unstructured grid multigrid”, Elect. Trans. Numer. Anal., 10 (2000), pp. 25-40. K. Diefendorff and P. K. Dubey. "How multimedia workloads will change processor design". Computer, vol. 30, Sept. 1997, pp.43-45. R. B. Lee and M. D. Smith. "Media Processing: a new design target". IEEE Micro, vol. 16, Aug. 1996, pp. 6-9. A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE Micro, July/Aug. 1996, pp. 42-50. S. Thakkar, T. Huff. “The Internet Streaming SIMD Extensions”. Intel Technology Journal, Q2, 1999. G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, P. Roussel, “ The Microarchitecture of the Pentium® 4 Processor”. Intel Technology Journal, Q1, 2001. Advanced Micro Devices, Inc. “AMD 3Dnow! Technology Manual”. Publication number 21928, March 2000. Tremblay et al., “VIS Speeds New Media Processing,” IEEE Micro, July/Aug. 1996, pp. 10-20. R.B. Lee, “Subword Parallelism with MAX-2”. IEEE Micro, July/Aug. 1996, pp. 51-59. K. Diefendorff, P. Dubey , R. Hochsprung, H. Scales “AltiVec Extension to PowerPC Accelerates Media Processing”.IEEE Micro, pp 85-96, April 2000. W.M. Lioen. “Parallelizing a highly vectorized multigrid code with zebra relaxation”. In Proceedings Supercomputing '92, Minneapolis, Minnesota, November 16-20, pp. 180-189. C.C. Douglas. “Some remarks on completely vectorizing point Gauss-Seidel while using the natural ordering”. Advances in Comp. Math., Vol. 2, No.2, 1994. F. Saied and M. J. Holst. "Vector Multigrid: An accuracy and performance study", Report No. UIUCDCS-R-90-1636, Dept. of Computer Science, University of Illinois at Urbana-Champaign, 1990. J. Corbal, M. Valero, and R. Espasa. Exploiting a New Level of DLP in Multimedia Applications. In MICRO 32, 1999. Intel Corp. Intel C/C++ and Intel Fortran Compilers for Linux. Information available at http://www.intel.com/software/products/compilers GNU GCC home page. http://gcc.gnu.org U. Trottenberg, C. Oosterlee, A. Schürer. “Multigrid”. Academic Press, 2001 D. H. Bailey, T. Harris, R. V. der Wigngaart, W. Saphir, A. Woo, M. Yarrow, The NAS parallel benchmarks 2.0, Tech. Rep. NAS-95-010, NASA Ames Research Center, 1995. J. L. Henning. “SPEC CPU2000: Measuring CPU Performance in the New Millenium”. IEEE Computer Magazine, Vol. 33 (7), Julio 2000, pp. 28-35. The ASCI 30-TeraOps Sample Application Codes. Information available at http://www.acl.lanl.gov/30TeraOpRFP/SampleApps M. Prieto, R. Santiago, D. Espadas, I. M. Llorente and F.Tirado. "Parallel Multigrid for Anisotropic Elliptic Equations". Journal of Parallel and Distributed Computing, vol. 61, no. 1, pp. 96,114. Academic Press, 2001.