Redbooks Paper Gaussian Performance Using AIX ...

Redbooks Paper Carlos P. Sosa Tina Tarquinio

Gaussian Performance Using AIX Large Pages This Redpaper compares the performance of the latest major release of the popular electronic structure package Gaussian, Gaussian 03 Revision B.01, using AIX® large pages. The IBM Eserver™ pSeries® 690 1.30 GHz and pSeries 650 1.45 GHz systems were used in this study. The molecules tested here (C2H4, C12H24O6, C8H4N4O2, C10H16, C54H90N6O18, C8H8O2, and C14H12N2O2) indicate that small systems such as C2H4 do not benefit from large pages. This conclusion is independent of the option selected in Gaussian. Systems such as C8H8O2 are borderline cases. All the other systems tested showed performance improvements. The trend seems to indicate that as the size of the system increases so does the performance when using large pages. The largest speedup that was observed was approximately 10%.

Introduction Scientific applications, also considered numerically intensive programs, benefit from performance tuning. Performance tuning of applications can be carried out at different levels and for multiple targets. The level and the target can be selected by analyzing computational bottlenecks. A particular target that might be considered is I/O or memory performance. A complete review of this subject is beyond the scope of this study. However, the reference listed in footnote 1 provides an excellent source of information about the different levels and targets when tuning for better performance on IBM® systems. Even without a careful analysis as recommended, it is possible at the system level to improve performance. Although there are a multitude of parameters that can be changed to carry out system performance optimization1, in this study, we look at large pages. Traditionally, IBM systems have supported what are called 4 KB pages. 4 KB pages correspond to the unit (size) used when mapping virtual and physical memory1. On the new POWER4™ systems, in particular, with the introduction of the pSeries 690 Model 681, a unit of 16 MB was added in addition to the 4 KB unit. The 4 KB pages are called small pages (SP), and the 16 MB pages correspond to the large pages (LP). The primary benefit of large pages is to improve 1

The POWER4 Processor Introduction and Tuning Guide, SG24-7041

© Copyright IBM Corp. 2003. All rights reserved.

ibm.com/redbooks

1

performance for certain applications1. In this study, we restrict our attention to the quantum chemistry application Gaussian 032.

Gaussian 03 can take advantage of both memory approaches, distributed or shared. Previous studies have reported the performance of Gaussian running on different architectures,3,4,5,6,7,8 including pSeries. Gaussian represents an excellent candidate to determine the effect of large versus small pages. Gaussian has a wide variety of options that allow one to test multiple hardware and software components. In the next section, we present the design features of the systems and configurations tested in this study. We also give a brief overview of large pages and provide a brief review section that describes Gaussian. In the last sections, we describe the benchmarks and their performance as a function of small and large pages. In addition, we tested multiple configurations of large pages under the 32-bit and 64-bit kernels. Although we try to maintain a constant set of benchmarks, it is important to note that from time to time we update this set to reflect new functionality in Gaussian or particular features where performance might play an important role and had not already been exposed with our current set of benchmarks. For that reason, we also introduce a benchmark that is based on the time-dependent density functional theory method9. We conclude this work with a summary.

Large pages The key benefit of large pages is to improve performance for applications used in high-performance computing (HPC)1. It is important to point out that large pages might not be beneficial for all HPC applications. In particular, it is expected that large pages will help boost performance for cases in which an application is accessing large amounts of memory sequentially1. Also, applications spending a significant amount of time performing gather/scatter operations might benefit from large pages1. In general, the effect of large pages can be identified as reducing the number of translation lookaside buffer misses and improving data prefetch usage1. For the reader interested in using and installing large pages, Mall10 is highly recommended. In fact, this reference should be read in conjunction with this publication, because in this 2 Gaussian 03, Revision B.01, M. J. Frisch, G. W. Trucks, H. B. Schlegel, G. E. Scuseria, M. A. Robb, J. R. Cheeseman, J. A. Montgomery, Jr., T. Vreven, K. N. Kudin, J. C. Burant, J. M. Millam, S. S. Iyengar, J. Tomasi, V. Barone, B. Mennucci, M. Cossi, G. Scalmani, N. Rega, G. A. Petersson, H. Nakatsuji, M. Hada, M. Ehara, K. Toyota, R. Fukuda, J. Hasegawa, M. Ishida, T. Nakajima, Y. Honda, O. Kitao, H. Nakai, M. Klene, X. Li, J. E. Knox, H. P. Hratchian, J. B. Cross, C. Adamo, J. Jaramillo, R. Gomperts, R. E. Stratmann, O. Yazyev, A. J. Austin, R. Cammi, C. Pomelli, J. W. Ochterski, P. Y. Ayala, K. Morokuma, G. A. Voth, P. Salvador, J. J. Dannenberg, V. G. Zakrzewski, S. Dapprich, A. D. Daniels, M. C. Strain, O. Farkas, D. K. Malick, A. D. Rabuck, K. Raghavachari, J. B. Foresman, J. V. Ortiz, Q. Cui, A. G. Baboul, S. Clifford, J. Cioslowski, B. B. Stefanov, G. Liu, A. Liashenko, P. Piskorz, I. Komaromi, R. L. Martin, D. J. Fox, T. Keith, M. A. Al-Laham, C. Y. Peng, A. Nanayakkara, M. Challacombe, P. M. W. Gill, B. Johnson, W. Chen, M. W. Wong, C. Gonzalez, and J. A. Pople, Gaussian, Inc., Pittsburgh PA, 2003 3 C. P. Sosa, J. Ochterski, J. Carpenter, and M. J. Frisch, J. Comp. Chem. 19, 1053 (1998); and references therein 4 J. Ochterski, C. P. Sosa, J. Carpenter, Performance of Parallel Gaussian94 on Cray Research Vector Supercomputers, in B. Winget and K. Winget, Editors, Cray User Group Proceedings, Cray User Group, Shepherdstown, WV, p. 108, 1996 5 D. P. Turner, G. W. Trucks, M. J. Frisch, Ab Initio Quantum Chemistry on a Workstation Cluster, in T. G. Matson, Editor, Parallel Computing in Computational Chemistry, ACS Series 592, American Chemical Society, Washington, DC, p. 62 6 C. P. Sosa, G. Scalmani, R. Gomperts, and M. J. Frisch, Parallel Computing 26, 843 (2000) 7 C. P. Sosa and S. Andersson, Gaussian benchmarks put the pSeries 690 server through its paces, IBM eServer Developer Domain, February 2002, http://www.ibm.com/servers/esdd/articles/gauss_bench/index.html 8 Ab Initio Quantum Chemistry on the IBM pSeries 690: A Comparison Between Turbo 1.3 GHz and 1.1 GHz, REDP0444 9 R. E. Stratman, G. E. Scuseria, and M. J. Frisch, J., Chem. Phys. 109, 8218 (1998); and references therein for a complete review 10 M. Mall, AIX Support for Large Pages, April 2002, available at http://www.ibm.com/servers/aix/whitepapers/large_page.pdf

2

Gaussian Performance Using AIX Large Pages

study, we only present a brief overview of large pages. One of the key features of AIX is to be able to support a mixed mode, that is, part of the memory is used for large pages and the rest is used with small pages. The percentage of large pages can be tailored according to the needs of a particular site. AIX supports large pages with 32-bit and 64-bit kernels. Applications, either 32-bit or 64-bit, can take advantage of large pages. The extended common object file format (XCOFF or XCOFF64), the object file format for AIX, provides a flag to identify binaries if they are set (or cleared) to use large pages (or turn the large pages flag) through ldedit10. The flag can also be turned on at load time (ld)10 with the following commands: 򐂰 ld command: ld -blpdata -o a.out 򐂰 ldedit command: ldedit -blpdata a.out (or -bnolpdata a.out) In addition, environmental variables can be defined to use large pages. Environmental variables take precedence over large pages set through ld or ldedit. LDR_CNTRL variables provide three options for large pages. See footnote 10 for details.

Hardware systems design features The IBM Eserver pSeries servers used in this study were a 1.45 GHz 650 Model 6M2 multiprocessor (1-, 2-, 4-, and 8-way SMP) and a 690 multiprocessor (up to 32-way SMP) 1.30 GHz POWER4 pSeries 690 Turbo. The pSeries 690 server is the latest UNIX® server from IBM, with the POWER4 MCM at the core of this latest architecture1,11. The building block for the system used here is an 8-way MCM Turbo running at 1.30 GHz. The Turbo system (8-way MCM) has two cores per L2 cache; therefore, one MCM is 8-way. A full description of the POWER4 architecture is beyond the scope of this work; however, in this section, we provide an overview of the most important features of this architecture. Further details are given in the footnotes 1 and 11. Each processor chip on the pSeries 690 consists of two microprocessors, an L2 cache that runs at the same speed as the microprocessor, the microprocessor interface unit (that is, the interface for each microprocessor to the rest of the system), the directory and cache controller for the L3 cache, the fabric bus controller, and a GX bus controller that enables I/O devices to connect to the central electronic complex (CEC). The L3 cache is a new component that was not available on the POWER3™ architecture. The L3 caches are mounted on a separate module. The p650 takes full advantage of the p690 architectural features. These machines use the IBM POWER4 chip technology in a building-block approach for medium to large systems. This fact, of course, is consistent with the design principles of the p690; that is, full system design comes first12. One key difference with the p630 is that the packaging uses a Single Chip Module (SCM) containing either one or two processor cores13. On the other hand, the p650 is the first p650 series system to use the POWER4+™ chip. Similar to the p630, each chip is packaged on an SCM13,14. Large pages are supported in the 32-bit kernel, as well as the 64-bit kernel. Large pages that will be used in a 32-bit kernel and 64-bit kernel are configured in a 32-bit and 64-bit kernel, respectively. It is possible to use large pages in a 64-bit kernel configured in a 32-bit kernel. 11 H. M. Matis, J. D. McCalpin, M-C, Chiang, F. P. O'Connell, P. Buckland, IBM eServer pSeries 690: Configuring for Performance, IBM Corporation, Austin, TX, 2001 12 J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy, POWER4 System Microarchitecture, IBM J. Res. & Dev. 46, 5 (2002), available at http://www.research.ibm.com/journal/rd/461/tendler.html 13 pSeries 650 Model 6M2 Technical Overview and Introduction, REDP0194 14 B. Olszewski, IBM eServer p650 Performance Tuning, IBM Corporation, Marketing Communications Server group, December 2002, Somers, NY, available at http://www.ibm.com/servers/eserver/pseries/hardware/whitepapers/p650_perf.pdf


3

To test if these different configurations have any impact on applications performance, such as Gaussian, we created four different configurations (experiments). One of the Gaussian benchmarks was selected and tested on each configuration. This benchmark ran using 1-way, 2-way, 4-way, 8-way, 16-way, and 32-way configurations. The four configurations were built as follows: 򐂰 Configuration I 32-bit kernel, with 64 GB large pages, set in 32-bit kernel. The machine was booted in a 32-bit kernel; then the large page amount was set, and then rebooted. 򐂰 Configuration II 64-bit kernel, with 64 GB large pages, set in 64-bit kernel. The machine was booted in a 64-bit kernel; then the large page amount was set, and then rebooted. 򐂰 Configuration III 64-bit kernel, with 64 GB large pages, set in 32-bit kernel. The machine was set as Configuration II; then the machine was booted in a 64-bit kernel with no change to the large page setting. 򐂰 Configuration IV 64-bit kernel, with 64 GB large pages. The machine was set as Configuration III; then the following steps were performed: a. b. c. d. e.

Set large page amount to 0. bosboot. Set large page amount to 64 GB. bosboot. reboot.

The last configuration was used to understand if performing these steps would deliver the same results as Configuration II. All the timings in this work correspond to the elapsed time. Timings were measured using the clock and CPU times printed by each link using the Gaussian #P option15. This information was used as input for a utility program that tabulates timings and speedups16. The Gaussian 03 binaries used in this study were compiled on a POWER4 system with the xlf 8.1.0.3 Fortran compilers17 and the ESSL libraries18.

Gaussian Gaussian2 is a connected series of programs that can be used for performing a variety of electronic structure calculations: molecular mechanics, semi-empirical, ab initio, and density functional theory. Gaussian consists of a collection of programs commonly known as links.

Each link communicates through disk files and is grouped into overlays15. Links are independent executables located in the g03 directory and labeled as lxxx.exe, where xxx is the unique number of each link. In general, overlay zero is responsible for starting the program, including reading the input file. After the input file is read, the route card (keywords and options that specify all the Gaussian parameters) is translated into a sequence of links.

15 AE. Frisch and M. J. Frisch, Gaussian98 User's Reference, 2nd Edition, Gaussian Inc., Pittsburgh, PA, available at http://www.gaussian.com 16 The utility program can be obtained from one of the authors (C.P. Sosa) at: mailto:[email protected] 17 XL Fortran for AIX, User's Guide Version 7 Release 1, SC09-2866 18 Engineering and Scientific Subroutine Library (ESSL) for AIX, program number 5765-F82

4


Overlay 99 (l9999.exe) terminates the run. In most cases, l9999.exe finishes with an archive entry (brief summary of the calculation). As previously pointed out, the Gaussian architecture on distributed or shared-addressable memory systems is basically the same3-8. Each link is responsible for continuing the sequence of links by invoking an exec() system call to run the next link. The links that run sequentially are links that are responsible for setting up the calculation and assigning symmetry. In previous publications, we summarized all the links that run in parallel3 and provided performance information3-8.

Selected benchmarks Similarly, as in previous studies3-8, we consider four major characteristics when studying parallel performance on the IBM SP or selecting benchmarks: job type, theoretical method, basis set size, and molecular size. The job type corresponds to a single point energy calculation, a gradient calculation, or calculation of second derivatives. Normally, single point calculations are used to compute accurate energies at a level of theory that is too expensive to carry out a full geometry optimization. Due to the importance of molecular structure in chemistry, a large majority of calculations are geometry optimizations using HF, DFT, or MP2. Geometry optimizations are normally followed by a frequency calculation. There are many options for carrying out calculations using Gaussian. In this study, we tried to select job types that reflect how users are currently running Gaussian. This is by no means a full representation of the options available in the program, but they do represent a large percentage of calculations carried out by typical researchers at computer centers. They also correspond to many of the benchmarks carried out for hardware procurements. In addition, and different from our previous work, we selected a small system (ethylene) and evaluated many of the options in Gaussian19. In this study, we chose the following types of calculations: 1. Single-point energy (SP) calculations at different levels of theory. 2. FORCE: This type of calculation corresponds to an SP calculation followed by the calculation of the first derivatives of the energy with respect to the position of the atoms in the molecule. 3. Frequency: This is a calculation of the second derivative of the energy and involves many SP and force calculations. The time required for a geometry optimization is a multiple of the time needed for a single FORCE calculation. Rather than doing a full geometry optimization, a FORCE calculation is recommended for benchmarks that involve hardware performance. It is equivalent to doing one cycle of the optimization and should provide a good approximation for the performance of an optimization calculation. A large number of approximate theoretical methods have been reported in the literature20. These methods range in accuracy and computational cost. Because Gaussian provides most of these methods, it is important to understand how they perform as a function of system resources. In this study, we refer to system resources as the number of processors, memory, and small or large pages needed for optimal performance. The theoretical methods chosen in this study have been extensively discussed in the literature20, and it is beyond the scope of 19

D. Feller, The EMSL Ab Initio Methods Benchmark Report: A Measure of Hardware and Software Performance in the Area of Electronic Structure Methods, June 1997, PNNL, Richland Washington, available at http://www.emsl.pnl.gov/docs/tms/abinitio/cover.html 20 W. J. Hehre, L. Radom, P. V. R. Schleyer, and J. A. Pople, Ab Initio Molecular Orbital Theory, John Wiley & Sons, 1985, ISBN 0471812412


5

this work to describe these methods. The approximations used in this work correspond to Hartree-Fock20, the three parameter hybrid density functional theory of Becke (B3-LYP)21,22, and configuration interaction singles (CIS) energy and gradients23. Preliminary benchmarks were carried out on a single processor using single point energy calculations on ethylene, crown ether, and caffeine19. In the case of ethylene, multiple options in Gaussian were tested. In the case of crown ether and caffeine, the benchmarks were run using HF and MP2 methods. The cases used in this study for parallel performance correspond to the same cases from a previous study3-8. Case I is an SP calculation of α-pinene at the HF level of theory using the 6-311G(df,p) basis set. Case II corresponds to a FORCE calculation on a fairly large system, valinomycin. Case III is an α-pinene frequency calculation at the B3-LYP/6-31G(d) level of theory. Cases II and III exercise several of the links that run in parallel. Case IV is a FORCE calculation of acetyl phenol excited states using CIS with the 6-31++G basis set23. Case V is a single-point Time-Dependent B3-LYP9 energy on C14H12N2O2. This set of benchmarks represents small to large systems to test speedup and efficiency for most parallel links with and without large pages. All the geometries are available from the author24.

Results and discussion In this study, as in our previous reports, we look at performance in terms of speedup3-8. Speedup (S) is defined as the ratio of the serial run time (elapsed time, ts) over the time that it takes to do the same problem in parallel (elapsed time, tp):

S

t ---stp

(5)

Efficiency (e) is the fraction of time that a processor is doing useful work. This measurement also indicates or provides an indirect indicator for the percentage of parallel code needed for linear or nearly linear scalability.

e

S ---------------------N PROCS

(6)

To compare scalability, we look at the measured speedup against ideal speedup, rather than using an extrapolated speedup as we have done in the past6. We take this approach because we want to compare how POWER4 machines perform, rather than analyzing the scalability of Gaussian. Table 1 shows all the options and molecules used in our single processor benchmarks. The objective of this preliminary set of benchmarks is to assess the effect of large pages on a very small system (ethylene molecule). In the same table, the effect of large pages is compared to systems larger than ethylene, such as crown ether and caffeine. In the case of ethylene, the results are consistent. Although we see an improvement (anywhere between 1 to 20%) in the user time, the elapsed time shows that for all the options tested, the use of large pages is detrimental. This is not surprising, because the elapsed time is affected by the system time.

21

B. G. Johnson, P. M. W. Gill, and J. A. Pople, J. Chem. Phys., 98, 5612 (1993) A. D. Becke, J. Chem. Phys., 98, 5648 (1993) 23 J. B. Foresman, M. Head-Gordon, J. A. Pople, and M. J. Frisch, J. Phys. Chem. 96, 135 (1992) 24 To obtain all the input files used in this study send an e-mail message to: mailto:[email protected] 22

6


Table 1 Preliminary benchmarks using ethylene and caffeine Number

Molecule

Gaussian option

1

C2H4

HF/GEN SCF=(InCore,NOSINGLEPOINT)

2

C2H4

HF/6-311++G(3df,3pd) SCF=Conven

3

C2H4

HF/6-311++G(3df,3pd) SCF=(Direct,NoSinglePoint)

4

C2H4

HF/6-311++G(3df,3pd) Force SCF=Conven

5

C2H4

HF/6-311++G(3df,3pd) Freq SCF=Conven

6

C2H4

CCSD(T)/6-311++G(3df,3pd)

7

C2H4

MP2=(Conventional,Full)/6-311++G(3df,3pd) SCF=Conven

8

C2H4

MP2=(FULL,FULLDIRECT)/6-311++G(3df,3pd) SCF=DIRECT

9

C2H4

MP2=(FULL)/6-311++G(3df,3pd) FORCE SCF=Conven

10

C2H4

MP4/6-311++G(3df,3pd)

11

C2H4

SVWN/6-311++G(3df,3pd)

12

C2H4

BLYP/6-311++G(3df,3pd)

13

C2H4

UHF/6-311++G(3df,3pd) SCF=Conven

14

C12H24O6

15

C8H9N4O2

UHF/6-31G** SCF=(Direct,NoSinglePoint)

16

C8H9N4O2

MP2/6-31G** SCF=(Direct,NoSinglePoint) MAXDISK=20000000000000

17

C2H4

MP2/6-311++G(3df,3pd) Freq

HF/GEN

SCF=(DIRECT,NOSINGLEPOINT)

Elapsed, user, and system times as provided by timex are reported in Table 2. However, Table 3 summarizes all the results from Table 2 in an easier format to understand. It clearly shows the effect of large pages versus small pages by looking at the percentage gained or lost in performance by using large pages (∆%). A positive number signifies that the effect of large pages is detrimental, and a negative number illustrates a gain in performance. In the case of ethylene, the results are consistent. Although we see an improvement (anywhere between 1 to 20%) in the user time, the elapsed time shows that for all the options tested, the use of large pages is detrimental. This fact is not surprising, because the elapsed time is affected by the system time. Table 2 shows a comparison between small and large pages as a function of multiple options for ethylene and caffeine in Gaussian 03 Rev.B.01. Table 2 Small and large pages as a function of multiple options for ethylene and caffeinea Small pagesb

Benchmark

Large pagesb

Elapsed

User

System

Elapsed

User

System

1

14.94

11.86

1.68

37.87

9.58

24.53

2

16.45

7.82

4.21

38.25

7.51

23.55

3

27.67

25.39

0.98

47.54

24.84

18.99

4

29.18

19.95

4.75

54.67

19.03

28.01


7

Small pagesb

Benchmark

Large pagesb

5

155.32

142.65

7.51

179.62

137.54

34.50

6

318.35

257.01

45.52

326.45

240.13

67.18

7

104.59

54.82

36.00

128.1

53.63

58.66

8

39.6

37.18

1.15

60.2

36.20

20.19

9

65.72

53.29

6.77

90.03

50.74

31.61

10

154.92

123.75

27.73

176.07

122.69

47.42

11

20.38

18.30

0.92

40.68

18.32

18.87

12

21.6

19.57

1.04

41.7

19.44

18.87

13

24.51

10.48

8.10

45.92

10.26

27.07

14

10601.52

10594.07

3.92

10005.27

9977.75

21.58

15

768.37

763.45

2.36

758.51

733.48

20.44

16

1763.48

1711.22

38.33

1632.2

1552.20

63.05

17

1421.23

1364.85

30.27

1372.84

1282.38

58.25

a. See Table 1 for benchmark definitions. b. All timings in seconds were obtained using timex.

This indicates that the effect of using large pages granularity (16 MB) when mapping between virtual and physical memory will be reflected in the elapsed time through the system time. Therefore, the time to solution will be affected. In other words, small Gaussian calculations that access small amounts of memory will not benefit from large pages. The almost 20% speedup for the user time seen in Benchmark 1 is not unusual. Benchmark 1 is an incore calculation; although it is a small system, all the integrals are stored in memory. This particular benchmark requires much more memory than all the other ethylene benchmarks. Table 3 shows a comparison between small and large pages as a function of different options in Gaussian for ethylene, crown ether, and caffeine. Table 3 Small and large pages as a function of different options in Gaussiana Benchmarkb

8

Small pages → Large pages (∆%)c Elapsed

User

System

1

153

-19

1360

2

133

-4

459

3

72

-2

1838

4

87

-5

490

5

16

-4

359

6

3

-7

48

7

22

-2

63

8

52

-3

1656

9

37

-5

367


Benchmarkb

Small pages → Large pages (∆%)c

10

14

-1

71

11

97

0

1951

12

93

-1

1714

13

87

-2

234

14

-6

-6

451

15

-1

-4

766

16

-7

-9

64

17

-3

-6

92

a. Gaussian 03 Rev.B.01. b. All benchmarks defined in Table 1. c. Percentage of performance gained or lost.

Table 4 summarizes the performance of Case I (Hartree-Fock single-point energy calculation). All these calculations were carried out on a pSeries 650 1.45 GHz machine. This system contains 346 basis functions and 548 primitive Gaussians. The size of the integral buffers are 131072 64-bit word long. For the single processor runs, large pages tend to improve performance by about 3%. As the number of processors is increased, we see the opposite effect; large pages slightly slow down the calculations. Because this is a small- to medium-size case, the number of integrals calculated per processor (less use of memory) tends to decrease rapidly, and the system time becomes the dominating factor. Table 4 Hartree-Fock Single-Point Energy Calculations on α-pinene (C10H16)a l502.exeb

Sc

ec

Totalb,d

Sc

ec

p650 1.45 GHz SP

771

1.00

1.00

779

1.00

1.00

p650 1.45 GHz LP

747

1.00

1.00

758

1.00

1.00

p650 1.45 GHz SP

391

1.97

0.99

779

1.97

0.99

p650 1.45 GHz LP

381

1.96

0.98

389

1.95

0.98

p650 1.45 GHz SP

238

3.24

0.81

242

3.22

0.81

p650 1.45 GHz LP

232

3.22

0.81

243

3.12

0.78

p650 1.45 GHz SP

152

5.07

0.63

157

4.96

0.62

p650 1.45 GHz LP

144

5.19

0.65

157

4.83

0.60

Number of processorse 1

2

4

8

a. Total of 346 basis functions; 6-311G(df,p) basis set; C1 symmetry. b. All timings are in seconds and correspond to elapsed time. c. Speedup and efficiency. d. Total time to complete the run.


9

On the other hand, Table 5 also shows a single-point calculation, but this benchmark is carried out with a larger system (valinomycin). These results indicate that for this type of system, large pages tend to show a consistent performance improvement when going from 1-way to 8-way runs. 1-way (SP to LP), 2-way (SP to LP), 4-way (SP to LP), and 8-way (SP to LP) percentage decreases for the column of total timings are 5%, 5%, 7%, and 5%, respectively. Table 5 B3-LYP FORCE calculation on valinomycin (C54H90N6O18)a l502.exeb

Sc

l703.exeb

Sc

Totalb,d

Sc

p650 1.45 GHz SP

7,951

1.00

1,578

1.00

9,689

1.00

p650 1.45 GHz LP

7,550

1.00

1,487

1.00

9,198

1.00

p650 1.45 GHz SP

4,088

1.94

800

1.97

4,983

1.94

p650 1.45 GHz LP

3,870

1.95

754

1.97

4,722

1.95

p650 1.45 GHz SP

2,400

3.31

451

3.50

2,915

3.32

p650 1.45 GHz LP

2,240

3.37

394

3.77

2,710

3.39

p650 1.45 GHz SP

1,436

5.54

311

5.07

1,793

5.40

p650 1.45 GHz LP

1,335

5.66

293

5.08

1,699

5.41

Number of processorse 1

2

4

8

a. Total of 882 basis functions; 3-21G basis set; C1 symmetry. b. All timings are in seconds and correspond to elapsed time. c. Speedup. d. Total time to complete the run.

Table 6 summarizes timings for a frequency calculation with and without large pages. In this case, there are several parallel links that are exercised in this example: l502.exe, l1110.exe, l1002.exe, and l703.exe. l502.exe and l703.exe have been discussed previously. In this paragraph, we summarize what we have presented in several earlier papers3-8. l1002.exe solves the couple-perturbed Hartree-Fock (CPHF) equations to produce the derivatives of the molecular orbital coefficients. l1110.exe computes the two-electron contribution to the Fock matrix derivatives with respect to nuclear coordinates20. Because only the direct scheme of the atomic orbitals (AO) production integrals is parallelized, it is not surprising that l1002.exe does not show the same type of scalability as l703.exe and l1110.exe3. Although α-pinene is a relatively small case, Table 6 shows that after analyzing all the parallel links individually, the effect of large pages is to improve performance between 1% and 6%. The column corresponding to the total time in Table 6 illustrates that the largest gain by using large pages corresponds to the single processor run. In this case, the gain in performance is about 9%. As the number of processors increases, the improvement in performance decreases. We can see that for the case that uses eight processors, the benefit of large pages is only 1%. The decrease in performance as the number of processors is increased might be explained in terms of the number of integrals that each processor has to compute. As we increase the number of processors, the number of integrals gets smaller. In effect, as the number of

10


processors increases, the size of the system (relatively speaking) gets smaller. As we have seen before, small systems do not sufficiently access memory benefit from large pages. Table 6 B3-LYP frequency calculation on α-pinene (C10H16)a l502.exeb

l1110.exeb

l1002.exeb

l703.exeb

Totalb,d

Sc

p650 1.45 GHz SP

312

1105

1902

1424

4754

1.00

p650 1.45 GHz LP

307

1068

1852

1392

4336

1.00

p650 1.45 GHz SP

158

565

957

720

2408

1.97

p650 1.45 GHz LP

156

552

929

709

2362

1.96

p650 1.45 GHz SP

94

308

557

390

1355

3.51

p650 1.45 GHz LP

92

289

533

373

1308

3.54

p650 1.45 GHz SP

57

181

377

216

837

5.68

p650 1.45 GHz LP

56

174

366

211

831

5.58

Number of processors 1

2

4

8

a. Total of 182 basis functions; 6-31G(d) basis set; C1 symmetry. b. All timings are in seconds and correspond to elapsed time. c. Speedup. d. Total time to complete the run.

Table 7 summarizes elapsed timings and total speedups for acetyl-phenol with and without large pages. This benchmark consists of a CI-singles energy and FORCE calculation. This example illustrates the scalability of l914.exe. l914.exe computes excited states using CI-singles excitations23. This type of calculation runs in parallel, because the repulsion two-electron integrals contributing to the CI-singles can be computed using the PRISM algorithm25. Table 7 indicates that this small case is a borderline case. Large pages do not degrade performance, but improvement is minimal. Table 7 CI-singles energy and FORCE calculation on acetyl phenol (C8H8O2)a l502.exeb

l914.exeb

l1002.exeb

l703.exeb

Totalb,d

Sc

p650 1.45 GHz SP

151

376

166

40

747

1.00

p650 1.45 GHz LP

150

365

164

39

740

1.00

p650 1.45 GHz SP

78

195

85

21

387

1.93

p650 1.45 GHz LP

78

187

84

20

385


2

4

25

P. M. Gill, M. Head-Godon, and J. A Pople, J. Phys. Chem., 94, 5564 (1990)


11

Number of processors

l502.exeb

l914.exeb

l1002.exeb

l703.exeb

Totalb,d

Sc

p650 1.45 GHz SP

49

128

52

13

249

3.00

p650 1.45 GHz LP

49

125

52

13

261

2.84

p650 1.45 GHz SP

33

80

34

9

161

4.64

p650 1.45 GHz LP

32

79

34

9

178

4.16

8

a. Total of 154 basis functions; 6-31++G basis set; C1 symmetry. b. All timings are in seconds and correspond to elapsed time. c. Speedup. d. Total time to complete the run.

As we pointed out before, this case is becoming too small for the POWER4. All the links with 16 processors take less than 60 seconds. l703.exe takes only 5 seconds. Table 8 illustrates the performance of a case similar to the previous case. Although the calculation computes excited states, it uses different functionality (it is just a single point calculation). The main difference is that the size of the system is twice as large as the previous case (acetyl phenol). This difference is reflected in a better use of large pages. In this particular example, we see speedups for the parallel links anywhere between 2% and 8%. The total time improves about 7%, except for the 8-way case which only improves 3%. Table 8 TD B3LYP energy single point calculation on (C14H12N2O2)a l502.exeb

l914.exeb

Totalb,d

Sc

p650 1.45 GHz SP

583

2385

2977

1.00

p650 1.45 GHz LP

567

2185

2766

1.00

p650 1.45 GHz SP

298

1211

1515

1.97

p650 1.45 GHz LP

291

1115

1416

1.95

p650 1.45 GHz SP

181

727

911

3.27

p650 1.45 GHz LP

171

671

856

3.23

p650 1.45 GHz SP

115

432

550

5.41

p650 1.45 GHz LP

110

407

533

5.19


2

4

8

a. Total of 294 basis functions; 6-31G* basis set; C1 symmetry. b. All timings are in seconds and correspond to elapsed time. c. Speedup. d. Total time to complete the run.

Table 9 summarizes a series of experiments where we looked at performance as a function of the type of kernel. In other words, we compare the performance of α-pinene frequency calculation using large pages configured on a particular AIX kernel (32-bit or 64-bit) and ran either in the same or different kernel. Although α-pinene is not a particularly large system, it 12


exercises several links, and we considered this important information. All these experiments were run with LDR_CNTRL=LARGE_PAGE_DATA=M to ensure that the benchmarks would run with large pages. Otherwise, we want the run to stop. The first experiment illustrated in Table 9 corresponds to 64 GB large pages configured on a system with a 32-bit kernel and run on a system with 32-bit kernel (I). This system is our point of reference. Experiment I and II summarize the effect on performance when running in a 32-bit kernel versus in a 64-bit kernel. Here, large pages were configured in exactly the same kernel. In this (experiment I versus experiment II) case, we see from Table 9 that when using a 64-bit kernel, elapsed timings are faster than when the 32-bit kernel was used. The largest speedup is about 6% for the 16-way run. Similar results are observed for small pages. However, in the case of small pages, the gain from going to a 64-bit kernel is only about 2%. We see that the difference when running either with a 32-bit kernel or a 64-bit kernel is insignificant. When running 1-way to 2-way, the largest performance improvement is for the system time (third column). A 1-way run shows a speedup for the system time of about 20% when running on a system using the 64-bit kernel. The first mixed configuration can be seen in experiment III, where 64 GB LP, configured in a 64-bit kernel, ran in a 32-bit kernel. For most of the cases, the performance of this experiment follows the trends of the two previous experiments. We do not observe any drastic changes in performance. In other words, for this example, the performance appears to be independent of where large pages were configured, but dependent on the type of kernel used to run the benchmarks. Table 9 Effect of 32-bit kernel, 64-bit kernel, and mixed large pages configurationa Number of processors

Elapsedb

CPUb

Systemb

64 GB LP, configured in 32-bit kernel, ran in 32-bit kernel (I)c 1 p690 1.30 GHz SP

4937.82

4918.33

7.12

p690 1.30 GHz LP

4913.28

4865.00

35.56

p690 1.30 GHz SP

2517.62

4924.76

53.79

p690 1.30 GHz LP

2518.85

4862.32

83.78

p690 1.30 GHz SP

1319.76

4974.55

44.80

p690 1.30 GHz LP

1344.14

4899.11

88.41

p690 1.30 GHz SP

786.09

5165.39

141.27

p690 1.30 GHz LP

806.23

5060.64

182.04

p690 1.30 GHz SP

459.62

5518.78

176.49

p690 1.30 GHz LP

533.04

5465.41

255.55

332.17

5426.79

959.13

2

4

8

16

32 p690 1.30 GHz SP


13


Elapsedb

CPUb

Systemb

p690 1.30 GHz LP

420.04

5306.89

1099.74

64 GB LP, configured in 64-bit kernel, ran in 64-bit kernel (II)c 1 p690 1.30 GHz SP

4926.16

4911.70

5.47

p690 1.30 GHz LP

4833.48

4796.62

28.26

p690 1.30 GHz SP

2518.01

4929.34

48.67

p690 1.30 GHz LP

2516.3

4873.67

72.54

p690 1.30 GHz SP

1313.5

4973.70

40.53

p690 1.30 GHz LP

1329.2

4904.31

74.77

p690 1.30 GHz SP

775.7

5164.63

124.50

p690 1.30 GHz LP

792.56

5087.87

156.94

p690 1.30 GHz SP

463.42

5559.63

186.36

p690 1.30 GHz LP

499.25

5445.06

196.08

p690 1.30 GHz SP

330.82

5437.20

1017.09

p690 1.30 GHz LP

412.96

5637.76

1203.92

2

4

8

16

32

64 GB LP, configured in 32-bit kernel, ran in 64-bit kernel (III)c 1 p690 1.30 GHz SP

4930.36

4910.97

5.69

p690 1.30 GHz LP

4842.09

4800.08

29.33

p690 1.30 GHz SP

2516.41

4932.02

47.12

p690 1.30 GHz LP

2514.23

4874.59

73.12

p690 1.30 GHz SP

1318.99

4973.09

41.88

p690 1.30 GHz LP

1325.58

4897.31

78.51

2

4

8

14



Elapsedb

CPUb

Systemb

p690 1.30 GHz SP

780.31

5173.81

120.90

p690 1.30 GHz LP

796.97

5086.85

166.42

p690 1.30 GHz SP

463.6

5553.81

184.67

p690 1.30 GHz LP

499.87

5428.34

208.67

p690 1.30 GHz SP

322.16

5366.99

867.94

p690 1.30 GHz LP

410

5669.62

929.97

16

32

64 GB LP, configured in 64-bit kernel, ran in 64-bit kernel (IV)c 1 p690 1.30 GHz SP

4929.33

4910.72

5.83

p690 1.30 GHz LP

4864.85

4864.85

29.93

p690 1.30 GHz SP

2516.35

4931.07

47.40

p690 1.30 GHz LP

2513.07

4870.37

69.76

p690 1.30 GHz SP

1314.2

4976.32

40.09

p690 1.30 GHz LP

1331.01

4898.67

77.55

p690 1.30 GHz SP

777.79

5168.40

124.13

p690 1.30 GHz LP

802.64

5091.77

152.42

p690 1.30 GHz SP

467.93

5568.90

191.82

p690 1.30 GHz LP

501.68

5445.54

220.89

p690 1.30 GHz SP

342.03

5443.23

1209.16

p690 1.30 GHz LP

403.51

5641.90

992.51

2

4

8

16

32

a. This benchmark corresponds to the α-pinene frequency calculation. b. All timings are in seconds and correspond to elapsed time. c. See text for definitions of each configuration.


15

Summary In this study, we tried to provide information about the performance of several options in the Gaussian program for the new family of IBM POWER4-based pSeries 650 and 690 servers. These options are commonly used by researchers at many computer centers and academic institutions, but by no means is this an exhaustive set of benchmarks. We ran a preliminary benchmark (C2H4) to test several options in the program with and without large pages. This system clearly is too small and that is why we consider it just a preliminary benchmark. The levels of theory tested here correspond to Hartree-Fock (HF), density functional theory (DFT), configuration interaction-single excitations, and time-dependent density functional theory (hybrid method). As first approximations to many higher-order methods, this set of methods is routinely used and certainly can provide valuable information when comparing the newest POWER architecture. We observed that small systems such as C2H4 do not benefit from the use of large pages. This is true for all the options tested here. Systems with less than 10 heavy atoms do not benefit from large pages. We should point out that we have not tested this size of a system with very large basis sets (for example, f and g polarization functions on heavy atoms). C8H8O2 is a borderline case where performance improvements due to large pages is minimal. On the other hand, large systems such as valynomicin can show as much as a 10% improvement in performance due to large pages. Finally, our experiments configuring large pages in different kernels indicate that running benchmarks in a 64-bit kernel is more efficient than a 32-bit kernel. The performance improvement for 1-way runs (32-bit kernel -> 64-bit kernel) is about 6%. The case that we tested where large pages were configured on a different type of kernel from the one used for the benchmarks does have a very large effect on performance.

Appendix A The script used to make all the Gaussian links large pages enabled is shown in Example 1. Example 1 Enable Gaussian links for large pages #!/bin/csh # set ROOT=/bench1/cpsosa/g03 set NOL=`ls $ROOT/*.exe` set L=`ls $ROOT/*.exel` foreach i ( g03 $NOL $L ) ldedit -blpdata $i echo "LP enabled: $i" end

16


Appendix B The script used to force large pages through the "mandatory" flag is shown in Example 2. Example 2 Force large pages through the mandatory flag #!/bin/ksh export LDR_CNTRL=LARGE_PAGE_DATA=M # # timex ./rung03 pinene_1 2> pinene_1.timex 1>&2 timex ./rung03 pinene_2 2> pinene_2.timex 1>&2 timex ./rung03 pinene_4 2> pinene_4.timex 1>&2 timex ./rung03 pinene_8 2> pinene_8.timex 1>&2 timex ./rung03 pinene_16 2> pinene_16.timex 1>&2 timex ./rung03 pinene_32 2> pinene_32.timex 1>&2 #

Notes on benchmarks and values The benchmarks and values shown here were derived using particular, well configured, development-level computer systems. Unless otherwise indicated for a system, the values were derived using 32-bit applications and external cache, if external cache is supported on the system. All benchmark values are provided "AS IS" and no warranties or guarantees are expressed or implied by IBM. Actual system performance may vary and is dependent upon many factors including system hardware configuration and software design and configuration. Buyers should consult other sources of information to evaluate the performance of systems they are considering buying and should consider conducting application oriented testing. For additional information about the benchmarks, values and systems tested, contact your local IBM office or IBM authorized reseller or access the following on the Web: 򐂰 TPC http://www.tpc.org 򐂰 GPC http://www.spec.org/gpc 򐂰 SPEC http://www.spec.org 򐂰 Pro/E http://www.proe.com 򐂰 Linpack http://www.netlib.no/netlib/benchmark/performance.ps 򐂰 Notesbench Mail http://www.notesbench.org 򐂰 VolanoMark http://www.volano.com 򐂰 Fluent http://www.fluent.com 򐂰 Gaussian http://www.gaussian.com


17

Unless otherwise indicated for a system, the performance benchmarks were conducted using AIX Versions 4.2.1 or 4.3, IBM C Set++ for AIX/6000 Version 4.1.0.1, and AIX XL FORTRAN Version 5.1.0.0 with optimization where the compilers were used in the benchmark tests. The preprocessors used in the benchmark tests include KAP 3.2 for FORTRAN and KAP/C 1.4.2 from Kuck & Associates and VAST-2 v4.01X8 from Pacific-Sierra Research. The preprocessors were purchased separately from these vendors. The following SPEC and Linpack benchmarks reflect the performance of the microprocessor, memory architecture, and compiler of the tested system: 򐂰 SPECint95: SPEC component-level benchmark that measures integer performance. Result is the geometric mean of eight tests that comprise the CINT95 benchmark suite. All of these are written in the C language. SPECint_base95 is the result of the same tests as CINT95 with a maximum of four compiler flags that must be used in all eight tests. 򐂰 SPECint_rate95: Geometric average of the eight SPEC rates from the SPEC integer tests (CINT95). SPECint_base_rate95 is the result of the same tests as CINT95 with a maximum of four compiler flags that must be used in all eight tests. 򐂰 SPECfp95: SPEC component-level benchmark that measures floating-point performance. Result is the geometric mean of 10 tests, all written in FORTRAN, that are included in the CFP95 benchmark suite. SPECfp_base95 is the result of the same tests as CFP95 with a maximum of four compiler flags that must be used in all 10 tests. 򐂰 SPECfp_rate95: Geometric average of the 10 SPEC rates from SPEC floating-point tests (CFP95). SPECfp_base_rate95 is the result of the same tests as CFP95 with a maximum of four compiler flags that must be used in all 10 tests. 򐂰 SPECint2000: New SPEC component-level benchmark that measures integer performance. Result is the geometric mean of 12 tests that comprise the CINT2000 benchmark suite. All of these are written in C language except for one, which is in C++. SPECint_base2000 is the result of the same tests as CINT2000 with a maximum of four compiler options that must be used in all 12 tests. 򐂰 SPECint_rate2000: Geometric average of the 12 SPEC rates from the SPEC integer tests (CINT2000). SPECint_base_rate2000 is the result of the same tests as CINT2000 with a maximum of four compiler options that must be used in all 12 tests. 򐂰 SPECfp2000: New SPEC component-level benchmark that measures floating-point performance. Result is the geometric mean of 14 tests, all written in FORTRAN and C languages, that are included in the CFP2000 benchmark suite. SPECfp_base2000 is the result of the same tests as CFP2000 with a maximum of four compiler options that must be used in all 14 tests. 򐂰 SPECfp_rate2000: Geometric average of the 14 SPEC rates from SPEC floating-point tests (CFP2000). SPEC_base_rate2000 is the result of the same tests as CFP2000 with a maximum of four compiler options that must be used in all 14 tests. 򐂰 SPECweb96: Maximum number of Hypertext Transfer Protocol (HTTP) operations per second achieved on the SPECweb96 benchmark without significant degradation of response time. The Web server software is ZEUS v.1.1 from Zeus Technology Ltd. 򐂰 SPECweb99: Number of conforming, simultaneous connections the Web server can support using a predefined workload. The SPECweb99 test harness emulates clients sending the HTTP requests in the workload over slow Internet connections to the Web server. The Web server software is Zeus from Zeus Technology Ltd. 򐂰 LINPACK DP (Double Precision): n=100 is the array size. The results are measured in megaflops (MFLOPS). 򐂰 LINPACK SP (Single Precision): n=100 is the array size. The results are measured in MFLOPS.

18


򐂰 LINPACK TPP (Toward Peak Performance): n=1,000 is the array size. The results are measured in MFLOPS. 򐂰 LINPACK HPC (Highly Parallel Computing): Solve largest system of linear equations possible. The results are measured in GFLOPS. VolanoMark is a 100% Pure Java™ server benchmark characterized by long-lasting network connections and high thread counts. In this context, long-lasting means the connections last several minutes or longer, rather than just a few seconds. The VolanoMark benchmark creates client connections in groups of 20 and measures how long it takes for the clients to take turns broadcasting their messages to the group. At the end of the test, it reports a score as the average number of messages transferred by the server per second. VolanoMark 2.1.2 local performance test measures throughput in messages per second. The final score is the average of the best two out of three results. The following SPEC benchmark reflects the performance of the microprocessor, memory subsystem, disk subsystem, and network subsystem: 򐂰 SPECsfs97_R1: The SPECsfs97_R1 (or SPEC SFS 3.0) benchmark consists of two separate workloads, one for NFS V2 and one for NFS V3, which report two distinct metrics, SPECsfs97_R1.v2 and SPECsfs97_R1.v3, respectively. The metrics consist of a throughput component and an overall response time measure. The throughput (measured in operations per second) is the primary component used when comparing SFS performance between systems. The overall response time (average response time per operation) is a measure of how quickly the server responds to NFS operation requests over the range of tested throughput loads. The following Transaction Processing Performance Council (TPC) benchmarks reflect the performance of the microprocessor, memory subsystem, disk subsystem, and some portions of the network: 򐂰 tpmC: TPC Benchmark C throughput measured as the average number of transactions processed per minute during a valid TPC-C configuration run of at least 20 minutes. 򐂰 $/tpmC: TPC Benchmark C price/performance ratio reflects the estimated five-year total cost of ownership for system hardware, software, and maintenance and is determined by dividing such estimated total cost by the tpmC for the system. 򐂰 QppH is the power metric of TPC-H and is based on a geometric mean of the 17 TPC-H queries, the insert test, and the delete test. It measures the ability of the system to give a single user the best possible response time by harnessing all available resources. QppH is scaled based on database size from 30 GB to 1 TB. 򐂰 QthH is the throughput metric of TPC-H and is a classical throughput measurement characterizing the ability of the system to support a multiuser workload in a balanced way. A number of query users is chosen, each of which must execute the full set of 17 queries in a different order. In the background, there is an update stream running a series of insert/delete operations. QthH is scaled based on the database size from 30 GB to 1 TB. 򐂰 $/QphH is the price/performance metric for the TPC-H benchmark, where QphD is the geometric mean of QppH and QthH. The price is the five-year cost of ownership for the tested configuration and includes maintenance and software support. The following graphics benchmarks reflect the performance of the microprocessor, memory subsystem, and graphics adapter: 򐂰 SPECxpc results: Xmark93 is the weighted geometric mean of 447 tests executed in the x11perf suite and is an indicator of 2D graphics performance in an X environment. Larger values indicate better performance.


19

򐂰 SPECplb results (graPHIGS): PLBwire93 and PLBsurf93 are geometric means of literal and optimized Picture Level Benchmark (PLB) tests for 3D wireframe and 3D surface tests, respectively. The benchmark and tests were developed by the Graphics Performance Characterization (GPC) Committee. The results shown used the graPHIGS API. Larger values indicate better performance. 򐂰 SPECopc results: CDRS-03, CDRS-04, DX-03, DX-04, DX-05, DRV-04, DRV-05, DRV-06, Light-01, Light-02, Light-02, AWadvs-01, AWadvs-02, Awadvs-03, and ProCDRS-02 are weighted geometric means of individual viewset metrics. The viewsets were developed by independent software vendors (ISVs) with the assistance of OpenGL Performance Characterization (OPC) member companies. Larger values indicate better performance. The following graphics benchmarks reflect the performance of the microprocessor, memory subsystem, graphics adapter, and disk subsystem: 򐂰 Bench95 and Bench97 Pro/E results: Bench95 and Bench97 Pro/E benchmarks have been developed by Texas Instruments to measure UNIX and Microsoft® Windows® NT workstations in a comparable real-world environment. Results shown are in minutes. Lower numbers indicate better performance. 򐂰 The NotesBench Mail workload simulates users reading and sending mail. A simulated user will execute a prescribed set of functions four times per hour and will generate mail traffic about every 90 minutes. Performance metrics are: – NotesMark: Transactions/minute (TPM). – NotesBench users: Number of client (user) sessions being simulated by the NotesBench workload. – $/NotesMark: Ratio of total system cost divided by the NotesMark (TPM) achieved on the Mail workload. – $/User: Ratio of total system cost divided by the number of client sessions successfully simulated for the Mail NotesBench workload measured. Total system cost is the price of the server under test to the customer, including hardware, operating system, and Domino™ Server licenses.

The team that wrote this Redpaper This Redpaper was produced by a team of specialists from around the world. Carlos P. Sosa IBM, pSeries High-Performance Computing, Chemistry and Life Sciences Development Solutions, and University of Minnesota Supercomputing Institute, Minneapolis, MN Tina Tarquinio IBM, pSeries Benchmark & Enablement Center, Poughkeepsie, NY Thanks to the following people for their contributions to this project: We would like to thank Pascal Vezzole, Colin Dumontier, and Tony Pirraglia for pointing out the potential problem with mixed kernel configurations when defining large pages. Our gratitude goes to Sharon Selzo for helping us with computational resources at the benchmark center in Poughkeepsie, New York. CPS would like to thank Bruce Hurley for encouragement and making this type of study possible and Steve Behling for assistance with many issues related to large pages. Special thanks to Elisabeth Stahl for carefully reading the manuscript and providing valuable suggestions. 20


Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces.

© Copyright IBM Corp. 2003. All rights reserved.

21

Send us your comments in one of the following ways: 򐂰 Use the online Contact us review redbook form found at: ibm.com/redbooks 򐂰 Send your comments in an Internet note to: [email protected] 򐂰 Mail your comments to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493 U.S.A.

®

Trademarks The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: AIX® Domino™ Eserver™ eServer™

® IBM® ibm.com® POWER3™

POWER4™ POWER4+™ pSeries® Redbooks(logo)

™

The following terms are trademarks of other companies: Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Other company, product, and service names may be trademarks or service marks of others.

22