May 12, 1988 - Use of a Shared-Memory Parallel Processor in the Restrained Least-Squares Procedure of. Hendrickson and Konnert. BY KRISHNAKUMAR N.
512
J. Appl. Cryst. (1988). 21,512-515
Use of a Shared-Memory Parallel Processor in the Restrained Least-Squares Procedure of Hendrickson and Konnert BY KRISHNAKUMARN. PILLAI AND BRUCE W. SUTER
Department of Computer and Information Science, University of Alabama at Birmingham, Birmingham, AL 35294, USA AND MIKE CARSON
Center for Macromolecular Crystallography, University of Alabama at Birmingham, Birmingham, AL 35294, USA (Received 22 December 1987; accepted 12 May 1988)
Abstract The restrained least-squares refinement procedure PROLSQ of Hendrickson & Konnert [Hendrickson (1985). Methods in Enzymology, Vol. 115, edited by H.W. Wyckoff, C. H.W. Hirs & S.N. Timasheff, pp. 252-270. Orlando: Academic Press] has been implemented on a 30-processor Sequent Balance 21000 shared-memory parallel processor. The adaptation of PROLSQ to a general-purpose parallel computer offers the potential of performance increase as the physical limits are approached in traditional sequential and vector processing hardware.
Introduction The
restrained
least-squares
refinement
supercomputer evolves, a higher level of parallelism is used to provide additional performance. Current supercomputers, such as the Cray-XMP/48, are composed of four processors. Some of the next generation of supercomputers will have sixteen processors, requiring substantial algorithm development to utilize their potential fully. The Sequent architecture exactly matches the shared-memory architecture of the Cray and therefore provides an excellent test bed for algorithm development for the next generation of supercomputers. Although the Sequent represents only a fraction of the Cray's processing power, the Sequent reflects the trend in parallel computers and future architectures.
The parallel processor and its software program
PROLSQ (Konnert, 1976; Hendrickson, 1985) is one of the more common refinement tools used by protein crystallographers. This program combines stereochemical restraints with the least-squares refinement of atomic coordinates and temperature factors to best fit the observed X-ray diffraction data. In the original program, the structure factors and derivatives are calculated by the very time-consuming conventional summation procedure. Vector processing supercomputers have been used by some researchers to reduce the computation time of PROLSQ (Hendrickson, 1985; Arnold et al., 1987). In addition, several groups have ported this program to more conventional sequential computers with attached array processors (Furey, Wang & Sax, 1982; Hendrickson, 1985; Cohen, 1986). Others have stressed the advantages of employing fast Fourier transform routines in a similar program (Tronrud, Ten Eyck & Matthews, 1987). The run time for PROLSQ can also be dramatically reduced through the use of a parallel processing computer such as the Sequent. As each generation of 0021-8898/88/050512-04503.00
The Sequent Balance 21000 system is a high-performancc parallel computer that employs 30 identical processors. The CPUs are 32-bit NS32032 microprocessors with each processor being computationally equivalent to a VAX 11/750 without an attached floating-point accelerator. The processors are tightly coupled, sharing a common memory of 16 Mbytes and having a common bus. Each CPU has 8 kbytes of cache RAM. Parallel and ordinary sequential programs can be run simultaneously on the system. The operating system, DYNIX, is a version of UNIX 4.2 bsd that has been enhanced to provide UNIX System V applications environment and to exploit the Balance architecture.
The problem Our goal was to speed up the original sequential PROLSQ package by using the parallel processing facilities of the Sequent. Our approach was to convert an existing sequential program to a parallel one. The success of any such conversion is determined by the achieved speedup. Let T(N) be the time required to '~i~' 1988 International Union of Crystallography
K R I S H N A K U M A R N. PILLAI, BRUCE W. SUTER AND MIKE CARSON
perform a certain algorithm using N processors. The speedup, S, is then defined as
S = T(1)/T(N)
(1)
A speedup proportional to N represents an upperbound. This is possible if all processors are always kept busy doing useful work. However, some fraction of the execution time must be spent in the sequential portion of the code (e.g. in initialization, I/O). According to Amdahl's (1967) rule, for 20 processors as little as 5% of sequential code in a parallel program will reduce the expected upper-bound speedup by half.
The
PROLSQ program
The Konnert-Hendrickson package consists of three main programs: SCATT, P R O T I N and PROLSQ. The PROLSQ (protein least squares) is the actual refinement program. It reads diffraction data and scattering factors prepared by S C A T T (scattering data), initial atomic coordinates and restraint specifications prepared by P R O T I N (protein input), parameter shifts from previous refinement cycles, and control input (Hendrickson & Konnert, 1980). It then augments the normal-equation elements pertinent to each of the stereochemical restraints and the structure factor observations. Fractional atomic coordinates are used to speed up the rate-limiting calculations concerning structure factors. For the same reason, a highly optimized space-group-specific routine, CALC, is used for computing the structure factors and their derivatives. Elements of the resulting sparse normal matrix are stored in a singly dimensioned array indexed by pointers. Next, PROLSQ uses a conjugategradient procedure to solve the parameter shifts. Finally, the R T E S T routine tests the expected impact of the new shifts on the R value and searches for an optimal shift damping factor against a sample of the data. It has been noted that the vast majority of computation time in typical refinements is consumed by the routines CALC, RTEST, the math routines TRIG and EXPV, and the main refinement routine SFREF. In fact, nearly 94% of the time is spent in SFREF and CALC (Hendrickson, 1981).
Implementation details There are several issues which have to be considered for the parallelization of any sequential code. The first step is to locate the critical areas for parallelization. The main criterion for this selection is the amount of the total computation time consumed by each subroutine or function. This can be determined by the UNIX operating system's "gprof' utility (UNIX Timesharing System, 1983). The gprof results showed that the largest computational time was taken by CALC (which is called by SFREF and RTEST). There were
513
two possible alternatives here: to parallelize CALC directly or to parallelize the routines SFREF and R T E S T which call CALC in turn. We chose the latter as CALC is a problem-dependent routine and may be modified for different space groups. We also felt that little parallelism could be obtained from CALC since it consists of a loop with only eight iterations. The parallelization of SFREF is described. A similar approach was used for the routine RTEST. The structure of the SFREF routine is shown in pseudocode below: Initialize DO I = 1, Number of observed reflections Read reflection data from file CALL CALC Write reflection listing Increment R-factor sums, matrix and vector elements END DO Compute and print R factors. The programming method used in our approach is data partitioning, which is appropriate for applications that perform the same operations repeatedly on large collections of data. This involved creating multiple identical processes and assigning a portion of the data to each process. In the case of a loop, this involved performing the iterations of the loop in parallel. The algorithm used here is the self-scheduling algorithm (Quinn, 1987) which basically forms a task queue consisting of the loop iterations. Each process removes an iteration from this queue, executes it, and returns for more work. In any such conversion, the I/O operations are usually a problem as they are essentially sequential in nature. We have written data server routines to handle all the required I/O. For this, we have two processors, one dedicated to reading (input) and one to writing (output). The input routine reads data from the shared memory into a common buffer and sets a flag indicating that the buffer contains data to be read. The first available process grabs this data, reads it into its local memory, and then resets the flag enabling the input routine to do another read. The output routine works in a similar way. The routine CALC is used for computing the structure factors and their derivatives with the elements of the resulting sparse matrix being stored in a singly-dimensioned array. We treated the entire matrix as a local variable and therefore passed it to each of the parallel processes created. Each process modifies its local copy of this matrix and, in the end, the global matrix is re-created from these local copies. The routine CALC (which is called by each of the loop iterations of SFREF or RTEST) has thus been treated as a local routine. CALC may then be modified for different proteins without changing any of the parallel code.
514
USE O F A S H A R E D - M E M O R Y PARALLEL PROCESSOR
Table 1. Timing data for the sequential and parallel versions of PROLSQ Number of processors
SFREF
RTEST
PROLSQ
PROLSQ
PROLSQ
CAM/P1
CAM/P1
CAM/PI
PN P/PI
PN P 'R32
10576 6155 3117 1631 973 856
2505 1881 965 514 313 276
13440 8366 4412 2475 1617 1461
113552 59764 30689 15555 8288 7009
63056 33187 17042 8881 4888 4121
Sequential 2 4 8 16 20
Times given in seconds for selected routines and the entire program. The sequential version uses one processor on the Sequent. The parallel version uses the number given, plus the I/O servers. CAM refers to caimodulin. The PN P data are run with a P I and an R32 version of CA LC.
Results The parallel version of the program was tested with data from the protein calmodulin (Babu, Sack, Greenhough, Bugg, Means & Cook, 1985; Babu, Bugg & Cook, 1988). The crystals of calmodulin are triclinic (a = 29.84, b = 53.72, c = 24.94 ,~, ~ = 93.3, /3 = 97.45, 7 = 88"77°, Z = 1) and diffraction data are available to 2"35./k resolution. Coordinates of 1143 atoms and corresponding isotropic temperature factors are refined against the 5393 observed reflections. The results are shown in Table 1 and Fig. 1, giving the timing data and the speedups for a different number of processors observed with SFREF, R T E S T and the entire PROLSQ program. As can be seen, an overall speedup of around nine has been obtained for this example using 20 processors for the computation plus the two processors dedicated to I/O. Simulations had shown that the speedup would be greater with a larger number of reflections. Further tests were performed with data from the protein purine nucleoside phosphorylase (PNP) (Ealick et al., 1986).
)•i'neoretlcal Speedup PNP/P1 X 15..
PNP/R32
SFREF
PROLSQ speedup
RTEST
0.~ 0
;
,'0
,;
2'0
-
"
number of processors
Fig. 1. Speedup as a function of the number of processors.
P N P crystals are of space group R32 (a = b = 142-9, c = 165-1 ,~, ~ =/~ = 90, ? = 120 °, Z = 1) and diffraction data are available to 3.0,&, resolution. Coordinates of 2279 atoms and an overall temperature factor are refined against 17 518 observed reflections. Table 1 and Fig. 1 also show the timing data and speedup curves with this data using the original triclinic version of CALC and an R32-specific version of CALC. As can be seen, an overall program speedup of over 15 is achieved using 20 (plus two) processors for both P N P computations.
Discussion These results have been obtained by merely parallelizing two routines. Other routines can also be parallelized, but there would be diminishing returns. The expected maximum performance is found from extrapolating the speedup curves of Fig. 1. Were we to reserve the Sequent for exclusive use, 27 processors (plus two for I/O, plus one reserved for the operating system) could be used for PROLSQ computation. It was observed that as the number of loop iterations increases, the use of a higher number of processors increases the speedup. Note that while the CPU time for the space-group-specific CALC routine is roughly half that of the triclinic version, the speedup curves are virtually the same. This is as expected, recalling the definition of speedup in (1) and the design decision concerning the parallelization of CALC. Clearly the absolute C P U times reported are not that impressive. This PROLSQ code has since been ported to a CRAY XMP/24 where it runs the calmodulin benchmark in 21 s. The CRAY version is about 70 times faster than the 20-processor run on the Sequent, but a CRAY is about seventy times more costly. Thus, a first-generation commercial parallel machine designed primarily for business applications runs a scientific application at about the same price/ performance ratio as a CRAY. The important point is that significant speedup trends were obtained employing parallelism. Parallel machines are expected to be extremely important in the future. The new architectures should offer major
KRISHNAKUMAR N. PILLAI, BRUCE W. SUTER AND MIKE CARSON
performance increases over current sequential machines. A major problem that remains is the development of parallel software methodologies (Fox & Messina, 1987). Others are exploring this area. The direct-methods program M U L T A N (Main, Fiske, Hull, Lessinger, Germain, Declercq & Woolfson, 1980) has recently been implemented in the Lisp language on the parallel Connection Machine (FlippenAnderson & Anderson, 1987). Scientific software ported to the relatively inexpensive parallel transputer chips is becoming commercially available. Another approach would be to reconsider the entire problem for a parallel computer and design a parallel algorithm. There is consensus that by developing a parallel algorithm better results could be achieved than by simply trying to parallelize an existing sequential program. For improved performance, changes can be made at an algorithmic level itself. The PROLSQ program uses a conjugate-gradient method to solve the resulting equations and this can be parallelized for greater speedup. Progress in parallel algorithms for this purpose has been made (Suter & Pillai, 1988). An increase of 50-fold in speed of calculation of structure factors was attained in PROLSQ using a space-group-specific fast Fourier routine (Finzel, 1987). This clearly shows the importance of superior algorithms. Special thanks to Y. S. Babu for providing the original version of the program and the test data and for valuable discussions. Thanks to Jim Fillers for the R32 version of CALC. Thanks to Charlie Bugg and Warren Jones for their encouragement and support.
References
AMDAHL, G. M. (1967). AFIPS Natl Comput. Conf. Expo. Conf. Proc. 30. ARNOLD, E., VRIEND, G., Lou, M., GRIFFITH, J. P.,
515
KAMER, G., ERICKSON,J. W., JOHNSON,J. E. & ROSSMANN, M. G. (1987). Acta Cryst. A43, 346-361. BABU, Y. S., BUGG, C. E. & COOK, W. J. (1988). In preparation. BABU, Y. S., SACK, J. S., GRF'ENHOUGH,T. J., BUGG, C. E., MEANS, A. R. & COOK, W. J. (1985). Nature (London), 315, 3740. COHEN, G. H. (1986). J. Appl. Cryst. 19, 486-488. EALICK, S., GREENHOUGH, T., BABU, Y., CARTER, D., COOK, W.J., BUGG, C. E., RULE, S., HABASH, J., HELE1WELL,J., STOECKLER, J., CHEN, S. & PARKS, R. JR (1986). Ann. N Y Acad. Sci. 451, 311-312. FINZEL, I. (1987). J. Appl. Cryst. 20, 53--55. FLIPPFN-ANDERSON, J. & ANI)ERSON, P. (1987). Proc. 38th Pittsburgh Conf., New Jersey, USA. FOX, G. & MESSINA, P. (1987). Sci. Am. 257, 66-77. FUREY, W., WANG, B. C. & SAX, M. (1982). J. Appl. Cryst. 15, 160-166. HENDRICKSON, W. A. (1981). Refinement of Protein Structures, edited by P. A. MACHIN,J. W. CAMPBELL& M. ELDER, pp. 1-8. Warrington, England: SERC Daresbury Laboratory. HENDRICKSON, W. A. (1985). Methods in Enzymology, Vol. 115, edited by H.W. WYCKOFF, C. H. W. HIRS & S.N. TIMASHEFF,pp. 252-270. Orlando: Academic Press. HENDRICKSON, W. A. & KONNERT,J. H. (1980). Computing in Crystallography, edited by R. DIAMOND, S. RAMASESHAN & K. VENKATESAN,pp. 13"01-13"23. Bangalore: Indian Academy of Science. KONNERT, J. H. (1976). Acta Cryst. A32, 614-617. MAIN, P., FISKE, S. J., HULL, S. E., LESSlNGER, L., GERMAIN, G., DECLERCQ, J.-P. & WOOLFSON, M. M. (1980). MULTAN80. A System of Computer Programs for
the Automatic Solution of Crystal Structures from X-ray Diffraction Data. Univs. of York, England, and Louvain, Belgium. QUINN, M. J. (1987). Designing Efficient Algorithms for Parallel Computers. New York: McGraw-Hill. SUTER, B. W. & PILLAI, K. N. (1988). In preparation. TRONRUD, D. E., TEN EVCK, L. E & MATTHEWS, B. W. (1987). Acta Cryst. A43, 489-501. UNIX TIMESHARING SYSTEM (1983). UNIX Programmer's Manual (1983). New York: Holt, Rinehart and Winston.