Structure, Vol. 12, 1753–1761, October, 2004, 2004 Elsevier Ltd. All rights reserved.
CRANK: New Methods for Automated Macromolecular Crystal Structure Solution Steven R. Ness, Rudolf A.G. de Graaff, Jan Pieter Abrahams, and Navraj S. Pannu* Biophysical Structural Chemistry Gorlaeus Laboratories Leiden University 2300 RA, Leiden The Netherlands
Summary CRANK is a novel suite for automated macromolecular structure solution and uses recently developed programs for substructure detection, refinement, and phasing. CRANK utilizes methods for substructure detection and phasing and combines them with existing crystallographic programs for density modification and automated model building in a convenient and easy-to-use CCP4i graphical interface. The data model used conforms to the XML eXtensible Markup Language specification and works as a common language to communicate data between many different applications inside and outside of the suite. The application of CRANK on various test cases has yielded promising results: with minimal user input, CRANK can produce better quality solutions over currently available programs.
Introduction In the field of macromolecular crystallography methods development, a current goal is to automate the repetitive and time-consuming steps in structure solution. Over the last few years, expert knowledge in the underlying crystallographic methods has been utilized to automate structure solution (Bru¨nger et al., 1998; Terwilliger and Berendzen, 1999; Emsley, 1999; Weeks et al., 2002; Schneider and Sheldrick, 2002; Holton and Alber, 2004; Brunzelle et al., 2003; Pape and Schneider, 2004). The increase in the power and automation of crystallographic software, combined with recent hardware advances, such as the advent of robotic platforms for crystal screening and growth, the availability of more powerful synchrotron radiation sources, and the automated loading of crystals, has no doubt contributed to the success of high-throughput structure solution or proteomics campaigns. A recent survey of the PDB revealed that several hundred structures have been deposited from these efforts (Brunzelle et al., 2003). Given the recent successes of genomics, proteomics, and structural proteomics efforts, it is inevitable that the trend to faster generation of datasets and structures will continue. As a result, the need for new fully automated and powerful tools will become even more intense.
*Correspondence:
[email protected]
DOI 10.1016/j.str.2004.07.018
Ways & Means
We present here CRANK, a new package for the automated structure solution of proteins. CRANK is composed of a number of programs, including new methods for substructure detection and refinement, and for protein structure solution from an anomalous or heavy atom signal. Thanks to the novel algorithms used, the suite pushes the limits on structures that can be automatically solved with minimal user intervention. In determining the structure of a macromolecule, CRANK performs a number of steps. First, the substructure is determined. The substructure parameters are then refined to generate optimal phase information. These initial experimental phases are improved using density modification techniques to produce a map suitable for iterative automated model building and structure refinement. Tasks are integrated into a single automated suite, designed so that the user is required to input only a small number of experimentally determined parameters along with a structure factor file. The parameters include the number of residues, atomic scattering factors f⬘ and f″, and an estimate of the number and type of the anomalous and/ or heavy atom scatterers present. In order to facilitate the communication between different programs, each step of the CRANK pipeline outputs extensive information to the global output CRANK XML (eXtensible Markup Language) file. Viewers contained within the suite transparently convert this information to HTML for easy viewing in any web browser or as a text file. The strategy of outputting program run information in HTML has been used to good effect before in the SHARP suite (La Fortelle and Bricogne, 1997), and more recently in the CCP4 suite (Potterton et al., 2003). It is often challenging, however, for programs to automatically extract information from these HTML files. The use of XML allows both for visualization of data by users and also for the transport of this data inside and outside the suite. The CRANK XML file format has been based in large part on the mmCIF file format, and with the converters in the suite, it is easy to convert mmCIF files to CRANK XML files and to convert CRANK XML to the mmCIF format. In all stages of data processing by CRANK, XML files are used to transmit information and commands between the various programs. XML is a standard language in computing, and is rapidly becoming a standard language in the macromolecular crystallography community (Hamelryck and Kjeldgaard, 2001; Murray-Rust, 1998; Ito et al., 2002; Winn et al., 2002) (Leslie et al., 2002). CRANK also allows for the input and output of mmCIF files and dictionaries. By using these standard and extensible languages, it is hoped that data and results can be easily and automatically analyzed and shared. Further, using the functionality in XSLT (Extensible Stylesheet Language Transformations), it is possible to transform CRANK XML into any of the other currently available XML standards for storing crystallographic data, with CML (Chemical Markup Language) and the XML files used by the PDB project being two obvious and immediately useful examples. In addition, by allowing
Structure 1754
Figure 1. CRANK Flowchart An example of data flow and program execution that is possible within the CRANK suite. Square white boxes represent programs that have CRANK XML tasks already implemented for communication within CRANK.
the reading and writing of mmCIF files, we hope to utilize the large amount of information and structure contained in the mmCIF language and dictionary. The program CRANK employs a data centric program structure, and integrates a wide variety of existing programs together to fully automate the process of protein
structure determination. By using standardized data formats and modular programming practice, it has proved relatively easy to add new crystallographic programs to the CRANK suite. The data flow within the CRANK suite can best be visualized as a pipeline with optional branches. Data flows from one end of the pipeline to
Ways & Means 1755
Figure 2. CRANK CCP4i Screenshot CRANK interface as implemented in the CCP4i suite. The procedures to be run can be selected in the “Procedure” menu. Subpanels for the tasks of substructure determination, refinement and phasing, and density modification are shown in their closed state—opening the panels allows additional specific parameters to be changed.
the other, with programs acting on the data as it proceeds down the pipeline, becoming transformed to a final validated output structure at the end. In this pipeline, the basic dataflow is: substructure determination, substructure refinement and phasing, density modification, and model building. This pipeline is shown graphically in Figure 1. Program control and execution is also implemented in XML using a specialized dialect of XML called SOAP (Simple Object Access Protocol). SOAP is a new development in the XML community and is commonly used for applications like web services and RPC program communication. In the CRANK suite, the CCP4i interface creates a SOAP request out of input and decisions from the user. A picture of the CRANK CCP4i interface is shown in Figure 2. The CCP4i interface transmits this SOAP request to the CRANK program, which interprets the XML contained within and runs the various programs specified in the SOAP request. We also use a custom XML data structure for the rule-based systems within the CRANK suite. These simple methods allow us to make surprisingly complex decisions based on various criteria of the data output by the various programs within the suite. There are two main advantages to using a system based on a central, standardized data communication language. The first is that without this, individual programs must be written to convert each type of program output into a form usable by other programs—this leads to a case where N2 programs must be written, each transforming the data of one program into a form usable by other programs. In an implementation with a common data description language, on average 2N programs will
need to be written. This concept is graphically illustrated in Figure 3. The other advantage to this approach is that by using program prototypes and by modifying existing code, new programs to convert data from program output to XML or from XML to program input can be written and debugged in a very short amount of time. This facilitates the use of CRANK as an interface to many different programs. CRANK interfaces have already been constructed for various programs in the different stages of a crystal structure solution. Thus, the CRANK system creates a powerful and easy-to-use interface for our new programs CRUNCH2 (de Graaff et al., 2001) for substructure detection and BP3 (Pannu et al., 2003; Pannu and Read, 2004) for substructure phasing, in addition to allowing crystallographers to interface with existing powerful tools for structure solution. Results and Discussion Datasets Recently, we have performed a number of tests using data from single wavelength anomalous diffraction (SAD) experiments to test the CRANK suite. We have compared the results with other programs currently used for substructure detection and refinement. Information and references for the datasets used may be found in Table 1. As can be seen from this table, the datasets used are quite diverse, come from a wide variety of sources, and contain a wide variety of anomalously scattering atoms. For example, there are structures with endogenous atoms such as phosphorous as is seen in the DNA dataset, with essential heavy metal ions like the calcium in subtilisin and the iron in C. acidur-
Structure 1756
Figure 3. CRANK Data Flow Diagram Illustrated is the concept of 2N connections needed in a method with a common language for transferring data between many different programs.
ici ferredoxin, with metabolic modification in the case of selenium atoms in the E. coli thioesterase, carbohydrate binding module CBM27, and MutS datasets and with heavy halide ions soaked into the crystal, as is the case in both the human acyl-protein thioesterase and the pseudomonas serine carboxyl proteinase. In addition, the datasets represent a wide variation in the wavelength at which they were collected (from 0.88 A˚ to 1.54 A˚), in resolution of the dataset (from 0.94 A˚ to 3.0 A˚) and in the amount of anomalous signal present. The performance of the various algorithms on a large variety
of datasets should show if methods discussed here are widely applicable in macromolecular crystallography. Substructure Detection In these tests, we compare the results obtained by CRUNCH2 (de Graaff et al., 2001) with the programs SHELXD (Schneider and Sheldrick, 2002), SOLVE (Terwilliger and Berendzen, 1999), and HySS (Grosse-Kunstleve and Adams, 2003) evaluating the differences between the heavy atom positions, both in number of sites found and also the rms displacement in heavy atom
Table 1. Dataset Information Description
Spacegroup
Anomalous Scatterers
Wavelength (A˚)
f″ (approx.) (e⫺)
Resolution (A˚)
Number of Residues
C. acidurici ferredoxin Carbohydrate binding module DNA oligomer (CGCGCG)2 Human acyl-protein thioesterase E. coli thioesterase II
P43212
8 Fe
0.88
1.25
0.94
64
P41212
4 Se
0.98
0.43
2.0
351
P212121
10 P
1.54
2.23
1.5
12
C2221
22 Br
0.92
5
1.8
408
P43212
8 Se
0.9789
5.4
2.5
570
Lysozyme (high redundancy) Pseudomonas serine carboxyl proteinase Calcium subtilisin
P43212
10 S 8 Cl
1.54
0.56
1.53
129
P62
9 Br
0.92
5
1.8
375
P212121
3 Ca
1.54
1.28
1.75
269
MutS binding to G-T mismatch Lysozyme (low redundancy)
P212121
45 Se
0.93
5
3.0
1580
P43212
10 S 2 Cl
1.54
0.56
1.64
129
References (Dauter et al., 1997; Dauter et al., 2002) (Boraston et al., 2003; Dodson, 2003) (Dauter & Adamiak 2001; Dauter et al., 2002) (Devedjiev et al., 2000; Dauter et al., 2002) (Li et al., 2000; Dauter et al., 2002) (Dauter et al., 1999, Dauter et al., 2002) (Dauter et al., 2001; Dauter et al., 2002) (Betzel et al., 1988; Dauter et al., 2002) (Lamers, Perrakis et al., 2000) (Weiss, 2001)
Crystallographic statistics and structural information of the various datasets used. Solvent content and number of residues were calculated from the final deposited PDB structure.
Ways & Means 1757
Table 2. Substructure Determination: Sites Found and RMS Deviations # Sites found Dataset C. acidurici ferredoxin Carbohydrate binding module DNA (CGCGCG)2 Human acyl-protein thioesterase E. coli Thioesterase II Lysozyme (high redundancy) Pseudomonas serine carboxyl proteinase Subtilisin MutS DNA binding protein Lysozyme 180 data (lower redundancy)
CRUNCH2
RMS against PDB model SHELXD a
HySS a
Solve a
CRUNCH2
SHELXD a
HySS a
Solve
8/8 3/4
8/8 3/4
1/8 2/4
2/8 2/4
0.33 0.39
0.44 0.47
1.32 0.19
1.57a 0.19
10/10 19/22
10/10 21/22
9/10a 21/22a
10/10a 18/22
0.19 0.27
0.22 0.65
0.22a 0.28a
0.14a 0.28
8/8 17/18
8/8 17/18a
8/8 13/18a
7/8 17/18
0.40 0.19
0.41 0.14a
0.41 0.26a
0.37 0.15
8/9
7/9
8/9
8/9
0.22
0.30
0.22
0.14
2/3 45/45
3/3 45/45
3/3 45/45
3/3 30/45
0.16 0.55
0.10 0.50
0.10 0.60
0.11 0.38
10/12 (14/18)b
12/12a (14/18)b
1/12a (1/18)b
2/12a (2/18)b
0.36
0.87a
0.98a
2.02a
a Additional parameters were needed in program input to obtain better results over strictly default values. Details are given in the Experimental procedures section. b Shown in parentheses are the number of sites found when compared with the high redundancy lysozyme pdb file that contained more chlorine atoms. The fraction of number of sites found over the total number of sites along with the root mean squared distance of these sites, as calculated by Emma from final deposited PDB structure.
positions compared to the final refined structures. These results can be seen in Table 2, with the timing results in Table 3. In these comparisons, we have primarily used the program Emma (Grosse-Kunstleve and Adams, 2003) from the CCTBX suite, but have also used the programs COMPARE (R.A.G.d.G., unpublished data), and NANTMRF (Smith, 2002). To run CRUNCH2, we use the DREAR suite (Blessing and Smith, 1999) for the generation of E(A) substructure factor amplitudes. We are currently developing an interface to the DREAR suite as well as writing a new program to obtain multivariate likelihood estimates of E(A) values for a variety of diffraction experiments using equations similar to those derived previously (Burla et al., 2003). The SHELXC program (http://shelx.uni-ac.gwdg.de/ SHELX/) can also be used to generate substructure factor amplitudes. The traditional CRUNCH algorithm uses random starting atom positions and the resulting random phases as
the starting point for its dual space search. For this paper, we have implemented a seeding program for CRUNCH2 that uses a Patterson Minimum Function (PMF) algorithm. In our experiments, we have found that the use of the PMF algorithm enhances the already high hit rate of CRUNCH2, allowing us to run fewer trials in our CRANK search strategies. We have also enhanced the performance of the algorithm by using heuristic rules determined in these experiments, for example in determining the relative difficulty of different datasets and modifying the CRUNCH2 strategy in response. As can be seen in Table 2, CRUNCH2 was able to determine the anomalously diffracting substructure in all cases. In the case of ferredoxin, CRUNCH2 and SHELXD were the only programs that could determine the full substructure consisting of the iron atoms in two iron sulfur clusters. In almost all cases, if the atoms were found, the RMS distance from the PDB deposited atomic coordinates were below 1 A˚.
Table 3. Substructure Determination – Timing Information Code
CRUNCH2 Time (sec)
SHELXD Time (sec)
HySS Time (normal) (sec)
Solve Time (sec)
C. acidurici ferredoxin Carbohydrate Binding Module DNA (CGCGCG)2 Human Acyl Protein Thioesterase E. coli Thioesterase Lysozyme (strong signal) Pseudomonas serine carboxyl proteinase Subtilisin MutS DNA binding protein Lysozyme 180 data (weaker signal)
4224 2155 644 451 1401 (20 trials) 6645 4600 562 4466 14020 (20 trials)
1329 (500 trials) 77 11 29 64 5336 (1000 trials) 62 17 67 4218 (1000 trials)
4110 37 337(full) 1527(full) 210 4547(full) 469 76 760 7858
539 952 457 1661 1254 3202 1009 324 3037 1187
Run times in seconds of the substructure determination programs used. In cases where more than the default 10 runs were needed, the number of runs is presented in brackets. For HySS, the normal search protocol was used, in cases where the full search protocol gave better results, the word full in parentheses is appended. All runs were performed on a 2.4GHz Intel Pentium 4 running Linux 2.4.
Structure 1758
Table 4. Refinement and Phasing – Map Correlation and Phase Difference code C. acidurici ferredoxin Carbohydrate binding module DNA (CGCGCG)2 Human acyl-protein thioesterase E. coli thioesterase Lysozyme (strong signal) Pseudomonas serine carboxyl proteinase Subtilisin MutS DNA binding protein Lysozyme (weaker signal)
BP3 map correl
SHARP map correl
MLPHARE map correl
SOLVE map correl
BP3 PDIF
SHARP PDIF
MLPHARE PDIF
SOLVE PDIF
0.7225
0.7059
0.1403
0.3735
44.62
45.13
87.06
79.95
0.3721
0.3736
0.2635
0.2929
75.10
74.93
77.49
75.79
0.6585 0.4645
0.6500 0.4261
0.3297 0.3148
0.5242 0.3740
48.75 65.74
49.62 67.76
63.12 70.86
55.67 68.47
0.5239 0.5761
0.5210 0.5743
0.4252 0.3750
0.4272 0.4772
62.21 56.71
62.28 56.76
65.08 68.70
65.37 61.40
0.2523
0.2388
0.1822
0.2012
78.40
78.88
79.80
80.41
0.3653 0.5046
0.3630 0.5139
0.2477 0.3966
0.2926 0.4276
67.46 60.61
67.35 60.03
71.08 65.87
68.85 63.09
0.4677
0.4576
0.2508
0.3719
62.56
62.47
74.42
70.59
Map correlation and phase difference as compared to phases calculated from the final deposited PDB structure. All phases comparisons were performed with the CCP4 program SFTOOLS on the set of phases common to all phasing and refinement programs.
Although both CRUNCH2 and SHELXD were able to find the substructure in the difficult cases, CRUNCH2 did not require any additional information about the target structure. For SHELXD to find the full substructure in the case of the ferredoxin dataset, the minimum distance criteria was lowered to ⫺1.5. With the two lysozyme datasets in order to find both sulfurs involved in disulfide bonds, the minimum distance criteria was also lowered to ⫺1.5. All of these data sets were also run with the Shakeand-Bake (SnB) suite (Weeks and Miller, 1999) that performed similarly to the results obtained with HySS. SnB was able to determine substructures for all but the ferrodoxin and low redundancy lysozyme case. However, it is not known if these structures can be solved through varying some of SnB’s program parameters. Substructure Refinement and Phasing As can be directly seen from the results, presented in Tables 4 and 5, both BP3 and SHARP outperform
MLPHARE and SOLVE in all measures, including map correlation and phase difference, and also in the agreement between the cosine of the phase difference and the reported figure of merit. The run time of each program is shown in Table 6. In Table 3, both map correlation and phase difference with the published structure are shown. In this table, it can further be seen that BP3 outperforms SHARP in 8 of the 10 test cases by a small amount in terms of map correlation. Although the absolute difference between the results shown is small in most cases, an improvement in phase error of approximately one degree in BP3 over SHARP (as seen in the human acyl-protein thioesterase and ferrodoxin cases) has led to better map building in ARP/wARP. Thus, even small improvements in solution quality at this early stage of structure determination can have an impact at the end of structure refinement. In Table 5, the figure of merit is compared to the cosine of the phase difference. The figure of merit is calculated
Table 5. Refinement and Phasing – Figure of Merit and Cosine of Phase Difference code C. acidurici ferredoxin Carbohydrate binding module DNA (CGCGCG)2 Human acyl-protein thioesterase E. coli thioesterase Lysozyme (strong signal) Pseudomonas serine carboxyl proteinase Subtilisin MutS DNA binding protein Lysozyme (weaker signal)
BP3 FOM
BP3 cos(PDIF)
SHARP FOM
SHARP cos(PDIF)
MLPHARE FOM
MLPHARE cos(PDIF)
SOLVE FOM
SOLVE cos(PDIF)
0.56 0.23 0.56 0.36 0.49 0.47 0.26
0.59 0.20 0.54 0.33 0.38 0.44 0.16
0.35 0.22 0.51 0.32 0.46 0.41 0.25
0.58 0.20 0.53 0.30 0.37 0.44 0.15
0.01 0.04 0.28 0.19 0.28 0.20 0.08
0.04 0.17 0.38 0.27 0.35 0.30 0.14
0.09 0.19 0.40 0.33 0.39 0.32 0.21
0.13 0.19 0.47 0.30 0.34 0.39 0.13
0.33 0.49 0.38
0.31 0.39 0.37
0.32 0.48 0.34
0.31 0.40 0.37
0.21 0.24 0.07
0.26 0.33 0.22
0.32 0.42 0.23
0.29 0.37 0.26
Figure of merits as obtained as estimates from all phasing and refinement programs. Also shown are the cosine of the phase difference “cos(PDIF)” as compared to phases calculated from the final deposited PDB structure. All phases comparisons were performed with the CCP4 program SFTOOLS on the set of phases common to all phasing and refinement programs.
Ways & Means 1759
Table 6. Refinement and Phasing – Timing information code
bp3 time (sec)
sharp time (sec)
mlphare time (sec)
solve time (sec)
C. acidurici ferredoxin Carbohydrate Binding Module DNA (CGCGCG)2 Human acyl-protein thioesterase E. coli Thioesterase Lysozyme (strong signal) Pseudomonas serine carboxyl proteinase Subtilisin MutS DNA binding protein Lysozyme (weaker signal)
990 678 144 2073 1401 1212 1196 191 3998 1442
481 481 261 931 703 246 551 266 4676 515
152 142 33 914 159 255 331 71 4101 242
230 597 91 790 754 1089 586 334 1595 255
Run times in seconds for all refinement and phasing programs. All runs were performed on a 2.4GHz Intel Pentium 4 running Linux 2.4.
by the refinement program, while the cosine of the phase difference is measured from the PDB deposited structure. How well these parameters agree is critical for the performance of density modification and other subsequent steps in structure determination. As can be seen from this table, the agreement between these quantities for BP3 and SHARP is quite good. The agreement is less good in the cases of MLPHARE and SOLVE. Notably, in the latest version of SHARP, which was used for the results in this paper, its performance in estimating figures of merit has significantly improved. An interesting result was seen with the ferredoxin dataset. The best and second best CRUNCH2 solutions were input into both BP3 and SHARP. With the best CRUNCH2 solution, BP3 and SHARP gave similar phase errors, BP3 had a phase error of 44.62⬚, and SHARP had a phase error of 45.13⬚. However, with the second best CRUNCH2 solution, while BP3 still had a phase error of approximately 44⬚, SHARP gave a phase error of 49⬚. This may imply that, in this test case, BP3 has a larger radius of convergence than the other refinement programs examined here. Further testing to verify the convergence radius of BP3 is being conducted using all datasets presented in this paper. Concluding Remarks The above test cases show the effectiveness of the programs CRUNCH2 and BP3 and CRANK’s ability to control the execution and decision making of the programs. CRUNCH2 was able to find substructures in all test cases with its default run parameters. SHELXD was the only other program able to solve all the substructures, but required extra program parameters that may not be known by the user (i.e., the minimum distance between atoms) in order for two of the test cases to be solved. Furthermore, using default settings, BP3 produced better map correlations for 8 out of the 10 data sets. The results quoted here for SHARP were the same or better than results obtained in previous papers (Dauter et al., 2002). Tables 3 and 6 show a comparison of the time the various programs used to perform its task. As can be seen, the programs CRUNCH2 and BP3 run in a time comparable to existing programs, thus the CRANK suite is equally suitable for cases with a strong anomalous signal as well as more difficult cases that may not be
solvable with current suites. Furthermore, to significantly improve CRUNCH2 run times, we are currently working on identifying CRUNCH2 solutions early, as they are generated, rather than requiring the full 10 or 20 trials. The above test cases used SAD data sets exclusively, but our methods are also applicable to other classes of diffraction experiment, including multiple-wavelength anomalous diffraction and multiple isomorphous replacement experiments. CRANK has been designed from the start as a modular and flexible program for structure determination, and can thus easily be extended to use other programs. We are currently implementing a strategy to combine different algorithms for substructure detection in CRANK: to first run SHELXD to obtain a substructure solution and to verify this solution within CRUNCH2. We believe that the combination of these programs will be a powerful, efficient, and robust technique for substructure detection. Experimental Procedures In order to evaluate the effectiveness of the CRUNCH2 and BP3 algorithms and the CRANK pipeline, standardized tests of these programs were performed against the current leading programs in anomalous substructure detection, and in refinement and phasing. These other programs are used by many of the pipelines discussed in the introduction. All programs were first run with default values only. In cases when strictly default values failed to produce acceptable solutions, recommended values from the authors or program documentation were used instead. For CRUNCH2 and BP3, strictly default values only were used for all test cases.
Calculation of E(A) Values The first step in anomalous substructure determination is the conversion of |F| values to a form more amenable to the mathematical treatments used in direct methods. Each substructure determination programs does this procedure in a slightly different way. For CRUNCH2, we are currently using the DREAR module from the SnB suite (Weeks and Miller, 1999). Generally, default values were used in all cases in the DREAR interface (Blessing and Smith, 1999), with the exception that it was found useful to lower the Xmin and Ymin data cutoff limits to values of 0.1 (Weiss, 2001). In addition, for all substructure determination trials with all programs, we started with a high-resolution cutoff of 0.5 A˚ below the maximum resolution of the dataset. This cutoff improved the performance of most the programs. In cases where this 0.5 A˚ cutoff failed to produce good results, different resolution cutoffs were used.
Structure 1760
These special cases are described in the individual program sections below. Substructure Determination CRUNCH2 was run with ten trials for most of the test cases presented below. For cases with a very low anomalous signal (i.e., lysozyme with low redundancy), twenty trials were employed. In all cases, the PMF (Patterson Minimum Function) algorithm was used to generate starting positions compliant with the Patterson map. These starting models were used for input to CRUNCH2 since they proved to lead to better solutions than purely random starts. CRUNCH2 requires only knowledge of the number of anomalous scatterers present and is therefore quite amenable to both inclusion in an automated suite, and is also used by users just beginning crystallography. The program SHELXD does not have built in termination conditions. Instead, it is designed so that the user interactively checks intermediate SHELXD results, or alternatively, specifies the number of trials for SHELXD to perform. For these tests, we ran SHELXD on all datasets with runs of 10, 100, and 1000 total trials. Results and timings from the 10 trial runs were taken in all cases, except when a longer run produced markedly better results. On advice from the program author, in the difficult test cases, a range of different cutoffs were tried, starting at the high resolution limit and going in steps of 0.1 A˚ up to 1 A˚ below the high resolution cutoff. In addition, in three of the test cases, additional parameters needed to be added to the SHELXD .ins file in order to generate solutions. In the case of the ferredoxin and the two lysozyme datasets, the MIND parameter, governing the minimum distance between atoms needed to be lowered to “MIND -1.5,” allowing atoms to be found at distances from 1.5 A˚ and up. For the program HySS, two different search modes are available: fast and full. All datasets were run with both search options. Again, in all cases, the results from the fast search mode were taken, except when more matching atom sites were found by the full search procedure. In communications with the author of HySS, we have been informed that in future versions, the termination criteria of HySS will be refined, allowing users to run HySS with the fast search option in all cases. On advice from the program author, in cases where the default 0.5 A˚ failed to produce results, resolution cutoffs from the high resolution cutoff up to 5 A˚ were tried in steps of 0.5 A˚. In the ferredoxin case, a cutoff of 3 A˚ allowed HySS to find one Fe atom, corresponding to one of the Fe atoms in one of the two FeS clusters. HySS requires very little information about the target structure, with only the number and type of anomalous scatterers required as input. SOLVE was run with default parameters, using the SAD script supplied with the SOLVE package as a template. As with the other substructure determination programs, the high-resolution reflections were truncated at 0.5 A˚ above the highest recorded reflections. In cases where this procedure failed to produce results, resolution values starting at the high-resolution cutoff and decreasing to 5.0 A˚ in steps of 0.5 A˚ were tried. In the few cases for which SOLVE did not have atomic form factors, these were obtained from CCP4 library files. SOLVE requires a little more information than CRUNCH2 and HySS, detailed examples can be found in the example scripts included with the SOLVE program and on the SOLVE website. Refinement and Phasing After the anomalously diffracting substructure had been determined, heavy atom positions were then sent to the next step in the processing pipeline, heavy atom refinement and phasing. In order to make the comparisons more manageable, only the highest scoring CRUNCH2 determined substructure was used as input to the various refinement and phasing programs. In order to allow for the direct comparison of phases, the atomic positions of the highest scoring CRUNCH2 trial was superimposed on the final PDB heavy atom positions using Emma (Grosse-Kunstleve and Adams, 2003). The heavy atom refinement and phasing programs used were BP3 (Pannu and Read, 2004), SHARP (La Fortelle and Bricogne, 1997), MLPHARE (Otwinowski, 1991), and SOLVE (Terwilliger and Berendzen, 1999). In all cases, only recommended settings were used, and automated scripts were set up to process all datasets. Results were
generated for the parameters of phase error, cosine of the phase error, figure of merit and map correlation with the final refined model. Automated scripts within CRANK were set up to run these programs, allowing us to try many different atomic models and parameters and to compare these different programs within the CRANK framework. To ensure that results were consistent with the original package, all final SHARP results were run through the SHARP program, it is these results that are presented in the tables. Program Availability CRANK, CRUNCH2, and BP3 are available from the web site: http:// www.bfsc.leidenuniv.nl/software/ Acknowledgments We would like to thank the people who contributed datasets for our analysis. Z. Dauter and colleagues provided all the data from the Jolly SAD paper; E. Dodson and G. Davies provided us with the CBM1 and CBM2 datasets; M. Weiss for the Lysozyme data; and T. Sixma provided us with the MutS dataset. The authors would also like to thank Fabio Dall’Antonia for kindly providing us with a beta version of the SITCOM program for substructure comparison. Funding for this research was provided by the N.W.O. Received: May 18, 2004 Revised: June 30, 2004 Accepted: July 25, 2004 Published: October 5, 2004 References Betzel, C., Dauter, Z., Dauter, M., Ingelman, M., Papendorf, G., Wilson, K.S., and Branner, S. (1988). Crystallization and preliminary X-ray diffraction studies of an alkaline protease from Bacillus lentus. J. Mol. Biol. 204, 803–804. Blessing, R.H., and Smith, G.D. (1999). Difference structure factor normalization for determining heavy-atom or anomalous scattering substructures. J. Appl. Crystallogr. 32, 664–670. Boraston, A.B., Revett, T.J., Boraston, C.M., Nurizzo, D., and Davies, G.J. (2003). Structural and thermodynamic dissection of specific mannan recognition by a carbohydrate binding module, TmCBM27. Structure 11, 665–675. Bru¨nger, A.T., Adams, P.D., Clore, G.M., DeLano, W.L., Gros, P., Grosse-Kunstleve, R.W., Jiang, J.S., Kuszewski, J., Nilges, M., Pannu, N.S., et al. (1998). Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr. D54, 905–921. Brunzelle, J.S., Shafaee, P., Yang, X., Weigand, S., Ren, Z., and Anderson, W.F. (2003). Automated crystallographic system for highthroughput protein structure determination. Acta Crystallogr. D59, 1138–1144. Burla, M.C., Carrozzini, B., Cascarano, G.L., Giacovazzo, C., and Polidori, G. (2003). SAD or MAD phasing: location of the anomalous scatterers. Acta Crystallogr. D 59, 662–669. Dauter, Z., Wilson, K.S., Sieker, L.C., Meyer, J., and Moulis, J.M. (1997). Atomic resolution (0.94 A) structure of Clostridium acidurici ferredoxin. Detailed geometry of [4Fe-4S] clusters in a protein. Biochemistry 36, 16065–16073. Dauter, Z., and Adamiak, D.A. (2001). Anomalous signal of phosphorus used for phasing DNA oligomer: importance of data redundancy. Acta Crystallogr. D 57, 990–995. Dauter, Z., Dauter, M., and Dodson, E. (2002). Jolly SAD. Acta. Crystallogr. D 58, 494–450. Dauter, Z., Dauter, M., de La Fortelle, E., Bricogne, G., and Sheldrick, G.M. (1999). Can anomalous signal of sulfur become a tool for solving protein crystal structures? J. Mol. Biol. 289, 83–92. Dauter, Z., Li, M., and Wlodawer, A. (2001). Practical experience with the use of halides for phasing macromolecular structures: a powerful tool for structural genomics. Acta Crystallogr. D 57, 239–249.
Ways & Means 1761
Devedjiev, Y., Dauter, Z., Kuznetsov, S.R., Jones, T.L.Z., and Derewenda, Z.S. (2000). Crystal structure of the human acyl protein thioesterase I from a single X-ray data set to 1.5 A. Struct. Fold. Des. 8, 1137–1165. Dodson, E. (2003). Is it jolly SAD? Acta Crystallogr. D 59, 1958–1965. Emsley, P. (1999). Datasets to maps: a wrapper for SHELXS and the CCP4 program suite CCP4. CCP4 Newsletter on Protein Crystallography 36. de Graaff, R.A.G., Hilge, M., van der Plas, J.L., and Abrahams, J.P. (2001). Matrix methods for solving protein substructures of chlorine and sulfur from anomalous data. Acta Crystallogr. D 57, 1857–1862. Grosse-Kunstleve, R.W., and Adams, P.D. (2003). Substructure search procedures for macromolecular structures. Acta Crystallogr. D 59, 1966–1973. Hamelryck, T., and Kjeldgaard, M. (2001). An open source multipurpose programming environment for macromolecular crystallography. CCP4 Newsletter on Protein Crystallography 39. Holton, J., and Alber, T. (2004). Automated protein crystal structure determination using ELVES. Proc. Natl. Acad. Sci. USA 6, 1537– 1542. Ito, N., Kobayashi, K., Sakamoto, H., and Nakamura, H. (2002). Development of PDBj-ML. Acta Crystallogr. A 58 (Supplement), C367. La Fortelle, E., and Bricogne, G. (1997). Methods Enzymol. 276, 472–494. Lamers, M.H., Perrakis, A., Enzlin, J.H., Winterwerp, H.H.K., de Wind, N., and Sixma, T. (2000). The crystal structure of DNA mismatch repair protein MutS binding to a G.T mismatch. Nature 407, 711–717. Leslie, A.G.W., Powell, H.R., Winter, G., Svensson, O., Spruce, D., McSweeney, S., Love, D., Kinder, S., Duke, E., and Nave, C. (2002). Automation of the collection and processing of X-ray diffraction data - a generic approach. Acta Crystallogr. D 58, 1924–1928. Li, J., Derewenda, U., Dauter, Z., Smith, S., and Derewenda, Z.S. (2000). Crystal structure of the Escherichia coli thioesterase II, a homolog of the human Nef binding enzyme. Nat. Struct. Biol. 7, 555–559. Murray-Rust, P. (1998). The globalization of crystallographic knowledge. Acta Crystallogr. D 54, 1065–1070. Otwinowski, Z. (1991). Proceedings of the CCP4 Study Weekend. Isomorphous Scattering and Anomalous Replacement. Wolf, W., Evans, P.R., and Leslie, A.G.W., eds. (Warrington: Daresbury Laboratory), pp. 80-86. Pannu, N.S., and Read, R.J. (2004). The application of multivariate statistical techniques improves single-wavelength anomalous diffraction phasing. Acta Crystallogr. D 60, 22–27. Pannu, N.S., McCoy, A.J., and Read, R.J. (2003). Application of the complex multivariate normal distribution to crystallographic methods with insights into multiple isomorphous replacement phasing. Acta Crystallogr. D 59, 1801–1808. Pape, T., and Schneider, T.R. (2004). HKL2MAP: a graphical user interface for macromolecular phasing with SHELX programs. J. Appl. Crystallogr. 37, 843–844. Potterton, E., Briggs, P., Turkenburg, M., and Dodson, E. (2003). A graphical user interface to the CCP4 program suite. Acta Crystallogr. D 59, 1131–1137. Schneider, T.R., and Sheldrick, G.M. (2002). Substructure solution with SHELXD. Acta Crystallogr. D 58, 1772–1779. Sheldrick, G.M. SHELXC (http://shelx.uni-ac.gwdg.de/SHELX/). Smith, G.D. (2002). Matching selenium-atom peak positions with a different hand or origin. J. Appl. Crystallogr. 35, 368–370. Terwilliger, T.C., and Berendzen, J. (1999). Automated MAD and MIR structure solution. Acta Crystallogr. D 55, 849–861. Weeks, C.M., and Miller, R. (1999). The design and implementation of SnB v2.0. J. Appl. Crystallogr. 32, 120–124. Weeks, C.M., Blessing, R.H., Miller, R., Mungee, R., Potter, S.A., Rappleye, J., Smith, G.D., Xu, H., and Furey, W. (2002). Towards automated protein structure determination: BnP, the BnP-PHASES interface. Z. Kristallogr. 217, 686–693.
Weiss, M.S. (2001). Global indicators of X-ray data quality. J. Appl. Crystallogr. 34, 130–135. Winn, M.D., Ashton, A.W., Briggs, P.J., Ballard, C.C., and Patel, P. (2002). Ongoing developments in CCP4 for high-throughput structure determination. Acta Crystallogr. D 58, 1929–1936.