SI Methods Final - PNAS

1 downloads 0 Views 4MB Size Report
Sequences and secondary structures of all proteins in the PDB .... QRs of residues 30-79, 29-78, 28-77…20-69 and select the IFU with the .... using PyMOL (10). ... letter abbreviations for amino acids are as follows: A: alanine; C: cysteine; D:.
Methods Database construction Databases of protein sequence and secondary structure annotations were assembled as follows. Sequences and secondary structures of all proteins in the PDB (updated 05/31/2017) were determined using both DSSP (1) and PROSS (2). DSSP determines secondary structure through hydrogen bonding patterns while PROSS identifies secondary structure through backbone dihedral angles. The resulting sets of secondary structure annotations and their corresponding amino acid sequences were updated to include chain breaks and unobserved regions using a metafile from the PDB (https://cdn.rcsb.org/etl/kabschSander/ss_dis.txt.gz).

Six fold-switching proteins were

identified using DSSP that were not identified with PROSS; 27 were identified through PROSS but not found using DSSP; the remaining 64 were identified using both libraries. All sequences in the database were used except those with continuous stretches of ≥4 amino acids annotated as X (0.05% of the database). Classification of fold switches Proteins with different oligomeric states were assigned to class A, B, or C as follows. Those reported to hetero-oligomerize with another macromolecule (protein, DNA, or RNA) in one conformation, or two different macromolecules in both conformations, were assigned to class A. Those with a ≥30% increase in number of intermolecular hydrogen bonds were assigned to class C. The remainder all showed a

www.pnas.org/cgi/doi/10.1073/pnas. 1800168115

1

≥30% increase in number of intermolecular hydrophobic contacts, and were assigned to class B. Hydrogen bonds were calculated using the criteria of Kortemme, et al. (3), and hydrophobic contacts were calculated as any carbon atoms within 4.5Å of one another. The oligomeric state of the PDB’s “biological assembly” was assumed to be its biologically relevant conformation; if more than one oligomeric state was provided, the literature was searched for mention of oligomeric state changes between the fold-switch pair. The rest of the fold-switching proteins (those that did not change oligomeric state) were assigned to class D.

Functional, organismal, and date-solved distributions Protein functions (Fig. 1) were drawn from the HEADER of their PDB files. Similarly, the organism from which these proteins originated was taken from SOURCE ORGANISM_SCIENTIFIC in the PDB files. For comparison, we used the same method to determine the distribution of all source organisms from a non-redundant subset of the PDB (2), serpins (4->1), and MacA (2->1) included in the organismal distribution. All fibril-forming proteins were counted because they were formed by proteins with disparate functions. Release dates were used to measure the

2

publication age of a protein structure. For each fold switch pair, the newer of the two dates was used in the distribution of solved-by dates (Fig. S4). This is because the newer of the two structures was required to demonstrate that a fold switch occurred. Identification of independent folding units Calculations of independent folding units (IFUs) were performed with SEED (5). Because this method is computationally intensive, the protein regions corresponding to fold-switching segments were given as reference for the IFU search. SEED then searched for IFUs by calculating the qualifying ratio (QR, the measure by which SEED determines independent folding cooperativity) of the reference sequence. The QR of the reference sequence was calculated. If it exceeded 0.78, SEED searched for a maximum QR by extending the reference sequence up to 15 residues both N-terminally and C-terminally (for a total extension of up to 30 residues) in increments of 1. If the QR of the reference